Reconstructing phylogenetic trees for protein superfamilies

Transcript Reconstructing phylogenetic trees for protein superfamilies

Structural Phylogenomic Analysis
Estimate
Tree of
Life;
plot key
traits
onto tree
Extend function prediction through
inclusion of structure prediction and
analysis
Anti-fungal
defensin
(Radish)
Scorpion toxin
Drosomycin
(Drosophila)
Predict active site &
subfamily specificity
positions
VirB4
model
Based on 12% identity to TrwB structure
Annotation transfer by homology
• Status quo approach to protein function prediction
– Given a gene (or protein) of unknown function
• Run BLAST to find homologs
• Identify the top BLAST hit(s)
• If the score is significant, transfer the annotation
– If resources permit, predict domains using PFAM or CDD
• Problems:
– Approach fails completely for ~30% of genes
– Of those with annotations, only 3% have any supporting experimental evidence
• 97% have had functions predicted by homology alone*
– High error rate
* Based on analysis of >300K proteins in the UniProt database
Database annotation errors
Main sources of annotation errors:
Sub-functionalization 1. Domain shuffling
Neo-functionalization 2. Gene duplication (failure to discriminate
between orthologs and paralogs)
3. Existing database annotation errors
Propagation of existing
database annotation errors
Errors in gene structure
Galperin and Koonin, “Sources of systematic error in functional annotation of genomes: domain rearrangement,
non-orthologous gene displacement and operon disruption.”In Silico Biol. 1998
Tomato Cf-2 Bioinformatics Analysis
Domain fusion and fission events complicate function
prediction by homology, particularly for particularly
common domains (e.g., LRR regions).
Domain structure analysis (e.g., PFAM) is often critical.
Tomato Cf-2 (GI:1587673)
Dixon, Jones, Keddie, Thomas, Harrison and Jones JDG
Cell (1996)
BLAST against Arabidopsis
Top BLAST hit in Arabidopsis is an RLK!
Panther
PFAM results
Berkeley Phylogenomics
Plant and Animal Innate Immunity Mediated by Structurally
Similar Receptor and Receptor-like molecules
TM
Domain fusion/fission
Cytoplasmic Toll Interleukin 1
Receptor (TIR) domain
Errors due to domain shuffling
(sic)
Error presumably due to non-orthology of
database hits used for annotation
The top matching BLAST hits are
putative odorant receptors
Phylogenetic
analysis
suggests it’s
more likely
a Biogenic
Amine GPCR
Annotation error (source unknown)
Phylogenomic inference
Gene duplication in ancestral organism
H1 C1 M1 R1 F1 W1
Eisen, 1998
Sjölander, Bioinformatics 2004
H2 C2 M2 R2 F2 W2
Human, Chimp, Mouse, Rat, Fly, Worm
SCI-PHY analysis of selected GPCRs
Venter et al, The sequence of the human genome (2001) Science.
Sjolander, “"Phylogenomic inference of protein molecular function: advances and
challenges," (2004) Bioinformatics
Phylogenetic reconstruction of
protein families is complicated
• Gene duplication
• Domain shuffling
• Lessening of evolutionary pressures associated with speciation
and duplication enable significant structural and sequence
changes
• Different mutation rates in some lineages
• Different types of constraints at some positions
• Multiple sequence alignment errors
• What members to include? (Some families contain thousands of
members)
Caveats
• Sequence “signal” guides the alignment
• If the signal is weak, the alignment can be poor
• As proteins diverge from a common ancestor, their structures
and functions can change
– Even structural superposition can be challenging!
• Repeats, domain shuffling, large insertions or deletions can
introduce alignment errors
• If tree construction is the aim, errors in the alignment will affect
tree accuracy!
Fundamental
mechanisms
underlying
evolution of
gene families
Homology and adaptation among
protein families
1AGT
Agitoxin 2
Egyptian Scorpion
(K+ channel inhibitor)
Drosomycin,
Antifungal protein
Fruit Fly
1CN2
Toxin 2
Mexican scorpion
(Na+ channel inhibitor)
1BK8
Antimicrobial Protein 1 (Ah-Amp1)
Common horse chestnut
1AYJ
Antifungal protein 1 (RS-AFP1)
Radish
Protein superfamilies
evolve novel forms and
functions:
Homology may be hard
to detect from sequence
similarity alone
%ID #pair %Superpos
>70
107
90.6
50-70
63
87.2
40-50
46
83.4
30-40
65
85.4
25-30
41
82.1
20-25
53
77.9
15-20
84
73
10-15
151
64.4
5-10
204
50.4
0-5
122
39.5
Pairwise alignment
MSA-pw
BLAST ClustalW Tcoffee ClustalW
MAFFT
Homology
detectionMUSCLE
and
0.954
0.955
0.955
0.955
0.954 (and0.954
alignment accuracy
0.862
0.903
0.894
0.901
0.919
0.911
%superposable
positions!)
drops with 0.862
evolutionary0.846
0.824
0.872
0.855
0.856
distance
0.811
0.874
0.867
0.87
0.892
0.925
0.779
0.782
0.788
0.795
0.837
0.836
Structure 0.678
can provide clues,
0.612
0.599
0.627
0.633
0.661
but not necessarily exact
0.381
0.451
0.457
0.49
0.496
0.554
definition
0.16
0.186
0.234
0.302
0.35
0.351
-0.007
-0.014
0
-0.047
0.098
0.075
-0.033
-0.049 -0.051
-0.034
-0.024
-0.022
S
Not all positions in a molecule
are created equal
Light-blue positions are variable
across subfamilies – but can be
very conserved within subfamilies.
These are the hallmarks of binding
pockets determining substrate
specificity.
A
A
B
C
C B
B
C A
Major differences between
trees are in the coarse
branching order
A
A
B
C
C B
B
C A
When each class, A, B and C appear equally similar to each other, the coarse
branching order can be difficult to determine. In this case, it’s critical to be
able to weight the subfamily-defining residues as more important when
computing the distance between classes.
HMM construction using an initial
multiple sequence alignment
Delete/skip
Insert
Match
Seq1
Seq2
Seq3
Seq4
Seq5
M
M
M
M
M
V
V
V
V
-
V
V
V
L
L
S
S
S
S
S
T
S
S
G
G
G
P
P
P
P
P
P
P
Profile or HMM parameter
estimation using small training sets
D
D
D
D
D
S
S
T
T
T
I
V
I
I
V
F
F
W
W
W
M
M
M
M
M
K
K
K
K
K
What other amino
acids might be seen
at this position
among homologs?
What are their
.
probabilities?
The context is critical when
estimating amino acid distributions
D
D
D
D
D
S
S
T
T
T
I
V
I
I
L
F
F
W
W
W
M
M
M
L
L
K
K
K
K
R
This position may be
critical for function or
structure, and may not
allow substitutions
.
Dirichlet Mixture Prior “Blocks9”
Parameters estimated using Expectation Maximization (EM) algorithm.
Training data: 86,000 columns from BLOCKS alignment database.
Combining Prior Knowledge with Observations
using Dirichlet Mixture Densities
ˆpi = the estimated probability of amino acid ‘i’
n = (n1,…,n20) = the count vector summarizing the observed
amino acids at a position.
j = ( j,1 ,…,  j,20 ) = the parameters of component j of the
Dirichlet mixture .
Dirichlet Mixtures: A Method for Improved Detection of Weak but
Significant Protein Sequence Homology.
Sjolander, Karplus, Brown, Hughey, Krogh, Mian and Haussler.
CABIOS (1996)
SATCHMO: Simultaneous Alignment and
Tree Construction using
Hidden Markov mOdels
Xia Jiang
Nandini Krishnamurthy
Duncan Brown
Michael Tung
Jake Gunn-Glanville
Bob Edgar
Edgar, R., and Sjölander, K., "SATCHMO: Sequence Alignment and Tree Construction using
Hidden Markov models," Bioinformatics. 2003 Jul 22;19(11):1404-11
SATCHMO motivation
• Structural divergence within a superfamily means that…
– Multiple sequence alignment (MSA) is hard
– Alignable positions varies according to degree of divergence
• Current MSA methods not designed to handle this variability
• Assume globally alignable, all columns (e.g. ClustalW)…
– Over-aligns, i.e. aligns regions that are not superposable
• …or identify and align only highly conserved positions (profile
HMMs)
– Discards information important for subfamily specificity
• Reality
– Different degrees of alignability in different sequence pairs, different
regions
Agglomerative clustering
Algorithm:
Initialize all objects to be separate classes (leaves in the
tree).
Join “closest” classes (connecting each by edges to a
node).
Compute distance between new class and other
classes.
Join closest two classes.
Iterate until all classes are joined into one class (a tree)
SATCHMO output
1. Tree
•
•
Cluster based on structural “distance”
Built simultaneously with alignments
2. Multiple sequence alignments
•
Different alignment for each cluster
(=each node in tree)
3. Prediction of alignable / non-alignable regions
•
1,2,3 mutually dependent, inform each other
– Interact each time two clusters are combined
Note: we can assess alignment quality, but assessment of tree
topology accuracy is not straightforward to estimate.
SATCHMO algorithm:
Progressive profile-profile alignment
• Typical state: set of subtrees
– Cluster (=subtree) contains
• alignment of all subtree sequences
• profile HMM
– Initialization: each sequence forms a leaf in tree
• Iterated step
– Find most closely related pair of subtrees (using HMM scoring)
– Align the MSAs of the two clusters using profile-profile alignment…
– …treats MSA column as single “letter”, keeps columns intact
– Result: new cluster with its own MSA
– Predict “alignable” columns, and build profile HMM (w/Dirichlet mixture
densities).
Assessing sequence alignment
with respect to structural alignment
Xia Jiang
Duncan Brown Nandini Krishnamurthy
Alignment accuracy as a function of % ID
(including homologs, full-length sequences)
1
0.9
Average CS score
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10-15%
15-20%
20-25%
25-30%
30-35%
Percent ID
CLUSTALW
MUSCLE
MAFFT
SATCHMO
35-40%
Alignment of proteins with
different overall folds
Summary
• SATCHMO is designed to provide for the
assumption of ‘positional homology’ during the
tree estimation process
• This assumption -- that we can predict the
structurally equivalent positions from sequence
information alone -- needs to be tested
• We need a benchmark dataset to evaluate
phylogenetic tree topology estimation

Reconstructing phylogenetic trees for protein superfamilies

Transcript Reconstructing phylogenetic trees for protein superfamilies

Directory