Transcript Document
FROM PROTEIN
SEQUENCES TO
PHYLOGENETIC TREES
Simon Harris
Wellcome Trust Sanger Institute, UK
Agenda
• Remind you that molecular phylogenetics is
complex
– the more you know about the compared proteins and
the method used, the better
• Try to avoid the black box approach as much as possible!
• Give an overview of some phylogenetic methods
and software used with protein alignments - some
practical issues…
From DNA/protein sequences to trees
1
2
3
Sequence data
Align Sequences
Phylogenetic signal?
Patterns—>evolutionary processes?
Distance methods
Character based methods
4
*
Distance calculation
(which model?)
Choose a method
MB
Model?
ML
MP
Weighting?
Model?
(sites, changes)?
Optimality criterion
LS
*
5
*
ME
Calculate or estimate best fit tree
Test phylogenetic reliability
Modified from Hillis et al., (1993). Methods in Enzymology 224, 456-487
Single tree
NJ
Phylogenies from proteins
*
*
*
• Parsimony
• Distance matrices
• Maximum likelihood
• Bayesian methods
Characters based
methods
MP
ML
BM
A
B
C
D
1
G
A
A
A
2
G
G
T
T
3
G
G
T
T
4
C
C
C
T
5
G
T
T
G
A
Distance matrix
A
B
C
D
Explicit model of
sequence evolution
B
Distance methods
C
D
A
-
B C
d1 d2
- d4
-
D
d3
d5
d6
-
Phylogenetic trees from protein
alignments
• Distance methods - model for distance estimation
– Simple formula (e.g. Kimura, use of Dij, LogDet)
– Complex models
• Probability of amino acid changes - Mutational Data Matrices
• Site rate heterogeneity
• Maximum likelihood and Bayesian methods- MDM based
models are used for lnL calculations of sites -> lnL of trees
• Site rate heterogeneity
• Homogenous versus heterogeneous models
• Estimations of data specific rate matrices (amino acid groupings GTR like)
Software: an overview
•
•
CLUSTALXv2 - kimura distances
PHYLIPv3.62 - distance, MP, and ML methods (and more)
–
Some complex protein models
•
–
•
Bootstrapping - bootstrap support values
TREE-PUZZLEv5.2 - distance and a ML method
–
–
ML - quartet method
Complex protein models
•
–
–
–
•
•
PAM, JTT ± site rate heterogeneity
JTT, WAG…matrices ± site rate heterogeneity
From quartets to n-taxa tree - PUZZLE support values
Some sequence statistics - aa frequency and heterogeneity between sequences
Tree comparisons - KH test
PHYML - ML methods (protein and DNA)
MRBAYES - Bayesian
–
Complex protein models
•
•
JTT, WAG…matrices ± site rate heterogeneity
Data partitioning
– Posteriors as support values
•
•
PHYLOBAYESv2.3
P4
–
–
–
All the things you can dream off… almost… ask Peter Foster
Heterogeneous models among taxa or sites
Estimation of amino acid rate matrices for grouped categories (6x6 rate
matrices can be calculated)
PHYLIP3.62
• Protpars: parsimony
• Protdist: models for distance calculations:
– PAM1, JTT, Kimura formula (PAM like), others...
– Correction for rate heterogeneity between sites! Removal of
invariant sites? (not estimated, see TREE-PUZZLE5.2!)
• NJ and LS distance trees (± molecular clock)
• Proml: protein ML analysis (no estimation of site rate
heterogeneity - see TREE-PUZZLE5.2)
– Coefficient of variation (CV) versus alpha shape
parameter CV=1/alpha1/2
• Bootstrapping
Distance methods
A two step approach - two choices!
1) Estimate all pairwise distances
Choose a method (100s) - has an explicit model for
sequence evolution
• Simple formula
• Complex models - PAM, JTT, site rate variation
2) Estimate a tree from the distance matrix
Choose a method: with (ME, LS) or without an optimality
criterion (NJ)?
Simple and complex models
dij = -Ln (1 - Dij - (Dij2/5))
(Kimura)
Simple and fast but can be unreliable - underestimates changes, hence
distances, which can lead to misleading trees - PHYLIP, CLUSTALX
Dij is the fraction of residues that differs between sequence i and j (Dij = 1 - Sij)
dij = ML [P(n), (G, pinv), Xij]
(bad annotation!)
ML is used to estimate the dij based on the sequence alignment and a given
model - MDM, gamma shape parameter and pinv - PHYLIP, PUZZLE. Each
site is used for the calculation of dij, not just the Dij value.
More realistic complexity in relation to protein evolution and the subtle
patterns of amino acid exchange rates…
Note: the values of the different parameters (alpha+pinv) have to be either
estimated, or simply chosen (MDM), prior the dij calculations
1) Choosing/estimating the
parameter of a model
1) Mutation Data Matrices: PAM, JTT, WAG…
• What are the properties of the protein alignment
(% identity, amino acid frequencies, globular, membrane)?
•
Can be corrected for the specific dataset amino acid
frequencies (-F) - in some software only
• Compare ML of different models for a given data
and tree
ModelGenerator and ProtTest are designed for
this
2) Alpha and pInv values have to be estimated on a tree
• TREE-PUZZLE can do that. Reasonable trees
give similar values…
2) Inferring phylogenetic trees
from the estimated dij
a) Without an optimality criterion
•
Neighbor-joining (NJ) (NEIGHBOR)
Different algorithms exist - improvement of the computing
If the dij are additive, or close to it, NJ will find the ME tree
•
BIONJ, WEIGBOR, FastME
b) With an optimality criterion
•
Least squares (FITCH)
•
Minimum evolution (in PAUP - now also PHYLIP)
Fitch Margoliash Method 1968
• Seeks to minimise the weighted squared deviation of the
tree path length distances from the distance estimates uses an objective function
T-1 T
E=
S S wij |dij - pij|
i=1 j=i+1
a
E = the error of fitting dij to pij
T = number of taxa
if a = 2 weighted least squares
wij = the weighting scheme
dij = F(Xij) pairwise distances estimate - from the data using a
specific model (or simply Dij)
pij = length of path between i and j implied on a given tree
dij = pij for additive datasets (all methods will find the right tree)
Minimum Evolution Method
• For each possible alternative tree one can estimate the
length of each branch from the estimated pairwise
distances between taxa (using the LS method) and then
compute the sum (S) of all branch length estimates.
The minimum evolution criterion is to choose the tree
with the smallest S value
2T-3
S=
S Vk
k=1
With Vk being the length of the branch k on a tree
Distance methods
• Advantages:
– Can be fast (NJ)
– Some distance methods (LogDet) can be superior to more complex
approached (ML) in some conditions (shown for DNA alignments)
– Distance trees can be used to estimate parameter values for more
complex models and then used in a ML method
– Provides trees with branch lengths
• Disadvantages:
– Can lose information by reducing the sequence alignment into pairwise
distances
– Can produce misleading (like any method) trees in particular if distance
estimates are not realistic (bad models), deviates from additivity
Character based methods
• Maximum likelihood based methods
– Quartet puzzling method - TREE-PUZZLE
– Standard ML - PHYML, PROML (PHYLIP)
• Bayesian based methods
– MrBayes v3.1
– Phylobayes v2.3
– P4 (Peter Foster)
Characters based
methods
MP
ML
BM
A
B
C
D
1
G
A
A
A
2
G
G
T
T
3
G
G
T
T
4
C
C
C
T
5
G
T
T
G
A
Distance matrix
A
B
C
D
Explicit model of
sequence evolution
B
Distances methods
C
D
A
-
B C
d1 d2
- d4
-
D
d3
d5
d6
-
TREE-PUZZLE5.2
• Protein maximum likelihood method using “quartet
puzzling”
– With various protein rate matrices (JTT, WAG…)
– Can include correction for rate heterogeneity between sites pinv + gamma shape (can estimates the values)
– Can estimate amino acid frequencies from the data
– List site rates categories for each site (2-16)
– Composition statistics
– Molecular clock test
– Can deal with large datasets
• Can be used for ML pairwise distance estimates with
complex models - used with puzzleboot to perform
bootstrapping with PHYLIP
A gamma distribution can be used to model
site rate heterogeneity
Yang 1996
TREE, 11,
367-372
TREE-PUZZLE5.2
The quartet ML tree search method has four steps:
1) Parameters (pInv-gamma) are estimated on a NJ ntaxa tree
2) Calculate the ML tree for all possible quartets (4taxa)
3) Combine quartets in a n-taxa tree (puzzling step)
4) Repeat the puzzling step numerous times (with
randomised order of quartet input)
5) Compute a majority rule consensus tree
from all n-trees - has the puzzle support value
Puzzle support values are not bootstrap values!
TREE-PUZZLE5.2
• Models for amino acid changes:
– PAM, JTT, BLOSUM64, mtREV24, WAG (with
correction for amino acid frequencies)
– Correction for specific dataset amino acid frequencies
– Discrete gamma model for rate heterogeneity between
sites 4-16 categories.
-> output gives the rate category for each site. Can be used to
partition your data and analyse them separately…
• Taxa composition heterogeneity test
• Molecular clock test
Combination of categories that contributes the most to the likelihood
(computation done without clock assumption assuming quartet-puzzling tree):
663480606551131631680164026401551353517555877778857576788666
302601370056337757586784188740366735872783378410002242458471
83622611151658686136640100010026331405414845653410774876588
TREE-PUZZLE5.2
• Can be used to calculate pairwise distances with a
broad diversity of models - puzzleboot (Holder &
Roger)
– Can be used in combination with PHYLIP programs for
bootstrapping:
– SEQBOOT
– NJ or LS…
– CONSENSE
• But PHYML can do ML bootstrapping in a fair
amount of time…
TREE-PUZZLE5.2
• Advantages:
– Can handle larger numbers of taxa for maximum likelihood
analyses
– Implements various models (BLOSUM, JTT, WAG…) and
can incorporate a correction for rate heterogeneity
(pinv+gamma)
– Can estimate for a given tree the gamma shape parameter
and the fraction of constant sites and attribute to each site a
rate category
• Disadvantages:
– Quartet based tree search - amplification of the long branch
attraction artefact within each quartet analysis?
MrBayes 3.1
• Bayesian approach
– Iterative process leading to improvement of trees and model parameters
and that will provide the most probable trees (and parameter values)
• Complex models for amino acid changes:
– PAM and JTT, WAG (with correction for amino acid frequencies, but
you have to type it!?!?!)
– Correction for rate heterogeneity between sites (pinv, discrete gamma,
site specific rates)
• Powerful parameter space search
–
–
–
–
Tree space (tree topologies)
Shape parameter (alpha shape parameter, pinv)
Can work with large dataset
Provides probabilities of support for clades (posterior probabilities)
MrBayes 3.1
• MrBayes will produce a population of trees and parameter
values - obtained by a Markov chain (mcmcmc). If the chain is
working well these will have converged to “probable” values
• In practice we plot the results of an mcmcmc to determine the
region of the chain that converged to probable values. The “burn
in” is the region of the mcmcmc that is ignored for calculation of the
consensus tree
– Trees and parameter values from the region of equilibrium are used to
estimate a consensus tree
– The number of trees recovering a given clade corresponds to the posterior for
that clade, the probability that this clade exists
– The mcmcmc uses the lnL function to compare trees between generations
MrBayes 3.1
Most methods provide a single tree and parameters value
– Bootstrapping provide a distribution of tree topologies
– Puzzling steps also provides a distribution tree topologies
• Bootstrap values - Puzzle support values - Posteriors values ???
• But not to sure how to interpret these different support values - in
each case the support values are for a given dataset and method
used
• Posteriors are typically higher then bootstrap and puzzle support
values?!
MrBayes 3.1: some options
MrBayes 3.1: an example
#NEXUS
begin data;
dimensions ntax=8 nChar=500;
format datatype=protein gap=- missing=?;
matrix
Etc…
Block I
Block II
Begin mrbayes;
log start filename=d.res.nex.log replace;
prset aamodelpr=fixed(wag);
lset rates=invgamma Ngammacat=4;
set autoclose=yes;
mcmc ngen=5000 printfreq=500 samplefreq=10 nchains=4 savebrlens=yes
startingtree=random filename=d.res.nex.out;
quit;
end;
[
Begin mrbayes;
log start filename=d.res.nex.con.log replace;
sumt filename=d.res.nex.out.t burnin=150 contype=allcompat;
end;
]
Tree lnL
Zooming in
Tree lnL
A Bayesian analysis
-Propose a starting tree topology and
parameters values (branch length,
alpha, pinv), calculate lnL
-Change one of these and compare
the lnL with previous proposal
-If the lnL is improved accept it
-If not, accept it only sometimes
-Do many of these…
-Plot the change of lnL in relationship
to the number of generations run
-Determine the region where the
chain converged and calculate the
consensus tree for that region
-> consensus tree with
posteriors for clade support
#generations (mcmcmc)
alpha
“Burn in”
determines the trees
to be ignored for
consensus tree
calculation
-Was the chain run long
enough?
-Do we get the same
result from an
independent chain?
pinv
#generations (mcmcmc)
Consensus tree with a burn in of 1500 (150)
Showing posterior values for the different clades - probability for a given
clade to be correct (for the given data and method used!!!)
+-----------------------A
|
+--------------------------B
|
|
+-----------------------C
|
+---------|(0.98)
|
|
|
+-------------------------D
|
|
+---------|(0.99)
+--------|(0.49)
+----------------------E
|
|
+-----------------------F
+---------|(0.96)
|
+------------------------G
+--------|(0.81)
+--------------------------H
Model choice in protein
analyses
1. Rate matrix choice (20x20 matrices)
– WAG, BLOSSUM62, etc…
2. Recoding protein datasets
– 20x20 --> 6x6 rate matrix (or else)
– Implemented in P4 and PhyloBayes
v2.3
Effect of using different rate matrices on phylogenetics
PHYML
MtRev matrix
PHYML
WAG matrix
Keane et al. 2006
-Numerous eukaryotes do not
possess mitochondria
-They possess instead
hydrogenosomes or mitosomes
-What is the evolutionary origin
of these organelles and what is
their function?
Trichomonas NuoF localises in the hydrogenosomes
Complex I
News and views by Gray (2005). Nature 434, 29-30.
Amino acids categories - recoding in p4
–
–
–
–
–
–
Sulfhydryl: C (1)
Smallhydrophilic: S, T, A, P, G (2)
Acid,amide: D, E, N, Q (3)
Basic: H, R, K (4)
Smallhydrophobic: M, I, L, V (5)
Aromatic: F, Y, W (6)
1
Recoding into 6 states (1-6)
allows the estimation of a
GTR like matrix with 14 free
parameters
123456-
2 3 4
5
x1 x2 x3 x4
x6 x7 x8
x10 x11
x13
6
x5
x9
x12
x14
x15
Why recode amino acids?
Potential advantages:
• Allows generation of a rate matrix specific for the
investigated alignment
• Contributes to mitigating amino acid composition
heterogeneity and homoplasy due to frequent changes
within categories - equivalent to DNA transversion
analyses
Potential disadvantage:
Loss of potential useful signal by reducing the
alphabet from 20 to 6 letters (or else)
Recoding effect on NuoF phylogeny
M US
XENOPUS
ANOPHELES
DRO M AL1
DRO M AL2
CAENORHABD
NEUROSPORA
ASPERGILLU
CANDIDA
YARROWIA
DICTYOSTEL
ARA THA1
ORYZA
SOLANUM
LEISHM ANIA
RHO SPH
PARACOCCUS
BRU M EL
AGROBACTER
SINORHIZOB
BRADYRHIZO
CAULOBACTE
RHODOPSEUD
M ESO RHIZ OB
NOVO SPHING
M A GN ETOSPI
RHODOS PIRI
RIC CON
RIC PRO
TRICHOM O NAS
M A GN ETOCOC
BURKHOLDER
RALSTONIA
NITROSOM ON
NEISSERIA
XANTHO M ON A
XYLELLA
COXIELLA
STR AVE
M YCO BA CT ER
THERM O BIFI
ESCHERICHI
CHLOROFLEX
DEINOCOCCU
LEPTOSPIRA
ENTEROCOCC
WAG (20x20)
+pInv+G
1.0
1.0
1.0
0.8
1.0
GTRrecodedDayhoff
classes (6x6)
+pInv+G
M US
XENOPUS
ANOPHELES
DRO M AL1
DRO M AL2
CAENORHABD
NEUROSPORA
ASPERGILLU
CANDIDA
YARROWIA
TRICHOM O NAS
LEISHM ANIA
ARA THA1
ORYZA
SOLANUM
DICTYOSTEL
BRADYRHIZO
CAULOBACTE
RHODOPSEUD
M ESO RHIZ OB
a-proteobacteria
BRU M EL
AGROBACTER
SINORHIZOB
RHO SPH
PARACOCCUS
NOVO SPHING
M A GN ETOSPI
RHODOS PIRI
RIC CON
RIC PRO
BURKHOLDER
RALSTONIA
NITROSOM ON
NEISSERIA
XANTHO M ON A
XYLELLA
COXIELLA
M A GN ETOCOC
STR AVE
M YCO BA CT ER
THERM O BIFI
DEINOCOCCU
LEPTOSPIRA
CHLOROFLEX
ESCHERICHI
THE M AR1
ENTEROCOCC
0.92
1.0
0.96
1.0
Summary
• No single program allows thorough phylogenetic analyses of
protein alignments
• Combination of PHYLIPv3.6, TREE-PUZZLEv5.2,
PHYML, MrBAYESv3.1, PHYLOBAYESv2.3 and P4 allow
detailed protein phylogenetics
• Experimenting with your data and available
methods/models can lead to interesting and biologically
relevant results (data <-> method)
– Incorporate site rate heterogeneity correction in the model or reduce
heterogeneity by data editing (with and without invariant sites?)
– Partitioning of the alignment (variant - various rates, invariant sites,
secondary structure, protein domains…)
– Amino acid groupings (6 categories - GTR like)
– LogDet for proteins - rare/absent changes? For long alignments?
– DNA based LogDet or the protein alignment…?
• Do not take support values as absolute. Any support value is
for a given method and data, only!
outside
PM
inside
TM domains
• TM domain have very specific structural requirements:
AA composition from TM domain is very distinct from non
TM domains!
• Extracellular and intracellular domains may also have
important functional differences --> different functional
constrains can lead to different AA composition.
• More generally in any protein, surface exposed AA
composition is typically distinct from “internal” AA.
outside
PM
inside
Global alignment (AA)
Sub-alignment 1
B
C
A
B
D
A
D
Model 1
Sub-alignment 1
Model 2
C
outside
PM
inside
Partition 1
Partition 2
Model 1
Model 2
B
A
C
D
outside
PM
inside
Global alignment (AA)
–
–
–
–
–
–
Sulfhydryl: C (1)
Smallhydrophilic: S, T, A, P, G (2)
Acid,amide: D, E, N, Q (3)
Basic: H, R, K (4)
Smallhydrophobic: M, I, L, V (5)
Aromatic: F, Y, W (6)
AA recoding, can mitigate
compositional difference
between domains
B
A
C
D
outside
PM
inside
Global alignment (AA)
Global alignment (DNA)
DNA based models
LogDet? Codon 1,2 or 3?
B
A
C
D