Phylogeny Estimation: Traditional and Bayesian Approaches

Download Report

Transcript Phylogeny Estimation: Traditional and Bayesian Approaches

Phylogeny Estimation: Traditional
and Bayesian Approaches
Molecular Evolution, 2003
[email protected]
Phylogeny Estimation
• Traditional approaches
– Neighbour-joining algorithm
– Tree searches with optimality criterion
• Parsimony
• Maximum likelihood
• Bayesian Approaches
Traditional approaches
• Neighbour-joining algorithm
– extremely popular.
– relatively fast.
– performs well when the divergence between sequences is low.
• The first step in the algorithm is converting the DNA or protein
sequences into a distance matrix that represents the evolutionary
distance between sequences.
1
2
3
4
5
H959
H3847
H5539
H1067
H3368
1
2
3
4
0.00752
0.00809 0.01069
0.00681 0.01593 0.01547
0.00855 0.01126 0.01706 0.01505
5
-
Traditional approaches
• Neighbour-joining algorithm
– A serious weakness for distance methods, is that the observed
differences between sequences are not accurate reflections of the
evolutionary distances between them
• Multiple substitutions
Traditional approaches
• Parsimony
– In contrast to distance-based approaches, parsimony and ML map the
history of gene sequences onto a tree.
– In parsimony, the score is simply the minimum number of mutations that
could possibly produce the data.
– Parsimony has a few obvious disadvantages.
• The score of a tree is completely determined by the minimum number of
mutations among all of the reconstructions of ancestral sequences.
• Another serious drawback of parsimony arises because it fails to account for
the fact that the number of changes is unlikely to be equal on all branches in
the tree.
Traditional approaches
• Maximum likelihood
•
•
In ML, a hypothesis is judged by how well it predicts the observed data; the
tree that has the highest probability of producing the observed sequences is
preferred.
To use this approach, we must be able to calculate the probability of a data
set given a phylogenetic tree.
– model of sequence evolution that describes the relative probability of various
events.
– These probabilities take into account the possibility of unseen events.
•
From many perspectives, ML is the most appealing way to estimate
phylogenies. All possible mutational pathways that are compatible with the
data are considered and the likelihood function is known to be a consistent
and powerful basis for statistical inference in general.
Bayesian phylogenetics
• Bayesian approaches
•
Bayesian approaches to phylogenetics are relatively new, but they are
already generating a great deal of excitement because the primary analysis
produces a tree estimate and measures of uncertainty for the groups on the
tree.
•
The field of Bayesian statistics is closely allied with ML.
Bayesian vs ML
• Maximum likelihood vs. Bayesian estimation
– Maximum likelihood
• search for tree that maximizes the chance of seeing the data
(P (Data | Tree))
– Bayesian inference
• search for tree that maximizes the chance of seeing the tree
given the data (P (Tree | Data))
The phylogenetic inference process
•
Data collecting, the first step
•
Typically, a few outgroup
sequences are included to root the
tree.
•
Insertions and deletions obscure
which of the sites are
homologous.
•
Multiple-sequence alignment is
the process of adding gaps to a
matrix of data so that the
nucleotides in one column of the
matrix are related to each other by
descent from a common ancestral
residue.
The phylogenetic inference process
•
•
•
•
In addition to the data, the
scientist must choose a model of
sequence evolution.
model
Free parameters
GTR
8
Increasing model complexity
improves the fit to the data but
also increases variance in
estimated parameters.
TN93
5
Model selection strategies attempt
to find the appropriate level of
complexity on the basis of the
available data.
Model complexity can often lead
to computational intractability.
HKY85
K80
F84
4
F81
3
1
JC69
0
The phylogenetic inference process
Bootstrapping
•
The bootstrapping approach
involves the generation of
pseudoreplicate data sets by resampling with replacemtent the
sites in the original data matrix.
•
When optimality-criterion methods
are used, a tree search is
performed for each data set, and
the resulting tree is added to the
final collection of trees.
Markov chain Monte Carlo
•
The Markov chain Monte Carlo
(MCMC) methodology is similar to
the tree-searching algorithm.
•
From an initial tree, a new tree is
proposed. The moves that change
and the tree must involve a
random choice.
•
The MCMC algorithm also
specifies the rules for when to
accept or reject a tree.
Markov chain Monte Carlo
Markov chain Monte Carlo
Markov chain Monte Carlo
• Metropolis Coupled Markov Chain Monte
Carlo
heated landscape
Markov chain Monte Carlo
Bootstrapping and MCMC generate
a sample of trees
Bootstrapping and MCMC generate
a sample of trees
•
Note that MCMC yields a much larger sample of trees in the same
computational time, because it produces one tree for every proposal cycle
versus one tree per tree search in the traditional approach.
•
However, the sample of trees produced by MCMC is highly auto-correlated.
•
As a result, millions of cycles through MCMC are usually required, whereas
many fewer (of the order of 1,000) bootstrap replicates are sufficient for
most problems.
Bayesian phylogenetics
• Bayesian approaches
•
Bayesian methods are exciting because they allow complex models of
sequence evolution to be implemented.
– estimating divergence times
– finding the residues that are important to natural selection
– detecting recombination points
Comparison of Methods
Comparison of Methods
Conclusions
• The estimation of phylogenies has become a regular
step in the analysis of new gene sequences.
• Still too early to tell if Bayesian approaches will
revolutionize tree estimation in general, but it is already
clear that MCMC-based approaches are extending the
field by answering previously intractable questions.