Rooting Phylogenetic Trees with Non

Download Report

Transcript Rooting Phylogenetic Trees with Non

Rooting Phylogenetic Trees with
Non-reversible Substitution Models
Von Bing Yap* and Terry Speed§
*Statistics and Applied Probability,
National University of Singapore
§Statistics, University of California
Reference: BMC Evolutionary Biology 5:1 (2005)
Molecular Phylogenetics
• From alignments to trees.
• Many methods: parsimony, distance, stochastic
models.
Reversible Models
• Almost all substitution models are reversible:
for example, Pr(anc=A, des=C) = Pr(anc=C,
des=A).
• Rooted trees that give the same unrooted tree
are indistinguishable.
Stationary Models
• Character states have the same frequencies
everywhere on the tree.
• Root can be identified (Yang 1994,
Huelsenbeck et al. 2001).
Nonstationary Models
• Yang and Roberts (1985)
• Galtier and Gouy (1998)
SUBSTITUTION
MODELS
NON-STATIONARY
STATIONARY
REVERSIBLE
The Simplest NSTA Model
• Parameters:
rooted tree topology
θ: root base frequency
Q: rate matrix (calibrated)
branch lengths
No relationship between θ and Q.
Specialisations
• If θ is the equilibrium distribution of Q, get
STA.
• If in addition, Q satisfies the detailed balance
conditions, get REV.
Probability of alignment
• Felsenstein’s algorithm can be used to compute
the probability of one site.
• Multiplying across sites gives probability of
alignment.
Tree Inference
• Fix a rooted tree.
• Find the most likely parameter values.
• The maximum likelihood is the support of the
tree.
• Choose tree with highest support.
Site Heterogeneity
• Codon positions, secondary structure.
• Deterministic or random relative rates can be
accommodated in the model.
• Two deterministic models: codon position, and
codon position + fast/slow.
Two deterministic models
• codon: 3 fixed unknown rates, corresponding
to codon positions, with weighted average 1.
• codonsite: get two classes of amino acids
(fast/slow) from CLUSTAL alignment output.
Coupled with codon positions, get 6 unknown
rates with weighted average 1.
Test Data Sets
•
•
•
•
•
A: human, chimp, gorilla
B: human, mouse, rat
C: human, chimp, gorilla, orangutan
D: human, chimp, mouse, rat
E: human, mouse, chicken, frog
• 13 mitochondrial protein-coding genes
Method
• Unrooted tree is assumed known.
• For each rooted tree consistent with the
unrooted tree, its support is the maximum
loglikelihood upon finding the MLE of the
process parameters and branch lengths.
Method (continued)
• Three processes: REV, STA, NSTA
• Three site models: novar (no variations),
codon (3 classes), codonsite (6 classes).
Method (continued)
• Two outcomes
(a) number of genes for which the correct
rooted tree is the most likely
(b) does the model get the right rooted tree
when the loglikelihoods are summed over
genes?
Number of successes
A
B
C
D
E
NSTA
5
11
12
12
0
STA
4
7
3
4
4
NSTA
9
13
11
12
9
STA
3
6
2
2
5
9
12
10
12
8
3
5
1
5
6
novar
codon
codon NSTA
site
STA
Combined genes: Does it get the right
tree?
novar
codon
codonsite
A
B
C
D
E
NSTA
N
Y
Y
Y
N
STA
Y
N
N
Y
N
NSTA
Y
Y
Y
Y
Y
STA
Y
N
N
Y
Y
NSTA
Y
Y
Y
Y
Y
STA
Y
N
N
N
N
Discussion (1)
• In general, NSTA fits much better than STA,
which fits much better than REV, by the
likelihood ratio test criterion.
• Not only does NSTA get the right tree more
often than STA, it is also more discriminative:
the best tree has much larger support compared
to the other trees.
Discussion (2)
• The codon+site model of site variation is very
crude, and this may explain why the
performance is worse than codon model.
• Need to use better methods. Also need to
compare with random model, like discrete
gamma.
Discussion (3)
• The NSTA only has 3 more parameters than
STA, and 6 more than REV, so the extra
computation is not heavy.
• Also, since it is possible to identify the root,
perhaps NSTA should be used routinely.
Discussion (4)
• Constraint on NSTA: base compositions of
sequences that are equally distant from the root
are the same. This may not hold.
• Software freely available upon request. Email
[email protected]