Phylogenetic Analysis by
Depart of Biology
University College London
• Overview of PAML, things it can do, and
especially things that other program don’t do.
• An example (of detecting amino acids under
• The trouble with sliding windows
PAML programs, currently in ver 3.15
• PAML programs are written in ANSI C.
Executables are provided for MS Windows and
Mac OSX. Source codes can be compiled for
unix and other platforms.
• Free for academics (and everybody else).
• Sequential, not parallelized.
• Old-style command-line programs, with no GUI,
no menu, no mice.
• Yang’s theorem:
Every version of PAML has bugs.
ML under nucleotide-based models
Continuous-gamma, for bases (Yang 1993)
ML for amino acids & codons
The world’s best named simulation program
dN and dS estimation using Y&N2000
2 critical values and p values
Parsimony calculations (Yang and Kumar 1996)
mcmctree Species divergence times, soft bounds, relaxed
clocks (Yang & Rannala 2006)
PAML docs & examples
• doc in doc/: pamlDOC.pdf, pamlFAQ.pdf,
• examples/ are provided with README files
• Apologies for poor support.
Bug reports can come to my mailbox. Questions
should go to paml discussion group:
• Poor tree search
• Poor user interface
• Many models implemented in the
Uses of PAML (i)
Maximum likelihood parameter estimation and
likelihood ratio tests of hypotheses under a number of
substitution models based on nucleotides, amino acids,
and codons (such as the molecular clock, rate variation
Most of the nucleotide-based models are available in
Most of models are available in MrBayes?
Uses of PAML (ii)
Likelihood (empirical Bayes) reconstruction of
ancestral nucleotide, amino acid, or codon sequences.
This is the same as parsimony reconstruction except
that it accounts for different branch lengths and
different rates of change between states.
Yang, Z., S. Kumar, and M. Nei. 1995. Genetics 141:1641-1650.
Koshi, J. M., and R. A. Goldstein. 1996. J. Mol. Evol. 42:313-320.
Pupko, T., I. Pe’er, R. Shamir, and D. Graur. 2000. Mol. Biol. Evol. 17:890-896.
Uses of PAML (iii)
• Combined analysis of heterogeneous data sets.
• MrBayes has implemented more powerful
models of this kind (Nylander, et al. 2004. Syst.
• These should make the following debates
• combined analysis (total evidence) vs.
• Supertree vs. supermatrix
Yang, Z. 1996. Maximum-likelihood models for combined analyses of
multiple sequence data. J. Mol. Evol. 42:587-596.
Pupko, T., D. Huchon, Y. Cao et al. 2002. Combining multiple data sets in a
likelihood analysis: which models are the best? Mol. Biol. Evol. 19:22942307.
Uses of PAML (iv)
Likelihood ratio test of the clock and likelihood
estimation of species divergences under clock and
relaxed-clock models (baseml & codeml)
Bayesian estimation of species divergence times
using soft bounds and relaxed molecular clocks
(mcmctree), similar to Jeff Thorne’s multidivtime.
Rambaut, A., and L. Bromham. 1998. Mol. Biol. Evol. 15:442-448.
Yoder, A. D., and Z. Yang. 2000. Mol. Biol. Evol. 17:1081-1090.
Yang, Z., and A. D. Yoder. 2003. Syst. Biol. 52:705-716.
Yang, Z., and B. Rannala. 2006. Mol. Biol. Evol. 23:212-226.
Rannala, B. and Z. Yang. in preparation.
Uses of PAML (iv): Codon substitution models &
detection of selection in protein-coding genes (codeml)
• Branch models to test positive selection on lineages
on the tree
(Yang 1998. Mol. Biol. Evol. 15:568-573)
• Site models to test positive selection affecting
(Nielsen & Yang. 1998. Genetics 148:929-936; Yang, et al. 2000. Genetics
• Branch-site models to detect positive selection at a
few sites on a particular lineage
(Yang & Nielsen. 2002. Mol. Biol. Evol. 19:908-917; Yang, et al. 2005. Mol.
Biol. Evol. 22:1107-1118; Zhang, J., R. Nielsen, and Z. Yang. 2005. Mol.
Biol. Evol. 22:2472-2479)
MacCallum, C., and E. Hill. 2006.
Being positive about selection. PLoS
PLoS Biol is receiving and rejecting too
many manuscripts that use the M&K test
and paml/codeml to detect positive
Their main criterion right now is that the
ms. should include experimental
verification to justify publication in such
LRT of amino acid sites under positive
H0: there are no sites at which > 1
H1: there are such sites
Compare 2 = 2(1 0) with a 2 distribution
(Nielsen & Yang 1998 Genetics 148:929-936;
Yang, Nielsen, Goldman & Pedersen 2000. Genetics 155:431-449)
Models M1a & M2a
Modified from Nielsen & Yang (1998), where 0=0 is fixed
Human MHC Class I data:
192 alleles, 270 codons
p0 = 0.830, 0 = 0.041
p1 = 0.170, 1 = 1
p0 = 0.776, 0 = 0.058
p1 = 0.140, 1 = 1
p2 = 0.084, 2 = 5.389
Likelihood ratio test of positive selection:
2 = 2 259.84 = 519.68, P < 0.000, d.f. = 2
Posterior probabilities for MHC (M2a)
There are a few wrong ways for detecting
one of which is sliding windows.
Sliding window analysis
Sliding window analysis
Two trends in sliding window analysis
• Both dS and dN fluctuate smoothly (because consecutive
• dS fluctuates more than dN (because there are fewer
silent than replacement sites)
Sliding windows may be useful for displaying trends that
are known to exist, but is misleading if used to detect
Orthodox statistical analysis
• formulate a biological hypothesis
• design the experiment & collect data
• test whether the data are compatible with the
The more-common way of data analysis in biology
• a large amount of data, no a priori hypothesis
• filter and plot data to identify “unexpected” patterns
• test the patterns using statistical tests