Transcript Document

Lecture 1: Overview of Phylogenetic
methods and applications
Allan Wilson
Charles Darwin and Alfred Russel Wallace
Evolution as descent with modification,
implying relationships between organisms by
unbroken genetic lines
Phylogenetics seeks to determine these genetic
Alfred Russel
relationships
Wallace
Darwin’s sketch: the
first phylogenetic tree?
Charles Darwin
Interpretation of morphological characters is
often subjective, so open to personal biases
Ji et al.
Opalized lower jaw of the
monotreme Steropodon
Hu et al.
Cynodonts (0)
Morganuconodonts (1)
Eutriconodonts (1)
Spalacotheriids (2)
Eupantotheres (2)
Archaic therians (2)
Modern therians (2)
e.g. Jaw rotation: weak (0), moderate (1), strong (2) as indicated
by vertical wear facets on molars.
Hu et al. (Nature, 1997) and Ji et al. (Nature, 1999) coded
Steropodon (1) and (2) respectively, helping to account for their
alternative placements of monotremes
Deoxyribonucleic acid
(DNA) -Watson, Crick,
Wilkins and Franklin
Early Molecular phylogenetics
- Immunological distances
- DNA-DNA hybridization
Without access to the actual sequences, these are difficult
to apply corrections and statistical significance testing to
Phylogenetics is now dominated by the clearly
defined 4 nucleotides and 20 amino acids
A
Purines
G
C
T
Pyrimidines
Transitions
Transversions
Millions of years
Hominid phylogeny from DNA
Tree terminology
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
Taxon 7
Taxon 8
Rooted tree
internal
edge/branch
Unrooted tree
external
edge/branch
node
internode
ingroup
outgroup
polyphyly
Sister taxa
paraphyly
bifurcating
polytomy
Overview of phylogenetic procedure - by example
1. Biological problem (the question)
2. Which data to obtain (data sampling)
3. Finding the best tree (search strategy)
4. Defining the best tree (optimality criterion)
1. Biological problem (the question)
What is the relationship of the extinct American
Cheetah (Miracinonyx trumani) to other cats?
Two main sister group hypotheses
A. Cheetahs (Acinonyx jubatus): Limb, skull, vertebrae
morphology
B. Pumas (Felis concolor): Geography, early fossils less
cheetah-like
See Barnett et al.
(Curr. Biol., 2005)
2. Which data to obtain (data sampling)
Mitochondrial (mt) DNA
1. High mtDNA copy number is important because Ancient DNA is
degraded
Observed divergence
2. Inferring relatively recent (2-10 million year) divergences, so
substantial sequence variation is required
mt control region best
< 2 million years
mt Protein/RNA coding,
best 2  25 million years
Nuclear protein-coding,
best > 25 million years
time
Mitochondrial partial NADH1 alignment for birds
#Nexus
Begin DATA;
Dimensions ntax=29 nchar=10692;
Format datatype=dna gap=-;
Matrix
Tinamou
AACTATCTATTCATATCCTTATCATACATCATTCCTATTCTTATTGCA..
Emu
AACCATCTCACTATATCACTCTCCTATGCAATCCCCATTCTAATCGCA..
Cassowary
AACCACCTCACCATATCCCTGTCCTATGCAATCCCAATTCTAATCGCA..
Kiwi
AACTACCTCACTATATCACTATCATATGTCATCCCAATTCTGATTGCA..
Rhea
AACTACCTAATTATGTCCCTGTCATATGCTATCCCAATTCTAATCGCA..
Ostrich
ACACACCTGACTATAGCACTCTCATACGCTGTTCCAATCCTAATTGCA..
Chicken
AACCTTCTAATCATAACCTTATCCTATATTCTCCCCATCCTAATCGCC..
BrushTurkey
AAACACCTCATCATATCCCTATCCTATGTTCTCCCAATTTTAATCGCC..
MagpieGoose
AATCACCTCATTATAACCCTATCGTATGCCATCCCAATCCTAATCGCC..
Duck
AGCTACCTCATTATATCCCTCCTATACGCCATCCCCATTCTAATCGCC..
Broadbill
ACTAACCTTACCATATCCCTATCCTACGCCATCCCCGTCCTAGTTGCC..
Flycatcher
ACCCACCTCATTATATCACTATCCTATGCCGTACCCATCCTAATTGCT..
ZebraFinch
ATTAACCTCATCATAGCCCTCTCCTATGCCCTCCCAATCCTGATCGCA..
Rook
GTCAACCTCATTATAGCACTTTCTTATGCTATCCCTATTCTAATCGCC..
Oystercatcher ACCTATCTCATTATATCCCTATCCTATGCCATCCCAATCCTGATCGCA..
Turnstone
ACCTACTTCATCATATCCCTATCCTATGCAATCCCAATTCTAATTGCA..
Penguin
GCTCACTTAGCCATATCCCTATCCTATGCCATCCCAATCCTCATTGCA..
Albatross
ACCTATCTTGTCATGTCCCTATCATATGCCATCCCAATCCTAATCGCC..
;
End;
Tree reconstruction
Type of data
Clustering
algorithm
Faster
Optimality
criterion
Slower
Tree-building method
Distances
Information loss
Discrete (e.g. nucleotides)
often statistical power loss
Unweighted pair group
method with arithmetic
means (UPGMA)
Neighbour-joining (NJ)
Maximum parsimony (MP)
Minimum evolution (ME)
Maximum likelihood (ML)
3. Finding the best tree (search strategy)
Number of possible trees (where n is the number of taxa)
Unrooted trees: (2n-5)  (2n-7)  …31
Rooted trees:
(2n-3)  (2n-5)  …31
For the 11-taxon cat phylogeny
Unrooted = 17  5  13  11  9  7  5  3  1 = 34,459,425
Rooted = Unrooted  (2n-3) = 654,729,075
An exhaustive search will examine all trees, but is
not practical for n > 12
Reducing the time for searching “tree space”
Heuristic search
Find an initial tree, and move within near-by tree-space,
discarding worse alternatives
Only a small amount of tree-space is searched and there is no
guarantee of finding the optimal tree - can be trapped in local
maxima
Global optima
X
Local optima
X
X
Starting point
Branch and Bound search
As trees are built and branches added, if the addition of a taxon to
a particular branch results in a tree-length greater than a previously
determined upper bound for the tree, then this topology and all
those derived from it are ignored and the search continues with a
new placement for that taxon
Branch and bound guarantees finding globally optimal trees
Global optima
X
Local optima
X
X
Starting point
4. Defining the best tree (optimality criteria)
Distance methods
Absolute distance matrix
1
2
3
4
5
6
7
8
9
10
11
Mongoose
Hyena
Sabretooth
Am.Cheetah
Lion
Tiger
Puma
House.Cat
Cheetah
Ocelot
Jaguarundi
1
156
207
192
186
160
194
206
192
206
204
2
3
4
5
6
7
8
9
10
11
147
140
134
143
139
133
139
123
147
159
148
132
162
163
162
165
177
131
111
70
124
108
116
123
64
124
118
127
116
143
100
100
109
98
121
117
96
111
101
110
98
119
113
128
131
-
Early phenetics (distance/similarity) studies would note
that taxon X and taxon Z are the most similar
Taxon Y TCAGCTA
Taxon X ACATGTG
Taxon Z ACGTCAG
XZ= 3 difference
YZ= 5 differences
XY= 4 differences
Taxon X
Taxon Z
Taxon Y
Cladistic methods, rather than being concerned with similarity,
are concerned with the nature of changes (apomorphies)
Taxon Y
Taxon X
Taxon Z
Outgroup
TC
AC
AC
AA
A
A
G
G
synapomorphy
GCTA
autapomorphy
TGTG
TCAG
symplesiomorphy
TCTG
Synapomorphies are shared derived characters and so
are considered to define clades (relationship groupings)
Maximum Parsimony: chooses the tree topology that
minimises the number of changes required
* Character 3 changes G to A
Homoplasy
synapomorphy
*
Taxon X
Taxon Z
Taxon Y
Taxon Z
Outgroup
7 steps (MP tree)
*
*
Taxon X
Taxon Y
Outgroup
8 step sub-optimal
phenetic tree
Maximum Likelihood: The explanation that makes
the observed outcome the most likely
L = Pr(D|H)
Probability of the data, given an hypothesis
The hypothesis is a tree topology, its branchlengths and a model under which the data evolved
First use in phylogenetics: Cavalli-Sforza and Edwards (1967) for
gene frequency data; Felsenstein (1981) for DNA sequences
A
A
0.5
0.5 substitutions
0.6
per site
0.4
0.4
G
G
Sum the probabilities
for each of the 16
internal node
combinations to get the
likelihood for this
single nucleotide site
A
A
A
G
A
C
C
G
A
T
T
G
Model of rate change e.g. KishinoHasegawa (1985): 4 base frequencies,
transition/transversion (ti/tv ratio)
A A
G
A
G G
A A
T
C
G G
A A
A
G
G G
A A
C
A
G G
A A
A
T
G G
A A
G
G
G G
A A
T
A
G G
A A
G
T
G G
A A
C
G
G G
A A
A
C
G G
A A
C
T
G G
A A
T
G
G G
A A A
G
C
G G G
A
G
A
G
The likelihood of a tree is the product of the site
likelihoods. Taken as natural logs, the site likelihoods
can be summed to give the log likelihood:
The tree with the highest –lnL is the ML tree
• ML is computationally intensive (slow)
• If branch-lengths are long, such that substitutions
occur multiple times along the same branch for the
same site, ML will be more consistent than MP – if
the evolutionary process is sufficiently well
modelled.
Bayesian Inference: The explanation with the highest
posterior probability
Prior probability, the
probability of the hypothesis
on previous knowledge
Bayes’ Theorem
Pr(H) Pr(D H)
Pr(H D) =
Posterior probability, the
probability of the
hypothesis given the data
Likelihood function,
probability of the data
given the hypothesis
Pr(D)
Unconditional probability of the data,
a normalizing constant ensuring the
posterior probabilities sum to 1.00
First use in phylogenetics: Li (1996, PhD thesis), Rannala and Yang (1996)
Bayesian inference in phylogenetics is essentially a likelihood
method, but may more closely reflect the way humans think.
• It is Informed by prior knowledge (e.g. fossil data)
• emphasis is placed on Pr(H D) instead of Pr(D H)
Markov chain Monte Carlo (MCMC) is used to approximate
Bayesian posterior probabilities *(BPP) over 1,000s –
1,000,000s of generations
New state rejected
New state accepted
Tree 1
Tree 2
Tree 3
Generation 1
BPP(tree 1) = 4/6
2
3
4
5
6
Posterior probabilities are integrated over all trees in the
posterior distribution – providing density distributions rather
than the optimization of likelihood
(Flat prior)
0
0.5
1.0
Prior for a parameter value
(e.g. proportion of invariant
sites)
0
0.5
1.0
Posterior for the proportion of
invariant sites
The American cheetah is related to the puma morphological similarity to the cheetah is convergence
Mongoose
Hyena
Sabretooth
Sabretooth
Am.Cheetah
Puma
Am.Cheetah
Puma
Jaguarundi
Jaguarundi
Cheetah
Cat
Ocelot
Cheetah
Cat
Ocelot
Lion
Lion
Tiger
Tiger
Maximum parsimony and
neighbour-joining (distance)
cladogram
American felids
Hyena
Mongoose
0.05 substitutions/site
Maximum likelihood and
Bayesian inference phylogram
Applications:
The tree of life
and inferring
our origins
146 gene phylogeny:
Delsuc et al. (Nature,
2006)
Little evidence from
fossils
Identifying selection
ACA GAG CGC Threonine - Glutamic acid - Arginine
ACG GAG AGC Threonine - Glutamic acid - Serine
Synonymous (S)
non-synonymous (N) substitutions
The dN/dS ratio can be estimated
along branches of phylogenetic
trees (e.g. Guindon et al. PNAS,
2004)
Here dN/dS is indicated by
branch width
Decreased
dN/dS suggests
purifying
selection
Increased dN/dS
suggests Positive
selection
Cohen (Molec. Biol. Evol., 2002) found increased positive selection at
binding sites in the MHC proteins of estuarine fish Fundulus
heteroclitus populations subject to severe chemical pollution.
MHC (Major histocompatibility
complex) binds antigens and presents
them to T-cells as part of the immune
response.
Positive selection at
binding sites provides
high MHC variability
with which to
confront new
pathogenic threats.
Non-synonymous/synonymous
ratios for peptide binding regions
and non-peptide binding regions
Fish from the Hot spot and Gloucester populations are genetically
adapted to severe chemical pollution and show novel patterns of
DNA substitution for Mhc class II B locus including strong signals
of positive selection at inferred antigen-binding sites
Mhc class II B with inferred
locations of populationspecific amino acid changes
for Gloucester and Hot Spot.
Stanhope et al. (Infect. Genet. Evol., 2004)
Severe Acute Respiratory Syndrome coronavirus
(SARS-CoV) has a recombinant history with
lineages of types I and III coronavirus
Using more sophisticated models of sequence evolution,
Holmes and Rambaut (Phil. Trans. Roy. Soc. B, 2004) could
not reject a single history across the SARS genome
I
SARS-TOR2
Understanding sequence evolution and the biases that may
result from models (which necessarily are simplifications)
are of vital importance in phylogenetic inference
II
III
Host-Parasite coevolution/co-speciation
• Etherington et
al. (J. Gen Virol, 2006)
Carnivoran
strains
Artiodactyl
strains
Caliciviruses infect diverse mammalian hosts and include Norovirus,
the major cause of food-borne viral gastroenteritis in humans.
Host switching by caliciviruses is rare, although pigs have strains from
co-speciation (artiodactyl strain) and host switching (carnivoran strain).
Fig (Ficus) and fig wasp mutualism is reflected by
co-speciation patterns: Machado et al. (PNAS, 2006)
Biogeography: vicariance and dispersal
Most frequent Area cladoragms – mapping taxa onto landmasses
Many plants; follows
Many land animals: follows
wind dispersal patterns
continental break-up
Africa
S. South America
Australia
midges
New Zealand
Southern beech
Cushion herb
Marsupial mammals
From: SanMartin and Ronquist (Syst. Biol. 2004)
Conservation genetics : Amur leopard (Panthera pardus orientalis)
Relict population
of 25-40
individuals in the
Russian Far East.
Nuclear microsatellites and mtDNA: Uphyrkina et
al. (J. Hered., 2002)
• validates subspecies distinctiveness
• extreme reduction in genetic diversity in the wild
• captive population genetically mixed with the
Chinese subspecies
Macroevolutionary inference
Cretaceous
65 Ma
Tertiary
Present
Does the 65 Ma meteor impact (Alvarez et al. Science, 1980)
fully explain the “great reptile extinction” and the rise of
modern birds and mammals?
Molecular clock: DNA/protein divergence between
organisms is a function of time
K/T boundary
68-65 Ma
0.03
71-68 Ma
0.06
83-71 Ma
0.09
144-83 Ma
Relative Diversity
0.12
0
95Ma 65Ma
E-M
Cret
CMP
E-M
MAA
L. MAA
Megafaunal extinctions (human induced or climate change)
Macrauchenia
Bison (Lascaux,
France)
The distribution of coalescence
events over time on the tree allow
inference of relative population size
Arrival of humans
in North America
Last glacial maximum