No Slide Title

Download Report

Transcript No Slide Title

A Brief of
Molecular
Evolution &
Phylogenetics
Aims of the course:
• To introduce to the practice phylogenetic
inference from molecular data.
• To known applications and computer
programmes to practice phylogenetic
inference.
Two Concepts of
Molecular Evolution
• Ortologous vs Paralogous genes
– Genes & species trees
• Molecular clock
– Substitution rates
Homologous genes
• Orthologous genes
Derived from a process of new
species formation (speciation)
• Paralogous genes
Derived from an original gene
duplication process in a single
biological species
Species trees vs Gene trees
Orthologous genes of
Cytochrome
Each one is present in a biological
species
• Paralogous genes of
Globin
 a, b, d (Glob), Myo y Leg haemoglobin,
each originated by duplication from an
ancestral gene
Species trees and Gene trees
Gene tree
a
A
b
B
c
D
Species tree
We often assume that gene trees give us
species trees
Orthologues and paralogues
paralogous
orthologous
a
b* c
Ancestral gene
b* C*
orthologous
C* B
A*
A*
A mixture of
orthologues and
paralogues sampled
Duplication to give 2
copies = paralogues on
the same genome
The malic enzyme gene tree contains a
mixture of orthologues and paralogues
Gene duplication
97
100
100
100
Anas = a duck!
Homo sapiens 2 Mit
Ascaris suum Mit
100
75
Homo sapiens 1 Cyt
Anas platyrhynchos Cyt
Zea mays
Ch
Plant chloroplast
Flaveria trinervia Ch
Populus trichocarpa Ch
Solanum tuberosum Mit
Plant
100
mitochondrion
Amaranthus Mit
Neocallimastix
Hyd
Trichomonas vaginalis Hyd
Giardia lamblia Cyt
Schizosaccharomyces
Saccharomyces
Lactococcus lactis
Is there a molecular clock?
• The idea of a molecular clock was
initially suggested by Zuckerkandl and
Pauling in 1962
• They noted that rates of amino acid
replacements in animal haemoglobins
were roughly proportional to time - as
judged against the fossil record
The molecular clock for alpha-globin:
100
shark
80
carp
60
platypus
chicken
40
500
400
300
200
0
100
cow
20
0
number of substitutions
Each point represents the number of substitutions separating each animal
from humans
Time to common ancestor (millions of years)
Rates of amino acid replacement in
different proteins
Prote in
Fibrin ope ptide s
Ins u l in C
Ri bon u cle ase
Hae moglobin s
C ytoch rom e C
Hi ston e H4
Rate (me an re placem en ts pe r si te
pe r 10 9 ye ars )
8.3
2.4
2.1
1.0
0.3
0.01
• Evolutionary rates depends on functional constraints of proteins
There is no universal clock
• The initial proposal saw the clock as a Poisson
process with a constant rate
• Now known to be more complex - differences in
rates occur for:
–
–
–
–
–
–
different
different
different
different
different
different
sites in a molecule
genes
base position (synonimous-nonsynonymous)
regions of genomes
genomes in the same cell
taxonomic groups for the same gene
• Molecular Clocks Not Exactly Swiss
Phylogenetic Trees
LEAVES
terminal branches
A
B
node 1
C
D
E
F
node 2
G
H
I
J
polytomy
interior
branches
A CLADOGRAM
ROOT
Trees - Rooted and Unrooted
A
B C D
E F
G H
I
J
A
B C D E
ROOT
ROOT
E
D
ROOT
F
A
H
B
C
J
I
G
H
I
J
F
G
Rooting using an outgroup
archaea
eukaryote
archaea
Unrooted tree
archaea
eukaryote
eukaryote
eukaryote
Rooted
by outgroup
bacteria Outgroup
archaea
Monophyletic Ingroup
archaea
archaea
eukaryote
eukaryote
root
eukaryote
eukaryote
Monophyletic
Ingroup
Some Common Phylogenetic Methods
Types of Data
Distances
Tree
buildin
g
metho
d
Cluster
Algorithms
UPGMA
NJ
Optimality
Criteria
Minimum
Evolution
Least Square
Sites
(nucleotides, aa)
Parsimony
Maximum Likelihood
Bayesian Inference
Distance Methods
• Distance Estimates attempt to estimate the mean
number of changes per site since 2 species
(sequences) split from each other.
• Simply counting the number of differences may
underestimate the amount of change - especially
if the sequences are very dissimilar - because of
multiple hits.
• We therefore use a model which includes
parameters which reflect how we think sequences
may have evolved.
Cálculo de distancias: observación y
realidad
1
2
obs real
sustitución:
A
A
A
A
0
0
no
A
A
A
C
1
1
simple
A
C
A
G
1
2
coincidente
A
A
A
C
1
2
múltiple
A
C
A
C
0
2
paralela
A
C
A
G
C
0
3
convergente
A
A
A
C
A
0
2
reversa
G
The simplest model : Jukes & Cantor:
dxy = -(3/4) Ln (1-4/3 D)
•
•
•
•
•
dxy = distance between sequence x and sequence y expressed as the
number of changes per site
(note dxy = r/n where r is number of replacements and n is the total
number of sites. This assumes all sites can vary and when unvaried
sites are present in two sequences it will underestimate the amount of
change which has occurred at variable sites)
D = is the observed proportion of nucleotides which differ between
two sequences (fractional dissimilarity)
Ln = natural log function to correct for superimposed substitutions
The 3/4 and 4/3 terms reflect that there are four types of
nucleotides and three ways in which a second nucleotide may not
match a first - with all types of change being equally likely (i.e.
unrelated sequences should be 25% identical by chance alone)
The natural logarithm ln is used to correct
for superimposed changes at the same site
•
•
•
•
If two sequences are 95% identical they are different at 5% or
0.05 (D) of sites thus:
– dxy = -3/4 ln (1-4/3 0.05) = 0.0517
Note that the observed dissimilarity 0.05 increases only slightly to
an estimated 0.0517 - this makes sense because in two very similar
sequences one would expect very few changes to have been
superimposed at the same site in the short time since the sequences
diverged apart
However, if two sequences are only 50% identical they are different
at 50% or 0.50 (D) of sites thus:
– dxy = -3/4 ln (1-4/3 0.5) = 0.824
For dissimilar sequences, which may diverged apart a long time ago,
the use of ln infers that a much larger number of superimposed
changes have occurred at the same site
Distance models can be made more
parameter rich to increase their realism 1
• It is better to use a model which fits the data
than to blindly impose a model on data
• The most common additional parameters are:
– A correction for the proportion of sites which are unable
to change
– A correction for variable site rates at those sites which
can change
– A correction to allow different substitution rates for
each type of nucleotide change
• PAUP will estimate the values of these additional
parameters for you.
A gamma distribution can be used
to model site rate heterogeneity
Exchangeability parameters for two models of amino acid
replacement.
Exchangeability parameters from two common empirical models of amino acid sequence evolution are presented.
The parameter value for each amino acid pair is indicated by the areas of the bubbles, and discounts the effects of
amino acid frequencies.
(a) The JTT model (Jones, D.T. et al. 1992CABIOS 8, 275–282) derived from a wide variety of globular proteins.
(b) The mtREV model (Yang, Z. et al. 1998 Mol. Biol. Evol. 15, 1600–161) derived from mammalian mitochondrial
genes that encode various transmembrane proteins.
Distances: advantages:
• Fast - suitable for analysing data
sets which are too large for ML
• A large number of models are
available with many parameters improves estimation of distances
• Use ML to test the fit of model to
data
Distances: disadvantages:
• Information is lost - given only the distances it is
impossible to derive the original sequences
• Only through character based analyses can the
history of sites be investigated e,g, most
informative positions be inferred.
• Generally outperformed by Maximum likelihood
methods in choosing the correct tree in computer
simulations
Numbers of possible trees
for N taxa:
• T(i) = P (2i-5) :: T(unrooted), i>3
1,3,15,105,945,10395,135135
• For 10 taxa there are 2 x 106 unrooted
trees
• For 50 taxa there are 3 x 1074 unrooted
trees
• How can we find the best tree ?
Cluster Analysis
UPGMA y NJ
Se unen recursivamente el par de
elementos más cercanos. Se
recalcula la matriz de distancias
(*) y se analiza el par unido como
un nuevo elemento
Unrooted Neighbor-Joining
Tree
Human
Spinach
Monkey
Rice
Mosquito
A perfectly additive tree
A
C
0.1
0.2
0.3
0.1
0.6
B
A
B
C
D
A
B
0.4
0.4 0.4 0.6
0.8 1.0
C
0.4
0.6
0.8
D
0.8
1.0
0.8
-
The branch lengths in the matrix and the tree path lengths
match perfectly - there is a single unique additive tree
D
Distance estimates may
not make an additive tree
Aquifex
Aquifex > Bacillus (0.335)
Some path lengths are
longer and others shorter
than appear in the matrix
0.217
Aquifex >
Thermus
(0.33)
Bacillus
0.119
Jukes-Cantor distance matrix
Proportion of sites assumed to be invariable = 0.56;
identical sites removed proportionally to base
frequencies estimated from constant sites only
1
2
4
5
6
ruber
Aquifex
Deinococc
Thermus
Bacillus
1
0.38745
0.22455
0.13415
0.27111
2
4
5
0.47540
0.27313
0.33595
0.23615
0.28017
0.28846
0.057
0.017
0.056
0.079
6
ruber
-
0.145
Thermus
Thermus >
Deinococcus
(0.218)
Deinococc
Obtaining a tree using pairwise
distances
• Stochastic errors will cause deviation of the
estimated distances from perfect tree additivity
even when evolution proceeds exactly according to
the distance model used
• Poor estimates obtained using an inappropriate
model will compound the problem
• How can we identify the tree which best fits the
experimental data from the many possible trees
Obtaining a tree using pairwise distances
• Use statistics to evaluate the fit of tree to
the data (goodness of fit measures)
– Fitch Margoliash method - a least squares method
– Minimum evolution method - minimises length of tree
• Note that neighbor joining while fast does not
evaluate the fit of the data to the tree
Fitch Margoliash Method 1968:
• Minimises the weighted squared
deviation of the tree path length
distances from the distance
estimates
Fitch Margoliash Method 1968:
Aquifex
Aquifex
0.207
Bacillus
Tree 2 - best
0.129
0.204
Tree 1
0.051
0.006 0.059
0.077
Thermus
0.148
Bacillus
0.139
0.132 0.040
0.058
0.076 0.023
ruber
ruber
Deinococc
Deinococc
Thermus
Optimality criterion = distance (weighted least squares with power=2)
Score of best tree(s) found = 0.12243 (average %SD = 11.663)
Tree #
Wtd. S.S.
APSD
1
2
0.13817 0.12243
12.391 11.663
Minimum Evolution Method:
• For each possible alternative tree one can
estimate the length of each branch from
the estimated pairwise distances between
taxa and then compute the sum (S) of all
branch length estimates. The minimum
evolution criterion is to choose the tree with
the smallest S value
Minimum Evolution
0.217
Aquifex
Bacillus
Tree 1 - best
Bacillus
0.217
0.119 0.058
0.152
0.053
0.081 0.012
ruber
Aquifex
Deinococc
Thermus
Tree 2
0.119
0.057
0.017 0.056
0.079
Thermus
0.145
ruber
Deinococc
Optimality criterion = distance (minimum evolution)
Score of best tree(s) found = 0.68998
Tree #
1
2
ME-score 0.68998 0.69163
Parsimony analysis
• Parsimony methods provide one way of
choosing among alternative phylogenetic
hypotheses
• The parsimony criterion favours hypotheses
that maximise congruence and minimise
homoplasy (convergence, reversal & parallelism)
• It depends on the idea of the fit of a
character to a tree
Parsimony
Seq 1 ...ACCT...
A
C
Seq 2 ...AACT...
T
A
Seq 3 ...TACT...
C
T
Seq 4 ...TCCT...
1200=3
1(C)
3(A)
3(A)
A
2(A)
4(C)
2 mutations
1(C)
A
2(A)
4(C)
1 mutation
Maximum Likelihood - goal
• To estimate the probability that we would
observe a particular dataset, given a
phylogenetic tree and some notion of how the
evolutionary process worked over time.
– P(D/H)
Probability of
a b

b a

c e

d c
given
  a,c,g,t
c
e
a
f
d

f

g

a
Maximum likelihood
1
2
5
V1
V2
3
6
V5
V3
V4
4
Where:
gx0prior probability that node 0 has nucleotide x (relative frequency)
Pii (v)  g i  (1  g i )e  v
(if gi=1/4, model becomes JC)
Pij (v)  g i (1  e v )
lk  g x5 Px5 x1 (v1 )Px5 x2 (v2 )Px5 x6 (v5 )Px6 x3 (v3 )Px6 x4 (v4 )
Since we do not know x5 and x6 we sum over all the possible nucleotides
Lk  
x5

g x5 Px5 x1 (v1 ) Px5 x2 (v2 ) Px5 x6 (v5 ) Px6 x3 (v3 ) Px6 x4 (v4 )
x6
Summing over all sites:
n
ln L   ln Lk
k 1
lnL is maximized changing Vi’s
Bayes’ rule
Bayes’ theorem
Posterior
distribution
Prior distribution
f ( | X ) 
p( )l ( X |  )
Likelihood function
 p( )l ( X |  )d
Unconditional probab.
Pr [Tree/Data] = (Pr [Tree] x Pr [Data/Tree]) / Pr [Data])
probability
A
B
C
1.0
Prior probability
distribution
probability
Data (observations)
1.0
Posterior probability
distribution
Markov Chain Monte Carlo (MCMC)
probability
p( )l ( X |  )

parameter space
Bootstrap
...ahhfhgkhkafdggg...
...rhhfkgkhkaydggg...
...ahdfhgkhkafkdgg...
...rhdfkgkhkaykdgg...
...ahdfhgk-kafkdgg...
...ahdfhgk-kafkdgg...
...ghdfhg--kafkdht...
...ahdfhg--kafaddg...
...hhdfhg--kafaddg...
...ahdfpgchka-kwgg...
...ahhfhgk-kafdggg...
86
...ahhfhgk-kafdggg...
...ghhfhg--kafdhtt...
50
...ahhfhg--kafddgg...
...hhhfhg--kafddgg...
75
90
...ahhfpgchka-wggg...
....
...adfhgkkaffkdgg...
...rdfkgkkayykdgg...
...adfhgkkaffkdgg...
...adfhgkkaffkdgg...
...gdfhg-kaffkdht...
...adfhg-kaffaddg...
...hdfhg-kaffaddg...
...adfpgcka--kwgg...
70
65
Aplicaciones de la
filogenia:
Trazar el origen de una cepa
Fechar la introducción de una cepa
Estudio de la función
Estudios evolutivos
Trazando el origen
Europa
Asia
América
Europa
Datos epidemiológicos
Virus RNA: alta tasa de evolución
t1
b
c
d
a
t0
1926
1970
(1926-t0)*v=a
(1970-t1)*v=c+d
...
Función
A ...ahgfhgkhkafkdggggcatgcgayhhks...
B ...rfgfkgkhkaykdggggcatgcgayhhks...
Función1
C ...ahdfhgkrkafkdggcccatgcgayhhks...
D ...ahdfhgkrkafkdglcccatgcgayhhks...
E ...ghdfhg-rkafkdhtcccatgcgayhhks...
Estados Ancestrales
Función2
PHYLIP
http://evolution.genetics.washington.edu/phylip.html
DNA
Proteins
DNAPARS. Estimates phylogenies by the
parsimony method using nucleic acid sequences.
PROTPARS. Estimates
phylogenies from protein
sequences using the parsimony
method.
DNAMOVE. Interactive construction of
phylogenies from nucleic acid sequences, with
their evaluation by parsimony and compatibility
DNAPENNY. Finds all most parsimonious
phylogenies for nucleic acid sequences by branchand-bound search.
DNACOMP. Estimates phylogenies from nucleic
acid sequence data using the compatibility
criterion,
DNAINVAR. For nucleic acid sequence data on
four species, computes Lake's and Cavender's
phylogenetic invariants,
DNAML. Estimates phylogenies from nucleotide
sequences by maximum likelihood.
DNAMLK. Same as DNAML but assumes a
molecular clock.
DNADIST. Computes four different distances
between species from nucleic acid sequences.
Restriction
Continuous
RESTML. Estimation of
phylogenies by maximum
likelihood using
restriction sites data
CONTML. Estimates phylogenies from
gene frequency data by maximum
likelihood.
PROTDIST. Computes a
distance measure for protein
sequences
SEQBOOT. Reads in a data set, and
produces multiple data sets from it by
bootstrap resampling..
.....
FITCH. Estimates phylogenies from distance
matrix data under the "additive tree model".
KITSCH. Estimates phylogenies from
distance matrix data under the "ultrametric"
model.
NEIGHBOR. An implementation of Saitou
and Nei's "Neighbor Joining Method," and of
the UPGMA (Average Linkage clustering)
method.
.....
CONSENSE. Computes consensus trees by
the majority-rule consensus tree method,
GENDIST. Computes one of three
different genetic distance formulas from
gene frequency data.
Discrete characters
MIX. Wagner parsimony method and
Camin-Sokal parsimony method,
MOVE. Interactive construction of
phylogenies from discrete character
Evaluates parsimony and compatibility
criteria.
PENNY. Finds all most parsimonious
phylogenies
DOLLOP. Estimates phylogenies by the
Dollo or polymorphism parsimony criteria.
DOLMOVE. Interactive DOLLOP.
DOLPENNY. branch-and-bound method
CLIQUE. Finds the largest clique of
mutually compatible characters,
... thanks !!!!