Transcript here
Christian M Zmasek, PhD
[email protected]
15 June 2010
1.
2.
3.
4.
Why perform phylogenetic inference?
Theoretical background
Methods
Software & Examples
(C) 2010 Christian M. Zmasek
2
‘Tree of life’: The relationships amongst
different species
Infer the functions of proteins from family
members in model organisms or to refine
existing annotations through phylogenetic
analysis
A method to organize/cluster sequences
with biological justification
(C) 2010 Christian M. Zmasek
3
RAT
RAT
MOUSE
MOUSE
HUMAN
RICE
Y
RICE
HUMAN
LIZARD
LIZARD
SHARK
Z
SHARK
Y
X
Z
: query sequence
: orthologous to query
: most similar to query
: gene duplication
(C) 2010 Christian M. Zmasek
4
HUMAN
WHEAT
RAT
BARLEY
Y
Z
: query sequence
: orthologous to query
: most similar to query
: gene duplication
(C) 2010 Christian M. Zmasek
5
A phylogeny is the evolutionary history of a species or a
group of species. Lately, the term is also being applied to the
evolutionary history of individual DNA or protein
sequences.
The evolutionary history of organisms or sequences can be
illustrated using a tree-like diagram – a phylogenetic tree.
(C) 2010 Christian M. Zmasek
6
(C) 2010 Christian M. Zmasek
7
Initially, phylogenetic trees were built based on the
morphology of organisms.
Around 1960 molecular sequences were recognized
as containing phylogenetic information and hence
as valuable for tree building
A tree built based on sequence data is called a gene
tree since it is a representation of the evolutionary
history of genes
A tree illustrating the evolutionary history of
organisms is called a species tree
(C) 2010 Christian M. Zmasek
8
(C) 2010 Christian M. Zmasek
9
(C) 2010 Christian M. Zmasek
10
Homologs are defined as sequences which share a common
ancestor (Fitch, 1966)
This definition becomes unclear if mosaic proteins, which
are composed of structural units originating from different
genes are considered
Phylogenetic trees make sense only if constructed based on
homologous sequences (whole genes/proteins, or domains)
(C) 2010 Christian M. Zmasek
11
Homologous sequences can be divided into orthologs,
paralogs and xenologs:
Orthologs: diverged by a speciation event (their last
common ancestor on a phylogenetic tree corresponds to a
speciation event)
IMPORANT: Functional similarity does not imply orthology
Paralogs: diverged by a duplication event (their last
common ancestor corresponds to a duplication)
Xenologs: are related to each other by horizontal gene
transfer (via retroviruses, for example)
(C) 2010 Christian M. Zmasek
12
(C) 2010 Christian M. Zmasek
13
Orthologous sequences tend to have more similar
“functions” than paralogs
Yet: Orthologs are mathematically defined,
whereas there is no definition of sequence
“function” (i.e. it is a subjective term)
(C) 2010 Christian M. Zmasek
14
New genes evolve if mutations accumulate while
selective constraints are relaxed by gene duplication
First recognized by Haldane (“… it [mutation
pressure] will favour polyploids, and particularly
allopolyploids, which possess several pairs of sets of
genes, so that one gene may be altered without
disadvantage…”
(C) 2010 Christian M. Zmasek
15
Wheat
Rat
Human
Rat
Wheat
Human
Rat
Human
Wheat
Wheat
Rat
Human
16
(C) 2010 Christian M. Zmasek
S
G2
G1
Multiple sequence alignment of homologous sequences
Pairwise distance calculation
Algorithmic
Methods Based on
Pairwise
Distances:
•UPGMA
•Neighbor Joining
Optimality Criteria
Based on Pairwise
Distances:
•Fitch-Margoliash
•Minimal Evolution
Optimality Criteria Based
on Character Data:
•Maximum Parsimony
•Maximum Likelihood
Bayesian Methods (MCMC)
“More accurate”
(in general)
Fast
(C) 2010 Christian M. Zmasek
17
The simplest method to measure the distance
between two amino acid sequences is by their
fractional dissimilarity p (nd is the number of
aligned sequence positions containing non-identical
amino acids and ns is the number of aligned
sequence positions containing identical amino
acids):
nd
p
nd ns
(C) 2010 Christian M. Zmasek
18
Unfortunately, this is unrealistic -- does not
take into account:
superimposed changes: multiple mutations at
the same sequence location
different chemical properties of amino acids: for
example, changing leucine into isoleucine is more
likely and should be weighted less than changing
leucine into proline
(C) 2010 Christian M. Zmasek
19
A more realistic approach for estimating
evolutionary distances is to apply maximum
likelihood to empirical amino acid replacement
models, such as PAM transition probability
matrices.
The likelihood LH of a hypothesis H (an evolutionary
distance, for example) given some data D (an
alignment, for example) is the probability of D given
H: LH=P(D|H)
(C) 2010 Christian M. Zmasek
20
UPGMA stands for unweighted pair group
method using arithmetic averages
This is clustering
This algorithm produces rooted trees based
under the assumption of a molecular clock.
(C) 2010 Christian M. Zmasek
21
As opposed to UPGMA, neighbor joining
(NJ) is not misled by the absence of a
molecular clock
NJ produces phylogenetic trees (not cluster
diagrams)
(C) 2010 Christian M. Zmasek
22
Fitch-Margoliash
Minimal evolution (ME)
Maximum Parsimony (MP)
Maximum Likelihood (ML)
(C) 2010 Christian M. Zmasek
23
Branch lengths are fitted to a tree according
to a unweighted least squares criterion, but
the optimality criterion to evaluate and
compare trees is to minimize the sum of all
branch lengths.
(C) 2010 Christian M. Zmasek
24
Evaluate a given
topology
Example:
Sequence1: TGC
Sequence2: TAC
Sequence3: AGG
Sequence4: AAG
(C) 2010 Christian M. Zmasek
25
Probabilistic methods can be used to assign a likelihood to a
given tree and therefore allow the selection of the tree which
is most likely given the observed sequences.
Probability for one residue a to change to b in time t along a
branch of a tree: P(b|a,t)
Its actual calculation is dependent on what model for
sequence evolution is used.
Poisson process:
P(b|a,t)=1/20 + 19/20e-ut for a=b
P(b|a,t)=1/20 + 1/20e-ut for a≠b
(C) 2010 Christian M. Zmasek
26
Example: MrBayes
Use Markov Chain Monte Carlo (MCMC)
approach to sample over tree space
(C) 2010 Christian M. Zmasek
27
To asses the reliability of trees
Resampling with replacement (see example
on next slide)
What is “good enough”?? >60%?, >90%?
(C) 2010 Christian M. Zmasek
28
Original sequence alignment:
Sequence 1: ARNDCQ
Sequence 2: VRNDCQ
123456
Bootstrap resample 1:
Sequence 1: RRQCCA
Sequence 2: RRQCCV
226551
Bootstrap resample 2:
Sequence 1: AQCDCQ
Sequence 2: VQCDCQ
165456
(C) 2010 Christian M. Zmasek
29
Multiple sequence alignment of homologous sequences
Pairwise distance calculation
Algorithmic
Methods Based on
Pairwise
Distances:
•UPGMA
•Neighbor Joining
Optimality Criteria
Based on Pairwise
Distances:
•Fitch-Margoliash
•Minimal Evolution
Optimality Criteria Based
on Character Data:
•Maximum Parsimony
•Maximum Likelihood
Bayesian Methods (MCMC)
“More accurate”
(in general)
Fast
(C) 2010 Christian M. Zmasek
30
Mafft:
http://mafft.cbrc.jp/alignment/software/
Server: http://mafft.cbrc.jp/alignment/server/
T-Coffee:
ClustalW:
ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/
Server: http://www.ebi.ac.uk/clustalw/
Probcons:
http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html
Server: http://www.ch.embnet.org/software/TCoffee.html
Server: http://www.ebi.ac.uk/t-coffee/
http://probcons.stanford.edu/
Server: http://probcons.stanford.edu
Muscle:
http://www.drive5.com/muscle/
Server: http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py
(C) 2010 Christian M. Zmasek
31
List of programs: http://evolution.genetics.washington.edu/phylip/software.html
ML pairwise distance calculation (protein):
TREE-PUZZLE: http://www.tree-puzzle.de/
Bootstrapping, pairwise distance calculation, UPGMA, NJ, Fitch-Margolish, ME:
PHYLIP: http://evolution.genetics.washington.edu/phylip.html
ME:
FastME (server): http://atgc.lirmm.fr/fastme/
MEGA: http://www.megasoftware.net/
ML:
PhyML (server): http://www.atgc-montpellier.fr/phyml/
RAxML (server): http://phylobench.vital-it.ch/raxml-bb/
Bayesian (MCMC):
MrBayes: http://mrbayes.csit.fsu.edu/
Parsimony (esp. on Macintosh), display:
PAUP: http://paup.csit.fsu.edu/
Tree display:
Archaeopteryx: http://www.phylosoft.org/archaeopteryx/
Hypothesis testing:
HyPhy: http://www.hyphy.org/
(C) 2010 Christian M. Zmasek
32
Richard Durbin et al.: Biological Sequence Analysis: Probabilistic Models of Proteins and
Nucleic Acids [http://www.amazon.com/Biological-Sequence-Analysis-ProbabilisticProteins/dp/0521629713/sr=1-1/qid=1170198997/ref=sr_1_1/102-49552971236120?ie=UTF8&s=books]
Joe Felsenstein: Inferring Phylogenies [http://www.amazon.com/Inferring-PhylogeniesJoseph-Felsenstein/dp/0878931775/sr=8-1/qid=1170198215/ref=pd_bbs_sr_1/102-49552971236120?ie=UTF8&s=books]
Ziheng Yang: Computational Molecular Evolution
[http://www.amazon.com/Computational-Molecular-Evolution-OxfordEcology/dp/0198567022/sr=1-1/qid=1170198731/ref=pd_bbs_sr_1/102-49552971236120?ie=UTF8&s=books]
Oliver Gascuel: Mathematics of Evolution & Phylogeny
[http://www.amazon.com/Mathematics-Evolution-Phylogeny-OlivierGascuel/dp/0198566107/sr=1-1/qid=1170198842/ref=sr_1_1/102-49552971236120?ie=UTF8&s=books]
(C) 2010 Christian M. Zmasek
33
Download and install MrBayes: http://mrbayes.csit.fsu.edu/
Read the tutorial: http://mrbayes.csit.fsu.edu/wiki/index.php/Tutorial
Analyze the provided data set (“primates.nex”)
Download and install PHYLIP:
http://evolution.genetics.washington.edu/phylip.html
Perform seqboot (100x) – dnadist – neighbor (NJ) – consense on
“primates.nex” (you need to change the format accordingly)
Compare the results (MrBayes vs. Phylip NJ)
(C) 2010 Christian M. Zmasek
34