Phylogenetic Trees

Download Report

Transcript Phylogenetic Trees

Phylogenetic Trees
Çiğdem EROL
[email protected]
Phylogenetic Trees
“Evolutionary Trees from DNA Sequences:
A Maximum Likelihood Approach”
Joseph Felsenstein
Journal of Molecular Evolution
1981
17: 368-376
The Term of Phylogenetic?
• Phylogenetic is of Greek origin
• From the terms phyle/phylon (φυλή/φῦλον),
meaning "tribe, race,"
• From genesis (γένεσις); genetikos (γενετικός),
meaning "relative to birth"
What is the Phylogenetic?
Study of evolutionary relatedness
among various groups of organisms
(for example, species or
populations), which is discovered
through molecular sequencing data
and morphological data matrices.
Simple
Tree
A
B
C
D
Taxonomists used anatomy and physiology
to group and classify organisms
- Morphological features like presence of
feathers or number of legs
When
protein
sequencing,
and
later
DNA
Charles Darwin
sequencing
On the
origin of species. became common, amino acid
and1859
DNA sequences became the common
Chapter IV - Natural
way
to construct trees
Selection
The first to recognize that the
systematic hierarchy represented a
rough approximation of
evolutionary history.
What is Molecular Phylogenetics
Phylogenetics is the study of evolutionary relationships
Example - relationship among species
primates
rodents
crocodiles
crocodiles
birds
lizards
birds
snakes
marsupials
snakes
rodents
primates
lizards
marsupials
A Brief History of Molecular Phlogenetics
1900s
Immunochemical studies: cross-reactions stronger for closely related
organisms
Nuttall (1902) - apes are closest relatives to humans
1960s - 1970s
Protein sequencing methods, electrophoresis, DNA hybridization and PCR
contributed to a boom in molecular phylogeny
late 1970s to present
Discoveries using molecular phylogeny:
- Endosymbiosis - Margulis, 1978
- Divergence of phyla and kingdom - Woese, 1987
- Many Tree of Life projects completed or underway
Applications of Phylogenetic Trees
• Evolution studies
– Relationship of sequences to one another
– Dissect the order of appearance of insertions,
deletions, and mutations
•
•
•
•
Systematic biology
Medical research and epidemiology
Ecology
Others like grouping languages,
medieval manuscripts …
Phylogenetic Trees
• Trees are composed of nodes and branches
– Terminal or leaf nodes correspond to a gene or
organism for which data has been collected
– Internal nodes usually represent an inferred
common ancestor that gave rise to two
independent lineages sometime in the past
Internal
External or Peripheral
Branch
10
11
Rooted and Unrooted Trees
• Some trees make an inference about a
common ancestor and the direction of
evolution and some don’t
– First type is called a rooted tree and has a single
node designated as root which is the common
ancestor
– Second type is called an unrooted tree
• Specifies only relationship between nodes and says
nothing about direction of evolution
Rooted and Unrooted Trees
R
B
C
Time
E
A
A
B
C
D
E
D
Tree as Text
Structure of a phylogenetic tree can be represented in
Newick format using nested parentheses (((B, C), (D, E)), A)
Tree Types
Dendogram: shows the diagrammatic
representation of a phylogenetic tree
Cladogram: shows the branching order of
nodes
Phylogram: shows branching order and
distances
Chronogram shows evolutionary time
through its branch lengths
Dendogram
Cladogram
Phylogram
Chronogram
.035 A
.012
B
.009
.057 C
.016
.044
D
15
3 OTUs
Roots can usually be assigned to unrooted trees using an outgroup
For three species there are 3 possible rooted trees, but only one
possible unrooted tree
1 unrooted tree = 3 rooted trees
16
4 OTUs
3 unrooted trees
= 15 rooted trees
17
The number of possible bifurcating rooted
trees (NR) for n  2 OTUs
(2n  3)!
NR = n  2
2
(n  2)!
(2n  5)!
NU = n  3
2
(n  3)!
The number of possible bifurcating
unrooted trees (NU) for n  3 OTUs
18

Number of OTUs
Number of possible rooted tree

2
1
3
3
4
15
5
105
6
954
7
10,395
8
135,135
9
2,027,025
10
34,459,425
15
213,458,046,676,875
20
8,200,794,532,637,891,559,375

19
Evolution is an historical process.
Only one historical narrative is true.
From 8,200,794,532,637,891,559,375
possibilities, 1 possibility is true and
8,200,794,532,637,891,559,374 are false.
Truth is one,
falsehoods are many.
20
How do we know which of the
8,200,794,532,637,891,559,375
trees is true?
21
We don’t, we infer
by using decision
criteria.
22
Two Major Approaches to Phylogeny Inference
1) Character Based Methods
2) Distance Matrix Methods
A character
provides
information
about an
individual OTU.
A distance
represents a
quantitative
statement
concerning the
dissimilarity
between two
OTUs.
24
A character is a well-defined
feature that in a taxonomic unit
can assume one out of two or
more mutually exclusive
character states.
Mutually exclusive: If David is tall, David cannot be short.
25
26
A character is unordered if a change
from one character state to any other
character state can occur in one step.
27
A character is ordered if there exists a unique
symmetrical path of change from one
character state to another.
28
A character is polar if there exists a unique
asymmetrical (irreversible) path of change
from one character state to another.
POLAR
29
Amino-acid sites are partially ordered characters.
An amino acid cannot change into all other amino
acids in a single step, as sometimes 2 or 3 steps are
required.
a tyrosine may only change
into a leucine through an
intermediate state, i.e.,
phenylalanine or histidine.
30
31
Distance Data
33
Most molecular data yield character
states that are subsequently
converted into distances.
34
Some molecular data can only be
expressed as distances.
35
Types of Data
• Two categories
– Numerical data
• Distance between objects
• E.g.evolutionary distance between two species
• Usually derived from sequence data
– Character data
• Each character has a finite number of states
• E.g. number or legs = 1, 2, 4
• DNA = {A, C, T, G}
Four stages of phylogenetic analysis
Molecular phylogenetic analysis may be described in
four stages:
[1] Selection of sequences for analysis
[2] Multiple sequence alignment
[3] Tree building
[4] Tree evaluation
Step1: Selection of Sequences for Analysis
• Choose homologous sequences in different
species
• Homologous sequences must, by definition,
be derived from a common ancestral
sequence
• Homology is not similarity
Step 2: Multiple Sequence Alignment
The fundamental basis of a phylogenetic tree is
a multiple sequence alignment.
(If there is a misalignment, or if a nonhomologous
sequence is included in the alignment, it will still
be possible to generate a tree.)
Multiple Sequence Alignment
• Creating optimal alignment of many sequences
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWESNG--
• Why do multiple alignments?
– Finding motifs or conserved domains
– Prediction of secondary structure of proteins
– Alignment of a family of sequences may provide more
information than a pair-wise alignment of any two of those
sequences
40
Step 3: Tree building
• Determine the evolutionary distances and
build distance matrix
• For molecular data, evolutionary distances can
be the observed number of nucleotide
differences between the pairs of species.
• Choose your Method
Choosing Method
Common Distance Methods Common Sequence Methods:
• UPGMA / WPGMA
• Maximum Parsimony (MP)
• Neighbor Joining (NJ)
• Maximum Likelihood (ML)
• Least Square (LS)
• Bayesian Inference
• Minimal Evolution (ME)
UPGMA
A
B
D
C
E
A
B
C
E
D
Unweighted Pair Group Method using Arithmetic mean
UPGMA
• So UPGMA is very simple and generates
rooted trees, however…
• Major weakness is that the algorithm assumes
that rates of evolution are the same among
different lineages
• This does not fit existing biological data, so
probably shouldn’t use UPGMA to build
phylogenetic trees
Neighbor-Joining Methods
• Start with all species arranged in a star tree
b
a
c
c
a
d
d
e
b
e
Combine nodes until you find a combination
that gives the smallest sum of branch
lengths.
Neighbor-Joining Methods
• The pair of nodes pulled out (grouped) at each
iteration are chosen so that the total length of the
branches on the tree is minimized
• After a pair of nodes is pulled out, it forms a cluster
in the tree and is included in further rounds of
iteration (and a new distance matrix is generated)
Sequence methods
Most common:
• Maximum Parsimony (MP)
• Maximum Likelihood (ML)
• Bayesian Inference
Maximum Parsimony (MP)
•
•
•
Originally developed for morphological characters
Henning, 1966
William of Ockham: the best hypothesis is the one
that requires the smallest number of assumptions
Maximum Parsimony (MP)
•
Principle:
– Estimate the minimum number of substitutions for a given
topology
– Parsimony-informative sites (exclude invariable sites and
singletons)
– Searching MP trees
•
•
Exhaustive search
Heuristic search
–
–
Result tree might not be the most parsimonious tree
Result
• Multiple result trees are possible (strict consensus tree,
majority-rule consensus tree)
• Most parsimonious tree vs true tree
• Unrooted result trees
Maximum parsimony: exhaustive stepwise addition
B
C
Step 1
A
D
B
D
B
C
C
B
C
D
Step 2
A
E
B
D
C
A
B
E
D
C
A
A
B
D E
A
C
A
Step 3
…………………
Maximum Parsimony (MP)
•
Advantages
–
•
Free from assumptions (model-free)
Disadvantages
–
–
Does not take into account homoplasy
Long-branch attraction (LBA): creates wrong
topologies, if the substitution rate varies extensively
between lineages
Maximum Likelihood (ML)
•
•
•
•
Cavalli-Sforza, Edwards (1967), gene frequency data
Felsenstein (1981), nucleotide sequences
Kishino (1990), proteins
Principle
–
Maximizes the likelihood of observing the sequence data for a specific
model of character state changes
Likelihood of a site = Sum of probabilities of every possible
reconstruction of ancestral states at the internal nodes
Likelihood of the tree = Product of the likelihoods for all sites (=sum of
log likelihoods)
Result = tree with the highest likelihood
–
–
–
•
•
Maximized to estimate branch lengths, not topologies
Search strategies: rarely exhaustive, mostly heuristic
•
•
•
NNI (Nearest neighbor interchanges)
TBR (Tree bisection-reconnection)
SPR (Subtree pruning and regrafting)
Birth Move (insert node)
T1
T2
Death Move (delete node)
T2
T1
Maximum Likelihood
• Creates all possible trees like Maximum Parsimony method but
instead of retaining trees with shortest evolutionary steps……
• Employs a model of evolution whereby different rates of
transition/transversion ration can be used
• Each tree generated is calculated for the probability that it reflects
each position of the sequence data.
• Calculation is repeated for all nucleotide sites
• Finally, the tree with the best probability is shown as the maximum
likelihood tree - usually only a single tree remains
• It is a more realistic tree estimation because it does not assume
equal transition-transversion ratio for all branches.
Which method to use?
• Distance based
– fast
• Maximum Parsimony
– Strong sequence similarity
• Maximum Likelihood
– Very slow
– Use only for small number of sequences
• Most packages (e.g. Phylip, MEGA)
use software for all three methods.
Comparison of Methods
Distance
Maximum
parsimony
Maximum likelihood
Uses only pairwise
distances
Uses only shared
derived characters
Uses all data
Minimizes distance
between nearest
neighbors
Minimizes total
distance
Maximizes tree likelihood
given specific parameter
values
Very fast
Slow
Very slow
Easily trapped in local
optima
Assumptions fail
when evolution is
rapid
Highly dependent on
assumed evolution
model
Good for generating
tentative tree, or
choosing among
multiple trees
Best option when
tractable (<30 taxa,
homoplasy rare)
Good for very small data
sets and for testing trees
built using other methods
Summary of all approaches
• UPGMA assumes molecular clock, so provides a
rooted tree (maybe too strong an assumption)
• Neighbor-joining is good when evolutionary rates
vary. Proven to construct the correct tree.
• Parsimony is good for closely related sequences
• Likelihood method is the most general of all
How confident are we about the inferred phylogeny?
?
rat
human
?
turtle
?
fruit fly
?
oak
duckweed
Step 4: Evaluation Tree (Methods)
• Bootstrap: resample initial data set with one
datum removed and replaced with another
member
• Jackknife: resample initial distribution with one
datum missing and not replaced
• MCMC(Markov chain Monte Carlo ): complex,
but generates random numbers to produce a
desired probability distribution with
which to compare model
The Bootstrap
• Computational method to estimate the confidence level of a certain
phylogenetic tree.
Pseudo sample 1
001122234556667
rat
GGAAGGGGCTTTTTA
human
GGTTGGGGCTTTTTA
turtle
GGTTGGGCCCCTTTA
fruitfly CCTTCCCGCCCTTTT
oak
AATTCCCGCTTCCCT
duckweed AATTCCCCCTTCCCC
Sample
0123456789
rat
human
turtle
fruitfly
oak
duckweed
GAGGCTTATC
GTGGCTTATC
GTGCCCTATG
CTCGCCTTTG
ATCGCTCTTG
ATCCCTCCGG
Pseudo sample 2
rat
human
turtle
fruit fly
oak
duckweed
Inferred tree
445556777888899
rat
CCTTTTAAATTTTCC
human
CCTTTTAAATTTTCC
turtle
CCCCCTAAATTTTGG
fruitfly CCCCCTTTTTTTTGG
oak
CCTTTCTTTTTTTGG
duckweed CCTTTCCCCGGGGGG
Many more replicates
(between 100 - 1000)
Bootstrap values
100
65
rat
human
turtle
0
fruit fly
55
oak
duckweed
• Values are in percentages
• Conventional practice: only values 60-100% are shown
Summary Review
• Relationships Among Groups Of Genes
– Comparing Sequences
– Building Sequence Families
• Sequence Conservation During Evolution
– Aligning Multiple Sequences
• Evolutionary Diagrams
– Tracing The Descent From Common Ancestors
– Growing Phylogenetic Trees
WIBR Bioinformatics, © Whitehead
Institute 2004
63
Easy Tree
• www.ebi.ac.uk/clustalw/
• Paste your alignment
• Select a tree type
• Other options need to be set
(see right)
• Press run
• Make a screen shot
• You can paste it where needed
Phylip (More elaborate tree)
• http://bioweb.pasteur.fr/seqanal/phylogeny/
phylip-uk.html
• Choose protdist from the page
• Paste the MSA
• Bootstrapping e.g.:
Phylip
• Run the query
• Click further analysis
Click Run
Select full screen view
There is your tree
Ugly Tree
• Let’s face it the tree is quite ugly
• http://iubio.bio.indiana.edu/treeapp/treeprint-form.html
• Select the consense.outtree from the previous website and paste
it into the box
• Select submit to create the tree
• Play around with the formats and settings
References
• Felsenstein, J., 1981. “Evolutionary Trees from DNA Sequences: A
Maximum Likelihood Approach”, Journal of Molecular Evolution, 17: 368376.
• Phylogenetics – Wikipedia
• Systematics and Molecular Phylogenetics - NCBI
• Phylogenetic Tree – Wikipedia
• Latek, R., “Practical Bioinformatics Tools For Understanding Evolution”
• Graur, D., “Molecular Phylogenetics”
• “Phylogenetics - Distance-Based Methods”
• Coates, M., Nowak, R. and Castro, R., “Maximum Likelihood Network
Topology Identification”
• Han Chuan Ong, “Phylogenetics”
• Applied Bioinformatics
• Gilbert, D., 2008. “Phylogenetic Trees”
Thank You
Any
Question ?