Transcript MP1
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Week6: Intro to Phylogenetic Reconstruction
& Distance Based Methods
• introduction to phylogenies
• distance based methods
• phylogeny exercises
Phylogeny Objectives
1 - understand the essence of phylogenies (definition of terms)
2 - understand distance based methods of phylogenetic reconstruction
3 - should be able to use various software packages to reconstruct and
view phylogenies:
ClustalX, MEGA, DAMBE, Treeview
Lecture #6
Page 1
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Molecular phylogeny
study of relationships among organisms (molecular systematics),
proteins or genes using molecular biology techniques
• Darwin - thesis #1 - organisms
descend with modification from
common ancestors (CA)
• relationships among organisms, proteins,
genes are illustrated by a phylogenetic tree
internal node - common ancestor (CA)
external node - operational taxonomic unit (OTU)
order of branches define the relationships (topology)
branch length defines the number of changes
Lecture #6
Page 2
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
A brief history of molecular phylogeny
• phylogenetic inference is old (for Biology)
Charles Darwin – Orgin of Species (1859)
Illustration of ‘descent with modification’
Ernst Haeckel “Tree of life” (1891)
Lecture #6
Page 3
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
A brief history of molecular phylogeny
more modern developments …
Molecular phylogeny
• Nuttall (1904) found that the strength of serological cross reactions was correlated
with the level of relatedness between animals - applied to primate phylogeny
• starting 1950’s many more sources of molecular information become avaiable:
e.g. amino acid sequences, allozyme frequencies, DNA hybridization
• these data stimulated the development of quantitative ‘numerical taxonomy’
techniques for phylogenetic analysis
Algorithmic approaches
• first numerical approach to phylogeny based on phenetic approach – i.e. similarity
of morphological characters (Michener and Sokal 1957)
• phylogenetic studies of human populations based on blood allele frequencies
led to the introduction of distance, parsimony & likelihood methods
(Edwards & Cavalli-Sforza 1963, 1964)
Lecture #6
Page 4
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
A brief history of molecular phylogeny …
emergence of sequence based methods:
accumulation of amino acid sequences stimulates development of sequence
based phylogenetic methods
these soon emerge as the most powerful methods (see slide #6 for reasons)
Parsimony, Distance & Maximum likelihood methods (see slide # 10)
Eck and Dayhoff (1966) – working of Atlas of Protein Sequence and Structure –
publish first method for phylogenetic analysis of sequences based on parsimony
Fitch and Margoliash (1967) publish first distance based method – weighted least
squares – for sequence based (cytochrome c) phylogenetic inference
Statistician Neyman (1971) publishes first maximum likelihood method for
phylogenetic analysis of sequence data
Lecture #6
Page 5
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Benefits of using molecular sequences for phylogenetics
1 - sequences evolve in a much more regular manner than morphological characters
2 - less prone to confusion between homology and analogy, homoplasies
3 - vast abundance of characters to analyze
4 - molecular data more amenable to quantitative treatments
5 - molecular data ubiquitous - can be used for microorganisms
6 - can be used to study relationships at many different evolutionary levels
faster evolving genes - mitochondrial DNA - closely related species
slower evolving genes - ribosomal RNA genes - distantly related species
Some success stories ….
• primate evolution - who are humans closest relatives ?
• origin of Cetacea mammals (whales, dolphins, porpoises)
• revising deep taxonomic classification scheme - 3 domains of life
Lecture #6
Page 6
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Unrooted versus rooted phylogenies
R
time
unrooted
only specifies relationships
not the evolutionary path
rooted
root (R) is common ancestor of all OTUs
path from root to OTUs specifies time
knowledge of outgroup required to define root
Lecture #6
Page 7
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Different visual representations of trees
rectangular cladogram
slanted cladogram
branch lengths not proportional to distance
phylogram
branch lengths
proportional to
distance
Lecture #6
Page 8
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Species tree versus gene tree
species tree - represents evolutionary relationships among species
gene tree - represents evolutionary relationship among genes
species trees and genes trees can (and often do!) differ
Reasons for this ??
• comparison of orthologous versus paralogous genes
• horizontal (later) transfer of genes
more on these important
concepts later in course
the concept of an accurate species tree is notoriously difficult to pin down
in this class we will deal almost exclusively with genes trees
Lecture #6
Page 9
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Methods of phylogenetic reconstruction
Distance based
• pairwise evolutionary distances computed for all taxa
• tree constructed using algorithm based on relationships between distances
Maximum parsimony
• nucleotides or amino acids are considered as character states
• best phylogeny is chosen as the one that minimizes the number of changes
between character states
Maximum likelihood
• statistical method of phylogeny reconstruction
• explicit model for how data set generated - nucleotide or amino acid substitution
• find topology that maximizes the probability of the data given the model and the
parameter values (estimated from data)
Lecture #6
Page 10
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Phylogenetic inference
1 – sequences change as they evolve from a common ancestor over time
2 – a group of related sequences retains information (incomplete) about the
evolutionary history that unites them – based on the pattern of changes
3 – phylogeny is estimation, make the best estimate about evolutionary history
given the incomplete information in the sequences being analyzed
4 – information about the past is not available, only extant sequences
5 – therefore any evolutionary scenario (i.e. phylogeny) can be postulated
to explain the changes in the sequences being analyzed
6 – must have some way to discriminate among the (many!) possible phylogenies
Lecture #6
Page 11
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Number of OTUs and number of possible trees
n
# OTUs (n)
2
3
4
5
6
7
8
9
10
# unrooted
trees
(2i-5)
i=3
1
1
3
15
105
954
10,395
135,135
2,027,025
n
# rooted
trees
(2i-5)
(2n-3)
i=3
1
3
15
105
954
10,395
135,135
2,027,025
34,459,425
true tree - true evolutionary history is one of many possibilities
difficult to infer true tree when # OTUs is large
inferred tree - obtained using data and reconstruction method
not necessarily the same as the true tree - a hypothesis
Lecture #6
Page 12
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Alogrithms & Optimality Criteria
Two ways for selecting preferred phylogenies:
Algorithms – sequence of steps that leads to the selection of a phylogeny
- combine phylogeny inference and criterion definition into single step
- move directly to toward the best tree without evaluating many different trees
e.g. UPGMA & Neighbor-joining
Optimality criteria – a criteria is defined whereby different phylogenies are
- compared to one another to determine which is better
- two steps involved: 1 – define criteria (objective function) 2 – use algorithm to compute
objective function on different trees
- this method is much slower – must evaluate many trees (shorcuts often necessary)
- may be more robust because scores are assigned to every phylogeny and then they are
ranked – yields information about how well specified the tree is
e.g. Least squares & Minimum evolution
Compromise – define starting tree with algorithm approach and then search nearby
tree-space using optimality criteria approach
Lecture #6
Page 13
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Least squares method
First distance based method developed; Cavalli-Sforza & Edwards (1967)
Fitch & Margoliash (1967)
Optimality criterion = minimize the residual sum of squares (RS) between the
observed distances (dij - based on distance matrix) and the patristic differences
(eij – based on the branch lengths of the inferred phylogeny)
(d
RS =
2
ij – eij)
i<j
A
B
C
D
A B C D
0 17 21 27
17 0 12 18
21 12 0 14
27 18 14 0
e.g.
dBD = 18
eBD = 6 + 2 + 8 = 16
RS-BD = (18 – 16)2 = 4
Lecture #6
Page 14
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Least squares method …
Statisitcally very sound method since based on Least squares
Logically challenged since it formally estimates branch lengths and not topologies
In principle RS is computed for all possible topologies but in practice this
quickly becomes impracticable (see slide #12) – short cuts are available
to minimize search space (see lecture week7)
Fitch & Margoliash (1967) introduced weighted least squares that corrects for the
bias introduced by long distances
[(d
RS =
ij
– eij)2 / dij]
i<j
Negative branch length estimates can confound method – constraint of
non-negative branch lengths results in substantial improvement
Lecture #6
Page 15
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Minimum evolution method
Optimality criterion = choose the phylogeny that gives the smallest value of
S - the sum of all branch lengths
T
S=
i
S = 35.6
Lecture #6
bi
where T = total # branches
bi = branch length i estimate
S = 35.0
Page 16
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Minimum evolution method …
As with least squares, S should ideally be computed for all possible trees but
this is impossible with many taxa
One shortcut is to start search with neighbor-joining tree and then evaluate
closely related trees to find the best one
Close neighbor interchange (CNI) start with temporary ME tree (e.g. NJ tree for
first step) and evaluate all trees that differ by one or two topological changes
This approach may be more robust than using neighbor-joining alone because
it can result in an ordered list of trees, if many trees represent the data almost
equally well then the best tree may not be so well supported
Lecture #6
Page 17
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
UPGMA method
(unweighted pair group method with arithmetic mean)
simplest method - uses sequential clustering algorithm
results in ‘ultrameric’ trees – equal distances from root to all tips
based on assumption of strict rate constancy among lineages – this is often
violated and so method often gives erroneous trees (not reccomended)
step 1
A
Distance matrix
B
B dAB
C dAC dBC
step 2
(AB)
C d(AB)C
d(AB)C = (dAC + dAB) / 2
A
A
Tree
B
dAB / 2
Lecture #6
B
C
d(AB)C / 2
Page 18
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
UPGMA example
step 1
A
Distance matrix
B 4
C 5
step 2
B
7
C
(AB)
6
2
2
Tree
2
dAB / 2
Lecture #6
1
2
A
B
3
d(AB)C = (dAC + dAB) / 2
A
B
C
d(AB)C / 2
Page 19
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Neighbor-joining method
uses ‘star decomposition’ – identification of neighbors
that sequentially minimize the total length of the tree
1 - start with star tree - no topology
S = total branch length of tree
2 - separate pair of OTUs from all others
S12 = total branch length of tree
3 - choose pair of OTUs that minimizes total branch lengths in the tree
4 - this pair collapsed as single OTU and distance matrix recalculated
5 - next pair of OTUs that gives smallest branch length is chosen
6 - iterate until complete
Lecture #6
Page 20
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Neighbor-joining example
Lecture #6
Page 21
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Neighbor-joining method …
Extremely fast and efficient method, widely used & found in numerous publications
Tends to perform fairly well in simulation studies
May produce tie trees from data set but this appears to be rare
Algorithm is ‘greedy’ and so can get stuck in local optima
Main criticism is that it produces only one tree and does not give any idea
of how many other trees are equally well or almost as supported by the data
For this reason, neighbor-joining is often used as a method to find a starting
tree that other methods (e.g. minimum evolution) will evaluate to find the best tree
Lecture #6
Page 22
Johns Hopkins University - Fall 2003
Phylogenetics & Computational Genomics - 410.640.71
Exercises
1 - choose some alignment to work on
2 - load alignment into Clustal and build neighbor-joining tree
3 - open tree in Treeview and view, manipulate and save tree
4 - load alignment into DAMBE and into MEGA and reconstruct and view trees
using all distance methods available – look for differences in results
5 - manually reconstruct UPGMA tree for the distance matrix on slide #14
6 - open the MEGA formatted version of this same distance matrix
http://jhunix.hcf.jhu.edu/~kjordan6/distances.meg
in MEGA and reconstruct distance based trees using all 3 methods available
(check UPGMA result against manually reconstructed UPGMA tree)
7 - calculate RS for all three distance based trees from #6 and pick best tree
8 - calculate S for all three distance based trees from #6 and pick best tree
Lecture #6
Page 23