Transcript tree

The
Genome
Access
Course
Phylogenetic Analysis
Phylogenetics
•Developed by Willi Henning (Grundzüge einer Theorie der
Phylogenetischen Systematik, 1950; Phylogenetic
Systematics, 1966)
What is the ancestral sequence?
• pfeffer
• pepper
• (pf/p)e(ff/pp)er
Evolutionary Trees
•
•
•
•
•
•
A tree is a connected, acyclic 2D graph
Leaf: Taxon
Node: Vertex
Branch: Edge
Tree length = sum of all branch lengths
Phylogenetic trees are binary trees
A Generic Tree
Evolutionary Trees
• Rooted
– common ancestor
– unique path to any leaf
– directed
• Unrooted
– root could be placed anywhere
– fewer possible than rooted
Rooted Tree
generated by DRAWGRAM (PHYLIP)
Unrooted Tree
generated by DRAWTREE (PHYLIP)
Possible Evolutionary Trees
Taxa (n)
Rooted
Unrooted
(2n-3)!/(2n-2(n-2)!)
(2n-5)!/(2n-3(n-3)!)
2
1
1
3
3
1
4
15
3
5
105
15
6
954
105
7
10395
954
8
135135
10395
9
2027025
135135
10
34459425
2027025
Genes vs. Species
• Sequences show gene relationships, but
phylogenetic histories may be different for
gene and species
• Genes evolve at different speeds
• Horizontal gene transfer
Methods for Phylogenetic
Analysis
• Character-State
– Maximum Parsimony
– Maximum Likelihood
• Genetic Distance
– Fitch & Margoliash
– Neighbor-Joining
– Unweighted Pair Group
Phylogenetic Software
•
•
•
•
•
PHYLIP
PAUP (Available in GCG)
TREE-PUZZLE
PhyloBLAST
Felsenstein maintains an extensive list of
programs on the PHYLIP site
PHYLIP Programs
•
•
•
•
•
•
dnapars/protpars
dnadist/protdist
dnaml (use fastDNAml instead)
neighbor
fitch/kitsch
drawtree/drawgram
Maximum Parsimony
•
•
•
•
Most common method
Allows use of all evolutionary information
Build and score all possible trees
Each node is a transformation in a character
state
• Minimize treelength
• Best tree requires the fewest changes to
derive all sequences
Which is the more parsimonious tree?
3 Nodes
9 Node Crossings
3 Nodes
8 Node Crossings
Maximum Likelihood
• Reconstruction using an explicit
evolutionary model
• Tree is calculated separately for each
nucleotide site. The product of the
likelihoods for each site provides the overall
likelihood of the observed data.
• Demanding computationally
• Slowest method
• Use to test (or improve) an existing tree
Clustering Algorithms
• Use distances to calculate phylogenetic
trees
• Trees are based on the relative numbers of
similarities and differences between
sequences
• A distance matrix is constructed by
computing pairwise distances for all
sequences
• Clustering links successively more distant
taxa
DNA Distances
• Distances between pairs of DNA sequences are relatively
simple to compute as the sum of all base pair differences
between the two sequences
• Can only work for pairs of sequences that are similar
enough to be aligned
• All base changes are considered equal
• Insertion/deletions are generally given a larger weight than
replacements (gap penalties).
• Possible to correct for multiple substitutions at a single
site, which is common in distant relationships and for
rapidly evolving sites.
Amino Acid Distances
• More difficult to compute
• Substitutions have differing effects on
structure
• Some substitutions require more than one
DNA mutation
• Use replacement frequencies (PAM,
BLOSUM)
Fitch & Margoliash
• 3 sequences are combined at a time to
define branches and calculate their length
• Additive branch lengths
• Accurate for short branches
Neighbor Joining
• Most common method of tree construction
• Distance matrix adjusted for each taxon
depending on its rate of evolution
• Good for simulation studies
• Most efficient computationally
UPGMA – Unweighted Pair Group
Methods Using Arithmetic Averages
• Simplest method
• Calculates branch lengths between most
closely related sequences
• Averages distance to next sequence or
cluster
• Predicts a position for the root
Phylogenetic Complications
•
•
•
•
Errors
Loss of function
Convergent evolution
Lateral gene transfer
Validation
• Use several different algorithms and data sets
• NJ methods generate one tree, possibly supporting
a tree built by parsimony or maximum likelihood
• Bootstrapping
– Perturb data and note effect on tree
– Repeat many times
– Unchanged ~90%, tree’s correctness is supported
Are there bugs in our genome?
N-acetylneuraminate lyase
The End