Transcript Document
Phylogenetics
• Motivation
• Concepts
• Algorithms
Lecture 13 CS566
1
Motivation
• Life arose just once - “Thou art my brethren but a
few sequences removed”
• Phylogenetic trees = Topologies of evolutionary
relationships between sequences, and, possibly,
species - “Is it man or cow that is the true heir of
the fabulous treasures of the woolly mammoth
dynasty?”
• Phylogenetic tree as guide tree for multiple
sequence alignment (déjà vu)
Lecture 13 CS566
2
Concepts
• Mutation and Evolution
– Mutations that persist over generations = Evolution
• Tree, not a lattice
– Each species arose just once
• Species phylogeny (often) != Sequence phylogeny
– Sequences evolve at different rates
• Within a single species
• Between different species
• Within a single sequence
– Especially in bacteria, horizontal transfer (“Napster’s been
around for ages”) quite common
Lecture 13 CS566
3
Concepts
• Molecular clock assumption
– Sequences drift apart at a constant rate
– Aka edge length proportional to time
– Aka satisfaction of ultrametricity
• For any 3 sequences
(all pair-wise distances are equal) xor
(2 distances are equal, and the third one smaller)
– If true, then
• All path lengths from root to leaf nodes are equal
• Additivity
– Distance metric chosen is
• True distance (fulfils triangular inequality)
• Such that cumulative sum of edge lengths along path between 2
sequences equals the distance between 2 sequences
Lecture 13 CS566
4
Concepts
•
•
•
•
Heuristic forays into intractable space
Start with pairwise “distances”
Path length = Distance (~Evolutionary time)
Work from leaves to node to generate tree
– (opposite of binary tree generation)
• “Its easier to be rootless than to be rooted”
• Binary tree approximation of higher order trees
• Edges do not imply direct links (Missing
links/incomplete data), only a representation of sequence
evolution
Lecture 13 CS566
5
• Parsimony (Character-based)
• Distance based methods
– Neighbor joining
– UPGMA
• Maximum Likelihood
Lecture 13 CS566
IIncreasing Sequence Similarity
Algorithms
6
Algorithms
• UPGMA (Unweighted pair group method with
arithmetic averages)
– Caveat if molecular clock not applicable: “If my cousin looks
more like me than my brother, he must be my lost brother, and
perhaps my brother my cousin?”
• Neighbor joining
– “Give me additive distances, and I shall give thee a tree, even if
some sequences morph faster than others”
• Parsimony
– “Its just a bruise, not Kaposi’s sarcoma!”
• Maximum Likelihood
– “Given the facts, Watson, the answer is elementary!”
Lecture 13 CS566
7
UPGMA
• Easiest to use if molecular clock and additivity are
valid
• No. of clusters = no. of sequences = no. of leaf
nodes
• Inter cluster distance = Average pairwise distance
{While (no. of clusters > 1)
– Connect pair of closest clusters (at distance d) with
intermediate node at distance d/2 from each of them}
• Caveat: Satisfies minimal distance requirement,
but may result in spurious topologies – because of
constant rate evolution assumption
Lecture 13 CS566
8
Parsimony
• Parsimony (“Miserliness in model space”): Pick the
simplest explanation that fits the facts - “If I hear a
blood-curdling scream, it’s just one of my sons trying
to kill the other – not an invasion by aliens!”
• Every possible tree evaluated in terms of total number
of steps needed to convert each sequence to another
– Practical for only a few sequences
• High percentage of similarity a prerequisite
– Neither identical or ‘completely different’ sequence
positions useful
– Each difference should represent a single step
(WYSIWYG) and not a ‘full circle’ or ‘nonshortest route’
Lecture 13 CS566
9
Parsimony
1.
2.
3.
4.
123456789…………
ACCEFAHIKLKNPR
ACCEFGHILLLNPR
ACDEFGHIKLINPK
AADEFGHILLNNPK
*
*
*
1C
3D
C
2C
D
Candidate tree for position 3
Lecture 13 CS566
4D
10
Parsimony
•
•
•
3 sets of 3 trees each compared
The one with lowest total number of
substitutions selected
Refinements:
–
Branch and bound:
•
–
–
Abandon a tree if subtree has a higher score than current
minimal score tree
Heuristic branch-pattern representatives
Non-boolean costs: Tranversion > transition OR use
of amino-acid substitution matrices
Lecture 13 CS566
11
Neighbor Joining
•
•
•
Generates unrooted tree, allowing for unequal branches
Given: Distance matrix for sequences
Steps: Repeat 1-3 till all branches generated
1.
2.
Take closest sequences i, j
Find branch lengths between i and j by treating remaining
sequences as composite (c)
1.
2.
3.
4.
5.
Calculate average i-C and j-C distances
Calculate branch lengths i and j
Treat ij as composite sequence now and generate new distance
table.
Generate multiple trees by starting with different pairs
Compare resulting trees in terms of best fit to original distance
matrix
Lecture 13 CS566
12
Rooting trees
•
Based on a “proxy ancestor”
–
–
–
•
Include a distant relative (“outgroup”) as the proxy
ancestor
Add the outgroup as the last node
Point of attachment of outgroup represents root
Diameter center
–
Place root at center of longest path through tree
Lecture 13 CS566
13
Summary
• Parsimony and ML based approaches
computationally intensive – scalability poor
• Neighbor joining adequate if additivity
assumption is valid
• UPGMA adequate if both molecular clock
and additivity assumptions are valid for
given set of sequences
Lecture 13 CS566
14
Summary
• Phylogenetics useful to understand
sequence evolution
• Phylogenetics makes sense for
– sequences with a high percentage of sequence
identity
– sequences not subject to ‘selection’
• Sequence tree not the same as species tree
Lecture 13 CS566
15