Phylogenetics - Distance

Download Report

Transcript Phylogenetics - Distance

Phylogenetics Distance-Based Methods
CIS 667 March 11, 2204
Phylogenetics
• Attempts to infer the evolutionary history of a
group of organisms or sequences of nucleic
acids or proteins
 Phylogenetic methods can be used for the study
of evolutionary relationships between species of
organisms as well as genes
 Attempt to reconstruct evolutionary ancestors
 Estimate time of divergence from ancestor
Phylogenetic Trees
• We can use phylogenetic trees to illustrate
the evolutionary relationships among
groups of species or genes
• Leaf nodes of the tree are the species or
genes we are comparing, interior nodes
are inferred common ancestors
Phylogenetic Trees
Phylogenetic Tree for Close Human Relatives
Common Ancestor of Humans and Apes
Humans
Comon Ancestor Gorillas, Chimps, Orangs
Orangutans
Common Ancestor of Gorillas Chimps
Chimpanzees
Gorillas
History
• Taxonomists used anatomy and
physiology to group and classify
organisms
 Morphological features like presence of
feathers or number of legs
• When protein sequencing, and later DNA
sequencing became common, amino acid
and DNA sequences became the common
way to contruct trees
Phylogenetic Tree constructed from aa
sequences of Cytochrome C protein
Quic kTime™ and a TIFF (Unc ompres s ed) dec ompres s or are needed to s ee this pic ture.
The Big Picture
• Determine the species or genes to be studied
• Acquire homologous sequence data
• Use multiple sequence alignment software like
ClustalW to align
• Clean up data by hand
• Use phylogenetic analysis software like Phylip
based on techniques we will study
• Verify experimentally
Phylogenetics
• Can be used to solve a number of
interesting problems
 Forensics
 HIV virus mutates rapidly
 Predicting evolution of influenza viruses
 Predicting functions of uncharacterized genes
- ortholog detection
 Drug discovery
 Vaccine development
 Target inferred common ancestor
Types of Data
• Two categories
 Numerical data
 Distance between objects
 E.g.evolutionary distance between two species
 Usually derived from sequence data
 Character data
 Each character has a finite number of states
 E.g. number or legs = 1, 2, 4
 DNA = {A, C, T, G}
Phylogenetic Trees
• Trees are composed of nodes and
branches
 Terminal or leaf nodes correspond to a gene
or organism for which data has been collected
 Internal nodes usually represent an inferred
common ancestor that gave rise to two
independent lineages sometime in the past
Rooted and Unrooted Trees
• Some trees make an inference about a
common ancestor and the direction of
evolution and some don’t
 First type is called a rooted tree and has a
single node designated as root which is the
common ancestor
 Second type is called an unrooted tree
 Specifies only relationship between nodes and
says nothing about direction of evolution
Rooted and Unrooted Trees
R
B
C
Time
E
A
A
B
C
D
E
D
Rooted and Unrooted Trees
• Roots can usually be assigned to unrooted
trees using an outgroup
 Species unambiguously separated the earliest
from others being studied
 E.g. baboons in case of humans and gorillas
 For three species there are 3 possible rooted
trees, but only one possible unrooted tree
Rooted and Unrooted Trees
• In Sets
fact theRooted
numbers
of rooted (N
Data
Trees
Unrooted
Trees
R) and unrooted
trees (N2U) for n species is 1
1
 NR = (2n - 3)!/2n-2(n - 2)!
3
 NU = (2n
- 5)!/2n-3(n - 3)!
3
1
4
15
3
5
105
15
10
34,459,425
2,027,025
15
213,458,046,767,875
7,905,853,580,625
20
8,200,794,532,637,891,559,375
221,643,095,476,699,771,875
Rooting Trees
• Trees can be rooted by using the outgroup
method previously mentioned, or by
putting the root midway between the two
most distant species as determined by
branch length
 Branch length measures the amount of
difference that occurred along a branch
 Assumes the species are evolving in a clocklike manner
Rooting a Tree
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
More Tree Terminology
• Structure of a phylogenetic tree can be
represented in Newick format using nested
parentheses
 (((B, C), (D, E)), A)
• If we lack data to tell in which order two or more
independent lineages occurred in the past, the
tree may be multifurcating (more than two
ancestors) otherwise, it is bifurcating (exactly
two ancestors per interior node)
Character and Distance Data
• Character-based methods use aligned
DNA or protein sequences directly for tree
inference
Species A
Species B
Species C
Species D
Species E
ATCGAATCGTTCCGGA
ATCCAATAGTTCCGGA
AACGAATCCTACCGGT
ATCGTTTCCAACCGCT
ATAGATTCGTTCGGGA
Character and Distance Data
• Distance-based methods must transform
the sequence data into a pairwise
similarity matrix for use during tree
inference
Species
A
B
C
D
B
2
-
-
-
C
4
5
-
-
D
7
9
5
-
E
3
5
7
8
Distance-Based Methods
• Given such an input matrix we want to find
an edge-weighted tree where the leafs of
the tree correspond to the species and the
distances measured between two leaves
corresponds to the corresponding matrix
value for the leaves
UPGMA
• UPGMA (Unweighted Pair Group Method
with Arithmetic mean) is the oldest
distance matrix method
 Uses a distance matrix representing measure
of genetic distance between pairs of species
being considered
 Clusters the two closest species
 Compute new distance matrix using arithmetic
mean to first cluster
 Repeat until all species grouped
UPGMA
A
B
D
C
E
A
B C
E
D
Estimation of Branch Length
• Scaled trees, where the length of the
branches correspond to the degree to
which sequences have diverged are called
cladograms
• If rates of evolution are assumed to be
constant in all lineages then internal nodes
are placed at equal distances from each of
the species they give rise to on a
bifurcating tree (UPGMA ex.)
UPGMA
• So UPGMA is very simple and generates
rooted trees, however…
• Major weakness is that the algorithm
assumes that rates of evolution are the
same among different lineages
• This does not fit existing biological data,
so probably shouldn’t use UPGMA to build
phylogenetic trees
Transformed Distance Method
• Several distance matrix-based alternatives to
UPGMA allow different rates of evolution within
different lineages
 Oldest and simplest is the transformed distance
method which takes advantage of an outgroup
 Other lineages only evolve separately from each
other after they diverged and since the outgroup
diverged first we can use it as a frame of reference to
compare how much the other lineages evolved by
seeing when they diverged
Neighbor’s Relation Method
• One variant of UPGMA tries to pair
species in such a way as to minimize the
sum of the branch lengths
 On a rooted tree, pairs of species separated
from each other by only one node are called
neighbors
 We have important relationships between
neighbors of a phylogenetic tree with four
nodes
Neighbor’s Relation Method
The following hold for this tree
dAC + dBD = dAD + dBC = a + b + c + d + 2e = dAB + dCD + 2e
dAB + dCD < dAC + dBD
dAB + dCD < dAD + dBC
A
a
c
C
e
B
b
d
D
Neighbor’s Relation Method
• Consider all possible pairwise
arrangements of four species, and
determine which satisfies the four point
condition (set of 2 inequalities)
• This process can be iterated to generate a
complete tree, but the process is
unfeasible for large sets of species
Neighbor-Joining Methods
• Other neighborliness approaches are
available as well
• Neighbor-joining methods start with all
species arranged in a star tree
b
a
d
c
e
c
a
d
b
e
Neighbor-Joining Methods
• The pair of nodes pulled out (grouped) at each
iteration are chosen so that the total length of
the branches on the tree is minimized
• After a pair of nodes is pulled out, it forms a
cluster in the tree and is included in further
rounds of iteration (and a new distance matrix is
generated)
• The tree’s total branch length is calculated as:
Q12 = (N - 2)d12 - (d1i )- (d2i )