Transcript ppt

Phylogenetic Analysis
• Motivation
– The problem of explaining the evolutionary history
of today's species
– How do species relate to one another in terms of
common ancestors
– Nucleic acids and Proteins also evolve
• Approaches
– Fossil Records , Phylogenetic Trees
General comments on phylogenetics
• Phylogenetics is the branch of biology that deals with
evolutionary relatedness
• Uses some measure of evolutionary relatedness:
e.g.,
morphological features
• Phylogenetics on sequence data is an attempt to reconstruct the
evolutionary history of those sequences
• Relationships between individual sequences are not necessarily
the same as those between the organisms they are found in
• The ultimate goal is to be able to use sequence data from many
sequences to give information about phylogenetic history of
organisms
• Phylogenetic relationships usually depicted as trees, with
branches representing ancestors of “children”; the bottom of the
tree (individual organisms) are leaves. Individual branch points
are nodes.
What is phylogenetic analysis and why
should we perform it?
Phylogenetic analysis has two major components:
1.
Phylogenetic inference or “tree building” —
the inference of the branching orders, and
ultimately the evolutionary relationships,
between “taxa” (entities such as genes,
populations, species, etc.)
2.
Character and rate analysis —
using phylogenies as analytical frameworks
for rigorous understanding of the evolution of
various traits or conditions of interest
• Examine the process of evolution
– What drives evolution?
– Understanding mutation, gene flow and natural selection
• Examine the history of evolution
– What has evolution done in the past?
– Understanding how living organisms are related and how
they have changed over time
• Aim
– The ultimate goal is to be able to use sequence data from many
sequences to give information about phylogenetic history of
organisms
– To construct a visual representation (a tree) to describe the
assumed evolution occurring between and among different
groups (individuals, populations, species, etc.) and to study
the reliability of the consensus tree.
– Phylogenetic relationships usually depicted as trees, with branches
representing ancestors of “children”; the bottom of the tree
(individual organisms) are leaves. Individual branch points are
nodes.
Common Phylogenetic Tree Terminology
Terminal Nodes
Branches or
Lineages
A
B
C
D
Ancestral Node
or ROOT of
the Tree
Internal Nodes or
Divergence Points
(represent hypothetical
ancestors of the taxa)
E
Represent the
TAXA (genes,
populations,
species, etc.)
used to infer
the phylogeny
Parts of a Phylogenetic Tree
Node
Branch
Root
Ingroup
Outgroup
Phylogenetic trees diagram the evolutionary
relationships between the taxa
Taxon B
Taxon C
Taxon A
Taxon D
No meaning to the
spacing between the
taxa, or to the order in
which they appear from
top to bottom.
Taxon E
This dimension either can have no scale (for ‘cladograms’),
can be proportional to genetic distance or amount of change
(for ‘phylograms’ or ‘additive trees’), or can be proportional
to time (for ‘ultrametric trees’ or true evolutionary trees).
((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses
These say that B and C are more closely related to each other than either is to A,
and that A, B, and C form a clade that is a sister group to the clade composed of
D and E. If the tree has a time scale, then D and E are the most closely related.
• In Phylogenetic trees
– Leaves represent present day
species
– Interior nodes represent
hypothesized ancestors
– We will only consider binary
trees: edges split only into
two branches (daughter
edges)
– Rooted trees have an explicit
ancestor; the direction of time
is explicit in these trees
– Unrooted trees do not have
an explicit ancestor; the
direction of time is
undetermined in such trees
A few examples of what can be inferred
from phylogenetic trees built from DNA
or protein sequence data:
• Which species are the closest living
relatives of modern humans?
• What were the origins of specific
transposable elements?
• Plus countless others…..
Input data for Phylogenetic
Reconstruction
• Distance Matrix
• Character State Matrix
Types of phylogenetic analysis
methods
• Phenetic: trees are constructed based
Distance
on observed characteristics, not on
methods
evolutionary history
• Cladistic: trees are constructed based Parsimony
on fitting observed characteristics to and
Maximum
Likelihood
some model of evolutionary history
methods
Distance methods
• Another way to say this is that there are a set of distances dij
between each pair of sequences i,j in the dataset. dij can be the
fraction f of sites u where residues xi and xj differ; or dij can be
such a fraction but weighted in some way (e.g. Jukes-Cantor
distance)
Parsimony methods
• Parsimony methods are based on the idea that the most
probable evolutionary pathway is the one that requires the
smallest number of changes from some ancestral state
• For sequences, this implies treating each position separately
and finding the minimal number of substitutions at each position
• Parsimony methods assign a cost to each tree available to the
dataset, then screen trees available to the dataset and select
the most parsimonious
• Screening all the trees available to even a smallish dataset
would take too much time; branch and bound method builds
trees with increasing numbers of leaves but abandons the
topology whenever the current tree has a bigger cost than any
complete tree
Example of parsimonious tree
building
• Tree on left requires
only one change,
tree on right
requires two: left
tree is most
parsimonious
Character State Matrix
• A character has a finite number of
states
• Taxonomical units for which we want to
create phylogeny are called Objects
– e.g. species, population
• Every object has a state vector & inherit
the same characters but not the same
states!
Character State Matrix M
• M has n rows
(Objects)
• M has m columns
(characters)
• Mij denotes the state
object i has for
character j
Which species are the closest living
relatives of modern humans?
14
Humans
Gorillas
Chimpanzees
Chimpanzees
Bonobos
Bonobos
Gorillas
Orangutans
Orangutans
Humans
0
MYA
Mitochondrial DNA, most nuclear DNAencoded genes, and DNA/DNA
hybridization all show that bonobos and
chimpanzees are related more closely to
humans than either are to gorillas.
15-30
MYA
0
The pre-molecular view was that the
great apes (chimpanzees, gorillas and
orangutans) formed a clade separate
from humans, and that humans diverged
from the apes at least 15-30 MYA.
A few examples of what can be learned
from character analysis using
phylogenies as analytical frameworks:
• When did specific episodes of positive Darwinian
selection occur during evolutionary history?
• Which genetic changes are unique to the human
lineage?
• What was the most likely geographical location of
the common ancestor of the African apes and
humans?
• Plus countless others…..
The number of unrooted trees increases in a greater
than exponential manner with number of taxa
A
B
# Taxa ( N)
C
A
B
C
A
C
B
D
D
E
A
B
C
F
D
E
3
4
5
6
7
8
9
10
.
.
.
.
30
# Unrooted trees
1
3
15
105
945
10,935
135,135
2,027,025
.
.
.
.
01 x 85.3ֵ
36
(2N - 5)!! = # unrooted trees for N taxa
Inferring evolutionary relationships between
the taxa requires rooting the tree:
B
To root a tree mentally,
imagine that the tree is
made of string. Grab the
string at the root and
tug on it until the ends of
the string (the taxa) fall
opposite the root:
Root
D
Unrooted tree
A
A
Note that in this rooted tree, taxon A is
no more closely related to taxon B than
it is to C or D.
C
B
C
D
Rooted tree
Root
Now, try it again with the root at another position:
B
C
Root
Unrooted tree
D
A
A
B
C
D
Rooted tree
Root
Note that in this rooted tree, taxon A is most
closely related to taxon B, and together they
are equally distantly related to taxa C and D.
An unrooted, four-taxon tree theoretically can be rooted in five
different places to produce five different rooted trees
A
The unrooted tree 1:
4
1
B
Rooted tree 1a
2
Rooted tree 1b
C
5
D
3
Rooted tree 1c
Rooted tree 1d
Rooted tree 1e
B
A
A
C
D
A
B
B
D
C
C
C
C
A
A
D
D
D
B
B
These trees show five different evolutionary relationships among the taxa!
There are two major ways to root trees:
By outgroup:
Uses taxa (the “outgroup”) that are
known to fall outside of the group of
interest (the “ingroup”). Requires
some prior knowledge about the
relationships among the taxa. The
outgroup can either be species (e.g.,
birds to root a mammalian tree) or
previous gene duplicates (e.g.,
a-globins to root b-globins).
outgroup
By midpoint or distance:
Roots the tree at the midway point
between the two most distant taxa in
the tree, as determined by branch
lengths. Assumes that the taxa are
evolving in a clock-like manner. This
assumption is built into some of the
distance-based tree building methods.
A
d (A,D) = 10 + 3 + 5 = 18
Midpoint = 18 / 2 = 9
10
C
3
B
2
2
5
D
Each unrooted tree theoretically can be rooted
anywhere along any of its branches
C
A
D
B
A
C
B
A
B
D
E
C
F
D
E
# Taxa
3
4
5
6
7
8
9
.
.
.
.
30
# Unrooted
# Rooted
x # Roots =
Trees
Trees
1
3
3
3
5
15
15
7
105
105
9
945
945
11
10,3 95
10,935
13
135,1 35
135,135
15
2,027,0 25
.
.
.
.
.
.
.
.
.
.
.
.
36
~3.58 x 10
57
~2.04 x 10 38
(2N - 3)!! = # unrooted trees for N taxa
Molecular phylogenetic tree building methods:
Are mathematical and/or statistical methods for inferring the divergence
order of taxa, as well as the lengths of the branches that connect them.
There are many phylogenetic methods available today, each having
strengths and weaknesses. Most can be classified as follows:
COMPUTATIONAL METHOD
Characters
Distances
DATA TYPE
Optimality criterion
Clustering algorithm
PARSIMONY
MAXIMUM LIKELIHOOD
MINIMUM EVOLUTION
UPGMA
LEAST SQUARES
NEIGHBOR-JOINING
Types of data used in phylogenetic inference:
Character-based methods: Use the aligned characters, such as DNA
or protein sequences, directly during tree inference.
Taxa
Species
Species
Species
Species
Species
A
B
C
D
E
Characters
ATGGCTATTCTTATAGTACG
ATCGCTAGTCTTATATTACA
TTCACTAGACCTGTGGTCCA
TTGACCAGACCTGTGGTCCG
TTGACCAGTTCTCTAGTTCG
Distance-based methods: Transform the sequence data into pairwise
distances (dissimilarities), and then use the matrix during tree building.
Species
Species
Species
Species
Species
A
B
C
D
E
---0.23
0.87
0.73
0.59
0.20
---0.59
1.12
0.89
A
0.50
0.40
---0.17
0.61
B
0.45
0.55
0.15
---0.31
C
0.40
0.50
0.40
0.25
----
Example 2: Kimura 2-parameter distance
(estimate of the true number of substitutions between taxa)
D
E
Example 1:
Uncorrected
“p” distance
(=observed percent
sequence difference)
Computational methods for finding optimal trees:
Exact algorithms: "Guarantee" to find the optimal or
"best" tree for the method of choice. Two types used in tree
building:
Exhaustive search: Evaluates all possible unrooted
trees, choosing the one with the best score for the method.
Branch-and-bound search: Eliminates the parts of the
search tree that only contain suboptimal solutions.
Heuristic algorithms: Approximate or “quick-and-dirty”
methods that attempt to find the optimal tree for the method of
choice, but cannot guarantee to do so. Heuristic searches
often operate by “hill-climbing” methods.
Exact searches become increasingly difficult, and
eventually impossible, as the number of taxa increases:
A
B
# Taxa ( N)
C
A
B
C
A
C
B
D
D
E
A
B
C
F
D
E
3
4
5
6
7
8
9
10
.
.
.
.
30
# Unrooted trees
1
3
15
105
945
10,935
135,135
2,027,025
.
.
.
.
01 x 85.3ֵ
36
(2N - 5)!! = # unrooted trees for N taxa
Classification of phylogenetic inference methods
COMPUTATIONAL METHOD
Characters
Distances
DATA TYPE
Optimality criterion
Clustering algorithm
PARSIMONY
MAXIMUM LIKELIHOOD
MINIMUM EVOLUTION
UPGMA
LEAST SQUARES
NEIGHBOR-JOINING
Parsimony methods:
Optimality criterion: The ‘most-parsimonious’ tree is the one that
requires the fewest number of evolutionary events (e.g., nucleotide
substitutions, amino acid replacements) to explain the sequences.
Advantages:
• Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’).
• Can be used on molecular and non-molecular (e.g., morphological) data.
• Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy)
• Can be used for character (can infer the exact substitutions) and rate analysis.
• Can be used to infer the sequences of the extinct (hypothetical) ancestors.
Disadvantages:
• Are simple, intuitive, and logical (derived from “Medieval logic”, not statistics!)
• Can be fooled by high levels of homoplasy (‘same’ events).
• Can become positively misleading in the “Felsenstein Zone”:
[See Stewart (1993) for a simple explanation of parsimony analysis, and Swofford
et al. (1996) for a detailed explanation of various parsimony methods.]
Bootstrapping
•
•
•
•
•
Evaluation of the tree reliability
n number of trees are built (n=100/1000/5000)
How many times a certain branch is reproduced
Values between 1-100 (%)
if the assumptions the method is based on hold, you should
always get the same tree from the bootstrapped alignments as
you did originally
• The frequency of some feature of your phylogeny in the
bootstrapped set gives some measure of the confidence you
can have for this feature
Parsimony methods
• Parsimony methods are based on the idea
that the most probable evolutionary pathway
is the one that requires the smallest number
of changes from some ancestral state
• For sequences, this implies treating each
position separately and finding the minimal
number of substitutions at each position
Example of parsimonious tree
building
• Tree on left requires
only one change,
tree on left requires
two: left tree is most
parsimonious
• Parsimony methods assign a cost to each
tree available to the dataset, then screen
trees available to the dataset and select the
most parsimonious
• Screening all the trees available to even a
smallish dataset would take too much time;
branch and bound method builds trees with
increasing numbers of leaves but abandons
the topology whenever the current tree has a
bigger cost than any complete tree
Phylogeny in medical forensics: HIV
• A dentist who was infected with HIV was suspected of infecting some
of his patients in the course of treatment
• HIV evolves very quickly (10-3 substitutions/year)
• Possible to trace the history of infections among individuals by
conducting a phylogenetic analysis of HIV sequences
• Samples were taken from dentist, patients, and other infected
individuals in the community
• Study found 5 patients had been infected by the dentist
Source: Ou et. al. 1992. Molecular epidemiology of HIV transmission in a dental practice. Science, 256: 1165-1171.
Did the Florida Dentist infect his patients with HIV?
Phylogenetic tree
of HIV sequences
from the DENTIST,
his Patients, & Local
HIV-infected People:
DENTIST
Patient C
Patient A
Patient G
Patient B
Patient E
Patient A
DENTIST
Yes:
The HIV sequences from
these patients fall within
the clade of HIV sequences
found in the dentist.
Local control 2
Local control 3
Patient F
No
Local control 9
Local control 35
Local control 3
Patient D
From Ou et al. (1992) and Page & Holmes (1998)
No