No Slide Title

Download Report

Transcript No Slide Title

Methods of Tree
Reconstruction
Dan Graur
1
2
3
4
5
Molecular phylogenetic approaches:
1. distance-matrix (based on distance measures)
2. character-state (based on character states)
3. maximum likelihood (based on both character states
and distances)
6
DISTANCE-MATRIX METHODS
In the distance matrix methods, evolutionary
distances (usually the number of nucleotide
substitutions or amino-acid replacements between
two taxonomic units) are computed for all pairs of
taxa, and a phylogenetic tree is constructed by
using an algorithm based on some functional
relationships among the distance values.
7
Multiple Alignment
GCGGCTCA
GCGGCCCA
GCGTTCCA
GCGTCCCA
GCGGCGCA
***
**
TCAGGTAGTT
TCAGGTAGTT
TC--CTGGTT
TCAGCTAGTT
TTAGCTAGTT
*
* ***
GGTG-G
GGTG-G
GGTGTG
GTTG-G
GGTG-A
* **
Spinach
Rice
Mosquito
Monkey
Human
8
Compute pairwise distances by
correcting for multiple hits at a single
sites
Number of differences
Number of changes (e.g., number of nucleotide
substitutions, number of amino acid
replacements)
9
Distance Matrix*
Spinach
Rice
Mosquito
Monkey
Human
Spinach
0.0
Rice
9
0.0
Mosquito Monkey Human
106
91
86
118
122
122
0.0
55
51
0.0
3
0.0
*Units: Numbers of nucleotide
substitutions per 1,000 nucleotide sites
10
Distance Methods:
UPGMA
Neighbor-relations
Neighbor joining
11
UPGMA
Unweighted pair-group method with arithmetic means
12
UPGMA employs a sequential clustering
algorithm, in which local topological
relationships are identified in order of decreased
similarity, and the tree is built in a stepwise
manner.
13
simple OTUs
14
composite OTU
15
16
17
UPGMA yields the correct answer only
if the distances are ultrametric!
Q: What happens if the distances are
only additive?
Q: What happens if the distances are not
even additive?
18
Neighborliness methods
The neighbors-relation method
(Sattath & Tversky)
The neighbor-joining method
(Saitou & Nei)
19
In an unrooted bifurcating tree, two
OTUs are said to be neighbors if they
are connected through a single
internal node. Neighbors ≠ Sister Taxa
20
If we combine OTUs A and B into one composite
OTU, then the composite OTU (AB) and the
simple OTU C become neighbors.
21
A
C
B
+
<
D
+
=
+
Four-Point Condition
d(A, B) d(C, D)  d(A,C) d(B, D)  d(A,D) d(B,C)
The Neighbor
Joining Method
23
In distance-matrix
methods, it is assumed:
Similarity  Kinship
24
25
From Similarity to Relationship
Similarities among OTUs can be due to:
• Ancestry:
– Shared ancestral characters (symplesiomorphies)
– Shared derived characters (synapomorphy)
• Homoplasy:
– Convergent events
– Parallel events
– Reversals
26
Parsimony Methods:
Willi Hennig
1913-1976
27
[Entities must not be multiplied beyond necessity]
William of Occam (ca. 1285-1349)
English philosopher & Franciscan monk
William of Occam was “solemnly” excommunicated by Pope John XXII.
28
MAXIMUM PARSIMONY METHODS
Maximum parsimony involves the identification of a
topology that requires the smallest number of
evolutionary changes to explain the observed differences
among the OTUs under study.
In maximum parsimony methods, we use discrete
character states, and the shortest pathway leading to these
character states is chosen as the “best” or maximum
parsimony tree.
Often two or more trees with the same minimum number of
changes are found, so that no unique tree can be inferred.
Such trees are said to be equally parsimonious.
29
Site
__________________ __________________________
Sequences 1
2
3
4
5
6
7
8
9
_________________ ___________________________
1
A
A
G
A
G
T
T
C
A
2
A
G
C
C
G
T
T
C
T
3
A
G
A
T
A
T
C
C
A
4
A
G
A
G
A
T
C
C
T
*
invariant
*
*
30
Site
__________________ __________________________
Sequences 1
2
3
4
5
6
7
8
9
_________________ ___________________________
1
A
A
G
A
G
T
T
C
A
2
A
G
C
C
G
T
T
C
T
3
A
G
A
T
A
T
C
C
A
4
A
G
A
G
A
T
C
C
T
*
variant
*
*
31
Site
__________________ __________________________
Sequences 1
2
3
4
5
6
7
8
9
_________________ ___________________________
1
A
A
G
A
G
T
T
C
A
2
A
G
C
C
G
T
T
C
T
3
A
G
A
T
A
T
C
C
A
4
A
G
A
G
A
T
C
C
T
*
uninformative
*
*
32
Site
__________________ __________________________
Sequences 1
2
3
4
5
6
7
8
9
_________________ ___________________________
1
A
A
G
A
G
T
T
C
A
2
A
G
C
C
G
T
T
C
T
3
A
G
A
T
A
T
C
C
A
4
A
G
A
G
A
T
C
C
T
*
informative
*
*
33
34
35
36
37
In the case of four OTUs, an
informative site can only favor one of
the three possible alternative trees.
Thus, the tree supported by the largest
number of informative sites is the most
parsimonious tree.
38
Inferring the maximum parsimony tree:
1. Identify all the informative sites.
2. For each possible tree, calculate the
minimum number of substitutions at each
informative site.
3. Sum up the number of changes over all
the informative sites for each possible tree.
4. Choose the tree associated with the
smallest number of changes as the
maximum parsimony tree.
39
Maximum parsimony (Practice):
Data
1.TGCA
2.TACC
3.AGGT
4.AAGT
***
Step 1. Identify all the informative sites.
Maximum parsimony (Practice):
Data
1.TGC
2.TAC
3.AGG
4.AAG
Step 2. For each possible tree, calculate the
minimum number of substitutions at each
informative site.
41
Maximum parsimony (Practice):
Data
1.TGC
2.TAC
3.AGG
4.AAG
4
5
6
Step 3. Sum up the number of changes over
all the informative sites for each possible
tree.
42
Maximum parsimony (Practice):
Data
1.TGC
2.TAC
3.AGG
4.AAG
4
5
6
Step 4. Choose the tree associated with the
smallest number of changes as the
maximum parsimony tree.
43
Problem (exaggerated)
44
Fitch’s (1971) method for inferring nucleotides at internal nodes
The set at an internal node is the intersection
() of the two sets at its immediate
descendant nodes if the intersection is not
empty.
The set at an internal node is the union () of
the two sets at its immediate descendant
nodes if the intersection is empty.
When a union is required to form a nodal set,
a nucleotide substitution at this position must
be assumed to have occurred.
45
Fitch’s (1971) method for inferring nucleotides at internal nodes
4 substitutions
3 substitutions
46
Testing properties of ancestral proteins
The ability to infer in silico the sequence of
ancestral proteins, in conjunction with some
astounding developments in synthetic biology,
allow us to “resurrect” putative ancestral
proteins in the laboratory and test their
properties. These properties, in turn, can be
used to test hypotheses concerning the physical
environment which the ancestral organism
inhabited (its paleoenvironment).
47
Testing properties of ancestral proteins
Gaucher et al. (2003) used EF-Tu (Elongation-Factor thermounstable) gene
sequences from completely sequenced mesophile eubacteria to reconstruct
candidate ancestral sequences at nodes throughout the bacterial tree. These
inferred ancestral proteins were, then, synthesized in the laboratory, and their
activities and thermal stabilities were measured and compared to those of
extant organisms.
The temperature profile of the
inferred ancestral protein was
55°C, suggesting that the
ancestor of extant mesophiles
was a thermophile.
Thermostability curves
48
Ancestral reconstruction is not possible with morphological data.
49
The impossibility of exhaustively searching for the
maximum-parsimony tree when the number of
OTUs is large

Number of OTUs
Number of possible rooted tree

2
1
3
3
4
15
5
105
6
954
7
10,395
8
135,135
9
2,027,025
10
34,459,425
15
213,458,046,676,875
20
8,200,794,532,637,891,559,375

50
Exhaustive = Examine all trees, get
the best tree (guaranteed).
Branch-and-Bound = Examine some
trees, get the best tree (guaranteed).
Heuristic = Examine some trees, get a
tree that may or may not be the best
tree.
51
Exhaustive
52
Branchand-Bound
Rationale: The length
of a tree with n+1
OTUs can either be
equal to or larger than
the length of a tree
with n OTUs.
Reminder: The total number of
substitutions in a tree = tree
length
53
Branch
-andBound
Obtain a tree by a fast method. (e.g.,
the neighbor-joining method)
Compute numbers of substitutions
(L) for this tree.
Turn L into an upper bound value.
Rationale: the maximum parsimony
tree must be either equal in length to
L or shorter.
54
Branch
-andBound
The magnitude of the
search will depend on
the data (i.e., luck).
55
Heuristic
56
57
Likelihood
L  data| hypothesis)
• Example:
• Data:
• Hypothesis:
Coin tossing
10 tosses: 6 heads + 4 tails
Binomial distribution
58
LIKELIHOOD IN MOLECULAR
PHYLOGENETICS
L  sequences| tree)
• The data are the aligned sequences
• The model is the probability of change
from one character state to another
(e.g., Jukes & Cantor 1-P model).
• The parameters to be estimated are:
Topology & Branch Lengths
59
60
Bayesian Phylogenetics
Based on “Bayes Theorem”
Thomas Bayes (1701–1761)
A = a proposition, a hypothesis.
B = the evidence.
P(A) = the prior, the initial degree
of belief in A.
P(A|B) = the posterior, the new
degree of belief in A given B (the
evidence).
P(B|A)/P(B) = represents the
support B provides for A.