Transcript L 1

6.1
Chatper 6
Maximum-likelihood
Tree-puzzle
• The concept of likelihood refers given some
data D, a decision must be made about an
adequate explanation of the data.
• In the phylogenetic framework, one
hypotheses include the different tree
structures, the branch lengths, the parameters
of the model of sequence evolution, and so
on.
N=100
h=21
t=79
Binominal distribution
Likelihood function
• Fig6.1
• For ease of computation, first compute the
logarithm of the likelihood function, which
results in sums rather than products:
• It is known that the maximum of a function
y=f(x) – when it exists- is given by the value x
for which the first derivative of the function
equals zero.
• This derivative is equal to zero if θ=h/n ,
positive for smaller values of θ, and
negative for larger values, so that log [L(θ)]
attains its maximum when θ^ = h/n
.
Thus θ^ =h/n is the maximum-likelihood
estimate (MLE) of the probability of
observing a heads in a single coin toss.
Odds ratio • L(21/100) ~ 0.0975
• L( 1/2 ) ~ 1.61 x 10-9
 : 0.21 vs 0.5
0.0975/ 1.61x 10-9 ~ 6 x 107
In evolution, point mutations are considered chance
events, just like tossing a coin. therefore, at least in
principle, the probability of finding a mutation along
one branch in a phylogenetic tree can be calculated by
using the same maximum-likelihood framework.
• The main idea behind phylogeny inference
with maximum likelihood is to determine the
tree topology, branch lengths, and
parameters of the evolutionary model (e.g.,
transition/ transversion ratio, base frequencies,
rate variation among sites) that maximize the
probability of observing the sequences at
hand. In other words, the likelihood function
is the conditional probability of the data
given a hypothesis (i.e., a model of
substitution with a set of parameters θ and
the tree τ, including branch lengths):
6.2
A tree with two taxa has only on branch
connecting the two sequences; the sole
purpose of the exercise is reconstructing the
branch length that produces the data with
maximal probability.
A
B
• 6.2.1 The simple case: Maximum-likelihood
tree for two sequences
• The alignment has length l for the two
sequences Si = (si(1), …, si ( j )), (i = 1, 2),
where si ( j ) is the nucleotide, the amino acid,
or any other letter from a finite alphabet at
sequence position j in sequence i.
GATCATC……..ATCATAAAATTTACGCA
GATACCC……..ATCAATAAATTTACCCA
• Identical pairs of nucleotides (l0) and the
number of different pairs (l1), where l0 + l1= l.
GATCATC……..ATCATAAAATTTACGCA
GATACCC……..ATCAATAAATTTACCCA
GAACCTC……..AACATAAAATTTAGCCA
…..
• First, it is assumed that each site s in the
alignment evolves according to the same
a tree t
model M.
• The assumption also implies that all sites
evolve at the same rate μ.
• The rate at a site is modified by a rate-specific
factor, ρj>0. …..The probability of a certain
site pattern are available.
• Pr [Ds, τ, M, ρj], j =1, …l
(6.11)
l
L (τ, M, ρ) = Pr[D, τ, M, ρ]=P Pr[D,
τ, M,
ρ]
j
j=1
D=(D1, D2, D3, …, Dl)
j
(6.12)
• First, for a fixed choice of τ, M, and the site rate
vector ρ, the probability to observe the alignment
D can be computed with Equation 6.11.
• Second, for a given alignment D, Equation 6.12
can be used to find the MLEs.
• It is assumed that the site-specific rate factor ρj is
drawn from a Г- distribution with expectation 1
and variance 1/α.
6.3
• Consider the tree τ with its branch lengths
(i.e., number of substitutions), the model of
sequence evolution M with its parameters
(e.g., transition/ transversion ratio, stationary
base composition), and the site-specific rate
factor ρj = 1 for each site j. The goal is to
compute the probability of observing one
of the 4n possible patterns in an alignment
of n sequences.
s1
s3
d1
d3
d5
s6
d2
s2
s5
d4
s4
• It is assumed that evolution started from
sequence S0 and then proceeded along the
branches of tree τ with branch lengths d1, d2,
d3, d4, and d5. To compute Pr [Dj, t, M, 1] for a
specific site j, where Dj = (s1, s2, s3, s4) are the
nucleotides observed, it is necessary to know
the ancestral states s0 and s5 . The conditional
probability of the data given the ancestral
states then will be as follows:
• However, in almost any realistic situation, the
ancestral sequences are not available. Therefore,
one sums over all possible combinations of
ancestral states of nucleotides.
• The sum can be efficiently assessed by
evaluating the likelihoods moving from the end
nodes of the tree to the root. In each step, two
nodes from the tree are removed and replaced by
a single node.
6.3.1
• To generalize this equation for more than four
sequences, it is necessary to sum all the
possible assignments of nucleotides at the n - 2
inner nodes of the tree.
• Let Dj = (s1, s2, s3,…, sn ) be a pattern at a site
j , with tree τ and a model M fixed.
Nucleotides at inner nodes of the tree are
abbreviated as xi, i = n + 1, …, 2n – 2.
6.3.1
• For an inner node i with offspring o1 and o2,
the vector Lij = (Lij = (A), Lij = (C) Lij = (G) Lij
= (T)) is defined recursively as where d01 and
d02 are the number of substitutions connecting
node i and its descendants in the tree
It is assumed that the node 2n – 2 has three offspring:
o1 and o2, and o3, respectively.
s1
s3
d1
d3
d5
s6
d2
s2
s5
d4
s4
6.4
• Finding those branch lengths for tree τ maximizing
the log-likelihood function.
• When computing the maximum-likelihood tree, the
model parameters and branch lengths have to be
computed for each tree, and then the tree that
yields the highest likelihood has to be selected.
Because of the numerous tree topologies, testing all
possible trees is impossible.
6.4
• Thus, various heuristics are used to suggest
reasonable trees, including stepwise addition
(e.g., used in Felsenstein’s PHYLIP package:
program DNAML. exe) and star decomposition
MOLPHY, as well as the neighbor-joining (NJ)
algorithm.
6.4.1
• Given a set of n aligned nucleotide sequences,
any group of four of them is called a quartet.
The quartet–puzzling algorithm analyzes all
possible quartets in a data set.
6.4.1
• In essence, the algorithm is a three–step
procedure. The first step, the so-called
maximum-likelihood step computes for each
n!
n
of the ( 4) = 4!(n-4)! possible quartets the
maximum-likelihood values L1, L2, and L3 for
the three possible four-sequence trees T1, T2, T3.
n)
(
· 4
• The resulting list of 3
likelihoods is then
used in the quartet-puzzle step to compute an
intermediate tree by inserting sequences
sequentially in an already reconstructed
subtree.
• eventually, sequence E is inserted at the
branch with minimal penalty.
• For large data it is not feasible to compute all
intermediate trees. Thus the quartet-puzzle step
is repeated at least a thousand times for
various input orders of sequences to avoid
reconstruction artifacts due to the ordering of
the sequences and to get a representative
collection of trees.
• Finally, in step three, the majority -rule
consensus is computed from the resulting
intermediate trees. The resulting tree is called
the quartet-puzzling tree. The consensus step
provides information about the number of times
a particular grouping occurred in the
intermediate trees. This so-called reliability
value, or support value, measures (in %) how
frequently a group of sequences occurs among
all intermediate trees.
6.5
• The accepted strategy is to infer a “reasonable”
tree topology with faster–reconstruction
methods and use that tree to estimate the
parameters. Eventually, a maximum likelihood
tree can be re-estimated with the new set of
parameters. This approach assumes that
parameter estimates are not greatly disturbed
when using a slightly incorrect topology.
Among the fast distance-based tree
reconstruction methods, NJ. It has been shown
that the NJ tree is always similar to the true tree.
1. Based on reasonable pairwise “genetic distance,
estimates” an NJ tree is computed.
2. Then, maximum-likelihood branch lengths are
computed for this tree topology and parameters of
the sequence evolution are estimates.
3. Based on these estimates, a new NJ tree is computed
and Step 2 is repeated.
Step (2) and (3) are repeated until the estimates of the
model parameters are stable.
TREE-PUZZLE employs this idea to obtain approximate
estimates of model parameters, saving computation
time and still serving as an efficient tool to estimate
model parameters.
6.6
• Approach also may be used to study the amount
of evolutionary information contained in a data
set.
• If L1, L2, and L3 are the likelihoods of trees T1,
T2, and T3, then it is possible to compute the
posterior probabilities of each tree Ti as pi =
Li/(L1 +L2 +L3), where the pi terms sum to 1
and 0 < pi <1 for each i. The probabilities p1, p2,
and p3 can be reported simultaneously as a
point P lying inside an equilateral triangle, each
corner of the triangle representing one of the
three possible tree topologies.
• If P is close to one corner – for example, the
corner T1 – the tree T1 receives the highest
support. In a maximum-likelihood analysis, the
tree Ti, which satisfies pi = max {p1, p2, p3}, is
selected as the MLE. However, this decision is
questionable if P is close to the center of the
triangle. A more realistic representation of the
data is a star-like tree rather than an artificially
strictly bifurcating tree.
• Therefore, the likelihood-mapping method partitions
the area of the equilateral triangle into seven regions.
The three trapezoids at the equilateral triangle into
seven regions.
• The three trapezoids at the corners represent the areas
supporting strictly bifurcating trees.
• The three rectangles on the sides represent regions
where the decision between two trees is not obvious.
• The center of the triangle represents sets of P vectors
where all three trees are poorly supported.
• The three likelihoods for the three tree topologies of
each possible quartet are reported as a dot in an
equilateral triangle like the one.
• The distribution of dots in the seven major areas of the
triangle gives an overall impression of the treelikeness of the data.
• That is informative about the mode of evolution of the
sequences under investigation is the percentage of dots
belonging to the three main different areas in the
equilateral triangle.
• The three corners represent fully resolved tree
topologies. The presence of tree-like phylogenetic
signal in the data. The center is the area of star-like
phylogeny.
• The three areas on the sides represent network-like
phylogeny, in which the data support conflicting tree
topologies.