Efficient Estimation of the Accuracy of Maximum Likelihood

Download Report

Transcript Efficient Estimation of the Accuracy of Maximum Likelihood

Taxon Sampling for
Ancestral State Reconstruction
Louxin Zhang
Department of Mathematics
Nat’l University of Singapore
Little Background
to Ancestral Sequence Reconstruction
• Ancestral sequence reconstruction incorporates
sequences from modern organisms into evolutionary
models to estimate the corresponding sequence of
an ancestor that no longer presents on Earth.
• This approach to understanding proteins or life in
general was proposed by Zuckerkandl and Pauling in 1963.
• It has become an popular approach to studying
the origin, evolution, sequence-function relationship
of proteins, genes and other components of life.
After ~ 100 million years
placental
mammals
source: Carnegie Museum of Natural History
How did we become
human?
The differences are due to
the changes in our DNA
(Slide from J.Ma)
human: ~3.1G, 23 chromosomes
chimp: ~3.3G, 24 chromosomes
mouse: ~2.7G, 20 chromosomes
4
dog: ~2.5G, 39 chromosomes
source: U.S. DOE
Recreate Genome of Ancient Human Ancestor
•“Boreoeutherian ancestor”
lived 70 million yrs ago.
• The boreoeutheria was formed
by a series of speciation events
occurring rapidly after the
ancestor, leading to
a star-like phylogeny of
boreoeutheria.
• Computer simulation suggests
a small number of extant genomes
can give a highly accurate
reconstruction of this ancient
genomes. (Blanchette et al’04, Ma et al’07)
Ancestral State Reconstruction Problem
• Given a phylogenetic tree T of a character the evolutionary history of the character
and
an evolutionary model - prior distribution of
all possible states at the root and substitution
probability on each branch of T
• Estimate the root state from the leaf states of
the character in the tree T.
Part I: Methods
for Ancestral State Reconstruction
• Fitch (parsimony) method
Step 1: Compute a subset Sx of letters for each node x
with children y and z as follows
S y  S z if S y and S z are disjoint
Sx  
 S y  S z otherwise
Step 2: Select a letter from the subset obtained at the
root randomly.
G
A
{G}U{ A}
{G, A}∩{G}
G
A
G
G
A
G
A
{G}U{ A}
{G} ∩{G, A}
G
G
• Fitch method
- It assigns a state to the root by minimizing
the total number of substitutions placed on
all branches
-- It is a local and then efficient method.
-- But, it ignores substitution rate on branches
and is very sensitive to the topology of the tree,
leading to several limitations
• Marginal maximum likelihood (ML) method
- It assigns to the root a state a that has the
maximum likelihood defined as
Pr[ the root state is a | the given states at leaves ]
and a tie is broken arbitrarily.
-- It is a global method and so less efficient.
Approximation algs are studied.
-- But, has the largest reconstruction accuracy,
over all methods.
A Reconstruction Example
Consider the following evolutionary model
( Tree + conservation rates + prior probabilities)
for a character of two states 0 and 1.
pprior(0)=pprior(1)=0.5
r
0.9
0.8
0
0.9
1
0.9
1
0.8 means that
a state remains
unchanged
with probability 0.8.
For root state sr=0,
0
0
0.9
1
0.8
0.9
0
1
0.8
0.9
1
0
0.9
0
0.9
1
0.9
1
Pr[ 011 | sr=0 ]
= 0.8x0.1x0.9x0.9 + 0.8x0.9x0.1x0.1=0.072
For root state sr=1,
1
1
0.9
1
0.8
0.9
0
1
0.8
0.9
1
0
0.9
0
0.9
1
0.9
1
Pr[ 011 | sr=1 ]
= 0.2x0.9x0.9x0.9 + 0.2x0.1x0.1x0.1=0.1464
r
pprior(0)=pprior(1)=0.5
0.9
0.8
0
0.9
0.9
1
1 Bayes formula:
Pr[sr  a | 011] 
Pr[ 011 | sr=0] = 0.072
Pr[ 011 | sr=1] = 0.1462
Pr[011| sr  a]  p prior (a)
Pr[011]
0.072 0.5
Pr[011]
0.146 0.5
Pr[sr  1 | 011] 
Pr[011]
Pr[sr  0 | 011] 
The marginal ML method selects 1 as the root state from
leaves states 0 1 1.
Part II: Reconstruction Accuracy
-- Definition
For a reconstruction method M, its accuracy is
the expected probability that the method
reconstructs correctly the root state from a
possible configuration D of states of leaf species:
RAM( T )
= ∑c,D Pr[c evolves into D] Pr[M reconstructs c from D]
= ∑c,D pprior(c) Pr[D|c] Pr [M reconstructs c from D]
{0}
{0}
{0, 1}
{0}
0
0
0
0
Fitch selects 0 as
the root state
0
Fitch selects 0 as
the root state
{0, 1}
{0, 1}
{1}
{0}
0
1
0
1
1
1
0
Fitch selects 0 as
the root state with prob 1/2
Fitch selects 0 as
the root state with prob 1/2
Accuracy of Fitch method:
0.8
RAF( T ) = ∑c,D pprior(c) Pr[D|c] Pr [F reconstructs c from D]
r
= ∑D Pr[D|0] Pr [F reconstructs 0 from D]
0.9
= 0.584+0.072+0.072+0.146x0.5+0.072x0.5
= 0.837
0.9
0.9
Selected
Pr[D| sr = 1]
0
0
0
1
0
1
1
1
pprior(0)=pprior(1)=0.5
0
0
1
0
1
0
1
1
0
1
0
0
1
1
0
1
0.018
0.018
0.018
0.072
0.146
0.072
0.072
0.584
Pr[D| sr = 0]
0.584
0.072
0.072
0.146
0.072
0.018
0.018
0.018
root state
0
0
0
0 or 1(prob1/2)
0 or 1(prob1/2)
1
1
1
Accuracy of ML:
0.8
pprior(0)=pprior(1)=0.5
RAML( T ) = ∑c,D pprior(c) Pr[D|c] Pr [ML reconstructs c from D
r
= ∑D Pr[D|0] Pr [ML reconstructs 0 from D]
0.9
= 0.584+0.072 + 0.072 + 0.146
= 0.874
0.9
0.9
Selected
Pr[D| sr = 1]
0
0
0
1
0
1
1
1
0
0
1
0
1
0
1
1
0
1
0
0
1
1
0
1
0.018
0.018
0.018
0.072
0.146
0.072
0.072
0.584
Pr[D| sr = 0]
0.584
0.072
0.072
0.146
0.072
0.018
0.018
0.018
root state
0
0
0
0
1
1
1
1
For the ML method and tree H,
RAML( H )
= ∑c,D Pr[c evolves into D] Pr[ML reconstructs c from D]
= ∑D ∑c Pr[c evolves into D] Pr[ML reconstructs c from D]
= ∑D maxc Pr[c evolves into D]
c
D:
A
G
C
G
G
Part II: Reconstruction Accuracy
-- Monotonicity
When reconstructing the state of the common ancestor
for a group of organisms, one would expect that the
accuracy will increase with the number of organisms
used.
However, this is not always true over a phylogeny
for a method?
In other words, more organisms do not necessarily
give better estimation for ancestral state.
Theorem: The accuracy function of the
Fitch method is not monotonic.
Consider the following tree (Li, Steel, Zhang’08)
r
pprior(a)=pprior(b)=0.5
0.8
0.9
0.9
0.9
The accuracy of reconstruction from all leaves is 0.866,
while the accuracy of reconstruction from the left leaf
is 0.9
Does this counterintuitive fact occur often?
A counterexample with ultrametric tree
in which the root is equally far from all leaves.
The accuracy of reconstruction from all leaves is 0.915268;
the accuracy of reconstruction
14.1190
from a, i, b, e is 0.921926
10.0613
9.7109
24.9926
17.5103
21.7263
47.8447
1

t 

200

p (t )  0.5  1  e



50.0000
More taxa are not necessarily good in ancestral sequence
reconstruction when the Fitch method is applied.
More sequences data do not always lead to the
true phylogeny when the parsimony method is used.
There are probably two reasons for these counterintuitive facts
-- It ignores character change rate on all branches;
-- the Fitch method is a kind of ‘local method’ .
Theorem: The maximum likelihood (ML) method
has the largest reconstructing accuracy over all
methods for any tree and evolution model.
Corollary: The accuracy function of the ML method
is monotonic
Proof of Corollary.
-- Using a subset of leaves is just a specific reconstruction
method that does not use letter information in the other leaves
-- hence its accuracy is not higher than the reconstruction
from all the leaves when ML is used.
Proof of Theorem: For any method M and tree H,
RAM( H )
= ∑c,D Pr[c evolves into D] Pr[M reconstructs c from D]
= ∑D ∑c Pr[c evolves into D] Pr[M reconstructs c from D]
≤ ∑D ∑c (maxc Pr[c evolves into D]) Pr[M reconstructs c from D]
= ∑D (maxc Pr[c evolves into D]) {∑c Pr[M reconstructs c from D]}
= ∑D (maxc Pr[c evolves into D])
= RAML( H )
c
D:
A
G
C
G
G
Part III: Reconstruction Accuracy
-- Computation
Is the reconstruction accuracy RAM (T)
polynomial-time computable for any phylogeny
T , given a simple Markovian evolutionary
model (say, Juke-Cantor ) and a method M?
Theorem: The reconstruction accuracy is
linear-time computable for Fitch method.
Proof :
1
RA F (T , M )   (1  D(r ))
2
r
Z
Y
X
Computing the accuracy of ML
It is unknown how to compute the accuracy of ML
in polynomial time.
Theorem 1 (B. Ma & Zhang’09).
For any n-leaf tree T in which a binary character
changes with probability at most q<1/2
on each branch, RAML(T) can be approximated
within ratio 1- ε in O(n N4) for any ε,
where N= 2(n  1) 2 ln q / ln1     2(n  1) 2 ln(1 )  1
q

Some notations
r:
the root of tree T;
D:
state configuration of leaves;
pprior (a):
the prior probability of a root state a;
Pr [ D | sr=a] : the probability that root state a
evolves into states in D;
Pr [ sr =a | D]: the probability that the root state is a
given that states in D are observed in leaves.
It is called likelihood of the state given D.
Lemma 1. Let T be phylogeny of k (>1) leaves with root
r. For any configuration D of leaves of T and c,
Pr[D | sr  c]  q
2( k 1)
Proof. It is based on the following facts:
(1) T has 2(k-1) edges;
(2) State c can evolve into D in 2k-2 ways. For
each way E,
Pr[D | sr  c, E] 
2( k 1)
Pr[
s
|
s
]

q
 u v
(u ,v )E (T )
Method Summary


RAML(T)=  max p prior (c) Pr[D | sr  c]
D
c
-- Discretize the probability space
x0=q2(n-1) , xi+1 = xi/(1-ε) 1/(n-1), i<N; xN=1
--- Calculate
the sum of the estimates of the
mutational probability recursively.
Remarks
• Our approach is superior to the sampling
approach since it can also be used to estimate
efficiently the reconstruction error rate.
• Both computing the accuracy and the error
rate for ML method have unknown complexity
status.
Are they NP-hard?
Part IV: Reconstruction Accuracy
-- Taxon Selection
-- Fitch method is efficient, but does not monotonic
accuracy ;
-- ML method has monotonic accuracy,
but computationally intensive
K-Taxa Selection Problem:
Input:
An evolutionary model (
A tree + substitution rate on branches),
a method M, and an integer k>0;
Objective: Find k taxa from which the root
state can be reconstructed with the highest
accuracy when M is used.
Question: Is the K-Taxa Selection Problem NP-hard?
For Fitch method, the bucket idea can be used to
give a good approximation to the K-taxa selection problem
(E. Mossel’08)
Question: How to generalize bucket idea
to study the K-Species Selection Problem
for ML method?
Forward Greedy Algorithm
Backward Greedy Algorithm
Performance comparison
The accuracy of
reconstruction from all
taxa - the accuracy of
reconstruction from a selected
subset of taxa
Performance on the greedy algorithms on random trees with
16 leaves in a Yule model
Performance comparison
The accuracy of
reconstruction from all
taxa - the accuracy of
reconstruction from a selected
subset of taxa
Performance on the greedy algorithms on random
ultrametric trees
DNA Sequence data:
Estimating Boreoeutherian Seq
( from Adam Siepel)
Backward Greedy Algorithm
Forward Greedy Algorithm
Lessons learnt from our experiments
-- The backward greedy selection outperforms
the forward greedy selection.
-- When the small number (<10) of taxa are used,
the reconstruction accuracy varies in a wide range
especially for the Fitch method.
This large variability of accuracy indicates that
caution should be exercised in drawing conclusions
on an ancestral sequence reconstructed from
a small number of taxa.
-- With around 15 organisms, the boreoeutherian ancestral
sequence can be reconstruction with high resolution.
Summary
Fitch method
Monotonicity of the accuracy
Computing the accuracy
Taxon selection problem
Maximum Likelihood
No
Yes
Linear-time algorithm
Good poly-time approx. alg.
NP?
Good poly-time approx. alg.
NP?
?
NP?
Backward Selection Method
Future Work:
Extend our work to general evolution models where
more evolutionary events are considered
Acknowledgement
Guoling Li
Genome Institute of Singapore
Jian Ma
UC Santa Cruz, USA
Bin Ma
University of Waterloo , Canada
Mike Steel
University of Canterbury, New Zealand
Thank You