20070903115012101
Download
Report
Transcript 20070903115012101
Modelling heterogeneity in
multi-gene data sets
Klaus Schliep, Barbara Holland,
Mike Hendy, David Penny
Allan Wilson Centre
Palmerston North, NZ
Motivation
• Phylogenomic datasets may involve hundreds of
genes for many species.
• These data sets create challenges for current
phylogenetic methods, as different genes have
different functions and hence evolve under
different processes.
• One question is how best to model this
heterogeneity to give reliable phylogenetic
estimates of the species tree.
Example
Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa
S. cerevisiae
S. paradoxus
S. mikatae
S. kudriavzevii
S. bayanus
S. kluyveri
S. castellii
C. albicans
Two extremes
• How many parameters do we need to
adequately represent the branches of all
(unrooted) gene trees ?
Between
13 (consensus tree)
&
13 x 106 = 1378
• Too few parameters introduces bias
• Too many parameters increases the variance
Stochastic partitioning
• Attempts to cluster genes into classes that
have evolved in a similar fashion.
• Each class is allowed its own set of
parameters (e.g. branch lengths or model
of nucleotide substitution)
Algorithm overview
1. Randomly assign the n genes to k classes.
2. Optimise parameters for each class
3. Compute the posterior probability for each
gene with the parameters from each class.
4. Move each gene into the class for which it has
highest posterior probability
5. Go to step 2, when no genes change class
STOP
How many classes?
Gene ontology
A different approach
• Allow each gene to have its own set of
parameters
• BUT penalise models where the
parameters are too different from each
other.
Penalized (log-)likelihood
106
pl ( , x) l ( i , xi ) 12g ( )
i 1
g ( ) i j K
2
i j
T
2
where i are the parameters for the i-th gene tree,
K is a symmetric matrix, and is the penalty term.
Number of parameters
• Hastie and Tibshirani (1990) give an
approximation for the number of degrees
of freedom for a penalized likelihood
estimator: df tr (( H K ) 1 H )
l ( , x)
H
t
2
• This allows us to choose the best λ value
using AIC or BIC.
Summary
• Tame statisticians are useful too!