Welcome to Comp 665 - UNC Computational Genetics

Download Report

Transcript Welcome to Comp 665 - UNC Computational Genetics

Getting Parameters from data
2015/7/17
Comp 790– Coalescence with Mutations
1
1. Introduction
• No fun
2. Estimating θ
• Unlike ρ,θ does not shape the genealogy, it modifies genetic
types
• Mutation-rich samples allow for more accurate estimation of
‘genealogical shape’-parameters.
• More segregating sites -> More information
2. Estimating θ
• 2.1 Watterson’s estimator
– One of the two very popular estimators
– The estimator has mean θ:
2. Estimating θ
• The estimator is unbiased and variance decreases with
increasing ρ: recombination breaks up linkage and reduces
correlation between sites
• Under exponential growth, it is downwards biased. Under
migration, it is upwards biased, because the MRCA tends to
be pushed further back in time.
2. Estimating θ
• 2.2 Tajima’s estimator
– πij: The number of sites that differ between sequence i and
j.
– πij has mean θ, because πij is the number of segregating
sites in a sample of size two.
– As a consequence,
2. Estimating θ
2. Estimating θ
• Code for Watterson Estimator
2. Estimating θ
2. Estimating θ
2. Estimating θ
2.3. Fu 1994
2. Estimating θ
2.3. Fu 1994
2. Estimating θ
Fu’s other estimators:
i-Mutation: only good with
large n
UPBLUE: does not generalize
to settings with recombination
3. Estimating ρ
• Estimating ρ is hard, both statistically and computationally.
– Using infinite side model, all mutation events can be listed
whereas all recombination events cannot. A recombination
can only be inferred with certainty if all four gametes are
present.
– The number of possible genealogical relationship between
sequences subject to recombination is unlimited.
3. Estimating ρ
• Both TM and HM might overlook important information in the data. They
only provides lower bounds to the number of recombination events. (TM:
Least number of gene trees required to explain the sample. HM: 5.11.3)
• Assume two non-recombining loci with rate ρbetween them.
• Likelihood function:
L(θ1, θ2, ρ) = P θ1, θ2, ρ(S1,n=k1,S2,n=k2)
• For n=2, the two extreme cases:
• L(θ1, θ2, 0) =
•
L(θ1, θ2, ∞) =
3. Estimating ρ
• Assume u1=u2, two genes are of same length, then θ1=θ2. If k1=1 and
k2=7,
L(θ1, θ2, 0)=0.0014
L(θ1, θ2, ∞) =0.0067
(θL = 4 is the maximum likelihood estimator for both ρ=0 and ∞)
The likelihood supports two unlinked loci(ρ= ∞) more than two completely
linked loci(ρ=0).
 ρ>0 even though the data passes the four gamete test and has TM = HM
= 0.
 recombination can be inferred even in the absence of incompatibilities.
• If k1=k2=4, the likelihood supports two complete linked loci.
3. Estimating ρ
• Recombination is difficult to take into account in an analysis and it is
tempting to ignore it or assume that its effects are minor. Unfortunately it
is not true.
• 3.1 Estimators based on summary statistics
– Wakeley’s estimator: Wakeley(1997)
• A complicated function of a complicated function of ρ. ( the form is
fully know though )
• Large variance. The expectation doesn’t strongly depend on ρ.
– Likelihood and summary statistics: Wall(2000)
• Infer ρ based on the likelihood of (Sn, Kn, TM) where Kn is the
number of haplotypes in the sample.
3. Estimating ρ
• 3.2 Pseudo-likelihood estimators: Hudson (2001b), Fearnhead &
Donnelly(2001)
– Consider all pairs of segregating sites, ignoring all non-polymorphic
sites
– Let nij denote the vector of gamete counts for sites i and j (00,01,…).
Let ρij be the scaled recombination rate:
– Probability of obtaining nij given that both i,j are polymorphic
– The proposed pseudo-likelihood function is:
– Then to estimate ρ from this pseudo-likelihood.
• It depends on ρ only. Because ρij is assumed proportional to sequence
length.
3. Estimating ρ
• It is pseudo:
– 1.Only likelihood of pairs of segregating sites are considered
– 2. Pairs are treated as independent of each other
– 3.The likelihood of a pair is conditioned on the pair being segregating
in both loci
4. Monte Carlo methods
• The principle: throw dice several times and calculate the average
– Var(g(X))/M
• Use Monte Carlo integration to find P(Sn=k) : A naïve approach
– Simulate genealogies of n genes, add mutations and count the number
of times a genealogy has exactly k mutations.
– Many simulated genealogies will not contribute to the sum.
4. Monte Carlo methods
• A better approach: Write P(Sn=k) in the form of an integral
– Rearrange the terms and we can get:
• Where X is gamma distributed with parameters k+1 and θ/2
– Or:
• Where Ln is the sum of all branches in the coalescent tree.
– It’s better because every simulated values counts.
2015/7/17
Comp 790– Continuous-Time Coalescence
21
4. Monte Carlo methods
2015/7/17
Comp 790– Continuous-Time Coalescence
22
4. Monte Carlo methods
• 4.1 Likelihood curve
– Monte Carlo methods becomes more useful in evaluating P(Sn=k) for a
whole range of θ values
– E.g.: to calculate L(θ) = Pθ(Sn=k) for a large range of θs and single out
the θ value with highest probability.
– Recall:
– Simulate y1,….yM from Ln and calcuate the empirical average:
– Note that only one set of simulations is performed and is used to
calculate the likelihood for all θ
2015/7/17
Comp 790– Continuous-Time Coalescence
23
4. Monte Carlo methods
• Alternatively, recall:
• one can extend it this way: consider some fixed θ0. for any θ
• Using Monte Carlo technique, the integral can be approximated by:
• Where x1,…xM are M values obtained from the proposal distribution
gamma(k+1, θ0). θ0 is called the ‘driving value’. An appropriate choice of
θ0 could be a simple estimator of θ. E.g. Watterson’s estimator.
2015/7/17
Comp 790– Continuous-Time Coalescence
24
4. Monte Carlo methods
2015/7/17
Comp 790– Continuous-Time Coalescence
25
4. Monte Carlo methods
• 4.2 Monte Carlo integration and the coalescent
– Full likelihood of a sample under a coalescent model:
– H: history
D: data
– Let’s define H here:
– N!(n-1)!/2^(n-1) different coalescent topologies
– Impossible to sum up all these. ( we haven't even considered recomb)
2015/7/17
Comp 790– Continuous-Time Coalescence
26
4. Monte Carlo methods
• A naïve Monte Carlo approach:
– It’s not efficient for most of the coalescent topologies will not be
compatible with D.  most simulations do not contribute to the
likelihood.
– A four sequence example: (1/3 compatible )
2015/7/17
Comp 790– Continuous-Time Coalescence
27
4. Monte Carlo methods
• Importance Sampling:
– Reduce the variance of the estimated probability
– Reduce the number of simulations that contribute little to likelihood
• Instead of choosing histories from distribution Pθ(H), sample histories from
a proposal distribution Q(H)
• Now the likelihood of data can be approximated by:
2015/7/17
Comp 790– Continuous-Time Coalescence
28
4. Monte Carlo methods
• Ideally, one would like to sample from,
where
because in that case the approximation becomes exact:
Not feasible approach.
• A proposal distribution between
and
• Giffiths and Tavare(1994), Stephens and Donnelly(2000)
2015/7/17
Comp 790– Continuous-Time Coalescence
29
4. Monte Carlo methods
• Giffiths and
Tavare(1994):
• Let’s go back to 2.4.2
Infinite Site Model
2015/7/17
Comp 790– Continuous-Time Coalescence
30
• Giffiths and Tavare(1994):
• H is defined as a path through
the diagram.
• H has probability defined by
the product of weights
attached to the edges that
belong to H.
• E.g.H’ follows the rightmost path:
– 1st term of Q(H’):
is the driving value
– Last five terms are all 1
– θ0
2015/7/17
Comp 790– Continuous-Time Coalescence
31
4. Monte Carlo methods
• Pθ(D|H’) = 1 ???
• is the product of coalescent probabilities of the events defining the
history:
– Coalescent -> mutation -> coalescent …
• The factor in front of a fraction is the probability that a mutation happens
in a given lineage(s) or that a coalescent event happens amongst certain
pair of genes.
2015/7/17
Comp 790– Continuous-Time Coalescence
32
4. Monte Carlo methods
• 4.3 Markov Chain Monte Carlo
– Kuhner et al.(1995,1998)
– Finite sites model
– All coalescent topologies are compatible with data
– Likelihood ratio:
2015/7/17
Comp 790– Continuous-Time Coalescence
33
4. Monte Carlo methods
• The importance sampling function is:
• Use Metropolis-Hastings algorithm to construct a Markov Chain with
distribution Q(H)
• The benefit of the approach is that the Markov Chain tends to stay in areas
of the tree space that suport
data well
before moving to another area.34
2015/7/17
Comp 790–the
Continuous-Time
Coalescence