Welcome to Comp 665

Download Report

Transcript Welcome to Comp 665

Continuous Coalescent Model
• The continuous coalescent lends itself to generative models
• Algorithm to construct a plausible genealogy for n genes
1. Start with k = n genes
2. Simulate the waiting time, Tkc, to the next event, Tkc ~ Exp 2k
3. Choose a random pair (i, j) with 1 ≤ i < j ≤ k uniformly
k
among the 2 pairs
4. Merge I and J into one gene and decrease the sample size
by one, k  k -1

5. Repeat from step 2 while k > 1



• Note that this model runs backwards, it begins from the
current population and posits ancestry, in contrast to a
forward algorithm like those used in the first lecture
4/8/2016
Comp 790– Continuous-Time Coalescence
1
In Python
• A simulator in 12 lines

T = [[i,0.0] for i in xrange(N)] # gene id, time of merge
k=N
t = 0.0
while k > 1:
t += expovariate(0.5*k*(k-1))
i = randint(0,k-1)
j = randint(0,k-1)
while i == j:
j = randint(0,k-1)
T[i] = [T[i], T[j], t]
T.pop(j)
k -= 1
4/8/2016
Comp 790– Continuous-Time Coalescence
2
Properties of a Coalescent Tree
• The height, Hn, of the tree is the sum of time epochs, Tj,
where there are j = n, n-1, n-2, … , 2, 1 ancestors.
As n ∞, E(Hn)  2,
and, if n=2, E(H2)=1.
n
E(Hn ) 

n
j2

n
Var(Hn ) 
 

1
 2 1  1n
j(j 1)
E(Tj )  2
j2

n
 j (j 1)
Var(Tj )  4
j2
1
2
2
j2
Thus, the waiting
time for n genes to
find their common
ancestor is less than
twice the time for 2!
As n ∞,
Var(Hn)  4(π2-9)/3,
and, if n=2, Var(H2)=1.

4/8/2016
Comp 790– Continuous-Time Coalescence
3
Sampled Distribution
• N = 1000000
4/8/2016
Comp 790– Continuous-Time Coalescence
4
Example Trees
• Observation: The contribution of T2, where the last two ancestors
converge to a common root, is disproportionately large
4/8/2016
Comp 790– Continuous-Time Coalescence
5
Total Branch Length
• In contrast to Hn, the distribution of the total branch length Ln,
has a simple form:
t /2 n1
P(Ln  t)  (1e
)
• The mean of Ln is found by weighting the coalescent times by
the number of active lineages

n
E(Ln ) 

j2
n1

jE(Tj )  2
1
j
j1
• This sum does not converge for large n, but grows slowly. It
fact, it is proportional to log(n)
4/8/2016
Comp 790– Continuous-Time Coalescence
6
Shared History
• E(Ln) can be used to get a sense of how much history genes share.
• Genes would share the least history if they all arose from a common
ancestor long ago and then propagated along distinct lineages.
• If the mean time to the common ancestor is E(Hn) = 2(1 – 1/n), and we
assume the split was a early as possible (thus minimizing the shared
history), then the total branch length would be nE(Hn) = 2(n-1).
• Comparing to E(Ln) as a fraction of this minimum shared-history case
gives:
n1

1
j1 j
E(Ln )
log(n)


nE(Hn ) n 1
n 1
…

4/8/2016
Comp 790– Continuous-Time Coalescence
7
7
7
Plot of Shared History
• Even for small n, samples, on average, share
considerable history
– share(5) = 48%
– share(10) = 69%
– share(20) = 81%
• Sharing is the fraction
of a genealogy that an
average gene shares
with two or more other
extant genes
4/8/2016
share(n)  1 

n1
1
j1 j
n 1

Comp 790– Continuous-Time Coalescence
8
Variance of Total Branch Length
• The variance in the total branch length is:
n
Var(Ln ) 

n1

j Var(Tj )  4
2
j2
1
j2
j1
which converges to 2π2/3 ≈ 6.579 as n ∞.
• This implies that for large n, Ln is narrowly

centered around E(Ln). Likewise, sharing is also
relatively consistent.
4/8/2016
Comp 790– Continuous-Time Coalescence
9
Implications on Sampling Paths
• Sampling multiple paths from extant genes along their
ancestors is less effective than one might think.
• Most long branches are covered by relatively few samples
• Not surprising since the E(H40) = 1.95 and E(H10) = 1.8
(a 4x increase in samples increases height by less than 10%).
4/8/2016
Comp 790– Continuous-Time Coalescence
10
Effective Population Size
• Real populations are not likely to satisfy the Wright-Fisher
model.
• In particular, most real populations show some sort of
reproductive structure, either due to geography or societal
constraints
• Also likely that the number of descendents is a generation
depends on many factors (health, disease, etc.), as opposed to
the implicit Poisson model
• Total population size is not fixed, but changes over time
4/8/2016
Comp 790– Continuous-Time Coalescence
11
Sanity Check
• When the Wright-Fisher model, or the basic
coalescent, is used to model a real population, the
size of the population (2N) cannot be taken literally.
• For example, many human genes have a MRCA less
than 200,000 years ago. If we consider one
generation per 20 years then N should be less than
200,000/(4*20) = 2500, which is too small (recall the
maximum tree height for the entire population is 2.
and 2(2 generation_time) = 4*20)
4/8/2016
Comp 790– Continuous-Time Coalescence
12