Transcript Document

Coalescent Theory in Biology
www. coalescent.dk
Fixed Parameters: Population Structure, Mutation, Selection,
Recombination,...
Reproductive Structure
Genealogies of non-sequenced
data
Genealogies of sequenced data
TGTTGT
Parameter Estimation
Model Testing
CGTTAT
CATAGT
Wright-Fisher Model of Population Reproduction
Haploid Model
i. Individuals are made by
sampling with replacement in the
previous generation.
ii. The probability that 2 alleles
have same ancestor in previous
generation is 1/2N
Assumptions
1. Constant
population size
2. No geography
Diploid Model
3. No Selection
4. No recombination
Individuals are made by
sampling a chromosome
from the female and one
from the male previous
generation with
replacement
P(k):=P{k alleles had k distinct parents}
1
1
2N
Ancestor choices:
k -> any
(2N)k
k -> k
2N *(2N-1) *..* (2N-(k-1))
=:
(2N)[k]
k -> k-1
k -> j
k 
 (2N)[k1]
2
Sk, j (2N)[ j ]

Sk,j - the number of ways to group k labelled objects into j groups.(Stirling Numbers of second kind.

k 
For k << 2N:
 / 2N
k 
2N[k ]
2
2 
P(k) 

(k

2N)
1
/2N

e
 
(2N) k
2
Waiting for most recent common ancestor - MRCA
Distribution until 2 alleles had a common ancestor, X2?:
P(X2 > 1) = (2N-1)/2N = 1-(1/2N)
1
1
2N
P(X2 = j) = (1-(1/2N))j-1 (1/2N)
P(X2 > j) = (1-(1/2N))j
j
j
2
2
1
1
1
2N
1
2N
Mean, E(X2) = 2N.
Ex.: 2N = 20.000, Generation time 30 years, E(X2) = 600000 years.
10 Alleles’ Ancestry for 15 generations
Multiple and Simultaneous Coalescents
1. Simultaneous Events
2. Multifurcations.
3. Underestimation of Coalescent Rates
Discrete  Continuous Time
tc:=td/2Ne
6
6/2Ne
0
k 
k 
X k is exp[ ] distributed. E(Xk )  1/ 
2
2
1.0 corresponds to 2N generations
1.0
2N
0
1
4
2
6
5
3
0.0
The Standard Coalescent
Two independent Processes
Continuous: Exponential Waiting Times
Discrete: Choosing Pairs to Coalesce.
Waiting
{1,2,3,4,5}
Coalescing
(1,2)--(3,(4,5))
Exp2 
{1,2}{3,4,5}
 
2 
 
Exp3 
{1}{2}{3,4,5}
 
2 
 
Exp4 
{1}{2}{3}{4,5}
 
2 
 
Exp5 
{1}{2}{3}{4}{5}
1
2
3
4
5
 
2 
 
1--2
3--(4,5)
4--5
Expected Height and Total Branch Length
Time Epoch
Branch Lengths
1
1
2
1/3
1
2
3
k
k 
2
1 /  
 2  k (k  1)
Expected Total height of tree:
2/(k-1)
Hk= 2(1-1/k)
i.Infinitely many alleles finds 1 allele in finite time.
ii. In takes less than twice as long for k alleles to find 1
ancestors as it does for 2 alleles.
Expected Total branch length in tree, Lk:
2*(1 + 1/2 + 1/3 +..+ 1/(k-1)) ca= 2*ln(k-1)
Effective Populations Size, Ne.
In an idealised Wright-Fisher model:
i. loss of variation per generation is 1-1/(2N).
ii. Waiting time for random alleles to find a common ancestor is 2N.
Factors that influences Ne:
i. Variance in offspring. WF: 1. If variance is higher, then effective
population size is smaller.
ii. Population size variation - example k cycle:
N1, N2,..,Nk. k/Ne= 1/N1+..+ 1/Nk. N1 = 10 N2= 1000 => Ne= 50.5
iii. Two sexes Ne = 4NfNm/(Nf+Nm)I.e. Nf- 10 Nm -1000 Ne - 40
6 Realisations with 25 leaves
Observations:
Variation great close to root.
Trees are unbalanced.
Sampling more sequences
The probability that the ancestor of the sample of size n is in a sub-sample of size k is
(n  1)(k  1)
(n 1)( k  1)
Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.
Adding Mutations
m mutation pr. nucleotide pr.generation. L: seq. length
µ = m*L Mutation pr. allele pr.generation. 2Ne - allele number.
Q := 4N*µ -- Mutation intensity in scaled process.
Continuous time
Continuous sequence
Discrete time
Discrete sequence
1/L
sequence
sequence
mutation
mutation
time
time
1/(2Ne)
coalescence
Probability for two genes being identical:
Q/2
Q/2
1
P(Coalescence < Mutation) = 1/(1+Q).
Note: Mutation rate and population size usually appear together as a product,
making separate estimation difficult.
Three Models of Alleles and Mutations.
Infinite Allele
Infinite Site
Finite Site
acgtgctt
acgtgcgt
acctgcat
tcctgcat
tcctgcat
Q
Q
Q
acgtgctt
acgtgcgt
acctgcat
tcctggct
tcctgcat
i. Only identity,
non-identity is
determinable
ii. A mutation
creates a new type.
represented by a line.
i. Allele is
represented by a
sequence.
ii. A mutation
always hits a new
position.
ii. A mutation changes
nucleotide at chosen
position.
i. Allele is
Infinite Allele Model
{(1)}  11
{(1,2)}  21
{(1), (2)}  12
{(1), (2)}  12
{(1), (2,3)}  1121
{(1), (2,3)}  1121
{(1,2), (3)(4,5)}  1122
1
2
3
4
5
{(1), (2), (3)(4,5)}  1 2
3 1
Infinite Site Model
Final Aligned Data Set:
Labelling and unlabelling:positions and sequences
1
2
3
4
5
Ignoring mutation position
Ignoring sequence label
1
2
3
5
4
Ignoring mutation position
{
,
,
Ignoring sequence label
}
The forward-backward
argument
2
5(4   )
4 classes of mutation
events incompatible
with data

1
(4   )

9 coalescence
events incompatible
with data
Infinite Site Model: An example
Theta=2.12
2
3
2
5
3
4
5
9
10
5
14
19
33
Impossible
Ancestral
States
Finite Site Model
Final Aligned Data Set:
acgtgctt
acgtgcgt
acctgcat
tcctgcat
tcctgcat
s s
s
Diploid Model with Recombination
An individual is made by:
1. The paternal
chromosome is taken
by picking random
father.
2. Making that father’s
chromosomes
recombine to create
the individuals
paternal chromosome.
Similarly for maternal
chromosome.
The Diploid Model Back in Time.
A recombinant sequence will have have two different ancestor sequences
in the grandparent.
1- recombination histories I: Branch length change
1
1
2
2
3
4
4
3
1
2
3
4
1- recombination histories II: Topology change
1
1
2
2
3
4
4
3
1
2
3
4
1- recombination histories III: Same tree
1
1
2
2
3
4
4
3
1
2
3
4
1- recombination histories IV: Coalescent time
must be further back in time than recombination time.
c
r
1
2
3
4
Recombination-Coalescence Illustration
Copied from Hudson 1991
Intensities
Coales. Recomb.
0
b

1
(1+b)
3
(2+b)
6
2
3
2
1
2
Age to oldest most recent common ancestor
Scaled recombination rate - 
0 kb
250 kb
Number of genetic ancestors to the Human Genome
time
S– number of Segments
E(S) = 1 + 
C
C
C
R
R
R
sequence
Simulations
Statements about number of
ancestors are much harder to make.
Applications to Human Genome
(Wiuf and Hein,97)
Parameters used 4Ne 20.000 Chromos. 1: 263 Mb. 263 cM
Chromosome 1: Segments 52.000
Ancestors 6.800
All chromosomes Ancestors 86.000
Physical Population. 1.3-5.0 Mill.
A randomly picked ancestor:
(ancestral material comes in batteries!)
0
260 Mb
0
52.000
*35
0
7.5 Mb
8360
6890
*250
0
30kb
Ignoring recombination in phylogenetic analysis
General Practice in Analysis of Viral Evolution!!!
Recombination
1 2 3
4
1
2
4
Assuming No Recombination
3
Mimics decelerations/accelerations of evolutionary rates.
No & Infinite recombination implies molecular clock.
Simulated Example
Genotype and Phenotype Covariation: Gene Mapping
Sampling Genotypes and Phenotypes
Decay of local dependency
Time
Reich et al. (2001)
Genetype -->Phenotype Function
Result:The Mapping Function
Dominant/Recessive.
Penetrance
A set of characters.
Binary decision (0,1).
Spurious Occurrence
Quantitative Character.
Heterogeneity
genotype
Genotype  Phenotype
phenotype