Transcript Document

Coalescent Module- Faro July 26th-28th 04
www.coalescent.dk
Monday
H:
The Basic Coalescent
W:
Forest Fire
W:
The Coalescent + History, Geography & Selection
H:
The Coalescent with Recombination
Tuesday
H:
Recombination cont.
W:
The Coalescent & Combinatorics
HW: Computer Session
H:
The Coalescent & Human Evolution
Wednesday
H:
The Coalescent & Statistics
HW: Linkage Disequilibrium Mapping
Zooming in!
(from Harding + Sanger)
3*109 bp
*5.000
(chromosome 11)
b-globin
Exon 1 Exon 2
5’ flanking
6*104 bp
*20
Exon 3
3’ flanking
ATTGCCATGTCGATAATTGGACTATTTTTTTTTT
3*103 bp
*103
30 bp
Human Migrations
From Cavalli-Sforza,2001
Data: b-globin from sampled humans.
From Griffiths, 2001
Assume:
1. At most 1
substitution per
position.
2.No recombination
Reducing nucleotide
columns to bipartitions gives a
bijection between data
& unrooted gene trees.
C
G
Simplified model of human sequence evolution.
Past
Rate of common ancestry: 1
Africa
Mutation rate: 2.5
Wait to common ancestry: 2Ne
Present
0.2
Non-Africa
From Griffiths, 2001
Models and their benefits.
Models + Data
1.
probability of data (statistics...)
2.
probability of individual histories
3.
hypothesis testing
4.
parameter estimation
Coalescent Theory in Biology
www. coalescent.dk
Fixed Parameters: Population Structure, Mutation, Selection,
Recombination,...
Reproductive Structure
Genealogies of non-sequenced
data
Genealogies of sequenced data
TGTTGT
Parameter Estimation
Model Testing
CGTTAT
CATAGT
Wright-Fisher Model of Population Reproduction
Haploid Model
i. Individuals are made by
sampling with replacement in the
previous generation.
ii. The probability that 2 alleles
have same ancestor in previous
generation is 1/2N
Assumptions
1. Constant
population size
2. No geography
Diploid Model
3. No Selection
4. No recombination
Individuals are made by
sampling a chromosome
from the female and one
from the male previous
generation with
replacement
10 Alleles’ Ancestry for 15 generations
Waiting for most recent common ancestor - MRCA
Distribution until 2 alleles had a common ancestor, X2?:
P(X2 > 1) = (2N-1)/2N = 1-(1/2N)
1
1
2N
P(X2 = j) = (1-(1/2N))j-1 (1/2N)
P(X2 > j) = (1-(1/2N))j
j
j
2
2
1
1
1
2N
1
2N
Mean, E(X2) = 2N.
Ex.: 2N = 20.000, Generation time 30 years, E(X2) = 600000 years.
P(k):=P{k alleles had k distinct parents}
1
1
2N
Ancestor choices:
k -> any
(2N)k
k -> k
2N *(2N-1) *..* (2N-(k-1))
=:
(2N)[k]
k -> k-1
k -> j
k 
 ( 2 N ) [ k -1]
2 
S k , j (2 N ) [ j ]
Sk,j - the number of ways to group k labelled objects into j groups.(Stirling Numbers of second kind.


k 
For k << 2N:
-  / 2 N
k 
2 N [k ]
2
2 
P (k) 

(
k

2
N
)
1
/
2
N

e


k
(2 N )
2 
Geometric/Exponential Distributions
The Geometric Distribution: {1,..} Geo(p):
P{Z=j)=pj(1-p)
P{Z>j)=pj
E(Z)=1/p.
The Exponential Distribution: R+
Exp (a)
Density: f(t) = ae-at,
P(X>t)= e-at
Properties: X ~ Exp(a)
i.
Y ~ Exp(b) independent
P(X>t2|X>t1) = P(X>t2-t1)
(t2 > t1)
ii.
E(X) = 1/a.
iii.
P(Z>t)=(≈)P(X>t) small a (p=e-a).
iv.
P(X < Y) = a/(a + b).
v.
min(X,Y)
~ Exp (a + b).
Discrete  Continuous Time
tc:=td/2Ne
6
6/2Ne
0
k 
X k is exp[  ] distributed. E(X
2 
k 
)  1/  
k
2 
1.0 corresponds to 2N generations
1.0
2N
0
1
4
2
6
5
3
0.0
Adding Mutations
m
mutation pr. nucleotide pr.generation.
L: seq. length
µ = m*L Mutation pr. allele pr.generation. 2Ne - allele number.
Q := 4N*µ -- Mutation intensity in scaled process.
Continuous time
Continuous sequence
Discrete time
Discrete sequence
1/L
sequence
sequence
mutation
Q/2
mutation
Q/2
time
time
1/(2Ne)
coalescence
1
Probability for two genes being
identical:
P(Coalescence < Mutation) = 1/(1+Q).
Note: Mutation rate and population size usually appear together
as a product, making separate estimation difficult.
The Standard Coalescent
Two independent Processes
Continuous: Exponential Waiting Times
Discrete: Choosing Pairs to Coalesce.
Waiting
{1,2,3,4,5}
Coalescing
(1,2)--(3,(4,5))
Exp 2 
{1,2}{3,4,5}
 
 
2 
{1}{2}{3,4,5}
Exp 3 
 
 
2 
Exp 4 
{1}{2}{3}{4,5}
 
 
 2 
Exp 5 
{1}{2}{3}{4}{5}
1
2
3
4
5
 
 
2 
1--2
3--(4,5)
4--5
Expected Height and Total Branch Length
Time Epoch
Branch Lengths
1
1
2
1/3
1
2
3
k
k 
2
1 /   
 2  k ( k - 1)
Expected Total height of tree:
2/(k-1)
Hk= 2(1-1/k)
i.Infinitely many alleles finds 1 allele in finite time.
ii. In takes less than twice as long for k alleles to find 1
ancestors as it does for 2 alleles.
Expected Total branch length in tree, Lk:
2*(1 + 1/2 + 1/3 +..+ 1/(k-1)) ca= 2*ln(k-1)
Kingman
(Stoch.Proc. & Appl. 13.235-248 + 2 other articles,1982)
A. Stochastic Processes on Equivalence Relations.
D ={(i,i);i= 1,..n}
1
if s
Q ={(i,j);i,j=1,..n}
<
t
qs,t =
0
otherwise
This defines a process, Rt , going from to through equivalence relations
on {1,..,n}.
B. The Paint Box & exchangable distributions on Partitions.
C. All coalescents are restrictions of “The Coalescent” – a
process with entrance boundary infinity.
D. Robustness of “The Coalescent”: If offspring distribution is
exchangeable and Var(n1) --> s2 & E(n1m) < Mm for all m, then
genealogies follows ”The Coalescent” in distribution.
E. A series of combinatorial results.
Effective Populations Size, Ne.
In an idealised Wright-Fisher model:
i. loss of variation per generation is 1-1/(2N).
ii. Waiting time for random alleles to find a common
ancestor is 2N.
Factors that influences Ne:
i. Variance in offspring. WF: 1. If variance is higher,
then effective population size is smaller.
ii. Population size variation - example k cycle:
N1, N2,..,Nk.
iii. Two sexes
k/Ne= 1/N1+..+ 1/Nk.
N1 = 10 N2= 1000 => Ne= 50.5
Ne = 4NfNm/(Nf+Nm)I.e. Nf- 10 Nm -1000
Ne - 40
6 Realisations with 25 leaves
Observations:
Variation great close to root.
Trees are unbalanced.
Sampling more sequences
The probability that the ancestor of the sample of size n is in a sub-sample of size k is
(n  1)( k - 1)
(n - 1)( k  1)
Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.
Three Models of Alleles and Mutations.
Infinite Allele
Infinite Site
Finite Site
acgtgctt
acgtgcgt
acctgcat
tcctgcat
tcctgcat
Q
Q
Q
acgtgctt
acgtgcgt
acctgcat
tcctggct
tcctgcat
i. Only identity,
non-identity is
determinable
ii. A mutation
creates a new type.
represented by a line.
i. Allele is
represented by a
sequence.
ii. A mutation
always hits a new
position.
ii. A mutation changes
nucleotide at chosen
position.
i. Allele is
Infinite Allele Model
{( 1)}  1
1
{( 1, 2 )}  2
1
{( 1), ( 2 )}  1
2
{( 1), ( 2 )}  1
{( 1), ( 2 ,3)}  1 2
1
{( 1), ( 2 ,3)}  1 2
1
1
1
{( 1, 2 ), ( 3 )( 4 ,5 )}  1 2
1
1
2
3
4
{( 1), ( 2 ), ( 3)( 4 ,5 )}  1 2
3
5
1
2
2
Infinite Site Model
Final Aligned Data Set:
0
1
1
1
2
4
3
5
4
5
5
5
6
3
7
2
8
1
0
Number of paths:
1
1
1
2
4
3
2
4
3
5
7
6
7
8
2
4
7
2
6
4
8
14
22
28
2
10
32
50
82
5
2
5
2
5
3
2
1
Labelling and unlabelling:positions and sequences
1
2
3
4
5
Ignoring mutation position
Ignoring sequence label
1
2
3
5
4
Ignoring mutation position
{
,
,
Ignoring sequence label
}
The forward-backward
argument
4 classes of mutation
events incompatible
with data

2
1
5( 4   )
(4   )

9 coalescence
events incompatible
with data
Infinite Site Model: An example
Theta=2.12
2
3
2
5
3
4
5
9
10
5
14
19
33
Impossible
Ancestral
States
Finite Site Model
Final Aligned Data Set:
acgtgctt
acgtgcgt
acctgcat
tcctgcat
tcctgcat
s s
s
Simplifying assumptions
1) Only substitutions.
s1
s2
TCGGTA
TGGT-T
s1
s2
TCGGA
TGGTT
2) Processes in different positions of the molecule are independent.
3) A nucleotide follows a continuous time Markov Chain.
4) Time reversibility: I.e. πi Pi,j(t) = πj Pj,i(t), where πi is the stationary distribution of i.
This implies that
 P (a ) * P
a,N 1
(l 1 )*P a,N 2 (l 2 )  P(N 1 )P N 1 ,N 2 (l 1  l 2 )
a
a
l1
N1
l2
=
N1
l2+l1
N2
N2
5) The rate matrix, Q, for the continuous time Markov Chain is the same at all times.
Evolutionary Substitution Process
A
t1
e
t2
C
C
Pi,j(t) = probability of going from i to j in time t.
lim
Pi , j ( e )
e -0
e
 q ij
lim
Pi ,i ( e ) - 1
e -0
e
 - q ii
Jukes-Cantor 69:
TO
FROM
A
A -3*

C

G

T
Total Symmetry.
C
G

-3*


T


-3*




-3*
A. Stationary Distribution: (.25,.25,.25,.25)
B.
Expected number of substitutions: 3t
t
0
-4t
P ,t ( C , G )  1 (1 - e
)
4
ATTGTGTATATAT….CAG
ATTGCGTATCTAT….CCG
Chimp
Mouse
Fish
Higher Cells
E.coli
History of Coalescent Approach to Data Analysis
1930-40s: Genealogical arguments well known to Wright &
Fisher.
1964: Crow & Kimura: Infinite Allele Model
1968: Motoo Kimura proposes neutral explanation of molecular
evolution & population variation. So does King & Jukes
1971: Kimura & Otha proposes infinite sites model.
1972: Ewens’ Formula: Probability of data under infinite
allele model.
1975: Watterson makes explicit use of
1982:
“The Coalescent”
Kingman introduces “The Coalescent”.
1983: Hudson introduces “The Coalescent with Recombination”
1983: Kreitman publishes first major population sequences.
History of Coalescent Approach to Data Analysis
1987-95: Griffiths, Ethier & Tavare calculates site data
probability under infinite site model.
1994-: Griffiths-Tavaré + Kuhner-Yamoto-Felsenstein introduces
highly computer intensitive simulation techniquees to estimate
parameters in population models.
1996- Krone-Neuhauser introduces selection in Coalescent
1998- Donnelly, Stephens, Fearnhead et al.: Major
accelerations in coalescent based data analysis.
2000-: Several groups combines Coalescent Theory & Gene
Mapping.
2002: HapMap project is started.
Basic Coalescent Summary
i. Genealogical approach to population genetics.
ii. ”The Coalescent” - generic probability distribution on
allele trees.
iii. Combining ”The Coalescent” with Allele/Mutation
Models allows the calculation the probability of data.