The Coalescent Theory - Case Western Reserve University

Download Report

Transcript The Coalescent Theory - Case Western Reserve University

By Mireya Diaz
Department of Epidemiology and Biostatistics
for EECS 458
Agenda
• Basic concepts of population genetics
• The coalescent theory
• Coalescent process of two sequences
• Coalescent time
• Statistical inference
• Applications: reconstruction of human evolutionary
history
• Future venues
Basic Concepts in Population Genetics
Mutation
Random genetic drift
Selection
f1 f2 fk
Basic Concepts in Population Genetics
• Mutation: limited role in evolution due to its slow effect,
however contributes to the maintenance of alleles in the
population
Locus with 2 allelles: A1 (p(n)) and A2 (q(n)=1-p(n))
Non-overlapping generations
A1->A2 at rate u and A2->A1 at rate v (u, v ~10-5, 10-6)
Allele can mutate most once/generation
p(n  1)  (1  u) p(n)  v(1  p(n))
p ( n) 
if initial gene freq. of A1=p(0)
As n->∞
“equilibrium”
pˆ 
v
uv
qˆ 
u
uv
v
v
 ( p0 
)(1  u  v) n
uv
uv
Basic Concepts in Population Genetics
• Random genetic drift: change in gene frequency due to
random sampling of gametes from a finite population.
Important for small size populations
Each generation 2N gametes sampled at random from
parent generation
y(n): # gametes of type A1, in absence of mutation and selection
 2N  j
 p (1  p) 2 N  j
P( y(n  1)  j | y(n)  i)  
 j 
p
i
2N
Wright-Fisher model
• One allele will be lost
P( fixation  A1 )  f ( A1 | t  0)  1
Basic Concepts in Population Genetics
• Selection: can act at different stages of the life of an
organism (e.g. differential fecundity, viability)
Locus with 2 alleles A1, A2
Three genotypes: A1 A1 (w11), A1 A2 (w12), A2A2 (w22)
with fitness wij, relative survival chances of zygotes of genotype AiAj
Under Hardy-Weinberg equilibrium
p(n  1) 
p(n)[ p(n) w11  q (n) w12 ]
w
q(n  1) 
w  p 2 (n)w11  2 p(n)q(n)w12  q 2 (n)w22
If w11>w12>w22
w11<w12<w22
w11,w22<w12
w12< w11,w22
->
->
->
->
q(n)[ p(n) w12  q(n) w22 ]
w
p(n  1)  q(n  1)  1
A1 becomes fixed
A2 becomes fixed
overdominance, stable polymorphism
underdominance, unstable polymorphism, A1 or A2
becomes fixed f(0)
The Coalescent Theory
• Stochastic process: continuous-time Markov process
• Large population approximation of Wright-Fisher model,
and other neutral models
• Probability model for genealogical tree of random
sample of n genes from large population
• Most significant progress in theoretical population
genetics (past 2 decades). Cornerstone for rigorous
statistical analysis of molecular data from populations
• Need of: inferring the past from samples taken from
present population
• Seminal work: Kingman, J Appl Prob 19A:27, 1982
The Coalescent Theory – Key Idea
• Start with a sample and trace backwards in time to identify EVENTS
in the past since the Most Recent Common Ancestor (MRCA) in the
sample
• Consider sample of n sequences of a DNA region for a population
• Assume no recombination between sequences
• N sequences are connected by a single phylogenetic tree
(genealogy) where the root=MRCA
MRCA
Diverge
Coalesce
The Coalescent Theory: Usefulness
• Sample-based theory
• By-product: development of highly-efficient algorithms
for simulation of samples under various population
genetics models
• Particularly suitable for molecular data
• Estimate parameters of evolutionary models (vs.
history of specific locus – phylogenetics)
The Coalescent Process of Two Sequences
• Consider diploid organisms
• Wright-Fisher model:
– Sequence in a population at a generation = random sample
with replacement from those in the previous generation
– Mutations at locus of interest: selectively neutral (do not
affect reproductive success, all individuals likely to
reproduce, all lineages equally likely to coalesce)
• P(coalescence at previous generation)=?
P=1/2N, N=effective population size
P(coalescence t+1 generations ago) =
1
(1  1 / 2 N ) t
2N
• For haploid structures, use N rather than 2N
The Coalescent Tree
MRCA
T2
T3
T4
T5
Genealogical relationship of sample of genes
• Topology is independent of branch lengths
• Branch lengths are independent, exponential rv’s
(waiting time between coalescent events)
• Topology is generated by randomly picking lineages to
coalesce -> “all topologies are equally likely”
The Coalescent Time
•
Assume: # mutations in a given period ~Poisson
mean time 2N generation between two sequences
mean # mutations in two sequences
 = 4Nm (m: mutation rate seq/generations)
•
Underlying assumption: randomly mating
(~ organisms with high mobility)
•
Coalescent time: time between two successive
coalescent events
•
Exponential variable, mean = 2/k(k-1)
k: # ancestral sequences between the two events
Coalescent Tree Parameters
1
N
And coalesce
P(2 lineages pick same parent)
Remain distinct
1
1
N
Expected time to MRCA (height of the tree):
n
2
n
 n
 1
E T (k )   E[T (k )]  
 21  
 n
k  2 k (k  1)
 k 2
 k 2
Expected total branch length of the tree:
n
 n1 2
E[Ttot (n)]  E  kT (k )   ~ 2(  log n)
 k 2
 k 1 k
The Coalescent Theory & Statistical Inference
• Mutation rate
• Age of MRCA
• Recombination rate
• Ancestral population size
• Migration rate
Reconstruction of Human Evolutionary History
•
•
•
•
•
•
Goal: estimate times of evolutionary events (major
migrations), demographic history (population
bottlenecks, expansions)
Haploid sequences: mtDNA, Y chromosome
Case study: recent common ancestry of human Y
chromosome
Source: Thomson et al. PNAS 2000; 97:7360-5
Estimations: expected time to MRCA and ages of
certain mutations
Data: 53-70 chromosomes, sequences variation at
three genes (SMCY, DBY, DFFRY) in Y chromosome
Recent common ancestry of Y chromosome
• For ages of major events: need mutation rate estimate (SN substitution)
• Substitutions between chimpanzee and human sequences
• Mutation rate per site per year = No. subst./2*Tsplit*L
• Tsplit: time since chimp and human split (~5M years ago)
• Assumptions: selective neutrality of all changes on Y since divergence
Summary of gene characteristics from sample
Gene
SMCY
DBY
DFFRY
All
Seq length
39,931
8,547
15,642
64,120
Sample size
53
70
70
43
No. polym. No. substitutions
47 (41)
14 (12)
17 (15)
65 (56)
528
107
159
794
Mutation rate
1.32x10-9
1.25x10-9
1.02x10-9
1.24x10-9
Source: Table 1 from article
(#) in no. polymorphisms after removal of length variants, repeat sequences, indels
GENETREE Analysis
• Software: www.stats.ox.ac.uk/~stephens/group/software.html
• Estimate mean number of mutations:  = 2Nem
Ne: effective number of Y chromosomes in population
m: mutation rate per gene per generation
• Also: expected ages of mutation, time since MRCA
• Assumptions: coalescent process, infinitely-many-sites
mutation (mutation rate low enough -> e/occurs at new site)
• Four insertions, three deletions, two repeat mutations (different
rates from SN substitutions)
• Only one segregating site in SMCY appeared to have mutated
>1 -> data fit infinitely-many sites model
Recent common ancestry of Y chromosome
MRCA distribution under constant population
Gene
SMCY
DBY
DFFRY
All
TMRCA1
95%CI
TMRCA2
95%CI
0.56
0.83
0.96
0.55
(0.40, 0.82)
(0.60, 1.10)
(0.55, 1.21)
(0.36, 0.98)
85,000
154,000
120,000
84,000
(61,000, 125,000)
(112,000, 206,000)
(69,000, 152,000)
(55,000, 149,000)
MRCA distribution under exponential population growth
Gene
TMRCA
SMCY
DBY
DFFRY
All
0.0731
0.0538
0.0582
0.0853
1Expected
95%CI
(0.0618, 0.1030)
(0.0382, 0.0975)
(0.0440, 0.0720)
(0.0580, 0.2070)
TMRCA
95%CI
48,000 (41,000, 68,000)
55,000 (39,000, 100,000)
53,000 (40,000, 65,000)
59,000 (40,000, 140,000)
age in Ne generations. 2Value in years = Ne*25
GENETREE Analysis
1
2
1 1 2 11 11 1 11 11 11 1 41 11 1 2 11 1 21 21 3 113
Africa
Asia
Oceania
Expected ages of mutations in tree:
Mutation 1: 47,000 (35,000; 89,000) – male movement out of Africa
Mutation 2: 40,000 (31,000; 79,000) – beginning of global expansion
Future Venues
•
Population genetics models: incorporation of migration,
population growth, recombination, natural selection
•
Longitudinal analysis
•
Evolutionary analysis of quantitative trait loci (QTL)
•
Properties of CT:
– Accuracy of coalescent approximation under combinations of
population size, sample size, mutation rate
– Properties of estimators under MCMC
References
•
Handbook of Statistical Genetics, 2nd edition, Vol.2
•
Nature 2002; 3:380-390
•
Theoretical Population Biology 1999; 56:1-10.