Genome Research - University of Oxford
Download
Report
Transcript Genome Research - University of Oxford
Coalescent Models for Genetic
Demography
What can the Coalescent do for
you?
Rosalind Harding
University of Oxford
Who was MtEve?
the most recent common ancestor
(mcra) to whom all mtDNA haplotype
diversity, currently sampled, can be
traced.
One possibility: First a bottleneck, then
multiple lineages are established during
expansion phases
MtEve
But if there wasn’t a bottleneck?
Then our predecessors collecting data 20,000
years ago, could have identified a different
mtEve, an Eve from an earlier generation;
in 20,000 years time, a new generation will
be likely to find their mtEve to be a grandndaughter of our mtEve.
While our mtEve may be special to us, for
archaeogeneticists of past and future
generations she will have no particular
significance!
Insights from coalescent models
Eve?
Time
Eve?
Eve?
present
What is the coalescent?
a simple model which generates a probability
distribution for gene genealogies sampled
from a population.
Further definitions
simple models: abstractions from complex
demographic reality, which preserve key features
population: all individuals within a generation with
the potential to contribute to the gene pool (including
individuals who are reproductively successful as well
as those who are not.)
gene genealogies: lineages of transmission of copies
of a gene from parents to offspring
coalescence: where two transmission lineages find a
common ancestor, looking backwards in time
probability distribution: a set of probabilities for many
possible alternative gene genealogies compatible with
the model
Models and data
Interpreting genetic polymorphism data
consider a sample of genes from a contemporary
population, with their allelic frequencies and
sequence identities determined – these data do
not reveal our genetic past directly, they must be
interpreted.
Options for model choice
evolution as phylogeny, phylo-geography
evolution as a balance of mutation and genetic
drift in a population with a specified demography
(population size, mating pattern, offspring
distribution)
Characteristics of
polymorphism data
For a small proportion of sites in human DNA, a
second allele is present in populations due to a
relatively recent mutation; this is polymorphism.
Polymorphism constitutes a transient phase in
evolution, intermediate between the occurrence of a
mutation and the fixation of either allele at 100%.
MtDNA trees may distort frequencies of
polymorphisms. They show sets of mutation events
as a proxy for fixed differences; it is the new allele
that is assumed to fix (attain 100%).
These potential sources of error for time scale
estimates may be minor but could be substantial.
Ingman and
Gyllensten, 2003
Genome Research
13:1600-1606
Neighbor-joining
phylogram of 101
mtDNA coding regions
sequences.
Is phylogenetic
branching the right
model?
Note variable branch
lengths and
endpoints; yet all
individuals sampled in
the present!
A phylogenetic model with added
genealogical detail and molecular clock
Trajectories for neutral alleles
Ne=10, constant over time
Understanding genetic drift as genealogy
Two of the gene copies in gen. t are inherited by all of the
offspring copies in generation t+x. This is the process of drift
that leads eventually to either loss or fixation (100% frequency
in the population) of new mutations.
Some advantages of coalescent
models over phylogeny for
interpreting polymorphism data
they make better use of molecular clocks and do not
treat polymorphisms as fixed differences;
as models of populations they clarify the difference
between
‘absence of evidence’ (eg for Neanderthal ancestry) and
‘evidence of absence’ (any single locus only represents such a
small sample of ancestors from >50,000 years ago that with
present data we don’t have the statistical power to rule out
Neanderthal ancestry).
they incorporate some measure of our uncertainty
about the evolution of allele frequencies (a mixed
process of mutation and transmission in genealogies).
Assumptions of Kingman’s (1982)
coalescent for interpreting
polymorphism data (random sample)
1.
2.
3.
4.
5.
6.
7.
Neutrality
All new mutations unique and informative
If individuals are diploid in a population of size N,
the model applies to 2N independent, haploid
copies of a gene
Random mating within a population
Constant population size, Ne
A very specific probability distribution for
transmissions of gene copies to 0, 1, 2 … offspring
Non-overlapping generations
Aims of coalescent modelling:
to make inferences from genetic data
to simulate different demographies to see what to
expect in polymorphism data;
to estimate parameters under an explicit
demographic model, eg Kingman’s coalescent;
to estimate in which generation (and subpopulation) particular lineages coalesced or
mutations occurred, given explicit demographic
assumptions;
to evaluate the uncertainty in our estimates;
to introduce new parameters to improve the model,
judging by its fit to data, to learn about
demography.
The ancestry of a sample composed of
two copies of the gene in generation t0
MRCA
Following the ancestry of a sample of two copies of a gene
(gene A) from time t0, ie the present, backwards (red) , we find
their most recent common ancestor (MRCA) at generation t8.
Expected coalescence times
Expected time
to coalescence
for n lineages
As the sample size increases towards 2N, E(tmrca) approaches
4N, which equals the fixation time for a newly arisen mutation.
Constant N
Thanks to
Lounes for
this slide
N
E(T2)=2Ne
E(TMRCA)=4Ne(1-1/5)
E(T5)=Ne/5
N expanding
N reducing
N0
N0
N1
N1
time
Simulated genealogies with constant Ne
1
2
TMRCA
1. 4.57
2. 2.93*
3. 1.48
4. 0.01
3
4
units of 2Ne
generations
eg
2.93x2x
10,000x
20 =
1.2
million
years
Simulating recent expansion: not much
variability in TMRCA between genealogies
1
2
TMRCA
1. 0.0026
2. 0.0029
3. 0.0028
4. 0.0027
3
4
units of 2Ne
generations
~1000 years of
human evolution
1. A time scale is given by the coalescent
model for the demography (drift history)
2.
Add mutations
Infinite-sites mutation in a gene tree
The relationship between av pairwise
sequence difference, p, and the
parameter q in Kingman’s Coalescent
2N generations
Data: Aboriginal Australian mtDNAs
Model: Kingman’s coalescent
?
?
?
MtDNA
Coding
DNA
Sites:
one colonization
9000 to
event?
16000
?
or several
founding
lineages at
different
times?
Note the
nonuniform
spacing of
mutations
Another advantage of coalescent
models over phylogeny
While the population bottlenecks implicitly
assumed in phylogenetic and phylogeographic
analyses can be explicitly assumed in a
coalescent framework, alternative
demographies may be assumed, or may be
inferred.
(the relationship between coalescent nodes
and colonization events is very ambiguous.)
Kingman’s coalescent as H0
Kingman’s coalescent model is a starting
point, available to us even before we collect
any data.
Having collected data, we can test whether
the data show goodness-of-fit to the
expectations of our starting model.
If not, we should change or add parameters
to improve the model. At present there are
some options available (not many, but some!)
Variations from Kingman’s
coalescent
1.
2.
3.
4.
5.
6.
7.
Selection
Recurrent and back mutation
Recombination
*Non-random mating: eg geographic subdivision
with specified migration between subpopulations
*Population size fluctuation, including bottlenecks
and expansions
Non-’Poisson’ distributions of offspring numbers
Unequal generation intervals between lineages
*similar model but additional parameters
The coalescent with structure
Much migration
Little migration
Each generation m alleles are exchanged between sub-populations.
Discrete migration probability m/2N, an allele migrates.
Continuous waiting time for migration is expo(m)
Summary and points for
discussion
Data drawn as gene trees show the relative ordering of
coalescence events.
The length of time between coalescence events is a function of
the number of mutation events inferred from the data AND the
assumed demographic history. (Molecular clocks should NOT
be applied directly.)
Present phylo-geographic methods fudge the data to
circumvent thinking about demography. Consequently we do
not learn anything about demography from them.
Furthermore, these methods may be generating some highly
inaccurate time estimates and they don’t provide satisfactory
estimates of the uncertainty surrounding these estimates.
Coalescent modelling to date draws attention to many
concerns, but to improve ‘phylo-geographic’ inference we need
implementations of the structured coalescent appropriate for a
colonization/extinction demography.
MtDNA
Coding
DNA
Sites:
500 to
9000
Implications of drift as genealogy
All the identical copies of a gene, eg all
the copies of the MC1R-151 red hair
allele, carried by thousands of people
across Europe, have been inherited from
a single common ancestor living some
time in the past. Although mutation may
have generated MC1R-151 alleles many
times, all these mutations were quickly
lost, except for one. On one occasion
only, the new mutation increased in
frequency, becoming a common
polymorphism. Could this be true? (We
think so!)