Transcript history

Separating Population
Structure
from Recent Evolutionary
History
Problem: Spatial Patterns Inferred
Earlier Represent An Equilibrium
Between Recurrent Evolutionary
Forces Such as Gene Flow and Drift.
E.g.,
fst 
1
4Nev 
But, Can Obtain The Same Pattern
Due to Recent Historical Events That
Have Not Had Time to Reach
Equilibrium
To Examine Historical Events &
Non-Equilibrium States, Need to
Study Genetic Variation in Both
Space & Time
l
l
Directly Sample Populations From the Past
Reconstruct Variation Through Time
Indirectly
Direct Study: mtDNA in the
Woolly Mammoth
Debruyne et al.
2008. Out of
America: Ancient
DNA Evidence for a
New World Origin of
Late Quaternary
Woolly Mammoths.
Curr. Biol. 18:13201326.
Direct Study: mtDNA in the
Woolly Mammoth
Debruyne et al.
2008. Out of
America: Ancient
DNA Evidence for a
New World Origin of
Late Quaternary
Woolly Mammoths.
Curr. Biol. 18:13201326.
Indirect Studies
l
l
l
Recall that Dt=D0(1-r)t
Therefore, Multi-locus or Multi-site
Polymorphic Data Contains Historical
Information, and This Retention Is For
Long Periods of Time When r Is Small.
Attempts to Reconstruct History Depend
Upon Multiple Loci or Upon Multi-Site
Haplotypes.
Multiple Loci: Principle Component Analysis of Genetic Data
This procedure has long been used
in human genetics to extract multilocus information about gene flow
patterns (e.g., Cavalli-Sforza &
Ammerman, 1984).
Multiple Loci: Principle Component Analysis of Genetic Data
Novembre et al. Nature
31 Aug 2008. Based on
197,146 loci in 1,387
individuals.
Overlay of the steepest slope values (upper 5%)
QuickTime™ and a
TIFF (LZW ) decompressor
are needed to see this picture.
Microsatelite survey of naked
mole rats in Meru National
Park, Kenya (Jon Hess)
Haplotypes
l
l
l
One Method Is To Look At the Spatial
Distribution of Globally Rare, Tip Haplotypes
(Although They May be Locally Common)
Coalescent Theory Implies Such Haplotypes Are
Recent, And Therefore Are Not In Equilibrium
And Have Limited Spatial Distributions
Therefore, Globally Rare, Tip Haplotypes Provide
A Straightforward Method of Observing The
Movements of Genes Through Space Over Short
and Recent Time Periods.
Geographic distribution of the Asian and American populations genotyped for the microsatellite
D9S1120
Schroeder, K. B. et al. Mol Biol Evol 2009 26:995-1016
“Private” 9-repeat allele
Visual genotypes, clustered by population, for individuals either homozygous or heterozygous
for the 9-repeat allele
Implies that this
“private allele” is
identical by descent in
all Western Beringians
and Native Americans,
which in turn implies
that Native Americans
Descended (at least in
part) From These
Western Beringian
Populations.
Schroeder, K. B. et al. Mol Biol Evol 2009 26:995-1016
Method for estimating the TMRCA of copies of an allele from
the number of recombination events on its shared
haplotypic background
Schematics of the demographic models used for the coalescent simulations: (A)
population split with two equal-size descendant populations (Asia and America),
(B) population split with NAs/NAm equal to 0.15 at TAs/Am, and (C) population
split with NAs/NAm equal to 0.02 at TAs/Am, followed by population growth such
that NAs/NAm equals 0.15 at T0. Models D and E are the same as models B and
C, respectively, but include population substructure in Asia and in America.
Under the different best models, the mean TMRCA of the 9-repeat
allele ranged from 293 generations to 1,596 generations; using a
generation time of 25 years resulted in a TMRCA of 7,325-39,900
years ago. Averaging over all of our best models, the mean TMRCA
is 513 generations ago or about 12,825 years ago. The 95%
confidence intervals for all of the best models produced ages for
the MRCA of the 9-repeat allele, that range from 144 to 1951
generations ago, or approximately 3,600-48,775 years ago.
Haplotype Trees
l
l
l
Are Biologically Meaningful Only When
Recombination Is Absent Or Rare
Gives Some Information About Temporal
Ordering of Mutational Variation, Both the Rare
and the Common Mutations
Not Limited to Recent Events, But Can Go Back
Further In Time (But Not Beyond the Most Recent
Common Ancestral DNA Molecule)
A Haplotype Tree Should Never
Be Equated To A Tree of
Populations. It Is Only The Tree
of The Genetic Variation For
That DNA Region.
There Is Information About
Population History in the
Haplotype Tree, But It Must Be
Extracted Carefully.
Haplotype Trees ≠Species or
Population Trees
It is dangerous to equate a haplotype
tree to a species tree.
It is NEVER justified to equate a
haplotype tree to a tree of populations
within a species because the problem
of lineage sorting is greater and the
time between events is shorter.
Moreover, a population tree need
not exist at all.
Nested Clade Analysis
l
l
l
l
Converts Haplotype Trees Into A Nested
Statistical Design
Other Data (Phenotypic or Geographical) Are
Then Overlaid Upon The Nested Design
Statistical Tests Are Performed To Detect
Significant Associations Between the Data and
The Haplotype Tree
DOES NOT EQUATE THE HAPLOTYPE TREE
TO A POPULATION TREE!
NCPA Distance Measures
= Sample locations
A Haplotype Tree In Elephants
Amboseli
Tsavo
Hwange
Matets
i
Sengwa
Victoria Falls
Within 1 -Step Clades
Haplotypes
No. in
Within Tota l Tree
Dc
Dn
1-Step Clades
Dc
Dn
1-1
884
1173L***
1-2
460S***
768S***
1-3
49S***
759S**
626L***
409L***
sample
1
35
1021L***
1027L***
2
20
81S***
657S***
3
1
0
601
944L***
373L***
Old-Young
4
11
959L***
832L***
5
16
114
249S*
6
3
0
156S*
862L***
598L***
Old-Young
7
27
47
47
8
1
0
126
9
1
0
68
47
-50
Old-Young
Only When Statistical Significance
Is Achieved Is The Biological
Significance Interpreted With
Explicit, a priori Criteria
•For Example, Under Isolation By Distance, It Takes
Many Generations For A New Haplotype To Spread
Across Many Demes.
•Therefore, Expect Older Haplotypes To Be More
Widespread Than Younger Haplotypes
•Younger Haplotypes Tend To Have Geographical
Ranges Nested Within the Ranges of Their Ancestral
Haplotypes
A Haplotype Tree In Elephants
Amboseli
Tsavo
Gene
flow
with
IBD
Hwange
Matets
i
Sengwa
Victoria Falls
Gene
flow
with
IBD
Gene
flow
with
IBD
Gene
flow
with
IBD
Historical Events Also Leave
Lasting Patterns in Haplotype Trees.
For Example, When A Population
Expands Into a New Area, Even
Haplotypes Recently Created by
Mutation Can Become Geographically
Widespread, and Haplotypes Created By
Mutation After the Expansion Can Be
Located Far From the Geographical
Center of Their Ancestral Haplotype.
Area A
Present
Range
Expansion
Past
Area B
Area C
Nested Clade Analysis of the Chub (Leuciscus cephalus):
Range Expansion (from Durand et al. 1999)
Older Clade
Younger
Clade
2-1
SPE
Historical Events Also Leave
Lasting Patterns in Haplotype Trees.
For Example, When A Population Is
Fragmented or Otherwise Effectively
Isolated, Haplotypes That Arise After The
Fragmentation/Isolation Event Cannot
Spread to Other Geographical Areas, and
With Increasing Time, More Mutations
Can Accumulate, Resulting In Larger
Than Average Branch Lengths Between
Clades in Different Isolates.
Recent
Fragmentation
Area A
Area A
Area B
Area C
Old
Area B
Area C
Fragmentation between Ambystoma tigrinum tigrinum
(Clade 4-2) and A. t. mavortium (Clade 4-1)
The Nested Design Means That Inferences
Are Robust To Topological Variation
Induced by the Evolutionary
Stochasticity of the Coalescent Process
African Elephants
(Roca, A. L., N. Georgiadis, and S. J. O'Brien. 2005. Cytonuclear genomic dissociation in African elephant species. 37:96-100.
Forest Elephant
Savanna Elephant
Fragmentation Inferences From NCA
All 5 DNA regions had a different
topology with respect to the 3 elephant
taxa (only BGN gave the “species tree”);
yet NCPA inferred a fragmentation
event between forest and savanna
elephants in all 5 DNA regions.
Y-DNA
Past Fragmentation Followed By
Range Expansion and Secondary
Contact
PLP
BGN
Highly
Significant
Fragmentation
Events Found
In All Five
Haplotype
Trees
mtDNA
Past Fragmentation
PHKA2
Nested Clade
Phylogeographic Analysis



Recurrent Gene Flow, Range Expansion and
Fragmentation Could All Have Occurred at
Different Times and/or Places.
NCPA Therefore Looks For Multiple Patterns,
Not Just One
The Relative Temporal Ordering of Events in a
Nested Series of Clades Is Also Inferred by
NCPA
Range Expansion
Isolation by Distance
Secondary Contact
Isolation by Distance
Fragmentation
Inferences from mtDNA haplotype tree of Ambystoma
tigrinum from NCPA and supplemental test for
secondary contact (Mol. Ecol. 10: 779-791, 2001)
By Analyzing Haplotype Trees
for mtDNA, Y-DNA, X-linked
DNA and Autosomal DNA,
One Can Sample A Wide
Variety of Time Scales and
Both Male and Female
Mediated Gene Flow and
Historical Events
By Analyzing Multiple
Haplotype Trees Can
Statistically Correct For The
Evolutionary Stochasticity of
The Coalescent Process For
Any One Genomic Region
Inference Errors in Nested Clade Analysis
 Inference Requires That An Appropriate
Mutation Occurred At the Right Time and
Right Place: Therefore, Some Events and
Processes Are Missed With A Particular DNA
Region.
 Selection and Evolutionary Stochasticity Can
Distort The Distribution of Haplotypes in
Space and Time, Thereby Leading to False
Positive Inferences.
These errors can be minimized by studying multiple loci and
requiring each inference (type, place and time) to be crossvalidated by two or more loci.
Multilocus Nested Clade Analysis
 Perform Single Locus NCPA on n loci
 Discard any inferences made only by a single locus
 Group together all the inferences made by 2 or more
loci that are concordant by type of inference and
geographical location.
 Test the null hypothesis that all inferences of an event
that are concordant by event type and location are a
single event.
 Because gene flow is a recurrent process, inferences of
gene flow between two regions are not necessarily
concordant in time, but can test the null hypothesis that
there was no gene flow between two regions in an
interval of time, say t1 to t2 given multiple inferences of
gene flow between the two regions.
 ALL RETAINED INFERENCES HAVE BEEN CROSSVALIDATED ACROSS LOCI AND HAVE EXPLICIT,
QUANTIFIED STATISTICAL SUPPORT.
Using Theory Developed by Tajima
(1983) and Kimura (1970), The
Distribution Of The Inference Time Is:
where ki is the average pairwise nucleotide diversity among
the haplotypes in DNA region i in the youngest monophyletic
clade that contributed in a statistically significant fashion to
the NCPA inference of interest, and Ti is the age obtained by
the Takahata et al. molecular clock estimator (or perhaps some
other method) for this inference from DNA region i.
Estimated Times To Common Ancestor
(Method of Takahata et al. 2001)
Dhc Nuc.Diff.
Between Humans
& Chimps
Dh Nuc.Diff.
Within Humans
TMRCA = 12Dh/Dhc
6 Million Years Ago
A Likelihood Ratio Test of The
Hypothesis That The Estimated Times
of An Event From j Loci Are The Same
Fragmentation Inferences From NCA
Null Hypothesis: there was a single
fragmentation event between forest
and savanna elephants.
Y-DNA
log-likelihood ratio test = 1.497 with 4
degrees of freedom, p= 0.8272. Accept
Null Hypothesis, with T = 4.2 MYA.
There are at least 2 lineages of
African Elephants.
Highly
Significant
Fragmentation
Events Found
In All Five
Haplotype
Trees
mtDNA
Past Fragmentation Followed By
Range Expansion and Secondary
Contact
PLP
BGN
Past Fragmentation
PHKA2
Performed Nested Clade Analyses on
25 DNA Regions in Humans
• Mitochondrial DNA (Ingman et al. Nature 408, 708 - 713, 2000: Sykes
et al. American Journal of Human Genetics 57, 1463-1475, 1995; Torroni et al. American Journal of
Human Genetics 53, 563-590, 1993, American Journal of Human Genetics 53, 591-608, 1993).
• Y-DNA (Hammer et al. Molecular Biology and Evolution 15, 427-441, 1998)
• 11 X-Linked Regions (Balciuniene et al. 2001; Garrigan et al. 2005;
Hammer et al. 2004; Harris. & Hey, 1999, 2001; Kaessmann et al. 1999; Nachman et al. 2004;
Saunders et al. 2002; Verrelli et al. 2002; Yu et al. 2002)
• 12 Autosomal Genes (Bamshad et al. 2002, Harding et al. 1997; Hollox
et al. 2001; Jin et al. 1999; Koda et al. 2001; Rana et al. 1999; Rogers et al. 2000; Toomajian and
Kreitman 2002; Wooding et al. 2002; Zhang & Rosenberg 2000).
P = 0.95
Three Out-of-Africa Events, All Defined
By Three or More Loci With A High
Degree of Temporal Homogeneity
P = 0.51 But With Highly Significant
Heterogeneity Between
The Three Events
P = 0.62
The log likelihood ratio test rejects the null hypothesis that all 15 events
are temporally concordant with a probability value of 3.89  10-15.
There Were At Least Three Out-of-Africa
Expansion Events Over the Last 2 Million Years
Inferences of Gene Flow That Are
Concordant Geographically Are NOT
Necessarily Concordant Temporally
Because Gene Flow is a Recurrent
Process. However, We Can Test The
Null Hypothesis of NO GENE FLOW
Between Two Geographical Regions
Over a Specified Time Interval.
Test Of The Null Hypothesis of NO
GENE FLOW Between Two
Geographical Regions Over a
Specified Time Interval l to u:
u
 [l,u ] =1  
l
t exp ti (1  ki ) / Ti 
ki
i
Ti / (1  ki )
1  ki
j
dt
i
(1  k )
i
LRT ([l,u])=-2 ln  [l,u ]
i=1
Gamma Distributions For 19
African/Eurasian Gene Flow Inferences
With Isolation By Distance
Extensive overlap implies cross-validation
with the exception of MX1, the only locus
with most of its probability mass in the
Pliocene.
The lack of clusters implies there
was no prolonged breaks in gene
flow throughout the Pleistocene
Testing The Null Hypothesis of No
African/Eurasian Gene Flow Throughout
the Pleistocene
The Null hypothesis
of isolation (no gene
flow) in this time
interval is rejected
with p < 10-8
All of The Cross
Validated
Inferences
Integrate Well
Into A Single
Overview of The
Emergence of
Humans.
Coalescent Simulations
Set of Fully Specified
Phylogeographic Hypotheses
Simulate Coal.
Process Many Times
Under Each Hypothesis
Virtual Current Generation
Real Current Generation
Draw Simulated
Sample of Same Size
as Real Sample
Statistics on Simulated Sample
Statistics from Real Sample
Compare Relative Fits of The
Simulated Statistics Under Each
Model to The Observed Statistics
Strong Vs. Weak Inference
l
l
l
l
l
Falsification is the strongest inference possible in science, so this
is called “strong inference.”
Inference in NCPA is based upon the falsification of null
hypotheses.
Weak inference refers to the relative fit of a non-exhaustive set of
alternatives.
It is rare that an exhaustive set of every conceivable
phylogeographic alternative can be simulated, so the coalescent
simulation approach results in weak inference.
Weak inference can give high relative support to a false
hypothesis when all the alternatives are also false.
E.g, Fagundes et al (PNAS
104:17614-17619, 2007)
Tested 3 Models of Human
Evolution via Simulation
Templeton (Yearbook of
Physical Anthropology
48:33-59, 2005) Falsified All
Three Models, With AFREG
Rejected with p < 10-17
These Results Are NOT Contradictory!
E.g, Fagundes et al (PNAS
104:17614-17619, 2007)
Tested 3 Models of Human
Evolution via Simulation
Eswaran et al (J. Human
Evol. 49:1-18, 2005) Tested
AFREG vs. A model of
Isolation By Distance and
Strongly Rejected AFREG.
Afri ca
S. Eu rope
Afri ca
S. Eu rope
N. Euro pe
S. Asia
N. As ia
Pacific
Ame rica s
S. Asia
These Results Are NOT Contradictory!
Interpretive Criteria
• Simulations assign “probabilities” to complex models as a
whole, making it impossible to interpret the biological
reason for a low probability.
• In contrast, NCPA allows individual components to be
tested, making the biological interpretation clear.
Reject the Null
hypothesis of
no admixture
with p < 10-17
Interpretive Criteria
The Null hypothesis of isolation
(no gene flow) in the minimal time
interval proposed by Fagundes et
al is rejected with p = 1.6 X 10-6 by
testing with multilocus NCPA.
Interpretive Criteria
•
•
Although Fagundes et al. (2007) interpreted the rejection of their assimilation
model as a rejection of admixture, the confounded nature of simulation inference
means that such an interpretation has no logical validity.
NCPA allows individual components to be tested, making it clear that the part of
their assimilation model that is wrong is NOT admixture, but rather the assumption
of prior isolation of archaic Africans and Eurasians.
X
Coherent Inference
•
Coherence is a property referring to nested and
composite hypotheses.
•
The meaning of coherence is most easily illustrated
with nested hypotheses:
B
A
One measure of fit is the probability of
the hypotheses. Because A is a nested
subset of B, Prob.(B) ≥ Prob.(A). This
relationship is “coherent”.
If one assigned Prob.(A) > Prob.(B),
this is mathematically impossible and is
said to be “incoherent”.
E.g, Fagundes et al (PNAS 104:17614-17619, 2007)
The “assimilation” model (B) allows the possibility of admixture
between Africans and Eurasians, measured by the parameter M
that can vary between 0 and 1. Note, M=0 corresponds to
replacement, so the replacement model (A) is a proper subset of
the assimilation model.
Note the probabilities assigned to A and B.
The ABC method is INCOHERENT!
Why Is ABC INCOHERENT?
There is no correction for dimensionality of
the different hypotheses (indexed by i); and
The denominator treats all hypotheses as
mutually exclusive events.
Equation 9 From Beaumont, M. A., W. Y. Zhang, and D. J. Balding. 2002. Approximate
Bayesian computation in population genetics. Genetics 162:2025-2035.
E.g, Fagundes et al (PNAS 104:17614-17619, 2007)
B
A
C
Prob(A or B or C) =
P(B)+P(C) - P(B & C)
Equation 9 states that the
Prob(A or B or C) =
P(A)+P(B)+P(C)
A
B
C
Hence, the fundamental
equation of ABC is
mathematically incoherent
for nested and/or composite
hypotheses.
Other Methods of Evaluating Hypotheses in the
Coalescent Simulation Approach are Incoherent
•Bayes Factors are known to be incoherent (Lavine,
M., and M. J. Schervish. 1999. Bayes Factors: What
They Are and What They Are Not. The American
Statistician 53:119-122).
•Mesquite and all other programs that treat all
phylogeographic hypotheses as mutually exclusive
alternatives are incoherent.
•Coalescent Simulations Can Only Be Used to Test
Single Parameter Models Against Their
Complement (e.g., FST > 0 vs. FST = 0).
Statistical
Phylogeography
Multilocus NCPA provides a robust, flexible
testing framework.
Simulations have multiple statistical flaws
and cannot be used to test composite
phylogeographic hypotheses.
NCPA defines the general model but does not
yield insight into details.
Once the general model framework has been
inferred by NCPA, simulations can be used to
estimate the underlying parameters.
Statistical
Phylogeography
NCPA and simulation approaches
are not so much alternative
techniques as they are
complementary, and potentially
synergistic, techniques. Both add
to the statistical toolkit of
intraspecific phylogeographers,
and both should be used when
appropriate.