Transcript ppt
Lecture 14: Population structure and
Population Assignment
February 28, 2014
Last Time
u Sample calculation of FST
u Defining populations on genetic criteria: introduction to
Structure
Today
Interpretation of F-statistics
More on the Structure program
Principal Components Analysis
Population assignment
FST: What does it tell us?
Degree of differentiation of subpopulations
Rules of thumb:
0.05 to 0.15 is weak to moderate
0.15 to 0.25 is strong differentiation
>0.25 is very strong differentiation
Related to the historical level of gene exchange between
populations
May not represent current conditions
FST is related to life history
Seed Dispersal
Gravity
Explosive/capsule
Winged/Plumose
0.446
0.262
0.079
Successional Stage
Early
0.411
Middle 0.184
Late
0.105
Life Cycle
Annual
Short-lived
Long-lived
0.430
0.262
0.077
(Loveless and Hamrick, 1984)
Structure Program
One of the most widely-used programs in population genetics
(original paper cited >11,000 times since 2000)
Very flexible model can determine:
The most likely number of uniform groups (populations, K)
The genomic composition of each individual (admixture
coefficients)
Possible population of origin
A simple model of population structure
Individuals in our sample represent a mixture of K (unknown)
ancestral populations.
Each population is characterized by (unknown) allele frequencies at
each locus.
Within populations, markers are in Hardy-Weinberg and linkage
equilibrium.
Roughly speaking, the model sorts individuals into K clusters
so as to minimize departures from HWE and Linkage
Equilibrium.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
More on the model...
l Let A1, A2, …, AK represent the (unknown) allele frequencies
in each subpopulation
l Let Z1, Z2, … , Zm represent the (unknown) subpopulation of
origin of the sampled individuals
l Assuming Hardy-Weinberg and linkage equilibrium within
subpopulations, the likelihood of an individual’s genotype
in subpopulation k is given by the product of the relevant
allele frequencies:
Pr(Gi | Zi= k, Ak) =
Ploci
Pl
Where Pl is probability of observing genotype l at a particular locus in
subpopulation k
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Probability of observing a genotype in a
subpopulation
Probability of observing a genotype at locus l by chance
in population is a function of allele frequencies:
Pl = p
2
i
Homozygote
Pl = 2 pi p j
Heterozygote
m
P = Õ Pl
for m loci
l=1
Assumes unlinked (independent loci) and HardyWeinberg equilibrium
If we knew the population allele frequencies in
advance, then it would be easy to assign
individuals.
If we knew the individual assignments, it would be
easy to estimate frequencies.
In practice, we don’t know either of these, but
the following MCMC algorithm converges to
sensible joint estimates of both.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
MCMC Algorithms Provide a way of Efficiently Exploring
Parameter Space to Find the Most Probable Combination
of Values
http://www.frankfurt-consulting.de/English/optimierung_us.htm
Take Stat 745 Data Mining with Dr. Culp for gory details
MCMC algorithm (for fixed K)
Start with random assignment of individuals to populations
Step 1: Gene frequencies in each population are estimated
based on the individuals that are assigned to it.
Step 2: Individuals are assigned to populations based on gene
frequencies in each population.
Continue this process many times to maximize likelihood of the
arrangement
…Estimation of K performed separately.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Admixed individuals are mosaics of
ancestry from the original populations
Ancestral
Populations
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
The two basic ancestry models used by structure.
No Admixture: each individual is derived completely from a
single subpopulation
Admixture: individuals may have mixed ancestry: some fraction
qk of the genome of individual i is derived from subpopulation k.
The admixture model allows for hybrids, but it is more flexible and
often provides a better fit for complicated structure. This is what we
used in lab.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Notes on Estimating the Number of Subpopulations (k)
u Likelihood-based method is the simplest, but likelihood often
increases continuously with k
u More variability at values of k beyond “natural” value
u Evanno et al. (2005) method measures change in likelihood and
discounts for variation
u Use biological reasoning at arriving at final value
u Can also incorporate prior expectations based on population
locations, other information (e.g., Geneland package)
u Often need to do hierarchical analyses: break into subregions and
run Structure separately for each
Estimating K
Structure is run separately at different values of K. The
program computes a statistic that measures the fit of each
value of K (sort of a penalized likelihood); this can be used
to help select K.
Assumed
value of K
1
2
3
Ln(Pr(D|KmM)))
-71500
-69200
-70500
Convert to posterior probability using Bayes’ Theorem:
Pr(p | Data) =
Pr(Data | p)Pr( p)
3
å Pr(Data | p )Pr( p )
i
i=1
i
Another method for inference of K
The K method of Evanno et al. (2005, Mol. Ecol. 14:
2611-2620):
Eckert, Population Structure, 5-Aug-2008 46
Inferred human population structure
Africans
Europeans MidEast
Cent/S Asia
Asia
Oceania America
Each individual is a thin vertical line that is partitioned into K colored segments
according to its membership coefficients in K clusters.
Rosenberg et al. 2002 Science 298: 2381-2385
Structure is Hierarchical: Groups reveal more substructure when examined
separately
Rosenberg et al. 2002 Science 298: 2381-2385
Alternative clustering method: Principal Components
Analysis
Structure is very computationally intensive
Often no clear best-supported K-value
Alternative is to use traditional multivariate statistics to
find uniform groups
Principal Components Analysis is most commonly used
algorithm
EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics
2:e190).
Eckert, Population Structure, 5-Aug-2008 49
Principal Components Analysis
Efficient way to summarize multivariate data like genotypes
Each axis passes through maximum variation in data, explains a
component of the variation
http://www.mech.uq.edu.au/courses/mech
4710/pca/s1.htm
How do we identify population of origin?
Once you have populations defined, can you assign a
migrant individual to their population of origin?
Human Population Assignment with SNP
Assayed 500,000 SNP genotypes for 3,192 Europeans
Used Principal Components Analysis to ordinate samples in space
High correspondence betweeen sample ordination and geographic
origin of samples
Individuals assigned to
populations of origin with
high accuracy
Novembre et al. 2008 Nature 456:98
Using Structure to Show Populations of Origin:
Taita Thrush data
Three main sampling locations in Kenya
Low migration rates (radio-tagging study)
155 individuals, genotyped at 7 microsatellite loci
Slide courtesy of Jonathan Pritchard
Likelihood Approaches
Allow evaluation of alternative hypotheses by comparing
their relative likelihoods given the evidence
L( H1 , H 2 | E )
P ( E | H1 )
P( E | H 2 )
In a population assignment or forensic context, definition
of the competing hypothesis is the most essential
component
Population Assignment: Likelihood
Assume you find skin cells and blood under fingernails of
a murder victim
Victim had major debts with the Sicilian mafia as well as
the Chinese mafia
Can population assignment help to focus investigation?
P(G | H1 )
L( H1 , H 2 | G ) LR
,
P(G | H 2 )
What is H1 and what is H2?
Population Assignment: Likelihood
"Assignment Tests" based on allele
frequencies in source populations and
genetic composition of individuals
Likelihood-Based Approaches
Calculate likelihood that individual
genotype originated in particular
population
Assume Hardy-Weinberg and linkage
equilibria
Genotype frequencies corrected for
presence of sampled individual
Usually reported as log10 likelihood for
origin in given population relative to
other population
Implemented in ‘GENECLASS’ program
(http://www.montpellier.inra.fr/URLB/geneclass/g
eneclass.html)
Pk l p
2
il
for homozygote AiAi in
population l at locus k
Pkl 2 pil p jl
for heterozygote AiAj in
population l at locus k
m
P Pk
k 1
for m loci
Power of Population Assignment using
Likelihood
Assignment success depends on:
Number of markers used
Polymorphism of markers
Number of possible source populations
Differentiation of populations
Accuracy of allele frequency estimations
Rules of Thumb (Cornuet et al. 1999) for 100% assignment success,
for 10 reference populations need:
30 to 50 reference individuals per population
10 microsatellite loci
HE > 0.6
FST > 0.1
Population Assignment Example: Wolf Populations in Northwest
Territories
Wolf populations sampled on island and
mainland populations in Canadian
Northwest Territories
Immigrants detected on mainland (black
circles) from Banks Island (white circles)
Carmichael et al. 2001 Mol Ecol 10:2787
Population Assignment Example:Fish Stories
Fishing competition on Lake
Saimaa in Southeast Finland
Contestant allegedly caught a 5.5
kg salmon, much larger than
usual for the lake
Compared fish from the lake to
fish from local markets
(originating from Norway and
Baltic sea)
7 microsatellites
Lake Saimaa
Based on likelihood analysis, fish
was purchased rather than
caught in lake
-
Market