Transcript ppt

Lecture 14: Population structure and
Population Assignment
February 28, 2014
Last Time
u Sample calculation of FST
u Defining populations on genetic criteria: introduction to
Structure
Today
 Interpretation of F-statistics
 More on the Structure program
 Principal Components Analysis
 Population assignment
FST: What does it tell us?
 Degree of differentiation of subpopulations
 Rules of thumb:
 0.05 to 0.15 is weak to moderate
 0.15 to 0.25 is strong differentiation
 >0.25 is very strong differentiation
 Related to the historical level of gene exchange between
populations
 May not represent current conditions
FST is related to life history
Seed Dispersal
Gravity
Explosive/capsule
Winged/Plumose
0.446
0.262
0.079
Successional Stage
Early
0.411
Middle 0.184
Late
0.105
Life Cycle
Annual
Short-lived
Long-lived
0.430
0.262
0.077
(Loveless and Hamrick, 1984)
Structure Program
 One of the most widely-used programs in population genetics
(original paper cited >11,000 times since 2000)
 Very flexible model can determine:
 The most likely number of uniform groups (populations, K)
 The genomic composition of each individual (admixture
coefficients)
 Possible population of origin
A simple model of population structure
 Individuals in our sample represent a mixture of K (unknown)
ancestral populations.
 Each population is characterized by (unknown) allele frequencies at
each locus.
 Within populations, markers are in Hardy-Weinberg and linkage
equilibrium.
 Roughly speaking, the model sorts individuals into K clusters
so as to minimize departures from HWE and Linkage
Equilibrium.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
More on the model...
l Let A1, A2, …, AK represent the (unknown) allele frequencies
in each subpopulation
l Let Z1, Z2, … , Zm represent the (unknown) subpopulation of
origin of the sampled individuals
l Assuming Hardy-Weinberg and linkage equilibrium within
subpopulations, the likelihood of an individual’s genotype
in subpopulation k is given by the product of the relevant
allele frequencies:
Pr(Gi | Zi= k, Ak) =
Ploci
Pl
Where Pl is probability of observing genotype l at a particular locus in
subpopulation k
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Probability of observing a genotype in a
subpopulation
 Probability of observing a genotype at locus l by chance
in population is a function of allele frequencies:
Pl = p
2
i
Homozygote
Pl = 2 pi p j
Heterozygote
m
P = Õ Pl
for m loci
l=1
 Assumes unlinked (independent loci) and HardyWeinberg equilibrium
 If we knew the population allele frequencies in
advance, then it would be easy to assign
individuals.
 If we knew the individual assignments, it would be
easy to estimate frequencies.
 In practice, we don’t know either of these, but
the following MCMC algorithm converges to
sensible joint estimates of both.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
MCMC Algorithms Provide a way of Efficiently Exploring
Parameter Space to Find the Most Probable Combination
of Values
http://www.frankfurt-consulting.de/English/optimierung_us.htm
Take Stat 745 Data Mining with Dr. Culp for gory details
MCMC algorithm (for fixed K)
 Start with random assignment of individuals to populations
 Step 1: Gene frequencies in each population are estimated
based on the individuals that are assigned to it.
 Step 2: Individuals are assigned to populations based on gene
frequencies in each population.
 Continue this process many times to maximize likelihood of the
arrangement
 …Estimation of K performed separately.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Admixed individuals are mosaics of
ancestry from the original populations
Ancestral
Populations
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
The two basic ancestry models used by structure.
 No Admixture: each individual is derived completely from a
single subpopulation
 Admixture: individuals may have mixed ancestry: some fraction
qk of the genome of individual i is derived from subpopulation k.
 The admixture model allows for hybrids, but it is more flexible and
often provides a better fit for complicated structure. This is what we
used in lab.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Notes on Estimating the Number of Subpopulations (k)
u Likelihood-based method is the simplest, but likelihood often
increases continuously with k
u More variability at values of k beyond “natural” value
u Evanno et al. (2005) method measures change in likelihood and
discounts for variation
u Use biological reasoning at arriving at final value
u Can also incorporate prior expectations based on population
locations, other information (e.g., Geneland package)
u Often need to do hierarchical analyses: break into subregions and
run Structure separately for each
Estimating K
Structure is run separately at different values of K. The
program computes a statistic that measures the fit of each
value of K (sort of a penalized likelihood); this can be used
to help select K.
Assumed
value of K
1
2
3
Ln(Pr(D|KmM)))
-71500
-69200
-70500
Convert to posterior probability using Bayes’ Theorem:
Pr(p | Data) =
Pr(Data | p)Pr( p)
3
å Pr(Data | p )Pr( p )
i
i=1
i
Another method for inference of K
 The K method of Evanno et al. (2005, Mol. Ecol. 14:
2611-2620):
Eckert, Population Structure, 5-Aug-2008 46
Inferred human population structure
Africans
Europeans MidEast
Cent/S Asia
Asia
Oceania America
Each individual is a thin vertical line that is partitioned into K colored segments
according to its membership coefficients in K clusters.
Rosenberg et al. 2002 Science 298: 2381-2385
Structure is Hierarchical: Groups reveal more substructure when examined
separately
Rosenberg et al. 2002 Science 298: 2381-2385
Alternative clustering method: Principal Components
Analysis
 Structure is very computationally intensive
 Often no clear best-supported K-value
 Alternative is to use traditional multivariate statistics to
find uniform groups
 Principal Components Analysis is most commonly used
algorithm
 EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics
2:e190).
Eckert, Population Structure, 5-Aug-2008 49
Principal Components Analysis
 Efficient way to summarize multivariate data like genotypes
 Each axis passes through maximum variation in data, explains a
component of the variation

http://www.mech.uq.edu.au/courses/mech
4710/pca/s1.htm
How do we identify population of origin?
Once you have populations defined, can you assign a
migrant individual to their population of origin?
Human Population Assignment with SNP
 Assayed 500,000 SNP genotypes for 3,192 Europeans
 Used Principal Components Analysis to ordinate samples in space
 High correspondence betweeen sample ordination and geographic
origin of samples
 Individuals assigned to
populations of origin with
high accuracy

Novembre et al. 2008 Nature 456:98
Using Structure to Show Populations of Origin:
Taita Thrush data
 Three main sampling locations in Kenya
 Low migration rates (radio-tagging study)
 155 individuals, genotyped at 7 microsatellite loci
Slide courtesy of Jonathan Pritchard
Likelihood Approaches
 Allow evaluation of alternative hypotheses by comparing
their relative likelihoods given the evidence
L( H1 , H 2 | E ) 
P ( E | H1 )
P( E | H 2 )
 In a population assignment or forensic context, definition
of the competing hypothesis is the most essential
component
Population Assignment: Likelihood
 Assume you find skin cells and blood under fingernails of
a murder victim
 Victim had major debts with the Sicilian mafia as well as
the Chinese mafia
 Can population assignment help to focus investigation?
P(G | H1 )
L( H1 , H 2 | G )  LR 
,
P(G | H 2 )
 What is H1 and what is H2?
Population Assignment: Likelihood
 "Assignment Tests" based on allele
frequencies in source populations and
genetic composition of individuals
 Likelihood-Based Approaches
 Calculate likelihood that individual
genotype originated in particular
population
 Assume Hardy-Weinberg and linkage
equilibria
 Genotype frequencies corrected for
presence of sampled individual
 Usually reported as log10 likelihood for
origin in given population relative to
other population
 Implemented in ‘GENECLASS’ program
(http://www.montpellier.inra.fr/URLB/geneclass/g
eneclass.html)
Pk l  p
2
il
for homozygote AiAi in
population l at locus k
Pkl  2 pil p jl
for heterozygote AiAj in
population l at locus k
m
P   Pk
k 1
for m loci
Power of Population Assignment using
Likelihood
 Assignment success depends on:





Number of markers used
Polymorphism of markers
Number of possible source populations
Differentiation of populations
Accuracy of allele frequency estimations
 Rules of Thumb (Cornuet et al. 1999) for 100% assignment success,
for 10 reference populations need:




30 to 50 reference individuals per population
10 microsatellite loci
HE > 0.6
FST > 0.1
Population Assignment Example: Wolf Populations in Northwest
Territories
 Wolf populations sampled on island and
mainland populations in Canadian
Northwest Territories
 Immigrants detected on mainland (black
circles) from Banks Island (white circles)
Carmichael et al. 2001 Mol Ecol 10:2787
Population Assignment Example:Fish Stories
 Fishing competition on Lake
Saimaa in Southeast Finland
 Contestant allegedly caught a 5.5
kg salmon, much larger than
usual for the lake
 Compared fish from the lake to
fish from local markets
(originating from Norway and
Baltic sea)
 7 microsatellites
Lake Saimaa
 Based on likelihood analysis, fish
was purchased rather than
caught in lake
-
Market