Download Report

Transcript GLYPHOSATE RESISTANCE Background / Problem

Lecture 14: Population structure
and Population Assignment
October 12, 2012
Lab 7 Update
u Corrected instructions for lab 7 will be posted today
u Problem 1: consider relative levels of F-statistics as
well as significance from bootstrapping
u Up to 3 points extra credit if problem 2 is done
u See lab open hours schedule on lab web page
u Caveat: exams and class usage of lab
u Other computers are available: see Hari or me
Population structure from worldwide human population
Population = subpopulation. Group = Regions
East Asia
Lab 7 Revised Problem 1
Problem 1. File human_struc.xls contains data for 10 microsatellite loci used to
genotype 41 human populations from a worldwide sample.
a.) Convert the file into Arlequin format and perform AMOVA based on this grouping
of populations within regions using distance. How do you interpret these results?
Report values of the phi-statistics and their statistical significance for each AMOVA
you run.
b.) Do you think that any of these regions can justifiably be divided into subregions?
Pick a region, form a hypothesis for what would be a reasonable grouping of
populations into subregions, then run AMOVA only for the region you selected using
distance measures. Was your hypothesis supported by the data?
c.) GRADUATE STUDENTS: Which of the 5 initially defined regions has the highest
diversity in terms of effective number of alleles? What is your biological explanation
for this?
Lab 7 Original Problem 2 (worth 8 points if you answer this).
Use Structure to further test the hypotheses you developed in
Problem 1.
a.) Calculate the posterior probabilities to test whether:
i. All populations form a single genetically homogeneous group.
ii. There are two genetically distinct groups within your selected region
iii. There are three genetically distinct groups within your selected region.
b.) Use the ΔK method to determine the most likely number of groups. How does this
compare to the method based on posterior probabilities?
c.) How do the groupings of subpopulations compare to your expectations from
Problem 1?
d.) Is there evidence of admixture among the groups? If so, include a table or figure
showing the proportion of each subpopulation assigned to each group.
e.) GRADUATE STUDENTS: Provide a brief, literature-based
explanation for the groupings you observe.
Last Time
u Sample calculation of FST
u Defining populations on genetic criteria:
introduction to Structure
 Interpretation of F-statistics
 More on the Structure program
 Principal Components Analysis
 Population assignment
FST: What does it tell us?
 Degree of differentiation of subpopulations
 Rules of thumb:
 0.05 to 0.15 is weak to moderate
 0.15 to 0.25 is strong differentiation
 >0.25 is very strong differentiation
 Related to the historical level of gene exchange
between populations
 May not represent current conditions
FST is related to life history
Seed Dispersal
Successional Stage
Early 0.411
Middle 0.184
Life Cycle
(Loveless and Hamrick, 1984)
Structure Program
 One of the most widely-used programs in population genetics
(original paper cited >8,000 times since 2000)
 Very flexible model can determine:
 The most likely number of uniform groups (populations, K)
 The genomic composition of each individual (admixture
 Possible population of origin
A simple model of population structure
 Individuals in our sample represent a mixture of K
(unknown) ancestral populations.
 Each population is characterized by (unknown) allele
frequencies at each locus.
 Within populations, markers are in Hardy-Weinberg and
linkage equilibrium.
 Roughly speaking, the model sorts individuals into K
clusters so as to minimize departures from HWE
and Linkage Equilibrium.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
More on the model...
 Let A1, A2, …, AK represent the (unknown) allele
frequencies in each subpopulation
 Let Z1, Z2, … , Zm represent the (unknown)
subpopulation of origin of the sampled individuals
 Assuming Hardy-Weinberg and linkage equilibrium
within subpopulations, the likelihood of an
individual’s genotype in subpopulation k is given by
the product of the relevant allele frequencies:
Pr(Gi | Zi= k, Ak) = loci
Where Pl is probability of observing genotype l at a particular locus in
subpopulation k
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Probability of observing a genotype in
a subpopulation
 Probability of observing a genotype at locus l by
chance in population is a function of allele
Pl = p
Pl = 2 pi p j
P = Õ Pl
for m loci
 Assumes unlinked (independent loci) and HardyWeinberg equilibrium
 If we knew the population allele
frequencies in advance, then it would be
easy to assign individuals.
 If we knew the individual assignments, it
would be easy to estimate frequencies.
 In practice, we don’t know either of
these, but the following MCMC algorithm
converges to sensible joint estimates of
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
MCMC algorithm (for fixed K)
 Start with random assignment of individuals to
 Step 1: Gene frequencies in each population are
estimated based on the individuals that are assigned to
 Step 2: Individuals are assigned to populations based on
gene frequencies in each population.
 And this is repeated...
 …Estimation of K performed separately.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Admixed individuals are mosaics of
ancestry from the original populations
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
The two basic ancestry models used by
 No Admixture: each individual is derived completely
from a single subpopulation
 Admixture: individuals may have mixed ancestry: some
fraction qk of the genome of individual i is derived
from subpopulation k.
The admixture model allows for hybrids, but it is more
flexible and often provides a better fit for complicated
structure. This is what we used in lab.
Slide adapted from Jonathan Pritchard, 2007 presentation to Conservation Genetics meeting
Notes on Estimating the Number of
Subpopulations (k)
 Likelihood-based method is the simplest, but likelihood
often increases continuously with k
 More variability at values of k beyond “natural” value
 Evanno et al. (2005) method measures change in likelihood
and discounts for variation
 Use biological reasoning at arriving at final value
 Priors based on population locations, other information
 Often need to do hierarchical analyses: break into
subregions and run Structure separately for each
Inferred human population structure
Africans Europeans MidEast
Cent/S Asia
Oceania America
Each individual is a thin vertical line that is partitioned into K colored
segments according to its membership coefficients in K clusters.
Rosenberg et al. 2002 Science 298: 2381-2385
Structure is Hierarchical: Groups reveal more substructure when
examined separately
Rosenberg et al. 2002 Science 298: 2381-2385
Alternative clustering method: Principal
Components Analysis
 Structure is very computationally intensive
 Often no clear best-supported K-value
 Alternative is to use traditional multivariate
statistics to find uniform groups
 Principal Components Analysis is most commonly
used algorithm
 EIGENSOFT (PCA, Patterson et al., 2006; PloS
Genetics 2:e190).
Eckert, Population Structure, 5-Aug-2008 49
Principal Components Analysis
 Efficient way to summarize multivariate data like genotypes
 Each axis passes through maximum variation in data,
explains a component of the variation
How do we identify population of origin?
Human Population Assignment with SNP
 Assayed 500,000 SNP genotypes for 3,192 Europeans
 Used Principal Components Analysis to ordinate samples in
 High correspondence betweeen sample ordination and
geographic origin of samples
Novembre et al. 2008 Nature 456:98
 Individuals assigned
to populations of
origin with high
Likelihood Approaches
 Allow evaluation of alternative hypotheses by
comparing their relative likelihoods given the
L( H1 , H 2 | E ) 
P ( E | H1 )
P( E | H 2 )
 In a population assignment or forensic context,
definition of the competing hypothesis is the most
essential component
Population Assignment: Likelihood
 Assume you find skin cells and blood under
fingernails of a murder victim
 Victim had major debts with the Sicilian mafia as
well as the Chinese mafia
 Can population assignment help to focus
P(G | H1 )
L( H1 , H 2 | G )  LR 
P(G | H 2 )
 What is H1 and what is H2?
Population Assignment: Likelihood
 "Assignment Tests" based on allele
frequencies in source populations and
genetic composition of individuals
 Likelihood-Based Approaches
 Calculate likelihood that individual
genotype originated in particular
 Assume Hardy-Weinberg and
linkage equilibria
 Genotype frequencies corrected
for presence of sampled individual
 Usually reported as log10 likelihood
for origin in given population
relative to other population
 Implemented in ‘GENECLASS’
Pk l  p
for homozygote AiAi in
population l at locus k
Pkl  2 pil p jl
for heterozygote AiAj in
population l at locus k
P   Pk
k 1
for m loci
Power of Population Assignment using
 Assignment success depends on:
Number of markers used
Polymorphism of markers
Number of possible source populations
Differentiation of populations
Accuracy of allele frequency estimations
 Rules of Thumb (Cornuet et al. 1999) for 100% assignment
success, for 10 reference populations need:
30 to 50 reference individuals per population
10 microsatellite loci
HE > 0.6
FST > 0.1
Knowing what you know about
human population genetics, is it
worth the effort to assign our
skin sample to Asian or Sicilian
 Rules of Thumb (Cornuet et al. 1999) for 100% assignment
success, for 10 reference populations need:
30 to 50 reference individuals per population
10 microsatellite loci
HE > 0.6
FST > 0.1
Population Assignment Example: Wolf Populations in
Northwest Territories
 Wolf populations sampled on island
and mainland populations in
Canadian Northwest Territories
 Immigrants detected on mainland
(black circles) from Banks Island
(white circles)
Carmichael et al. 2001 Mol Ecol 10:2787
Population Assignment Example:Fish Stories
 Fishing competition on Lake
Saimaa in Southeast Finland
 Contestant allegedly caught a
5.5 kg salmon, much larger
than usual for the lake
 Compared fish from the lake
to fish from local markets
(originating from Norway and
Baltic sea)
 7 microsatellites
Lake Saimaa Market
 Based on likelihood analysis,
fish was purchased rather
than caught in lake