Transcript Overview
From which population does this
individual come?
ECL 290
May 21, 2004
Questions that will be addressed today?
-What programs can be used for these questions?
-Maximum likelihood and Bayesian theory 101
-What do the results of traditional analyses and
STRUCTURE tell us about our data set?
-Specific methods and statistics for programs
-What program is best for my data and questions?
Individual assignment
From Hansen et al. 2001
Programs and common uses
Geneclass/Geneclass 2
Identifies origin of individuals
Spam
Determines relative contribution of distinct
populations in a mix of organisms
Structure
Determines number of clusters that fits data,
origin
Applications of assignment
tests and examples
Patterns of dispersal
Introgression
Poaching
Mixed Stock Analysis
Dispersal - immigration
Assignment:
Individuals correctly
assigned in some
populations but for
others, many
mismatched
Hypothesis:
Former populations
isolated, latter lots
migration
Statistical tests:
Simulation testGeneclass (Cornuet et
al. 1999)
Structure (Pritchard
2000)
Dispersal - sex biased
Assignment:
Distribution of assignment indices
from a single population differ
between males and females
Hypothesis
Gene flow is mediated primarily by
one sex
Test: Calculate assignment
likelihood p (e.g. Geneclass,
Whichrun, Arlequin), import data
into a spreadsheet and use the
following:
Hybrid Index IH = 1 – log (px)
log (px) + log (py)
Introgression – wild and domestic
populations
Assignment:
Most individuals correctly
assigned to sample of origin
Hypothesis: Wild population
unaffected by domestic
Test: Structure – especially
good at handling samples
with no prior knowledge of
sample origin. Spam if
populations fairly panmictic
with known sources.
Wildlife forensics
Where did TB positive deer originate from? (Blanchong et al.
2002), Cheating in fishing contests (Primmer 2000), Canid
species, gender and genotype in samples from
salivation in sheep predation wounds (Wildlife Soc. Bull 31: 926-932)
Hypothesis:
Organism is suspected to be from protected population, or is
not really derived from certain population
Test: Collect baseline samples, simulation option (Geneclass
- Cornuet et al. 1999)
SPAM- Statistical Package for Analyzing Admixtures
-Developed by ADFG for Mixed Stock Analysis (MSA) and
Genetic Stock Identification (GSI)
- Available at :
www.genetics.cf.adfg.state.ak.us/software/spampage.php
Requires control, baseline, and mixture files
SPAM Analyses and Statistics
-Performs Maximum Likelihood Estimation and Simulation
Analyses
-Bayesian modeling of baseline allele frequencies with
maximum likelihood scheme
-Jackknifed standard errors
-Bootstrap confidence intervals
-Likelihood-based C.I.
-Symmetric percentile & Nonsymmetric percentile C.I.
-Studentized bootstrap-t C.I.
-Likelihood ratio tests
Maximum Likelihood Estimation and Simulation Analyses
• THREE maximum likelihood algorithms.
•Iteratively reweighted Least Squares (IRLS)- Default
•Conjugate Gradient (CG)
•Expectation-maximization (EM)
•Simulations creates a user-defined mixture scenario to evaluate
performance for a given baseline
•Iteratively reweighted Least Squares (IRLS)- algorithm
performs increasingly better with more data, lots of
memory
•Conjugate Gradient (CG)- low memory requirements
•Expectation-maximization (EM)- If loci are missing
from baseline programs performs this algorithm.
Missing allele frequency is uniform for each locus.
Coverts this to number of individual/population per
iteration.
Bayesian Distribution
Fig. 1, Cornuet et al. 1999
Bayesian modeling of baseline allele frequencies with
maximum likelihood scheme
Rannala and Mountain (1997)
Equal-probability prior distribution
All alleles are assumed equally abundant for all the
samples prior to knowledge of baseline frequencies
Baudouin and Lebrun 2000
No info available
Bayesian modeling of baseline allele frequencies with
maximum likelihood scheme
Pella and Masuda (2001)
Prior distribution is function of
baseline allele frequencies
among stocks at a locus
Maximum Likelihood Estimation and Simulation Statistics
Jackknifed Standard Errors
•Available for populations aggregated into regions
•Variance of regional proportions due to mixture S.E
Likelihood-based C.I.
•Available for populations aggregated into regions
•Based on simple binomial model
•Outputs asymptotic results
Maximum Likelihood Estimation and Simulation Statistics
Symmetric percentile C.I.
-Confidence intervals computed based on Gaussian
distribution
Nonsymmetric percentile & Studentized bootstrap-t C.I
-More appropriate for skewed sampling distributions, more
appropriate the symmetric C.I.s
-Studentized bootstrap-t C.I requires certain baseline.
Requires estimate of variance for each bootstrap replicate.
Which bootstrap method matches my parameter
(mixture contribution estimates) space?
0.3<=<=0.7 Sampling distribution of contribution estimate is
fairly symmetric, give studentized C.I. A shot
0.1<= <= 0.3 or 0.7=> =>0.9 Pretty skewed sampling
distribution, best to try nonsymmetric C.I.
0.0<= <= 0.1 or 0.9=> =>1.0. Real extreme true contributions
wreak havoc with parameter boundary, such that symmetric
percentile method should be used.
Allows user to conduct likelihood ratio tests of
competing mixture models
Three uses in SPAM (Details in Vers. 3.5 Handbook):
1. Reduce boas in mixture estimates
2. Compare independent mixtures
3. Permits a power analysis of sample size selection
GeneClass2
Basic Functions
-Detection of first-generation migrants
-Assignment of individuals/groups of individuals
-Description of population diversity
-Available at: http://www.montpellier.inra.fr/URLB/
-Help file:
http://www.montpellier.inra.fr/URLB/GeneClass2/Help
.pdf
GeneClass2 Analyses and
Statistics
Criteria for likelihood calculations:
• Bayesian
• Frequency-based
• Distance-based
GeneClass2 Criteria for
Likelihood Calculations
Bayesian:
1) Probability of individual belonging to each
population
2) List of populations for which the probability is at
least equal to threshold
•
Assumptions: HW, linkage equilibria, allelic freq =
exact values
•
Options
–
Rannala and Mountain 1997
•
–
Assumes SMM, high μ
Bandouin and Lebrun 2000
GeneClass2 criteria for
Likelihood Calculations
Frequency-based:
1) Calculate observed frequencies for each allele in a population
2) Estimate likelihood of a diploid genotype occurring in a
population (square of the observed allele frequency for
homozygotes or twice the product of the two allele
frequencies for heterozygotes)
3) Multiply likelihoods for each locus together to yield overall
likelihood for genotype in reference population
•
Assumptions: HW, linkage equilibria, allelic freq = exact
values
•
Options
–
Paetkau et al. 2004
GeneClass2 Criteria for
Likelihood Calculations
Distance-based:
• Assigns individual to “closest” population based on
distance
• Does not assume HW, linkage equilibria
• Options:
IAM:
– Nei standard Ds
– Nei minimum Dm
– Nei DA
– Cavalli-Sforza and Edwards Dc
SMM:
– Goldstein et al.
GeneClass2 Resampling
Algorithms
Classic: Simulated individual obtained by
drawing alleles at random according to
observed allele frequencies
Options:
• Rannala and Mountain 1997
• Cornuet et al. 1999
GeneClass2 Resampling
Algorithms
New: Simulated individual obtained by
simulating gametes from randomly
chosen pairs of “parents”
Options:
• Paetkau et al. 2004 (recommended)
GeneClass2 Likelihood Measures for
Assignment/Exclusion
• L = L_home
– Likelihood computed from the population
where the individual was sampled
– Appropriate when some source pop’ns
missing
• L = L_home/L_max
– Ratio of L_home over the likelihood value
among all pop’n samples including the
pop’n where the individual was sampled
GeneClass2 Likelihood Measures for
Assignment/Exclusion (cont’d)
• L = L_home/L_max_not_home
– Ratio of L_home over the likelihood value
among all pop’n samples excluding the
pop’n where the individual was sampled
Program applications
Generally: Bayesian, frequency, distance
Bayesian = Good (robust to many rare alleles)
Migration
Mixed Stock Analysis
Geneclass
2
Lots of mixing
Lots of Source populations
Extensive sampling
SPAM
IMMANC
Not many
populations
(multilocus genotype)
May be better at handling less
Extensively sampled populations
Structure
Structure
Program applications
Genetic Stock Identification
AFLP data
AFLPOP?
Try several methods at once
(e.g. max like, jack)
Accepts Genepop
Can creat input SPAM
Which Run or
DOH
Standard Maximum Likelihood
Arlequin
Can produce extensive stats
(relative to DOH, IMMANC)
Blend of methods
Not duplicated by other programs
Geneclass 2
Structure
Computer Lab Outline
How can STRUCTURE be used to determine how many
populations are in my dataset?
What population do a group of unknown individuals belong
to? (GENECLASS)
What is the contribution of each population to a mixed
stock? (SPAM)