Slide - iPlant Pods

Download Report

Transcript Slide - iPlant Pods

Genome-wide association
mapping
Introduction to theory and
methodology
Aaron Lorenz
Department of Agronomy
and Horticulture
GWAS – Genome-wide Association Study
•
•
•
•
Big subject
Lots of methods and software packages
Lots of considerations for handling data
We have some data to analyze
• 75 minutes
Slide credit: Mike Gore
Goal
Find genes
contributing to
variation in
phenotypes of
interest
Approaches to mapping genes
Yu and Buckler, 2006
Germplasm
Biometris
Germplasm
• Any genetically diverse natural or artificial population can
be used
– Examples
• 71 elite European maize inbred lines (Andersen et
al., 2005)
• Diverse panel of 288 maize lines (Harjes et al.,
2008)
• Diverse panel of 191 Arabidopsis lines (Stock center
accessions and individuals sampled from the wild;
Atwell et al. 2010)
• 915 dogs from 80 domestic breeds, 83 wild canids,
10 outbred African shelter dogs.
Linkage disequilibrium (LD)
•
The non-random association of alleles between loci.
•
Extent of LD over physical distance determines
marker density needed.
D  pAB  pA pB
Common statistic to
quantify LD.
Normalized value of
D.
2
D
r 
p A pa pB pb
2
LD decay in bi-parental linkage mapping
populations
Slide credit: Peter Bradbury
Plots of LD across the Maize d3 Gene (Remington et al., 2001).
r2 above diagonal,
D’ below diagonal
Note that LD
drops to
nearly 0 within
500 base
pairs
bp
Gaut B. S., Long A. D. Plant Cell
2010:15:1502-1506
Copyright © 2003. American Society of Plant Biologists. All rights reserved.
Extensive LD in barley of the
Upper Midwest
Toy example
Pheno value
• 500 random individuals from a population phenotyped and
genotyped
– Genotypes were scored for one marker linked to a
candidate gene
– Individuals scored as A1A1 = 0, A1A2 = 1, A2A2 = 2.
y    bw  
H0 : b  0
HA :b  0
0
1
2
R: lm function
• Fits a linear model with normal errors and constant
variance; generally this is used for regression analysis
using continuous explanatory variables.
• Simple linear regression
– lm(y ~ x)
• See riceGwasEmma.r
Population structure
• Nearly always present in association mapping panels
• Causes spurious associations if not accounted for.
Extreme example
AB
AB
ab
ab
AB
AB
AB
AB
AB
AB
AB
AB
AB
AB
ab
ab
ab
ab
ab
ab
ab
ab
ab
ab
ab
ab
Within each of these populations, the Ab or bA gametes never occur, so
D = freq(AB) – freq(A)*freq(B) = 0.25.
When the subpops are combined into population and LD is calculated, the two loci
are in complete LD regardless of their physical linkage.
Model population structure
y    vq  bw  e
Subpop
membership
and effect
Marker
allele
dosage
and effect
Matrix notation
y  1  Qv  Wb  e
Illustration
3 subpopulations, 2 markers, 10 individuals
 4.4  1
 0.75
 4.6  1
 0.65
  

 5.3  1
0.50
  

5.0
1
  
 0.75
 5.8  1
0.80
        
5.7  1
0.20
 4.3 1
0.20
  

 4.6  1
0.30
 4.4  1
0.10
  

 4.8 1
0.10
y 1
0.25
0.30
0.40
0.05
0.00
0.60
0.80
0.70
0.00
0.00
Qv
0.00 
0
0
0.05 

1
0.10 


0.20 
1

v
 1
0.20    0
 v2   
0.20 
1
 v3  
1
0.00 


0.00 
0
0
0.90 


1
0.90 
0
 e1 
e 
1 
 2
 e3 
1
 

1
 e4 
1   b1   e5 
  
0  b2   e6 
e 
0
 7

1
 e8 
e 
0
 9

e10 
1 
Wb
e
Population structure and differential
relatedness (or family structure)
Yu and Buckler, 2006
Mixed-linear model to account for family
structure
y  1  Qv  Wb  Zu  e
Polygenic effect
(random)
u ~ MVN (0, K u2 )
K = kinship matrix.
Normally calculated with genomewide markers
Efficient Mixed-Model Association (EMMA)
• Uses eigenvalue decomposition to more efficiently
solve mixed-model equation
• (Taking direct inverse of covariance matrix is
computationally intensive. Want to avoid in GWAS.)
Options for modeling structure and
kinship [see Price et al. (2010)]
Inferring and modeling structure
• Use knowledge on subpop membership directly
• Subpopulation clustering (explicitly infer ancestry)
– STRUCTURE
– ADMIXTURE
• Principal component analysis
– Use top PCs as covariates to correct for pop structure
– Related approach is multi-dimensional scaling (MDS)
Inferring kinship
• Marker similarity matrix
• Realized genomic additive relationship matrix
• Pedigree additive relationship matrix
Efficient Mixed-Model Association (EMMA)
See riceGwasEmma.r
Manhattan plot
See riceGwasEmma.r
Statistical threshold: Correcting for
multiple testing
Here?
Here?
Statistical threshold: Correcting for
multiple testing
• Bonferroni correction
– alphaC ≈ alphaE / test#
– Assumes independent tests
– Too conservative
• Permutation testing
– Good for linkage mapping
– Generally, not valid for GWAS because family structure not
preserved
• False-discovery rate (Benjamini and Hochberg, 1995)
– Calculate expected proportion of declared QTL that are false
positives.
Calculate effective number of tests
Other software packages to implement
linear models for GWAS
• TASSEL: www.maizegenetics.net
• PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/
• EIGENSTRAT: http://www.hsph.harvard.edu/alkesprice/software/
• EMMAMAX: http://genetics.cs.ucla.edu/emmax/
• GAPIT: http://www.maizegenetics.net/gapit
• GenABEL: http://www.genabel.org/packages/GenABEL
• GWASTools:
http://www.bioconductor.org/packages/2.11/bioc/html/GW
ASTools.html
• FaST-LMM: http://research.microsoft.com/enus/um/redmond/projects/MSCompBio/Fastlmm/