Transcript slides
Genome-wide association studies
(GWAS)
Thomas Hoffmann
Department of Epidemiology and Biostatistics,
and Institute for Human Genetics
Outline
GWAS Overview
Study design
Plink overview
QC
Analysis
If we have time
Main idea
●
●
●
Correlation between genotype (SNP) and trait of
interest (e.g., LDL/HDL, blood pressure)
Test ~1 Million SNPs to see if any are correlated
with the phenotype.
Agnostic “shotgun” approach across the
genome
Hirschhorn & Daly, Nat Rev Genet 2005
Size matters
Large # SNPs tested,
multiple comparison
correction
● Very small effect sizes
● TagSNP, rather than actual
SNP
●
Visscher, AJHG
2012,
Outline
GWAS Overview
Study design
gPlink overview
QC
Analysis
If we have time
Study design with Quanto
Disease: case-control
(unmatched/matched),
continuous
➢Hypothesis: Gene only (also
does GxE)
➢Outcome model: Baseline
Risk
➢Genetic Effect: many GWAS
ORs have actually been very
small
➢Parameters: Power/Sample
size
-8
➢Type I error: 5e-8 (=5x10 ,
bonferroni correction for 1
million tests)
➢
What parameters seem reasonable?
How would you explain this in a grant?
http://biostats.usc.edu/software
Outline
GWAS Overview
Study design
gPlink overview
QC
Analysis
If we have time
Plink overview
●
●
●
Command-line driven code [You have seen this, right?]
●
Start > All Programs > Accessories > Command Prompt
●
Rock solid
Graphical interface [You haven't seen this?]
●
Pretty, but a little unstable, especially if you kill something
●
May not allow you to delete some things until restart system
Great documentation
http://pngu.mgh.harvard.edu/~purcell/plink/reference.shtml
●
Apply an operation to data, which produces output. Manually looking
at output much more difficult because of quantity of it for GWAS; use
other programs (Haploview [today], stata, R, e.g.) to display that
output better
gPLINK setup
Click “File > Open project”, and then click “Browse” in the dialogue that pops up, and navigate to
where you have your plink file located, then click OK.
Ignore any “stream error” errors you get, they don't seem to cause any problems.
gPLINK Setup (2)
Click “File > Configure”, then click the two “Browse” buttons, navigating to the haploview and plink
executable files, respectively.
Outline
GWAS Overview
Study design
Plink overview
QC
Analysis
If we have time
QC Steps (1)
Remove SNPs with low call rate (e.g., <95%),
Individuals with too much missing data
Proportion of SNPs actually called by software
If it's low, the clusters aren't well defined, artifacts
PLINK produce summary:
– plink --bfile ceu --missing --out ceu-miss
PLINK on the fly / to produce a new dataset
– plink --bfile ceu --geno 0.05 --mind 0.05 <rest of
command, e.g., association>
WARNING: It will look like it is not doing
anything!!! In the background a plink.exe process
has been launched. You need to wait for the green
circle.
This creates a new dataset based on
the filter, but this can also be done on
the fly with some of the association
analysis options as well...
QC Steps (2)
Remove those with low minor allele frequency?
Rarer variants more likely artifacts / underpowered
Sometimes fit is unstable
Other approaches to deal with rare variants than
single SNP at a time (combine them)
Produce summary: plink --bfile ceu --freq --out ceufreq
Filter: plink --bfile ceu --maf 0.05 --make-bed --out
ceu-0.05
QC Steps (3)
SNPs that fail Hardy-Weinberg
Suppose a SNP with alleles A and B has allele
frequency of p. If random matting, then
AA has frequency p*p
AB has frequency 2*p*(1-p)
BB has frequency (1-p)*(1-p)
Test for this (e.g., chi-squared test), if deviating too
strongly, likely a bad SNP
In practice do for homogeneous populations (more
later), controls only
> hwe = read.table("hwe.hwe", header=TRUE,
stringsAsFactors=FALSE, fill=TRUE)
> head(hwe)
CHR
SNP TEST
GENO O.HET. E.HET. P_HWD
1
1 rs3131968
ALL 0/15/45 0.2500 0.2188 0.5795
2
1 rs3131968
AFF
0/0/0 -1.0000
NA
NA
3
1 rs3131968 UNAFF
0/0/0 -1.0000
NA
NA
4
1 rs12562034
ALL 0/11/49 0.1833 0.1665 1.0000
5
1 rs12562034
AFF
0/0/0 -1.0000
NA
NA
6
1 rs12562034 UNAFF
0/0/0 -1.0000
NA
NA
>
QC Steps (4)
• Check genotype gender from X heterozygosity
plink --bfile ceu --check-sex --out ceu-check-sex
●
Microarrays often have another gender check
ceu-check-sex.sexcheck
FID
IID
PEDSEX
1341
NA06985
2
1341
NA06993
1
1340
NA06994
1
1340
NA07000
2
1340
NA07022
1
1341
NA07034
1
1341
NA07055
2
1340
NA07056
2
1345
NA07345
2
SNPSEX
2
1
1
2
1
1
2
2
2
STATUS
OK
OK
OK
OK
OK
OK
OK
OK
OK
F
-0.06403
1
1
-0.03363
1
1
-0.06594
-0.02845
0.00151
F: The actual X chromosome inbreeding (homozygosity) estimate. Which ones are
female? Has this data been already cleaned?
(5) Check for relatedness, e.g., HapMap
Take overall average
of all SNPs of how
many alleles are
shared.
●E.g., parent-offspring
never shares zero
alleles.
●HapMap was
supposed to be
unrelateds (this was
a bit of a surprise!)
● “MZ twins”
sometime a
duplicated sample...
●
Pemberton et al., AJHG 2010
PLINK Identity By State (IBS)
Calculations
plink --bfile ceu --genome --out ceu-genome
FID1
Family ID for first individual
IID1
Individual ID for first individual
FID2
Family ID for second individual
IID2
Individual ID for second individual
RT
Relationship type given PED file
EZ
Expected IBD sharing given PED file
Z0
P(IBD=0)
Z1
P(IBD=1)
Z2
P(IBD=2)
PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD )
PHE
Pairwise phenotypic code (1,0,-1 = AA, AU and UU pairs)
DST
IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs )
PPC
IBS binomial test
RATIO Of HETHET : IBS 0 SNPs (expected value is 2)
PLINK Identity By State (IBS)
Calculations
plink --bfile ceu --genome --out ceu-genome
FID1
Family ID for first individual
IID1
Individual ID for first individual
FID2
Family ID for second individual
IID2
Individual ID for second individual
RT
Relationship type given PED file
EZ
Expected IBD sharing given PED file
Z0
P(IBD=0)
Z1
P(IBD=1)
Z2
P(IBD=2)
PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD )
PHE
Pairwise phenotypic code (1,0,-1 = AA, AU and UU pairs)
DST
IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs )
PPC
IBS binomial test
RATIO Of HETHET : IBS 0 SNPs (expected value is 2)
QC Notes
●
●
Exact thresholds differ based on dataset, how
many subjects you have
GWAS of 1,000 subjects vs. GWAS of 100,000
subjects
●
●
Ad-hoc look at QQ-plots (later) to asses some
QC issues
Look at cumulative distributions may be slightly
easier than histograms (what we saw earlier) for
cutoffs in the tails
QC Summary
●
●
●
●
●
Filter SNPs and individuals
●
MAF (very sample size dependent)
●
Low call rates
Test HWE
Relatedness – remove first degree relatives at
least, or account for
Check genotypic gender (e.g., mislabeled
samples)
Anything extra you could do with family data?
●
Mother AA, Father BB, Child AA
Outline
GWAS Overview
Study design
Plink overview
QC
Analysis
If we have time
GWAS analysis
• Most common approach: look at each SNP one-at-a-time
– Additive coding of SNP most common, e.g., # of A
alleles
– Just a covariate in a regression framework
• Dichotomous phenotype: logistic regression
• Continuous phenotype: linear regression
– {BMI}=B1{SNP}+ B2{Age}+...
• Can use residuals/propensity score to speed up
analysis
What is population stratification?
If allele frequency is different in two populations
(races/ethnicities), and a disease prevalence is
different in the two...
● If not corrected for, what is your test actually telling
you?
● Generally inflates all test statistics if not accounted
for
●
Balding, Nature Reviews Genetics 2010
Adjusting for PC's
Maximize
variance using all
SNPs between
subjects – pulls
out clusters of
individuals of
different
populations/subp
opulations.
• Li et al., Science 2008
Adjusting for PC's
• Razib, Current Biology 2008
PCs
●
Eigenstrat most common (linux only)
●
Plink can still run something similar, two steps:
plink --bfile ceu --genome --out ceu-genome
plink --bfile ceu --read-genome ceugenome.genome --cluster --mds-plot 10 --out
ceu-mds
Running the analysis
Continuous phenotype: linear regression
{BMI}=B1{SNP}+ B2{Age}+...
plink --bfile ceu
--linear
--covar ceu.cov
--covar-name PC1,PC2,PC3,PC4
--pheno ceu.phe
--pheno-name phe
--out ceu-analysis
--hide-covar
--hide-covar: supresses lots of output (every covariate for every
regression), unfortunately, cannot do in gPLINK
plink --bfile ceu
--linear
--covar ceu.cov
--covar-name PC1,PC2,PC3,PC4
--pheno ceu.phe
--pheno-name phe
--out ceu-analysis
--hide-covar
--adjust
--adjust: Get the genomic inflation factor lambda, and inflate the
test statistics slightly for it (otherwise can inflate later with other
packages; especially if you want to run in parallel)
Do you really want to adjust? More later...
Format of covariate files
Phenotype and covariate files not structurally
different from each other:
FID IID LDL HDL BMI
ID1 ID1 120 160 200
…
FID and IID match things in the .fam file of the
plink formatted files (which generally calling
software produces)
View analysis results
Manhattan plot
Multiple comparison correction
• If you conduct 20
tests at =0.05, one
true by chance
http://xkcd.com/882/. If
you conduct 1 million
tests...
• Correct for multiple
comparisons
– e.g., Bonferroni, 1
million, =5x10-8
Multiple comparison correction
• If you conduct 20
tests at =0.05, one
true by chance
http://xkcd.com/882/. If
you conduct 1 million
tests...
• Correct for multiple
comparisons
– e.g., Bonferroni, 1
million, =5x10-8
Multiple comparison correction
• If you conduct 20
tests at =0.05, one
true by chance
http://xkcd.com/882/. If
you conduct 1 million
tests...
• Correct for multiple
comparisons
– e.g., Bonferroni, 1
million, =5x10-8
Multiple comparison correction
• If you conduct 20
tests at =0.05, one
true by chance
http://xkcd.com/882/. If
you conduct 1 million
tests...
• Correct for multiple
comparisons
– e.g., Bonferroni, 1
million, =5x10-8
QQ-plots and PC adjustment
Lambda quantifies how much it is deviating from what we
would expect. Want it to be roughly 1, but I don't see too
many objections for 1.04, 1.06, e.g., for ~10,000 subjects.
• Wang, BMC Proc 2009
Genomic inflation factor
Why might the genomic inflation factor be
elevated?
Genomic inflation factor
Why might the genomic inflation factor be
elevated?
No, there was nothing done wrong with the
analysis...
Genomic inflation factor
Why might the genomic inflation factor be
elevated?
There are hundreds of loci known for height...
Genomic inflation factor
Why might the genomic inflation factor be
elevated?
There are hundreds of loci known for height...
After you found a hit: Check cluster
plot (where genotype calls made)
Good calls!
Bad calls!
Laird and Lange
Outline
GWAS Overview
Study design
Plink overview
QC
Analysis
If we have time
Advanced: Conditional analysis
chromosome
Region
2
30
Region
3
25
Region
1
rs144729
5
Column U
Column H
Column AA
Column AQ
http://cgems.cancer.gov
Multiple prostate cancer
loci on 8q24
-log(p-value)
20
rs169019
79
15
Y = b0 + b1X1 + b2X2
10
rs698326
7
In practice, take residual of Y on
X1, then fit that to X2 (avoid
issues of multicollinearity)
5
0
128.10
128.20
128.30
128.40
128.50
128.60
128.70
Position on 8q24 (Mb)
Prostate cancer, Witte, Nat Genet 2007
Conditional Analysis (continued)
●
●
Different approaches – what question is being
asked?
●
Rerun only within region
●
Genome-wide
Hard to coordinate in consortium
Imputation of SNP Genotypes
Combine data from different platforms (e.g., Affy &
Illumina) (for replication / meta-analysis).
Estimate unmeasured or missing genotypes.
Based on measured SNPs and external info (e.g.,
haplotype structure of HapMap).
Increase GWAS power (impute and analyze all), e.g.
Sick sinus syndrome, most significant was 1000
Genomes imputed SNP (Holm et al., Nature Genetics,
2011)
HapMap as reference, now 1000 Genomes Project
Imputation Example
What
would
you do?
Li et al., Ann Rev Genom Human Genet, 2009
Imputation Example
Li et al., Ann Rev Genom Human Genet, 2009
Imputation Example
Li et al., Ann Rev Genom Human Genet, 2009
Imputation Application
TCF7L2 gene region & T2D from the WTCCC data
Observed genotypes black
Imputed genotypes red.
Chromosomal Position
Marchini Nature Genetics2007
http://www.stats.ox.ac.uk/~marchini/#software
Imputation Software
●
●
●
SHAPE-IT to phase http://www.shapeit.fr/
Impute2 to impute (+ 1000
genomes):http://mathgen.stats.ox.ac.uk/impute/impute
_v2.html
Analyze similar to typed genotypes, with plink –dosage
●
●
Data subset, etc., not as flexible with dosages
(i.e., don't work)
1000 Genomes: Tens of millions of SNPs...
Replication
• To replicate:
– Association test for replication sample significant at
0.05/{Number of SNPs replicating} alpha level
– Same genetic model (e.g. additive, dominant)
– Same direction
– Sufficient sample size for replication
• Non-replications not necessarily a false positive
– LD structures, different populations (e.g., flip-flop)
– covariates, phenotype definition, underpowered
Meta-analysis
• Combine multiple studies to increase power
• Either combine p-values (Fisher’s test),
–
when might you use this? (tricky question)
• or z-scores (better)
• plink --meta-analysis study1.assoc
study2.assoc study3.assoc
–
Tip: Run previous analysis with --ci 0.95
Extras
●
●
If using separate control group (“convenience
controls”), need to be extra careful with QC
(give example).
Why?
Summary
●
●
GWAS successful in finding large numbers of
variants associated with disease.
GWAS not as successful as people hoped
●
Not as much of the heritability / variance
explained.
●
Difficult to find true functional SNP?
●
What do we do after we find them?