Monday, 17/09
Download
Report
Transcript Monday, 17/09
Large-scale association studies:
brute force and ignorance
Thomas Lumley
BIOINF 744/STATS 771
My experience: the CHARGE consortium
I’m in CHARGE
I’m in CHARGE
I’m in CHARGE
I’m in CHARGE
I’m in CHARGE I’m in CHARGE
I’m in CHARGE
I’mI’m
in CHARGE
in CHARGE
I’m in CHARGE
I’m in CHARGE
Genome-wide Association Studies
• If a protein is important in a particular disease or trait
then small changes in function or expression of that
protein should have a small effect on the disease or trait.
• A small effect. Very small.
• But not confounded with environmental or lifestyle
factors : like a tiny randomised experiment
• And not relying on prior biological knowledge
No, smaller than that.
– we can find surprises.
• Because SNP associations are very weak, need
103-105 people in the study
Does it work?
• Not for risk prediction or treatment choice
– exception: some adverse drug reactions
• Yes, for discovering new mechanisms or
potential drug targets
– mystery 9p21 variant in CHD
– autophagy in Crohn’s Disease
– sodium transporter affecting uric acid levels
– new ion channels important in heart rhythm
Measurement technologies
• SNP chips
– cDNA attached to glass/silicon
– multiple probes per SNP
– planned layout (Affymetrix) or random (Illumina)
– DNA binds to cDNA, fluorescent tags for readout
• Off-the-shelf
– 105 to 107 SNPs, genome-wide coverage
• Custom chips
– 384 to 250000 SNPs for a particular purpose
Scale
• SNP chips are cost-effective only for large
sample sizes and numbers of SNPs
– new `exome chip’ has all known coding variants
segregating in the population
– 1.5 million chips sold
– a few hundred dollars in large volumes
= dozens of SNPs per 1c.
Quality control
homozygote
heterozygote
homozygote
Quality control
Quality control
• Easy by hand, but there are 500000 of them
– 1 minute each=24hrs/day for a year.
– Need to be brutal and automated
– 10% of SNPs discarded is not unusual
•
•
•
•
•
Batch effects
Missingness (per SNP and per sample)
Hardy-Weinberg equilibrium
low minor allele frequency (eg Np<100)
Big differences from expected allele frequency
HWE
• Not looking at population structure here
• Bad SNPs tend to drop out either
heterozygotes or rare-allele homozygotes
• Calling error leads to massive HWE violations
– p-value <10-5 is a common standard
Batch effects
• Experimental design is important
– mix cases and controls in same batches
– especially important with new technologies
After online publication of our report “Genetic Signatures of
Exceptional Longevity in Humans” (1) we discovered that
technical errors in the Illumina 610 array and an inadequate
quality control protocol introduced false positive single
nucleotide polymorphisms (SNPs) in our findings.
-- Sebastiani et al. Science retraction.
Analyses
• Must be simple and fast
• Usually additive genetic model
• Adjust for
– sampling factors such as recruitment site
– precision variables (eg heart rate in QT interval)
– age and sex (because epidemiologist can’t stop themselves)
– population structure summaries
– not for any post-conception exposures that don’t
affect genes.
Results
number of zeroes in
p-value (need 7-8)
One dot for each SNP, ordered by
position within chromosome
Zoom in
Meta-analysis
• Most genome-wide studies involve multiple
samples
• Usually share results, not individual data
– combine by precision-weighted meta-analysis
– no loss of efficiency for single-parameter analyses
m
b̂ =
-2
b̂
s
å i i
i=1
m
-2
s
å i
i=1
,
s =
1
2
m
-2
s
å i
i=1
Computation: not a big deal
• In R, roughly 12 cpu-hrs for quantitative traits,
36 cpu-hrs for binary, time-to-event
• Parallelises very well
– we split by chromosome
• Limited by disk bandwidth
– eg, six parallel R sessions on a cheap eight-core
server
– eg, 500 parallel R sessions on high-quality
supercomputer
Population structure
• Full Bayesian modelling is too slow at this
scale
• Use first few principal components of the
genotype correlation matrix
– population structure is a concern because it leads to
systematic variation in allele frequencies along the whole
genome
– systematic variation in allele frequencies along the whole
genome shows up in principal components
Principal components
• Genotype matrix G has 106 columns, 104 rows
– don’t want to form GTG, with 1012 entries
– work with GGT, with 108 entries
– first few eigenvectors are population structure
components (or common inversions)
– ‘EIGENSTRAT’ was first program to do this
– Reduce effort further by using just 105 or 104
random SNPs (some loss in quality)
Principal components: MESA study
Principal components
• Does it work?
– if not, ancestry-informative loci would be overrepresented in association findings
– largely not the case
• slight suggestion in very largest studies that ABO blood
group and lactase persistence loci are cropping up too
often.
Imputation
• Meta-analysis often involves studies using
different SNP chips
• Can only combine results for the same SNPs
– usually a minority
• Imputation allows everyone to use the same
SNPs
• Based on linkage disequilibrium
– with 500,000 SNPs, we are very far from linkage
equilibrium
Imputation
• Haplotyping
– estimate possible haplotypes and their
probabilities for each person in your sample
• In reference panel with all the SNPs (eg HapMap)
– look up which allele is on each haplotype
• Compute posterior mean genotype
G=
å
g ´ P(G = g)
g=0,1,2
=
å å
g=0,1,2 haplotypes h
g ´ P(G = g | H = h) ´ P(H = h)
Imputation
• Imputation does not use phenotype data
– slightly underestimates association
– but only for SNPs that explain a large fraction of
variation in phenotype
– which basically don’t exist.
• Just plug imputed genotype into regression as
if it was measured.
– some people filter out SNPs where imputation is
low-quality: compare var[G] to 2p(1-p)
Imputation
• For meta-analysis, need to impute to the same
set of SNPs before analysis
– most people us 2.5 million HapMap Phase II SNPs
– starting to use 38 million 1000 Genomes SNPs
– for additive genetic model, doesn’t matter
whether SNPs are measured or imputed.
– slightly more work needed for non-additive
genetic models or SNP:SNP interaction models
Resequencing
• 2.5 million SNPs is one per 1000 bases
• Every base varies somewhere in the human
population
• Association studies by sequencing are just
becoming possible
• US$1000 genome probably coming next year
Resequencing
• Basic idea is similar to GWAS, but
– most variants will be rare
– some variants will have stronger associations
– the true functional variant will be measured.
• For sufficiently-common SNPs, use the same
analysis as in GWAS
• For rare variants (SNPs and indels), use a
burden test
Burden tests
• Might expect most mutations to reduce function
– people with more copies of rare variants should have
lower function for that gene (or non-gene locus)
• Use number of variants for each person as
predictor in a regression model
– rarer variants may have larger effects: give them more
weight
– we know or guess that some bases are more likely to
matter: give them more weight
Omnidirectional burden tests
• ‘Loss of function’ is tricky
– ion channel function is to open and close: which
direction is loss of function?
– Leiden variant in Factor V removes ability to be
turned off: loss or gain of function?
• Would be nice to find important genes even if
variants act both ways
• Hard: huge increase in dimension of problem
• Simple meta-analyses are no longer efficient.
Omnidirectional burden tests
• Typically based on correlation
– do people with more similar genotypes have more
similar phenotypes?
• Power is very low if there are many
unimportant variants
Third generation sequencing
• Pacific Biosciences: tethered polymerase
copies single DNA molecule, with spotlight
small enough to see just one base fluoresce
• Oxford Nanopore: drag a single DNA strand
through a tiny hole and measure its shadow
• Ion Torrent: tethered polymerases copy one
base at a time, read-out uses H+ ion released
by adding the base.