Genome-wide Association Studies
Download
Report
Transcript Genome-wide Association Studies
Multifactorial traits and complex
genetics I
Genome-wide association studies in humans
[email protected]
Wellcome Trust Centre for Human Genetics
Overview
Describe studies aiming to find genetic
differences between individuals that
influence susceptibility to diseases (or other
traits).
Why find disease genes?
Identify putative drug targets.
Identify high risk individuals.
Gene therapy?
Personalised medicine (e.g. stratifying cancer)
Understand the biology of disease.
How do genetic factors influence traits?
Two somewhat competing views
Genetic influence on traits
is inherited in big, discrete
lumps
Genetic influence on traits
is inherited in essentially
continuous quantities
“Mendelian inheritance”
“biometrical”, “multifactorial”,
“polygenic” viewpoint
- Gregor Mendel (1865)
- Morgan (1915)
- e.g. Discovery of ABO blood
group (1924)
- Darwin 1859
- Galton 1886 (e.g. human
height)
Fisher, Haldane, Wright 1920s-1930s
The modern evolutionary synthesis
Genomics timeline
1950’s – structure of DNA
1970’s – ‘Sanger sequencing’
1980’s – RFLP (genetic barcode / inexpensive genotyping of
marker loci)
1990’s – Linkage studies using RFLPs
2000’s – Human Genome Project completed; International
HapMap project; first genotyping microarrays; first large-scale
association studies.
2010’s – 1000 Genomes Project; direct-to-consumer genetic
testing
Present – Massively large-scale biobank / population sequencing
projects (UK Biobank), 100 000 genomes project (UK); Precision
Medicine Initiative (US), …
Finding Disease Genes 1 (linkage)
Familial Aggregation
Segregation Analysis
Genome-wide
Linkage Analysis
Linkage Mapping
Small number of typed markers
A/a
ABC
abc
B/b
C/c
…
abc
abc
A chromosome
ABC
abc
abc
abc
= Affected
= Unaffected
aBC
abc
abc
abc
ABC
abc
Abc
abc
abC
abc
ABc
abc
ABC
abc
Linkage Mapping
Typical result if successful – a strong signal (good) but
not well localised within a chromosome.
chromosome
Initial
discovery led
to finding of
APOE
variants
affecting risk
of
Alzheimers.
Pericak-Vance et al, Am. J. Hum. Gen (1991)
Finding Disease Genes 1 (linkage)
Familial Aggregation
Segregation Analysis
Genome-wide
Linkage Analysis
Candidate Gene Studies + Fine Mapping
Gene Characterization
Finding Disease Genes 1 (linkage)
Familial Aggregation
Segregation Analysis
We aren’t
very good
at this!
Genome-wide
Linkage Analysis
Candidate Gene Studies + Fine Mapping
Gene Characterization
Successes and Failures
Linkage Mapping has been successful in identifying the genetic basis of
many human diseases in which the disease penetrance resembles a
simple Mendelian model e.g.
Huntington’s disease (HD
1993), Cystic Fibrosis, some forms of breast cancer
(BRCA1 1993), Alzheimers (APOE 1991)…
But
“the literature is now replete with linkage screens for an array of common
‘complex’ disorders such as schizophrenia, manic depression, autism,
asthma, type I and type II diabetes, Multiple Sclerosis, Lupus. Although many
of these studies have reported significant linkage findings,
none has lead to convincing replication” – Risch (2000)
Successes and Failures
Why? It’s because linkage studies aren’t the right study design for
detecting non-Mendelian-like effects. These so-called ‘complex’ traits
have fundamentally different genetic architectures.
Relative risk =
P( disease | risk allele )
P( disease | non-risk allele )
‘Mendelian’-like trait => RR > 4 or so, i.e. you are many
times more likely to get disease if you are a risk allele
carrier.
Typically for common disease RR are thought to be < 1.5 or
smaller. (But there may be many such variants.)
Relative risk (RR)
Complex diseases
Rare (e.g. <1%)
Frequency
Common (e.g. 5-50%)
The mutations underlying common complex disease are
composed of multiple mutations of modest effect
Typically RR < 1.5
Successes and Failures
Linkage studies aren’t the right study design for detecting complex
trait effects.
Number of families /
case-control pairs
needed
Relative risk =
Linkage study
Case/control,
GWAS study
P( disease | risk allele )
P( disease | non-risk allele )
Risch (2000)
Finding Disease Genes 2 - GWAS
Familial Aggregation
Still want a heritable trait!
Segregation Analysis
Genome-wide
Linkage Analysis
Genome-wide
Association Analysis
Candidate Gene Studies + Fine Mapping
Gene Characterization
Association mapping
Chromosomes
Cases (D)
Controls (U)
1. Collect a set of unrelated affected individuals (cases) and
unaffected individuals (controls).
Association mapping
Chromosomes
Cases (D)
Controls (U)
Red variant is what we’re looking for – e.g. in this toy example,
RR =
P(D|red)
P(D|not red)
=
P(red|D) P(not red)
P(not red|D) P(red)
=
5/6 * 5/6 / (1/6)*(1/6)
= 25
So real effects, e.g. RR<1.5, are much more subtle than this!
Association mapping
Cases (D)
Controls (U)
*
*
*
2. Genotype many thousands of genetic markers (but probably
not the causal, functional mutations themselves)
Association mapping
Cases (D)
Controls (U)
*
*
*
3. Hope to rely on correlations between typed markers and
the causal mutations
Association mapping
e.g in our toy example
Not white
white
Frequency
cases
5
1
1/6
controls
2
4
2/3
=> Estimate RR=10 at this marker SNP.
Perform statistical test to test for evidence of difference in allele
frequencies between cases and controls. (e.g. chi-squared test).
In this toy example P=0.24 so not enough data even for this
strong effect.
P < (a stringent threshhold) => success!
(Aside - association studies – TDT)
Collect (lots) of trios of individuals
Condition on phenotype of
offspring (case)
High risk alleles should be over
transmitted
Internal control formed by
untransmitted alleles
A
a
A
a
a
a
A
A
Difference between linkage and
association
Linkage studies
- Collect set of families with individuals carrying disease or
phenotype
- Look for co-segregation of small number of markers with
disease status.
Association Studies
- Collect unrelated individuals and look at allele frequency
differences between cases and controls (or cases and
parents for TDT)
- Requires genotyping many thousands of markers.
- Exploits correlations between nearby genetic diversity
along chromosomes within the population
Theory
Association studies provide more power allowing us to
detect the small effect sizes underlying gene responsible
for common disease.
Questions
- How many SNPs would actually be needed to cover
the genome?
- Can we actually type enough SNPs, and cheaply
enough, for the large sample sizes required?
Tagging genetic diversity
How many markers are actually required to tag
the diversity?
- To understand this, must first understand
patterns of diversity in natural populations
- Identify catalogue of variants to type
Can we design experiments to analyse such
large numbers of SNPs?
Correlation between SNPs
Correlation
Real data
Previous
prediction
Physical distance along chromosome
Reich et al Nature 2001
Why? - recombination hotspots
Count the number of recombination in (lots) of
sperm in the MHC region of chromosome 6
Jeffreys et al 1998
Hotspots are a genome wide feature
More than 80% of recombination in less than
10% of the genome
Recombination gives LD a block-like
structure
HapMap project
Consortium of a large number of scientist to conduct a
study to catalogue and describe human genetic diversity
Discovery of over 5 million SNP across the genome
HapMap project
Consortium of a large number of scientist to conduct a
study to catalogue and describe human genetic diversity
Estimate that 200,000 to 500,000 SNPs require to tag
genome (at least in European and Asian populations).
Competition drove technology
improvements
Cost
Coverage
Affymetrix 100K
Affymetrix 500K
Affymetrix 6.0 (~1M SNPs)
…
Illumina 650Y
Illumina 1M
Illumina 2.5M
Illumina 5M
…
Which one to buy?
Costing a GWA
Competition and anticipation of GWA association studies
power drove cost of genotyping chips down
Cost per genotype
2003 ~ $1
2005 ~ $0.1
2006 ~ $0.001
2009 ~ $0.0005 (ish)
High throughput microchip arrays
Main players Affymetrix and Illumina
Power to find weak effects
Illumina 650k
Power
Illumina 550k
Illumina 300k
Affymetrix 500k
Affymetrix 100k
Sample size (number of
cases and controls)
Relative
risk of 1.2
Theory
Association studies provide more power allowing us to
detect the small effect sizes underlying gene responsible
for common disease
HapMap
Strong correlations between neighbouring SNP due to
hotspots mean that we don’t necessarily need to type the
causal variant
Technology
Competition and commercial drive has meant the we can
now affordable type the necessary number of SNPs in
large numbers of individuals
GWAS recipe
1. Collect large numbers of case individuals (1000s)
2. Collect large numbers of controls (perhaps
randomly from the population).
(3. Get consent)
4. Extract DNA
5. Genotype individuals at lots of markers
6. At each SNP do a test for allele frequency
difference between cases and controls (chisquared, logistic regression)
7. Look for small p-values (how small)?
It works!
Study of ulcerative colitis (inflammatory bowel disease)
2321 cases, 4,818 controls typed on Affy 6.0 array (~1M SNPs)
There are now (2016) over 160 common SNPs with effects RR < 2
associated with IBD, accounting for ~20% of disease heritability
It works!
www.well.ox.ac.uk/wtccc2/ms
Study of multiple sclerosis (2011)
9772 cases, 17,376 controls from across Europe
www.genome.gov/gwastudies/
What can possibly go wrong?
Association mapping
Cases (D)
Controls (U)
Genetic markers
genotyped
*
*
*
Potential confounders
Testing for small differences in allele frequency
in large samples at around a million different
SNPs in the genome
•
Statistical tests are sensitive to possible
confounding
e.g. ??
•
Large amounts of data makes it difficult to visual
inspect data
Some potential problems
Population Structure
Population differentiation – tends to affect
all parts of the genome
Natural selection – has pronounced effect
at particular loci
Experimental biases
Subtle difference in the DNA collection,
storage or analysis can lead to both
consistent and sporadic differences
Confounding by population structure
Subpopulation A
Subpopulation B
Cases
Cases
Controls
Controls
2 = 2.1
(p = 0.34)
Genotype
2 = 1.57
(p = 0.46)
2 = 16.3
(p <0.001)
aa
Aa
AA
SNP genotyping
Intensity of probe B
• SNP genotyping is
achieved by measuring the
evidence for the presence
of the two alleles at each
SNP in each individuals
independently
• Genotypes are then
obtained by “clustering”
the data
Intensity of probe A
•This is hard!
Differences in genotype calling
Cases
Controls
The experimental process is not perfect and
slight differences can lead to apparent allele
frequency differences
An embarrassing example
Plausible hypothesis, big study, genome-wide markers, very
small P-value (< 1x10-10).
In a respected journal (Science)...
But not real, and now retracted.
Why – because of genotyping errors!
A quick example to demonstrate some
of the analytical and statistical
challenges…