Population genetics and whole-genome disease association studies
Download
Report
Transcript Population genetics and whole-genome disease association studies
Population genetics and wholegenome disease association studies
Alkes L. Price
Harvard Medical School &
Broad Institute of MIT and Harvard
April 5, 2007
Outline
1. Introduction to population genetics
2. Whole-genome association studies (WGAS)
3. Applications of population genetics to WGAS:
i. Linkage disequilibrium and haplotypes
ii. Population stratification
iii. Admixture association signals
Outline
1. Introduction to population genetics
2. Whole-genome association studies (WGAS)
3. Applications of population genetics to WGAS:
i. Linkage disequilibrium and haplotypes
ii. Population stratification
iii. Admixture association signals
What is population genetics?
The study of how genetic variation is distributed
within and across populations.
Are different human populations
actually genetically different?
Are different human populations
actually genetically different?
Slightly.
Are different human populations
actually genetically different?
Slightly.
5-7% of worldwide human genetic variation is due to
genetic differences between human populations.
The remaining 93-95% of human genetic variation is due to
differences within human populations
(Rosenberg et al. 2002: Science 298, 2381-5).
Are different human populations
actually genetically different?
Slightly.
5-7% of worldwide human genetic variation is due to
genetic differences between human populations.
What about hair / skin / eye color?
Exceptions due to natural selection.
Are different human populations
actually genetically different?
Slightly.
5-7% of worldwide human genetic variation is due to
genetic differences between human populations.
Why care about population differences?
•
•
Use genetic data to decipher ancient history.
Relevance to disease association studies.
International HapMap Project
HapMap genotyped 270 samples:
• Utah samples of N. European ancestry (CEU)
• Han Chinese (CHB)
• Japanese (JPT)
• Yoruban samples from Nigeria (YRI)
International HapMap Project
HapMap genotyped CEU CHB JPT YRI samples at
3.8 million single nucleotide polymorphisms (SNPs)
(HapMap 2005: Nature 437, 1299-1320).
International HapMap Project
HapMap genotyped CEU CHB JPT YRI samples at
3.8 million single nucleotide polymorphisms (SNPs)
(HapMap 2005: Nature 437, 1299-1320).
How to quantify genetic differences between populations?
Define the FST between two populations to be the
proportion of overall variation attributable to differences
between populations (Cavalli-Sforza et al. 1994:
The History and Geography of Human Genes.)
International HapMap Project
Define the FST between two populations to be the
proportion of overall variation attributable to differences
between populations (Cavalli-Sforza et al. 1994:
The History and Geography of Human Genes.)
It follows that the difference in frequency between the
two populations at a SNP with overall frequency p has
variance 2FST p(1-p).
International HapMap Project: FST values
CEU
CEU
CHB
JPT
YRI
CHB
JPT
YRI
0.11
0.11
0.16
0.007
0.19
0.19
International HapMap Project: FST values
CEU
CEU
CHB
CHB
JPT
YRI
0.11
0.11
0.16
0.007
0.19
JPT
e.g. pCHB = 50%, pYRI = 77%
pCHB = 50%, pJPT = 56%
0.19
PCA results on HapMap data
Discrete clusters or continuous axes?
“We identified six main genetic clusters of human
populations” (Rosenberg et al. 2002: Science 298,
2381-5).
“Gradual variation, rather than major genetic
discontinuities or ‘races’, is typical of global
human genetic diversity” (Serre and Paabo 2004:
Genome Res 14, 1679-85).
Also see Rosenberg et al. 2005: PLoS Genet 1, e70
Outline
1. Introduction to population genetics
2. Whole-genome association studies (WGAS)
3. Applications of population genetics to WGAS:
i. Linkage disequilibrium and haplotypes
ii. Population stratification
iii. Admixture association signals
Whole-genome association studies (WGAS)
Step 1. Obtain DNA samples from
1000 individuals with a specific disease (Cases)
1000 healthy individuals (Controls)
Whole-genome association studies (WGAS)
Step 1. Obtain DNA samples from
1000 individuals with a specific disease (Cases)
1000 healthy individuals (Controls)
Step 2. Genotype 1000 Cases and 1000 Controls
at 100,000 – 500,000 SNPs
Whole-genome association studies (WGAS)
Step 1. Obtain DNA samples from
1000 individuals with a specific disease (Cases)
1000 healthy individuals (Controls)
Step 2. Genotype 1000 Cases and 1000 Controls
at 100,000 – 500,000 SNPs
Step 3. Look for a SNP with significantly
different frequency in Cases vs. Controls.
Whole-genome association studies (WGAS)
Step 1. Obtain DNA samples from
1000 individuals with a specific disease (Cases)
1000 healthy individuals (Controls)
Step 2. Genotype 1000 Cases and 1000 Controls
at 100,000 – 500,000 SNPs
Step 3. Look for a SNP with significantly
different frequency in Cases vs. Controls.
(Hirschhorn & Daly 2005: Nat Rev Genet 6, 95-108).
Common Disease/Common Variants hypothesis
The Common Disease/Common Variants hypothesis
suggests that genetic risk for common diseases arises
from a large number (e.g. up to 10 or more) of
common variants (e.g. SNPs with frequency 10-90%)
which each confer modest disease risk (e.g. 1.5x
larger risk of disease per copy of unfavorable allele)
(Reich & Lander 2001: Trends Genet 17, 502-10).
WGAS are aimed at detecting common variants
(Hirschhorn & Daly 2005: Nat Rev Genet 6, 95-108).
Successes of WGAS
WGAS have identified risk variants for:
• Age-related Macular Degeneration
(Klein et al. 2005: Science 308, 385-9)
• Obesity
(Herbert et al. 2006: Science 312, 279-83)
• Inflammatory Bowel Disease
(Duerr et al. 2006: Science 314, 1461-3)
• Type 2 diabetes
(Sladek et al. 2007: Nature 445, 828-30)
Advantages/disadvantages of WGAS
Advantages:
• Effective for common variants of modest risk
• No prior knowledge of disease pathways required
• Fine localization of disease variant
Disadvantages:
• Large number of hypotheses tested reduces power
• High cost
Cost of WGAS
Affymetrix 500K and Illumina 300K technologies:
genotype hundreds of thousands of SNPs at a cost of
about $500 per sample.
Thus, a WGAS with 1000 Cases and 1000 Controls will
incur about $1 million in genotyping costs.
Outline
1. Introduction to population genetics
2. Whole-genome association studies (WGAS)
3. Applications of population genetics to WGAS:
i. Linkage disequilibrium and haplotypes
ii. Population stratification
iii. Admixture association signals
Whole-genome association studies (WGAS)
Step 1. Obtain DNA samples from
1000 individuals with a specific disease (Cases)
1000 healthy individuals (Controls)
Step 2. Genotype 1000 Cases and 1000 Controls
at 100,000 – 500,000 SNPs
Step 3. Look for a SNP with significantly
different frequency in Cases vs. Controls.
(Hirschhorn & Daly 2005: Nat Rev Genet 6, 95-108).
Whole-genome association studies (WGAS)
Step 1. Obtain DNA samples from
1000 individuals with a specific disease (Cases)
1000 healthy individuals (Controls)
Step 2. Genotype 1000 Cases and 1000 Controls
at 100,000 – 500,000 of 10 million SNPs total
Step 3. Look for a SNP with significantly
different frequency in Cases vs. Controls.
(Hirschhorn & Daly 2005: Nat Rev Genet 6, 95-108).
Outline
1. Introduction to population genetics
2. Whole-genome association studies (WGAS)
3. Applications of population genetics to WGAS:
i. Linkage disequilibrium and haplotypes
ii. Population stratification
iii. Admixture association signals
LD and haplotypes: Recombination
Mother
Father
Child
LD and haplotypes: Recombination
Population at time 0
Many generations later
Linkage disequilibrium and haplotypes
Linkage disequilibrium and haplotypes
haplotype
Linkage disequilibrium and haplotypes
..
SNP #1
SNP #2
Linkage disequilibrium and haplotypes
..
SNP #1 A G
SNP #2 C G
SNP #1
SNP #2
Linkage disequilibrium and haplotypes
..
SNP #1
SNP #2
SNP #1 A G
SNP #2 C G
SNP #1 and SNP #2 are perfect proxies (perfect LD)
The r2 between SNP #1 and SNP #2 is 100%
Linkage disequilibrium and haplotypes
..
SNP #1
SNP #1 A G
More generally, SNP #1 might be an imperfect proxy
(imperfect LD) for all SNPs within 10,000 bp.
Linkage disequilibrium and haplotypes
..
SNP #1
SNP #1 A G
More generally, SNP #1 might be an imperfect proxy
(imperfect LD) for all SNPs within 10,000 bp.
WGAS: choose a subset of 100-500,000 tag SNPs so that
all SNPs are in strong LD (r2 > 0.8) with a tag SNP.
Linkage disequilibrium and haplotypes
..
SNP #1
SNP #1 A G
More generally, SNP #1 might be an imperfect proxy
(imperfect LD) for all SNPs within 10,000 bp.
WGAS: choose a subset of 100-500,000 tag SNPs so that
all SNPs are in strong LD (r2 > 0.8) with a tag SNP.
Haplotype association mapping: don’t need causal SNP.
Affymetrix 500K and Illumina 300K
Proportion of HapMap SNPs which are well tagged
(r2 > 0.8) by at least one of the tag SNPs in
Affymetrix 500K or Illumina 300K, respectively:
CEU
CHB+JPT
YRI
Affy 500K
65%
66%
41%
Illum 300K
75%
63%
28%
(Barrett & Cardon 2006: Nat Genet 38, 659-62)
Affymetrix 500K and Illumina 300K
Proportion of HapMap SNPs which are well tagged
(r2 > 0.8) by at least one of the tag SNPs in
Affymetrix 500K or Illumina 300K, respectively:
CEU
CHB+JPT
YRI
Affy 500K
65%
66%
41%
Illum 300K
75%
63%
28%
(Barrett & Cardon 2006: Nat Genet 38, 659-62)
Population differences in extent of LD
West African
10,000 bp
European
50,000 bp
East Asian
50,000 bp
Native American
>100,000 bp
Reich et al. 2001: Nature 411, 199-204
Conrad et al. 2006: Nat Genet 38, 1251-60
Population differences in extent of LD
West African
10,000 bp
no bottleneck
European
50,000 bp
out of Africa 50kya
East Asian
50,000 bp
out of Africa 50kya
>100,000 bp
Bering strait 15kya
Native American
Reich et al. 2001: Nature 411, 199-204
Conrad et al. 2006: Nat Genet 38, 1251-60
Population differences in extent of LD
West African
10,000 bp
no bottleneck
European
50,000 bp
out of Africa 50kya
East Asian
50,000 bp
out of Africa 50kya
>100,000 bp
Bering strait 15kya
>>100,000 bp
island settled 2kya
Native American
Kosrae
Bonnen et al. 2006: Nat Genet 38, 214-7
also see Service et al. 2006: Nat Genet 38, 556-560
Future challenges
SNPs that are not in strong LD (r2 > 0.8) with any of
the tag SNPs in Affymetrix 500K (or Illumina 300K)
may still be well-captured using pairs of tag SNPs, or
more generally, sets of n tag SNPs for some value of n
(de Bakker et al. 2005: Nat Genet 37, 1217-23)
(also see Zaitlen et al. 2007: Am J Hum Genet 80, 683-91).
• However, increased number of hypotheses tested may
reduce power rather than increasing power
(Pe’er et al. 2006: Nat Genet 38, 663-7).
• Related approach: impute all HapMap SNPs and
then carry out WGAS using those imputed SNPs.
Outline
1. Introduction to population genetics
2. An unsolved problem in population genetics
3. Whole-genome association studies (WGAS)
4. Applications of population genetics to WGAS:
i. Linkage disequilibrium and haplotypes
ii. Population stratification
iii. Admixture association signals
HapMapaaaaaaaaaaa
Whole-genome association studies
Phenotype
case
control
SNP
Ancestry
???
N. Europe
S. Europe
T
C
HapMapaaaaaaaaaaa
Whole-genome association studies
Phenotype
case
control
SNP
Ancestry
???
N. Europe
S. Europe
T
C
Stratification:
spurious associations due to ancestry differences
between cases and controls.
Height association study
Phenotype
Lactase SNP
Ancestry
tall stratification N. Europe
short
S. Europe
T
C
in European Americans.
(Campbell et al. 2005: Nat Genet 37, 868-72)
chr 2
Population stratification
Phenotype
Lactase SNP
Ancestry
tall stratification N. Europe
short
S. Europe
T
C
spurious association
due to stratification!
(Campbell et al. 2005: Nat Genet 37, 868-72)
chr 2
EIGENSTRAT:
use PCA to correct for stratification
1. Apply principal components analysis to infer
continuous axes of genetic variation.
Cavalli-Sforza et al. 1994 book
Cavalli-Sforza et al. 1993: Science 259, 630-46
Patterson et al. 2006: PLoS Genet 2, e190
Price et al. 2006: Nat Genet 38, 904-9
EIGENSTRAT:
use PCA to correct for stratification
1. Apply principal components analysis to infer
continuous axes of genetic variation.
2. For each inferred axis
Subtract from each genotype and each phenotype
an amount attributable to ancestry along that axis.
3. Evaluate association between ancestry-adjusted
genotypes and ancestry-adjusted phenotypes, using
Armitage trend test.
Toy Example
Example of axis of variation
+
0
_
Cavalli-Sforza et al. 1994 book
Cavalli-Sforza et al. 1993: Science 259, 630-46
European American population structure:
What’s inside the melting pot?
???
European American data set
Brigham Rheumatoid Arthritis Sequential Study (BRASS):
488 European American samples with rheumatoid arthritis,
genotyped on a 100K Affy chip (116,204 SNPs).
Results: top two axes of variation
Results: top two axes of variation
SE Europe
NW Europe
Lactase persistence association study
Lactase Persistent?
Yes
No
???
N. Europe
S. Europe
stratification
inferred from
LCT gene on chr 2
(known to perfectly predict
lactase persistence)
SNP
T
C
Lactase persistence association study
Lactase Persistent?
Yes
No
???
N. Europe
S. Europe
stratification
SNP
T
C
inferred from
LCT gene on chr 2
(Enattah et al. 2002)
Many associated SNPs near LCT gene on chr 2.
Lactase persistence association study
rs10511418
Persistent?
Yes
No
Ancestry
N. Europe
S. Europe
stratification
SNP on chr 3
G
A
P-value = 0.0000002 (after correcting for 116,204 hypotheses tested)
?!?
Lactase persistence association study
rs10511418
Persistent?
Ancestry
Yes
No
N. Europe
S. Europe
stratification
spurious association
due to stratification!
SNP on chr 3
G
A
Lactase persistence association study:
correcting for stratification
rs10511418
Persistent?
Yes
No
Ancestry
N. Europe
S. Europe
stratification
SNP on chr 3
G
A
Correcting for stratification (and for 116,204 hypotheses tested):
Genomic Control P-value = 0.0023
EIGENSTRAT P-value = 1.0000
Future challenges
Given genetic data (e.g. SNP data) from a set of
samples of unknown ancestry: what is the best way to
describe the “population structure” in the data – i.e.
departures from the panmictic model of a single
randomly mating population?
• Principal Components Analysis
• STRUCTURE model-based clustering program
Pritchard et al. 2000: Genetics 155, 945-59
Falush et al. 2003: Genetics 164, 1567-86
Future challenges
Given genetic data (e.g. SNP data) from a set of
samples of unknown ancestry: what is the best way to
describe the “population structure” in the data – i.e.
departures from the panmictic model of a single
randomly mating population?
• Principal Components Analysis
• STRUCTURE model-based clustering program
Pritchard et al. 2000: Genetics 155, 945-59
Falush et al. 2003: Genetics 164, 1567-86
These methods both fail on HapMap data.
PCA results on HapMap data
PCA results on HapMap data
The problem: none of the principal components
are able to distinguish CHB from JPT – even if
looking at lower principal components.
PCA results on CHB and JPT only
PCA results on CHB and JPT only
The problem: discernment between
CHB and JPT requires analyzing
CHB+JPT populations separately.
But what if population structure is continuous?
Outline
1. Introduction to population genetics
2. An unsolved problem in population genetics
3. Whole-genome association studies (WGAS)
4. Applications of population genetics to WGAS:
i. Linkage disequilibrium and haplotypes
ii. Population stratification
iii. Admixture association signals
Admixture association references
Methodology and ANCESTRYMAP program:
Patterson et al. 2004: Am J Hum Genet 74, 979-1000
Admixture mapping in African Americans:
Smith et al. 2004: Am J Hum Genet 74, 1001-13
Successes of admixture mapping in African Americans:
Reich et al. 2005: Nat Genet 37, 1113-8
Freedman et al. 2006: PNAS 103, 14068-73
Admixture mapping in Latino populations:
Price et al. 2007: Am J Hum Genet, in press
Latino admixture creates a mosaic
European
chromosomes
4 generations ago
3 generations ago
2 generations ago
1 generation ago
Today
Native American
chromosomes
European +
Native American
chromosomes
How does Latino admixture mapping work?
European
chromosome
Disease
locus
Cases with disease
Native American
chromosome
The Signal of Latino Admixture Association
100%
50%
0%
20
40
60
80
100
Position on chromosome (cM)
120
140
Admixture association: future challenges
How to best integrate haplotype association and
admixture association signals in a WGAS of an
admixed population?
Acknowledgements
Nick Patterson, Robert Plenge, Michael Weinblatt, Nancy Shadick,
Fuli Yu, David Cox, Alicja Waliszewska, Gavin McDonald, Arti
Tandon, Christine Schirmer, Julie Neubauer, Gabriel Bedoya,
Constanza Duque, Alberto Villegas, Maria Catira Bortolini, Francisco
Salzano, Carla Gallo, Guido Mazzotti, Marcela Tello-Ruiz, Laura
Riba, Carlos Aguilar-Salinas, Samuel Canizales-Quinteros, Marta
Menjivar, William Klitz, Brian Henderson, Chris Haiman, Cheryl
Winkler, Teresa Tusie-Luna, Andres Ruiz-Linares, and
David Reich