Lecture: Genome-Wide Association Studies (GWAS)
Download
Report
Transcript Lecture: Genome-Wide Association Studies (GWAS)
Genome-Wide Association
Studies (GWAS)
Epidemiology 243
Molecular Epidemiology of Cancer
Spring 2008
Association Studies of Genetic
Factors
1st generation
Very small studies (<100 cases)
Usually not epidemiologic study design; 1-2 SNPs
2nd generation
Small studies (100-500 cases)
More epi focus; a few SNPs
3rd generation
Large molecular epi studies (>500 cases)
Proper epi design; pathways
4th generation
Consortium-based pooled analyses (>2000 cases)
GxE analyses
5th generation
Post-GWS studies
Boffeta, 2007
International Lung Cancer Consortium (ILCCO)
Wichmann
McLaughlin
Schwarts
Wild
Boffetta
Harris
Goodman
Risch
Kiyohara
Brennan
Benhamou
Wiencke
Christiani
Zhang
Stucker
Yang
Tajima
Landi
Berwick
Hong
Vineis
Lan
Chen
Lazarus
Spitz
Thun
Le Marchand
3 cohort studies
17 population based case-control studies
13 hospital based case-control studies
2 studies with mixed controls
1 cross-sectional study
Issues in genetic association studies
Many genes
Many SNPs
~25,000 genes, many can be candidates
~12,000,000 SNPs, ability to predict functional SNPs is limited
Methods to select SNPs:
Only functional SNPs in a candidate gene
Systematic screen of SNPs in a candidate gene
Systematic screen of SNPs in an entire pathway
Genomewide screen
Systematic screen for all coding changes
Introduction
A genome-wide association study is an approach that
involves rapidly scanning markers across the
complete sets of DNA, or genomes, of many people
to find genetic variations associated with a particular
disease.
Once new genetic associations are identified,
researchers can use the information to develop better
strategies to detect, treat and prevent the disease.
Such studies are particularly useful in finding genetic
variations that contribute to common, complex
diseases, such as asthma, cancer, diabetes, heart
disease and mental illnesses.
http://www.genome.gov/20019523
Definition of GWAS
A genome-wide association study is
defined as any study of genetic
variation across the entire human
genome that is designed to identify
genetic associations with observable
traits (such as blood pressure or
weight), or the presence or absence of
a disease (such as cancer) or condition.
Potential of GWAS
Whole genome information, when combined with
epidemiological, clinical and other phenotype data,
offers the potential for increased understanding of
basic biological processes affecting human health,
improvement in the prediction of disease and patient
care, and ultimately the realization of the promise of
personalized medicine.
In addition, rapid advances in understanding the
patterns of human genetic variation and maturing
high-throughput, cost-effective methods for
genotyping are providing powerful research tools for
identifying genetic variants that contribute to health
and disease.
Potential of GWAS
Selection of SNPs
(Genome-wide association studies)
Molecular
Analytical
Highest requirements: Data management, automation
Advantages
Higher requirements: Affymetrix and Illumina
No biological assumptions and can identify novel
genes/pathways
Excellent chance to identify risk alleles
Utility in individual risk assessment
Disadvantages
High costs
Concern of multiple tests
SNP Selection
SNP Selection
Affymetrix® Genome-Wide
Human SNP Array
The new Affymetrix® Genome-Wide Human
SNP Array 6.0 features 1.8 million genetic
markers, including more than 906,600 single
nucleotide polymorphisms (SNPs) and more
than 946,000 probes for the detection of copy
number variation. The SNP Array 6.0
represents more genetic variation on a single
array than any other product, providing
maximum panel power and the highest
physical coverage of the genome.
The need for GWA
Current understanding of disease etiology is limited
Current understanding of functional variants is limited
Xu JF, 2007
Therefore, the focusing on nonsynonymous changes is not sufficient
Results from linkage studies are often inconsistent and broad
Therefore, candidate genes or pathways are insufficient
Therefore, the utility of identified linkage regions is limited
GWA studies offer an effective and objective approach
Better chance to identify disease associated variants
Improve understanding of disease etiology
Improve ability to test gene-gene interaction and predict disease risk
GWA is promising
Many diseases and traits are influenced by genetic factors
Over 12 millions SNPs are known in the genome
i.e., it is affordable to genotype a large number of SNPs in the genome
Large numbers of cases and controls are available
i.e., some SNPs will be directly or indirectly associated with causal variants
The cost of SNP Genotyping is reduced
i.e., they are caused by sequence variants in the genome
i.e., there is statistical power to detect variants with modest effect
When the above conditions are met…
…associated SNPs will have different frequencies between cases
GWA is challenging
Many diseases and traits are influenced by genetic factors
But probably due to multiple modest risk variants
They confer a stronger risk when they interact
True associated SNPs are not necessary highly significant
Too many SNPs are evaluated
Single studies tend to be underpowered
Xu, 2007
False positives due to multiple tests
False negatives
Considerable heterogeneity among studies
Phenotypic and genetic heterogeneity
False positives due to population stratification
Genome coverage
Two major platforms for GWA
Illumina: HumanHap300, HumanHap550, and HumanHap1M
Affymetrix: GeneChip 100K, 500K, 1M, and 2.3M
Genome-wide coverage
Xu, 2007
The percentage of known SNPs in the genome that are in LD with
the genotyped SNPs
Calculated based on HapMap
Calculated based on ENCODE
Strategies for pre-association
analysis
Quality control
Filter SNPs by genotype call rates
Filter SNPs by minor allele frequencies
Filter SNPs by testing for Hardy-Weinberg
Equilibrium
Data Analysis
Single SNP analysis using prespecified genetic models
2 x 3 table (2-df)
Additive model (1-df), and test for additivity
All possible genetic models (recessive,
dominant)
Data Analysis
Haplotype analysis
Gene-gene and gene-environment
interactions
Interaction with main effect
Logistic regression
Interaction without main effect: data mining
Classification and recursive tree (CART)
Multifactor Dimensionality Reduction (MDR)
Sample size needs as a function of
genotype prevalence and OR for
main effects
Boffeta, 2007
False Positives
False positives: too many dependent tests
Adjust for number of tests
Bonferroni correction
Nominal significance level = study-wide significance / number of tests
Nominal significance level = 0.05/500,000 = 10-7
Effective number of tests
Take LD into account
Permutation procedure
Permute case-control status
Mimic the actual analyses
Obtain empirical distribution of maximum test statistic under null hypothesis
False Positives
False discovery rate (FDR)
Expected proportion of false discoveries among
all discoveries
Offers more power than Bonferroni
Holds under weak dependence of the tests
False Positives
Bayesian approach
Taking a priori into account, False-Positive
Report Probability (FPRP)
Confirmation in independent
study populations
The approach may limit the number of false
positives
Confirmation is needed to dissect true from false
positives
Replication, examine the results from the 2nd stage only
Joint analysis, combining data from 1st stage with 2nd stage
Multiple stages
Issues of GWAS
Population stratification
Multiple Testing: False Positives
Gene-Environmental Interaction
High Costs
Kingsmore, 2008
Kingsmore, 2008
GWAS
Proposed GWAS of Lung
Cancer among Non-smokers
Motives and Conceptual Framework
For Study of Genetic Susceptibility to
Lung Cancer among Non-smokers
About 16% of the male smokers and 10% of female smokers
will eventually develop lung cancer, which suggest exposures to
other environmental carcinogens and individual genetic
susceptibility may play an important role among non smoking
lung cancer.
It is suggested that 26% of lung cancer are associated with
genetic susceptibility Lichtenstein P, et al. NEJM, 2000)
We hypothesize that the variation of genetic susceptibility or
single nucleotide polymorphisms (SNPs) of genes in
inflammation, DNA repair, and cell cycle control pathways may
be important on the development of lung cancer among nonsmokers.
Lichtenstein P, Holm NV, Verkasalo PK, Iliadou A, Kaprio J, Koskenvuo M, Pukkala E, Skytthe A,
Theoretical model of gene-gene/environmental interaction pathway for lung cancer
Tobacco consumption
Occupational
Exposures
Environmental Carcinogens /
Procarcinogens Exposures
Ile105Val
Ala114Val
Environmental Exposure
Null
GSTP1
GSTM1
CYP1A1
MspI
Ile462Val
Tyr113His
His139Arg
PAHs,
Xenobiotics,
Arene,
Alkine, etc
Detoxified
carcinogens
Active carcinogens
Pro187Ser
mEH
mEH
NQO1
DNA damage
repaired
DNA Damage
Tyr113His
His139Arg
Normal cell
Defected DNA
repair gene
If DNA damage not
repaired
XRCC1
Arg194Trp,
Arg399Gln,
Arg280His
M
G1
G2
P53
P16
S
G0
G870A
Arg72Pro
Ala146Thr
Cyclin D1
If loose cell cycle
control
Carcinogenesis
Programmed cell
death
500K SNP Coverage
Median intermarker distance:
3.3 kb
Mean intermarker distance:
5.4 kb
Average Heterozygosity
0.30
Average minor allele frequency
0.22
SNPs in genes
196,384
80% of genome within 10kb of a SNP
Figure 1. The effects of SNPs on the Risk of Lung
Cancer among Smokers and Non-smokers
8
OR
7
6
5
Smokers
Non-Smokers
ETS Exp
Non ETS Exp
4
3
2
1
0
BRCA1 CHEK1 XRCC3 INFG
IL-10 ALDH2
Hypothesis
The overall hypothesis is that multiple
sequence variants in the genome are
associated with the risk of lung cancer
among non-smokers. Specifically, we
hypothesize that a number of common
nonsmoking lung cancer risk-modifying
SNPs are in strong LD with the SNPs
arrayed on the 500K GeneChip®.
Executive Committee
DNA Repair Working
Group Coordinator
DNA Repair
Working Group
Members
Nonsmokers
Working Group
Coordinator
Familial Cases
Working Group
Coordinator
Rare Histology
Working Group
Coordinator
Young Onset
Working Group
Coordinator
Nonsmokers
Working Group
Members
Familial Cases
Working Group
Members
Rare Histology
Working Group
Members
Young Onset
Working Group
Members
Figure 2. Structure and Governance of ILCCO
Specific Aims
Aim 1. To perform exploratory tests for
association between 500K SNPs across the
genome and lung cancer risk among 200
non-smoking lung cancer patients and 200
controls.
Aim 2. To perform first stage of confirmatory
association tests between lung cancer risk
and more than 1,000 SNPs implicated in Aim
1 among an independent set of 600 pairs of
cases and controls.
Specific Aims
Aim 3. To perform second stage of confirmatory
association tests between lung cancer risk and more
than 500 SNPs that were replicated in Aim 2 among
an additional 600 cases and 600 controls. Additional
SNPs will also be added from our ongoing pathway
specific analyses of DNA repair, cell cycle regulation,
inflammation and metabolic pathways based on nonsmokers in our lung cancer study.
Aim 4. To perform fine mapping association studies
in the flanking regions of each of the 30-100 SNPs
confirmed in Aim 3 among the entire 1,400 cases and
1,400 controls. The large number of cases with nonsmoking lung cancer in this study population also
allows us to identify SNPs that are associated with
risk of the disease among nonsmokers.
Specific Aims
Aim 5. To explore the generalizability of the
SNPs identified in Specific Aims 1-4 within a
Chinese population of 600 nonsmoking lung
cancer cases and 600 nonsmoking controls.
The relatively homogeneous Chinese
population not only allows us to further
confirm the associations, but also improves
our ability to finely map the SNPs associated
with lung cancer risk among non-smokers.
Discussion: Costs
Affy 500 k SNP chip $1000/case
2000 x $1000=$2m
1000 x $1000=$1m
500 x $1000=$0.5 M
500 x 3000 (SNP) x $0.15=$225, 000
500 x 30 (SNP) x $0.15 =$2,250