Discovery only (Single study)

Download Report

Transcript Discovery only (Single study)

Andrea Baccarelli, MD, PhD, MPH
Laboratory of Environmental Epigenetics
Harvard School of Public Health
[email protected]
Lecture 7
From GWAS to EWAS &
Interpretation of epigenetic data
Genetics
• Candidate gene approach
• A priori knowledge → candidate genes
• test for association with disease/phenotype
• Genome-wide approach (GWAS)
• Agnostic approach → entire genome
• test for association with disease/phenotype
Graphical representation of GWAS findings
Manhattan plot
Systemic Sclerosis (auto-immune disease)
Radstake et al., Nature Genetics 2010
Published Genome-Wide Associations through 12/2013
Published GWA at p≤5X10-8 for 17 trait categories
NHGRI GWA Catalog
www.genome.gov/GWAStudies
www.ebi.ac.uk/fgpt/gwas/
Epigenetics
• Candidate gene (gene-specific) approach
• A priori knowledge → candidate genes
• test for association with exposure/risk factor
• test for association with disease/phenotype
•
Global (average) level of methylation (5mC content)
• Average methylation of all CpG sites across the genome
• test for association with exposure/risk factor
• test for association with disease/phenotype
• Epigenome-wide approach (EWAS)
• Agnostic approach → entire genome
• test for association with exposure/risk factor
• test for association with disease/phenotype
Examples for DNA methylation
• Candidate gene approach
– AAB’s blood has 26% methylation in the IL6 promoter
(N.B.: any other region of interest can be targeted, e.g.,
CpGi shore, shelf, etc.)
• Global methylation approach
– AAB’s blood has 4.5% methylation (i.e., 4.5% of all
cytosines found in blood are methylated; no information
on where the methylated cytosines are located)
• Genome-wide approach
– Methylation in AAB’s blood is measured at a high number
of CpG sites (e.g, if we use Illumina Infinium 450K
beadchip → we will get ≈486,000 numbers [one for each
CpG site] for AAB’s blood)
GWAS/EWAS
• Screen for 100Ks to millions of loci:
– GWAS: Single nucleotide polymorphisms (SNPs)
– EWAS: CpG sites
• The EWAS field is relatively new
• Several tools are methods are inferred from
GWAS
Features covered in the 450k Infinium BeadChip
The 450K BeadChip covers a total of 77,537 CpG Islands and CpG Shores (N+S)
Region Type
Regions
CpG Island
N Shore
S Shore
N Shelf
S Shelf
Remote/Unassigned
Total
26,153
25,770
25,614
23,896
23,968
-
N Shelf
TSS1500
N Shore
TSS200
CpG sites covered on Average # of CpG
450K BeadChip array sites per region
139,265
73,508
71,119
49,093
48,524
104,926
485,553
CpG Island
5.08
2.74
2.66
1.97
1.94
-
S Shore
5’ UTR
The 450K BeadChip covers a total of 20,617 genes
S Shelf
3’ UTR
GWAS vs. EWAS
• Type of data
– GWAS: SNP can assume only 3 values: 0 (wt/wt); 1
(wt/var); 2 (var/var)
– EWAS: measures are quantitave: e.g.: Illumina
infinium β value between 0 and 1
• Changes over time
– GWAS: SNPs (almost) never change
– EWAS: epigenetic marks change over time
• Tissue specificity
– GWAS: SNPs are not tissue specific
– EWAS: epigenetic marks are tissue specific
Vulcano plot
Differences between liver cancer cases and controls
Shen Hepatology 2012
Multiple comparisons
• Infinium 450K methylation BeadChip
– Methylation measured at 485,553 CpG sites
– We will do 485,553 statistical tests
– Any problem with that?
• If you conduct 20 tests at α=0.05
– one significant (positive) by chance at p<0.05
• If you conduct 485,553 tests
– 24,277 significant (positives) by chance at p<0.05
Statistical corrections for multiple comparisons
• Bonferroni correction
– Multiple tests inflate the cumulative α
– Dividing α/485,553 solves the problem
– Threshold for significance commonly set at p =
0.05/485,553 = 1.0e-7
• False discovery rate (FDR)
– Focuses on positive (significant) findings at a ‘nominal’
uncorrected p-value
– FDR is the proportion of false positives among all positive
findings
– FDR controlling procedures have been developed to control
the expected proportion of false positives (e.g., Benjamini
Hockberg)
FP
True association
True
Positive
TN + FP
NO
False
Positive
P-value
Positive
YES
P-value =
Probability of a false
positive finding under the
null hypothesis (i.e., no
true association)
FP
Negative
FDR
False
Negative
True
Negative
=
TP + FP
If I have a number X of
significant p-values, how
many are false positives?
(Proportion of false
positives)
Learning from past experience (in genetics)
Relative odds of alcohol dependency associated with Taq1A polymorphism
1990
1995
Original
OR=8.7
Odds Ratio as a Function
of Publication Year
1999
Final OR=1.4
2004
Smith et al. (2008)
American Journal of Epidemiology, 167(2): 125-138.
The winner’s curse
• On ebay – Given the lack of information on the true
value of the item being auctioned
– High variance in the estimated (dollar) values
• many over-and many under-estimates (bids)
– The “winner” is likely to have made the largest
overestimate of value
• i.e., he or she is paying (way) too much
• In genetics – The winner’s curse has been common
– the first report of an association of genetic variation with
disease is likely to overestimate the effect size
• In epigenetics: Does the same apply?
Replication is needed
Replication
Replication
Hirschhorn & Daly Nat. Genet. Rev. 6: 95, 2005
NCI-NHGRI Working Group on Replication Nature 447: 655, 2007
Strategies for discovery and Replication
• We will review different approaches for
discovery and replication
• Examples from published studies
– Examples from EWAS when available
– Same concepts apply to both EWAS and GWAS
EWAS validation – Study design
• Discovery only (Single study)
– Prone to false positive findings (negative too)
-66 cases of Hepatocellular carcinoma (HCC) assessed using 450K BeadChip
-Differences in methylation in cancer tissues vs. adjacent non cancer tissues
-Bonferroni-corrected p value ≤ 0.05; corresponds to a raw p value of ≤ 1.06 × 10−7
-After Bonferroni adjustment, a total of 130,512 CpG sites significantly differed in
methylation level in tumor compared with non-tumor tissues, with 28,017 CpG sites
hypermethylated and 102,495 hypomethylated in tumor tissues.
Additional filtering
• Hypermethylated sites
– mean difference in methylation tumor vs normal > 20%
– > 70% of the tumor tissues methylation >2SDs above mean
methylation level of all 66 adjacent tissues
– mean methylation for adjacent tissues < 25%
• Hypomethylated sites:
• mean difference in methylation tumor vs normal > 20%
• > 70% of the tumor tissues methylation >2SDs below mean
methylation level of all 66 adjacent tissues
EWAS validation – Study design
• Discovery only (Single study)
– Prone to false positive findings (negative too)
• Internal Replication
– Sample two or more groups from the same population
– Group 1: EWAS; Other groups: candidate gene analysis
– Overall power lower than same-size discovery only
(Skol AD, Nat Genet 2006).
• All subjects from the ESTHER cohort in
Germany
• Internal Replication
– Discovery on 177 participants from ESTHER
(27K Infinium methylation BeadChip analysis)
– Replication on 316 participants from ESTHER
(Sequenom MASS-ARRAY)
Discovery and replication groups
Discovery
Discovery → validation → replication (top gene)
EWAS validation – Study design
• Discovery only (Single study)
– Prone to false positive findings (negative too)
• Internal Replication
– Sample two or more groups from the same population
– Group 1: EWAS; Other groups: candidate gene analysis
– Overall power lower than same-size discovery only
(Skol AD, Nat Genet 2006).
• Discovery > External (Independent) Replication
– Two (or more) independent studies
– Ensure validation + generalizability
(1) Discovery: Cord blood and peripheral blood samples from 1018 ALSPAC childmother pairs (450K Infinium methylation BeadChip analysis)
(2) External Replication:
• The WMHP and CANDLE cohort (27K Infinium methylation BeadChip analysis)
• The NB and MoBa cohort (450K Infinium methylation BeadChip analysis)
• And a case–control study (450K Infinium methylation BeadChip analysis)
Discovery → Replication
Gestational Age:
• 224 top hits: GA had a negative association with methylation at 188
probes and a positive association at 36 probes
• 129 replicated in the NB cohort and 5 were replicated in the WMHP and
CANDLE
• 72 previously reported in the case-control study
Birth Weight:
• 23 associations observed between birth weight and cord blood
methylation in the discovery study
• 2 out of 23 replicated in the MoBa cohort
EWAS validation – Study design
• Discovery only (Single study)
– Prone to false positive findings (negative too)
• Internal Replication
– Sample two or more groups from the same population
– Group 1: EWAS; Other groups: candidate gene analysis
– Overall power lower than same-size discovery only
(Skol AD, Nat Genet 2006).
• Discovery > Replication
– Two (or more) independent studies
– Ensure validation + generalizability
• Meta-analysis
– Uses estimates from multiple populations
– Needed to achieve large sample size
– Allows for evaluating generalizability
• 44,494 participants of European ancestry
– from nine large studies participating in the
Cohorts for Heart and Aging Research in Genomic
Epidemiology (CHARGE) Consortium.
– seven additional studies
• Each study computes association statistics
(e.g., ORs and p-values), then results are
meta-analyzed
• Only results (not data) are shared
Results for intima media thickness
Forest plot for ZHX2 – rs11781551
(zinc fingers and homeoboxes 2)
Population Stratification*
Each population has unique genetic and social history;
ancestral patterns of migration, mating,
expansions/bottlenecks, stochastic variation all yield
differences in allele frequencies between populations.
Population stratification: cases and controls have
different allele frequencies due to diversity in
populations of origin and unrelated to outcome,
requiring:
1) differences in disease prevalence
2) differences in allele frequencies
*Cardon LR, Palmer LJ, Lancet 2003
What is population stratification?
Balding, Nature Reviews Genetics 2010
Unlinked Genetic Markers in Population Stratification
• Population stratification (or any non-random
mating) allows marker-allele frequencies to vary
among population segments.
• Disease more prevalent in one subpopulation will
be associated with any alleles in high frequency
in that subpopulation.
• If population stratification exists, can often be
detected by analysis of unlinked marker loci.
[Pritchard JD, Rosenberg NA; AJHG 1999; 65:220228]
.
Adjusting for Population Stratification in a GWAS
of T2DM*
• Case-control study of 661 cases of T2DM and 614
controls from France.
• Genotyping assayed 392,935 SNPs
• SNP 200kb from lactase gene on 2q21:
– Strong association with T2DM
– Strong north-south prevalence gradient in France
• Used 20,323 SNPs not related to T2DM as measure
of population stratification.
• After adjustment for stratification, most of the
association was removed.
*Sladek R et al. Nature 2007; 445: 881-885.
Sources of analytical variability for methylation EWAS
• Several factors can affect results
–
–
–
–
DNA/sample quality
Plate effects
Batch effect
Row/column effect
• How to handle this
–
–
–
–
Best laboratory practice
Randomize/balance samples
Universal DNA/Replicates
Bioinformatics/Statistical analysis
40
Is DNA Collected and Handled Identically
in Cases and Controls?
• T1DM gene association study: cases from GRID
Study, controls from 1958 British Birth Cohort Study
examining 6322 SNPs.
• Samples from lymphoblastoid cell lines extracted
using same protocol in two different laboratories.
• Case and control DNAs randomly ordered with
teams masked to case/control status.
• Some extreme associations could not be replicated
by second genotyping method.
Clayton DG et, Nat Genet 2005; 37: 1243-46.
Interpretation of epigenetic data
In-class Readings
Papers
• Lee et al. Quantitative promoter hypermethylation
analysis of RASSF1A in lung cancer: Comparison with
methylation-specific PCR technique and clinical
significance. Mol Med Report 2011.
• Joubert et al. 450K Epigenome-Wide Scan Identifies
Differential DNA Methylation in Newborns Related to
Maternal Smoking during Pregnancy. Environ Health
Perspect 2012
In-class Readings
Questions
•DNA methylation analysis:
•Which technique was used?
•How much DNA was used?
•Did it involve bisulfite treatment?
•Aim of the study:
•What was measured?
•Why?
•Results:
•How were DNA methylation results reported?
•Which statistical analysis was used?
Next lecture
Guest Lectures: Reproductive Epigenetics and
Prenatal Influences on the Epigenome
Karin Michels, PhD, ScD
Co-Director, Ob/Gyn Epidemiology Center, BWH
Heather Herson Burris, MD, MPH
Neonatology, BIDMC