Genome-Wide Association Meta

Download Report

Transcript Genome-Wide Association Meta

GWAS Consortia and Meta-Analysis
Inês Barroso
Joint Head of Human Genetics
Metabolic Disease Group Leader
Wellcome Trust Sanger Institute
1
Objectives
•
•
•
•
•
•
Why perform meta-analysis?
How?
What are the issues to consider?
What can you gain?
Setting up consortia
What next?
– Deep replication
– Fine-mapping
– Rare variants
2
First Obesity Locus
Science, 2007
Nature Genetics, 2007
PLOS Genetics, 2007
No additional variants robustly
associated with BMI identified
3
Second Common
Obesity Locus
Nature Genetics, 2008
Required:
 >16K samples in initial analysis;
 replication in additional ~90K samples;
 association study in population of
different ancestry
CONCLUSION:
Large sample sizes needed to detect
the small effects one expects for
complex traits
4
GWAS Changing Approaches
GWAS 1
Association Signals
GWAS 2
Replication in Additional
Populations of Similar
Descent
GWAS 3
GWAS 4
GWAS 5
Meta-Analysis in
Large International
Consortia
Replicating Loci
GWAS 6
GWAS 7
Association Testing in
Diverse Populations
Re-Sequencing
& Fine-Mapping
Biology
Causal Variants
5
Why Perform Meta-Analysis?
• Maximise the value and information from preexisting data -> e.g. GIANT and BMI
• Increase sample size
– often sample sizes required too large for a single
group to be able to perform appropriately
powered study
• Increase power of the study
• Possibly to increase diversity of study samples
– Trans-ethnic fine-mapping
6
Meta-analysis -principles
• Synthesis of different datasets to obtain a summary
based on evidence from the combined data
• In epidemiological terms, meta-analyses provide a better
estimate of effect size
• In the GWAS setting, meta-analysis is usually initially
carried out to help the discovery of further susceptibility
variants of moderate/ small effect size that would have
otherwise escaped detection due to low power
7
Sample Size and Power
Power to detect association (p=5x10-8) at a variant with risk allele frequency 0.30 and
allelic OR 1.10
>20, 000 cases and equal number of
controls needed for power > 80%
100
90
80
power (%)
70
60
50
40
30
20
10
0
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
n cases (equal to number of controls)
8
Forming consortia
• The first step in many GWAS meta-analyses involves setting up consortia to
study specific traits of interest
G
ANT
C O N S O R T I U M
9
Consortia Rules of Engagement
• It is helpful to decide governance structures up front
– Steering committees
– Working groups
• Agree principles for data sharing
– Within the consortia
– With the wider community
• Agree principles for authorship
• Have a written document delineating the above
• Ask participating investigators to agree to/ sign up to
the above
• Document can be referred back to when/if needed
10
GWAS Meta-Analysis Considerations (I)
•
•
The majority of GWAS meta-analyses combine data retrospectively
–
harmonisation of study design can be extremely difficult;
–
Meta-analyses can be carried out sequentially, and can be updated when new
GWAS datasets for the same trait emerge
Robust GWAS meta-analyses require a clear predefined protocol:
–
Definition of phenotype to study should be uniform;
–
Agreed uniform criteria for inclusion/ exclusion of samples;
–
Agreed uniform QC for genotyped data (call rate, HWE, MAF);
–
There is a need to specify basic analytical options, such as genetic model examined,
strategy for covariate adjustments ;
–
Need to consider how to correct for population stratification (e.g. genomic control)
11
GWAS Meta-Analysis Considerations (II)
•
Retrospective aspect of GWAS Meta-Analyses
–
•
It is critical to gather information systematically regarding phenotype,
genotyping platform, local QC metrics, co-variates, etc.
Because most GWAS meta-analyses combine pre-existing
datasets often these are generated using different genotyping
platforms with limited overlap in variants tested
–
How should one combine data generated in multiple different platforms?
(imputation)
12
Imputation
• Allows to estimate genotypes at SNPs with missing data
– Failed SNPs
– SNPs not present on a given genotyping array
• Two studies to be meta-analysed used two different
genotyping platforms with little overlap in SNP content
– Only using overlapping SNPs would exclude the majority of the data
– Imputation allows data to be combined across platforms
– Imputation increases the coverage across the genome and the number
of variants that can be tested for association
13
Imputation
Study 1
Study 2
Study 1 with imputed missing SNPs
• Imputation
• Requires GWAS genotypes to be used as scaffold
• Requires reference datasets (e.g. www.hapmap.org;
www.1000genomes.org) where the LD (correlation) between SNPs is
known and allows imputation of genotypes for variants not typed on a
given array. Increasingly these could include reference datasets
generated by whole-genome sequencing of subsets of individuals from
the populations included in the study
• There is specialist software to facilitate imputation as well as meta14
analysis
GWAS Meta-Analysis
Collecting Information for Consortia Work
• Requires summary statistics at each variant
• Information on analysis method and covariates used
• Information on the size of the study
• Information on the independence of samples
• Information on approaches taken to adjust for any population
stratification (for example genomic control)
• Information on strand and build of the human genome, on
which allele coding has been based
15
Typical data sharing table format
STUDY TITLE
General information
Name of study
Name of analyst
Email of analyst
Study design
population-based, family-based –please give details
Sample information
Number of cases (females)
Number of controls (males)
Ethnic composition
Possible relatedness issues
are individuals related (how?)
Possible structure issues
mixed population?
Genotyping and imputation information
Genotyping platform
Summary of key QC metrics
# SNPs passed QC
Imputation method
Imputation settings
Reference data used for imputation
including build
Analytical information
Association analysis method for imputed genotypes
accounting for uncertainty using SNPTEST or other (which?) program, using only genotypes with
P(call)>X (which threshold?) as hard calls, using best guess genotypes
Calculated GC lambda (typed SNPs)
Calculated GC lambda (imputed SNPs)
Covariates included
Genetic model
PCA, GC, none
16
Typical data sharing table format
Column header
Description
SNP
SNP rs number (if unknown, e.g. with some Affymetrix SNPs, report Affy SNP ID)
build
e.g. “36”, human genome build used
strand
e.g. “+”, human genome strand used
chromosome
chromosome on which SNP resides
position
position of SNP on chromosome in base pairs, based on human genome build used
imputed
“1” for imputed, “0” for directly-typed SNP passing QC
major_allele
e.g. “G”, major allele at that SNP, based on control frequency
minor_allele
e.g. “A”, minor allele at that SNP, based on control frequency
MAF_controls
e.g. “0.246”, minor allele frequency in controls -provide 3 digits to the right of the decimal
OR_allele
e.g. “A”, allele to which the OR has been estimated
call_rate
e.g. “0.985”, call rate for this SNP across cases and controls -provide 3 digits to the right of the decimal
exact_HWE_cases
exact HWE p value in cases
exact_HWE_controls
exact HWE p value in controls
OR
e.g. “1.097”, allelic odds ratio –provide 3 digits to the right of the decimal
lower_95%CI
e.g. “0.874”, lower 95% confidence interval of the OR –provide 3 digits to the right of the decimal
upper_95%CI
e.g. “1.267”, upper 95% confidence interval of the OR –provide 3 digits to the right of the decimal
additive_p_uncorr
additive model p value, uncorrected for genomic control
additive_p_corr
additive model p value, corrected for genomic control
impute_acc
e.g. “0.98”, metric for imputation accuracy (i.e. value for r2hat or proper_info measures, depending on
imputation programme used; if some other measure used, please specify)
17
Heterogeneity
• Results from meta-analysis of various studies may
suggest between study heterogeneity (e.g.
especially when combining populations of
different ancestry)
• How to interpret heterogeneity?
–
–
–
–
Differences in study design
Differences in population structure
Differences in environmental exposures
False-positive?
18
Benefits of GWAS Meta-Analysis
• Increased sample sizes for many disease and
continuous trait consortia
– increased power to detect new loci
– new pathways and important biological insights
gained
– greater power to detect even smaller effect sizes and
greater coverage of allele frequency spectrum
• Power of large collaborations/consortia
– Design better powered replication and fine-mapping
experiments (e.g. Metabochip, Immunochip)
19
GWAS Meta-Analysis Results
• As sample size increases, power increases to detect smaller
effect sizes
• Effect sizes were small with
FTO being the largest;
• For a given allele frequency
novel loci had slightly
smaller effect sizes than
previously established loci
• BMI-increasing allele for
new loci varied from 4-87%,
covering greater allele freq
spectrum than previous
GWAS meta-analysis with ~
half sample size (24-83%)
Speliotes et al. NG, 2010
20
Published GWAS Meta-Analysis Studies
•
•
•
•
Increased the number of loci discovered
Increased the fraction of the heritability explained
However…
Follow-up of discovery results in published GWAS metaanalysis was limited:
– Small number (N<30) of top signals from discovery analysis were
taken into additional studies for validation
– Information from discovery meta-analysis has not been fully
exploited
• Development of custom chips to enable
– Deep replication
– Fine-mapping
21
Metabochip
• The Metabochip is a custom iSelect Illumina array
(~196,725 SNPs).
• Designed to support large scale follow up of putative
associations for glycaemic, cardiovascular and other
metabolic traits.
• The chip incorporates a number of fine-mapping
regions, designed using the initial release of the 1000
Genomes to ensure high SNP coverage
• The chip also incorporates all established GWAS hits
that ( p<5x10-8) known at the time of design
22
Metabochip: replication
(Illumina ~200k iSelect array, Aug2009)
ICBP‐GWAS
SBP/DBP
CARDIoGRAM
MICAD
MPV/PLT/WBC
Lipids
HDL/LDL/TG
TC
2x5k SNPs
QT‐IGC
QT
5k SNPs
2x5k,3x1k
SNPs
GIANT
BMI/WHR
WC/Height/FATPCT
7,3x0.7k
SNPs
~66,117 SNPs
3x5k,1k
SNPs
MAGIC
FG
FI/HbA1c/2hrG
5k,3x1k
SNPs
5k,2x1k
SNPs
DIAGRAM
T2D
T2DAoD/EarlyOnset
23
Metabochip Application
• Metabochip array used for follow up of regions
with preliminary association evidence in:
– DIAGRAM (type 2 diabetes case control analysis)
– MAGIC (quantitative glycaemic trait analysis)
• New loci influencing type 2 diabetes risk and
glycaemic traits have been discovered (Morris et
al., 2012; Scott et al., 2012)
• Metabochip facilitates investigating the genetic
overlap between related cardiometabolic traits
– Shared genetic determinants
24
Immunochip
• Custom Illumina array with ~200,000 markers
• Trynka et al., 2011
–
–
–
–
Application of Immunochip to coeliac disease
13 new loci associated with disease risk at p<5x10-8
1/3 loci with multiple independent association signals
29 of the 54 fine-mapped signals seemed to be
localized to single genes and, in some instances, to
gene regulatory elements
25
Summary
• GWAS Meta-Analysis
– Made possible by imputation
– Requires agreed upon phenotype definition, samples and
SNP QC, and analysis plan
– Facilitated the discovery of new loci due to >> sample sizes
• Large consortia work facilitated the development of
cost-effective custom arrays for deep replication and
fine-mapping
• New methods for conditional analysis
– Allow application to summary level data
• Rare variants – next analytical challenge!
26
Fine-Mapping in Meta-Analysis Setting
• Several rounds of conditional analysis by local analysts
• Central meta-analysis of results from conditional
analysis
• Iterative process
• Further insights and better coverage by including
– Imputed SNPs to the most recent reference set from 1000
genomes
– From ongoing sequencing efforts (e.g. uk10k)
• Overlay data with ENCODE datasets to evaluate
functional relevance
• If many studies and samples are involved this process is
laborious and error prone
27
Fine-Mapping Using Summary Level Data
• Yang et al., 2012
– Developed a method for approximate conditional and
joint genome-wide association analysis
– Method can use summary-level statistics from a metaanalysis of genome-wide association studies (GWAS)
and LD information from a reference panel
– Computationally fast
– Avoids having to go back to individuals cohorts for
iterative rounds of conditional analysis
– Avoids the need to request individual level genotype
data from participating cohorts (which is something
most people are not comfortable sharing)
28
Where to next?
• H3A Consortia
– Investigate genetic bases of diseases in Africa
• Trans-ethnic fine-mapping
– Analyse GWAS data from multiple populations with different patterns
of LD
• Investigate the role of rarer variants in disease and underlying
traits:
– Exome chip - targeted array with variants mapping to protein coding
regions and identified from exome sequencing projects
– Sequence-based association studies
• Whole-exome re-sequencing
– Exons, slice junctions and conserved NCS;
– More “interpretable” portion of genome;
– Per base more expensive but overall cheaper;
• Whole-genome re-sequencing
– Hypothesis free;
– Covers whole genome.
29
Genetic variants and human disease
Linkage/ Candidate gene
GWAS/
Candidate
gene
Effect Size
High
Intermediate
Common
Variants,
very few of
these
Now
Rare
variants
causing
Mendelian
disease
Low
frequency
variants
intermediate
effect
Modest
Common
Variants
Low
Very
rare
0.1%
Rare
0.5%
Low
frequency
5%
Common
Allele
30
frequency