Transcript Slide 1

Sampling Design in Regional Fine Mapping
of a Quantitative Trait
Banff International Research Station
Emerging Statistical Challenges and Methods
Session 7: GWAS and Beyond II
25 June 2014
Shelley B. Bull, Lunenfeld-Tanenbaum Research Institute,
& Dalla Lana School of Public Health, University of Toronto
Co-authors: Zhijian Chen and Radu Craiu
Lunenfeld-Tanenbaum Research Institute & University of Toronto
Overview
Setting
Studies designed to follow up associations detected in a GWAS
Fine-mapping of a candidate region by sequencing
Aim to identify a functional sequence variant
Approach
Phase I: Quantitative trait with GWAS data (eg. N = 5000)
Phase II: Two stage design
Stage 1 sample (n1) – expensive sequencing to identify a
smaller set of promising variants
Stage 2 sample (n2) – cost-effective genotyping of selected
variants in an independent group
Stratification in Stage 1 according to a promising GWAS tag SNP
Bayesian analysis in Stage 1, incorporating genetic model selection
Two-phase Two-stage Design
Background
Two-phase designs +/- Stratification on tag SNP
Chen et al (2012), Schaid et al (2013), Thomas et al (2013)
Earlier: case-cohort designs
Two-stage designs
Skol et al (2007), Thomas et al (2009),
Stanhope & Skol (2012)
Bayesian approaches to genetic association
Stephens & Balding (2009), Wakefield (2009),
WTCCC/Maller et al (2012)
Genetic model (mis)specification
Joo et al (2010), Spencer et al (2011), Vukcevic et al (2011),
Faye et al (2013)
Sampling Designs & Sample Allocation
Based on tag SNP (AA, Aa, aa) from the GWAS:
(1) Simple random sampling (SRS) – ignores tagSNP information
(2) Equal (ES) number from each stratum
(3) Oversampled homozygous (HO) – number larger than under SRS
Example: N=5000, MAF=0.2
Quantitative Trait Model
QT Model Parameters: θ = (β0 , β1 , σ 2 )
Genetic Models: M1= additive, M2= dominant, M3= recessive
Bayesian Inference: Stage 1 sample
(1) Specify priors for the genetic models and the regression parameters
p(Mj ) = ⅓ p( θ | Mj ) = p( θ )
p( θ ) = p(β0 , β1 | σ 2 ) p( σ 2 ) normal-inverse-gamma (NIG)
(2) Derive model-specific posterior for the regression parameters for a
functional sequence variant – analytic when prior is NIG
(3) Select a genetic model for each seq variant according to the
posterior probability wj = p(Mj | data )
(4) Given selection of a genetic model, compare all seq variants in the
region by computing the posterior probability that variant k is
functional given all the data, and rank them (the Bayes factor)
p(1) ≥ p(2) ≥ … ≥ p(m)
(5) Construct a 95% credible interval that includes all variants such that
p(1) + p(2) + … + p(k) ≥ 0.95 for minimum k
Criteria for a Good Design
Higher probability that the correct genetic model is identified
for the sequence variant
Fewer sequence variants selected into the credible set
(number and %) * cost
Higher probability that the functional sequence variant is
selected into the credible set * power
Higher probability that the functional sequence variant is
top ranked in the credible set
Simulation Design (APOE gene region, 1KG)
Quantitative trait model is
Y = β0 + β1 X + γ 1(X=1) + ϵ,
Parameters specified by
β0=5, β1=0.25,
σ2 =0.1, 0.5, 1.5
and σ/β1 =1.3, 2.8, 4.9
Simulation Results: Genetic model selection
Designs: SRS ____
ES - - - -
HO …..
Data simulated
under additive,
dominant and
recessive
genetic models.
The rate of
selecting the true
genetic model for
the functional
variant using the
strong criteria of
wj >0.833.
Common seq
variant
(MAF=0.2)
1000 simulations
Simulation Results: Size of the 95% credible set
Designs: SRS ____
ES - - - -
HO …..
Data simulated
under additive,
dominant and
recessive genetic
models.
Upper panels:
common variant
(MAF=0.2)
with σ/β1=4.9
(m=201)
Lower panels:
low frequency
variant (MAF=0.02)
with σ/β1=2.8
(m=332)
1000 simulations
Simulation Results: Selection of functional variant
Designs: SRS ____
ES - - - -
HO …..
Data simulated
under additive,
dominant and
recessive genetic
models.
Upper panels:
common variant
(MAF=0.2)
with σ/β1=4.9
(m=201)
Lower panels:
low frequency
variant (MAF=0.02)
with σ/β1=2.8
(m=332)
1000 simulations
Simulation Results: Functional variant top ranked
Designs: SRS ____
ES - - - -
HO …..
Data simulated
under additive,
dominant and
recessive genetic
models.
Upper panels:
common variant
(MAF=0.2)
with σ/β1=4.9
(m=201)
Lower panels:
low frequency
variant (MAF=0.02)
with σ/β1=2.8
(m=332)
1000 simulations
Simulation Results: Model selection
Data simulated under additive, dominant and recessive genetic models.
For cases without model selection (no MS), analysed under an additive model.
Common seq variant (MAF=0.2), σ/β1=4.9, n1=600, 1000 simulations
Simulation Results: Cost Efficiency (CE)
A total of m sequence variants are identified in n1 individuals in stage 1,
and a proportion q = (m2 / m) are genotyped in n2=N-n1 in stage 2.
Cost depends on c1, the stage 1 per individual sequencing cost,
and on c2, the stage 2 per individual per marker genotyping cost.
e.g. if N = 5000, n1=500, c1=$1000, n2=4500, m2=100, and c2=$0.50,
then the total two-stage cost is $500,000 + $225,000 = $725,000
compared to a one-stage cost of $5 million.
CE is defined as “Power” / Cost, where “Power” is estimated by the probability
that a functional variant falls within the 95% credible set
Comments and Discussion
• Incorporating Bayesian genetic model selection is worthwhile
• Selection of informative individuals for expensive data collection
can be a useful strategy in statistical genetic design and analysis
• The simulations confirm the intuition that the efficiency of the tagstratified sampling strategy increases with tag-seq correlation.
• Winner’s curse effects propagate from the GWAS, but are more
complicated
• Cost-efficiency of a two-stage design depends on the relative
costs of sequencing versus genotyping – will it remain practical?
• Analysis of the sequence data limited to low frequency and
common variants – extensions to rare variants
• Other design options – trait-dependent sampling
• How to conduct joint Bayesian inference for stages 1 and 2?
Acknowledgements
Co-Authors:
Zhijian Chen, STAGE Post-doctoral Fellow
Radu Craiu, Dept of Statistical Sciences
Thanks to Laura Faye and Andrew Paterson for helpful
discussions, and to referees for improvements to the paper.
To appear in Genetic Epidemiology
Funding
Thanks
Simulation Results Summary
In stage 1, a total of m variants are sequenced in n1 = 500 individuals, with
equal strata sampling (ES) and an additive genetic model.
Size is the number m2 of sequence SNPs in the 95% credible set (% or count).
P(Select) is the probability the functional variant is selected into the credible set.
P(Rank) is the probability the functional variant is top ranked in the credible set.
GWAS Sample Size
Title