Ohio State Talk, October 2004
Download
Report
Transcript Ohio State Talk, October 2004
Gene-Environment Case-Control
Studies
Raymond J. Carroll
Department of Statistics
Faculty of Nutrition
Texas A&M University
http://stat.tamu.edu/~carroll
Outline
• Problem: Can more efficient inference be done
assuming gene (G) and environment (X)
independence?
• Gene-Environment independence: the case-only
method
• Profile likelihood approach
• Efficiency gains
• Example
• Conclusions
Acknowledgment
• This work is joint with Nilanjan Chatterjee,
National Cancer Institute
• Papers in: Biometrika, Genetic Epidemiology
http://dceg.cancer.gov/people/ChatterjeeNilanjan.html
Outline
• Theoretical Methods:
• With real G and X independence, we used a profile
likelihood method based on nonparametric maximum
likelihood
• (Key insight) Equivalent to a device of pretending the
study is a regular random sample subject to missing
data
• (This allows) generalization to any parametric model
for G given X.
A Little Terminology
•
•
•
•
•
Epidemiologists: Case control sample
Econometricians: Choice-based sample
These are exactly the same problems
Subjects have two choices (or disease states)
Subjects have their covariates sampled
conditional on their choices, i.e.,
• Random sample from those with disease
• Random sample from those without disease
Basic Problem Formalized
•
•
•
•
•
Case control sample: D = disease
Gene expression: G
Environment: X
Strata: S
We are interested in main effects for G and (X,S)
along with their interaction
Prospective Models
• Simplest logistic model
pr(D 1|G, X) H(b0 b1G b2 X b3G * X)
• General logistic model
pr(D 1|G, X) H{b0 m(G, X, β1 )}
• The function m(G,X,b1) is completely general
Case-Control Data
• Case-control data are not a random sample
• We observe (G,X) given D, i.e., we observe the covariates
given the response, not vice-versa
• If we had a random sample, linear logistic regression
would be used to fit the model
• Obvious idea: ignore the sampling plan and pretend
you have a random sample
Case-Control Data
• Known Fact: The intercept is not identified, rest
of the model is identified
• Retrospective odds is given as
pr(G=g,X=x|D=1)
=exp{β 0 +m(g,x,β1 )-log( 0/1 )}
pr(G=g,X=x|D=0)
d=pr(D=d)
Alternative Derivation: Ignore
Sampling Plan
• Consider a prospective study
• Let D= 1 mean selection into the study
• Pretend
pr(Δ=1|D=d,G,X) nd/pr(D=d);
nd # of observations with D d
• Then compute
logit{pr(D=1|Δ=1,X=x,G=g)}
=b 0 log(n1 /n0 ) log(1 / 0 ) m(g, x, b1 )
Case-Control Data
logit{pr(D=1|Δ=1,X=x,G=g)}
=b 0 log(n1 /n0 ) log(1 / 0 ) m(g, x, b1 )
• Fact: all parameters except the intercept can be
estimated consistently while ignoring the
sampling plan
• Standard Errors: Those compute ignoring the
sampling plan are asymptotically correct
Case-Control Data
logit{pr(D=1|Δ=1,X=x,G=g)}
=b 0 log(n1 /n0 ) log(1 / 0 ) m(g, x, b1 )
• The intercept is determined by pr(D=1) in the
population, hence not identified from these data
• Little Known Fact: Adding information about
pr(D=1) adds no information about β1
Gene-Environment Independence
• In many situations, it may be reasonable to
assume G and X are independently distributed in
the underlying population, possibly after
conditioning on strata
• This assumption is often used in geneenvironment interaction studies
G-E Independence: Discussion
• Does not always hold!
• Example: polymorphisms in the smoking
metabolism pathway may affect the degree of
addiction
• If False: Possible severe bias (Albert, et al.,
2001, our own simulations)
G-E Independence: Discussion
• It is reasonable in many problems
• Example: Environment is a treatment in a
randomized study under nested case-control
sampling
• Example: Reasonable when exposure is not
directly controlled by individual behavior
• Radiation exposure for A-bomb survivors
• Carcinogenic exposure of employees
• Pesticide exposure in a rural community
Generalizations
• I have phrased this problem as one where G
and X are independent given strata
• This makes sense contextually in genetic
epidemiology
• All the results I will describe go through if
you can write down a probability model for G
given (X,S): I do this in the Israeli Study.
Generalizations
• If G is binary, it is natural to apply our approach
• Posit a parametric or semiparametric model for
G given (X,S)
• Consequences:
• More efficient estimation of G effects
• Much more efficient estimation of G and (X,S)
interactions.
Gene-Environment Independence
• Rare Disease Approximation: Rare disease for all
values of (G,X)
• May be unreasonable for important genes such as
BRCA1/2
• Case-only estimate of multiplicative interaction
(Piegorsch, et al.,1994)
pr(D=1|G,X)=H{β0 +β x X+βgG+β xg XG}
pr(G=1|X=1,D=1)pr(G=0|X=0,D=1)
pr(G=0|X=1,D=1)pr(G=1|X=0,D=1)
exp(β xg )
Gene-Environment Independence:
Case-Only Analysis
• Positive Consequence: Often much more
powerful than standard analysis
• Power advantage of this method often has led
researchers to discard information on controls
• Negative Consequence: no ability to estimate
other risk parameters, which are often of greater
interest (see example later)
• Restrictions: Can only handle multiplicative
interaction, requires rare disease in all values of
(G,X)
Gene-Environment Independence
• Fact: gain in power for inference about a
multiplicative interaction
• Consequence: There is thus (Fisher)
information in the assumption
• Conjecture: Can handle general models and
improve efficiency for all parameters
• We do this via a semiparametric profile likelihood
approach
• We start though from a different likelihood
Prentice-Pyke Calculation
• Methodology: Start with the retrospective
likelihood
pr(G=g, X=x|D=d)
=
pr(X=x,G=g)exp d b 0 m(g, x, b1 ) 1 H b 0 m(g, x, b1 )
pr(X=x',G=g')exp db
x ',g'
0
m(g', x ', b1 ) 1 H b 0 m(g', x ', b1 )
• The distribution of (X,G) in the population is left
unspecified
• Semiparametric MLE is usual logistic regression
Environment and Gene Expression
• Methodology: Start with the retrospective
likelihood
pr(G=g, X=x|D=d)
=
pr(X=x)pr(G=g)exp d b 0 m(g, x, b1 ) 1 H b 0 m(g, x, b1 )
pr(X=x')pr(G=g')exp d b
x ',g'
0
m(g', x ', b1 ) 1 H b 0 m(g', x ', b1 )
• Note how independence of G and X is used here,
see the red expressions
• We do not want to model the often
multivariate distribution of X
• Gene distribution model can be standard
Environment and Gene Expression
• Methodology: Compute a profile estimate
• Parametric/semiparametric distribution for G
• Nonparametric distribution for X (possibly high
dimensional)
• Result: Explicit profile likelihood
Environment and Gene Expression
• Methodology: Treat λi=pr(X=xi ) as distinct
parameters
• Let G have parametric structure: pr(G=g) =f(g,θ)
• Construct the profile likelihood, having estimated
the λ i
as functions of data and other
parameters Ω={θ,β0 ,β1 , 1 pr(D 1)}
• The result is a function of Ω : this function can
be calculated explicitly!
Profile Likelihood
• Result:
= β0 log(n1 /n0 ) log pr(D=1) /pr(D 0) ;
S(d,g,x, ) =
f(g, ) exp d m(g, x, b1 )
1 exp b 0 m(g, x, b1 )
Profile Likelihood = L(β0 ,β1 ,κ,θ)=L(Ω)
=
S(D,G,X,Ω)
1
S(d, g, X, )d(g)
d=0
Alternative Derivation
• Consider a prospective study
• Let D= 1 mean selection into the study
• Pretend
pr(Δ=1|D=d,G,X) nd/pr(D=d);
nd # of observations with D d
• Then compute
pr(D=d,G=g|Δ=1,X)
• This is exactly our profile pseudolikelihood!
Alternative Derivation
• We compute:
pr(D=d,G=g|Δ=1,X)
• Standard approach computes
pr(D=d|G=g,Δ=1,X)
• It is this insight that allows us to greatly
generalize the work past independence of G and
X.
Computation
• Intercept: The logistic intercept, and hence
pr(D=1), is weakly identified by itself
• Disease rate: If pr(D=1) is known, or a good
bound for it is specified, can have significant
gains in efficiency.
• This does not happen for a regular case-control study
Interesting Technical Point
• Profile pseudo-likelihood acts like a likelihood
• Information Asymptotics are (almost) exact
• Missing G data handled seamlessly (see next)
• Missing genotype
• Unphased haplotype data
Missing Data
• We have a formal likelihood:
pr(D=d,G=g|Δ=1,X)
• If gene is missing, suggests the formal
likelihood
pr(D=d|Δ=1,X)=
pr(D=d,G=g |Δ=1,X)
g*
*
• Result: Inference as if the data were a
random sample with missing data
Measurement Error
• The likelihood formulation also allows us to deal
with measurement error in the environmental
variables
Advertisement
First Simulation
• MSE Efficiency of Profile method: 0.02 <
pr(D=1) < 0.07
4
3.5
3
2.5
pr(G)=.05
pr(G)=.20
2
1.5
1
0.5
0
G
X
G times X
Israeli Ovarian Cancer Study
• Population based case-control study
• Study the interplay of BRCA1/2 mutations (G)
and two known risk factors (E or X) of ovarian
cancer:
• oral contraceptive (OC) use
• parity.
• Missing Data: Approximately 50% of the
controls were not genotyped, and 10% of the
cases
Israeli Ovarian Cancer Study
• Results reported in Modan et al., NEJM (2001).
• Their analysis involves
• Assumption of parity and OC use are
independent of BRCA1/2 mutation status
• Simple but approximate methods for
exploiting G and E independence assumption
(including case-only estimate of interaction)
• Risk model adjusted for Age, Race, Family
History, History of Gynecological Surgery
Israeli Ovarian Cancer Study
• Disease risk model including same covariates as
Modan et al (2001)
• In addition, we explicitly adjusted for the
possibility of both G and E being related to S
• FH = family history (breast cancer = 1, ovarian
or >= 2 breast cancer = 2)
logit{Pr(G=1|S)}=
β0 +β AshI(Ash)
Israeli Ovarian Cancer Study
• Question: Can carriers be protected via OCuse?
• The logarithm of the odds ratio is the sum of
• The main effect for OC-use
• The interaction term between OC-use and being a
carrier, i.e., interaction between gene and
environment
• Note how this involves main effects and
interactions
Israeli Ovarian Cancer Study
• Question: Is there a carrier/OC interaction
• The case-only method can only answer this
question
Israeli Ovarian Cancer Study
• Interaction of OC and BRCA1/2:
Israeli Ovarian Cancer Study
• Main Effect of BRCA1/2:
Israeli Ovarian Cancer Study
• Odds ratio for OC use among carriers = 1.04
(0.98, 1.09)
• No evidence for protective effect
• Not available from case-only analysis
• Length of interval is ½ the length of
the usual analysis
Features of the Method
• Allows estimation of all parameters of logistic
regression model and can be used to examine
interaction in alternative scales
• Can be used to estimate OR for non-rare
diseases
• Important for studying major genes such as
BRCA1/2
Features of the Method
• Allows incorporation of external information on
Pr(D=1)
• Unlike with logistic regression in case-control
studies, this information improves efficiency of
estimation
Colorectal Adenoma Study
• PLCO Study: 772 cases, 772 controls
• Three SNPs in the calcium-sensing receptor
region
• HWE assumed
• Interest in the interaction of number of copies of
one haplotype (GCG) and calcium intake from
diet
Colorectal Adenoma Study
• Method #1: Write down the prospective
likelihood and apply missing data techniques
• A standard analysis
• If ignoring the case-control sampling scheme works for
ordinary logistic regression, it should work for missing
haplotype regression too, right?
• Wrong! Biased estimates and standard errors
• Method #2: Our method
Colorectal Adenoma Study
Conclusions
• Standard case-control (choice-based)
studies
• Specify a model for G given X, e.g., G-E independence
in population after conditioning on strata
• No assumptions made about X (high dimensional)
• All parameters estimable, no rare-disease assumption
• Handle missing G data
• Large gains in efficiency versus usual method
• Large gains in efficiency for effects of environment
given the gene
Conclusions
• Theoretical Methods:
• With real G and X independence, we used a profile
likelihood method based on nonparametric maximum
likelihood
• (Key insight) Equivalent to a device of pretending
that study is a regular random sample subject to
missing data
• (This allows) generalization to any parametric model
for G given X.
Acknowledgment
• Two graduate students have worked on this
project
Iryna Lobach, Yale
Christie Spinka, U of Missouri
Thanks!
http://stat.tamu.edu/~carroll