Transcript Slide 1

Bayesian Variable Selection in
Semiparametric Regression
Modeling with Applications to Genetic
Mappping
Fei Zou
Department of Biostatistics
Email: [email protected]
Outline
• Introduction
– Experimental crosses
– Existing QTL Mapping Methods
• Bayesian semi-parametric QTL Mapping
• Results
• Remarks and Conclusions
http://www.cs.unc.edu/Courses/comp590-090-f06/Slides/CSclass_Threadgill.ppt
Overview
• One gene one trait: very unlikely
• The vast majority of biological traits are
caused by complex polygenes
– Potentially interacting with each other
• Most traits have significant environmental
exposure components
– Potentially interacting with polygenes
Experimental Crosses: F2
P1
Parents
P2
Experimental Crosses
•
P1
F2
AA
F1
F2:
P1
BB
P1
F1
F1
AA
AB
AB
P2
AA
BB
AB
BB
Backcross(BC)
P2
AB
BC:
AB
AB
AA
AB
QTL Data Format
0: homozygous AA, 2: homozygous BB,
1: heterozygote AB.
Marker positions:
Linkage Analysis
• Data structure:
– Marker data (genotypes plus positions)
– Phenotypic trait(s)
– Other nongenetic covariates, such as age,
gender, environmental conditions etc
• Quantitative trait loci (QTL): a particular
region of the genome containing one or
more genes that are associated with the
trait being assayed or measured
QTL Mapping of Experimental
Crosses
• Single QTL Mapping
• Single marker analysis (Sax, 1923 Genetics)
• Interval mapping: Lander & Botstein (1989,
Genetics)
• Multiple QTL mapping
• Composite interval mapping (Zeng 1993 PNAS,
1994 Genetics; Jansen & Stam, 1994 Genetics)
• Multiple interval mapping (Kao et al., 1999
Genetics)
• Bayesian analysis (Satagopan et al., 1997
Genetics)
Single QTL Interval Mapping
• For backcross, the model assumes
yi (QTL genotype is AA) ~ N (AA ,  2 )
yi (QTL genotype is Aa) ~ N (Aa ,  2 )
• QTL analysis:
H0 : AA   Aa vs H A :  AA   Aa
• If QTL genotypes are observed, the analysis is
trivial: simple t-test!
• However, QTL position is unknown and
therefore QTL genotypes are unobserved
Interval Mapping
• For QTL between markers
– QTL genotypes missing: can use marker genotypes to
infer the conditional probabilities of the QTL
genotypes for a given QTL position
– Profile likelihood (LOD score) calculated across the
whole genome or candidate regions using EM
algorithm
– In any region where the profile exceeds a (genomewide) significance threshold, a QTL declared at the
position with the highest LOD score.
Profile LOD
8
lod
6
4
2
0
1
2
3 4
5
6 7 8 9 10 11 12 13 14 15 16171819 X
Chromosome
Multiple QTL Mapping
• Most complicated traits are caused by
multiple (potentially interacting) genes,
which also interact with environment
stimuli
– Single QTL interval mapping
• Ghost QTL (Lander & Botstein 1989)
• Low power
Multiple QTL Mapping
• Composite interval mapping (Zeng 1993, 1994;
Jansen & Stam1993): searching for a putative QTL in
a given region while simultaneously fitting partial
regression coefficients for "background markers" to
adjust the effects of other QTLs outside the region
• which background markers to include; window size etc
• Multiple interval mapping (Kao et al 1999): fitting
multiple QTLs simultaneously
• Computationally intensive; how many QTLs to include?
Multiple QTL Mapping
• Bayesian methods (Stephens and Fisch 1998
Biometrics; Sillanpaa and Arjas 1998
Genetics; Yi and Xu 2002 Genetic Research,
and Yi et al. 2003 Genetics): treat the number
of QTLs as a parameter by using reversible
jump Markov chain Monte Carlo (MCMC) of
Green (1995 Biometrika)
• change of dimensionality, the acceptance
probability for such dimension change, which in
practice, may not be handled correctly (Ven 2004
Genetics)
Multiple QTL Mapping
• Alternative, multiple QTL mapping can be
viewed as a variable selection problem
– Forward and step-wise selection procedures (Broman
and Speed 2002 JRSSB)
– LASSO, etc
– Bayesian QTL mapping
• Xu (2003 Genetics), Wang et al (2005 Genetics) Huang et al
(2007 Genetics): Bayesian shrinkage
• Yi et al (2003 Genetics): stochastic search variable selection
(SSVS) of George and McCulloch (1993 JASA)
• Yi (2004 Genetics): composite model space of Godsill (2001
J. Comp. Graph. Stat)
• Software: R/qtlbim by Yi’s group
Multiple QTL Mapping
• Limitations of existing QTL mapping
methods
– do not model covariates at all or only
model covariate effect linearly
– do not model interactions at all or
model only lower order interactions,
such as two way interactions
• The multiple QTL mapping is a very large
variable selection problem: for p potential genes,
with p being in the hundreds or thousands, there
p
p

are 2 possible main effect models, 2 2  possible
two-way interactions and 2  kp  possible higher
order (k > 2) interactions.
Semiparmetric Multiple (Potentially
Interacting) QTL Mapping
• Goal: map multiple potentially interacting QTLs without
specifically model all potential main and higher order
interaction effects
• Semiparametric model:
yi   ( xi1 , xi 2 ,...xip , ti1 ,..., tiq )  ei , i  1,
iid
, n with ei ~N (0,  2 )
where function  is unspecified, xi1 , xi 2 ,...xip QTL
genotypes and ti1 ,..., tiq represent all non- genetics
factors/covariates.
• When  equals xi1 xi 2  xi 3ti1 : non-explicitly modeling
the two way interaction between genes 1 and 2 and the
gene-environmental interaction between gene 3 and
covariate 1.
Bayesian Semi/non-parametric
Methods
– Dirichlet process (Muller et al. 1996)
– Splines (Smith and Kohn 1996; Denison et al. 1998
and DiMatteo et al. 2001)
– Wavelets (Abramovich et al. 1998 JRSSB)
– Kernel models (Liang et al 2007)
– Gaussian process (Neal 1997; 1996)
• Gaussian process priors have a large support in the space
of all smooth functions through an appropriate choice of
covariance kernel.
• Gaussian process is flexible for curve estimation because
of their flexible sample path shapes
• Gaussian process related to smoothing spline somehow
(Wahba 1978 JRSSB)
Prior Specification on

• A Gaussian process such that all possible finite
dimensional distributions (1 ,...,n )T follow
multivariate normal with mean 0 and covariance
function
p
q
1
cov(i ,i ' )  exp[  xk ( xik  xi ' k ) 2   tj (tij  ti ' j ) 2 ]

k 1
j 1
where  ,  xks and tj s are hyperparameters
and    ( x ,..., x , t ,..., t )
i
i1
ip
i1
iq
• Hyperparameter  defines the vertical scale of
variations, i.e., controls the magnitude of the
exponential part. Hyperparameters xk ( tj ) related to
length scales 1/  xk which characterize the distance in
that particular direction over which y is expected to
vary significantly



controls the smoothness of
: when
0
the posterior mean of  almost interpolates the data while
centered around the prior mean function if
– When  xk = 0, y is expected to be an essentially constant
function of that input variable xj, which is therefore deemed
irrelevant (Mackay 1998).
–
 
Priors on  s
• The original papers on the Gaussian process
(Mackay 1998; Neal 1997) did not view this
method as an approach for variable selection and
imposed a Gamma prior on the  parameters.
However,  does provide information about the
relevance of any QTL with value near zero
indicating an irrelevant QTL.
• For variable selection purpose, we can impose
the following Gamma mixture priors on  xk  1/  xk
Prior Specifications
• Inverse Gamma distributions are used for the
priors of  and  2 .
Simulations
• Set ups:
– backcross population
– 200 or 500 individuals
– 151 evenly spaced markers at 5cM intervals
– Four QTLs with varying heritabilities:
• Main effect model: all four QTL act additively
• Main plus two way interactions
• Four way interactions only
n=500 and pure 4 way-interaction model
n=500 and pure additive model
Real Data Analysis
• A mouse study
– # samples: 187 backcross samples
– # markers: 85 with average marker distance
20 cM
– Phenotypes: inguinal, gonadal, retroperitoneal
and mesenteric fat pad weights
Remarks
• For studies with large # of samples and/or large
# of markers, MCMC converges very slowly
– We employed the hybrid Monte Carlo method, which
merges the Metropolis-Hastings algorithm with
sampling techniques based on dynamics simulation.
– We also estimated the maximum a posteriori (MAP)
via conjugate gradient method (Hestenes et al 1952
J. Research of National Bureau of Standards)
• point estimate
Real Study: Cardiovascular
Disease
• 2655 tag SNPs from roughly 200 selected
candidate genes for cardiovascular
disease
• 820 individuals
• Non-genetic covariates: gender, smoking
status, age
Remarks
• Semiparemetric mapping is powerful in
mapping multiple (potentially interacting
with higher orders) QTL
– Picks up genes related to the trait regardless
of their marginal main effects or joint epistasis
effects
– Cannot readily differentiates genetic
contributions
• main effect? interaction? or both?
• Fine tuned parametric model with selected genes
Remarks and Future Research
• How to extend the methodologies to human
genome-wide association (GWA) studies, where
hundreds of thousands of markers are available
– Is it possible?
– potential solutions: pathway analysis; data reduction
techniques
• How to extend the method to human pedigree
analysis where mixed effect model is used for
correlated family members?
– Use inheritance vector: so far results are very
promising
Acknowledgement
• Joint work with
– Hanwen Huang
– Haibo Zhou
– Fuxia Cheng
– Ina Hoeschele
• Funding support
– NIH R01 GM074175