bsybayes - Computer Sciences User Pages
Download
Report
Transcript bsybayes - Computer Sciences User Pages
Bayesian Interval Mapping
1. Bayesian strategy
3-19
2. Markov chain sampling
20-27
3. sampling genetic architectures
28-35
4. criteria for model selection
36-44
QTL 2: Bayes
Seattle SISG: Yandell © 2008
1
QTL model selection: key players
•
observed measurements
– y = phenotypic trait
– m = markers & linkage map
– i = individual index (1,…,n)
•
observed
m
X
missing data
– missing marker data
– q = QT genotypes
q
Q
missing
• alleles QQ, Qq, or qq at locus
•
•
unknown quantities
– = QT locus (or loci)
– = phenotype model parameters
– = QTL model/genetic architecture
unknown
pr(q|m,,) genotype model
– grounded by linkage map, experimental cross
– recombination yields multinomial for q given m
•
Yy
pr(y|q,,) phenotype model
– distribution shape (assumed normal here)
– unknown parameters (could be non-parametric)
QTL 2: Bayes
Seattle SISG: Yandell © 2008
after
Sen Churchill (2001)
2
1. Bayesian strategy for QTL study
• augment data (y,m) with missing genotypes q
• study unknowns (,,) given augmented data (y,m,q)
– find better genetic architectures
– find most likely genomic regions = QTL =
– estimate phenotype parameters = genotype means =
• sample from posterior in some clever way
– multiple imputation (Sen Churchill 2002)
– Markov chain Monte Carlo (MCMC)
• (Satagopan et al. 1996; Yi et al. 2005, 2007)
posterior
posterior for q, , ,
pr ( q, , , | y , m )
QTL 2: Bayes
likelihood * prior
constant
phenotype likelihood * [prior for q, , , ]
constant
pr ( y | q, , ) * [pr ( q | m, , ) pr ( | ) pr ( | m, ) pr ( )]
pr ( y | m )
Seattle SISG: Yandell © 2008
3
6
8
10
prior mean
actual mean
n small prior
n large
n large
prior mean
n small
prior
actual mean
Bayes posterior for normal data
12
14
16
6
8
y = phenotype values
small prior variance
QTL 2: Bayes
10
12
14
16
y = phenotype values
large prior variance
Seattle SISG: Yandell © 2008
4
Bayes posterior for normal data
model
environment
likelihood
prior
yi = + ei
e ~ N( 0, 2 ), 2 known
y ~ N( , 2 )
~ N( 0, 2 ), known
posterior:
single individual
mean tends to sample mean
~ N( 0 + b1(y1 – 0), b12)
sample of n individuals
~ N bn y (1 bn ) 0 , bn 2 / n
with y sum yi / n
{i 1,..., n}
n
shrinkage factor
(shrinks to 1)
bn
QTL 2: Bayes
Seattle SISG: Yandell © 2008
n 1
1
5
what values are the genotypic means?
phenotype model pr(y|q,)
prior mean
data mean
n small prior
data means
n large
posterior means
6
qq
QTL 2: Bayes
8
10
Qq
12
y = phenotype values
Seattle SISG: Yandell © 2008
14
16
QQ
6
Bayes posterior QTL means
posterior centered on sample genotypic mean
but shrunken slightly toward overall mean
phenotype mean:
E ( y | q)
q
V ( y | q) 2
genotypic prior:
E ( q )
y
V ( q ) 2
posterior:
E ( q | y ) bq yq (1 bq ) y V ( q | y ) bq 2 / nq
nq
shrinkage:
QTL 2: Bayes
bq
count {qi q}
nq
nq 1
yq sum yi / nq
{qi q}
1
Seattle SISG: Yandell © 2008
7
partition genotypic effects
on phenotype
• phenotype depends on genotype
• genotypic value partitioned into
– main effects of single QTL
– epistasis (interaction) between pairs of QTL
q 0 q E (Y ; q)
q ( q2 ) ( q2 ) ( q1 , q2 )
QTL 2: Bayes
Seattle SISG: Yandell © 2008
8
partitition genotypic variance
• consider same 2 QTL + epistasis
• centering variance
V ( 0 ) 0 s
2
V ( q ) 1
2
• genotypic variance
• heritability
QTL 2: Bayes
2
h
2
q
2
q
q2 2
Seattle SISG: Yandell © 2008
2
q
2
1
2
2
h12 h22 h122
9
2
12
posterior mean ≈ LS estimate
q | y ~ N (bq ˆq , bqCq 2 )
2
ˆ
N ( q , Cq )
LS estimate ˆq sum i [sum j ˆ ( qij )] sum i wqi yi
variance
V ( ˆq ) sum i wqi2 2 Cq 2
shrinkage
bq 1 /(1 Cq ) 1
QTL 2: Bayes
Seattle SISG: Yandell © 2008
10
pr(q|m,) recombination model
pr(q|m,) = pr(geno | map, locus)
pr(geno | flanking markers, locus)
m1 m2
QTL 2: Bayes
q?
m3
m4
markers
m5
m6
distance along chromosome
Seattle SISG: Yandell © 2008
11
QTL 2: Bayes
Seattle SISG: Yandell © 2008
12
what are likely QTL genotypes q?
how does phenotype y improve guess?
D4Mit41
D4Mit214
what are probabilities
for genotype q
between markers?
120
bp
110
recombinants AA:AB
100
all 1:1 if ignore y
and if we use y?
90
AA
AA
AB
AA
AA
AB
AB
AB
Genotype
QTL 2: Bayes
Seattle SISG: Yandell © 2008
13
posterior on QTL genotypes q
• full conditional of q given data, parameters
– proportional to prior pr(q | m, )
• weight toward q that agrees with flanking markers
– proportional to likelihood pr(y | q, )
• weight toward q with similar phenotype values
– posterior recombination model balances these two
• this is the E-step of EM computations
pr ( y | q, ) * pr ( q | m, )
pr ( q | y, m, , )
pr ( y | m, , )
QTL 2: Bayes
Seattle SISG: Yandell © 2008
14
Where are the loci on the genome?
• prior over genome for QTL positions
– flat prior = no prior idea of loci
– or use prior studies to give more weight to some regions
• posterior depends on QTL genotypes q
pr( | m,q) = pr() pr(q | m,) / constant
– constant determined by averaging
• over all possible genotypes q
• over all possible loci on entire map
• no easy way to write down posterior
QTL 2: Bayes
Seattle SISG: Yandell © 2008
15
what is the genetic architecture ?
• which positions correspond to QTLs?
– priors on loci (previous slide)
• which QTL have main effects?
– priors for presence/absence of main effects
• same prior for all QTL
• can put prior on each d.f. (1 for BC, 2 for F2)
• which pairs of QTL have epistatic interactions?
– prior for presence/absence of epistatic pairs
• depends on whether 0,1,2 QTL have main effects
• epistatic effects less probable than main effects
QTL 2: Bayes
Seattle SISG: Yandell © 2008
16
= genetic architecture:
loci:
main QTL
epistatic pairs
effects:
add, dom
aa, ad, dd
QTL 2: Bayes
Seattle SISG: Yandell © 2008
17
Bayesian priors & posteriors
• augmenting with missing genotypes q
– prior is recombination model
– posterior is (formally) E step of EM algorithm
• sampling phenotype model parameters
– prior is “flat” normal at grand mean (no information)
– posterior shrinks genotypic means toward grand mean
– (details for unexplained variance omitted here)
• sampling QTL loci
– prior is flat across genome (all loci equally likely)
• sampling QTL genetic architecture model
– number of QTL
• prior is Poisson with mean from previous IM study
– genetic architecture of main effects and epistatic interactions
• priors on epistasis depend on presence/absence of main effects
QTL 2: Bayes
Seattle SISG: Yandell © 2008
18
2. Markov chain sampling
• construct Markov chain around posterior
– want posterior as stable distribution of Markov chain
– in practice, the chain tends toward stable distribution
• initial values may have low posterior probability
• burn-in period to get chain mixing well
• sample QTL model components from full conditionals
–
–
–
–
sample locus given q, (using Metropolis-Hastings step)
sample genotypes q given ,,y, (using Gibbs sampler)
sample effects given q,y, (using Gibbs sampler)
sample QTL model given ,,y,q (using Gibbs or M-H)
( , q, , ) ~ pr ( , q, , | y , m)
( , q, , )1 ( , q, , )2 ( , q, , ) N
QTL 2: Bayes
Seattle SISG: Yandell © 2008
19
MCMC sampling of unknowns (q,µ,)
for given genetic architecture
• Gibbs sampler
– genotypes q
– effects µ
– not loci
q ~ pr ( q | yi , mi , , )
pr ( y | q, )pr ( )
~
pr ( y | q)
pr ( q | m, )pr ( | m)
~
pr ( q | m)
• Metropolis-Hastings sampler
– extension of Gibbs sampler
– does not require normalization
• pr( q | m ) = sum pr( q | m, ) pr( )
QTL 2: Bayes
Seattle SISG: Yandell © 2008
20
Gibbs sampler
for two genotypic means
• want to study two correlated effects
– could sample directly from their bivariate distribution
– assume correlation is known
• instead use Gibbs sampler:
– sample each effect from its full conditional given the other
– pick order of sampling at random
– repeat many times
0 1
1
~ N ,
2
0
1
1 ~ N 2 ,1 2
2 ~ N 1 ,1 2
QTL 2: Bayes
Seattle SISG: Yandell © 2008
21
Gibbs sampler samples: = 0.6
N = 200 samples
3
-2
1
0
-2
-1
Gibbs: mean 2
2
1
0
-1
Gibbs: mean 1
2
3
2
1
0
Gibbs: mean 2
-1
1
0
-1
-2
-2
Gibbs: mean 1
2
N = 50 samples
2
0
100
150
200
-2
Gibbs: mean 2
-1
0
1
2
3
Gibbs: mean 1
2
3
2
1
0
-2
-2
Gibbs: mean 2
50
Markov chain index
2
1
1
0
-1
3
2
1
0
-1
-2
Gibbs: mean 2
-1
Gibbs: mean 1
0
-2
-1
50
Gibbs: mean 2
40
-2
30
1
20
0
10
Markov chain index
-1
0
0
10
20
30
40
Markov chain index
QTL 2: Bayes
50
-2
-1
0
1
Gibbs: mean 1
2
0
50
100
150
Markov chain index
Seattle SISG: Yandell © 2008
200
-2
-1
0
1
2
Gibbs: mean 1
22
3
full conditional for locus
• cannot easily sample from locus full conditional
pr( |y,m,µ,q) = pr( | m,q)
= pr( q | m, ) pr( ) / constant
• constant is very difficult to compute explicitly
– must average over all possible loci over genome
– must do this for every possible genotype q
• Gibbs sampler will not work in general
– but can use method based on ratios of probabilities
– Metropolis-Hastings is extension of Gibbs sampler
QTL 2: Bayes
Seattle SISG: Yandell © 2008
23
Metropolis-Hastings idea
f()
0.4
• want to study distribution f()
• unless too complicated
0.2
– take Monte Carlo samples
– propose new value *
• near (?) current value
• from some distribution g
– accept new value with prob a
0
2
4
6
0.4
• Metropolis-Hastings samples:
0.0
– take samples using ratios of f
0.2
0.0
-4
QTL 2: Bayes
10
g(–*)
• Gibbs sampler: a = 1 always
f (* ) g (* )
a min 1,
*
f ( ) g ( )
8
Seattle SISG: Yandell © 2008
-2
0
2
4
24
0
0.0
0.1
2000
0.2
pr( |Y)
0.3
0.4
mcmc sequence
4000
6000
0.5
8000
0.6
10000
Metropolis-Hastings for locus
0
2
4
6
8
10
2
3
4
5
6
7
8
added twist: occasionally propose from entire genome
QTL 2: Bayes
Seattle SISG: Yandell © 2008
25
800
400
0
mcmc sequence
800
400
0
histogram
pr( |Y)
1.0
0.0
0 2 4 6 8
Seattle SISG: Yandell © 2008
0.0 0.2 0.4 0.6
2.0
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
histogram
pr( |Y)
0.0 0.4 0.8 1.2
pr( |Y)
histogram
4
2
0
pr( |Y)
histogram
6
QTL 2: Bayes
N = 1000 samples
narrow g
wide g
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
mcmc sequence
150
0 50
150
mcmc sequence
N = 200 samples
narrow g
wide g
0 50
mcmc sequence
Metropolis-Hastings samples
0 2 4 6 8
26
3. sampling genetic architectures
• search across genetic architectures A of various sizes
– allow change in number of QTL
– allow change in types of epistatic interactions
• methods for search
– reversible jump MCMC
– Gibbs sampler with loci indicators
• complexity of epistasis
– Fisher-Cockerham effects model
– general multi-QTL interaction & limits of inference
QTL 2: Bayes
Seattle SISG: Yandell © 2008
27
reversible jump MCMC
• consider known genotypes q at 2 known loci
– models with 1 or 2 QTL
• M-H step between 1-QTL and 2-QTL models
– model changes dimension (via careful bookkeeping)
– consider mixture over QTL models H
1 QTL : Y 0 ( q1 ) e
2 QTL : Y 0 ( q1 ) ( q 2 ) e
QTL 2: Bayes
Seattle SISG: Yandell © 2008
28
geometry of reversible jump
0.6
0.6
0.8
Reversible Jump Sequence
0.8
Move Between Models
b2
0.2 0.4
b2
0.2 0.4
c21 = 0.7
0.0
0.0
m=2
m=1
0.0
QTL 2: Bayes
0.2
0.4
1b1
0.6
0.8
0.0
0.2
0.4
b1
0.6
0.8
1
Seattle SISG: Yandell © 2008
29
geometry allowing q and to change
first 1000 with m<3
0.0
0.0
0.05
b2
0.1 0.2
b2
0.10
0.3
0.15
0.4
a short sequence
0.05
QTL 2: Bayes
0.10
b1
1
0.15
-0.3 -0.2 -0.1 0.0 0.1 0.2
b1
Seattle SISG: Yandell © 2008
1
30
collinear QTL = correlated effects
8-week
additive
2
-0.2
-0.1
cor = -0.7
-0.6
-0.3
-0.4
-0.2
additive 2
cor = -0.81
0.0
0.0
4-week
-0.6
-0.4
-0.2
0.0
0.2
-0.2
additive 1
-0.1
0.0
0.1
0.2
additive 1
effect 1
effect 1
• linked QTL = collinear genotypes
correlated estimates of effects (negative if in coupling phase)
sum of linked effects usually fairly constant
QTL 2: Bayes
Seattle SISG: Yandell © 2008
31
sampling across QTL models
0
1
m+1 2 … m
L
action steps: draw one of three choices
• update QTL model with probability 1-b()-d()
– update current model using full conditionals
– sample QTL loci, effects, and genotypes
• add a locus with probability b()
– propose a new locus along genome
– innovate new genotypes at locus and phenotype effect
– decide whether to accept the “birth” of new locus
• drop a locus with probability d()
– propose dropping one of existing loci
– decide whether to accept the “death” of locus
QTL 2: Bayes
Seattle SISG: Yandell © 2008
32
Gibbs sampler with loci indicators
• consider only QTL at pseudomarkers
– every 1-2 cM
– modest approximation with little bias
• use loci indicators in each pseudomarker
– = 1 if QTL present
– = 0 if no QTL present
• Gibbs sampler on loci indicators
– relatively easy to incorporate epistasis
– Yi, Yandell, Churchill, Allison, Eisen, Pomp (2005 Genetics)
• (see earlier work of Nengjun Yi and Ina Hoeschele)
q 1 ( q1 ) 2 ( q 2 ) , k 0,1
QTL 2: Bayes
Seattle SISG: Yandell © 2008
33
Bayesian shrinkage estimation
• soft loci indicators
– strength of evidence for j depends on
– 0 1 (grey scale)
– shrink most s to zero
• Wang et al. (2005 Genetics)
– Shizhong Xu group at U CA Riverside
q 0 1 1 ( q1 ) 2 2 ( q1 ), 0 k 1
QTL 2: Bayes
Seattle SISG: Yandell © 2008
34
4. criteria for model selection
balance fit against complexity
• classical information criteria
– penalize likelihood L by model size ||
– IC = – 2 log L( | y) + penalty()
– maximize over unknowns
• Bayes factors
– marginal posteriors pr(y | )
– average over unknowns
QTL 2: Bayes
Seattle SISG: Yandell © 2008
35
classical information criteria
• start with likelihood L( | y, m)
– measures fit of architecture () to phenotype (y)
• given marker data (m)
– genetic architecture () depends on parameters
• have to estimate loci (µ) and effects ()
• complexity related to number of parameters
– | | = size of genetic architecture
• BC:
| | = 1 + n.qtl + n.qtl(n.qtl - 1) = 1 + 4 + 12 = 17
• F2:
| | = 1 + 2n.qtl +4n.qtl(n.qtl - 1) = 1 + 8 + 48 = 57
QTL 2: Bayes
Seattle SISG: Yandell © 2008
36
classical information criteria
• construct information criteria
– balance fit to complexity
– Akaike
AIC = –2 log(L) + 2 ||
– Bayes/Schwartz BIC = –2 log(L) + || log(n)
– Broman
BIC = –2 log(L) + || log(n)
– general form: IC = –2 log(L) + || D(n)
• compare models
– hypothesis testing: designed for one comparison
• 2 log[LR(1, 2)] = L(y|m, 2) – L(y|m, 1)
– model selection: penalize complexity
• IC(1, 2) = 2 log[LR(1, 2)] + (|2| – |1|) D(n)
QTL 2: Bayes
Seattle SISG: Yandell © 2008
37
information criteria vs. model size
WinQTL 2.0
SCD data on F2
A=AIC
1=BIC(1)
2=BIC(2)
d=BIC()
models
d
d
information criteria
300
320
340
•
•
•
•
•
•
•
360
d
d
d
1
A
1
1
3
1
1
1A
A
2
2
2
• 2+5+9+2
• 2:2 AD
2
2
d2
d
2
– 1,2,3,4 QTL
– epistasis
2
d
2
A
A
A
4
5
6
7
model parameters p
1
1
A
A
8
9
epistasis
QTL 2: Bayes
Seattle SISG: Yandell © 2008
38
Bayes factors
• ratio of model likelihoods
– ratio of posterior to prior odds for architectures
– averaged over unknowns
pr ( 1 | y , m) / pr ( 2 | y , m) pr ( y | m, 1 )
B12
pr ( 1 ) / pr ( 2 )
pr ( y | m, 2 )
• roughly equivalent to BIC
– BIC maximizes over unknowns
– BF averages over unknowns
2 log( B12 ) 2 log( LR) (| 2 | | 1 |) log( n)
QTL 2: Bayes
Seattle SISG: Yandell © 2008
39
scan of marginal Bayes factor & effect
QTL 2: Bayes
Seattle SISG: Yandell © 2008
40
issues in computing Bayes factors
• BF insensitive to shape of prior on
– geometric, Poisson, uniform
– precision improves when prior mimics posterior
• BF sensitivity to prior variance on effects
– prior variance should reflect data variability
– resolved by using hyper-priors
• automatic algorithm; no need for user tuning
• easy to compute Bayes factors from samples
– sample posterior using MCMC
– posterior pr( | y, m) is marginal histogram
QTL 2: Bayes
Seattle SISG: Yandell © 2008
41
• sampled marginal histogram
• shape affected by prior pr(A)
BF 1 , 2
prior probability
0.10
0.20
– prior pr() chosen by user
– posterior pr( |y,m)
e
e
pr ( 1|y , m) /pr ( 1 )
pr ( 2|y , m) /pr ( 2 )
e
p
u
exponential
Poisson
uniform
p p
e u u
p u u
u u u
p
e
p
e
e p
p
e
0.00
• | | = number of QTL
0.30
Bayes factors & genetic architecture
0
e e
p
e e
p
u p
u p
u u
2
4
6
8
m = number of QTL
10
• pattern of QTL across genome
• gene action and epistasis
QTL 2: Bayes
Seattle SISG: Yandell © 2008
42
3 4
BF sensitivity to fixed prior for effects
Bayes factors
0.5
1
2
4
3
2
3
4
2
4
3
2
4
3
2
4
3
2
4
3
2
4
3
2
3
4
2
4
3
2
1
1
1
1
1
0.2
3
4
2
4
3
2
1
1
1
1
0.05
1
0.20
1
0.50
2.00 5.00 2
hyper-prior heritability h
4
3
2
1
B45
B34
B23
B12
20.00 50.00
2
qj ~ N0, G2 / m, G2 h2 total
, h2 fixed
QTL 2: Bayes
Seattle SISG: Yandell © 2008
43
BF insensitivity to random effects prior
insensitivity to hyper-prior
1.0
3.0
hyper-prior density 2*Beta(a,b)
0.0
0.5
1.0
1.5
2
hyper-parameter heritability h
3
2
3
2
3
2
3
2
Bayes factors
0.2
0.4 0.6
0.0
density
1.0
2.0
0.25,9.75
0.5,9.5
1,9
2,10
1,3
1,1
2.0
3
2
1
1
1
1
0.05
0.10
0.20
2
Eh
1
3
3
2
2
B34
B23
B12
1
1
0.50
1.00
2
qj ~ N0, G2 / m, G2 h2 total
, 12 h2 ~ Beta (a, b)
QTL 2: Bayes
Seattle SISG: Yandell © 2008
44