Bayesian methods with Monte Carlo Markov Chains II

Transcript Bayesian methods with Monte Carlo Markov Chains II

Bayesian Methods with Monte
Carlo Markov Chains II
Henry Horng-Shing Lu
Institute of Statistics
National Chiao Tung University
[email protected]
http://tigpbp.iis.sinica.edu.tw/courses.htm
1
Part 3
An Example in Genetics
2
Example 1 in Genetics (1)

Two linked loci with alleles A and a, and B
and b



A, B: dominant
a, b: recessive
A double heterozygote AaBb will produce
gametes of four types: AB, Ab, aB, ab
A
A
a
B
b
1/2
1/2
A
B
A
a
a
B
b
b
1/2
b
a
1/2
B
3
Example 1 in Genetics (2)

Probabilities for genotypes in gametes
No Recombination
Recombination
Male
1-r
r
Female
1-r’
r’
A
A
a
B
b
1/2
1/2
A
B
A
a
a
B
b
1/2
a
1/2
b
AB
ab
aB
Ab
Male
(1-r)/2
(1-r)/2
r/2
r/2
Female
(1-r’)/2
(1-r’)/2
r’/2
r’/2
b
B
4
Example 1 in Genetics (3)
Fisher, R. A. and Balmukand, B. (1928).
The estimation of linkage from the offspring
of selfed heterozygotes. Journal of Genetics,
20, 79–92.
 More:
http://en.wikipedia.org/wiki/Genetics
http://www2.isye.gatech.edu/~brani/isyeba
yes/bank/handout12.pdf

5
Example 1 in Genetics (4)
MALE
F
E
M
A
L
E
AB
(1-r)/2
ab
(1-r)/2
aB
r/2
Ab
r/2
AB
(1-r’)/2
AABB
(1-r) (1-r’)/4
aABb
(1-r) (1-r’)/4
aABB
r (1-r’)/4
AABb
r (1-r’)/4
ab
(1-r’)/2
AaBb
(1-r) (1-r’)/4
aabb
(1-r) (1-r’)/4
aaBb
r (1-r’)/4
Aabb
r (1-r’)/4
aB
r’/2
AaBB
(1-r) r’/4
aabB
(1-r) r’/4
aaBB
r r’/4
AabB
r r’/4
Ab
r’/2
AABb
(1-r) r’/4
aAbb
(1-r) r’/4
aABb
r r’/4
AAbb
r r’/4
6
Example 1 in Genetics (5)










Four distinct phenotypes:
A*B*, A*b*, a*B* and a*b*.
A*: the dominant phenotype from (Aa, AA, aA).
a*: the recessive phenotype from aa.
B*: the dominant phenotype from (Bb, BB, bB).
b*: the recessive phenotype from bb.
A*B*: 9 gametic combinations.
A*b*: 3 gametic combinations.
a*B*: 3 gametic combinations.
a*b*: 1 gametic combination.
Total: 16 combinations.
7
Example 1 in Genetics (6)

Let   (1  r )(1  r ') , then
2 
P( A * B*) 
4
1
P( A * b*)  P(a * B*) 
4
P(a * b*) 

4
8
Example 1 in Genetics (7)

Hence, the random sample of n from the
offspring of selfed heterozygotes will follow
a multinomial distribution:
 2   1 1  
Multinomial  n;
,
,
, 
4
4
4 4

We know that
  (1  r )(1  r '), 0  r  1/ 2, and 0  r '  1/ 2
So
1/ 4    1
9
Bayesian for Example 1 in Genetics (1)

To simplify computation, we let
P( A * B*)  1 , P( A * b*)  2
P(a * B*)  3 , P(a * b*)  4

The random sample of n from the offspring
of selfed heterozygotes will follow a
multinomial distribution:
Multinomial n;1,2 ,3 ,4 
10
Bayesian for Example 1 in Genetics (2)
If we assume a Dirichlet prior distribution
with parameters: D(1,2 ,3 ,4 ) to estimate
probabilities for A*B*, A*b*, a*B* and
a*b*.
 Recall that

A*B*: 9 gametic combinations.
 A*b*: 3 gametic combinations.
 a*B*: 3 gametic combinations.
 a*b*: 1 gametic combination.
We consider (1 , 2 , 3 , 4 )  (9,3,3,1).

11
Bayesian for Example 1 in Genetics (3)

Suppose that we observe the data of
y   y1, y2 , y3 , y4   125,18,20,24 .

So the posterior distribution is also
Dirichlet with parameters D 134, 21, 23, 25

The Bayesian estimation for probabilities
are (1,2 ,3 ,4 )  0.660,0.103,0.113,0.123
12
Bayesian for Example 1 in Genetics (4)

Consider the original model,
2 
1

P( A * B*) 
, P( A * b*)  P(a * B*) 
, P( a * b*)  .
4
4
4

The random sample of n also follow a
multinomial distribution:
 2   1 1  
( y1 , y2 , y3 , y4 ) ~ Multinomial n;
,
,
, .
4
4
4 4


We will assume a Beta prior distribution:
Beta(1 , 2 ).
13
Bayesian for Example 1 in Genetics (5)

The posterior distribution becomes
P( y1 , y2 , y3 , y4 |  ) P( )
P( | y1 , y2 , y3 , y4 ) 
.
 P( y1 , y2 , y3 , y4 |  ) P( )d

The integration in the above denominator,
 P( y1 , y2 , y3 , y4 |  ) P( )d
does not have a close form.
14
Bayesian for Example 1 in Genetics (6)


How to solve this problem?
Monte Carlo Markov Chains (MCMC)
Method!
What value is appropriate for 1 , 2  ?
15
Part 4
Monte Carlo Methods
16
Monte Carlo Methods (1)
Consider the game of
solitaire: what’s the
chance of winning with a
properly shuffled deck?
 http://en.wikipedia.org/
wiki/Monte_Carlo_meth
od
 http://nlp.stanford.edu/l
ocal/talks/mcmc_2004_
07_01.ppt

?
Lose
Lose
Win
Lose
Chance of winning is 1 in 4!
17
Monte Carlo Methods (2)
Hard to compute analytically because
winning or losing depends on a complex
procedure of reorganizing cards.
 Insight: why not just play a few hands, and
see empirically how many do in fact win?
 More generally, can approximate a
probability density function using only
samples from that density.

18
Monte Carlo Methods (3)
Given a very large set X and a distribution
f  x  over it.
 We draw a set of N i.i.d. random samples.
 We can then approximate the distribution
using these samples.

f(x)
X


1 N
f N  x   1 xi   x  f  x 
N 
N i 1
19
Monte Carlo Methods (4)

We can also use these samples to compute
expectations:
 
1 N
i 
EN  g    g x
 E  g    g  x f  x
N 
N i 1
x

And even use them to find a maximum:
 
xˆ  arg max  f x  i  


i
x 
20
Monte Carlo Example
, X n be i.i.d. N  0,1 , find E( X i 4 )  ?

X1 ,

Solution:
 

E X i 4 = x4

-
 x2 
1
4!
24
exp    dx 

3
4/2 4
2
 2 
2 ( )! 8
2
Use Monte Carlo method to approximation
> x <- rnorm(100000) # 100000 samples from N(0,1)
> x <- x^4
> mean(x)
[1] 3.034175
21
Part 5
Monte Carlo Method vs.
Numerical Integration
22
Numerical Integration (1)

Theorem (Riemann Integral):
If f is continuous, integrable, then the
Riemann Sum:
n
Sn   f (ci )( xi 1  xi ) , where ci  xi , xi 1 
i 1
n
Sn

max | xi1  xi |
i 1,..,n

b
I   f ( x)dx is a constant
a
http://en.wikipedia.org/wiki/Numerical_inte
gration
http://en.wikipedia.org/wiki/Riemann_integ
ral
23
Numerical Integration (2)

Trapezoidal Rule:
ba
xi  a 
i equally spaced intervals
n
ba
xi 1  xi  h 
n
n f x  f x


i
i 1 
ST  
h
2
i 1
1
 h  f ( x1 )  f ( x2 ) 
2

1

 f ( xn )  f ( xn 1 ) 
2

http://en.wikipedia.org/wiki/Trapezoidal_rul
e
24
Numerical Integration (3)

An Example of Trapezoidal Rule by R:
f <- function(x) {((x-4.5)^3+5.6)/1234}
x <- seq(-100, 100, 1)
plot(x, f(x), ylab = "f(x)", type = 'l', lwd = 2)
temp <- matrix(0, nrow = 6, ncol = 4)
temp[, 1] <- seq(-100, 100, 40);temp[, 2] <- rep(f(-100), 6)
temp[, 3] <- temp[, 1]; temp[, 4] <- f(temp[, 1])
segments(temp[, 1], temp[, 2], temp[, 3], temp[, 4], col =
"red")
temp <- matrix(0, nrow = 5, ncol = 4)
temp[, 1] <- seq(-100, 80, 40); temp[, 2] <- f(temp[, 1])
temp[, 3] <- seq(-60, 100, 40); temp[, 4] <- f(temp[, 3])
segments(temp[, 1], temp[, 2], temp[, 3], temp[, 4], col =
"red")
25
Numerical Integration (4)
26
Numerical Integration (5)

Simpson’s Rule:
b
Specifically, a

f  x  dx 

ba 
 a b 
f
a

4
f

f
b







6 
 2 

One derivation replaces the integrand f  x 
by the quadratic polynomial P  x  which
takes the same values as f  x  at the end
ab
m

points a and b and the midpoint
2
27
Numerical Integration (6)
x  m  x  b 
x  a  x  b 
x  a  x  m 



P  x  f a
 f  m
 f b 
 a  m a  b 
 m  a  m  b 
b  a b  m 

b
a


ba 
 a b 
P  x 
f a  4 f 
  f  b 

6 
 2 

http://en.wikipedia.org/wiki/Simpson_rule
28
Numerical Integration (7)

Example of Simpson’s Rule by R:
f <- function(x) {((x-4.5)^3+100*x^2+5.6)/1234}
x <- seq(-100, 100, 1)
p <- function(x) {
f(-100)*(x-0)*(x-100)/(-100-0)/(-100100)+f(0)*(x+100)*(x-100)/(0+100)/(0100)+f(100)*(x-0)*(x+100)/(100-0)/(100+100)
}
matplot(x, cbind(f(x), p(x)), ylab = "", type = 'l', lwd = 2, lty
= 1)
temp <- matrix(0, nrow = 3, ncol = 4)
temp[, 1] <- seq(-100, 100, 100); temp[, 2] <- rep(p(-60),
3)
temp[, 3] <- temp[, 1]; temp[, 4] <- f(temp[, 1])
segments(temp[, 1], temp[, 2], temp[, 3], temp[, 4], col =29
"red")
Numerical Integration (8)
30
Numerical Integration (9)

Two Problems in numerical integration:
1.
2.


b, a    ,  , i.e.  f  x  dx
How to use numerical integration?
1
Logistic transform: y 
1  ex
Two or more high dimension integration?
Monte Carlo Method!
http://en.wikipedia.org/wiki/Numerical_inte
gration
31
Monte Carlo Integration


I   f  x  dx  
f  x
g  x  dx ,
g  x
where g  x  is a probability distribution
f  x
function and let h  x  
g  x
iid
If x1 ,..., xn ~ g  x  , then
LLN
1 n
h  xi   E  h  x     h  x  g  x  dx   f  x  dx  I

n
n i 1

http://en.wikipedia.org/wiki/Monte_Carlo_i
ntegration
32
Example (1)

 18
   7   11 x 1  x 
1
10
10
0
dx  ?
2
 0.0350877 )
(It is
57

Numerical Integration by Trapezoidal Rule:
33
Example (2)

Monte Carlo Integration:
1
 18
1  x  dx   x
1
  7  11
x71 1  x 
   7  11
  7   11
 Let x ~ Beta  7,11 , then
1
111
4   7  11
7 1
4
x
x
1

x
dx

E
x


 
0   7  11
1
 18
1 n 4 LLN
10
4
10
xi  E  x   
x 1  x  dx

0
n i 1 n
  7   11
0
10
x
10
0
4
111
34
dx
Example by C/C++ and R
> x <- rbeta(10000, 7, 11)
> x <- x^4
> mean(x)
[1] 0.03547369
35
Part 6
Markov Chains
36
Markov Chains (1)
A Markov chain is a mathematical model
for stochastic systems whose states,
discrete or continuous, are governed by
transition probability.
 Suppose the random variable X 0 , X1 ,
take
state space (Ω) that is a countable set of
value. A Markov chain is a process that
corresponds to the network.

X0
X1
... X
t 1
Xt
X t 1
...
37
Markov Chains (2)

The current state in Markov chain only
depends on the most recent previous
states.
P  X t 1  j | X t  i, X t 1  it 1 ,
, X 0  i0 
 P  X t 1  j | X t  i   Transition probability
where i0 , , i, j  

http://en.wikipedia.org/wiki/Markov_chain
http://civs.stat.ucla.edu/MCMC/MCMC_tuto
rial/Lect1_MCMC_Intro.pdf
38
An Example of Markov Chains

  1, 2,3, 4,5 
X   X 0 , X1,
, Xt ,

where X 0 is initial state and so on.
P is transition matrix.
1
1 0.4
2  0.5
P  3 0.0

4 0.0
5 0.0
2
3
4
5
0.6 0.0 0.0 0.0 
0.0 0.5 0.0 0.0 
0.3 0.0 0.7 0.0 

0.0 0.1 0.3 0.6 
0.3 0.0 0.5 0.2 
39
Definition (1)

Define the probability of going from state i
to state j in n time steps as
( n)
pij  P  X t n  j | X t  i 

A state j is accessible from state i if there
( n)
p
are n time steps such that ij  0 , where
n  0,1,

A state i is said to communicate with state
j (denote: i  j ), if it is true that both i is
accessible from j and that j is accessible
from i.
40
Definition (2)
A state i has period d  i  if any return to
state i must occur in multiples of d  i  time
steps.
 Formally, the period of a state is defined as



d  i   gcd n : Pii( n )  0

If d i   1, then the state is said to be
aperiodic; otherwise (d i   1), the state is
said to be periodic with period d  i  .
41
Definition (3)
A set of states C is a communicating
class if every pair of states in C
communicates with each other.
 Every state in a communicating class must
have the same period
 Example:

42
Definition (4)
A finite Markov chain is said to be
irreducible if its state space (Ω) is a
communicating class; this means that, in
an irreducible Markov chain, it is possible to
get to any state from any state.
 Example:

43
Definition (5)

A finite state irreducible Markov chain is
said to be ergodic if its states are aperiodic

Example:
44
Definition (6)
A state i is said to be transient if, given
that we start in state i, there is a non-zero
probability that we will never return back to
i.
 Formally, let the random variable Ti be the
next return time to state i (the “hitting
time”): Ti  min n : X n  i | X 0  i
 Then, state i is transient iff there exists a
finite Ti such that: P Ti    1

45
Definition (7)
A state i is said to be recurrent or
persistent iff there exists a finite Ti such
that: P Ti    1.
 The mean recurrent time i  E Ti .
 State i is positive recurrent if i is finite;
otherwise, state i is null recurrent.
 A state i is said to be ergodic if it is
aperiodic and positive recurrent. If all
states in a Markov chain are ergodic, then
the chain is said to be ergodic.

46
Stationary Distributions

Theorem: If a Markov Chain is irreducible
and aperiodic, then
(n)
ij
P


1
j
as n  , i, j 
Theorem: If a Markov chain is irreducible
P Xn  j
and aperiodic, then !  j  lim
n 
and  j   i Pij ,  j  1, j 
i
i
where  is stationary distribution.
47
Definition (8)

A Markov chain is said to be reversible, if
there is a stationary distribution  such
that  i Pij   j Pji i, j 

Theorem: if a Markov chain is reversible,
then  j    i Pij
i
48
An Example of Stationary Distributions

A Markov chain:
0.4
0.3
0.3
0.7
1

2
0.3
0.3
0.7
3
 0.7 0.3 0.0 
P   0.3 0.4 0.3 
 0.0 0.3 0.7 
1
The stationary distribution is   
3
0.7 0.3 0.0 
1 1 1  
1 1 1 

 3 3 3   0.3 0.4 0.3    3 3 3 
 0.0 0.3 0.7 
1
3
1
3 
49
Properties of Stationary Distributions

Regardless of the starting point, the
process of irreducible and aperiodic Markov
chains will converge to a stationary
distribution.

The rate of converge depends on properties
of the transition probability.
50
Part 7
Monte Carlo Markov Chains
51
Applications of MCMC

Simulation:
 n  x  1
n  x   1
1  y 
Ex:  x, y  ~ f  x, y   c   y
where x  0,1, 2,
 x
, n, 0  y  1,  ,  are known.

Integration: computing in high dimensions.
1
Ex: E  g  y    0 g  y  f  y  dy

Bayesian Inference:
Ex: Posterior distributions, posterior
means…
52
Monte Carlo Markov Chains
MCMC method are a class of algorithms for
sampling from probability distributions
based on constructing a Markov chain that
has the desired distribution as its stationary
distribution.
 The state of the chain after a large number
of steps is then used as a sample from the
desired distribution.
 http://en.wikipedia.org/wiki/MCMC

53
Inversion Method vs. MCMC (1)
Inverse transform sampling, also known
as the probability integral transform, is a
method of sampling a number at random
from any probability distribution given its
cumulative distribution function (cdf).
 http://en.wikipedia.org/wiki/Inverse_transf
orm_sampling_method

54
Inversion Method vs. MCMC (2)
A random variable with a cdf F , then F has
a uniform distribution on [0, 1].
 The inverse transform sampling method
works as follows:

1.
2.
3.
Generate a random number from the standard
uniform distribution; call this u .
Compute the value x such that F  x   u ; call
this xchosen .
Take xchosen to be the random number drawn
from the distribution described by F .
55
Inversion Method vs. MCMC (3)
For one dimension random variable,
Inversion method is good, but for two or
more high dimension random variables,
Inverse Method maybe not.
 For two or more high dimension random
variables, the marginal distributions for
those random variables respectively
sometime be calculated difficult with more
time.

56
Gibbs Sampling
One kind of the MCMC methods.
 The point of Gibbs sampling is that given a
multivariate distribution it is simpler to
sample from a conditional distribution
rather than integrating over a joint
distribution.
 George Casella and Edward I. George.
"Explaining the Gibbs sampler". The
American Statistician, 46:167-174, 1992.
(Basic summary and many references.)
 http://en.wikipedia.org/wiki/Gibbs_samplin
g

57
Example 1 (1)


To sample x from:
 n  x  1
n  x   1
 x, y  ~ f  x, y   c   y 1  y 
 x
where x  0,1, 2, , n, 0  y  1, n,  ,  are known
c is a constant
One can see that
f  x, y   n  x
n x
f  x | y 
   y 1  y  ~ Binomial  n, y 
f  y
 x
f  y | x 
f  x, y 
f  x
y
x  1
1  y 
n  x   1
~ Beta  x   , n  x   
58
Example 1 (2)

Gibbs sampling Algorithm:

Initial Setting:

For t  0,

Return  xn , yn 
 y0 ~ Uniform  0,1 or a arbitrary value 0,1

 x0 ~ Bin  n, y0 
, n , sample a value  xt 1 , yt 1  from

 yt 1 ~ Beta  xt   , n  xt   


 xt 1 ~ Bin  n, yt 1 
59
Example 1 (3)
x, yt  y.
Under regular conditions: xt t

t 
 How many steps are needed for
convergence?
 Within an acceptable error, such as

i 10
x
t
i  20
x
 t i 11
10
10
t i 1

t
 0.001, i  .
n is large enough, such as n  1000 .
60
Example 1 (4)

Inversion Method:
 x is Beta-Binomial distribution.
x ~ f  x    f  x, y  dy
1
0
 n         x      n  x   
 
 n     
 x        

The cdf of x is F  x  that has a uniform
distribution on [0, 1].
 n         i      n  i   
F  x   
 n    
i 0  i        
x
61
Gibbs sampling by R (1)
N = 1000; num = 16; alpha = 5; beta = 7
tempy <- runif(1); tempx <- rbeta(1, alpha, beta)
j = 0; Forward = 1; Afterward = 0
while((abs(Forward-Afterward) > 0.0001) && (j <= 1000)){
Forward = Afterward; Afterward = 0
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
Afterward = Afterward+tempx
}
Afterward = Afterward/N; j = j+1
}
sample <- matrix(0, nrow = N, ncol = 2)
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
62
Gibbs sampling by R (2)
sample[i, 1] = tempx; sample[i, 2] = tempy
Inverse method
}
sample_Inverse <- rbetabin(N, num, alpha, beta)
write(t(sample), "Sample for Ex1 by R.txt", ncol = 2)
Xhist <- cbind(hist(sample[, 1], nclass = num)$count,
hist(sample_Inverse, nclass = num)$count)
write(t(Xhist), "Histogram for Ex1 by R.txt", ncol = 2)
prob <- matrix(0, nrow = num+1, ncol = 2)
for(i in 0:num){
if(i == 0){
prob[i+1, 2] = mean(pbinom(i, num, sample[, 2]))
prob[i+1, 1] = gamma(alpha+beta)*gamma(num+beta)
prob[i+1, 1] = prob[i+1,
1]/(gamma(beta)*gamma(num+beta+alpha))
}
63
else{
Gibbs sampling by R (3)
if(i == 1){
prob[i+1, 1] = num*alpha/(num-1+alpha+beta)
for(j in 0:(num-2))
prob[i+1, 1] = prob[i+1,
1]*(beta+j)/(alpha+beta+j)
}
else
prob[i+1, 1] = prob[i+1, 1]*(num-i+1)/(i)*(i1+alpha)/(num-i+beta)
prob[i+1, 2] = mean((pbinom(i, num, sample[, 2])pbinom(i-1, num, sample[, 2])))
}
if(i != num)
prob[i+2, 1] = prob[i+1, 1]
}
write(t(prob), "ProbHistogram for Ex1 by R.txt", ncol = 2)
64
Inversion Method by R (1)
rbetabin <- function(N, size, alpha, beta){
Usample <- runif(N)
Pr_0 =
gamma(alpha+beta)*gamma(size+beta)/gamma(beta)/gam
ma(size+beta+alpha)
Pr = size*alpha/(size-1+alpha+beta)
for(i in 0:(size-2))
Pr = Pr*(beta+i)/(alpha+beta+i)
Pr_Initial = Pr
sample <- array(0,N)
CDF <- array(0, (size+1))
CDF[1] <- Pr_0
65
Inversion Method by R (2)
}
for(i in 1:size){
CDF[i+1] = CDF[i]+Pr
Pr = Pr*(size-i)/(i+1)*(i+alpha)/(size-i-1+beta)
}
for(i in 1:N){
sample[i] = which.min(abs(Usample[i]-CDF))-1
}
return(sample)
66
Gibbs sampling by C/C++ (1)
67
Gibbs sampling by C/C++ (2)
Inverse method
68
Gibbs sampling by C/C++ (3)
69
Gibbs sampling by C/C++ (4)
70
Inversion Method by C/C++ (1)
71
Inversion Method by C/C++ (2)
72
Plot Histograms by Maple (1)

Figure 1:1000 samples with n=16, α=5 and
β=7.
Blue-Inversion method
Red-Gibbs sampling
73
Plot Histograms by Maple (2)
74
Probability Histograms by Maple (1)

Figure 2:


Blue histogram and yellow line are pmf of x.
m
1
Red histogram is Pˆ  X  x    P  X  x | Yi  yi  from
m i 1
Gibbs sampling.
75
Probability Histograms by Maple (2)
The probability histogram of blue histogram
of Figure 1 would be similar to the bule
probability histogram of Figue 2, when the
sample size .
 The probability histogram of red histogram
of Figure 1 would be similar to the red
probability histogram of Figue 2, when the
iteration n  .

76
Probability Histograms by Maple (3)
77
Exercises

Write your own programs similar to those
examples presented in this talk, including
Example 1 in Genetics and other examples.

Write programs for those examples
mentioned at the reference web pages.

Write programs for the other examples that
you know.
78

Bayesian methods with Monte Carlo Markov Chains II

Transcript Bayesian methods with Monte Carlo Markov Chains II

Directory