Bayesian methods with Monte Carlo Markov Chains III

Download Report

Transcript Bayesian methods with Monte Carlo Markov Chains III

Part 6
Markov Chains
1
Markov Chains (1)
A Markov chain is a mathematical model
for stochastic systems whose states,
discrete or continuous, are governed by
transition probability.
 Suppose the random variable X 0 , X1 ,
take
state space (Ω) that is a countable set of
value. A Markov chain is a process that
corresponds to the network.

X0
X1
... X
t 1
Xt
X t 1
...
2
Markov Chains (2)

The current state in Markov chain only
depends on the most recent previous
states.
P  X t 1  j | X t  i, X t 1  it 1 ,
, X 0  i0 
 P  X t 1  j | X t  i   Transition probability
where i0 , , i, j  

http://en.wikipedia.org/wiki/Markov_chain
http://civs.stat.ucla.edu/MCMC/MCMC_tuto
rial/Lect1_MCMC_Intro.pdf
3
An Example of Markov Chains

  1, 2,3, 4,5 
X   X 0 , X1,
, Xt ,

where X 0 is initial state and so on.
P is transition matrix.
1
1 0.4
2  0.5
P  3 0.0

4 0.0
5 0.0
2
3
4
5
0.6 0.0 0.0 0.0 
0.0 0.5 0.0 0.0 
0.3 0.0 0.7 0.0 

0.0 0.1 0.3 0.6 
0.3 0.0 0.5 0.2 
4
Definition (1)

Define the probability of going from state i
to state j in n time steps as
( n)
pij  P  X t n  j | X t  i 

A state j is accessible from state i if there
( n)
p
are n time steps such that ij  0 , where
n  0,1,

A state i is said to communicate with state
j (denote: i  j ), if it is true that both i is
accessible from j and that j is accessible
from i.
5
Definition (2)
A state i has period d  i  if any return to
state i must occur in multiples of d  i  time
steps.
 Formally, the period of a state is defined as



d  i   gcd n : Pii( n )  0

If d i   1, then the state is said to be
aperiodic; otherwise (d i   1), the state is
said to be periodic with period d  i  .
6
Definition (3)
A set of states C is a communicating
class if every pair of states in C
communicates with each other.
 Every state in a communicating class must
have the same period
 Example:

7
Definition (4)
A finite Markov chain is said to be
irreducible if its state space (Ω) is a
communicating class; this means that, in
an irreducible Markov chain, it is possible to
get to any state from any state.
 Example:

8
Definition (5)

A finite state irreducible Markov chain is
said to be ergodic if its states are aperiodic

Example:
9
Definition (6)
A state i is said to be transient if, given
that we start in state i, there is a non-zero
probability that we will never return back to
i.
 Formally, let the random variable Ti be the
next return time to state i (the “hitting
time”): Ti  min n : X n  i | X 0  i
 Then, state i is transient iff there exists a
finite Ti such that: P Ti    1

10
Definition (7)
A state i is said to be recurrent or
persistent iff there exists a finite Ti such
that: P Ti    1.
 The mean recurrent time i  E Ti .
 State i is positive recurrent if i is finite;
otherwise, state i is null recurrent.
 A state i is said to be ergodic if it is
aperiodic and positive recurrent. If all
states in a Markov chain are ergodic, then
the chain is said to be ergodic.

11
Stationary Distributions

Theorem: If a Markov Chain is irreducible
and aperiodic, then
(n)
ij
P


1
j
as n  , i, j 
Theorem: If a Markov chain is irreducible
P Xn  j
and aperiodic, then !  j  lim
n 
and  j   i Pij ,  j  1, j 
i
i
where  is stationary distribution.
12
Definition (8)

A Markov chain is said to be reversible, if
there is a stationary distribution  such
that  i Pij   j Pji i, j 

Theorem: if a Markov chain is reversible,
then  j    i Pij
i
13
An Example of Stationary Distributions

A Markov chain:
0.4
0.3
0.3
0.7
1

2
0.3
0.3
0.7
3
 0.7 0.3 0.0 
P   0.3 0.4 0.3 
 0.0 0.3 0.7 
1
The stationary distribution is   
3
0.7 0.3 0.0 
1 1 1  
1 1 1 

 3 3 3   0.3 0.4 0.3    3 3 3 
 0.0 0.3 0.7 
1
3
1
3 
14
Properties of Stationary Distributions

Regardless of the starting point, the
process of irreducible and aperiodic Markov
chains will converge to a stationary
distribution.

The rate of converge depends on properties
of the transition probability.
15
Part 7
Monte Carlo Markov Chains
16
Applications of MCMC

Simulation:
 n  x  1
n  x   1
1  y 
Ex:  x, y  ~ f  x, y   c   y
where x  0,1, 2,
 x
, n, 0  y  1,  ,  are known.

Integration: computing in high dimensions.
1
Ex: E  g  y    0 g  y  f  y  dy

Bayesian Inference:
Ex: Posterior distributions, posterior
means…
17
Monte Carlo Markov Chains
MCMC method are a class of algorithms for
sampling from probability distributions
based on constructing a Markov chain that
has the desired distribution as its stationary
distribution.
 The state of the chain after a large number
of steps is then used as a sample from the
desired distribution.
 http://en.wikipedia.org/wiki/MCMC

18
Inversion Method vs. MCMC (1)
Inverse transform sampling, also known
as the probability integral transform, is a
method of sampling a number at random
from any probability distribution given its
cumulative distribution function (cdf).
 http://en.wikipedia.org/wiki/Inverse_transf
orm_sampling_method

19
Inversion Method vs. MCMC (2)
A random variable with a cdf F , then F has
a uniform distribution on [0, 1].
 The inverse transform sampling method
works as follows:

1.
2.
3.
Generate a random number from the standard
uniform distribution; call this u .
Compute the value x such that F  x   u ; call
this xchosen .
Take xchosen to be the random number drawn
from the distribution described by F .
20
Inversion Method vs. MCMC (3)
For one dimension random variable,
Inversion method is good, but for two or
more high dimension random variables,
Inverse Method maybe not.
 For two or more high dimension random
variables, the marginal distributions for
those random variables respectively
sometime be calculated difficult with more
time.

21
Gibbs Sampling
One kind of the MCMC methods.
 The point of Gibbs sampling is that given a
multivariate distribution it is simpler to
sample from a conditional distribution
rather than integrating over a joint
distribution.
 George Casella and Edward I. George.
"Explaining the Gibbs sampler". The
American Statistician, 46:167-174, 1992.
(Basic summary and many references.)
 http://en.wikipedia.org/wiki/Gibbs_samplin
g

22
Example 1 (1)


To sample x from:
 n  x  1
n  x   1
 x, y  ~ f  x, y   c   y 1  y 
 x
where x  0,1, 2, , n, 0  y  1, n,  ,  are known
c is a constant
One can see that
f  x, y   n  x
n x
f  x | y 
   y 1  y  ~ Binomial  n, y 
f  y
 x
f  y | x 
f  x, y 
f  x
y
x  1
1  y 
n  x   1
~ Beta  x   , n  x   
23
Example 1 (2)

Gibbs sampling Algorithm:

Initial Setting:

For t  0,

Return  xn , yn 
 y0 ~ Uniform  0,1 or a arbitrary value 0,1

 x0 ~ Bin  n, y0 
, n , sample a value  xt 1 , yt 1  from

 yt 1 ~ Beta  xt   , n  xt   


 xt 1 ~ Bin  n, yt 1 
24
Example 1 (3)
x, yt  y.
Under regular conditions: xt t

t 
 How many steps are needed for
convergence?
 Within an acceptable error, such as

i 10
x
t
i  20
x
 t i 11
10
10
t i 1

t
 0.001, i  .
n is large enough, such as n  1000 .
25
Example 1 (4)

Inversion Method:
 x is Beta-Binomial distribution.
x ~ f  x    f  x, y  dy
1
0
 n         x      n  x   
 
 n     
 x        

The cdf of x is F  x  that has a uniform
distribution on [0, 1].
 n         i      n  i   
F  x   
 n    
i 0  i        
x
26
Gibbs sampling by R (1)
N = 1000; num = 16; alpha = 5; beta = 7
tempy <- runif(1); tempx <- rbeta(1, alpha, beta)
j = 0; Forward = 1; Afterward = 0
while((abs(Forward-Afterward) > 0.0001) && (j <= 1000)){
Forward = Afterward; Afterward = 0
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
Afterward = Afterward+tempx
}
Afterward = Afterward/N; j = j+1
}
sample <- matrix(0, nrow = N, ncol = 2)
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
27
Gibbs sampling by R (2)
sample[i, 1] = tempx; sample[i, 2] = tempy
Inverse method
}
sample_Inverse <- rbetabin(N, num, alpha, beta)
write(t(sample), "Sample for Ex1 by R.txt", ncol = 2)
Xhist <- cbind(hist(sample[, 1], nclass = num)$count,
hist(sample_Inverse, nclass = num)$count)
write(t(Xhist), "Histogram for Ex1 by R.txt", ncol = 2)
prob <- matrix(0, nrow = num+1, ncol = 2)
for(i in 0:num){
if(i == 0){
prob[i+1, 2] = mean(pbinom(i, num, sample[, 2]))
prob[i+1, 1] = gamma(alpha+beta)*gamma(num+beta)
prob[i+1, 1] = prob[i+1,
1]/(gamma(beta)*gamma(num+beta+alpha))
}
28
else{
Gibbs sampling by R (3)
if(i == 1){
prob[i+1, 1] = num*alpha/(num-1+alpha+beta)
for(j in 0:(num-2))
prob[i+1, 1] = prob[i+1,
1]*(beta+j)/(alpha+beta+j)
}
else
prob[i+1, 1] = prob[i+1, 1]*(num-i+1)/(i)*(i1+alpha)/(num-i+beta)
prob[i+1, 2] = mean((pbinom(i, num, sample[, 2])pbinom(i-1, num, sample[, 2])))
}
if(i != num)
prob[i+2, 1] = prob[i+1, 1]
}
write(t(prob), "ProbHistogram for Ex1 by R.txt", ncol = 2)
29
Inversion Method by R (1)
rbetabin <- function(N, size, alpha, beta){
Usample <- runif(N)
Pr_0 =
gamma(alpha+beta)*gamma(size+beta)/gamma(beta)/gam
ma(size+beta+alpha)
Pr = size*alpha/(size-1+alpha+beta)
for(i in 0:(size-2))
Pr = Pr*(beta+i)/(alpha+beta+i)
Pr_Initial = Pr
sample <- array(0,N)
CDF <- array(0, (size+1))
CDF[1] <- Pr_0
30
Inversion Method by R (2)
}
for(i in 1:size){
CDF[i+1] = CDF[i]+Pr
Pr = Pr*(size-i)/(i+1)*(i+alpha)/(size-i-1+beta)
}
for(i in 1:N){
sample[i] = which.min(abs(Usample[i]-CDF))-1
}
return(sample)
31
Gibbs sampling by C/C++ (1)
32
Gibbs sampling by C/C++ (2)
Inverse method
33
Gibbs sampling by C/C++ (3)
34
Gibbs sampling by C/C++ (4)
35
Inversion Method by C/C++ (1)
36
Inversion Method by C/C++ (2)
37
Plot Histograms by Maple (1)

Figure 1:1000 samples with n=16, α=5 and
β=7.
Blue-Inversion method
Red-Gibbs sampling
38
Plot Histograms by Maple (2)
39
Probability Histograms by Maple (1)

Figure 2:


Blue histogram and yellow line are pmf of x.
m
1
Red histogram is Pˆ  X  x    P  X  x | Yi  yi  from
m i 1
Gibbs sampling.
40
Probability Histograms by Maple (2)
The probability histogram of blue histogram
of Figure 1 would be similar to the bule
probability histogram of Figue 2, when the
sample size .
 The probability histogram of red histogram
of Figure 1 would be similar to the red
probability histogram of Figue 2, when the
iteration n  .

41
Probability Histograms by Maple (3)
42
Exercises

Write your own programs similar to those
examples presented in this talk, including
Example 1 in Genetics and other examples.

Write programs for those examples
mentioned at the reference web pages.

Write programs for the other examples that
you know.
43
Bayesian Methods with Monte
Carlo Markov Chains III
Henry Horng-Shing Lu
Institute of Statistics
National Chiao Tung University
[email protected]
http://tigpbp.iis.sinica.edu.tw/courses.htm
44
Part 10
More Examples of Gibbs Sampling
45
An Example with Three Random Variables (1)

To sample  X , Y , N  as follows:
where x  0,1, 2, , n, 0  y  1, n  0,1, 2,
   is known, c is a constant.
46
An Example with Three Random Variables (2)

One can see that
n x
n x
f  x | y, n     y 1  y  ~ Binomial  n, y 
 x
f  y | x, n   y
x  1
f  n  x | x, y  
e
1  y 
1 y 
n  x   1
~ Beta  x   , n  x   
1  y   
 n  x !
n x
~ Poisson  1  y   
47
An Example with Three Random Variables (3)

Gibbs sampling Algorithm:
Initial Setting: t  0 ,
 y0 ~ Unif 0,1 or an arbitrary value   0,1

n0 ~ Discrete Unif 1,   or an arbitrary integer 1, 

 x0 ~ Bin  n0 , y0 
2. Sample a value  xt 1 , yt 1  from
 yt 1 ~ Beta  xt   , nt  xt   

nt 1 ~ xt  Possion  1  yt 1   

 xt 1 ~ Bin  nt 1 , yt 1 
3. t  t  1, repeat step 2 until convergence.
48
1.
An Example with Three Random Variables by R
N = 10000; alpha = 2; beta = 4; lambda = 16
sample <- matrix(0, nrow = N, ncol = 3)
tempY <- runif(1); tempN <- 1
10000 samples with
tempX <- rbinom(1, tempN, tempY)
α=2, β=4 and λ=16
j = 0; forward = 1; afterward = 0
while((abs(forward-afterward) > 0.001) && (j <= 1000)){
forward = afterward; afterward = 0
for(i in 1:N){
tempY <- rbeta(1, tempX+alpha, tempN-tempX+beta)
tempN <- rpois(1, (1-tempY)*lambda)
tempN = tempN+tempX
tempX <- rbinom(1, tempN, tempY)
afterward = afterward+tempX
}
afterward = afterward/N; j = j+1
49
}
An Example with Three Random Variables by R
for(i in i:N){
tempY <- rbeta(1, tempX+alpha, tempN-tempX+beta)
tempN <- rpois(1, (1-tempY)*lambda)
tempN = tempN+tempX
tempX <- rbinom(1, tempN, tempY)
sample[i, 2] = tempY
sample[i, 3] = tempN
sample[i, 1] = tempX
}
50
An Example with 3 Random Variables by C (1)
10000 samples with
α=2, β=4 and λ=16
51
An Example with 3 Random Variables by C (2)
52
Example 1 in Genetics (1)

Two linked loci with alleles A and a, and B
and b



A, B: dominant
a, b: recessive
A double heterozygote AaBb will produce
gametes of four types: AB, Ab, aB, ab
A
A
a
B
b
1/2
1/2
A
B
A
a
a
B
b
b
1/2
b
a
1/2
B
53
Example 1 in Genetics (2)

Probabilities for genotypes in gametes
No Recombination
Recombination
Male
1-r
r
Female
1-r’
r’
A
A
a
B
b
1/2
1/2
A
B
A
a
a
B
b
1/2
a
1/2
b
AB
ab
aB
Ab
Male
(1-r)/2
(1-r)/2
r/2
r/2
Female
(1-r’)/2
(1-r’)/2
r’/2
r’/2
b
B
54
Example 1 in Genetics (3)
Fisher, R. A. and Balmukand, B. (1928).
The estimation of linkage from the offspring
of selfed heterozygotes. Journal of Genetics,
20, 79–92.
 More:
http://en.wikipedia.org/wiki/Genetics
http://www2.isye.gatech.edu/~brani/isyeba
yes/bank/handout12.pdf

55
Example 1 in Genetics (4)
MALE
F
E
M
A
L
E
AB
(1-r)/2
ab
(1-r)/2
aB
r/2
Ab
r/2
AB
(1-r’)/2
AABB
(1-r) (1-r’)/4
aABb
(1-r) (1-r’)/4
aABB
r (1-r’)/4
AABb
r (1-r’)/4
ab
(1-r’)/2
AaBb
(1-r) (1-r’)/4
aabb
(1-r) (1-r’)/4
aaBb
r (1-r’)/4
Aabb
r (1-r’)/4
aB
r’/2
AaBB
(1-r) r’/4
aabB
(1-r) r’/4
aaBB
r r’/4
AabB
r r’/4
Ab
r’/2
AABb
(1-r) r’/4
aAbb
(1-r) r’/4
aABb
r r’/4
AAbb
r r’/4
56
Example 1 in Genetics (5)










Four distinct phenotypes:
A*B*, A*b*, a*B* and a*b*.
A*: the dominant phenotype from (Aa, AA, aA).
a*: the recessive phenotype from aa.
B*: the dominant phenotype from (Bb, BB, bB).
b*: the recessive phenotype from bb.
A*B*: 9 gametic combinations.
A*b*: 3 gametic combinations.
a*B*: 3 gametic combinations.
a*b*: 1 gametic combination.
Total: 16 combinations.
57
Example 1 in Genetics (6)

Let   (1  r )(1  r ') , then
2 
P( A * B*) 
4
1
P( A * b*)  P(a * B*) 
4
P(a * b*) 

4
58
Example 1 in Genetics (7)

Hence, the random sample of n from the
offspring of selfed heterozygotes will follow
a multinomial distribution:
 2   1 1  
Multinomial  n;
,
,
, 
4
4
4 4

We know that
  (1  r )(1  r '), 0  r  1/ 2, and 0  r '  1/ 2
So
1/ 4    1
59
Example 1 in Genetics (8)

Suppose that we observe the data of
y   y1, y2 , y3 , y4   125,18,20,24
which is a random sample from
 2   1 1  
Multinomial  n;
,
,
, 
4
4
4 4

Then the probability mass function is
n!
2   y 1 y  y  y
f ( y | ) 
(
) (
)
( )
2
1
y1 ! y2 ! y3 ! y4 !
4
4
3
4
4
60
Example 1 in Genetics (9)

How to estimate  ?



MME (shown in the last week):
http://en.wikipedia.org/wiki/Method_of_moment
s_%28statistics%29
MLE (shown in the last week):
http://en.wikipedia.org/wiki/Maximum_likelihoo
d
Bayesian Method:
http://en.wikipedia.org/wiki/Bayesian_method
61
Example 1 in Genetics (10)
As the value of  is between ¼ and 1, we
can assume that the prior distribution of
is Unif  14 ,1.
 The posterior distribution is
f  y |   f  
f  | y  
 f  y |   f   d 


The integration in the above denominator,
 f  y |   f   d 
does not have a closed form.
62
Example 1 in Genetics (11)


We will consider the mean of posterior
distribution (the posterior mean),
E  | y     f  | y  d 
The Monte Carlo Markov Chains method is a
good method to estimate
E  | y     f  | y  d 
even if  f  y |   f   d and the posterior
mean do not have closed forms.
63
Example 1 by R
1 
 Direct numerical integration when  ~ U  ,1:
4 
> y <- c(125, 18, 20, 24)
> phi <- runif(1000000, 1/4, 1)
> f_phi <- function(phi){((2+phi)/4)^y[1]*((1phi)/4)^(y[2]+y[3])*(phi/4)^y[4]}
> mean(f_phi(phi)*phi)/mean(f_phi(phi))
[1] 0.573808

We can assume other prior distributions to
compare the results of posterior means:
Beta 1,1, Beta  2, 2, Beta  2,3 , Beta  3, 2 ,
Beta  0.5,0.5, Beta 10 5 ,10 5 
64
Example 1 by C/C++
Replace other prior distribution,
such as Beta(1,1),…,Beta(1e-5,1e-5)
65
Beta Prior
66
Comparison for Example 1 (1)
Estimate
Method
ˆ
Estimate
Method
ˆ
MME
0.683616
Bayesian
Beta(2,3)
0.564731
MLE
0.663165
Bayesian
Beta(3,2)
0.577575
Bayesian
U(¼,1)
0.573931
Bayesian
Beta(½,½)
0.574928
Bayesian
Beta(1,1)
0.573918
Bayesian
Beta(10-5,10-5)
0.588925
Bayesian
Beta(2,2)
0.572103
Bayesian
Beta(10-7,10-7)
show below
67
Comparison for Example 1 (2)
Estimate
Method
ˆ
Estimate
Method
ˆ
Bayesian
Beta(10,10)
0.559905
Bayesian
Beta(10-7,10-7)
0.193891
Bayesian
Beta(102,102)
0.520366
Bayesian
Beta(10-7,10-7)
0.400567
Bayesian
Beta(104,104)
0.500273
Bayesian
Beta(10-7,10-7)
0.737646
Bayesian
Beta(105,105)
0.500027
Bayesian
Beta(10-7,10-7)
0.641388
Bayesian
Beta(10n,10n)
 0.5
Bayesian
Beta(10-7,10-7)
Not
stationary
n
68
Part 11
Gibbs Sampling Strategy
69
Sampling Strategy (1)

Strategy I:


Run one chain for a long time.
After some “Burn-in” period, sample points
every some fixed number of steps.
Burn-in


N samples from one chain
The code example of Gibbs sampling in the
previous lecture use sampling strategy I.
http://www.cs.technion.ac.il/~cs236372/tir
gul09.ps
70
Sampling Strategy (2)

Strategy II:



Run the chain N times, each run for M steps.
Each run starts from a different state points.
Return the last state in each run.
N samples from the last
sample of each chain
Burn-in
71
Sampling Strategy (3)
 Strategy II by R:
N = 100; num = 16; alpha = 5; beta = 7
sample <- matrix(0, nrow = N, ncol = 2)
for(k in 1:N){
tempy <- runif(1); tempx <- rbeta(1, alpha, beta)
j = 0; Forward = 1; Afterward = 0
while((abs(Forward-Afterward) > 0.001) && (j <= 100)){
Forward = Afterward; Afterward = 0
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, numtempx+beta)
tempx <- rbinom(1, num, tempy)
Afterward = Afterward+tempx
}
Afterward = Afterward/N; j = j+1
}
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
sample[k, 1] = tempx; sample[k, 2] = tempy
}
72
Sampling Strategy (4)

Strategy II by C/C++:
73
Strategy Comparison

Strategy I:



Perform “burn-in” only once and save time.
Samples might be correlated (--although only
weakly).
Strategy II:


Better chance of “covering” the space points
especially if the chain is slow to reach
stationary.
This must perform “burn-in” steps for each chain
and spend more time.
74
Hybrid Strategies (1)
Run several chains and sample few samples
from each.
 Combines benefits of both strategies.

N samples from each
chain
Burn-in
75
Hybrid Strategies (2)

Hybrid Strategy by R:
tempN <- N; loc <- 1
sample <- matrix(0, nrow = N, ncol = 2)
while(loc != (N+1)){
tempy <- runif(1); tempx <- rbeta(1, alpha, beta); j = 0
pN <- floor(runif(1)*(N-loc))+1
cat(pN, '\n‘); Forward = 1; Afterward = 0
while((abs(Forward-Afterward) > 0.001) && (j <= 100)){
Forward = Afterward; Afterward = 0
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
Afterward = Afterward+tempx}
Afterward = Afterward/N; j = j+1
}
for(i in loc:(loc+pN-1)){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
sample[i, 1] <- tempx; sample[i, 2] <- tempy
}
loc <- i+1
}
76
Hybrid Strategies (3)

Hybrid Strategy by C/C++:
77
Part 12
Metropolis-Hastings Algorithm
78
Metropolis-Hastings Algorithm (1)
Another kind of the MCMC methods.
 The Metropolis-Hastings algorithm can draw
samples from any probability distribution
Π  x  , requiring only that a function
proportional to the density can be
calculated at x .
 Process in three steps:





Set up a Markov chain;
Run the chain until stationary;
Estimate with Monte Carlo methods.
http://en.wikipedia.org/wiki/MetropolisHastings_algorithm
79
Metropolis-Hastings Algorithm (2)
Let   1, ,  n  be a probability density (or
mass) function (pdf or pmf).
 f  is any function and we want to estimate

n
I  E  f    f  i   i
i 1

Construct P   Pij  the transition matrix of an
irreducible Markov chain with states 1, 2, , n,
where Pij  Pr  Xt 1  j | X t  i , X t  1,2, , n
and Π is its unique stationary distribution.
80
Metropolis-Hastings Algorithm (3)

Run this Markov chain for times t  1,
and calculate the Monte Carlo sum
,N
N
1
Iˆ   f  Xt
N t 1
then Iˆ  I as N  
Sheldon M. Ross(1997). Proposition 4.3.
Introduction to Probability Model. 7th ed.
 http://nlp.stanford.edu/local/talks/mcmc_2
004_07_01.ppt

81
Metropolis-Hastings Algorithm (4)
In order to perform this method for a given
distribution Π , we must construct a Markov
chain transition matrix P with Π as its
stationary distribution, i.e. ΠP = Π.
 Consider the matrix P was made to satisfy
the reversibility condition that for all i and j
Πi Pij = Π j Pij .
 The property ensures that   i Pij   j for all j

i
and hence Π is a stationary distribution for P
82
Metropolis-Hastings Algorithm (5)
Let a proposal Q  Qij  be irreducible where
Qij  Pr  Xt 1  j | X t  i  , and range of Q is equal
to range of Π .
 But Π is not have to a stationary
distribution of Q.
 Process: Tweak Qij to yield Π .

States from Qij
not π
Tweak
States from Pij
π
83
Metropolis-Hastings Algorithm (6)

We assume that Pij has the form
Pij  Qij    i, j   i  j  ,
Pii  1   Pij ,
i j
where   i, j  is called accepted probability,
i.e. given X t  i ,
 X t 1  j with probability   i, j 
take 
 X t 1  i with probability 1-   i, j 
84
Metropolis-Hastings Algorithm (7)

For i  j,  i Pij   j Pji
  iQij  (i, j)   jQji  ( j, i)
*
WLOG for some (i, j ) ,  iQij   j Qji.
 In order to achieve equality (*), one can
introduce a probability   i, j   1 on the lefthand side and set   j, i   1 on the righthand side.

85
Metropolis-Hastings Algorithm (8)

Then  i Qij    i, j    j Q ji    j , i    j Q ji
 j Q ji
   i, j  
 i Qij

These arguments imply that the accepted
probability  (i, j ) must be
  jQ ji 
  i, j   min 1,
  iQ 
ij 

86
Metropolis-Hastings Algorithm (9)

M-H Algorithm:
Step 1: Choose an irreducible Markov chain
transition matrix Q with transition
probability Qij .
Step 2: Let t  0 and initialize X 0 from states
in Q.
Step 3 (Proposal Step):
Given X t  i , sample Y  j form QiY .
87
Metropolis-Hastings Algorithm (10)

M-H Algorithm (cont.):
Step 4 (Acceptance Step):
Generate a random number U from Unif  0,1
If U   i, j  , set X t 1  Y  j
else X t 1  X t  i
Step 5: t  t  1 , repeat Step 3~5 until
convergence.
88
Metropolis-Hastings Algorithm (11)

An Example of
Step 3~5:
Qij
1. Y 1 from QX 0 ,Y and

 (Y 1)QY1 , X 0
 ( X 0, Y 1)  min 1 ,

 ( X 0)QX 0 ,Y1

accepted  X 1  Y 1




2. Y 2 from Q X1 ,Y and

 (Y 2)QY2 , X1
 ( X 1, Y 2)  min 1 ,

 ( X 1)QX1 ,Y2

not accepted  X 2  Y 1
3. Y 3 from Q X 2 ,Y and




 (Y 3)QY2 , X1
a ( X 2, Y 3)  min 1 ,

 ( X 2)QX1 ,Y2

accepted  X 3  Y 3



and so on.
Tweak
Pij
Y1
X1= Y1
Y2
X2= Y1
Y3
X3= Y3
‧
‧
‧
‧
‧
‧
‧
‧
‧
YN
XN
Y
X(t)
89
Metropolis-Hastings Algorithm (12)
We may define a “rejection rate” as the
proportion of times t for which X t 1  X t .
Clearly, in choosing Q , high rejection rates
are to be avoided.
 Example:

 Y 
will be small
 (Xt )
π
and it is likely that X t 1  X t
 More Step3~5 are needed .
90
Xt
Y
Example (1)

Simulate a bivariate normal distribution:

 X1 
 1 
 11 12  
X    ~ N2      ,   
,
 X2 
 2 
  21  22  

T
 1

exp 
X    1 X   
2

.
i.e.   X  
2 |  |1/2




91
Example (2)

Metropolis-Hastings Algorithm:
1.
2.
3.
X 0  , i  0
Generate U1 ~ U  1,1 and U2 ~ U  1,1
 U1 
that U1 and U 2 are independent, then U i   
U2 

Yi  X i  Ui
4.

  (Yi ) 
 X i 1  Yi w.p.   X i , Yi   min 1,


  (Xi ) 
 X  X w.p. 1-  X , Y
 i i
i
 i 1
5.
i  i  1, repeat step 2~4 until convergence.
92
Example of M-H Algorithm by R (1)
Pi <- function(x, mu, sigma){exp(-0.5*((xmu)%*%chol2inv(sigma)%*%as.matrix(xmu)))/(2*pi*sqrt(det(sigma)))}
N <- 1000; mu <- c(3, 7)
sigma <- matrix(c(1, 0.4, 0.4, 1), nrow = 2)
sample <- matrix(0, nrow = N, ncol = 2)
j = 0; tempX <- mu
while((j < 1000) && (abs(mean(sample[, 1])-mu[1]) >
0.001)){
for(i in 1:N){
tempU <- c(runif(1, -1, 1), runif(1, -1, 1))
tempY <- tempX+tempU
if(min(c(Pi(tempY, mu, sigma)/Pi(tempX, mu, sigma),
1)) > runif(1)){
tempX <- tempY; sample[i, ] <- tempY
93
}
Example of M-H Algorithm by R (2)
else{
}
}
j = j+1
tempX <- tempX; sample[i, ] <- tempX
}
for(i in 1:N){
tempU <- c(runif(1, -1, 1), runif(1, -1, 1))
tempY <- tempX+tempU
if(min(c(Pi(tempY, mu, sigma)/Pi(tempX, mu, sigma), 1)) >
runif(1)){
tempX <- tempY; sample[i, ] <- tempY}
else{
tempX <- tempX; sample[i, ] <- tempX
}
94
}
Example of M-H Algorithm by C (1)
95
Example of M-H Algorithm by C (2)
96
Example of M-H Algorithm by C (3)
97
An Figure to Check Simulation Results
Black points are simulated samples; color
points are probability density.
7
6
5
4
X2
8
9
10

0
1
2
3
X1
4
5
6
plot(sample, xlab = "X1", ylab =
"X2")
j=0
for(i in seq(0.01, 0.3, 0.02)){
for(x in seq(0,6, 0.1)){
for(y in seq(4, 11,
0.1)){
if(abs(Pi(c(x, y),
mu, sigma)-i) < 0.003)
points(x, y,
col = ((j)%%2+2), pch = 19)
}
}
j = j+1
}
98
Exercises

Write your own programs similar to those
examples presented in this talk, including
Example 1 in Genetics and other examples.

Write programs for those examples
mentioned at the reference web pages.

Write programs for the other examples that
you know.
99