Bayesian methods with Monte Carlo Markov Chains III
Download
Report
Transcript Bayesian methods with Monte Carlo Markov Chains III
Part 6
Markov Chains
1
Markov Chains (1)
A Markov chain is a mathematical model
for stochastic systems whose states,
discrete or continuous, are governed by
transition probability.
Suppose the random variable X 0 , X1 ,
take
state space (Ω) that is a countable set of
value. A Markov chain is a process that
corresponds to the network.
X0
X1
... X
t 1
Xt
X t 1
...
2
Markov Chains (2)
The current state in Markov chain only
depends on the most recent previous
states.
P X t 1 j | X t i, X t 1 it 1 ,
, X 0 i0
P X t 1 j | X t i Transition probability
where i0 , , i, j
http://en.wikipedia.org/wiki/Markov_chain
http://civs.stat.ucla.edu/MCMC/MCMC_tuto
rial/Lect1_MCMC_Intro.pdf
3
An Example of Markov Chains
1, 2,3, 4,5
X X 0 , X1,
, Xt ,
where X 0 is initial state and so on.
P is transition matrix.
1
1 0.4
2 0.5
P 3 0.0
4 0.0
5 0.0
2
3
4
5
0.6 0.0 0.0 0.0
0.0 0.5 0.0 0.0
0.3 0.0 0.7 0.0
0.0 0.1 0.3 0.6
0.3 0.0 0.5 0.2
4
Definition (1)
Define the probability of going from state i
to state j in n time steps as
( n)
pij P X t n j | X t i
A state j is accessible from state i if there
( n)
p
are n time steps such that ij 0 , where
n 0,1,
A state i is said to communicate with state
j (denote: i j ), if it is true that both i is
accessible from j and that j is accessible
from i.
5
Definition (2)
A state i has period d i if any return to
state i must occur in multiples of d i time
steps.
Formally, the period of a state is defined as
d i gcd n : Pii( n ) 0
If d i 1, then the state is said to be
aperiodic; otherwise (d i 1), the state is
said to be periodic with period d i .
6
Definition (3)
A set of states C is a communicating
class if every pair of states in C
communicates with each other.
Every state in a communicating class must
have the same period
Example:
7
Definition (4)
A finite Markov chain is said to be
irreducible if its state space (Ω) is a
communicating class; this means that, in
an irreducible Markov chain, it is possible to
get to any state from any state.
Example:
8
Definition (5)
A finite state irreducible Markov chain is
said to be ergodic if its states are aperiodic
Example:
9
Definition (6)
A state i is said to be transient if, given
that we start in state i, there is a non-zero
probability that we will never return back to
i.
Formally, let the random variable Ti be the
next return time to state i (the “hitting
time”): Ti min n : X n i | X 0 i
Then, state i is transient iff there exists a
finite Ti such that: P Ti 1
10
Definition (7)
A state i is said to be recurrent or
persistent iff there exists a finite Ti such
that: P Ti 1.
The mean recurrent time i E Ti .
State i is positive recurrent if i is finite;
otherwise, state i is null recurrent.
A state i is said to be ergodic if it is
aperiodic and positive recurrent. If all
states in a Markov chain are ergodic, then
the chain is said to be ergodic.
11
Stationary Distributions
Theorem: If a Markov Chain is irreducible
and aperiodic, then
(n)
ij
P
1
j
as n , i, j
Theorem: If a Markov chain is irreducible
P Xn j
and aperiodic, then ! j lim
n
and j i Pij , j 1, j
i
i
where is stationary distribution.
12
Definition (8)
A Markov chain is said to be reversible, if
there is a stationary distribution such
that i Pij j Pji i, j
Theorem: if a Markov chain is reversible,
then j i Pij
i
13
An Example of Stationary Distributions
A Markov chain:
0.4
0.3
0.3
0.7
1
2
0.3
0.3
0.7
3
0.7 0.3 0.0
P 0.3 0.4 0.3
0.0 0.3 0.7
1
The stationary distribution is
3
0.7 0.3 0.0
1 1 1
1 1 1
3 3 3 0.3 0.4 0.3 3 3 3
0.0 0.3 0.7
1
3
1
3
14
Properties of Stationary Distributions
Regardless of the starting point, the
process of irreducible and aperiodic Markov
chains will converge to a stationary
distribution.
The rate of converge depends on properties
of the transition probability.
15
Part 7
Monte Carlo Markov Chains
16
Applications of MCMC
Simulation:
n x 1
n x 1
1 y
Ex: x, y ~ f x, y c y
where x 0,1, 2,
x
, n, 0 y 1, , are known.
Integration: computing in high dimensions.
1
Ex: E g y 0 g y f y dy
Bayesian Inference:
Ex: Posterior distributions, posterior
means…
17
Monte Carlo Markov Chains
MCMC method are a class of algorithms for
sampling from probability distributions
based on constructing a Markov chain that
has the desired distribution as its stationary
distribution.
The state of the chain after a large number
of steps is then used as a sample from the
desired distribution.
http://en.wikipedia.org/wiki/MCMC
18
Inversion Method vs. MCMC (1)
Inverse transform sampling, also known
as the probability integral transform, is a
method of sampling a number at random
from any probability distribution given its
cumulative distribution function (cdf).
http://en.wikipedia.org/wiki/Inverse_transf
orm_sampling_method
19
Inversion Method vs. MCMC (2)
A random variable with a cdf F , then F has
a uniform distribution on [0, 1].
The inverse transform sampling method
works as follows:
1.
2.
3.
Generate a random number from the standard
uniform distribution; call this u .
Compute the value x such that F x u ; call
this xchosen .
Take xchosen to be the random number drawn
from the distribution described by F .
20
Inversion Method vs. MCMC (3)
For one dimension random variable,
Inversion method is good, but for two or
more high dimension random variables,
Inverse Method maybe not.
For two or more high dimension random
variables, the marginal distributions for
those random variables respectively
sometime be calculated difficult with more
time.
21
Gibbs Sampling
One kind of the MCMC methods.
The point of Gibbs sampling is that given a
multivariate distribution it is simpler to
sample from a conditional distribution
rather than integrating over a joint
distribution.
George Casella and Edward I. George.
"Explaining the Gibbs sampler". The
American Statistician, 46:167-174, 1992.
(Basic summary and many references.)
http://en.wikipedia.org/wiki/Gibbs_samplin
g
22
Example 1 (1)
To sample x from:
n x 1
n x 1
x, y ~ f x, y c y 1 y
x
where x 0,1, 2, , n, 0 y 1, n, , are known
c is a constant
One can see that
f x, y n x
n x
f x | y
y 1 y ~ Binomial n, y
f y
x
f y | x
f x, y
f x
y
x 1
1 y
n x 1
~ Beta x , n x
23
Example 1 (2)
Gibbs sampling Algorithm:
Initial Setting:
For t 0,
Return xn , yn
y0 ~ Uniform 0,1 or a arbitrary value 0,1
x0 ~ Bin n, y0
, n , sample a value xt 1 , yt 1 from
yt 1 ~ Beta xt , n xt
xt 1 ~ Bin n, yt 1
24
Example 1 (3)
x, yt y.
Under regular conditions: xt t
t
How many steps are needed for
convergence?
Within an acceptable error, such as
i 10
x
t
i 20
x
t i 11
10
10
t i 1
t
0.001, i .
n is large enough, such as n 1000 .
25
Example 1 (4)
Inversion Method:
x is Beta-Binomial distribution.
x ~ f x f x, y dy
1
0
n x n x
n
x
The cdf of x is F x that has a uniform
distribution on [0, 1].
n i n i
F x
n
i 0 i
x
26
Gibbs sampling by R (1)
N = 1000; num = 16; alpha = 5; beta = 7
tempy <- runif(1); tempx <- rbeta(1, alpha, beta)
j = 0; Forward = 1; Afterward = 0
while((abs(Forward-Afterward) > 0.0001) && (j <= 1000)){
Forward = Afterward; Afterward = 0
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
Afterward = Afterward+tempx
}
Afterward = Afterward/N; j = j+1
}
sample <- matrix(0, nrow = N, ncol = 2)
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
27
Gibbs sampling by R (2)
sample[i, 1] = tempx; sample[i, 2] = tempy
Inverse method
}
sample_Inverse <- rbetabin(N, num, alpha, beta)
write(t(sample), "Sample for Ex1 by R.txt", ncol = 2)
Xhist <- cbind(hist(sample[, 1], nclass = num)$count,
hist(sample_Inverse, nclass = num)$count)
write(t(Xhist), "Histogram for Ex1 by R.txt", ncol = 2)
prob <- matrix(0, nrow = num+1, ncol = 2)
for(i in 0:num){
if(i == 0){
prob[i+1, 2] = mean(pbinom(i, num, sample[, 2]))
prob[i+1, 1] = gamma(alpha+beta)*gamma(num+beta)
prob[i+1, 1] = prob[i+1,
1]/(gamma(beta)*gamma(num+beta+alpha))
}
28
else{
Gibbs sampling by R (3)
if(i == 1){
prob[i+1, 1] = num*alpha/(num-1+alpha+beta)
for(j in 0:(num-2))
prob[i+1, 1] = prob[i+1,
1]*(beta+j)/(alpha+beta+j)
}
else
prob[i+1, 1] = prob[i+1, 1]*(num-i+1)/(i)*(i1+alpha)/(num-i+beta)
prob[i+1, 2] = mean((pbinom(i, num, sample[, 2])pbinom(i-1, num, sample[, 2])))
}
if(i != num)
prob[i+2, 1] = prob[i+1, 1]
}
write(t(prob), "ProbHistogram for Ex1 by R.txt", ncol = 2)
29
Inversion Method by R (1)
rbetabin <- function(N, size, alpha, beta){
Usample <- runif(N)
Pr_0 =
gamma(alpha+beta)*gamma(size+beta)/gamma(beta)/gam
ma(size+beta+alpha)
Pr = size*alpha/(size-1+alpha+beta)
for(i in 0:(size-2))
Pr = Pr*(beta+i)/(alpha+beta+i)
Pr_Initial = Pr
sample <- array(0,N)
CDF <- array(0, (size+1))
CDF[1] <- Pr_0
30
Inversion Method by R (2)
}
for(i in 1:size){
CDF[i+1] = CDF[i]+Pr
Pr = Pr*(size-i)/(i+1)*(i+alpha)/(size-i-1+beta)
}
for(i in 1:N){
sample[i] = which.min(abs(Usample[i]-CDF))-1
}
return(sample)
31
Gibbs sampling by C/C++ (1)
32
Gibbs sampling by C/C++ (2)
Inverse method
33
Gibbs sampling by C/C++ (3)
34
Gibbs sampling by C/C++ (4)
35
Inversion Method by C/C++ (1)
36
Inversion Method by C/C++ (2)
37
Plot Histograms by Maple (1)
Figure 1:1000 samples with n=16, α=5 and
β=7.
Blue-Inversion method
Red-Gibbs sampling
38
Plot Histograms by Maple (2)
39
Probability Histograms by Maple (1)
Figure 2:
Blue histogram and yellow line are pmf of x.
m
1
Red histogram is Pˆ X x P X x | Yi yi from
m i 1
Gibbs sampling.
40
Probability Histograms by Maple (2)
The probability histogram of blue histogram
of Figure 1 would be similar to the bule
probability histogram of Figue 2, when the
sample size .
The probability histogram of red histogram
of Figure 1 would be similar to the red
probability histogram of Figue 2, when the
iteration n .
41
Probability Histograms by Maple (3)
42
Exercises
Write your own programs similar to those
examples presented in this talk, including
Example 1 in Genetics and other examples.
Write programs for those examples
mentioned at the reference web pages.
Write programs for the other examples that
you know.
43
Bayesian Methods with Monte
Carlo Markov Chains III
Henry Horng-Shing Lu
Institute of Statistics
National Chiao Tung University
[email protected]
http://tigpbp.iis.sinica.edu.tw/courses.htm
44
Part 10
More Examples of Gibbs Sampling
45
An Example with Three Random Variables (1)
To sample X , Y , N as follows:
where x 0,1, 2, , n, 0 y 1, n 0,1, 2,
is known, c is a constant.
46
An Example with Three Random Variables (2)
One can see that
n x
n x
f x | y, n y 1 y ~ Binomial n, y
x
f y | x, n y
x 1
f n x | x, y
e
1 y
1 y
n x 1
~ Beta x , n x
1 y
n x !
n x
~ Poisson 1 y
47
An Example with Three Random Variables (3)
Gibbs sampling Algorithm:
Initial Setting: t 0 ,
y0 ~ Unif 0,1 or an arbitrary value 0,1
n0 ~ Discrete Unif 1, or an arbitrary integer 1,
x0 ~ Bin n0 , y0
2. Sample a value xt 1 , yt 1 from
yt 1 ~ Beta xt , nt xt
nt 1 ~ xt Possion 1 yt 1
xt 1 ~ Bin nt 1 , yt 1
3. t t 1, repeat step 2 until convergence.
48
1.
An Example with Three Random Variables by R
N = 10000; alpha = 2; beta = 4; lambda = 16
sample <- matrix(0, nrow = N, ncol = 3)
tempY <- runif(1); tempN <- 1
10000 samples with
tempX <- rbinom(1, tempN, tempY)
α=2, β=4 and λ=16
j = 0; forward = 1; afterward = 0
while((abs(forward-afterward) > 0.001) && (j <= 1000)){
forward = afterward; afterward = 0
for(i in 1:N){
tempY <- rbeta(1, tempX+alpha, tempN-tempX+beta)
tempN <- rpois(1, (1-tempY)*lambda)
tempN = tempN+tempX
tempX <- rbinom(1, tempN, tempY)
afterward = afterward+tempX
}
afterward = afterward/N; j = j+1
49
}
An Example with Three Random Variables by R
for(i in i:N){
tempY <- rbeta(1, tempX+alpha, tempN-tempX+beta)
tempN <- rpois(1, (1-tempY)*lambda)
tempN = tempN+tempX
tempX <- rbinom(1, tempN, tempY)
sample[i, 2] = tempY
sample[i, 3] = tempN
sample[i, 1] = tempX
}
50
An Example with 3 Random Variables by C (1)
10000 samples with
α=2, β=4 and λ=16
51
An Example with 3 Random Variables by C (2)
52
Example 1 in Genetics (1)
Two linked loci with alleles A and a, and B
and b
A, B: dominant
a, b: recessive
A double heterozygote AaBb will produce
gametes of four types: AB, Ab, aB, ab
A
A
a
B
b
1/2
1/2
A
B
A
a
a
B
b
b
1/2
b
a
1/2
B
53
Example 1 in Genetics (2)
Probabilities for genotypes in gametes
No Recombination
Recombination
Male
1-r
r
Female
1-r’
r’
A
A
a
B
b
1/2
1/2
A
B
A
a
a
B
b
1/2
a
1/2
b
AB
ab
aB
Ab
Male
(1-r)/2
(1-r)/2
r/2
r/2
Female
(1-r’)/2
(1-r’)/2
r’/2
r’/2
b
B
54
Example 1 in Genetics (3)
Fisher, R. A. and Balmukand, B. (1928).
The estimation of linkage from the offspring
of selfed heterozygotes. Journal of Genetics,
20, 79–92.
More:
http://en.wikipedia.org/wiki/Genetics
http://www2.isye.gatech.edu/~brani/isyeba
yes/bank/handout12.pdf
55
Example 1 in Genetics (4)
MALE
F
E
M
A
L
E
AB
(1-r)/2
ab
(1-r)/2
aB
r/2
Ab
r/2
AB
(1-r’)/2
AABB
(1-r) (1-r’)/4
aABb
(1-r) (1-r’)/4
aABB
r (1-r’)/4
AABb
r (1-r’)/4
ab
(1-r’)/2
AaBb
(1-r) (1-r’)/4
aabb
(1-r) (1-r’)/4
aaBb
r (1-r’)/4
Aabb
r (1-r’)/4
aB
r’/2
AaBB
(1-r) r’/4
aabB
(1-r) r’/4
aaBB
r r’/4
AabB
r r’/4
Ab
r’/2
AABb
(1-r) r’/4
aAbb
(1-r) r’/4
aABb
r r’/4
AAbb
r r’/4
56
Example 1 in Genetics (5)
Four distinct phenotypes:
A*B*, A*b*, a*B* and a*b*.
A*: the dominant phenotype from (Aa, AA, aA).
a*: the recessive phenotype from aa.
B*: the dominant phenotype from (Bb, BB, bB).
b*: the recessive phenotype from bb.
A*B*: 9 gametic combinations.
A*b*: 3 gametic combinations.
a*B*: 3 gametic combinations.
a*b*: 1 gametic combination.
Total: 16 combinations.
57
Example 1 in Genetics (6)
Let (1 r )(1 r ') , then
2
P( A * B*)
4
1
P( A * b*) P(a * B*)
4
P(a * b*)
4
58
Example 1 in Genetics (7)
Hence, the random sample of n from the
offspring of selfed heterozygotes will follow
a multinomial distribution:
2 1 1
Multinomial n;
,
,
,
4
4
4 4
We know that
(1 r )(1 r '), 0 r 1/ 2, and 0 r ' 1/ 2
So
1/ 4 1
59
Example 1 in Genetics (8)
Suppose that we observe the data of
y y1, y2 , y3 , y4 125,18,20,24
which is a random sample from
2 1 1
Multinomial n;
,
,
,
4
4
4 4
Then the probability mass function is
n!
2 y 1 y y y
f ( y | )
(
) (
)
( )
2
1
y1 ! y2 ! y3 ! y4 !
4
4
3
4
4
60
Example 1 in Genetics (9)
How to estimate ?
MME (shown in the last week):
http://en.wikipedia.org/wiki/Method_of_moment
s_%28statistics%29
MLE (shown in the last week):
http://en.wikipedia.org/wiki/Maximum_likelihoo
d
Bayesian Method:
http://en.wikipedia.org/wiki/Bayesian_method
61
Example 1 in Genetics (10)
As the value of is between ¼ and 1, we
can assume that the prior distribution of
is Unif 14 ,1.
The posterior distribution is
f y | f
f | y
f y | f d
The integration in the above denominator,
f y | f d
does not have a closed form.
62
Example 1 in Genetics (11)
We will consider the mean of posterior
distribution (the posterior mean),
E | y f | y d
The Monte Carlo Markov Chains method is a
good method to estimate
E | y f | y d
even if f y | f d and the posterior
mean do not have closed forms.
63
Example 1 by R
1
Direct numerical integration when ~ U ,1:
4
> y <- c(125, 18, 20, 24)
> phi <- runif(1000000, 1/4, 1)
> f_phi <- function(phi){((2+phi)/4)^y[1]*((1phi)/4)^(y[2]+y[3])*(phi/4)^y[4]}
> mean(f_phi(phi)*phi)/mean(f_phi(phi))
[1] 0.573808
We can assume other prior distributions to
compare the results of posterior means:
Beta 1,1, Beta 2, 2, Beta 2,3 , Beta 3, 2 ,
Beta 0.5,0.5, Beta 10 5 ,10 5
64
Example 1 by C/C++
Replace other prior distribution,
such as Beta(1,1),…,Beta(1e-5,1e-5)
65
Beta Prior
66
Comparison for Example 1 (1)
Estimate
Method
ˆ
Estimate
Method
ˆ
MME
0.683616
Bayesian
Beta(2,3)
0.564731
MLE
0.663165
Bayesian
Beta(3,2)
0.577575
Bayesian
U(¼,1)
0.573931
Bayesian
Beta(½,½)
0.574928
Bayesian
Beta(1,1)
0.573918
Bayesian
Beta(10-5,10-5)
0.588925
Bayesian
Beta(2,2)
0.572103
Bayesian
Beta(10-7,10-7)
show below
67
Comparison for Example 1 (2)
Estimate
Method
ˆ
Estimate
Method
ˆ
Bayesian
Beta(10,10)
0.559905
Bayesian
Beta(10-7,10-7)
0.193891
Bayesian
Beta(102,102)
0.520366
Bayesian
Beta(10-7,10-7)
0.400567
Bayesian
Beta(104,104)
0.500273
Bayesian
Beta(10-7,10-7)
0.737646
Bayesian
Beta(105,105)
0.500027
Bayesian
Beta(10-7,10-7)
0.641388
Bayesian
Beta(10n,10n)
0.5
Bayesian
Beta(10-7,10-7)
Not
stationary
n
68
Part 11
Gibbs Sampling Strategy
69
Sampling Strategy (1)
Strategy I:
Run one chain for a long time.
After some “Burn-in” period, sample points
every some fixed number of steps.
Burn-in
N samples from one chain
The code example of Gibbs sampling in the
previous lecture use sampling strategy I.
http://www.cs.technion.ac.il/~cs236372/tir
gul09.ps
70
Sampling Strategy (2)
Strategy II:
Run the chain N times, each run for M steps.
Each run starts from a different state points.
Return the last state in each run.
N samples from the last
sample of each chain
Burn-in
71
Sampling Strategy (3)
Strategy II by R:
N = 100; num = 16; alpha = 5; beta = 7
sample <- matrix(0, nrow = N, ncol = 2)
for(k in 1:N){
tempy <- runif(1); tempx <- rbeta(1, alpha, beta)
j = 0; Forward = 1; Afterward = 0
while((abs(Forward-Afterward) > 0.001) && (j <= 100)){
Forward = Afterward; Afterward = 0
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, numtempx+beta)
tempx <- rbinom(1, num, tempy)
Afterward = Afterward+tempx
}
Afterward = Afterward/N; j = j+1
}
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
sample[k, 1] = tempx; sample[k, 2] = tempy
}
72
Sampling Strategy (4)
Strategy II by C/C++:
73
Strategy Comparison
Strategy I:
Perform “burn-in” only once and save time.
Samples might be correlated (--although only
weakly).
Strategy II:
Better chance of “covering” the space points
especially if the chain is slow to reach
stationary.
This must perform “burn-in” steps for each chain
and spend more time.
74
Hybrid Strategies (1)
Run several chains and sample few samples
from each.
Combines benefits of both strategies.
N samples from each
chain
Burn-in
75
Hybrid Strategies (2)
Hybrid Strategy by R:
tempN <- N; loc <- 1
sample <- matrix(0, nrow = N, ncol = 2)
while(loc != (N+1)){
tempy <- runif(1); tempx <- rbeta(1, alpha, beta); j = 0
pN <- floor(runif(1)*(N-loc))+1
cat(pN, '\n‘); Forward = 1; Afterward = 0
while((abs(Forward-Afterward) > 0.001) && (j <= 100)){
Forward = Afterward; Afterward = 0
for(i in 1:N){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
Afterward = Afterward+tempx}
Afterward = Afterward/N; j = j+1
}
for(i in loc:(loc+pN-1)){
tempy <- rbeta(1, tempx+alpha, num-tempx+beta)
tempx <- rbinom(1, num, tempy)
sample[i, 1] <- tempx; sample[i, 2] <- tempy
}
loc <- i+1
}
76
Hybrid Strategies (3)
Hybrid Strategy by C/C++:
77
Part 12
Metropolis-Hastings Algorithm
78
Metropolis-Hastings Algorithm (1)
Another kind of the MCMC methods.
The Metropolis-Hastings algorithm can draw
samples from any probability distribution
Π x , requiring only that a function
proportional to the density can be
calculated at x .
Process in three steps:
Set up a Markov chain;
Run the chain until stationary;
Estimate with Monte Carlo methods.
http://en.wikipedia.org/wiki/MetropolisHastings_algorithm
79
Metropolis-Hastings Algorithm (2)
Let 1, , n be a probability density (or
mass) function (pdf or pmf).
f is any function and we want to estimate
n
I E f f i i
i 1
Construct P Pij the transition matrix of an
irreducible Markov chain with states 1, 2, , n,
where Pij Pr Xt 1 j | X t i , X t 1,2, , n
and Π is its unique stationary distribution.
80
Metropolis-Hastings Algorithm (3)
Run this Markov chain for times t 1,
and calculate the Monte Carlo sum
,N
N
1
Iˆ f Xt
N t 1
then Iˆ I as N
Sheldon M. Ross(1997). Proposition 4.3.
Introduction to Probability Model. 7th ed.
http://nlp.stanford.edu/local/talks/mcmc_2
004_07_01.ppt
81
Metropolis-Hastings Algorithm (4)
In order to perform this method for a given
distribution Π , we must construct a Markov
chain transition matrix P with Π as its
stationary distribution, i.e. ΠP = Π.
Consider the matrix P was made to satisfy
the reversibility condition that for all i and j
Πi Pij = Π j Pij .
The property ensures that i Pij j for all j
i
and hence Π is a stationary distribution for P
82
Metropolis-Hastings Algorithm (5)
Let a proposal Q Qij be irreducible where
Qij Pr Xt 1 j | X t i , and range of Q is equal
to range of Π .
But Π is not have to a stationary
distribution of Q.
Process: Tweak Qij to yield Π .
States from Qij
not π
Tweak
States from Pij
π
83
Metropolis-Hastings Algorithm (6)
We assume that Pij has the form
Pij Qij i, j i j ,
Pii 1 Pij ,
i j
where i, j is called accepted probability,
i.e. given X t i ,
X t 1 j with probability i, j
take
X t 1 i with probability 1- i, j
84
Metropolis-Hastings Algorithm (7)
For i j, i Pij j Pji
iQij (i, j) jQji ( j, i)
*
WLOG for some (i, j ) , iQij j Qji.
In order to achieve equality (*), one can
introduce a probability i, j 1 on the lefthand side and set j, i 1 on the righthand side.
85
Metropolis-Hastings Algorithm (8)
Then i Qij i, j j Q ji j , i j Q ji
j Q ji
i, j
i Qij
These arguments imply that the accepted
probability (i, j ) must be
jQ ji
i, j min 1,
iQ
ij
86
Metropolis-Hastings Algorithm (9)
M-H Algorithm:
Step 1: Choose an irreducible Markov chain
transition matrix Q with transition
probability Qij .
Step 2: Let t 0 and initialize X 0 from states
in Q.
Step 3 (Proposal Step):
Given X t i , sample Y j form QiY .
87
Metropolis-Hastings Algorithm (10)
M-H Algorithm (cont.):
Step 4 (Acceptance Step):
Generate a random number U from Unif 0,1
If U i, j , set X t 1 Y j
else X t 1 X t i
Step 5: t t 1 , repeat Step 3~5 until
convergence.
88
Metropolis-Hastings Algorithm (11)
An Example of
Step 3~5:
Qij
1. Y 1 from QX 0 ,Y and
(Y 1)QY1 , X 0
( X 0, Y 1) min 1 ,
( X 0)QX 0 ,Y1
accepted X 1 Y 1
2. Y 2 from Q X1 ,Y and
(Y 2)QY2 , X1
( X 1, Y 2) min 1 ,
( X 1)QX1 ,Y2
not accepted X 2 Y 1
3. Y 3 from Q X 2 ,Y and
(Y 3)QY2 , X1
a ( X 2, Y 3) min 1 ,
( X 2)QX1 ,Y2
accepted X 3 Y 3
and so on.
Tweak
Pij
Y1
X1= Y1
Y2
X2= Y1
Y3
X3= Y3
‧
‧
‧
‧
‧
‧
‧
‧
‧
YN
XN
Y
X(t)
89
Metropolis-Hastings Algorithm (12)
We may define a “rejection rate” as the
proportion of times t for which X t 1 X t .
Clearly, in choosing Q , high rejection rates
are to be avoided.
Example:
Y
will be small
(Xt )
π
and it is likely that X t 1 X t
More Step3~5 are needed .
90
Xt
Y
Example (1)
Simulate a bivariate normal distribution:
X1
1
11 12
X ~ N2 ,
,
X2
2
21 22
T
1
exp
X 1 X
2
.
i.e. X
2 | |1/2
91
Example (2)
Metropolis-Hastings Algorithm:
1.
2.
3.
X 0 , i 0
Generate U1 ~ U 1,1 and U2 ~ U 1,1
U1
that U1 and U 2 are independent, then U i
U2
Yi X i Ui
4.
(Yi )
X i 1 Yi w.p. X i , Yi min 1,
(Xi )
X X w.p. 1- X , Y
i i
i
i 1
5.
i i 1, repeat step 2~4 until convergence.
92
Example of M-H Algorithm by R (1)
Pi <- function(x, mu, sigma){exp(-0.5*((xmu)%*%chol2inv(sigma)%*%as.matrix(xmu)))/(2*pi*sqrt(det(sigma)))}
N <- 1000; mu <- c(3, 7)
sigma <- matrix(c(1, 0.4, 0.4, 1), nrow = 2)
sample <- matrix(0, nrow = N, ncol = 2)
j = 0; tempX <- mu
while((j < 1000) && (abs(mean(sample[, 1])-mu[1]) >
0.001)){
for(i in 1:N){
tempU <- c(runif(1, -1, 1), runif(1, -1, 1))
tempY <- tempX+tempU
if(min(c(Pi(tempY, mu, sigma)/Pi(tempX, mu, sigma),
1)) > runif(1)){
tempX <- tempY; sample[i, ] <- tempY
93
}
Example of M-H Algorithm by R (2)
else{
}
}
j = j+1
tempX <- tempX; sample[i, ] <- tempX
}
for(i in 1:N){
tempU <- c(runif(1, -1, 1), runif(1, -1, 1))
tempY <- tempX+tempU
if(min(c(Pi(tempY, mu, sigma)/Pi(tempX, mu, sigma), 1)) >
runif(1)){
tempX <- tempY; sample[i, ] <- tempY}
else{
tempX <- tempX; sample[i, ] <- tempX
}
94
}
Example of M-H Algorithm by C (1)
95
Example of M-H Algorithm by C (2)
96
Example of M-H Algorithm by C (3)
97
An Figure to Check Simulation Results
Black points are simulated samples; color
points are probability density.
7
6
5
4
X2
8
9
10
0
1
2
3
X1
4
5
6
plot(sample, xlab = "X1", ylab =
"X2")
j=0
for(i in seq(0.01, 0.3, 0.02)){
for(x in seq(0,6, 0.1)){
for(y in seq(4, 11,
0.1)){
if(abs(Pi(c(x, y),
mu, sigma)-i) < 0.003)
points(x, y,
col = ((j)%%2+2), pch = 19)
}
}
j = j+1
}
98
Exercises
Write your own programs similar to those
examples presented in this talk, including
Example 1 in Genetics and other examples.
Write programs for those examples
mentioned at the reference web pages.
Write programs for the other examples that
you know.
99