Transcript Title

Suggested readings
• Historical notes
– Chap1, Andrieu, C., et al. (2003). An Introduction to MCMC for Machine
Learning. Machine Learning, 50, 5–43.
– Section 11.11 by Gelman.
• Markov chains
– Chap11, David J.C. MacKay. (2003). Information Theory, Inference, and
Learning Algorithms. Cambridge University Press.
http://www.cambridge.org/0521642981.
– Chap11, Grinstead & Snell, Introduction to Probability, found on internet.
– Section 7.1, Tan & Fox, (2007) Lecture Notes on Inverse Problems
• MCMC details
– Andrieu, C., et al. (2003). An Introduction to MCMC for Machine
Learning. Machine Learning, 50, 5–43.
– Section 11.2~11.6 by Gelman.
– Section 7.2, Tan & Fox, (2007) Lecture Notes on Inverse Problems
-1-
Markov Chain
• Weather in the land of Oz
– Land of Oz has unique weather system.
R
P
N S
R 1/2 1/4 1/4 
N 1/2 0 1/2 
S 1/4 1/4 1/2 
If N, likely to S and R the next day, not N again.
If R or S, 50% of the same the next day.
If changed, only half of this change to N.
Note that next day state is not affected by the previous states.
– Case study
• It is R today. What is the probability of S tomorrow ?
• What is the probability of S, day after tomorrow ?
R 1/2 1/4 1/4  1/2 1/4 1/4 
N 1/2 0 1/2  1/2 0 1/2 
S 1/4 1/4 1/2  1/4 1/4 1/2 
p13  p11 p13  p12 p23  p13 p33
• For i’th state, what is the probability of j’th state, day after tomorrow ?
pij  2  pik pkj or P 2  P2
• After many days like 100 days ? P100  P100
• Let the initial state is always R. Calculate states after 3, 10, 100 days.
Let the initial state has equal chance. Calculate states after 3, 10, 100 days.
-2-
Markov Chain
• Remarks
– For a set of states S={s1,s2,…,sr}, Markov chain is a process that a
new state sj is determined by current state si with probability pij.
Note that the new state is not affected by the previous states.
– pij is called transition probabilities, denoted as P(j|i) = pij.
The row vector, called probability vector (or distribution), sums to 1.
pij(n) of the matrix Pn gives the probability that we have state sj, starting
from state si, after n steps.
– As n→∞, Pn approaches limiting matrix W with all rows the same vector
w. w is obtained by using wP=w or w(P-I)=0. in fact, w is left
eigenvector of matrix P.
– Let u be arbitrary starting dist. Then probability distribution after n →∞
steps is uPn = w, which means that regardless of initial u, one gets
always the same distribution w after many steps.
-3-
Markov Chain Monte Carlo
• Introduction
– MCMC is a strategy for generating samples x(i) by exploring the states
using a Markov chain mechanism, such that the samples x(i) mimic
samples drawn from the target distribution p(x).
– MCMC is used when it is not possible (or not computationally efficient)
to sample q directly from p(q|y). Instead, sample iteratively such that at
each step of the process we expect to draw from a distribution that
becomes closer and closer to p(q|y).
– MCMC doesn’t require that p(q|y) be normalized.
– For a wide class of problems, this appears to be the easiest way to get
reliable results.
-4-
Markov Chain Monte Carlo
• Forward process: Markov process
– Stochastic process x(i) is called Markov chain if
  
p x   p x 
i
i 1
 T  x  | x  
i
i 1
or in more simple form p j  piTij
– Given a transition matrix T, find equilibrium distribution p.
1
– Example:


1/2 1/4 1/4  Start with p  .5, .2, .3 ,
2
1
Given T  1/2 0 1/2  then p   p  T  .425, .2, .375
1/4 1/4 1/2  and p    .4, .2, .4




• Inverse process: MCMC process
– Given a distribution p, construct transition matrix T of Markov chain.
 pj 
Tij  qij min 1,  , when i  j, and  j Tij  1
 pi 
where qij is an arbitrary matrix with symmetricity and non-zero elements
Then we obtain after Markov process, the target distribution p.
– Example: Given p  5, 11, 2 /18 Then T  ?
-5-
• Df
– In Markov simulation, sequences of simulation draws are created; each
sequence is produced by starting at an arbitrary point , and drawing
from a transition distribution . The transition probability distribution
should be constructed so that the Markov chain converges to a
stationary distribution which is the target of our concern.
– Recent survey places the Metropolis algorithm as one of the greatest
ten algorithms of science and engineering development of 20th century.
The algorithm is playing key role in sampling method known as MCMC.
These algorithms have played a significant role in statistics,
econometrics, physics and computing science over the last two decades.
-6-
Metropolis-Hastings algorithm
• Introduction
– Most popular MCMC method. Other MCMC algorithms are interpreted
as special cases or extensions of this algorithm.
• MH Algorithm
– A proposal distribution q(x*|x) is introduced.
– The MH step is to sample a new candidate x* given the current x based
on q(x*|x). Then the point is accepted with probability

 p  x  q  x | x 

*
*
A  x, x   min 1,
* 
p
x
q
x
|
x



 


*
Otherwise it remains at x.
– By repeating this steps, one obtains in the end the samples of p(x).
-7-
Metropolis-Hastings algorithm
• Metropolis algorithm
– A proposal distribution q(x*|x) is symmetric w.r.t x* and x. Then the ratio
 p  x  q  x | x 
is simplified.


 p  x  
*
*
*
A  x, x   min 1,
  min 1,

*
 p  x q  x | x  
 p  x  


*
Example: normal pdf at x* with mean at x equals to the vice versa.
• Practice with matlab
– Generate samples of this distribution using a proposal pdf

p  x   0.3exp  0.2 x 2   0.7 exp 0.2  x  10 
q  x* | x   N  x,10 
2

– As the random walk progresses, the number of samples are increased,
and the distribution converges to the target distribution.
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
-15
-10
-5
0
5
10
15
20
25
-8-
Metropolis-Hastings algorithm
• Notes
– MH requires careful design of the proposal distribution
– Allowing asymmetric jump rules can increase the speed of random walk.
– There is matlab function for MH algorithm.
• mhsample(1,N,'pdf',p,'proprnd',qrnd,'proppdf',q);
– Variants of MCMC are
• Hybrid Monte Carlo
• Slice sampling
• Reversible jump MCMC
-9-
Simulated annealing for global optimization
• Algorithm
– Target distribution is objective function in optimization problem.
After sampling via MH algorithm, the optimum solution is obtained as
the mode of the simulated distribution, i.e., xˆ  arg max p x  i 
i
x  :i 1,..., N
 
– Using just original function is inefficient. Simulated annealing was
developed to enhance speed of finding this, in which pdf value at
1
iteration i is replaced by
 
p x i 
Ti
where Ti is decreasing cooling schedule
, where Ti  0, i  1, 2,...
Ti 
1
C ln  i  T0 
– Thus SA is just a minor modification of standard MCMC algorithms.
- 10 -
Simulated annealing for global optimization
• Algorithm
0.35
0.1
0.09
0.3
0.08
0.25
0.07
0.06
0.2
0.05
0.15
0.04
0.03
0.1
0.02
0.05
0.01
0
-15
-10
-5
0
5
10
15
20
0
-15
25
- 11 -
-10
-5
0
5
10
15
20
25
Gibbs sampler
• Algorithm
– Particular Markov chain useful for multidimensional problems, also
called alternating conditional sampling.
– Let x=(x1,x2,…,xn). Each iteration of the Gibbs sampler cycles through
the components, drawing each parameter conditional on the values of
all the others. There are thus n steps in an iteration.
p  x j* | x j i 1 
where x j i 1   x1i ,..., x j 1i , x j 1i 1 ,..., xni 1 
– Each parameter xj is updated conditional on the latest values of the
other components which are
• the iteration i values for the already updated components and
• the iteration i-1 values for the others not updated yet.
- 12 -
Gibbs sampler
• Practice with matlab
– Generate samples of bivariate normal distribution with correlation r.
– To apply Gibbs, we need conditional distributions, using the properties of
multivariate normal distribution ((A.1) or (A.2) on page 579)
– Gibbs sampling where sample for conditional pdf drawn by matlab function.
– Gibbs sampling where sample for conditional pdf drawn by MH algorithm.
- 13 -
MH Algorithm for two parameters
• Remark
– Gibbs sampler samples alternatively for each component of the
parameter vector while MH algorithm samples altogether for the
parameter vector.
• Algorithm
– is outlined via example.
- 14 -
Inference and convergence
• Remarks
– If iterations not proceeded long enough, result can be erroneous.
– Recommended simulation is to employ multiple sequences with starting
points dispersed over the domain. Then monitor the variation between
and within the sequence and proceed until the within variation equals
the between variation. More detail is addressed in the textbook p296.
– Early iterations should be discarded. Usually the first half of each
sequence is discarded. This discarding activity is referred as ‘burn-in’.
– One can apply several graphical and statistical tests to assess, roughly,
if the chain has stabilized. In general, none of these tests provide
entirely satisfactory diagnostics.
– See another comment on the convergence at Chap. 29 Monte Carlo
Methods in David J.C. MacKay. (2003). Information Theory, Inference,
and Learning Algorithms. Cambridge University Press.
- 15 -
Homework
• Problem
– For the 3.7 Example: Bioassay experiment,
1. Perform Metropolis algorithm of two parameters to obtain samples of
the posterior distribution.
2. Draw scatter plot.
3. Draw traces of each parameters and sequence of points in 2-D domain.
4. Compute mean, 95% conf. intervals, variances and correlation of the
two parameters.
5. Try different initial point, different proposal density and compare the
results 4.
- 16 -