PowerPoint Template

Download Report

Transcript PowerPoint Template

An Introduction to MCMC
for Machine Learning
(Markov Chain Monte Carlo)
Young Ki Baik
Computer Vision Lab. SNU
References
• An introduction to MCMC for Machine
Learning
•
Andrieu et al. (Machine Learning 2003)
• Introduction to Monte Carlo methods
•
David MacKay.
• Markov Chain Monte Carlo for Computer
Vision
•
•
Zhu, Delleart and Tu. (a tutorial at ICCV05)
http://civs.stat.ucla.edu/MCMC/MCMC_tutorial.htm
• Various PPTs for MCMC in the web
Contents
• MCMC
•
•
•
•
•
•
•
Metropolis-Hasting algorithm
Mixture and cycles of MCMC kernels
Auxiliary variable sampler
Adaptive MCMC
Other application of MCMC
Convergence problem and Trick of MCMC
Remained Problems
• Conclusion
MCMC
• Problem of MC(Monte Carlo)
•
Assembling the entire distribution for MC is usually hard:
• Complicated energy landscapes
• High-dimensional system.
• Extraordinarily difficult normalization
• Solution : MCMC
•
•
•
•
Build up distribution from Markov chain
Choose local transition probabilities which generate distribution
of interest (ensure detailed balance)
Each random variable is chosen based on the previous
variable in the chain
“Walk” along the Markov chain until convergence reached
• Result : Normalization not required, calculation are
local…
MCMC
• What is Markov Chain?
•
A Markov chain is a mathematical model for stochastic
system that generates random variable X1, X2, …, Xt,
where the distribution
p( xt | x1, x2 ,, xt 1 )  p( xt | xt 1 )
xt 1
•
•
xt
xt 1
The distribution of the next random variable depends
only on the current random variable.
The entire chain represents a stationary probability
distribution.
MCMC
• What is Markov Chain Monte Carlo?
•
MCMC is general purpose technique for generating
fair samples from a probability in high-dimensional
space, using random numbers (dice) drawn from
uniform probability in certain range.
Markov chain
states
xt 1
xt
xt 1
Independent
trials of dice
zt 1
zt
zt 1
xt ~ px
zt ~ unif [a, b]
MCMC
• MCMC as a general purpose computing technique
•
Task 1: simulation: draw fair (typical) samples from a
probability which governs a system.
x ~ p x 
•
Task 2: Integration/computing in very high dimensions,
i.e. to compute
E  f  x    f ( x) p ( x)dx
•
Task 3: Optimization with an annealing scheme
x*  arg max p( x)
•
Task 4: Learning:
un supervised learning with hidden variables (simulated from
posterior) or MLE learning of parameters p(x;Θ) needs
simulations as well.
MCMC
• Some notation
xi     x1, x2 ,...,xs 
•
x i  : sample
 : statespace
s : index of state
i 
The stochastic process x is called a Markov chain if

 
p xi  | xi 1 ,...,x1  T xi  | xi 1
•

The chain is homogeneous if T  T xi  | xi 1  remains
i  i 1
invariant for all i, with x  T x | x   1 for any i.
i
•
Chain depends solely on the current state of the
chain and a fixed transition matrix.
MCMC
• Example
•
•
Transition graph for Markov chain
with three state (s=3)
Transition matrix
1
0
0
T   0 0.1 0.9
0.6 0.4 0 
•
0.1
1
x2
0.9
0.4
For the initial state  x1   0.5,0.2,0.3
x1
x3
0.6
 x1 T  0.2,0.6,0.2
 x1 TT  ~, ~, ~ 

 x1 T t  0.2,0.4,0.4
•
px 
This stability result plays a fundamental role in MCMC.
MCMC
• Convergence properties
•
For any starting point, the chain will convergence to the invariant
distribution p(x), as long as T is a stochastic transition matrix that
obeys the following properties:
1) Irreducibility
That is every state must be (eventually) reachable from every
other state.
2) Aperiodicity
This stops the chain from oscillating between different states
3) Reversibility (detailed balance) condition
This holds the system remain its stationary distribution.
 
  

px     px  T x  | x  
p xi  T xi 1 | xi   p xi 1 T xi  | xi 1
i 1
i
.
 
x  i1

 
i
i 1
discrete

p x i    p x i 1 K x i  | x i 1 dx i 1 continuous
Kernal, proposal distribution
MCMC
• Eigen-analysis
•
From the spectral theory, p(x) is the left eigenvector of
the matrix T with corresponding eigenvalue 1.
0.1000 0.5000 0.6000
T T  0.6000 0.2000 0.3000
0.3000 0.3000 0.1000
0.6396 0.7071  0.2673
E  0.6396  0.7071 0.8018 
0.4264 0.0000  0.5345
1.0000 0.0000 0.0000 
D  0.0000  0.7071 0.0000 
0.0000 0.0000  0.2000
•
T T E  ED
Stationary distribution
Eigenvalue v1 always 1
p  e1 / sume1 
The second largest eigenvalue determines the rate of
convergence of the chain, and should be as small as
possible.
Metropolis-Hastings algorithm
• The MH algorithm
•
The most popular MCMC method
• Invariant distgribution p(x)
• Proposal distribution q(x*|x)
• Candidate value x*
• Acceptance probability A(x,x*)
1.
2.
Initialize x 0 .
For i=0 to N-1
- Sample u ~ U [0,1]
- Sample x* ~ qx* | xi  
xi 1  x*
 
 
else
•
x i 1  x i 
Kernel K

 



 
KMH xi 1 | xi   q xi1 | xi  A xi  , xi 1   xi  xi1 r xi 
 
.



 


r x i    q x* | x i  1  A x i  , x* dx*
 





p x * q x i  | x * 
i 
*
i  
 p x q x |x 

- If u  Ax i  , x*   min1,
p xi  KMH xi 1 | xi   p xi 1 KMH xi  | xi 1

Metropolis-Hastings algorithm
• Results of running the MH algorithm
2
2
• Target distribution : px  0.3 exp 0.2x   0.7 exp 0.2x 10 
Proposal distribution




q x* | xi   N xi  ,100
Metropolis-Hastings algorithm
• Different choices of the proposal standard
deviation  *
•
MH requires careful design of the proposal distribution.
•
If  * is narrow, only 1 mode of p(x) might be visited.
•
•
If  * is too wide,
the rejection rate can be high.
If all the modes are visited while
the acceptance probability is high,
the chain is said to “mix” well.
Mixture and cycles of MCMC kernels
• Mixture and cycle
•
•
It is possible to combine several samplers into mixture
and cycles of individual samplers.
If transition kernels K1, K2 have invariant distribution,
then cycle hybrid kernel and mixture hybrid kernel
are also transition kernels.
vK1  1  vK2 , for 0  v  1
Mixture and cycles of MCMC kernels
• Mixtures of kernels
•
Incorporate global proposals to explore vast region
of the state space and local proposals to discover
finer details of the target distribution.
-> target distribution with many narrow peaks
(= reversible jump MCMC algorithm)
Mixture and cycles of MCMC kernels
• Cycles of kernels
•
Split a multivariate state vector into components (block)
-> It can be updated separately.
-> Blocking highly correlated variables
(= Gibbs sampling algorithm)
Auxiliary variable samplers
• Auxiliary variable
•
•
It is often easier to sample from an augmented
distribution p(x,u), where u is an auxiliary variable.
It is possible to obtain marginal samples x by sampling
(x, u), and ignoring the sample u.
• Hybrid Monte Carlo (HMC)
•
•
Use gradient approximation
Slice sampling
Adaptive MCMC
• Adaptive selection of proposal distribution
•
•
The variance of proposal distribution is important.
To automate the process of choosing the proposal distribution
as much as possible.
• Problem
•
•
Adaptive MCMC can disturb the stationary distribution.
Gelfand and Sahu(1994)
• Station distribution is disturbed despite the fact that each
participating kernel has the same stationary distribution.
• Avoidance
•
•
•
Carry out adaptation only initial fixed number of step.
Parallel chains
And so on…
-> inefficient, much more research is required.
Other application of MCMC
• Simulated annealing method for global
optimization
•
To find global maximum of p(x)
 
xˆ  arg max p x i 
x  i  ;i 1,...,N
• Monte Carlo EM
•
To find fast approximation for E-step
• Sequential Monte Carlo method and particle
filters
•
To carry out on-line approximation of probability
distributions using samples.
->using parallel sampling
Convergence problem and Trick of MCMC
• Convergence problem
•
Determining the length of the Markov chain is a difficult task.
• Trick
•
Initial set problem (for starting biases)
• Discards an initial set of samples (Burn-in)
• Set initial sample value manually.
20
25
20
15
10
5
0
0
•
100
200
300
400
500
600
700
800
900
1000
Markov chain test
• Apply several graphical and statistical tests to assess if the
chain has stabilized.
-> It doesn’t provide entirely satisfactory diagnostics.
• Study about convergence problem
Remained problems
• Large dimension model
•
The combination of sampling algorithm with either
gradient optimization or exact one.
• Massive data set
•
A few solution based on importance sampling have been
proposed.
• Many and varied applications…
-> But there is still great room for innovation in this area.
Conclusion
• MCMC
•
•
•
The Markov Chain Monte Carlo methods cover a variety
of different fields and applications.
There are great opportunities for combining existing
sub-optimal algorithms with MCMC in many machine
learning problems.
Some areas are already benefiting from sampling
methods include:
Tracking, restoration, segmentation
Probabilistic graphical models
Classification
Data association for localization
Classical mixture models.