Markov Chain Monte Carlo And Statistics

Download Report

Transcript Markov Chain Monte Carlo And Statistics

Jun Liu
Department of Statistics
Stanford University
Based on the joint work with F. Liang and W.H. Wong.
7/16/2015
MCMC and Statistics
1
The Basic Problems of Monte Carlo
• Draw random variable
X ~  ( x)
Sometimes with unknown normalizing constant
• Estimate the integral
I   f ( x ) ( x )dx  E f ( X )
7/16/2015
MCMC and Statistics
2
How to Sample from (x)
• The Inversion Method. If U ~ Unif (0,1) then
1
X  F (U ) ~  , where F is the cdf of  .
• The Rejection Method.
The “envelope” distrn
c g(x)
– Generate x from g(x);
– Draw u from unif(0,1);
c
u cg(x)
– Accept x if u<(x)/cg(x)
– The accepted x follows (x).
7/16/2015
MCMC and Statistics
(x)
x
3
High Dimensional Problems?
Ising Model

Partition
function
X  ( x ),   Lattice points, x 

1
1
 ( X )  Z 1 exp{Eng( X )}
1
where Eng( X )   x x
T  ~
Metropolis Algorithm:
(a) pick a lattice point, say , at random
(b) change current x to 1- x (so X(t)  X*)
(c) compute r= (X*)/ (X(t) )
(d) make the acceptance/rejection decision.
7/16/2015
MCMC and Statistics
4
General Metropolis-Hastings Recipe
• Start with any X(0)=x0, and a “proposal chain” T(x,y)
• Suppose X(t)=xt . At time t+1,
– Draw y~T(xt ,y) (i.e., propose a move for the next step)
– Compute the Metropolis ratio (or “goodness” ratio)
 ( y )T ( xt , y )
r
 ( x )T ( y , xt )
– Acceptance/Rejection decision: Let
 y,
( t 1)
X

 xt ,
7/16/2015
with p= min {1,r}
with 1  p
MCMC and Statistics
“Thinning
down”
5
• The detailed balance
Actual transition probability
from x to y, where x  y
  ( y )T ( y , x ) 
 ( x )T ( x , y ) min1,
  min ( x )T ( x , y ),  ( y )T ( y , x )
  ( x )T ( x , y ) 
  ( x )T ( x , y ) 
  ( y )T ( y , x ) min1,


(
y
)
T
(
y
,
x
)


Transition probability
from y to x.
7/16/2015
MCMC and Statistics
6
General Markov Chain Simulation
• Question: how to simulate from a target
distribution (X) via Markov chain?
• Key: find a transition function A(X,Y) so that
f0 An  
that is,  is an invariant distribution of A.
• Different from traditional Markov Chain
theory.
7/16/2015
MCMC and Statistics
7
If the actual transition probability is
 ( y) ( x, y)
I learnt it from Stein
where (x,y) is a symmetric function of x,y,
Then the chain has (x) as its invariant distribution.

T ( x , y ) min 1,
7/16/2015
 ( y)T ( y,x)
 ( x)T ( x, y)
   ( y) min
MCMC and Statistics
T ( x,y)
 ( y)
,
T ( y,x)
 ( x)
8

• The moves are very “local”
• Tend to be trapped in a local mode.
 ( x)  e
7/16/2015
y2 
1  x2
 2  0. 25  2 


 2e
MCMC and Statistics
2 ( Y 5 )2 
1  ( x 5 )
 2  0. 25  2 


9
Other Approaches?
• Gibbs sampler/Heat Bath: better or worse?
• Random directional search --- should be
better if we can do it. “Hit-and-run.”
• Adaptive directional sampling (ADS)
(Gilks, Roberts and George, 1994).
Multiple
chains
7/16/2015
Iteration t
xa
xc
MCMC and Statistics
10
Gibbs Sampler/Heat Bath
• Define a “neighborhood” structure N(x)
– can be a line, a subspace, trace of a group, etc.
• Sample from the conditional distribution.
• Conditional Move
A chosen direction
xnew ~ p( x)   ( x) |xN ( xold )
7/16/2015
MCMC and Statistics
11
How to sample along a line?
• What is the correct conditional distribution?

x  x'  x  t r
– Random direction:

p( t )   ( x  t r )
– Directions chosen a priori: the same as above
– In ADS?

'
x c  x c  x c  t r( x a , x c )
d 1
p(t )  | t|
7/16/2015

 (x a  t r )
MCMC and Statistics
12
The Snooker Theorem
• Suppose x~ and y is any point in the d-dim
space. Let r=(x-y)/|x-y|. If t is drawn from

d 1
p(t )  | t|  ( y  t r)
If y is generated from

Then
x'  y  t r
follows the target distribution  .
distr’n, the new point x’
is indep. of y.
p(t )
x
7/16/2015
y (anchor)
MCMC and Statistics
13
Connection with
transformation group
• WLOG, we let y=0.
• The move is now: x  x’=t x
The set {t: t0} forms a transformation group.
Liu and Wu (1999) show that if t is drawn from
p(t )   (t  x ) | J t ( x )| H ( dt )
Then the move is invariant with respect to  .
7/16/2015
MCMC and Statistics
14
Another Hurdle
• How to draw from something like

d 1
p(t )  | t|  ( y  t r )
• Adaptive rejection? Approximation? Griddy
Gibbs?
• M-H Independence Sampler (Hastings, 1970)
– need to draw from something that is close
enough to p(x).
7/16/2015
MCMC and Statistics
15
• Propose bigger jumps
– may be rejected too often
• Proposal with mix-sized stepsizes.
• Try multiple times and select good one(s)
(“bridging effect”) (Frankel & Smit, 1996)
• Is it still a valid MCMC algorithm?
7/16/2015
MCMC and Statistics
16
Current is at x
Can be dependent ones
• Draw y1,…,yk from the proposal T(x, y) .
• Select Y=yj with probability  (yj)T(yj,x).
• Draw x1* ,, xk*1 from T(Y, x). Let x*k
• Accept the proposed yj with probability
x
  ( y1 )T ( y1 , x)     ( yk )T ( yk , x) 
p  min1,

*
*
*
*
  ( x1 )T ( x1 , y j )     ( xk )T ( xk , y j ) 
7/16/2015
MCMC and Statistics
17
A Modification
• If T(x,y) is symmetric, we can have a
different rejection probability:
  ( y1 )     ( yk ) 
p  min1,

*
*
  ( x1 )     ( xk ) 
Ref: Frankel and Smit (1996)
7/16/2015
MCMC and Statistics
18
Back to the example
Random Ray Monte Carlo:
• Propose random direction
• Pick y from y1 ,…, y5
• Correct for the MTM bias
7/16/2015
y1 y2
MCMC and Statistics
y5
y
4
x
y3
19
An Interesting Twist
• One can choose multiple tries semi-deterministically.
x y5 y6 y7 y8
y4
y
y
3
y 2
1
* x*
x
x5* x 7 8
*
6
*
x
Random equal grids
x1* x x 4 y
•Pick y from y1 ,…, y8
•The correction rule is the same:
*
2
*
3
  ( y1 )     ( y8 ) 
p  min1,

*
*
  ( x1 )     ( x8 ) 
7/16/2015
MCMC and Statistics
20
Use Local Optimization in MCMC
• The ADS formulation is powerful, but its
direction is too “random.”
• How to make use of their framework?
– Population of samples St  {xt(1) ,, xt(m) }
(c )
x
– Randomly select t to be updated.
– Use the rest to determine an “anchor point”
• Here we can use local optimization techniques;
yt
• Use MTM to draw sample along the line,
with the help of the Snooker Theorem.
7/16/2015
MCMC and Statistics
21
Distribution contour
xc
xa (anchor point)
A gradient or conjugate
gradient direction.
7/16/2015
MCMC and Statistics
22
Numerical Examples
• An easy multimodal problem
  6   1 .9  
  4   1  .9  
1






   N 2   , 
 
 N 2 (0, I 2 )  N 2    , 
3
  6   .9 1  
  4    .9 1  


7/16/2015
MCMC and Statistics
23
7/16/2015
MCMC and Statistics
24
A More DifficultTest Example
• Mixture of 2 Gaussians:
 ( x)  13 N2 (0, I5 )  23 N2 (5, I5 )
• MTM with CG can sample the distribution.
• The Random-Ray also worked well.
• The standard Metropolis cannot get across.
7/16/2015
MCMC and Statistics
25
Fitting a Mixture model
• Likelihood:
n
L( | y1 ,, yn )  
i 1
R
y   IU
|S p  F
|
V
G
J
|T H K|W
3
i
j
j
j 1
j
  (log p1, log p2 ;  j ,  j , j  1,2,3)
• Prior: uniform in all, but with constraints
1  2  3 ; p j   ;  j   min
And each group has at least one data point.
7/16/2015
MCMC and Statistics
26
7/16/2015
MCMC and Statistics
27
Bayesian Neural Network Training
• Setting: Data = {( y1, x1 ), ( y2 , x2 ),, ( yn , xn )}
Nonlinear curve fitting: yt  f (xt )   t
• 1-hidden layer feed-forward
NN Model

M
fˆ ( xt )   j (xTt  j ).

j 1
h1
1
x
• Objective function for optimization:
P
 N ( y | fˆ (x ), )g (
2
t
y
t


2
hM
xp
)
 N ( | 0,  2 ) N (  | 0,  2 )
7/16/2015
MCMC and Statistics
28
Liang and Wong (1999) proposed a method
that combines the snooker theorem, MTM,
exchange MC, and genetic algorithm.
Activation function: tanh(z)
# hidden units M=2
7/16/2015
MCMC and Statistics
29
7/16/2015
MCMC and Statistics
30