PPT presentation

Download Report

Transcript PPT presentation

Inference V:
MCMC Methods
.
Stochastic Sampling
 In
previous class, we examined methods that use
independent samples to estimate P(X = x |e )
Problem: It is difficult to sample from P(X1, …. Xn |e )
 We had to use likelihood weighting to reweigh our
samples
 This introduced bias in estimation
 In some case, such as when the evidence is on
leaves, these methods are inefficient
MCMC Methods
 We
are going to discuss sampling methods that are
based on Markov Chain
 Markov Chain Monte Carlo (MCMC) methods
 Key

ideas:
Sampling process as a Markov Chain
 Next

sample depends on the previous one
These will approximate any posterior distribution
 We
start by reviewing key ideas from the theory of
Markov chains
Markov Chains
X1, X2, … take some set of values
 wlog. These values are 1, 2, ...
 A Markov chain is a process that corresponds to
the network:
 Suppose
X1
 To
X2
X3
...
Xn
quantify the chain, we need to specify
 Initial probability: P(X1)
 Transition probability: P(Xt+1|Xt)
 A Markov chain has stationary transition
probability
 P(Xt+1|Xt) is the same for all times t
...
Irreducible Chains
state j is accessible from state i if there is an n
such that P(Xn = j | X1 = i) > 0
 There is a positive probability of reaching j from
i after some number steps
A
A
chain is irreducible if every state is accessible
from every state
Ergodic Chains
A
state is positively recurrent if there is a finite
expected time to get back to state i after being in
state i
 If X has finite number of states, then this is
suffices that i is accessible from itself
A
chain is ergodic if it is irreducible and every state
is positively recurrent
(A)periodic Chains
state i is periodic if there is an integer d such
that
P(Xn = i | X1 = i ) = 0 when n is not divisible by d
A
A
chain is aperiodic if it contains no periodic state
Stationary Probabilities
Thm:
 If a chain is ergodic and aperiodic, then the limit
lim P (Xn | X1  i )
n 
exists, and does not depend on i
 Moreover,
let P * (X  j )  lim P (Xn  j | X1  i )
n 
then, P*(X) is the unique probability satisfying
P * (X  j )   P (Xt 1  j | Xt  i )P * (X  i )
i
Stationary Probabilities
probability P*(X) is the stationary probability of
the process
 Regardless of the starting point, the process will
converge to this probability
 The
 The
rate of convergence depends on properties of
the transition probability
Sampling from the stationary
probability
 This
theory suggests how to sample from the
stationary probability:



Set X1 = i, for some random/arbitrary i
For t = 1, 2, …, n
 Sample a value xt+1 for Xt+1 from P(Xt+1|Xt=xt)
return xn
 If n is
P*(X)
large enough, then this is a sample from
Designing Markov Chains
 How
do we construct the right chain to sample
from?
 Ensuring aperiodicity and irreducibility is usually
easy
 Problem
is ensuring the desired stationary
probability
Designing Markov Chains
Key tool:
 If the transition probability satisfies
P ( Xt 1  j |Xt i )
P ( Xt 1 i |Xt  j )

Q (X  j )
whenever P ( Xt 1  j | Xt  i )  0
Q ( X i )
then, P*(X) = Q(X)
 This gives a local criteria for checking that the chain
will have the right stationary distribution
MCMC Methods
 We
can use these results to sample from
P(X1,…,Xn|e)
Idea:
 Construct an ergodic & aperiodic Markov Chain
such that
P*(X1,…,Xn) = P(X1,…,Xn|e)
 Simulate the chain n steps to get a sample
MCMC Methods
Notes:
 The Markov chain variable Y takes as value
assignments to all variables that are consistent
evidence
 For
simplicity, we will denote such a state using the
vector of variables
V (Y )  { x1,..., xn V (X1 )   V (X1 ) | x1,..., xn satisfy e }
Gibbs Sampler
 One
of the simplest MCMC method
 At each transition change the state of just on Xi
 We can describe the transition probability as a
stochastic procedure:
 Input: a state x1,…,xn
 Choose i at random (using uniform probability)
 Sample x’i from
P(Xi|x1, …, xi-1, xi+1 ,…, xn, e)
 let x’j = xj for all j  i
 return x’1,…,x’n
Correctness of Gibbs Sampler
 By
chain rule
P(x1, …, xi-1, xi, xi+1 ,…, xn|e) =
P(x1, …, xi-1, xi+1 ,…, xn|e)P(xi|x1, …, xi-1, xi+1 ,…,
xn, e)
 Thus, we get
P ( x1 ,,xi 1 ,xi ,xi 1 ,,xn |e )
P ( x1 ,,xi 1 ,x 'i ,xi 1 ,,xn |e )
 Since

P ( xi |x1 ,,xi 1 ,xi 1 ,,xn ,e )
P ( x 'i |x1 ,,xi 1 ,xi 1 ,,xn ,e )
we choose i from the same distribution at
each stage, this procedure satisfies the ratio criteria
Gibbs Sampling for Bayesian
Network
 Why
is the Gibbs sampler “easy” in BNs?
 Recall that the Markov blanket of a variable
separates it from the other variables in the network
 P(Xi | X1,…,Xi-1,Xi+1,…,Xn) = P(Xi | Mbi )
 This
property allows us to use local computations
to perform sampling in each transition
Gibbs Sampling in Bayesian
Networks
do we evaluate P(Xi | x1,…,xi-1,xi+1,…,xn) ?
 Let Y1, …, Yk be the children of Xi
 By definition of Mbi, the parents of Yj are in
Mbi{Xi}
 How
 It
is easy to show that
P (xi | Mbi ) 
P (xi | Pai ) P (yi | pay j )
j
 P (x ' | Pa ) P (y
x 'i
i
i
j
i
| pay j )
Sampling Strategy
 How
do we collect the samples?
Strategy I:
 Run the chain M times, each run for N steps
 each run starts from a different state points
 Return the last state in each run
M chains
Sampling Strategy
Strategy II:
 Run one chain for a long time
 After some “burn in” period, sample points every
some fixed number of steps
“burn in”
M samples from one chain
Comparing Strategies
Strategy I:


Better chance of “covering” the space of points
especially if the chain is slow to reach stationarity
Have to perform “burn in” steps for each chain
Strategy II:


Perform “burn in” only once
Samples might be correlated (although only weakly)
Hybrid strategy:
 run several chains, and sample few samples
from each
 Combines benefits of both strategies