PPT presentation
Download
Report
Transcript PPT presentation
Inference V:
MCMC Methods
.
Stochastic Sampling
In
previous class, we examined methods that use
independent samples to estimate P(X = x |e )
Problem: It is difficult to sample from P(X1, …. Xn |e )
We had to use likelihood weighting to reweigh our
samples
This introduced bias in estimation
In some case, such as when the evidence is on
leaves, these methods are inefficient
MCMC Methods
We
are going to discuss sampling methods that are
based on Markov Chain
Markov Chain Monte Carlo (MCMC) methods
Key
ideas:
Sampling process as a Markov Chain
Next
sample depends on the previous one
These will approximate any posterior distribution
We
start by reviewing key ideas from the theory of
Markov chains
Markov Chains
X1, X2, … take some set of values
wlog. These values are 1, 2, ...
A Markov chain is a process that corresponds to
the network:
Suppose
X1
To
X2
X3
...
Xn
quantify the chain, we need to specify
Initial probability: P(X1)
Transition probability: P(Xt+1|Xt)
A Markov chain has stationary transition
probability
P(Xt+1|Xt) is the same for all times t
...
Irreducible Chains
state j is accessible from state i if there is an n
such that P(Xn = j | X1 = i) > 0
There is a positive probability of reaching j from
i after some number steps
A
A
chain is irreducible if every state is accessible
from every state
Ergodic Chains
A
state is positively recurrent if there is a finite
expected time to get back to state i after being in
state i
If X has finite number of states, then this is
suffices that i is accessible from itself
A
chain is ergodic if it is irreducible and every state
is positively recurrent
(A)periodic Chains
state i is periodic if there is an integer d such
that
P(Xn = i | X1 = i ) = 0 when n is not divisible by d
A
A
chain is aperiodic if it contains no periodic state
Stationary Probabilities
Thm:
If a chain is ergodic and aperiodic, then the limit
lim P (Xn | X1 i )
n
exists, and does not depend on i
Moreover,
let P * (X j ) lim P (Xn j | X1 i )
n
then, P*(X) is the unique probability satisfying
P * (X j ) P (Xt 1 j | Xt i )P * (X i )
i
Stationary Probabilities
probability P*(X) is the stationary probability of
the process
Regardless of the starting point, the process will
converge to this probability
The
The
rate of convergence depends on properties of
the transition probability
Sampling from the stationary
probability
This
theory suggests how to sample from the
stationary probability:
Set X1 = i, for some random/arbitrary i
For t = 1, 2, …, n
Sample a value xt+1 for Xt+1 from P(Xt+1|Xt=xt)
return xn
If n is
P*(X)
large enough, then this is a sample from
Designing Markov Chains
How
do we construct the right chain to sample
from?
Ensuring aperiodicity and irreducibility is usually
easy
Problem
is ensuring the desired stationary
probability
Designing Markov Chains
Key tool:
If the transition probability satisfies
P ( Xt 1 j |Xt i )
P ( Xt 1 i |Xt j )
Q (X j )
whenever P ( Xt 1 j | Xt i ) 0
Q ( X i )
then, P*(X) = Q(X)
This gives a local criteria for checking that the chain
will have the right stationary distribution
MCMC Methods
We
can use these results to sample from
P(X1,…,Xn|e)
Idea:
Construct an ergodic & aperiodic Markov Chain
such that
P*(X1,…,Xn) = P(X1,…,Xn|e)
Simulate the chain n steps to get a sample
MCMC Methods
Notes:
The Markov chain variable Y takes as value
assignments to all variables that are consistent
evidence
For
simplicity, we will denote such a state using the
vector of variables
V (Y ) { x1,..., xn V (X1 ) V (X1 ) | x1,..., xn satisfy e }
Gibbs Sampler
One
of the simplest MCMC method
At each transition change the state of just on Xi
We can describe the transition probability as a
stochastic procedure:
Input: a state x1,…,xn
Choose i at random (using uniform probability)
Sample x’i from
P(Xi|x1, …, xi-1, xi+1 ,…, xn, e)
let x’j = xj for all j i
return x’1,…,x’n
Correctness of Gibbs Sampler
By
chain rule
P(x1, …, xi-1, xi, xi+1 ,…, xn|e) =
P(x1, …, xi-1, xi+1 ,…, xn|e)P(xi|x1, …, xi-1, xi+1 ,…,
xn, e)
Thus, we get
P ( x1 ,,xi 1 ,xi ,xi 1 ,,xn |e )
P ( x1 ,,xi 1 ,x 'i ,xi 1 ,,xn |e )
Since
P ( xi |x1 ,,xi 1 ,xi 1 ,,xn ,e )
P ( x 'i |x1 ,,xi 1 ,xi 1 ,,xn ,e )
we choose i from the same distribution at
each stage, this procedure satisfies the ratio criteria
Gibbs Sampling for Bayesian
Network
Why
is the Gibbs sampler “easy” in BNs?
Recall that the Markov blanket of a variable
separates it from the other variables in the network
P(Xi | X1,…,Xi-1,Xi+1,…,Xn) = P(Xi | Mbi )
This
property allows us to use local computations
to perform sampling in each transition
Gibbs Sampling in Bayesian
Networks
do we evaluate P(Xi | x1,…,xi-1,xi+1,…,xn) ?
Let Y1, …, Yk be the children of Xi
By definition of Mbi, the parents of Yj are in
Mbi{Xi}
How
It
is easy to show that
P (xi | Mbi )
P (xi | Pai ) P (yi | pay j )
j
P (x ' | Pa ) P (y
x 'i
i
i
j
i
| pay j )
Sampling Strategy
How
do we collect the samples?
Strategy I:
Run the chain M times, each run for N steps
each run starts from a different state points
Return the last state in each run
M chains
Sampling Strategy
Strategy II:
Run one chain for a long time
After some “burn in” period, sample points every
some fixed number of steps
“burn in”
M samples from one chain
Comparing Strategies
Strategy I:
Better chance of “covering” the space of points
especially if the chain is slow to reach stationarity
Have to perform “burn in” steps for each chain
Strategy II:
Perform “burn in” only once
Samples might be correlated (although only weakly)
Hybrid strategy:
run several chains, and sample few samples
from each
Combines benefits of both strategies