Suppressing Random Walks in Markov Chain Monte Carlo Using

Download Report

Transcript Suppressing Random Walks in Markov Chain Monte Carlo Using

Suppressing Random Walks in Markov
Chain Monte Carlo Using Ordered
Overrelaxation
Radford M. Neal
발표자 : 장 정 호
Introduction
Problems
– Gibbs Sampling & simple forms of the
Metropolis algorithm
• The distance moved in each iteration is usually
small because of the dependencies between
variables.
– More serious in high-dimensional distributions
encountered in Bayesian inference and statistical physics.
• Operates via a random walk.
 Two solutions to random walk
– Hybrid Monte Carlo
– Overrelaxation methods
 Hybrid Monte Carlo
– Duane, Kennedy, Pendleton, Roweth, 1987.
– An elaborate form of the Metropolis algorithm.
• Candidate states are found by simulating a trajectory defined
by Hamiltonian dynamics.
• Trajectories proceed in a consistent direction until they reach
a region of low probability.
• In Bayesian inference problems for complex models based on
neural networks, this can be hundreds or thousands of times
faster than simple versions of Metropolis algorithm.
– Problems that can be applied
• The state variables are continuous.
• Derivatives of the probability density can be
efficiently computed.
– Difficulty
• Require careful choices for the length of the
trajectories and for the stepsize.
– Using too large a stepsize cause the dynamics to become
unstable, resulting in an extremely high rejection rate.
 Methods based on overrelaxtion
– Introduced by Adler, 1981
• Similar to Gibbs sampling except that
– The new value for a component is negatively correlated with
the old value.
• Successive overrelaxation improves sampling efficiency by
suppressing random walk behavior.
• Does not require that the user select a suitable value for a
stepsize parameter.
• Doest not sufffer from the growth in computation time with
system size that results from the use of a global acceptance
test in hybrid Monte Carlo.
• Applicable only to problems where all the full conditional
distributions are Gaussian.
– Variants
• Most methods employ occasional rejections to ensure that the
correct distribution is invariant.
– Can undermine the ability of overrelaxation to suppress random
walks.
– The probability of rejection is determined by the distribution to
be sampled.
– The probability of rejection can not be reduced.
– Ordered overrelaxation
• Rejection-free overrelaxation method based on order statistics.
– In principle, it can be used for any distribution for which Gibbs
sampling would produce an ergodic Markov chain.
Overrelaxation with Gaussian
conditional distribution
Adler’s method
– Applicable when the distribution for the state,
x=(x1, …, xN) is such that all the full
conditional densities,  ( x | {x } ) are Gaussian.
– The components are updated in turn.
i
j
j i
xi'  i   ( xi  i )   i (1   2 )1 / 2 n
n ~ N (0, 12 )
 1    1 : degree of overrelaxa tion
  0 : overrelaxa tion to the other side
  0 : equivalent to Gibbs sampling
– Leaves the desired distribution invariant.
– Overrelaxed updates with  1    1 produce
an ergodic chain.
The way of suppressing random
walk
Example
– Bivariate Gaussian with correlation 0.998
– Degree of overrelaxation
• When  is chosen well, randomization occurs on
about the time scale as is required for the state to
move from one end of the distribution to the other.
– Corr  1 then   -1
The benefit from overrelaxtion
– Autocorrelation time
• The sum of the autocorrelations for the function of state at all
lags.
• The efficiency of estimation of E[x1] is a factor of about 22
better than Gibbs sampling
• The efficiency of estimation of E[x12] is a factor of about 16
better than Gibbs sampling.
– Can reduce the variance of an estimate given run
length.
– Can ruduce the length of run given desired variance
level.
Overrelaxation is not always beneficial.
– Barone, Frigessi, 1990
• Overrelaxation applied to multivariate Gaussian.
• If a method converges with rate , the computation
time required to reach some given accuracy is
inversely proportional to –log()
• Overrelaxation can for some distribution be
arbitrarily faster then Gibbs sampling
– For some distribution with negative correlations, it can
be better to underrelax.
– Green, Han, 1992
• Values for  very near –1 are not good from the point of view
of convergence to equilibrium.
• Suggests to use different chains during initial period and
during subsequent generation.
– The benefits of overrelaxation are not universal.
– Contexts where overrelaxation is beneficial
• Mutlivariate Gaussian distributions where Correlations
between components of the state are positive.
Previous proposal for more
general overrelaxation methods
Brown and Woch, 1987
– Procedure
• Transform to a new parameterization in which the
conditional distribution is Gaussian.
• Do the update by Adler’s method.
• Transform back.
– For many problems, required computations
will be costly or infeasible.
Brown, Woch, Creutz, 1987
– Based on Metropolis algorithm.
– Procedure
• To update component i, first find xi*, which is near
the center of the conditional distribution.
'
*
*
• xi  xi  ( xi  xi ) as a candidate.
• Accept with probability
min[ 1,  ( xi' | {x j } j i ) /  ( xi | {x j } j i )]
Green, Han, 1992
– Procedure
• To update component I, find a Gaussian
approximation to the conditional distribution.
• Find a candidate state xi’ by overrelaxing.
xi'  i   ( xi  i )   i (1   2 )1 / 2 n
• Candidate is accepted or rejected using Hastings’
generalization of the Metropolis algorithm.
Fodor, Jansen, 1994
– Applicable when the conditional distribution is
unimodal.
– Candidate state is the point on the other side of
the mode.
• Probability density is the same as that of the
current state.
• Accepted or rejected based on the derivative of the
mapping from current state to candidate state.
Overrelaxation based on order
statistics
 Ordered overrelaxation
– Component-wise update.
 Basic procedure
– Generate K random values, independently, from the
conditional distribution  ( xi | {x j } j i )
– Arrange these K values plus the old value, xi, in nondecreasing order.
xi( 0 )  xi(1)    xi( r )  xi    xi( K )
– Let the new value for component i be
xi'  xi( K  r )
Validity of ordered overrelaxation
– That the method is valid means that the
distribution is invariant.
• Suffice to show that each update for a component
satisfies detailed balance.
– Assuming there are no tied vlaues among K
values,
P( X i  xi' | X i  xi ) 
 ( xi | {x j } j i )  K!   ( xi' | {x j } j i )  I [ s  K  r ]
• The probability density for the reverse transition is
identical.
Strategies for implementing
ordered overrelaxation
Inference for a hierarchical
Bayesian model
Discussion