ch14 - Temple University

Download Report

Transcript ch14 - Temple University

Ch. 14: Markov Chain Monte Carlo Methods
based on Stephen Marsland, Machine Learning: An Algorithmic
Perspective. CRC 2009.;
C, Andrieu, N, de Freitas, A, Doucet, M, I. Jordan. An Introduction to MCMC for
Machine Learning. Machine Learning, Vol. 50, No. 1. pp. 5-43, 2003,
and Wikipedia
Longin Jan Latecki
Temple University
[email protected]
Random Numbers
• Pseudo-random numbers from uniform
distribution:
xn+1 = (axn + c) mod m,
a, c, m are parameters,
e.g., m=232, a=1,664,525, and c=1,013,904,223
Monte Carlo Method
While convalescing from an illness in 1946, Stanislaw Ulam was playing
solitaire. It, then, occurred to him to try to compute the chances that a
particular solitaire laid out with 52 cards would come out successfully
(Eckhard, 1987). After attempting exhaustive combinatorial calculations, he
decided to go for the more practical approach of laying out several
solitaires at random and then observing and counting the number of
successful plays. This idea of selecting a statistical sample to approximate
a hard combinatorial problem by a much simpler problem is at the heart of
modern Monte Carlo simulation. Stanislaw Ulam soon realized that
computers could be used in this fashion to answer questions
of neutron diffusion and mathematical physics.
From An Introduction to MCMC for Machine Learning
by: C, Andrieu, N, de Freitas, A, Doucet, M, I. Jordan
Machine Learning, Vol. 50, No. 1. pp. 5-43, 2003.
Random Samples and Statistical Models
• Random Sample: A random sample is a
collection of random variables X1, X2,…, XN,
that that have the same probability
distribution and are mutually independent
– If F is a distribution function of each random
variable Xi in a random sample, we speak of a
random sample from F. Similarly we speak of a
random sample from a density f, e.g., a random
sample from an N(µ, σ2) distribution, etc.
• Statistical Model for repeated measurements
A dataset consisting of values x(1), x(2),…, x(N) of
repeated measurements of the same quantity is
modeled as the realization of a random sample
X1, X2,…, XN.
The model may include a partial specification of the
probability distribution of each Xi.
Distribution features and sample statistics
• Empirical Distribution Function
# ( x ( i )  ( , a ])
FN (a ) 
N
• Law of Large Numbers
– lim N->∞ P(|FN(a) – F(a)| > ε) = 0
• This implies that for most realizations
– FN(a) ≈ F(a)
• The histogram
# ( x ( i )  ( x  h, x  h ])
f ( x) 
N
i.e., the height of histogram on (x-h, x+h] ≈ f(x)
We recall that the probability density function (pdf) f of a
random variable X is a derivative of the cumulative
distribution function (cdf) F(t)=Pr(X<t):
d
f ( x) 
F ( x)
dx
In particular, we have
x
F ( x) 
 f (u )du

x
d
d
f ( x) 
F ( x) 
f (u )du

d ( x)
d ( x ) 
Monte Carlo Principle 1
• The idea of Monte Carlo simulation is to draw an independent
and identically distributed (i.i.d.) set of samples
{x(1), x(2),…, x(N)}
from a target density p(x) defined on a high-dimensional space
X. These N samples can be used to approximate the target
density with the following empirical point-mass function
1
pN ( x ) 
N
N

i 1
N 
x( i )
(x)  p( x )
The convergence is almost sure (a.s.) and  x( i ) ( x )
denotes the delta-Dirac mass located at x(i).
Think about pN(x) as a histogram or kernel density estimate.
Monte Carlo = Sample based pdf approximation
Regions of high density have many samples
and high weights of samples.
Discrete approximation of a continuous pdf:
1
pN ( x ) 
N
N
 w( x
i 1
(i )
) x( i ) (x )  p( x )
N  ( a . s .)
Samples for a Gaussian
Matlab demo
Samples for exponential pdf
Matlab demo
Delta-Dirac mass
  z  c
 c ( z)  
 0 zc

and
  ( z)dz  1
c

• As a probability measure, the delta measure  c (z )
is characterized by its cumulative distribution function, which is
the Heaviside function (or unit step function)
1 z  c
Hc ( z)  
0 z  c
Plot of H0(z)
d
 c ( z)  H c ( z)
dz
and
plot of Hc(0)
1
1
0
0
Delta-Dirac function
• The Dirac delta function as the limit (in the sense of
distributions) of the sequence of Gaussians
0 ( x)  lima0 a ( x)
Kernel Density Estimate
• Given is set of iid samples {x(1), x(2),…, x(N)} from a density p(x)
1
f N ( x) 
N
N
 K (x  x
(i )
N  , 0
)   p( x )
i 1
1
pN ( x ) 
N
N

i 1
N 
x( i )
(x)  p( x )
The convergence is almost sure.
Monte Carlo Principle 2
• We can approximate the integrals, e.g., expectations (or very
large sums) E( f ) with tractable sums EN( f ):
1
EN ( f ) 
N
1
EN ( f ) 
N
N
N 
(i )
f
(
x
)


 E ( f )   f ( x ) p( x )dx

i 1
X
N
N 
(i )
f
(
x
)


 E ( f )   f ( x ) p( x )

i 1
x
The convergence is almost sure (a.s.)
• The N samples can also be used to obtain a maximum of the
objective function p(x)
Theorem 1
Let x(i) be iid samples from p(x), then
1
EN ( f ) 
N
N
N  ( a . s .)
(i )
f
(
x
)


 E ( f )   f ( x ) p( x )dx

i 1
X
Proof. By the strong law of large numbers, we have
1 N
N  ( a . s .)
N  EN ( f )   f (x (i ) ) 
   E ( f )   f ( x ) p( x )dx
X
N i 1
We observe that µ is the mean or expected value of function f(x)
with respect to the probability density function, while µN is the
mean of f values at the N sample points.
This proves the theorem.
Example
• It is possible to express all probabilities, integrals, and
summations as expectations as we illustrate now.
• Consider a problem (which is completely deterministic) of
integrating a function f(x) from a to b (as in high-school calculus).
This can be expressed as an expectation with respect to a
uniformly distributed, continuous random variable U(a,b)
between a and b. U has density function pU(x) = 1/(b - a), so if we
rewrite the integral we get
b

• Let
x(i)
a
b
1
f ( x )  (b  a )  f ( x )
dx
ba
a
be iid samples from U, then
ba N
1
N  ( a . s .)
(i )
f (x ) (b  a )  f ( x )
dx   f ( x )

N i 1
ba
a
a
b
b
Theorem 2
• Let x(i) be iid samples from p(x), then
1
pN ( x ) 
N
N
N  ( a . s .)

(
x
)


 p( x)
 x
(i )
i 1
Proof. By Theorem 1, we have
1 N
N  ( a . s .)
N  EN ( f )   f (x (i ) ) 
   E ( f )   f ( x ) p( x )dx
X
N i 1
We set X=R and f(x)=Hx(t) for some t, and obtain the cdf of p(x):

X
Consequently,
1
N
and
 1
t N
N
i 1
R
 p( x)dx

t
N
H
f ( x ) p( x )dx   H x (t ) p( x )dx 
t
x
(i )
N  ( a . s .)
(t ) 
  p( x )dx
1
H
(
t
)


x( i )
N
i 1


N  ( a . s .)

(
t
)





p( x )dx  p( x )


x( i )
t 
i 1
N
t
Importance Sampling
• Sometimes it is hard or impossible to draw samples from p(x).
• We have a proposal distribution q(x) such that its support
includes the support of p(x), i.e., p(x) > 0 ⇒ q(x) > 0 and it is easy
to draw samples form q(x).
• We draw an iid set of samples {x(1), x(2),…, x(N)} from q(x) and
(i )
define importance weights as w( x (i ) )  p( x )
q( x ( i ) )
Then samples
{(x(1), w(x(1))), {(x(2), w(x(2))),…, {(x(N), w(x(N))), }
are weighted samples from p(x).
Theorem 3: Importance Sampling
• We have a proposal distribution q(x) such that its support includes
the support of p(x), i.e., p(x) > 0 ⇒ q(x) > 0. We draw an iid set of
samples {x(1), x(2),…, x(N)} from q(x)
p( x )
w
(
x
)

and define importance weights as
q( x )
N
N  ( a . s .)
(i )
Then p ( x )  1
w
(
x
)

(
x
)


 p( x )

N
x( i )
N i 1
Proof: By Theorem 1 applied to q(x) and f(x)=h(x)w(x):
1 N
p( x )
N  ( a . s .)
(i )
(i )
h(x ) w( x )   h( x )
q( x )dx   h( x ) p( x )dx

X
X
N i 1
q( x )
We set X=R and and h(x)=Hx(t) and obtain the cdf of p(x):

X
Finally, we have
 1
t N
N
1
H x( i ) (t ) w( x ) 

N
i 1
(i )
h( x ) p( x )dx   H x (t ) p( x )dx 
R
t
 p( x)dx


 x( i ) (t ) w( x )   p( x )dx  p( x )

t 
i 1
N
(i )
N  ( a . s .)
t
Importance Sampling 2
• When the normalizing constant of p(x) is unknown, it is still possible
to apply the importance sampling method:
• We draw an iid set of samples {x(1), x(2),…, x(N)} from q(x)
(i )
w
(
x
)
and define importance weights as w~( x ( i ) )  N
 w( x
Then samples{( x
Before 1
N
(i )
N
i 1
i 1
are weighted samples from p(x).
N  ( a . s .)
(i )
(i )
h
(
x
)
w
(
x
)


  h( x ) w( x )q( x )dx

X
i 1
~( x ( i ) ) 
h
(
x
)
w

(i )
N
i 1
)
N
now
1
N
~( x ))}
,w
(i )
(i )
1
N
N
(i )
(i )
h
(
x
)
w
(
x
)

i 1
1
N
N

i 1
w( x ( i ) )

 
N  ( a . s .)
X
h( x ) w( x )q( x )dx

X
w( x )q( x )dx
Sampling Importance Resampling (SIR)
• We draw an iid set of samples {x(1), x(2),…, x(N)} from q(x) and
compute the importance weights as
(i )
p( x )
w( x ) 
q( x ( i ) )
(i )
~( x ) 
• Normalize the weights as w
(i )
w( x ( i ) )
N
(i )
w
(
x
)

i 1
(i ) ~
(i )
N
{(
x
,
w
(
x
))}
• Resample from the distribution
i 1
with probabilities given by the weights and again normalize the
weights.
(i )
N
~
{(
x
,
w
(
x
))}
We obtain weighted samples from p(x):
new
new
i 1
(i )
Sample based pdf representation
Regions of high density have many samples
and high weights of samples.
Discrete approximation of a continuous pdf:
1
pN ( x ) 
N
N
N  ( a . s .)
(i )
w
(
x
)

(
x
)


 p( x )

x( i )
i 1
Rejection Sampling 1
Sometimes it is hard or impossible to draw samples from p(x).
We can sample from a distribution p(x), which is known up to a proportionality
constant, by sampling from another easy-to-sample proposal distribution q(x)
that satisfies p(x) ≤ Mq(x), M < ∞, using the accept/reject procedure.
The accepted x(i ) can be shown to be sampled with probability p(x).
Rejection Sampling 2
Rejection sampling suffers from severe limitations. It is not always
possible to bound p(x)/q(x) with a reasonable constant M over the whole
space X. If M is too large, the acceptance probability will be too small.
This makes the method impractical in high-dimensional scenarios.
Theorem 4: Rejection Sampling (taken from pdf)
Demo Assignment
Suppose that we only have uniform generators at hand and we need
samples drawn from a Gaussian distribution, i.e., p(x) = G(0, 1). Use the
uniform distribution q(x) = U(-5, 5), to draw samples form G(0, 1).
1. Use rejection sampling to get samples from p(x).
2. Use importance sampling to get samples from p(x).
In both cases plot the histogram of samples overlaid on the plot of the
original Gaussian. What is the percentage of samples you had to reject
with the rejection sampler?
Markov Chain Monte Carlo (MCMC)
MCMC is a strategy for generating samples x(i ) while exploring the state
space X using a Markov chain mechanism. This mechanism is
constructed so that the chain spends more time in the most important
regions. In particular, it is constructed so that the samples x(i ) mimic
samples drawn from the target distribution p(x).
We recall that we use MCMC when we cannot draw samples from p(x)
directly, but can evaluate p(x) up to a normalizing constant.
It is intuitive to introduce Markov chains on finite state spaces,
where x(i ) can only take s discrete values x(i ) ∈ X = {x1, x2, . . . , xs}.
The stochastic process x(i ) is called a Markov chain if
Metropolis-Hastings algorithm
We obtain a set of samples {x(1), x(2),…, x(N)} from p(x),
as we show on the next set of slides.
Metropolis-Hastings Example. Bimodal target distribution
p(x) ∝ 0.3 exp(−0.2x2) + 0.7 exp(−0.2(x − 10)2)
Proposal q(x | x(i )) = G(x(i ), 100)
Simulated annealing for global optimization
Our goal is to find a global maximum
This technique involves simulating a non-homogeneous Markov chain
whose invariant distribution at iteration i is no longer equal to p(x), but to
Simulated annealing Example. Bimodal target distribution
p(x) ∝ 0.3 exp(−0.2x2) + 0.7 exp(−0.2(x − 10)2)
Proposal q(x | x(i )) = G(x(i ), 100)
Gibbs Sampler
Gibbs sampler can be viewed as a special case of the Metropolis-Hastings
algorithm. Suppose we have an n-dimensional vector x=(x1, . . . , xj, . . . , xn).
and the expressions for the full conditionals
p(x j | x−j) = p(x j | x1, . . . , xj−1, x j+1, . . . , xn).
In this case, it is often advantageous to use the proposal distribution q
defined for j = 1, . . . , n as
The corresponding acceptance probability is:
p( x (ji ) | x( ij) ) 
p( x (ji ) , x( ij) )
(i )
j
p( x )
p( x | x ) 
*
j
*
j

p( x ( i ) )
p( x( ij) )
p( x* )
p( x* j )
Thus we always accept the new sample.
Gibbs Sampler
Gibbs Sampling
x1(t 1) ~  ( x1 | x2(t ) , x3(t ) ,, xK(t ) )
x2(t 1) ~ π ( x2 | x1(t 1) , x3(t ) , , xK(t ) )
xK( t 1) ~  ( xK | x1( t 1) ,, xK( t 11) )
x
2
x
Slide by Ce Liu
1