Stochastic approximate inference
Download
Report
Transcript Stochastic approximate inference
Stochastic approximate inference
Kay H. Brodersen
Computational Neuroeconomics Group
Department of Economics
University of Zurich
Machine Learning and Pattern Recognition Group
Department of Computer Science
ETH Zurich
http://people.inf.ethz.ch/bkay/
When do we need approximate inference?
๏ค
How to evaluate the posterior distribution of the model parameters?
๐ ๐ด๐ ๐ ๐
1
๐ ๐๐ด =
= ๐ ๐ด ๐ ๐ ๐ sample from an arbitrary distribution
๐ ๐ด
๏ค
๏ค
๐
How to compute the evidence term?
๐ ๐ด = โซ ๐ ๐ด ๐ ๐ ๐ ๐๐ = E๐ ๐ ๐ด ๐
compute an expectation
How to compute the expectation of the posterior?
๐ธ ๐ ๐ด = โซ ๐ ๐ ๐ ๐ด ๐๐
compute an expectation
๏ค
How to make a point prediction?
โซ ๐ฆ ๐ ๐ฆ ๐ด ๐๐ฆ = E ๐ฆ ๐ด
compute an expectation
2
Which type of approximate inference?
Deterministic approximations
through structural assumptions
Stochastic approximations
through sampling
๏ณ application often requires mathematical
๏ณ computationally expensive
๏ณ storage intensive
๏ฒ asymptotically exact
๏ฒ easily applicable general-purpose
derivations (hard work)
๏ณ systematic error
๏ฒ computationally efficient
๏ฒ efficient representation
๏ฒ learning rules may give additional insight
algorithms
3
Themes in stochastic approximate inference
Sampling from a desired target distribution
we need to find a way of drawing random numbers from some target distribution ๐(๐ง)
๐ธ[๐ง]
p(z)
sampling
f(z)
z
z
๐ง
๐ง
1
2
= 0.201
= 1.807
โฎ
10,000
๐ง
= 0.870
summary
Computing an expectation w.r.t. that target distribution
we can approximate the expectation of ๐ง using the sample mean:
1
๐ธ ๐ง โ ๐ ๐๐=1 ๐ง ๐
4
1
Transformation method
Transformation method for sampling from ๐(๐ง)
๏ค
Idea: we can obtain samples from some distribution ๐ ๐ง by first sampling from
the uniform distribution and then transforming these samples.
1.4
2
1.2
1.5
0.8
p(z)
p(u)
1
0.6
1
0.4
0.5
0.2
0
๐ข = 0.9
0
0.5
1
0
๐ง = 1.15
0
1
2
3
4
transformation
6
Transformation method: algorithm
๏ค
Algorithm for sampling from ๐(๐ง)
๏
Draw a random number from the uniform distribution:
๐ข ๐ ~๐ 0,1
Transform ๐ข by applying the inverse cumulative density function (cdf) of the
desired target distribution:
๐ง ๐ = ๐น โ1 ๐ข ๐
๏
Repeat both steps for ๐ = 1 โฆ ๐.
๏
7
Transformation method: example
๏ค
๏ค
Example: sampling from the exponential distribution
๏
The desired pdf is: ๐ ๐ง|๐ = ๐ exp(โ๐๐ง)
๏
The corresponding cdf is: ๐น ๐ง = 1 โ exp โ๐๐ง
๏
The inverse cdf is: ๐น โ1 ๐ข = โ ๐ ln 1 โ ๐ข
๏
Thus, ๐ง
1
๐
1
= โ ๐ ln 1 โ ๐ข
๐
is a sample from the exponential distribution.
Implementation in MATLAB
for t=1:10000
z(t) = -1/lambda*log(1-rand);
end
hist(z)
mean(z)
8
Transformation method: summary
๏ค
Discussion
๏ฒyields high-quality samples
๏ฒeasy to implement
๏ฒcomputationally efficient
๏ณ obtaining the inverse cdf can be difficult
9
2
Rejection sampling
and importance sampling
Rejection sampling
๏ค
Idea: when the transformation method cannot be applied, we can resort to a
more general method called rejection sampling. Here, we draw random numbers
from a simpler proposal distribution ๐(๐ง) and keep only some of these samples.
The proposal distribution ๐(๐ง),
scaled by a factor ๐, represents
an envelope of ๐(๐ง).
11
Rejection sampling: algorithm
๏ค
Algorithm for sampling from ๐(๐ง)
๏
๏
๏
๏
๏
Sample ๐ง0 from ๐(๐ง)
Sample ๐ข0 from ๐(0,1)
If ๐ข0 โค ๐(๐ง0 )/๐๐ ๐ง0 , then accept the sample:
๐ง ๐ = ๐ง0
Otherwise, discard ๐ง0 and ๐ข0 .
Repeat until we have obtained ๐ accepted samples.
12
Importance sampling
๏ค
๏ค
Idea: if our goal is to compute the expectation ๐ธ[๐ง], we can outperform rejection
sampling by bypassing the generationg of random samples.
Naïve approach
๏
A naïve approach would be to approximate the expectation as follows. Rather than
sampling from ๐(๐ง), we discretize ๐ง-space into a uniform grid and evaluate:
๐ฟ
๐(๐ง ๐ ) ๐ง (๐)
๐ผ๐ง โ
๐=1
๏
There are two problems with this approach:
๏
๏
The number of terms in the summation grows exponentially with the dimensionality of ๐ง.
Only a small proportion of the samples will make a significant contribution to the sum.
Uniform sampling clearly is very inefficient.
13
Importance sampling
๏ค
Addressing the two problems of the naive approach above, given a proposal
distribution ๐(๐ง), we can approximate the expectation as
๐ผ ๐ง = โซ ๐ง ๐ ๐ง ๐๐ง =
๐ ๐ง
1
๐ง
๐ ๐ง ๐๐ง โ
๐ ๐ง
๐ฟ
where the samples ๐ง (๐) are drawn from ๐.
๏ค
The quantities ๐๐ =
๐ ๐ง (๐)
๐ ๐ง (๐)
๐ฟ
๐=1
๐ ๐ง
๐ ๐ง
๐
๐
๐ง
๐
are known as importance weights, and they correct the
bias introduced by sampling from the wrong distribution.
๏ค
Unlike in the case of rejection sampling, all of the generated samples are
retained.
14
3
Markov Chain Monte Carlo
Markov Chain Monte Carlo (MCMC)
๏ค
Idea: we can sample from a large class of distributions and overcome the
problems that previous methods face in high dimensions using a framework
called Markov Chain Monte Carlo.
16
Background on Markov chains
๏ค
๏ค
A first-order Markov chain is defined as a series of random variables ๐ง
such the following conditional-independence property holds:
๐ ๐ง ๐+1 ๐ง 1 , โฆ , ๐ง ๐ = ๐ ๐ง ๐+1 ๐ง ๐
,โฆ,๐ง
๐
Thus, the graphical model of a Markov chain is a chain:
๐ง
๏ค
1
1
๐ง
2
๐ง
โฆ
3
๐ง
๐
A Markov chain is specified in terms of
๏
the initial probability distribution ๐ ๐ง
๏
the transition probabilities ๐ ๐ง
๐+1
0
๐ง
๐
17
Background on Markov chains
Markov chain: state diagram
2
Equlibrium distribution
๐(๐ง)
0.5
1
0.5
0.5
3
๐ง
1
2
3
18
Background on Markov chains
Markov chain: state diagram
2
Equlibrium distribution
๐(๐ง)
1
3
๐ง
1
2
3
19
The idea behind MCMC
p(z)
Desired target distribution
z
f(z)
Empirical distribution of samples
1
2
3
4
5
6
7
8
z
Markov chain whose
equilibrium distribution is
๐(๐ง)
20
Metropolis algorithm
๏ค
Algorithm for sampling from ๐(๐ง)
๏
๏
๏
Initialize by drawing ๐ง (1) somehow.
At cycle ๐ + 1, draw a candidate sample ๐ง โ from ๐ ๐ง ๐ง ๐ .
Importantly, ๐ needs to be symmetric, i.e., ๐ ๐ง1 ๐ง2 = ๐ ๐ง2 ๐ง1 .
Accept ๐ง (๐+1) โ ๐ง โ with probability
โ
๐ด ๐ง ,๐ง
๐
= min
and otherwise set ๐ง
๏ค
๐(๐ง โ )
1,
= min
๐(๐ง ๐ )
(๐+1)
(๐)
โ๐ง
๐(๐ง โ )
1,
๐(๐ง ๐ )
,
.
Notes
๏
๏
In contrast to rejection sampling, each cycle leads to a new sample, even
when the candidate ๐ง โ is discarded.
Note that the sequence ๐ง 1 , ๐ง 2 , โฆ is not a set of independent samples from
๐(๐ง) because successive samples are highly correlated.
21
Metropolis-Hastings algorithm
๏ค
Algorithm for sampling from ๐(๐ง)
๏
๏
๏
Initialize by drawing ๐ง (1) somehow.
At cycle ๐ + 1, draw a candidate sample ๐ง โ from ๐(๐ง|๐ง ๐ ).
In contrast to the Metropolis algorithm (see previous slide), ๐ no longer needs
to be symmetric.
Accept ๐ง (๐+1) โ ๐ง โ with probability
๐ ๐ง โ ๐๐ ๐ง ๐ ๐ง โ
โ
๐
๐ด ๐ง ,๐ง
= min 1,
,
๐
โ
๐
๐(๐ง
๐๐ (๐ง |๐ง
and otherwise set ๐ง (๐+1) โ ๐ง (๐) .
22
Metropolis: accept or reject?
Inrease in density:
p(z*)
p(zฯ)
๏ก ๏ฝ1
zฯ z*
Decrease in density:
p(zฯ)
p(z*)
~
p ( z*)
๏ก๏ฝ~ ๏ด
p( z )
zฯ z*
23
Gibbs sampling
๏ค
๏ค
๏ค
Idea: as an alternative to the Metropolis-Hastings algorithm, Gibbs sampling is
less broadly applicable but does away with acceptance tests and can therefore be
more efficient.
Suppose we wish to sample from a multivariate distribution ๐ ๐ง = ๐(๐ง1 , โฆ , ๐ง๐ ),
e.g., representing several variables in a model. For example, we might be
interested in their joint posterior distribution.
In Gibbs sampling, we update one component at a time.
๐ง1
๐ง1
๐ง1
๐ง2
๐ง2
๐ง2
๐ง3
๐ง3
๐ง4
๐ง4
๐ง5
๐ง5
๐ง5
๐ง6
๐ง6
๐ง6
โฆ
๐ง3
๐ง4
24
Gibbs sampling
๏ค
Algorithm for sampling from ๐(๐ง)
๏
๏
๏
Initialize {๐ง๐ : ๐ = 1, โฆ , ๐} somehow.
๐
๐
At cycle ๐ + 1, sample ๐ง๐ ~๐ ๐ง๐ ๐ง\๐ , i.e., replace the ๐ ๐กโ variable by a new sample,
drawn from a distribution that is conditioned of the current values of all other
variables. The resulting new vector is our new sample.
In the next cycle, replace a different variable ๐. The simplest procedure is to go round
๐ = 1, โฆ , ๐, 1, โฆ , ๐, โฆ Alternatively, ๐ could be chosen randomly.
25
Summary
๏ค
๏ค
๏ค
Throughout Bayesian statistics, we encounter intractable problems.
Most of these problems are: (i) evaluating a distribution; or (ii)
computing the expectation of a distribution.
Sampling methods provide a stochastic alternative to deterministic
methods. They are usually computationally less efficient, but are
asymptotically correct, broadly applicable, and easy to implement.
We looked at three main approaches:
๏
๏
๏
Transformation method: efficient sampling from simple distributions
Rejection sampling and importance sampling: sampling from arbitrary
distributions; direct computation of an expected value
Monte Carlo Markov Chain (MCMC): efficient sampling from high-dimensional
distributions through the Metropolis-Hastings algorithm or Gibbs sampling
26
Supplementary slides
27
Transformation method: proof
๏ค
๏ค
Procedure: ๐ง
๐
= ๐น โ1 ๐ข
๐
Proof:
๐ โ ๐น โ1 (๐). Is ๐~๐น?
๐ ๐ โค ๐ก = ๐ ๐น โ1 ๐ โค ๐ก = ๐ ๐ โค ๐น ๐ก = ๐น(๐ก)
Moreover: ๐~๐
0,1 โน ๐ ๐ โค ๐ = ๐, 0 โค ๐ โค 1
Therefore: ๐ ๐ โค ๐น ๐ก = ๐น ๐ก โน ๐~๐น
28
Background on Markov chains
๏ค
๏ค
Ergodicity: A Markov chain is ergodic if the distribution ๐ ๐ง ๐ converges to an
invariant distribution ๐โ (๐ง) irrespective of the choice of initial distribution
๐ ๐ง 0 . A homogeneous Markov chain will be ergodic, subject to some weak
restrictions on the invariant distribution and the transition probabilities.
Equilibrium distribution: Given an ergodic Markov chain, that invariant
distribution which any initial distribution converges to is called the equilibrium
distribution. (An ergodic Markov chain may have several invariant distributions
but only one of these is the equilibrium distribution.)
29