Computer Vision - RWTH Aachen University

Download Report

Transcript Computer Vision - RWTH Aachen University

Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
Machine Learning – Lecture 16
Approximate Inference
07.07.2009
Bastian Leibe
RWTH Aachen
http://www.umic.rwth-aachen.de/multimedia
[email protected]
Announcements
• Lecture evaluation
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


First 10-15 minutes of this lecture…
While the beamer is heating up…
B. Leibe
2
Course Outline
• Fundamentals (2 weeks)
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Bayes Decision Theory
Probability Density Estimation
• Discriminative Approaches (5 weeks)


Lin. Discriminants, SVMs, Boosting
Dec. Trees, Random Forests, Model Sel.
• Graphical Models (5 weeks)




Bayesian Networks & Applications
Markov Random Fields & Applications
Exact Inference
Approximate Inference
• Regression Problems (2 weeks)

Gaussian Processes
B. Leibe
3
Recap: Parameter Learning in BNs
• We need to specify two things:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Structure of Bayesian network (graph topology)
Parameters of each conditional probability table (CPT)
• It is possible to learn both from training data.


But learning structure is much harder than learning parameters.
Also, learning when some nodes are hidden is much harder than
when everything is observable.
• Four cases:
Structure
Observability
Method
Known
Full
Maximum Likelihood Estimation
Known
Partial
EM (or gradient ascent)
Unknown
Full
Search through model space
Unknown
Partial
EM + search through model space
B. Leibe
4
Recap: Learning with Known Structure
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• ML-Learning with complete data (no hidden variables)

Log-likelihood decomposes into sum of functions of µi.

Each µi can be optimized separately.

ML-solution: simply calculate frequencies.
• ML-Learning with incomplete data (hidden variables)

Iterative EM algorithm.
E-step: compute expected counts given previous settings µ(t) of
parameters E[ni,j,k|D,µ(t)].

M-step: re-estimate parameters µ using the expected counts.

( t + 1)
µi ;j ;k
Slide credit: Bernt Schiele
E [n i ;j ;k jD ; µ( t ) ]
à P
0 jD ; µ( t ) ]
E
[n
0
i
;j
;k
k
B. Leibe
5
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
Recap: Unknown Structure
• Goal

Learn a directed acyclic graph (DAG) that best explains the data.
• Constraints-based learning


Use statistical tests of marginal and conditional independence.
Find the set of DAGs whose d-separation relations match the
results of conditional independence tests.
• Score-based learning


Use a global score such as BIC (Bayes Information Criterion).
Find a structure that maximizes this score.
Slide adapted from Zoubin Ghahramani
B. Leibe
6
Topics of This Lecture
• Approximate Inference
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Variational methods
Sampling approaches
• Sampling approaches




Sampling from a distribution
Ancestral Sampling
Rejection Sampling
Importance Sampling
• Markov Chain Monte Carlo




Markov Chains
Metropolis Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
B. Leibe
7
Approximate Inference
• Exact Bayesian inference is often intractable.
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Often infeasible to evaluate the posterior distribution or to
compute expectations w.r.t. the distribution.
– E.g. because the dimensionality of the latent space is too high.
– Or because the posterior distribution has a too complex form.

Problems with continuous variables
– Required integrations may not have closed-form solutions.

Problems with discrete variables
– Marginalization involves summing over all possible configurations of
the hidden variables.
– There may be exponentially many such states.
 We need to resort to approximation schemes.
B. Leibe
8
Two Classes of Approximation Schemes
• Deterministic approximations (Variational methods)
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Based on analytical approximations to the posterior distribution
– E.g. by assuming that it factorizes in a certain form
– Or that it has a certain parametric form (e.g. a Gaussian).
 Can never generate exact results, but are often scalable to large
applications.
• Stochastic approximations (Sampling methods)
Given infinite computationally resources, they can generate
exact results.
 Approximation arises from the use of a finite amount of
processor time.
 Enable the use of Bayesian techniques across many domains.
 But: computationally demanding, often limited to small-scale
problems.

B. Leibe
9
Topics of This Lecture
• Approximate Inference
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Variational methods
Sampling approaches
• Sampling approaches




Sampling from a distribution
Ancestral Sampling
Rejection Sampling
Importance Sampling
• Markov Chain Monte Carlo




Markov Chains
Metropolis Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
B. Leibe
10
Sampling Idea
• Objective:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Evaluate expectation of a function f(z)
w.r.t. a probability distribution p(z).
• Sampling idea

Draw L independent samples z(l) with l = 1,…,L from p(z).

This allows the expectation to be approximated by a finite sum
XL
1
f^ =
f (zl )
L
l= 1

As long as the samples z(l) are drawn independently from p(z),
then
 Unbiased estimate, independent of the dimension of z!
Slide adapted from Bernt Schiele
B. Leibe
11
Sampling – Challenges
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Problem 1: Samples might not be independent
 Effective sample size might be much smaller than apparent
sample size.
• Problem 2:
If f(z) is small in regions where p(z) is large and vice versa, the
expectation may be dominated by regions of small probability.
 Large sample sizes necessary to achieve sufficient accuracy.

B. Leibe
12
Parametric Density Model
• Example:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

A simple multivariate (d-dimensional) Gaussian model
p(xj¹ ; § ) =

1
(2¼) D =2 j§ j 1=2
½
¾
1
exp ¡ (x ¡ ¹ ) T § ¡ 1 (x ¡ ¹ )
2
This is a “generative” model
in the sense that we can generate
samples x according to the
distribution.
Slide adapted from Bernt Schiele
B. Leibe
13
Sampling from a Gaussian
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Given: 1-dim. Gaussian pdf (probability density function)
p(x|¹,¾2) and the corresponding cumulative distribution:
Zx
F¹ ;¾2 (x) =
p(xj¹ ; ¾2 )dx
¡ 1
• To draw samples from a Gaussian, we can invert the
cumulative distribution function:
1
2
u » Uniform(0; 1) ) F¹¡ ;¾
2 (u) » p(xj¹ ; ¾ )
F¹ ;¾2 (x)
p(xj¹ ; ¾2 )
Slide credit: Bernt Schiele
B. Leibe
14
Sampling from a pdf (Transformation method)
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• In general, assume we are given the pdf p(x) and the
corresponding cumulativeZdistribution:
x
F (x) =
p(z)dz
¡ 1
• To draw samples from this pdf, we can invert the
cumulative distribution function:
u » Uniform(0; 1) ) F ¡ 1 (u) » p(x)
Slide credit: Bernt Schiele
B. Leibe
15
Ancestral Sampling
• Generalization of this idea to directed graphical models.
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Joint probability factorizes into conditional probabilities:
• Ancestral sampling



Assume the variables are ordered such that there are no links
from any node to a lower-numbered node.
Start with lowest-numbered node and draw a sample from its
distribution.
x^1 » p(x 1 )
Cycle through each of the nodes in order and draw samples from
the conditional distribution (where the parent variable is set to
its sampled value).
x^n » p(x n jpan )
B. Leibe
16
Discussion
• Transformation method
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

Limited applicability, as we need to invert the indefinite integral
of the required distribution p(z).
• More general


Rejection Sampling
Importance Sampling
Slide credit: Bernt Schiele
B. Leibe
17
Rejection Sampling
• Assumptions
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Sampling directly from p(z) is difficult.
But we can easily evaluate p(z) (up to some normalization factor
Zp):
1
p(z) =
• Idea


Zp
p~(z)
We need some simpler distribution q(z) (called proposal
distribution) from which we can draw samples.
Choose a constant k such that: 8z : kq(z) ¸ p
~(z)
Slide credit: Bernt Schiele
B. Leibe
18
Rejection Sampling
• Sampling procedure
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine



Generate a number z0 from q(z).
Generate a number u0 from the
uniform distribution over [0,kq(z0)].
If u0 > p
~(z0 ) reject sample, otherwise accept.
– Sample is rejected if it lies in the grey shaded area.
– The remaining pairs (u0,z0) have uniform distribution under the
curve p
~(z) .
• Discussion


Original values of z are generated from the distribution q(z).
Samples are accepted
Z with probability p~(z)=kq(z)
Z
p(accept ) =
1
p~(z)=kq(z)q(z)dz =
k
p~(z)dz
 k should be as small as possible!
Slide credit: Bernt Schiele
B. Leibe
19
Rejection Sampling – Discussion
• Limitation: high-dimensional spaces
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

For rejection sampling to be of practical value, we require that
kq(z) be close to the required distribution, so that the rate of
rejection is minimal.
• Artificial example

Assume that p(z) is Gaussian with covariance matrix ¾p2 I

Assume that q(z) is Gaussian with covariance matrix ¾q2 I
Obviously: ¾q2 ¸ ¾p2
D
 In D dimensions: k = (¾ /¾ ) .
q
p
– Assume ¾q is just 1% larger than ¾p.
– D = 1000  k = 1.011000 ¸ 20,000
1
– And p(accept ) ·
20000
 Often impractical to find good proposal distributions for high
dimensions!

Slide credit: Bernt Schiele
B. Leibe
20
Importance Sampling
• Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Approximate expectations directly
(but does not enable to draw samples from p(z) directly).
Goal:
• Simplistic strategy: Grid sampling

Discretize z-space into a uniform grid.

Evaluate the integrand as a sum of the form

But: number of terms grows exponentially with number of
dimensions!
Slide credit: Bernt Schiele
B. Leibe
21
Importance Sampling
• Idea
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine



Use a proposal distribution q(z) from which it is easy to draw
samples.
Express expectations in the form of a finite sum over samples
{z(l)} drawn from q(z).
with importance weights
p(z( l ) )
rl =
q(z( l ) )
Slide credit: Bernt Schiele
B. Leibe
22
Importance Sampling
• Typical setting:
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

p(z) can only be evaluated up to an unknown normalization
constant

p(z) = p~(z)=Z p
q(z) can also be treated in a similar fashion.
q(z) = q~(z)=Z q

Then

p~(z( l ) )
with: r~l =
q~(z( l ) )
Slide credit: Bernt Schiele
B. Leibe
23
Importance Sampling
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Ratio of normalization constants can be evaluated
Z
Zp
1
=
Zq
Zq
Z
p~(z)dz =
L
p~(z( l ) )
1X
q(z)dz '
r~l
(
l
)
L
q~(z )
l= 1
• and therefore
• with
wl = P
r~l
m
Slide credit: Bernt Schiele
= P
r~m
B. Leibe
p~( z ( l ) )
q~( z ( l ) )
p~( z ( m ) )
m q~( z ( m ) )
24
Importance Sampling – Discussion
• Observations
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Success of importance sampling depends crucially on how well
the sampling distribution q(z) matches the desired distribution
p(z).
Often, p(z)f(z) is strongly varying and has a significant proportion of its mass concentrated over small regions of z-space.
 Weights rl may be dominated by a few weights having large
values.

Practical issue: if none of the samples falls in the regions where
p(z)f(z) is large…
– The results may be arbitrary in error.
– And there will be no diagnostic indication (no large variance in rl)!

Key requirement for sampling distribution q(z):
– Should not be small or zero in regions where p(z) is significant!
Slide credit: Bernt Schiele
B. Leibe
25
Topics of This Lecture
• Approximate Inference
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Variational methods
Sampling approaches
• Sampling approaches




Sampling from a distribution
Ancestral Sampling
Rejection Sampling
Importance Sampling
• Markov Chain Monte Carlo




Markov Chains
Metropolis Algorithm
Metropolis-Hastings Algorithm
Gibbs Sampling
B. Leibe
26
Independent Sampling vs. Markov Chains
• So far
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


We’ve considered two methods, Rejection Sampling and
Importance Sampling, which were both based on independent
samples from q(z).
However, for many problems of practical interest, it is difficult
or impossible to find q(z) with the necessary properties.
• Different approach



We abandon the idea of independent sampling.
Instead, rely on a Markov Chain to generate dependent samples
from the target distribution.
Independence would be a nice thing, but it is not necessary for
the Monte Carlo estimate to be valid.
Slide credit: Zoubin Ghahramani
B. Leibe
27
MCMC – Markov Chain Monte Carlo
• Overview
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


Allows to sample from a large class of distributions.
Scales well with the dimensionality of the sample space.
• Idea

We maintain a record of the current state z(¿)

The proposal distribution depends on the current state: q(z|z(¿))

The sequence of samples forms a Markov chain z(1), z(2),…
• Setting


We can evaluate p(z) (up to some normalizing factor Zp):
p~(z)
p(z) =
Zp
At each time step, we generate a candidate sample from the
proposal distribution and accept the sample according to a
criterion.
Slide credit: Bernt Schiele
B. Leibe
28
MCMC – Metropolis Algorithm
• Metropolis algorithm
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


[Metropolis et al., 1953]
Proposal distribution is symmetric: q(zA jzB ) = q(zB jzA )
The new candidate sample z* is accepted with probability
µ
¶
?
p
~
(z
)
A(z? ; z( ¿) ) = min 1;
p~(z( ¿) )
• Implementation

Choose random number u uniformly from unit interval (0,1).

Accept sample if A(z? ; z(¿) ) > u .
• Note

New candidate samples always accepted if p
~(z? ) ¸ p~(z(¿) ) .
– I.e. when new sample has higher probability than the previous one.

The algorithm sometimes accepts a state with lower probability.
Slide credit: Bernt Schiele
B. Leibe
29
MCMC – Metropolis Algorithm
• Two cases
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


If new sample is accepted:
Otherwise:
z(¿+ 1) = z?
z(¿+ 1) = z(¿)
This is in contrast to rejection sampling, where rejected samples
are simply discarded.
 Leads to multiple copies of the same sample!

Slide credit: Bernt Schiele
B. Leibe
30
MCMC – Metropolis Algorithm
• Property
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

When q(zA|zB) > 0 for all z, the distribution of z¿ tends to p(z)
as ¿ ! 1.
• Note


Sequence z(1), z(2),… is not a set of independent samples from
p(z), as successive samples are highly correlated.
We can obtain (largely) independent samples by just retaining
every Mth sample.
• Example: Sampling from a Gaussian



Proposal: Gaussian with ¾ = 0.2.
Green:
Red:
accepted samples
rejected samples
Slide credit: Bernt Schiele
B. Leibe
31
Markov Chains
• Question
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

How can we show that z¿ tends to p(z) as ¿ ! 1?
• Markov chains


First-order Markov chain:
³
´
³
´
p z(m + 1) jz( 1) ; : : : ; z(m ) = p z(m + 1) jz(m )
Marginal probability
³
´
³
´ ³
´
X
p z( m + 1) =
p z(m + 1) jz(m ) p z( m )
z( m )
Slide credit: Bernt Schiele
B. Leibe
32
Markov Chains – Properties
• Invariant distribution
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine



A distribution is said to be invariant (or stationary) w.r.t. a
Markov chain if each step in the chain leaves that distribution
invariant.
Transition probabilities:
³
´
³
´
(m ) ( m + 1)
( m + 1) (m )
T z ;z
= p z
jz
Distribution p*(z) is invariant if:
X
?
p (z) =
T (z0; z) p? (z0)
z0
• Detailed balance


Sufficient (but not necessary) condition to ensure that a
distribution is invariant:
p? (z)T (z; z0) = p? (z0)T (z0; z)
A Markov chain which respects detailed balance is reversible.
Slide credit: Bernt Schiele
B. Leibe
33
MCMC – Metropolis-Hastings Algorithm
• Metropolis-Hastings Algorithm
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine



Generalization: Proposal distribution not required to be
symmetric.
The new candidate sample z* is accepted with probability
µ
¶
?
( ¿) ?
p~(z )qk (z jz )
A(z? ; z( ¿) ) = min 1;
p~(z( ¿) )qk (z? jz( ¿) )
where k labels the members of the set of possible transitions
considered.
• Note

When the proposal distributions are symmetric, MetropolisHastings reduces to the standard Metropolis algorithm.
Slide credit: Bernt Schiele
B. Leibe
34
MCMC – Metropolis-Hastings Algorithm
• Properties
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


We can show that p(z) is an invariant distribution of the Markov
chain defined by the Metropolis-Hastings algorithm.
We show detailed balance:
p(z)qk (zjz0)A k (z0; z) = min f p(z)qk (zjz0); p(z0)qk (z0jz)g
= min f p(z0)qk (z0jz); p(z)qk (zjz0)g
= p(z0)qk (z0jz)A k (z; z0)
Slide credit: Bernt Schiele
B. Leibe
35
MCMC – Metropolis-Hastings Algorithm
• Schematic illustration
For continuous state spaces, a common
choice of proposal distribution is a
Gaussian centered on the current state.
 What should be the variance of the
proposal distribution?
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine

– Large variance: rejection rate will be high for complex problems.
– The scale ½ of the proposal distribution should be as large as
possible without incurring high rejection rates.
 ½ should be of the same order as the smallest length scale ¾min.

This causes the system to explore the distribution by means of a
random walk.
– Undesired behavior: number of steps to arrive at state that is
independent of original state is of order (¾max/¾min)2.
– Strong correlations can slow down the Metropolis algorithm!
B. Leibe
36
Gibbs Sampling
• Approach
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine


MCMC-algorithm that is simple and widely applicable.
May be seen as a special case of Metropolis-Hastings.
• Idea

Sample variable-wise: replace zi by a value drawn from the
distribution p(zi|z\i).
– This means we update one coordinate at a time.

Repeat procedure either by cycling through all variables or by
choosing the next variable.
Slide credit: Bernt Schiele
B. Leibe
37
Gibbs Sampling
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Example

Assume distribution p(z1, z2, z3).

Replace z1 with new value drawn from z1(¿+ 1) » p(z1 jz2(¿) ; z3(¿) )

Replace z2( ¿) with new value drawn from z2


( ¿)
( ¿)
(¿+ 1)
» p(z2 jz1
(¿+ 1)
» p(z3jz1
Replace z3 with new value drawn from z3
And so on…
Slide credit: Bernt Schiele
B. Leibe
(¿+ 1)
; z3 )
(¿)
(¿+ 1)
; z2
(¿+ 1)
38
)
Gibbs Sampling
• Properties
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine





The factor that determines the acceptance probability in the
Metropolis-Hastings is determined by
? ?
?
? ?
?
?
p(z
jz
)p(z
)p(z
jznk )
p(z
)q
(zjz
)
k
k
nk
nk
k
?
A(z ; z) =
=
= 1
?
p(z)qk (z jz)
p(zk jznk )p(znk )p(zk jznk )
I.e. we get an algorithm which always accepts!
If you can compute (and sample from) the conditionals, you can
apply Gibbs sampling.
The algorithm is completely parameter free.
Can also be applied to subsets of variables.
Slide credit: Zoubin Ghahramani
B. Leibe
39
Gibbs Sampling
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Example

20 iterations of Gibbs sampling on a bivariate Gaussian.

Note: strong correlations can slow down Gibbs sampling.
Slide credit: Zoubin Ghahramani
B. Leibe
40
Summary: Approximate Inference
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
• Exact Bayesian Inference often intractable.
• Rejection and Importance Sampling


Generate independent samples.
Impractical in high-dimensional state spaces.
• Markov Chain Monte Carlo (MCMC)



Simple & effective (even though typically computationally
expensive).
Scales well with the dimensionality of the state space.
Issues of convergence have to be considered carefully.
• Gibbs Sampling



Used extensively in practice.
Parameter free
Requires sampling conditional distributions.
B. Leibe
41
References and Further Reading
• Sampling methods for approximate inference are
Augmented Computing
and Sensory
Perceptual
Summer’09
Learning,
Machine
described in detail in Chapter 11 of Bishop’s book.
Christopher M. Bishop
Pattern Recognition and Machine Learning
Springer, 2006
David MacKay
Information Theory, Inference, and Learning Algorithms
Cambridge University Press, 2003
• Another good introduction to Monte Carlo methods can
be found in Chapter 29 of MacKay’s book (also available
online: http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html)
B. Leibe
42