No Slide Title

Transcript No Slide Title

Bayesian Belief Propagation and Image
Interpretation
March 13, 2002
Presenter: David Rosenberg
Overview
• Deals with problems in which we want to
estimate local scene properties that may
depend, to some extent, on global properties
• Paper demonstrates that Bayesian Belief
Propagation (BBP) is a very good technique for
this class of problems
– In the paper’s examples, the answers are often
significantly better and converge significantly faster.
An Introductory Problem:
Interpolation
• Find a sequence of consecutive segments that
– approximate our data points and
– has small derivatives for each segment.
?
?
Interpolation Problem (continued)
• We can formalize this problem as minimizing the
following cost functional:
J (Y ) 

k : yk * observed
Y  ( y1 ,
( yk  yk )    ( yi  y i 1 ) 2
* 2
*
1
, yn , y ,
i
*
n
,y )
NOTE: J(Y) is a sum of terms, each
containing “neighboring” variables
• Standard solutions to minimization problems:
– Gradient Descent / Relaxation
• Gauss-Seidel relaxation
• Successive over relaxation (SOR)
– Simulated Annealing
The Core Idea
• We can rewrite certain cost functional
minimization problems as MAP estimate
problems for Markov Random Fields
• This is important to because Bayesian Belief
Propagation gives optimal solutions very
quickly, for MRFs with certain graph structures
Mapping Cost Minimization to MAP
• Note that minimizing J(Y) is equivalent to
maximizing exp( - J(Y) ).
• Suppose our cost functional has the form:
J (Y ) 
J
YC Y
C
(YC )
• Then we can also find Y that maximizes
e  J (Y )  exp(   J C (YC ))
YC Y

 exp( J
YC Y
C
(YC ))
Already looks like a
product of localized
potentials.
Mapping Cost Minimization to MAP
(continued)
• By constraining J to be a sum, we’ve reduced
our problem to the maximization of:
 exp( J
C
(YC ))
YC Y
• Since this function is strictly positive, we can
normalize to create a PDF.
1
P(Y )   exp( J C (YC ))
Z YC Y
• (This could be a Gibbs distribution!)
Mapping Cost Minimization to MAP
(continued)
• So finding the y’s that minimize J(Y), subject to
the observations that constrain some y*s is
equivalent to finding the mode (peak) of the
distribution P(Y|Y*).
• This is just the MAP estimate of Y given Y*.
Cost Minimization to MAP on MRF
(continued)
• We have
1
P(Y )   exp( J C (YC ))
Z YC Y
• If we can associate each r.v. in Y to a node of a
graph G
– such that each of the YC’s is a clique in G,
– then P(Y) is a Gibbs distribution w.r.t. G.
• If P(Y) is a Gibbs distribution w.r.t. a graph G,
– then the r.v.’s Y are a Markov random field (MRF),
– (Hammersley-Clifford Theorem)
MAP on MRF to Cost Function
Minimization
• Start with the MAP problem on an MRF.
• Every MRF has a Gibbs distribution,
– also by the Hammersley-Clifford theorem.
• By reversing our steps, we will find a cost
function J(Y) whose minimization corresponds
to the MAP estimate on the MRF.
• Thus any problem we can solve by finding the
MAP estimate on an MRF, we can also solve by
minimizing some cost functional.
Our Simplified Problem (from paper)
• We have
– hidden “scene” variables: Xj
– observed “image” variables Yj
• We assume that the following graph structure is
implicit in our cost functional:
y1
( x1 , y1 )
y2
( x2 , y2 )
x1
( x1 , x2 )
x2
y3
( x3 , y3 )
( x2 , x3 )
x3
• The Problem:
– Given some Yj’s, estimate the Xj’s
Straightforward Exact Inference
• Given the joint PDF
– typically specified using potential functions
• We can just marginalize out to
– get the aposteriori distribution for each Xj
• We can immediately extract the
– MAP estimate -- just the mode of the aposteriori
distribution
– Least squares estimate -- just the expected value of
the aposteriori distribution
Derivation of belief propagation
y1
( x1 , y1 )
y2
( x2 , y2 )
x1
( x1 , x2 )
x2
y3
( x3 , y3 )
( x2 , x3 )
x3
x1MMSE  mean sum sum P( x1 , x2 , x3 , y1 , y2 , y3 )
x1
x2
x3
The posterior factorizes
x1MMSE  mean sum sum P ( x1 , x2 , x3 , y1 , y2 , y3 )
x1
x2
x3
x1MMSE  mean sum sum  ( x1 , y1 )
x1
x2
x3
 ( x2 , y2 )  ( x1 , x2 )
 ( x3 , y3 )  ( x2 , x3 )
x1MMSE  mean  ( x1 , y1 )
x1
sum  ( x2 , y2 )  ( x1 , x2 )
x2
sum  ( x3 , y3 )  ( x2 , x3 )
x3
y1
( x1 , y1 )
y2
( x2 , y2 )
x1
( x1 , x2 )
x2
y3
( x3 , y3 )
( x2 , x3 )
x3
Propagation rules
x1MMSE  mean sum sum P ( x1 , x2 , x3 , y1 , y2 , y3 )
x1
x2
x3
x1MMSE  mean sum sum  ( x1 , y1 )
x1
x2
x3
 ( x2 , y2 )  ( x1 , x2 )
 ( x3 , y3 )  ( x2 , x3 )
x1MMSE  mean  ( x1 , y1 )
x1
sum  ( x2 , y2 )  ( x1 , x2 )
x2
sum  ( x3 , y3 )  ( x2 , x3 )
x3
y1
( x1 , y1 )
y2
( x2 , y2 )
x1
( x1 , x2 )
x2
y3
( x3 , y3 )
( x2 , x3 )
x3
Propagation rules
x1MMSE  mean ( x1 , y1 )
x1
sum ( x2 , y2 )  ( x1 , x2 )
x2
sum ( x3 , y3 )  ( x2 , x3 )
x3
M 12 ( x1 )  sum ( x1 , x2 ) ( x2 , y2 ) M 23 ( x2 )
x2
y1
( x1 , y1 )
y2
( x2 , y2 )
x1
( x1 , x2 )
x2
y3
( x3 , y3 )
( x2 , x3 )
x3
Belief, and message updates
bj (x j ) 
j
k
M
 j (x j )
kN ( j )
M i j ( xi )   ij (xi , x j )
xj
i
=
i
k
M
 j (x j )
k N ( j ) \ i
j
Optimal solution in a chain or tree:
Belief Propagation
• “Do the right thing” Bayesian algorithm.
• For Gaussian random variables over time:
Kalman filter.
• For hidden Markov models: forward/backward
algorithm (and MAP variant is Viterbi).
No factorization with loops!
x1MMSE  mean ( x1 , y1 )
x1
sum ( x2 , y2 )  ( x1 , x2 )
x2
sum ( x3 , y3 )  ( x2 , x3 )  ( x1 , x3 )
x
3
y2
y1
x1
x2
y3
x3
The (Discrete) Interpolation Problem
• Used the integers {1,…,5} as the domain and
range.
• Used evidence:
6
5
4
3
2
1
0
0
1
2
3
4
5
6
The (Discrete) Interpolation Problem
• How do we put the evidence into the MRF?
– As a prior on the random variables.
– Comes from the noise or sensor model.
• I tried two priors:
– 1. (example priors)
• Observed 1 --> Prior: 25
• Observed 3 --> Prior: 9
– 2. (example priors)
• Observed 1 --> Prior: 625
16
16
256
9
25
4
16
81 4
y1
( x1 , y1 )
1
9
1
y2
( x2 , y2 )
x1
( x1 , x2 )
x2
y3
( x3 , y3 )
( x2 , x3 )
x3
The (Discrete) Interpolation Problem
• How do we specify the derivative constraint?
– We adjust the potential functions between adjacent
random variables:
• We want potential functions that look something
like:
– 10 1
1
1
1
1
10 1
1
1
1
1
10 1
1
1
1
1
10 1
1
1
1
1
10
y2
• I call the ratio “10:1” the tightness. y1
( x1 , y1 )
( x2 , y2 )
x1
( x1 , x2 )
x2
y3
( x3 , y3 )
( x2 , x3 )
x3
Results for First Prior
Tightness = 2
Tightness = 4
Tightness = 6
4
4
4
2
2
2
2
4
2
4
4
4
4
2
2
2
2
4
2
4
4
4
4
2
2
2
2
4
2
4
4
4
4
2
2
2
2
4
2
4
4
4
4
2
2
2
2
4
2
4
4
4
4
2
2
2
2
4
2
4
2
4
2
4
2
4
2
4
2
4
2
4
Results for Second Prior
Tightness = 2
Tightness = 4
Tightness = 6
4
4
4
2
2
2
1
2
3
4
5
1
2
3
4
5
4
4
4
2
2
2
1
2
3
4
5
1
2
3
4
5
4
4
4
2
2
2
1
2
3
4
5
1
2
3
4
5
4
4
4
2
2
2
1
2
3
4
5
1
2
3
4
5
4
4
4
2
2
2
1
2
3
4
5
1
2
3
4
5
4
4
4
2
2
2
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Weiss’s Examples
• Interior/exterior example.
• Motion example
• In both examples, BBP had results that were
much better, and converged much faster than
other techniques.
Conclusions: When to use BBP?
• Among all problems expressible as cost
function minimization.
• Among problems expressible as MAP or MMSE
problems on MRFs
– Graph topology should be relatively sparse.
• Messages per iteration increases linearly with the
number of edges
– Reasonably small number of dimensions for r.v.
distributions.
• Approximate Inference
EXTRA SLIDES
Slide on Weiss’s Motion Detection
Mention some approximate inference
approaches
Complexity issues with message passing
• How long are messages
• How many messages do we have to pass per
iteration
• How many iterations until convergence
• Problem quickly becomes intractible
Slides on message passing with jointly
gaussian distributions???
BACKUP SLIDES
Markov Random Fields
• Let G be an undirected graph
– nodes: {1, …, n}
• Associate a random variable X_t to each node t
in G.
• (X_1, …, X_n) is a Markov random field on G if
– Every r.v. is independent of its nonneighbors
conditioned on its neighbors.
– P(X_t=x_t | X_s = x_s for all s \neq t} = P(X_t=x_t | X_s
= x_s for all s\in N(t)),
where N(s) be the set of neighbors of a node s.
Specifying a Markov Random Field
• Nice if we could just specify P( X | N(X) )
for all r.v.’s X (as with Bayesian networks)
• Unfortunately, this will overspecify the joint PDF.
– E.g. X_1 -- X_2.
• Joint PDF has 3 degrees of freedom
• Conditiona PDFs X_1|X_2 and X_2|X_1 have 2
degrees of freedom each
• The Hammersley-Clifford Theorem helps to
specify MRFs
The Gibbs Distribution
• A Gibbs distribution w.r.t. graph G is a
probability mass function that can be expressed
in the form
– P(x_1, … , x_n) = Prod _ Cliques C V_C(x_1, .., x_n)
– where V_C(x_1, …, x_n) depends only on those x_I in
C.
• We can combine potential functions into
products from maximal cliques, so
– P(x_1, … , x_n) = Prod _ MaxCliques C V_C(x_1, ..,
x_n)
– This may be better in certain circumstances because
we don’t have to specify as many potential functions
Hammersley Clifford Theorem
• Let the r.v’s {X_j} have a positive joint
probability mass function.
• Then the Hammersley Clifford Theorem says
that {X_j} is a Markov random field on graph G
iff it has a Gibbs distirubtion w.r.t G.
– Side Note: Hammserley and Clifford discovered this theorem in
1971, but they didn’t publish it because they kept thinking they
should be able to remove or relax the positivity assumption.
They couldn’t. Clifford published the result in 1990.
• Specifying the potential functions is equivalent
to specifying the joint probability distribution of
all variables.
• Now it’s easy to specify a valid MRF
Incorporating Evidence nodes into
MRFs
• We would like to have nodes that don’t change
their beliefs -- they are just observations.
• Can we do this via the potential functions on the
non-maximal clique containing just that node?
• I tink this is what they do in the Yair Weiss
implementation
• What if we don’t want to specify a potential
function? Make it identically one, since it’s in a
product.
From cost functional to transition
matrix
From cost functional to update rule
From update rule to transition matrix
The factoriation into pair wise
potentials -- good for general Markov
networks
Other Stuff
• For shorthand, we will write x = (x_1, …, x_n).

No Slide Title

Transcript No Slide Title

Directory