Pattern Theory: the Mathematics of Perception

Download Report

Transcript Pattern Theory: the Mathematics of Perception

Pattern Theory: the Mathematics
of Perception
Prof. David Mumford
Division of Applied Mathematics
Brown University
International Congress of Mathematics
Beijing, 2002
Outline of talk
I.
Background: history, motivation, basic
definitions
II. A basic example – Hidden Markov Models and
speech; and extensions
III. The “natural degree of generality” – Markov
Random Fields; and vision applications
IV. Continuous models: image processing via
PDE’s, self-similarity of images and random
diffeomorphisms
URL: www.dam.brown.edu/people/mumford/Papers
/ICM02powerpoint.pdf or /ICM02proceedings.pdf
Some History
• Is there a mathematical theory underlying intelligence?
• 40’s – Control theory (Wiener-Pontrjagin), the output side:
driving a motor with noisy feedback in a noisy world to
achieve a given state
• 70’s – ARPA speech recognition program
• 60’s-80’s – AI, esp. medical expert systems, modal,
temporal, default and fuzzy logics and finally statistics
• 80’s-90’s – Computer vision, autonomous land vehicle
Statistics vs. Logic
• Plato: “If Theodorus, or any other geometer, were
prepared to rely on plausibility when he was doing
geometry, he'd be worth absolutely nothing.”
• Graunt – counting corpses in
medieval London
• Gauss – Gaussian distributions, least squares  relocating
lost Ceres from noisy incomplete data
• Control theory – the Kalman-Wiener-Bucy filter
• AI – Enhanced logics < Bayesian belief networks
• Vision – Boolean combinations of features < Markov
random fields
What you perceive is not what you hear:
ACTUAL SOUND
1. The ?eel is on the shoe
2. The ?eel is on the car
3. The ?eel is on the table
4. The ?eel is on the orange
1.
2.
3.
4.
PERCEIVED WORDS
The heel is on the shoe
The wheel is on the car
The meal is on the table
The peel is on the orange
(Warren & Warren, 1970)
Statistical inference is being used!
Why is this old man recognizable from a cursory glance?
His outline is lost in clutter, shadows and wrinkles;
except for one ear, his face is invisible. No known
algorithm will find him.
The Bayesian Setup, I
VARIABLES:
xo  observable variables,
xh  hidden (not directly observable) variables,
  parameters in model
MODEL:
Pr( xo | xh , ).Pr( xh |  ) or Pr( xh , )
ML PARAMETER ESTIMATION:
arg max  Pr( xˆ ( ) |  ) or

arg max   Pr( xˆo( ) , xh |  )

xh
The Bayesian Setup, II
BAYES’S RULE:
Pr( xˆo | xh , ).Pr( xh |  )
Pr( xh | xˆo , ) 
Pr( xˆo |  )
 Pr( xˆo | xh , ).Pr( xh |  )
This is called the “posterior” distribution on xh
•Sampling Pr(xo,xh|), “synthesis” is the acid test
of the model
The central problem of Statistical learning theory:
• The complexity of the model and the Bias-Variance dilemma
* Minimum Description Length—MDL,
* Vapnik’s VC dimension
A basic example: HMM’s and speech recognition
I. Setup
sk = sound in window around time kDt are the observables
xk = part of phoneme being spoken at time kDt are the hidden vars.
Pr( x , s )   p1 ( xk | xk 1 ). p2 ( sk | xk )
k
Sk-1
sk
Sk+1
xk+1
xk-1
xk
’s = log’s of table of values of p1, p2, so
1 k  k .Ek ( x , s )
Pr( x , s )  e
Z
A basic example: HMM’s and speech recognition
II. Inference by dynamic programming:
(a)
 p (x
1
Pr( xk | sˆ k ) 
k
| xk 1 ). p2 ( sˆk | xk ).Pr( xk 1 | sˆ k 1 )
xk 1
 numerator
xk
(b) Call maxp( xk )  max xk 1 Pr( xk , x k 1 , sˆ k ),
then maxp( xk )  max xk 1  p1 ( xk | xk 1 ). p2 ( sˆk | xk ).maxp( xk 1 ) 
(c) Optimizing the ’s done by “EM” algorithm,
valid for any exponential model
Continuous and discrete variables in
perception
• Perception locks on to discrete labels, and the
world is made up of discrete objects/events
• Make an empirical histogram of changes x and
compute the kurtosis:
k =Exp(( x  x )4 ) /  4
• High kurtosis is nature’s universal signal of
discrete events/objects in space-time.
• Stochastic process with i.i.d. increments has
jumps iff the kurtosis k of its increments is > 3.
A typical stochastic process with jumps
Xt stochastic process with independent increments, then
z  X t 1  X t
e
 z2 / 2
2  X t Brownian, hence continuous
z  X t 1  X t  e z  X t has jumps:
Ex.: daily log-price changes in a sample of
stocks
Note fat power law tails
N.B. vertical axis is log of probability
Particle filtering
• Compiling full conditional probability tables is
usually impractical.
•Use a weighted sample:
Pr( xk | sˆk )
weak
w
i ,k
 x ( xk )
i ,k
i
• Bootstrap particle filtering:
(a) Sample with replacement
(b) Diffuse via p1
(c) Reweight via p2
• Tracking application
(from A.Blake, M.Isard):
Estimating the posterior distribution on
optical flow in a movie (from M.Black)
Horizontal flow
(follow window in red)
Horizontal flow
Horizontal flow
Horizontal flow
Horizontal flow
Horizontal flow
No process is truly Markov
• Speech has longer range patterns than phonemes:
triphones, words, sentences, speech acts, …
• PCFG’s = “probabilistic context free grammars” = almost
surely finite, labeled, random branching processes:
Forest of random trees Tn, labels xv on vertices, leaves
in 1:1 corresp with observations sm, prob. p1(xvk|xv) on
children, p2(sm|xm) on observations.).
• Unfortunate fact: nature is not so obliging, longer range
constraints force context-sensitive grammars. But how to
make these stochastic??
Grammar in the parsed speech of Helen,
a 2 ½ year old
Grammar in images (G. Kanisza):
contour completion
Markov Random Fields:
the natural degree of generality
• Time linear structure of
dependencies
• space/space-time/abstract
situations  general
graphical structure of
dependencies
G  (V , E ), a graph,
 xv vV , the variables,
EC ( xC )
1 
Pr( xV )  e C
,
Z
C the cliques of G
The Markov property: xv, xw are conditionally independent,
given xS , if S separates v,w in G.
Hammersley-Clifford: the converse.
A simple MRF: the Ising model
1
Pr( x , s ) 
e
ZT

  c.


 ( x s )
2

sk+1,l-1
sk,l-1
sk-1,l-1
sk-1,l
, x  1, s 
sk+1,l+1
sk,l+1
xk+1,l
xk,l
xk-1,l
 ,
sk-1,l+1
xk+1,l-1
xk-1,l-1

sk+1,l
sk,l
xk,l-1


x x  T

adj.

xk+1,l+1
xk,l+1
xk-1,l+1
The Ising model and image segmentation
A state-of-the-art image segmentation
algorithm (S.-C. Zhu)
Input
Segmentation
Synthesis from model
I ~ p( I | W*)
Hidden variables describe segments and their texture,
allowing both slow and abrupt intensity and texture changes
(See also Shi-Malik)
Texture synthesis via MRF’s
On left: a cheetah hide;
In middle, a sample from the Gaussian model
with identical second order statistics;
On right, a sample from exponential model
reproducing 7 filter marginals using:
k (( Fk  I )( x ))
1  
Pr( I )  e k 
Z
Monte Carlo Markov Chains
Basic idea: use artificial thermal dynamics to find
minimum energy (=maximum probability) states
Bayesian belief propagation and
the Bethe approximation
• Can find modes of MRF’s on trees using dynamic programming
•When the graph G is not a tree, use its universal covering graph G
•The Bethe approximation =
the 1(G)-invariant MRF on G ‘closest’ to the given MRF,
‘closest’ = minimizing Kullback-Liebler information distance
•‘Bayesian belief propagation’ = finding the modes of the Bethe
approximation with dynamic programming
Continuous models I:
deblurring and denoising
• Observe noisy, blurred image I,
seek to remove noise, enhance edges simultaneously!

min  c1  ( I  J ) 2 dxdy  c2  J
D ,
 D
D 
p

dxdy  c3  

• p=2 is Mumford-Shah model;
• p=1, c3=0 is Osher-Rudin ‘TV’ model
 J
J
 div 
 c  J
t


   I  J 
p


• p=2 is Perona-Malik equation; c=0, p=1 is TV-gradient descent
An example: Bela Bartok enhanced via the
Nitzberg-Shiota filter
Continuous models II: images and scaling
• The statistics of images of
‘natural scenes’ appear to be a
fixed point under blockaveraging renormalization, i.e.
• Assume NN images of
natural scenes have a certain
probability distribution; form
N/2N/2 images by a window
or by 22 averages – get the
same marginal distribution!
Scale invariance has many implications:
•Power law for the spectrum:
Exp Iˆ( , )  c  2   2 
2
• In the continuous limit, images are not locally
integrable functions but generalized functions in:
H 

• Intuitively, this is what we call ‘clutter’ – the
mathematical explanation of why vision is hard
Three axioms for natural images
1. Scale invariance
2. For all zero-mean filters F, the scalar random
variables F  I (i, j ) have kurtosis > 3
(D.Field, J.Huang).
3. Local image patches are dominated by ‘preferred
geometries’: edges, bars, blobs as well as ‘blue sky’
blank patches (D.Marr, B.Julesz, A.Lee).
It is not known if these axioms can be exactly satisfied!
Empirical data on image filter responses
Probability distributions of 1 and 2
filters, estimated from natural image
data.
a) Top plot is for values of horizontal
first difference of pixel values; middle
plot is for random 0-mean 8x8 filters.
Vertical axis in top 2 plots is
log(prob.density).
b) Bottom plot shows level curves of
Joint prob.density of vert.differences
at two horizontally adjacent pixels.
All are highly non-Gaussian!
Mathematical models for random images
I ( x, y )   k (e rk x  xk , e rk y  yk ),
k
( xk , yk , rk ) a Poisson process in
 k  independent samples from
3
,
prob.space of normalized wavelets.
(Random wavelet model)
I ( x, y )   k ( x , y ) (e k ( x , y ) x  xk ( x , y ) , e k ( x , y ) y  yk ( x , y ) ),
r

r

k ( x, y )  min k (erk x  xk , erk y  yk )  supp( k )
(“Dead leaves” model)
Continuous models III:
random diffeomorphisms
• The patterns of the world include shapes, structures which
recur with distortions: e.g. alphanumeric characters,
faces, anatomy
• Thus the hidden variables must include (i) clusters of
similar shapes, (ii) warpings between shapes in a
cluster
• Mathematically: need a metric on (i) the space of
diffeomorphisms Gk of k, or (ii) the space of
“shapes” Sk in k (open subsets with smooth bdry)
Can use diffusion to define a probability measure on Gk .
Metrics on Gk, I
V.Arnold:
Introduce a Riemannian metric in SGk,
the group of volume preserving diffeos.
For any path {t}, let
length of path  


k

 t 1
vt ( x ) dx dt , vt ( x) 
 t ( x) 

t
2
Then he proved geodesics are solutions of Euler’s equation:
vt
 (vt .)vt  p
t
Metrics on Gk, II
Christensen, Rabbitt, Miller – on Gk, use stronger metric:

k
Lvt , vt dx, L  ( I  D)m
vt = velocity, ut = Lvt = momentum in this metric
Geodesics now are solutions to a regularized compressible
form of Euler’s equation:
ut
 (vt .)ut  div(vt )ut   (ut )i ((vt )i )
t
i
Note: linear in u, so u can be a generalized function!
Geodesics in the quotient space S2
Get geodesics {Ht} on Sk by taking singular momentum:
ut ( x )  a(t , x )nx , Ht  Ht ( x )
S2 has remarkable structure:
(A geodesic, F.Beg)
1. ‘Weak’ Hilbert manifold
2. ‘Medial axis’ gives it a cell decomposition
3. Geometric heat equation defines a
deformation retraction
4. Diffusion defines probability measure
(Dupuis-Grenander-Miller, Yip)
Geodesics in the quotient space of ‘landmark
points’ gives a classical mechanical system
(Younes)
Consider geodesics whose momentum has finite support:
N
ut ( x )   ui (t ) Pi (t ) ( x )
i 1
This gives the ODE in which particles traveling in
the same (resp. opposite) direction attract (resp. repel):


dPi
 2 K Pi  Pj u j
dt
j


dui
   Pi K Pi  Pj .(ui .u j )
dt
j
K the Green’s function of L, a Bessel function.
Outlook for Pattern Theory
Finding a rich class of stochastic models adequate
for duplicating human perception yet tractable
(vision remains a major challenge)
 Finding algorithms fast enough to make inferences
with these models (Monte Carlo? BBP ?
competing hypothesis particles?)
 Underpinnings for a better biological theory of
neural functioning e.g. incorporating particle
filtering? grammar? warping? feedback?
URL: www.dam.brown.edu/people/mumford/Papers
/ICM02powerpoint.pdf or /ICM02proceedings.pdf
A sample of Graunt’s data