A Bayesian Hierarchical Model for Learning Natural Scene

Download Report

Transcript A Bayesian Hierarchical Model for Learning Natural Scene

A Bayesian Hierarchical Model
for Learning Natural Scene
Categories
L. Fei-Fei and P. Perona. CVPR 2005
Discovering objects and their
location in images
J. Sivic, B. Russell, A. Efros, A. Zisserman and B. Freeman. ICCV 2005
Tomasz Malisiewicz
[email protected]
Advanced Machine Perception
February 2006
Graphical Models: Recent Trend in
Machine Learning
Describing Visual Scenes using
Transformed Dirichlet Processes.
E. Sudderth, A. Torralba, W. Freeman,
and A. Willsky. NIPS, Dec. 2005.
Outline
Goals of both vision papers
Techniques from statistical text modeling
- pLSA vs LDA
Scene Classification via LDA
Object Discovery via pLSA
Goal: Learn and Recognize Natural
Scene Categories
Classify a scene without first extracting
objects
Other techniques we know of:
-Global frequency (Oliva and Torralba)
-Texton Histogram (Renninger, Malik
et al)
Goal: Discover Object Categories
 Discover what objects are present in a collection
of images in an unsupervised way
 Find those same objects in novel images
 Determine what local image features correspond
to what objects; segmenting the image
Enter the world of Statistical Text Modeling
D. Blei, A. Ng, and M. Jordan. Latent
Dirichlet allocation. Journal of Machine
Learning Research, 3:993–1022, January
2003.
Bag-of-words approaches: the order of
words in a document can be neglected
Graphical Model Fun
Bag-of-words
A document is a collection of M words
A corpus (collection of documents) is
summarized in a term-document matrix
Object
Bag of ‘words’
1990: Latent Semantic Analysis (LSA)
Goal: map high-dimensional count vectors
to a lower dimensional representation to
reveal semantic relations between words
The lower dimensional space is called the
latent semantic space
Dim( latent space ) = K
1990: Latent Semantic Analysis (LSA)
NxM co-occurrence term-document matrix
=
NxK
topics
x
topics
NxM
M words
topics
documents
documents
words
N documents
words
KxK
x
topics
D = {d1,…,dN}
W = {w1,…,wM}
Nij = #(di,wj)
KxM
What did we just do?
Singular Value Decomposition
NxK
x
words
KxK
x
topics
=
topics
topics
NxM
topics
documents
documents
words
KxM
LSA summary
SVD on term-document matrix
Approximate N by thresholding all but the
largest K singular values in W to zero
Produces rank-K optimal approximation to
N in the L2-matrix or Frobenius norm
sense
LSA and Polysemy
According to this superposition
Polysemy:
theLSA
ambiguity
of toancapture
individual
principle,
is unable
multiplethat
senses
word (in
word or phrase
canofbea used
different contexts) to express two or more
different meanings
Under the LSA model, the coordinates of a
word in latent space can be written as a
linear superposition of the coordinates of
the documents that contain the word
Problems with LSA
LSA does not define a properly normalized
probability distribution
No obvious interpretation of the directions
in the latent space
From statistics, the utilization of L2 norm in
LSA corresponds to a Gaussian Error
assumption which is hard to justify in the
context of count variables
Polysemy problem
pLSA to the rescue
Probabilistic Latent Semantic Analysis
pLSA relies on the likelihood function of
multinomial sampling and aims at an
explicit maximization of the predictive
power of the model
pLSA to the rescue
Decomposition into Probabilities!
K
p( wi | d j )   p( wi | z k ) p( zk | d j )
k 1
Observed word
distributions
word distributions
per topic
Topic distributions
per document
Slide credit: Josef Sivic
Learning the pLSA parameters
Observed counts of
word i in document j
Unlike LSA, pLSA does not minimize any type of ‘squared deviation.’
The parameters are estimated in a probabilistically sound way.
Maximize likelihood of data using EM.
Minimize KL divergence between empirical
distribution and model
Slide credit: Josef Sivic
EM for pLSA (training on a corpus)
E-step: compute posterior probabilities for
the latent variables
M-step: maximize the expected complete
data log-likelihood
Graphical View of pLSA
pLSA is a generative model
d
Latent variables
z
Observed variables
w
Plates
Select a document di with prob P(di)
Pick latent class zk with prob P(zk|di)
Generate word wj with prob P(wj|zk)
How does pLSA deal with previously
unseen documents?
“Folding-in” Heuristic
First train on Corpus to obtain
Now re-run same training EM algorithm,
but don’t re-estimate
and let
D={dunseen}
Problems with pLSA
Not a well-defined generative model of
documents; d is a dummy index into the
list of documents in the training set (as
many values as documents)
No natural way to assign probability to a
previously unseen document
Number of parameters to be estimated
grows with size of training set
LDA to the rescue
 Latent Dirichlet Allocation treats the topic mixture
weights as a k-parameter hidden random
variable and places a Dirichlet prior on the
multinomial mixing weights
 Dirichlet distribution is conjugate to the
multinomial distribution (most natural prior to
choose: the posterior distribution is also a
Dirichlet!)
pLSA
LDA
Corpus-Level parameters in LDA
 Alpha and beta are corpus-level documents that
are sampled once in the corpus creating
generative model (outside of the plates!)
 Alpha and beta must be estimated before we
can find the topic mixing proportions belonging
to a previously unseen document
LDA
Getting rid of plates

1
z1
z2
2
z2
K
z3
zN
z1
z3
zN
z1
z3
zN
w 1 w2 w3
wN
w1 w2 w3
wN
w1 w2 w3
wN
b
Thanks to Jonathan Huang for the un-plated LDA graphic
z2
Inference in LDA
Inference = estimation of document-level
parameters
Intractable to compute  must employ
approximate inference
Approximate Inference in LDA
 Variational Methods: Use Jensen’s inequality
to obtain a lower bound on the log likelihood
that is indexed by a set of variational
parameters
Variational Methods are one way of doing this.
 Optimal
Variational
Parameters
(documentGibbs
sampling (MCMC)
is another
way.
specific) are obtained by minimizing the KL
divergence between the variational
distribution and the true posterior
Variational distribution
Look at some P(w|z) produced by LDA
Show some pLSI and LDA results applied
to text
An LDA project by Tomasz Malisiewicz
and Jonathan Huang
Search for the word ‘drive’
pLSA and LDA applied to Images
How can one apply these techniques to
the images?
Hierarchical Bayesian
text models
Probabilistic Latent Semantic Analysis (pLSA)
z
d
w
N
D
Hoffman, 2001
Latent Dirichlet Allocation (LDA)
c
D

z
w
N
Blei et al., 2001
Hierarchical Bayesian
text models
Probabilistic Latent Semantic Analysis (pLSA)
d
D
z
w
N
“face”
Sivic et al. ICCV 2005
Hierarchical Bayesian
text models
“beach”
Latent Dirichlet Allocation (LDA)
c
D

z
w
N
Fei-Fei et al. ICCV 2005
A Bayesian Hierarchical Model for Learning Natural
Scene Categories
Flow Chart: Quick Overview
How to Generate an Image?
Choose a scene (mountain, beach, …)
Given scene generate an intermediate
probability vector over ‘themes’
For each word:
Determine current theme from mixture
of themes
Draw a codeword from that theme
How to Generate an Image?
Inference
How to make decision on a novel image
Integrate over latent variables to get:
Approximate Variational Inference (not
easy, but Gibbs sampling is supposed to
be easier)
Codebook
 174 Local Image Patches
 Detection:
Evenly Sampled Grid
Random Sampling
Saliency Detector
Lowe’s DoG Detector
 Representation:
Normalized 11x11 gray values
128-dim SIFT
Results: Average performance 64%
Confusion Matrix
100 training examples and 50 test examples
Rank statistic test: the probability of a test scene correctly
belong to one of the top N most probable categories
Results: The Distributions
Theme
distribution
Codeword
distribution
The peak at 174
Summary of detection and representation
choices
SIFT outperforms pixel gray values
Sliding grid, which creates the largest
number of patches, does best
Discovering objects and their location in
images
Visual Words
 Vector Quantized SIFT descriptors computed in
regions
 Regions come from elliptical shape adaptation
around interest point, and from the maximally
stable regions of Matas et al.
 Both are elliptical regions at twice their detected
scale
Building a Vocabulary
…
Building a Vocabulary
K-means clustering of 300K regions
to get about 1K clusters for each of
Shape Adapted and Maximally Stable
regions
…
Vector quantization
Slide credit: Josef Sivic
pLSA Training
Sanity Check: Remember what quantities
must be estimated?
Results #1: Topic Discovery
This is just the training stage
4 object categories
Plus background
Obtain P(zk|dj) for each image, then
classify image as containing object k
according to the max of P(zk|dj) over k
Results #1: Topic Discovery
Results #2: Classifying New Images
Object Categories learned on a corpus,
then object categories found in new image
Anybody remember how this is done?
Remember the index d in
the graphical model
How does pLSA deal with previously
unseen documents?
“Folding-in” Heuristic
First train on Corpus to obtain
Now re-run same training EM algorithm,
but don’t re-estimate
and let
D={dunseen}
Results #2: Classifying New Images
Train on one set and test on another
Results #3: Segmentation
Localization and Segmentation of Object
For a word occurrence in a particular
document we can examine the probability
of different topics
Find words with P(zk|dj,wi) > .8
Results #3: Segmentation
Note: words shown are not the most probable words
for a topic, but instead they are words that have a high
probability of occurring in a topic AND high probability of
occurring in the image
Results #3: Segmentation and Doublets
 Two class image dataset consisting of half the faces
(218 images) and backgrounds (217 images)
 A 4 topic pLSA model is learned for all training faces and
training backgrounds with 3 fixed background topics, i.e.
one (face) topic is learned in addition to the three fixed
background topics
 A doublet vocabulary is then formed from the top 100
visual words of the face topic. A second 4 topic pLSA
model is then learned for the combined vocabulary of
singlets and doublets with the background topics fixed.
Doublets
Face
Segmentation
Scores
Singleton: .49
Doublets: .61
Efros: didn’t work as much as you’d think
Conclusions
Showed how both papers use bag-ofwords approaches
We’re now ready to become experts on
generative models like pLSA and LDA
Graphical Model Fun! (Carlos Guestrin
teaches Graphical Models)
Are you really into Graphical Models?
 Describing Visual Scenes using Transformed Dirichlet Processes. E.
Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005.
References
 A Bayesian Hierarchical Model for Learning
Natural Scene Categories, Fei Fei Li et al
 Describing Visual Scenes using Transformed
Dirichlet Processes, Sudderth et al
 Discovering objects and their location in images,
Sivic et al
 Latent Dirichlet Allocation, Blei et al
 Unsupervised Learning by Probabilistic Latent
Semantic Analysis, T. Hoffman