Slides - Tamara L Berg
Download
Report
Transcript Slides - Tamara L Berg
Graphical Models
Tamara L Berg
CSE 595 Words & Pictures
Announcements
HW3 online tonight
Start thinking about project ideas
Project proposals in class Oct 30
Come to office hours Oct 23-25 to discuss ideas
Projects should be completed in groups of 3
Topics are open to your choice, but need to
involve some text and some visual analysis.
Project Types
• Implement an existing paper
• Do something new
• Do something related to your research
Data Sources
•
Flickr
– API provided for search. Example code to retrieve images and associated textual meta-data
available here: http://graphics.cs.cmu.edu/projects/im2gps/
•
SBU Captioned Photo Dataset
– 1 million user captioned photographs. Data links available here:
http://dsl1.cewit.stonybrook.edu/~vicente/sbucaptions/
•
ImageClef
– 20,000 images with associated segmentations, tags and descriptions available here:
http://www.imageclef.org/
•
Scrape data from your favorite web source (e.g. Wikipedia, pinterest, facebook)
•
Rip data from a movie/tv show and find associated text scripts
•
Sign language
– 6000 frames continuous signing sequence, for 296 frames, the position of the left and right,
upper arm, lower arm and hand were manually segmented
http://www.robots.ox.ac.uk/~vgg/data/sign_language/index.html
Generative vs Discriminative
Discriminative version – build a classifier to
discriminate between monkeys and non-monkeys.
P(monkey|image)
Generative vs Discriminative
Generative version - build a model of the joint
distribution.
P(image,monkey)
Generative vs Discriminative
Can use Bayes rule to compute p(monkey|image) if
we know p(image,monkey)
P(image,monkey)
P(monkey | image) =
P(image)
Generative vs Discriminative
Can use Bayes rule to compute p(monkey|image) if
we know p(image,monkey)
P(image,monkey)
P(monkey | image) =
P(image)
Generative
Discriminative
Talk Outline
1. Quick introduction to graphical models
2. Examples:
- Naïve Bayes, pLSA, LDA
Slide from Dan Klein
Random Variables
Random variables
X = { X1, X 2 X 3,...X n }
Let
x i be a realization of X i
Random Variables
Random variables
X = { X1, X 2 X 3,...X n }
Let
x i be a realization of X i
A random variable is some aspect of the world about which
we (may) have uncertainty.
Random variables can be:
Binary (e.g. {true,false}, {spam/ham}),
Take on a discrete set of values
(e.g. {Spring, Summer, Fall, Winter}),
Or be continuous (e.g. [0 1]).
Joint Probability Distribution
Random variables
X = { X1, X 2 X 3,...X n }
Let
x i be a realization of X i
Joint Probability Distribution:
P(X1 = x1, X 2 = x2, X 3 = x 3,...X n = x n )
Also written
p(x1, x 2 x 3,...x n )
Gives a real value for all possible assignments.
Queries
Joint Probability Distribution:
P(X1 = x1, X 2 = x2, X 3 = x 3,...X n = x n )
Also written
p(x1, x 2 x 3,...x n )
Given a joint distribution, we can reason about unobserved
variables given observations (evidence):
P(X q | x e1,..., x ek )
Stuff you care about
Stuff you already know
Representation
Joint Probability Distribution:
P(X1 = x1, X 2 = x2, X 3 = x 3,...X n = x n )
Also written
p(x1, x 2 x 3,...x n )
One way to represent the joint probability distribution for discrete X i is
as an n-dimensional table, each cell containing the probability for a
setting of X. This would have r nentries if each X i ranges over r values.
Representation
Joint Probability Distribution:
P(X1 = x1, X 2 = x2, X 3 = x 3,...X n = x n )
Also written
p(x1, x 2 x 3,...x n )
One way to represent the joint probability distribution for discrete X i is
as an n-dimensional table, each cell containing the probability for a
setting of X. This would have r nentries if each X i ranges over r values.
Graphical Models!
Representation
Joint Probability Distribution:
P(X1 = x1, X 2 = x2, X 3 = x 3,...X n = x n )
Also written
p(x1, x 2 x 3,...x n )
Graphical models represent joint probability
distributions more economically, using a set of “local”
relationships among variables.
Graphical Models
Graphical models offer several useful properties:
1. They provide a simple way to visualize the structure of
a probabilistic model and can be used to design and
motivate new models.
2. Insights into the properties of the model, including
conditional independence properties, can be obtained
by inspection of the graph.
3. Complex computations, required to perform inference
and learning in sophisticated models, can be expressed
in terms of graphical manipulations, in which
underlying mathematical expressions are carried along
implicitly.
from Chris Bishop
Main kinds of models
• Undirected (also called Markov Random
Fields) - links express constraints between
A
B
variables.
• Directed (also called Bayesian Networks) have a notion of causality -- one can regard an
arc from A to B as indicating that A "causes" B.
A
B
Main kinds of models
• Undirected (also called Markov Random
Fields) - links express constraints between
A
B
variables.
• Directed (also called Bayesian Networks) have a notion of causality -- one can regard an
arc from A to B as indicating that A "causes" B.
A
B
Directed Graphical Models
Directed Graph, G = (X,E)
Nodes
Edges
X = { X1, X 2 X 3,...X n }
E = {(X i, X j ) : i ¹ j}
Each node is associated with a random variable
Directed Graphical Models
Directed Graph, G = (X,E)
Nodes
Edges
X = { X1, X 2 X 3,...X n }
E = {(X i, X j ) : i ¹ j}
Each node is associated with a random variable
Definition of joint probability in a graphical model:
n
p(x1,...x n ) = Õ p(xi
i=1
xp i ) where
p iare the
parents of
xi
Example
X4
X2
X6
X1
X3
X5
Example
X4
X2
X6
X1
X3
X5
Joint Probability:
p(x1, x 2 x 3 , x 4 , x 5 , x 6 ) =
p(x1) p(x 2 | x1) p(x 3 | x1) p(x 4 | x 2 ) p(x 5 | x 3 ) p(x 6 | x 2, x 5 )
x1
0
x
x1
0
x2
Example
0
x
X4
1
0
21
0
41
X2
0
X6
0
X1
1
1
X3
1
0
0
31
0
X5
x3
x1
x
1
1
0
x
0
51
1
1
Conditional Independence
Independence:
p(x A , x B ) = p(x A )p(x B )
Conditional Independence:
p(xA , xC | x B ) = p(x A | x B )p(xC | x B )
Or,
p(x A | x B , xC ) = p(x A | x B )
Conditional Independence
p(x1, x 2 x 3, x 4 , x 5 , x 6 ) =
p(x1) p(x 2 | x1)p(x3 | x1, x 2 )p(x4 | x1, x2, x3 )p(x5 | x1, x2, x 3, x4 ) p(x6 | x1, x2, x3, x 4 , x5 )
Conditional Independence
p(x1, x 2 x 3, x 4 , x 5 , x 6 ) =
p(x1) p(x 2 | x1)p(x3 | x1, x 2 )p(x4 | x1, x2, x3 )p(x5 | x1, x2, x 3, x4 ) p(x6 | x1, x2, x3, x 4 , x5 )
By Chain Rule (using the usual arithmetic ordering)
Example
X4
X2
X6
X1
X3
X5
Joint Probability:
p(x1, x 2 x 3 , x 4 , x 5 , x 6 ) =
p(x1) p(x 2 | x1) p(x 3 | x1) p(x 4 | x 2 ) p(x 5 | x 3 ) p(x 6 | x 2, x 5 )
Conditional Independence
p(x1, x 2 x 3, x 4 , x 5 , x 6 ) =
p(x1) p(x 2 | x1)p(x3 | x1, x 2 )p(x4 | x1, x2, x3 )p(x5 | x1, x2, x 3, x4 ) p(x6 | x1, x2, x3, x 4 , x5 )
By Chain Rule (using the usual arithmetic ordering)
Joint distribution from the example graph:
p(x1) p(x2 | x1)p(x 3 | x1)p(x 4 | x2 )p(x5 | x3 )p(x6 | x2, x5 )
Conditional Independence
p(x1, x 2 x 3, x 4 , x 5 , x 6 ) =
p(x1) p(x 2 | x1)p(x3 | x1, x 2 )p(x4 | x1, x2, x3 )p(x5 | x1, x2, x 3, x4 ) p(x6 | x1, x2, x3, x 4 , x5 )
By Chain Rule (using the usual arithmetic ordering)
Joint distribution from the example graph:
p(x1) p(x2 | x1)p(x 3 | x1)p(x 4 | x2 )p(x5 | x3 )p(x6 | x2, x5 )
Missing variables in the local conditional probability functions
correspond to missing edges in the underlying graph.
Removing an edge into node i eliminates an argument from
the conditional probability factor p(x | x , x ,..., x )
i
1
2
i-1
Observations
• Graphs can have observed (shaded) and
unobserved nodes. If nodes are always
unobserved they are called hidden or latent
variables
• Probabilistic inference in graphical models is the
problem of computing a conditional probability
distribution over the values of some of the nodes
(the “hidden” or “unobserved” nodes), given the
values of other nodes (the “evidence” or
“observed” nodes).
Inference – computing conditional probabilities
X4
X2
X6
X1
X3
X5
p(x , x )
Conditional Probabilities: p(x1 | x 6 ) = 1 6
p(x 6 )
Marginalization:
p(x1, x 6 ) =
ò ò ò ò p(x ) p(x
1
x2 x3 x4 x5
2
| x1) p(x 3 | x1) p(x 4 | x 2 ) p(x 5 | x 3 ) p(x 6 | x 2, x 5 )
Inference Algorithms
• Exact algorithms
– Elimination algorithm
– Sum-product algorithm
– Junction tree algorithm
• Sampling algorithms
– Importance sampling
– Markov chain Monte Carlo
• Variational algorithms
– Mean field methods
– Sum-product algorithm and variations
– Semidefinite relaxations
Talk Outline
1. Quick introduction to graphical models
2. Examples:
- Naïve Bayes, pLSA, LDA
A Simple Example – Naïve Bayes
C
C – Class
F - Features
F1
F2
Fn
P(C,F1,F2,...Fn ) = P(C)Õ P(Fi | C)
i
We only specify (parameters):
P(C) prior over class labels
P(Fi | C) how each feature depends on the class
A Simple Example – Naïve Bayes
C
C – Class
F - Features
F1
F2
Fn
P(C,F1,F2,...Fn ) = P(C)Õ P(Fi | C)
i
We only specify (parameters):
P(C) prior over class labels
P(Fi | C) how each feature depends on the class
A Simple Example – Naïve Bayes
C
C – Class
F - Features
F1
F2
Fn
P(C,F1,F2,...Fn ) = P(C)Õ P(Fi | C)
i
We only specify (parameters):
P(C) prior over class labels
P(Fi | C) how each feature depends on the class
Slide from Dan Klein
Slide from Dan Klein
Slide from Dan Klein
Percentage of
documents in
training set labeled
as spam/ham
Slide from Dan Klein
In the documents labeled
as spam, occurrence
percentage of each word
(e.g. # times “the”
occurred/# total words).
Slide from Dan Klein
In the documents labeled
as ham, occurrence
percentage of each word
(e.g. # times “the”
occurred/# total words).
Slide from Dan Klein
Classification
The class that maximizes:
P(C,W1,...W n ) = P(C)Õ P(W i | C)
i
= argmax P(c)Õ P(W i | c)
c ÎC
i
Classification
• In practice
– Multiplying lots of small probabilities can result in
floating point underflow
Classification
• In practice
– Multiplying lots of small probabilities can result in
floating point underflow
– Since log(xy) = log(x) + log(y), we can sum log
probabilities instead of multiplying probabilities.
Classification
• In practice
– Multiplying lots of small probabilities can result in
floating point underflow
– Since log(xy) = log(x) + log(y), we can sum log
probabilities instead of multiplying probabilities.
– Since log is a monotonic function, the class with
the highest score does not change.
Classification
• In practice
– Multiplying lots of small probabilities can result in
floating point underflow
– Since log(xy) = log(x) + log(y), we can sum log
probabilities instead of multiplying probabilities.
– Since log is a monotonic function, the class with
the highest score does not change.
– So, what we usually compute in practice is:
c map
é
ù
= arg max log P(c)+å log P(Wi |c)
êë
úû
i
c ÎC
Naïve Bayes for modeling
text/metadata topics
Harvesting Image Databases from the Web
Schroff, F. , Criminisi, A. and Zisserman, A.
Download images from the web via a search
query (e.g. penguin).
Re-rank images using a naïve Bayes model
trained on text surrounding the images and
meta-data features (image alt tag, image title
tag, image filename).
Top ranked images used to train an SVM
classifier to further improve ranking.
Results
Results
Naïve Bayes on images
Visual Categorization with Bags of Keypoints
Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cédric Bray
Method
Steps:
– Detect and describe image patches
– Assign patch descriptors to a set of predetermined
clusters (a vocabulary) with a vector quantization
algorithm
– Construct a bag of keypoints, which counts the
number of patches assigned to each cluster
– Apply a multi-class classifier (naïve Bayes), treating
the bag of keypoints as the feature vector, and thus
determine which category or categories to assign to
the image.
Naïve Bayes
C
C – Class
F - Features
F1
F2
Fn
P(C,F1,F2,...Fn ) = P(C)Õ P(Fi | C)
i
We only specify (parameters):
P(C) prior over class labels
P(Fi | C) how each feature depends on the class
Naive Bayes Parameters
Problem: Categorize images as one of 7 object
classes using Naïve Bayes classifier:
– Classes: object categories (face, car, bicycle, etc)
– Features – Images represented as a histogram
where bins are the cluster centers or visual word
vocabulary. Features are vocabulary counts.
P(C) treated as uniform.
P(Fi | C) learned from training data – images labeled
with category.
Results
Talk Outline
1. Quick introduction to graphical models
2. Examples:
- Naïve Bayes, pLSA, LDA
pLSA
pLSA
Joint Probability:
P(wi,d j ,zk ) = P(d j )P(zk | d j )P(wi | zk )
Marginalizing over topics determines the conditional
probability:
Fitting the model
Need to:
Determine the topic vectors common to all documents.
Determine the mixture components specific to each document.
Goal: a model that gives high probability to the words that appear
in the corpus.
Maximum likelihood estimation of the parameters is obtained by
maximizing the objective function:
pLSA on images
Discovering objects and their location
in images
Josef Sivic, Bryan C. Russell, Alexei A. Efros, Andrew Zisserman,
William T. Freeman
Documents – Images
Words – visual words (vector quantized SIFT descriptors)
Topics – object categories
Images are modeled as a mixture of topics (objects).
Goals
They investigate three areas:
– (i) topic discovery, where categories are
discovered by pLSA clustering on all available
images.
– (ii) classification of unseen images, where topics
corresponding to object categories are learnt on
one set of images, and then used to determine
the object categories present in another set.
– (iii) object detection, where you want to
determine the location and approximate
segmentation of object(s) in each image.
(i) Topic Discovery
Most likely words for 4 learnt topics (face, motorbike, airplane, car)
(ii) Image Classification
Confusion table for unseen test images against pLSA trained on
images containing four object categories, but no background
images.
(ii) Image Classification
Confusion table for unseen test images against pLSA trained on
images containing four object categories, and background images.
Performance is not quite as good.
(iii) Topic Segmentation
P(zk | wi ,d j ) > 0.8
(iii) Topic Segmentation
(iii) Topic Segmentation
Talk Outline
1. Quick introduction to graphical models
2. Examples:
- Naïve Bayes, pLSA, LDA
LDA
David M Blei, Andrew Y Ng, Michael Jordan
LDA
Per-document
topic proportions
Per-word topic
assignment
Observed word
LDA
pLSA
Per-document
topic proportions
Per-word topic
assignment
Observed word
LDA
Dirichlet
parameter
Per-document
topic proportions
Per-word topic
assignment
Observed word
LDA
topics
Dirichlet
parameter
Per-document
topic proportions
Per-word topic
assignment
Observed word
Generating Documents
Joint Distribution
Joint distribution:
LDA on text
Topic discovery from a text corpus. Highly ranked words for 4 topics.
LDA in
Animals on the Web
Tamara L Berg, David Forsyth
Animals on the Web Outline:
Harvest pictures of animals from the web using
Google Text Search.
Select visual exemplars using text based
information
LDA
Use vision and text cues to extend to similar
images.
Text Model
Latent Dirichlet Allocation (LDA) on the words in collected web pages
to discover 10 latent topics for each category.
Each topic defines a distribution over words. Select the 50 most likely
words for each topic.
Example Frog Topics:
1.) frog frogs water tree toad leopard green southern music king irish eggs folk princess river ball
range eyes game species legs golden bullfrog session head spring book deep spotted de am free
mouse information round poison yellow upon collection nature paper pond re lived center talk buy
arrow common prince
2.) frog information january links common red transparent music king water hop tree pictures pond
green people available book call press toad funny pottery toads section eggs bullet photo nature
march movies commercial november re clear eyed survey link news boston list frogs bull sites
butterfly court legs type dot blue
Select Exemplars
Rank images according to whether they have these likely words near
the image in the associated page.
Select up to 30 images per topic as exemplars.
1.) frog frogs water tree toad leopard green
southern music king irish eggs folk princess river
ball range eyes game species legs golden bullfrog
session head ...
2.) frog information january links common
red transparent music king water hop tree
pictures pond green people available book
call press ...
Extensions to LDA for pictures
A Bayesian Hierarchical Model for Learning
Natural Scene Categories
Fei-Fei Li, Pietro Perona
An unsupervised approach to learn and
recognize natural scene categories. A
scene is represented by a collection of
local regions. Each region is represented
as part of a “theme” (e.g. rock, grass
etc) learned from data.
Generating Scenes
1.) Choose a category label (e.g. mountain scene).
2.) Given the mountain class, draw a probability vector that will
determine what intermediate theme(s) (grass rock etc) to
select while generating each patch of the scene.
3.) For creating each patch in the image, first determine a
particular theme out of the mixture of possible themes, and
then draw a codeword given this theme. For example, if a
“rock” theme is selected, this will in turn privilege some
codewords that occur more frequently in rocks (e.g. slanted
lines).
4.) Repeat the process of drawing both the theme and codeword
many times, eventually forming an entire bag of patches that
would construct a scene of mountains.
Results – modeling themes
Left - distribution of the 40 intermediate themes.
Right - distribution of codewords as well as the appearance of 10
codewords selected from the top 20 most likely codewords for this
category model.
Results – modeling themes
Left - distribution of the 40 intermediate themes.
Right - distribution of codewords as well as the appearance of 10
codewords selected from the top 20 most likely codewords for this
category model.
Results – Scene Classification
correct
incorrect
Results – Scene Classification
correct
incorrect