Transcript Lecture XI

CS558 COMPUTER VISION
Lecture XI: Object Recognition (2)
Slides adapted from S. Lazebnik
OUTLINE
Object recognition: overview and history
 Object recognition: a learning based approach
 Object recognition: bag of features representation
 Object recognition: classification algorithms

BAG-OF-FEATURES MODELS
ORIGIN 1: TEXTURE RECOGNITION
Texture is characterized by the repetition of basic elements
or textons
 For stochastic textures, it is the identity of the textons, not
their spatial arrangement, that matters

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001;
Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
ORIGIN 1: TEXTURE RECOGNITION
histogram
Universal texton dictionary
Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001;
Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
ORIGIN 2: BAG-OF-WORDS MODELS

Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
ORIGIN 2: BAG-OF-WORDS MODELS

Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
US Presidential Speeches Tag Cloud
http://chir.ag/phernalia/preztags/
ORIGIN 2: BAG-OF-WORDS MODELS

Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
US Presidential Speeches Tag Cloud
http://chir.ag/phernalia/preztags/
ORIGIN 2: BAG-OF-WORDS MODELS

Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
US Presidential Speeches Tag Cloud
http://chir.ag/phernalia/preztags/
BAG-OF-FEATURES STEPS
1.
2.
3.
4.
Extract features
Learn “visual vocabulary”
Quantize features using visual vocabulary
Represent images by frequencies (counts) of “visual words”
1. FEATURE EXTRACTION

Regular grid or interest regions
1. FEATURE EXTRACTION
Compute
descriptor
Normalize
patch
Detect patches
Slide credit: Josef Sivic
1. FEATURE EXTRACTION
…
Slide credit: Josef Sivic
2. LEARNING THE VISUAL VOCABULARY
…
Slide credit: Josef Sivic
2. LEARNING THE VISUAL VOCABULARY
…
Clustering
Slide credit: Josef Sivic
2. LEARNING THE VISUAL VOCABULARY
Visual vocabulary
…
Clustering
Slide credit: Josef Sivic
K-MEANS CLUSTERING
•
Want to minimize sum of squared Euclidean
distances between points xi and their nearest
cluster centers mk
D(X ,M ) 
k
cluster

•
•
i
xi  m k
2
point in
cluster k
Algorithm:
Randomly initialize K cluster centers
Iterate until convergence:
Assign each data point to the nearest center
 Recompute each cluster center as the mean of all
points assigned to it

CLUSTERING AND VECTOR QUANTIZATION
•
Clustering is a common method for learning a
visual vocabulary or codebook
Unsupervised learning process
 Each cluster center produced by k-means becomes a
codevector
 Codebook can be learned on separate training set
 Provided the training set is sufficiently representative,
the codebook will be “universal”

•
The codebook is used for quantizing features
A vector quantizer takes a feature vector and maps it to
the index of the nearest codevector in a codebook
 Codebook = visual vocabulary
 Codevector = visual word

EXAMPLE CODEBOOK
…
Appearance codebook
Source: B. Leibe
ANOTHER CODEBOOK
…
…
…
…
…
Appearance codebook
Source: B. Leibe
Yet another codebook
Fei-Fei et al. 2005
VISUAL VOCABULARIES: ISSUES
•
How to choose vocabulary size?
Too small: visual words not representative of all
patches
 Too large: quantization artifacts, overfitting

•
Computational efficiency

Vocabulary trees
(Nister & Stewenius, 2006)
SPATIAL PYRAMID REPRESENTATION


Extension of a bag of features
Locally orderless representation at several levels of resolution
level 0
Lazebnik, Schmid & Ponce (CVPR 2006)
SPATIAL PYRAMID REPRESENTATION


Extension of a bag of features
Locally orderless representation at several levels of resolution
level 0
level 1
Lazebnik, Schmid & Ponce (CVPR 2006)
SPATIAL PYRAMID REPRESENTATION


Extension of a bag of features
Locally orderless representation at several levels of resolution
level 0
level 1
Lazebnik, Schmid & Ponce (CVPR 2006)
level 2
SCENE CATEGORY DATASET
Multi-class classification results
(100 training images per class)
CALTECH101 DATASET
http://www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html
Multi-class classification results (30 training images per class)
BAGS OF FEATURES FOR ACTION
RECOGNITION
Space-time interest points
Juan Carlos Niebles, Hongcheng Wang and Li Fei-Fei, Unsupervised Learning of Human
Action Categories Using Spatial-Temporal Words, IJCV 2008.
BAGS OF FEATURES FOR ACTION
RECOGNITION
Juan Carlos Niebles, Hongcheng Wang and Li Fei-Fei, Unsupervised Learning of Human
Action Categories Using Spatial-Temporal Words, IJCV 2008.
IMAGE CLASSIFICATION
•
Given the bag-of-features representations of
images from different classes, how do we learn a
model for distinguishing them?
OUTLINE
Object recognition: overview and history
 Object recognition: a learning based approach
 Object recognition: bag of features representation
 Object recognition: classification algorithms

CLASSIFIERS

Learn a decision rule assigning bag-of-features
representations of images to different classes
Decision
boundary
Zebra
Non-zebra
CLASSIFICATION
Assign any input vector to one of two or more classes
 Any decision rule divides an input space into decision
regions separated by decision boundaries

NEAREST NEIGHBOR CLASSIFIER

Assign the label of the nearest training data point
to each test data point
from Duda et al.
Voronoi partitioning of feature space
for two-category 2D and 3D data
Source: D. Lowe
k-Nearest Neighbors



For a new point, find the k closest points from a training data
set.
The labels of the k points “vote” to classify the new point.
Works well provided that there is lots of data and the
distance/similarity function is good.
k=5
Source: D. Lowe
FUNCTIONS FOR COMPARING HISTOGRAMS
N
•
•
L1 distance:
χ2 distance:
D(h1 , h2 )   | h1 (i )  h2 (i ) |
i 1
N
D(h1 , h2 )  
i 1
•
h1 (i)  h2 (i) 2
h1 (i )  h2 (i )
Quadratic distance (cross-bin distance):
D(h1 , h2 )   Aij (h1 (i)  h2 ( j )) 2
i, j
•
Histogram intersection (similarity function):
N
I (h1 , h2 )   min( h1 (i ), h2 (i ))
i 1
LINEAR CLASSIFIERS
•
Find a linear function (hyperplane) to separate positive
and negative examples
xi positive :
xi  w  b  0
xi negative :
xi  w  b  0
Which hyperplane
is best?
SUPPORT VECTOR MACHINES
•
Find hyperplane that maximizes the margin
between the positive and negative examples
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
SUPPORT VECTOR MACHINES
•
Find a hyperplane that maximizes the margin
between the positive and negative examples
xi positive ( yi  1) :
xi  w  b  1
xi negative ( yi  1) :
xi  w  b  1
For support vectors,
xi  w  b  1
Distance between point
and hyperplane:
| xi  w  b |
|| w ||
Therefore, the margin is 2 / ||w||
Support vectors
Margin
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
FINDING THE MAXIMUM MARGIN HYPERPLANE
1.
2.


Maximize the margin 2/||w||
Correctly classify all training examples:
xi positive ( yi  1) :
xi  w  b  1
xi negative ( yi  1) :
xi  w  b  1
Quadratic programming problem:
Minimize
1 T
w w
2
Subject to yi(w·xi+b) ≥ 1
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
FINDING THE MAXIMUM MARGIN HYPERPLANE
• Solution:
w  i  i yi xi
learned
weight
Support
vector
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
FINDING THE MAXIMUM MARGIN HYPERPLANE
•
Solution:
w  i  i yi xi
b = yi – w·xi for any support vector
•
Classification function (decision boundary):
w  x  b  i  i yi xi  x  b
•
•
Notice that it relies on an inner product between the
test point x and the support vectors xi
Solving the optimization problem also involves
computing the inner products xi · xj between all pairs
of training points
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
NONLINEAR SVMS
• Datasets that are linearly separable work out great:
x
0
• But what if the dataset is just too hard?
x
0
• We can map it to a higher-dimensional space:
x2
0
x
Slide credit: Andrew Moore
NONLINEAR SVMS
•
General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:
Φ: x → φ(x)
Slide credit: Andrew Moore
NONLINEAR SVMS
•
The kernel trick: instead of explicitly computing the
lifting transformation φ(x), define a kernel function K
such that
K(xi ,xj) = φ(xi ) · φ(xj)

•
(To be valid, the kernel function must satisfy Mercer’s
condition)
This gives a nonlinear decision boundary in the
original feature space:
 y  ( x )   ( x )  b   y K ( x , x )  b
i
i
i
i
i
i
i
i
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
NONLINEAR KERNEL: EXAMPLE
•
 ( x)  ( x, x 2 )
Consider the mapping
x2
 ( x)   ( y)  ( x, x 2 )  ( y, y 2 )  xy  x 2 y 2
K ( x, y)  xy  x 2 y 2
Polynomial kernel
KERNELS FOR BAGS OF FEATURES
•
Histogram intersection kernel:
N
I (h1 , h2 )   min( h1 (i ), h2 (i ))
i 1
•
Generalized Gaussian kernel:
 1
2
K (h1 , h2 )  exp   D(h1 , h2 ) 
 A

•
D can be L1 distance, Euclidean distance,
χ2 distance, etc.
•
A kernel function must be a similarity function.
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for
Classifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007
SUMMARY: SVMS FOR IMAGE CLASSIFICATION
1.
2.
3.
4.
5.
Pick an image representation (in our case, bag of
features)
Choose a kernel function for that representation
Compute the matrix of kernel values between all pairs
of training examples
Feed the kernel matrix into your favorite SVM solver
to obtain support vectors and weights
At test time: compute kernel values between your test
example and all support vectors, and combine them
with the learned weights to yield the value of the final
decision function
WHAT ABOUT MULTI-CLASS SVMS?
•
•
•
Unfortunately, there is no “definitive” multi-class
SVM formulation
In practice, we have to obtain a multi-class SVM
by combining multiple binary (two-class) SVMs
One vs. others


•
Training: learn an SVM for each class vs. the others
Testing: apply each SVM to test example and assign
to it the class of the SVM that returns the highest
decision value
One vs. one
Training: learn an SVM for each pair of classes
 Testing: each learned SVM “votes” for a class to
assign to the test example

SVMS: PROS AND CONS
•
Pros
Many publicly available SVM packages:
http://www.kernel-machines.org/software
 Kernel-based framework is very powerful, flexible
 SVMs work very well in practice, even with very small
training sample sizes

•
Cons
No “direct” multi-class SVM, must combine two-class
SVMs
 Computation, memory

During training time, must compute a matrix of kernel values
between all pairs of examples ( training time >= O(n^2) )
 Training SVMs can take a very long time for large-scale
problems (actually kernel SVMs can typically handle ~100K
training examples on a single machine)

SUMMARY: CLASSIFIERS
•
Nearest-neighbor and k-nearest-neighbor classifiers

•
Support vector machines





•
L1 distance, χ2 distance, quadratic distance, histogram
intersection
Linear classifiers
Margin maximization
The kernel trick
Kernel functions: histogram intersection, generalized
Gaussian, pyramid match
Multi-class
Of course, there are many other classifiers out there
Neural networks, boosting, decision trees, …
 Deep neural networks, convolutional neural networks:
jointly learning feature representation and classifiers

PROJECT #4
Visual Recognition Challenge:
DATA SET:

Graz02

Bike, Cars, Person, None (background)
TRAIN/VALIDATION/TEST



Use the Train dataset to train the classifiers for
recognizing the three object categories;
Use the validation dataset to tune any parameters;
Fix the parameters you tuned and report your final
classification accuracy on the test images.
FEATURE EXTRACTION



Sample fixed-size image patches on regular grid of the
image. You can use a single fixed image patch size, or
have several different sizes.
Sample random-size patches at random locations
Sample fixed-size patches around the feature locations
from, e.g., your own Harris corner detector or any other
feature detector you found online. Note if you used code
you downloaded online, please cite it clearly in your
report.
FEATURE DESCRIPTION

It is recommended that you implement the SIFT
descriptor described in the lecture slides for each
image patch you generated. But you may also use
other feature descriptors, such as the raw pixel
values (bias-gain normalized). You don’t need to
make your SIFT descriptor to be rotation
invariant if you don’t want to do so.
DICTIONARY COMPUTATION

Run K-means clustering on all training feature
descriptors to learn the dictionary. Setting K to
be 500-1000 should be sufficient. You may
experiment with different sizes.
IMAGE REPRESENTATION

Given the features extracted from any
training/validation/test image, quantize each
feature to its closest codevector in the dictionary.
For each image, you can accumulate a histogram
by counting how many features are quantized to
each code in the dictionary. This is your image
representation (bag-of-features).
CLASSIFIER TRAINING

Given the histogram representation of all images, you
may use a k-NN classifier or any other classifier you
wish to use. If you wanted to push for recognition, you
mean consider training an SVM classifier. Please feel
free to find any SVM library to train your classifier. You
may use the Matlab Toolbox provided in this link
(http://theoval.cmp.uea.ac.uk/~gcc/svm/toolbox/).
RECOGNITION QUALITY

After you train the classifier, please report recognition
accuracy on the test image set. It is the ratio of images
you recognize correctly. Please also generate the socalled confusion matrix to enumerate, e.g., how many
images you mis-classify from one category to another
category. You will need to report results from both the
validation and test datasets to see how well your
algorithm generalizes to test data.
IMPORTANT NOTE

Please follow the train/validation/test protocol and
don’t heavily tune parameters on the test dataset. You
will be asked to report recognition accuracy on both
the validation and test datasets. Tuning parameters
on the test dataset is considered to be cheating, and
not allowed.