Introduction to Machine Learning for Category Representation

Download Report

Transcript Introduction to Machine Learning for Category Representation

Classification 2:
discriminative models
Jakob Verbeek
January 14, 2011
SVM slides stolen from Svetlana Lazebnik
Course website:
http://lear.inrialpes.fr/~verbeek/MLCR.10.11.php
Plan for the course
•
Session 4, December 17 2010
–
–
–
•
Session 5, January 7 2011
–
–
–
–
•
Jakob Verbeek: The EM algorithm, and Fisher vector image representation
Cordelia Schmid: Bag-of-features models for category-level classification
Student presentation 2: Beyond bags of features: spatial pyramid matching for recognizing
natural scene categories, Lazebnik, Schmid and Ponce, CVPR 2006.
Jakob Verbeek: Classification 1: generative and non-parameteric methods
Student presentation 4: Large-Scale Image Retrieval with Compressed Fisher Vectors,
Perronnin, Liu, Sanchez and Poirier, CVPR 2010.
Cordelia Schmid: Category level localization: Sliding window and shape model
Student presentation 5: Object Detection with Discriminatively Trained Part Based Models,
Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2010.
Session 6, January 14 2011
–
–
–
Jakob Verbeek: Classification 2: discriminative models
Student presentation 6:TagProp: Discriminative metric learning in nearest neighbor
models for image auto-annotation, Guillaumin, Mensink, Verbeek and Schmid, ICCV
2009.
Student presentation 7:IM2GPS: estimating geographic information from a single
image, Hays and Efros, CVPR 2008.
Example of classification
apple
pear
tomato
cow
dog
horse
Given: training images x and their categories y
What are the categories of
these test images?
Generative vs Discriminative Classification
•
Training data consists of “inputs”, denoted x, and
corresponding output “class labels”, denoted as y.
•
Goal is to predict class label y for a given test data x.
•
Generative probabilistic methods
– Model the density of inputs x from each class p(x|y)
– Estimate class prior probability p(y)
– Use Bayes’ rule to infer distribution over class given input
p ( y | x) 
•
p( x | y ) p( y )
p( x)
p ( x)   p ( y ) p ( x | y )
y
Discriminative (probabilistic) methods
– Directly estimate class probability given input: p(y|x)
– Some methods do not have probabilistic interpretation, but
fit a function f(x), and assign to classes based on sign(f(x))
Discriminant function
1.
2.
3.
Choose class of decision functions in feature space.
Estimate the function parameters from the training set.
Classify a new pattern on the basis of this decision rule.
kNN example from last week
Needs to store all data
Separation using smooth curve
Only need to store curve parameters
Linear classifiers
•
Decision function is linear in the features:
D
f ( x)  w x  w0  w0   wi xi
T
f(x)=0
d 1
•
Classification based on the sign of f(x)
•
Orientation is determined by w (surface
normal)
Offset from origin is determined by w0
•
•
Decision surface is (d-1) dimensional
hyper-plane orthogonal to w, given by
f ( x)  wT x  w0  0
w
Linear classifiers
•
Decision function is linear in the features:
D
f ( x)  w x  w0  w0   wi xi
T
d 1
•
Classification based on the sign of f(x)
•
•
Orientation is determined by w (surface normal)
Offset from origin is determined by w0
•
Decision surface is (d-1) dimensional hyperplane orthogonal to w, given by
f(x)=0
f ( x)  wT x  w0  0
•
What happens in 3d with w=(1,0,0) and w0 =-1 ?
w
Dealing with more than two classes
•
First (bad) idea: construction from multiple binary classifiers
– Learn the 2-class “base” classifiers independently
– One vs rest classifiers: train 1 vs (2 & 3), and 2 vs (1 & 3), and 3 vs (1 & 2)
– One vs One classifiers: train 1 vs 2, and 2 vs 3 and 1 vs 3
•
Not clear what should be done in some regions
– One vs Rest: Region claimed by several classes
– One vs One: no agreement
Dealing with more than two classes
• Alternative: define a separate linear function for each class
f k (x)  wTk x  wk0
• Assign sample to the class of the function with maximum value
y  arg maxk hk (x)
• Decision regions are convex in this case
– If two points fall in the region, then also all points on connecting line
• What shape do the separation surfaces have ?
Logistic discriminant for two classes
• Map linear score function to class probabilities with sigmoid function
f (x)  w0  w T x,
p  y  1 | x    ( f (x)),
1
 ( z) 
,
1  exp z 
p  y  1 | x   1  p ( y  1 | x)   ( f (x))
 (z )
• The class boundary is obtained for
p(y|x)=1/2, thus by setting linear
function in exponent to zero
z
f(x)=+5
p(y|x)=1/2
f(x)=-5
w
Multi-class logistic discriminant
• Map K linear score functions (one for each class) to K class
probabilities using the “soft-max” function
f k (x)  wk 0  w Tk x
p ( y  i | x) 
exp( f i (x))
K
 exp(f
k 1
k
(x))
• The class probability estimates are non-negative, and sum to one.
• Probability for the most likely class increases quickly with the
difference in the linear score functions
• For any given pair of classes we find that they are
equally likely on a hyperplane in the feature space
Parameter estimation for logistic discriminant
• Maximize the (log) likelihood of predicting the correct class label for
training data, ie the sum log-likelihood of all training data
N
L   log p yn | x n 
n 1
• Derivative of log-likelihood as intuitive interpretation
L
  [ yn  c]  p( y  c | x n ) 
w0c n
Expected number of points
from each class should
equal the actual number.
Indicator function
1 if yn=c, else 0
L
  [ yn  c]  p( y  c | x n ) x n   an x n
w c n
n
Expected value of each
feature, weighting points
by p(y|x), should equal
empirical expectation.
• No closed-form solution, use gradient-descent methods
– Note 1: log-likelihood is concave in parameters, hence no local optima
– Note 2: w is linear combination of data points
Support Vector Machines
• Find linear function (hyperplane) to separate
positive and negative examples
xi positive:
xi  w  b  0
xi negative:
xi  w  b  0
Which hyperplane
is best?
Support vector machines
• Find hyperplane that maximizes the margin
between the positive and negative examples
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Support vector machines
• Find hyperplane that maximizes the margin
between the positive and negative examples
xi positive( yi  1) :
xi  w  b  1
xi negative( yi  1) :
xi  w  b  1
For support vectors,
xi  w  b  1
Distance between point
and hyperplane:
Therefore, the margin is
Support vectors
| xi  w  b |
|| w ||
2 / ||w||
Margin
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Finding the maximum margin hyperplane
1. Maximize margin 2/||w||
2. Correctly classify all training data:
xi positive( yi  1) :
xi  w  b  1
xi negative( yi  1) : xi  w  b  1
•
•
Quadratic optimization problem:
1 T
Minimize w w
2
Subject to yi(w·xi+b) ≥ 1
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Finding the maximum margin hyperplane
• Solution: w 
learned
weight
 yx
i
i
i
i
Support
vector
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Finding the maximum margin hyperplane
• Solution:
w  i  i yi xi
b = yi – w·xi for any support vector
• Classification function (decision boundary):
w  x  b  i  i y x x  b
T
i i
• Notice that it relies on an inner product between the test
point x and the support vectors xi
• Solving the optimization problem also involves
computing the inner products xi · xj between all pairs of
training points
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Summary Linear discriminant analysis
•
Two most widely used linear classifiers in practice:
– Logistic discriminant (supports more than 2 classes directly)
– Support vector machines (multi-class extensions recently developed)
•
In both cases the weight vector w is a linear combination of the data points
w   i x i
i
•

This means that we only need the inner-products between data points to
calculate the linear functions
N
f (x)  w x  w0  w0    n xTn x
T
n 1
N
 w0    n k (x n , x)
n 1
– The function k that computes the inner products is called a kernel function
Nonlinear SVMs
• Datasets that are linearly separable work out great:
x
0
• But what if the dataset is just too hard?
x
0
• We can map it to a higher-dimensional space:
x2
0
x
Slide credit: Andrew Moore
Nonlinear SVMs
• General idea: the original input space can
always be mapped to some higher-dimensional
feature space where the training set is
separable:
Φ: x → φ(x)
Slide credit: Andrew Moore
Nonlinear SVMs
• The kernel trick: instead of explicitly computing
the lifting transformation φ(x), define a kernel
function K such that
K(xi ,xj) = φ(xi ) · φ(xj)
•
(to be valid, the kernel function must satisfy
Mercer’s condition)
• This gives a nonlinear decision boundary in the
original feature space:
 y K (x , x)  b
i
i
i
i
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining
and Knowledge Discovery, 1998
Kernels for bags of features
• Histogram intersection kernel:
N
I (h1 , h2 )   min(h1 (i), h2 (i))
i 1
• Generalized Gaussian kernel:
 1
2
K (h1 , h2 )  exp  D(h1 , h2 ) 
 A

• D can be Euclidean distance, χ2 distance, Earth
Mover’s Distance, etc.
J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local Features and Kernels for
Classifcation of Texture and Object Categories: A Comprehensive Study, IJCV 2007
Summary: SVMs for image classification
1.
Pick an image representation (in our case, bag of features)
2.
Pick a kernel function for that representation
3.
Compute the matrix of kernel values between every pair of training
examples
4.
Feed the kernel matrix into your favorite SVM solver to obtain
support vectors and weights
5.
At test time: compute kernel values for your test example and each
support vector, and combine them with the learned weights to get
the value of the decision function
SVMs vs Logisitic discriminants
• Multi-class SVMs can be obtained in a similar way as for logistic
discriminants
– Separate linear score function for each class
• Kernels can also be used for non-linear logisitic discriminants
– Requires storing the support vectors, may cost lots of memory in practice
– Computing kernel between new data point and support vectors may be
computationally expensive
• For Logisitic discriminant to work well in practice with small number
of examples, regularizer needs to be added to achieve large
margins
• Kernel functions can be applied to virtually any linear data
processing technique
– Principle component analysis, k-means clustering, ….