Introduction to Machine Learning for Category Representation

Download Report

Transcript Introduction to Machine Learning for Category Representation

Introduction to Machine Learning
for Category Representation
Jakob Verbeek
November 27, 2009
Many slides adapted from S. Lazebnik
Plan for this course
1)
Introduction to machine learning
2)
Clustering techniques

3)
Gaussian mixture density continued

4)
Introduction, generative methods, semi-supervised
Classification techniques 2

6)
Parameter estimation with EM, Fisher kernels
Classification techniques 1

5)
k-means, Gaussian mixture density
Discriminative methods, kernels
Decomposition of images

Topic models, …
What is machine learning?
• According to wikipedia
– “Learning is acquiring new knowledge, behaviors, skills, values,
preferences or understanding, and may involve synthesizing
different types of information. The ability to learn is possessed by
humans, animals and some machines. Progress over time tends
to follow learning curves.”
– “Machine learning is a scientific discipline that is concerned
with the design and development of algorithms that allow
computers to change behavior based on data, such as from
sensor data or databases. A major focus of machine learning
research is to automatically learn to recognize complex patterns
and make intelligent decisions based on data. Hence, machine
learning is closely related to fields such as statistics, probability
theory, data mining, pattern recognition, artificial intelligence,
adaptive control, and theoretical computer science.”
Why machine learning?
• Extract knowledge/information from past experience/data
• Use this knowledge/information to analyze new experiences/data
• Designing rules to deal with new data by hand can be difficult
– How to write a program to detect a cat in an image?
• Collecting data can be easier
– Find images with cats, and ones without them
• Use machine learning to automatically find such rules.
• Goal of this course: introduction to machine learning techniques
used in current object recognition systems.
Steps in machine learning
•
Data collection
– “training data”, optionally with “labels” provided by a “teacher”.
•
Representation
– how the data are encoded into “features” when presented to learning algorithm.
•
Modeling
– choose the class of models that the learning algorithm will choose from.
•
Estimation
– find the model that best explains the data: simple and fits well.
•
Validation
– evaluate the learned model and compare to solution found using other model
classes.
•
Apply learned model to new “test” data
Data Representation
• Important issue when using learning techniques
• Different types of representations
– Vectorial, graphs, …
– Homogeneous or heterogeneous, e.g. Images + text
• Choice of representation may impact the choice of learning
algorithm.
• Domain knowledge can help to design or select good features.
– The ultimate feature would solve the learning problem…
• Automatic methods known as “feature selection” methods
Probability & Statistics in Learning
• Many learning methods formulated as a probabilistic model of data
– Can deal with uncertainty in the data
– Missing values for some data can be handled
– Provides a unified framework to combine many different models for
different types of data
• Statistics are used to analyze the behavior of learning algorithms
– Does the learning algorithm recover the underlying model given enough
data: “consistency”
– How fast does is do so: rate of convergence
• Common important assumption
– Training data sampled from the true data distribution
– The test data is sampled from the same distribution
Different forms of learning
•
Supervised
– Classification
– Regression
•
Unsupervised
–
–
–
–
•
Clustering
Dimension reduction
Topic models
Density estimation
Semi-supervised
– Combine labeled data wit unlabeled data
•
Active learning
– Determine the most useful data to label next
•
Many other forms…
Supervised learning
• Training data provided as pairs (x,y)
• The goal is to predict an “output” y from an “input” x
• Output y for each input x is the “supervision” that is given
to the learning algorithm.
– Often obtained by manual “annotation” of the inputs x
– Can be costly to do
• Most common examples
– Classification
– Regression
Classification
•
Training data consists of “inputs”, denoted x, and corresponding output
“class labels”, denoted as y.
•
Goal is to correctly predict for a test data input the corresponding class
label.
•
Learn a “classifier” f(x) from the input data that outputs the class label or a
probability over the class labels.
•
Example:
– Input: image
– Output: category label, eg “cat” vs. “no cat”
•
Classification can be binary (two classes), or over a larger number of
classes (multi-class).
– In binary classification we often refer to one class as “positive”, and the other as
“negative”
•
Binary classifier creates a boundaries in the input space between areas
assigned to each class
Example of classification
Given: training images and their categories
What are the categories
of these test images?
Regression
• Similar to classification, but output y has the form of one or more
real numbers.
• Goal is to predict for input x an output f(x) that is close to the true y.
• Learn a continuous function
• A “loss” function, or “error” function measures how we a certain
function f is doing
– In classification we want to minimize nr. of errors using a 0/1 loss:
correct or not
– In regression we minimize a graded loss function, loss is bigger as f(x)
is further from correct y.
Example of regression
• Suppose we want to predict gas mileage of a car
based on some characteristics: number of
cylinders or doors, weight, horsepower, year etc.
Regression: example 2
• Training set: faces (represented as vectors of distances between
keypoints) together with experimentally obtained attractiveness
rankings
• Learn: function to reproduce attractiveness ranking based on training
inputs and outputs
Attractiveness score f(v)
Vector of distances v
T. Leyvand, D. Cohen-Or, G. Dror, and D. Lischinski, Data-driven enhancement of facial
attractiveness, SIGGRAPH 2008
Other forms of supervised learning
• Structured prediction tasks: predict several
interdependent output variables
Image
Word
Structured Prediction
• Estimation of body poses
• Data association problem: assigning edges body parts
model
Source: D. Ramanan
Other supervised learning scenarios
• Learning similarity functions from relations between
multiple input objects
Pairwise constraints
Source: X. Sui, K. Grauman
Learning face similarities
• Training data: pairs of faces labeled as same/different
• Similarity measure should ignore: pose, expression, …
• Face identification: are these faces of the same person?
[Guillaumin, Verbeek, Schmid, ICCV 2009]
Unsupervised learning
• Input data x given without desired output variables y.
• Goals is to learn something about the “structure” of the data
• Examples include
–
–
–
–
Clustering
Dimensionality reduction
Topic models
Density estimation
• Not always clear how to measure success of unsupervised learning
– Probabilistic models can be evaluated by computing likelihood assigned
to other data sampled from the same distribution
– Clustering can be evaluated by learning on labeled data, measure how
clusters correspond to classes, but classes may not define most
apparent clusters
– Dimensionality reduction can be evaluated by reconstruction errors
Clustering
• Finding a group structure in the data
– Data in one cluster similar to each other
– Data in different clusters dissimilar
• Map each data point to a discrete cluster index
– “flat” methods find k groups (k known, or automatically set)
– “hierarchical” methods define a tree structure over the data
Clustering example
• Learn face similarity from training pairs labeled
as same/different
• Cluster faces based on identity
[Guillaumin, Verbeek, Schmid, ICCV 2009]
Dimension reduction
• Finding a lower dimensional representation of the data
– Useful for compression, visualization, noise reduction
• Unlike regression: target values not given
Dimension reduction
• Finding a lower dimensional representation of the data
– Useful for compression, visualization, noise reduction
• Unlike regression: target values not given
Dimension reduction
Topic models
• Decompose images or texts into groups of regions or
words that often co-occur (topics)
Topic models for images
• Decompose each image into small set of visual topics
• Spatial coherence enforced by Markov Random Field
• Training images labeled with category (topic) names
• Learning algorithm assigns pixels to categories (topics)
• Test images do not have any labels
[Verbeek & Triggs, CVPR’07]
Density estimation
• Fit probability density on the training data
– Can be combination of discrete and continuous data
– Good fit: high likelihood on training data
– Smooth function: generalizes to new data
• Can be used to detect anomalies
• Many forms of unsupervised
learning can be understood as
doing density estimation
– Type of model differs though
Different forms of learning
•
Supervised
– Classification
– Regression
•
Unsupervised
–
–
–
–
•
Clustering
Dimension reduction
Topic models
Density estimation
Semi-supervised
– Combine labeled data wit unlabeled data
•
Active learning
– Determine the most useful data to label next
•
Many other forms…
Semi-supervised learning
• Learn from supervised and unsupervised data
– Labeled data often expensive to obtain
– Unlabeled data often cheap to obtain
• Why should this work?
– Unsupervised data used to learn about distribution on inputs x
– Supervised data used to learn about input x given output y
?
Example of semi-supervised learning
• Classification of newsgroup articles into 20 different classes: politics,
sports, education,…
• Use EM to iteratively estimate class label of unlabeled data and
update the model
• Helps when few labeled examples are available
p(x | y) p(y)
p(y | x) 
p(x)
[Nigam et al., Machine Learning,
Vol. 39, pp 103—134, 2000]
Active learning
• The learning algorithm can choose its own training examples, or ask
a “teacher” for an answer on selected inputs
– Labeling of most uncertain images
– Labeling of images that maximally reduce uncertainty in model parameters
S. Vijayanarasimhan and K. Grauman, “Cost-Sensitive Active Visual Category Learning,” 2009
Generalization
• The ultimate goal is to do as well as possible on new,
unseen data (a test set), but we only have access to
labels (“ground truth”) for the training set
• What makes generalization possible?
• Inductive bias: set of assumptions a learner uses to
predict the target value for previously unseen inputs
– This is the same as modeling or choosing a target hypothesis
class
• Types of inductive bias
– Occam’s razor
– Similarity/continuity bias: similar inputs should have similar
outputs
– …
Achieving good generalization
• Consideration 1: Bias
– How well does your model fit the observed data?
– It may be a good idea to accept some fitting error,
because it may be due to noise or other “accidental”
characteristics of one particular training set
• Consideration 2: Variance
– How robust is the model to the selection of a
particular training set?
– To put it differently, if we learn models on two different
training sets, how consistent will the models be?
Bias/variance tradeoff
• Models with too many
parameters may fit the
training data well (low
bias), but are sensitive to
choice of training set (high
variance)
Bias/variance tradeoff
• Models with too many
parameters may fit the training
data well (low bias), but are
sensitive to choice of training set
(high variance)
• Models with too few parameters
may not fit the data well (high
bias) but are consistent across
different training sets (low
variance)
2
Bias/variance tradeoff
• Models with too many
parameters may fit the training
data well (low bias), but are
sensitive to choice of training set
(high variance)
• Models with too few parameters
may not fit the data well (high
bias) but are consistent across
different training sets (low
variance)
• Generalization error is due to
overfitting
• Generalization error is due to
underfitting
2
Underfitting and overfitting
• How to recognize underfitting?
– High training error and high test error
• How to deal with underfitting?
– Find a more complex model
• How to recognize overfitting?
– Low training error, but high test error
• How to deal with overfitting?
– Get more training data
– Decrease the number of parameters in your model
– Regularization: penalize certain parts of the parameter space or
introduce additional constraints to deal with a potentially illposed problem
Methodology
• Distinction between training and testing is crucial
– Correct performance on training set is just memorization!
– Not enough to perform well on new test data
• Strictly speaking, the researcher should never look at
the test data when designing the system
– Generalization performance should be evaluated on a hold-out
or validation set
– Raises some troubling issues for learning “benchmarks”
Source: R. Parr
Plan for this course

Introduction to machine learning
2)
Clustering techniques

3)
Gaussian mixture density continued

4)
Introduction, generative methods, semi-supervised
Classification techniques 2

6)
Parameter estimation with EM, Fisher kernels
Classification techniques 1

5)
k-means, Gaussian mixture density
Discriminative methods, kernels
Decomposition of images

Topic models, …