Transcript Part 1

A Survey on Distance Metric Learning
(Part 1)
Gerry Tesauro
IBM T.J.Watson Research Center
1
Acknowledgement
• Lecture material shamelessly adapted/stolen from
the following sources:
– Kilian Weinberger:
• “Survey on Distance Metric Learning” slides
• IBM summer intern talk slides (Aug. 2006)
– Sam Roweis slides (NIPS 2006 workshop on
“Learning to Compare Examples”)
– Yann LeCun talk slides (NIPS 2006 workshop
on “Learning to Compare Examples”)
2
Outline
Part 1 
Motivation and Basic Concepts

ML tasks where it’s useful to learn dist. metric

Overview of Dimensionality Reduction

Mahalanobis Metric Learning for Clustering with Side Info
(Xing et al.)

Pseudo-metric online learning (Shalev-Shwartz et al.)
Part 2 
Neighbourhood Components Analysis (Golderberger et al.),
Metric Learning by Collapsing Classes (Globerson & Roweis)

Metric Learning for Kernel Regression (Weinberger & Tesauro)

Metric learning for RL basis function construction (Keller et al.)

Similarity learning for image processing (LeCun et al.)
3
Motivation
• Many ML algorithms and tasks require a distance
metric (equivalently, “dissimilarity” metric)
– Clustering (e.g. k-means)
– Classification & regression:
• Kernel methods
• Nearest neighbor methods
– Document/text retrieval
• Find most similar fingerprints in DB to given sample
• Find most similar web pages to document/keywords
– Nonlinear dimensionality reduction methods:
• Isomap, Maximum Variance Unfolding, Laplacian
Eigenmaps, etc.
4
Motivation (2)
• Many problems may lack a well-defined, relevant
distance metric
– Incommensurate features  Euclidean distance
not meaningful
– Side information  Euclidean distance not
relevant
– Learning distance metrics may thus be desirable
• A sensible similarity/distance metric may be
highly task-dependent or semantic-dependent
– What do these data points “mean”?
– What are we using the data for?
5
Which images are most
similar?
right
centered
It depends ...
left
male
female
It depends ...
student
professor
... what you are looking
for
nature background
plain background
... what you are looking
for
Key DML Concept:
Mahalanobis distance metric
• The simplest mapping is a linear
transformation
Mahalanobis distance
metric
• The simplest mapping is a linear
transformation
PSD
Algorithms can
learn both
matrices
>5 Minutes Introduction
to Dimensionality
Reduction
How can the dimensionality be
reduced?
eliminate redundant features
eliminate irrelevant features
extract low dimensional structure
Notation
Input:
with
Output:
Embedding principle:
Nearby points remain nearby,
distant points remain distant.
Estimate r.
Two classes of DR
algorithms
Linear
Non-Linear
Linear dimensionality
reduction
Principal Component
Analysis (Jolliffe 1986)
Project data into subspace
of maximum variance.
Facts about PCA
Eigenvectors of covariance matrix C
Minimizes ssq reconstruction error
Dimensionality r can be estimated from
eigenvalues of C
PCA requires meaningful scaling of
input features
Multidimensional Scaling
(MDS)
miles
NY
LA
Phoenix
Chicago
NY
0
2790
2450
810
LA
2790
0
390
2050
Phoenix
2450
390
0
1740
Chicago
810
2050
1740
0
Multidimensional Scaling
(MDS)
Multidimensional Scaling
(MDS)
equivalent to PCA
use eigenvectors of inner-product matrix
requires only pairwise distances
Non-linear dimensionality
reduction
Non-linear dimensionality
reduction
From subspace to
submanifold
We assume the data is sampled from some
manifold with lower dimensional degree of
freedom. How can we find a truthful
embedding?
Approximate manifold with neighborhood
graph
Approximate manifold with neighborhood
graph
Isomap
Tenenbaum et al 2000
geodesic distance
Compute shortest path between all inputs
Create geodesic distance matrix
Perform MDS with geodesic distances
Maximum Variance
Unfolding (MVU)
Weinberger and Saul 2004
Maximum Variance
Unfolding (MVU)
Weinberger and Saul 2004
Optimization problem
unfold data
by maximizing
pairwise distances
Preserve local
distances
Optimization problem
center output
(translation invariance)
Optimization
problem
Problem: Optimization non-convex
multiple local minima
Optimization
problem
Solution: Change of notation
single global minimum
QuickTime™ and a
Photo - JPEG decompressor
are needed to see this picture.
Unfolding the
swiss-roll
Mahalanobis Metric Learning for Clustering
with Side Information (Xing et al. 2003)
• Exemplars {xi , i=1,…,N} plus two types of side info:
– “Similar” set S = { (xi , xj ) } s.t. xi and xj are “similar” (e.g. same class)
– “Dissimilar” set D = { (xi , xj ) } s.t. xi and xj are “dissimilar”
• Learn optimal Mahalanobis matrix M
D2ij = (xi – xj)T M (xi – xj)
(global dist. fn.)
• Goal : keep all pairs of “similar” points close,
while separating all “dissilimar” pairs.
• Formulate as a constrained convex programming problem
– minimize the distance between the data pairs in S
– Subject to data pairs in D are well separated
40
MMC-SI (Cont’d)
• Objective of learning:
min
M
 D
( xi , x j )S
2
ij
s.t. M  0,
 D
( xi , x j )D
2
ij
1
• M is positive semi-definite
– Ensure non negativity and triangle inequality of the metric
• The number of parameters is quadratic in the number of features
– Difficult to scale to a large number of features
– Significant danger of overfitting small datasets
41
Mahalanobis Metric for
Clustering (MMC-SI)
Xing et al., NIPS 2002
MMC-SI
Move similarly labeled inputs together
MMC-SI
Move different labeled inputs apart
Convex optimization
problem
Convex optimization
problem
target: Mahalanobis matrix
Convex optimization
problem
pushing
differently
labeled
inputs apart
Convex optimization
problem
pulling
similar
points
together
Convex optimization
problem
ensuring positive
semi-definiteness
Convex optimization
problem
CONVEX
Gradient Alternating
Projection
Gradient Alternating
Projection
Take step along
gradient.
Gradient Alternating
Projection
Take step along
gradient.
Project onto constraint
satisfying sub-space.
Gradient Alternating
Projection
Take step along
gradient.
Project onto constraint
satisfying sub-space.
Project onto PSD cone.
Gradient Alternating
Projection
Take step along
gradient.
Project onto constraint
satisfying sub-space.
Project onto PSD cone.
REPEAT
Algorithm is guaranteed to converge to optimal
solution
Mahalanobis Metric Learning: Example I
(a) Data Dist. of the original dataset
(b) Data scaled by the global metric
• Keep all the data points within the same classes close
• Separate all the data points from different classes
58
Mahalanobis Metric Learning: Example II
(a) Original data
(b) rescaling by learned
full M
(c) Rescaling by learned
diagonal M
• Diagonal distance metric M can simplify computation, but
could lead to disastrous results
59
Summary of Xing et al 2002
Learns Mahalanobis metric
Well suited for clustering
Can be kernelized
Optimization problem is convex
Algorithm is guaranteed to converge
Assumes data to be uni-modal
POLA
(Pseudo-metric online learning algorithm)
Shalev-Shwartz et al,
ICML 2004
POLA
(Pseudo-metric online learning algorithm)
This time the inputs are accessed two at a time.
POLA
(Pseudo-metric online learning algorithm)
Differently labeled inputs are separated.
POLA
(Pseudo-metric online learning algorithm)
POLA
(Pseudo-metric online learning algorithm)
Similarly labeled inputs are moved closer.
Margin
Convex optimization
At each time t, we get two inputs:
Constraint 1:
Constraint 2:
Both are convex!!
,
Alternating Projection
Initialize inside PSD cone
Project onto constraint satisfying hyperplane
and back
Alternating Projection
Initialize inside PSD cone
Project onto constraint satisfying hyperplane
and back
Repeat with new
constraints
Alternating Projection
Initialize inside PSD cone
Project onto constraint satisfying hyperplane
and back
Repeat with new
constraints
If solution exists, algorithm converges
inside intersection.
Summary of POLA
Learns Mahalanobis metric
Online algorithm
Can also be kernelized
Introduces a margin
Algorithm converges if solution exists
Assumes data to be unimodal
Neighborhood Component
Analysis
Distance metric for visualization and kNN
(Goldberger et. al. 2004)