Introduction to Machine Learning for Category Representation

Download Report

Transcript Introduction to Machine Learning for Category Representation

Clustering with k-means and
mixture of Gaussian densities
Jakob Verbeek
December 4, 2009
Plan for this course

Introduction to machine learning
2)
Clustering techniques

3)
Gaussian mixture density continued

4)
Introduction, generative methods, semi-supervised
Classification techniques 2

6)
Parameter estimation with EM, Fisher kernels
Classification techniques 1

5)
k-means, Gaussian mixture density
Discriminative methods, kernels
Decomposition of images

Topic models, …
Clustering
• Finding a group structure in the data
– Data in one cluster similar to each other
– Data in different clusters dissimilar
• Map each data point to a discrete cluster index
– “flat” methods find k groups (k known, or automatically set)
– “hierarchical” methods define a tree structure over the data
Hierarchical Clustering
• Data set is partitioned into a tree structure
• Top-down construction
– Start all data in one cluster: root node
– Apply “flat” clustering into k groups
– Recursively cluster the data in each group
• Bottom-up construction
– Start with all points in separate cluster
– Recursively merge “closest” clusters
– Distance between clusters A and B
• Min, max, or mean distance
between x in A, and y in B
Clustering example
• Learn face similarity from training pairs labeled as
same/different
• Cluster faces based on identity
• Example: picasa web albums, label face clusters
[Guillaumin, Verbeek, Schmid, ICCV 2009]
Clustering example: visual words
Airplanes
Motorbikes
Faces
Wild Cats
Leafs
People
Bikes
Clustering for visual vocabulary construction
• Clustering of local image descriptors
– Most often done using k-means or mixture of Gaussians
– Divide space of region descriptors in a collection of nonoverlapping cells
• Recap of the image representation pipe-line
– Extract image regions at different locations and scales:
randomly, on a regular grid, or using interest point detector
– Compute descriptor for each region (eg SIFT)
– Assign each descriptor to a cluster center
• Or do “soft assignment” or “multiple assignment”
– Make histogram for complete image
• Possibly separate histograms for different image regions
Definition of k-means clustering
• Given: data set of N points xn, n=1,…,N
• Goal: find K cluster centers mk, k=1,…,K
• Clustering: assign each point to closest center
• Error criterion: sum of squared distances to closest
cluster center for each data point
E {mk }kK1    min k || xn  mk ||22
N
n 1
Examples of k-means clustering
• Data uniformly sampled in R2
• Data non-uniformly sampled in R3
Minimizing the error function
E {mk }kK1    min k || xn  mk ||22
N
n 1
•
The error function is
– non-differentiable due to the min operator
– Non-convex, i.e. there are local maxima
•
Minimization can be done with iterative algorithm
1)
2)
3)
4)
5)
•
Initialize cluster centers
Assign each data point to nearest center
Update the cluster centers as mean of associated data
If cluster centers changed: return to step 2)
Return cluster centers
Iterations monotonically decrease error function
Iteratively minimizing the error function
•
Introduce “latent” variable zn, with value in [1,…, K]
2) Assignment of data point xn, to one of the clusters: zn
E {m }
•
   min
N
K
k k 1
n 1
|| xn  mk ||   || xn  mzn ||22  B{mk }kK1 ,{z n }nN1 
N
k
2
2
n 1
Upper-bound on error function, without min operator
– Error function and bound equal for the “min” assignment
– Minimize the bound w.r.t. cluster centers
3) Update the cluster centers as mean of associated data

mk
2
||
x

m
||
 n k 2 2
n:z n  k

 

m    xn  /  1
 n:zn  k   n:zn  k 
*
k
 (x
n: z n  k
n
 mk )
Iteratively minimizing the error function
• Minimization can be done with iterative algorithm
1) Assign each data point to nearest center
• Construct tight bound on error function
2) Update the cluster centers as mean of associated data
• Minimize bound

 
 


E {mk }kK1  B {mk }kK1 ,{zn }nN1  B {mk* }kK1 ,{z n }nN1  E {mk* }kK1
1
2
• Example of “Iterative bound optimization”
– EM algorithm another example

Examples of k-means clustering
• Several iterations with two centers
Error function
Examples of k-means clustering
• Clustering RGB vectors of pixels in images
• Compression of image file: N x 24 bits
– Store RGB values of cluster centers: K x 24 bits
– Store cluster index of each pixel: N x log K bits
4.2%
8.3%
16.7%
Clustering with Gaussian mixture density
• Each cluster represented by Gaussian density
– Center, as in k-means
– Covariance matrix: cluster spread around center
 1

p( x)  N ( x; m, C )  (2 )  d / 2 | C |1/ 2 exp   ( x  m)T C 1 ( x  m) 
 2

Data dimension d
Determinant of
covariance matrix C
Quadratic function of
point x and mean m
Clustering with Gaussian mixture density
• Mixture density is weighted sum of Gaussians
– Mixing weight: importance of each cluster
K
p( x)    k N ( x; mk , Ck )
k 1
• Density has to integrate to 1, so we require
k  0
K

k 1
k
1
Clustering with Gaussian mixture density
• Given: data set of N points xn, n=1,…,N
• Find mixture of Gaussians (MoG) that best explains data
– Assigns maximum likelihood to the data
– Assume data points are drawn independently from MoG
N
N
K
p( X )   p( xn )   k N ( xn ; mk , Ck )
n 1
n 1 k 1
  { k , mk , Ck }kK1
N
N
K
n 1
n 1
k 1
L( )   log p( xn )   log   k N ( xn ; mk , Ck )
– Maximize log-likelihood of fixed data set X w.r.t. parameters of MoG
• As with k-means objective function has local minima
– Can use Expectation-Maximization (EM) algorithm
– Similar to the iterative k-means algorithm
Assignment of data points to clusters
• As with k-means zn indicates cluster index for xn
• To sample point from MoG
– Select cluster index k with probability given by mixing weight
– Sample point from the k-th Gaussian
– MoG recovered if we marginalize of unknown index
p( zn  k )   k
p( xn | zn  k )  N ( xn ; mk , Ck )
K
K
k 1
k 1
p( xn )   p( zn  k ) p( xn | zn  k )   k N ( xn ; mk , Ck )
Soft assignment of data points to clusters
• Given data point xn, infer value of zn
– Conditional probability of zn given xn
p ( z n  k | xn ) 
p ( z n  k , xn ) p ( z n  k ) p ( x n | z n  k )
 N ( xn ; mk , Ck )

 K k
p ( xn )
p ( xn )
  s N ( x n ; ms , C s )
s 1
Maximum likelihood estimation of Gaussian
• Given data points xn, n=1,…,N
• Find Gaussian that maximizes data log-likelihood
1
1
 d

L( )   log p( xn )   log N ( xn ; m, C )    log( 2 )  log | C |  ( xn  m)T C 1 ( xn  m)
2
2
2

n 1
n 1
n 1 
N
N
N
• Set derivative of data log-likelihood w.r.t. parameters to zero
N

1
L( )  C  xn  m  0
m
n 1
1
m
N
N
 xn
n 1
N

1
1
T
L
(

)

C

(
x

m
)(
x

m
)
0

n
n
1


C
2

n 1  2
1
C
N
N
 (x
n 1
n
• Parameters set as data covariance and mean
 m)( xn  m)T
Maximum likelihood estimation of MoG
• Use EM algorithm
–
–
–
–
Initialize MoG: parameters or soft-assign
E-step: soft assign of data points to clusters
M-step: update the cluster parameters
Repeat EM steps, terminate if converged
• Convergence of parameters or assignments
• E-step: compute posterior on z given x:
p ( z n  k | xn ) 
 k N ( xn ; mk , Ck )
K

s 1
s
N ( x n ; ms , C s )
• M-step: update Gaussians from data points weighted by posterior
1
k 
 qns
N
1
q


nk
N
n 1
N
q
n 1
nk
n,s
mk 
N
1
1
q
x


nk n
N k
n 1
N
q
n 1
nk
N
 qnk xn
n 1
1
Ck 
N k
N
T
q
(
x

m
)(
x

m
)
 nk n k n k
n 1
 qnk
Maximum likelihood estimation of MoG
• Example of several EM iterations
Clustering with k-means and MoG
• Hard assignment in k-means is not
robust near border of quantization cells
• Soft assignment in MoG accounts for
ambiguity in the assignment
• Both algorithms sensitive for
initialization
– Run from several initializations
– Keep best result
• Nr of clusters need to be set
• Both algorithm can be generalized to
other types of distances or densities
Images from [Gemert et al, IEEE TPAMI, 2010]
Plan for this course

Introduction to machine learning

Clustering techniques
•

k-means, Gaussian mixture density
Reading for next week:



3)
Neal & Hinton “A view of the EM algorithm that justifies incremental,
sparse, and other variants”, in “Learning in graphical models”,1998.
Part of chapter 3 of my thesis
Both available on course website http://lear.inrialpes.fr/~verbeek/teaching
Gaussian mixture density continued
•
4)
Parameter estimation with EM, Fisher kernels
Classification techniques 1
•
5)
Introduction, generative methods, semi-supervised
Classification techniques 2
•
6)
Discriminative methods, kernels
Decomposition of images
•
Topic models, …