Transcript Sliding
Clustering with k-means and
mixture of Gaussian densities
Jakob Verbeek
December 3, 2010
Course website:
http://lear.inrialpes.fr/~verbeek/MLCR.10.11.php
Plan for the course
•
Session 1, October 1 2010
–
–
•
Session 2, December 3 2010
–
–
–
•
Cordelia Schmid: Introduction
Jakob Verbeek: Introduction Machine Learning
Jakob Verbeek: Clustering with k-means, mixture of Gaussians
Cordelia Schmid: Local invariant features
Student presentation 1: Scale and affine invariant interest point detectors, Mikolajczyk,
Schmid, IJCV 2004.
Session 3, December 10 2010
–
–
Cordelia Schmid: Instance-level recognition: efficient search
Student presentation 2: Scalable Recognition with a Vocabulary Tree, Nister and Stewenius,
CVPR 2006.
Plan for the course
•
Session 4, December 17 2010
–
–
–
•
Session 5, January 7 2011
–
–
–
–
•
Jakob Verbeek: Mixture of Gaussians, EM algorithm, Fisher Vector image representation
Cordelia Schmid: Bag-of-features models for category-level classification
Student presentation 2: Beyond bags of features: spatial pyramid matching for recognizing
natural scene categories, Lazebnik, Schmid and Ponce, CVPR 2006.
Jakob Verbeek: Classification 1: generative and non-parameteric methods
Student presentation 4: Large-Scale Image Retrieval with Compressed Fisher Vectors,
Perronnin, Liu, Sanchez and Poirier, CVPR 2010.
Cordelia Schmid: Category level localization: Sliding window and shape model
Student presentation 5: Object Detection with Discriminatively Trained Part Based Models,
Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2010.
Session 6, January 14 2011
–
–
–
Jakob Verbeek: Classification 2: discriminative models
Student presentation 6: TagProp: Discriminative metric learning in nearest neighbor models
for image auto-annotation, Guillaumin, Mensink, Verbeek and Schmid, ICCV 2009.
Student presentation 7: IM2GPS: estimating geographic information from a single image,
Hays and Efros, CVPR 2008.
Clustering
• Finding a group structure in the data
– Data in one cluster similar to each other
– Data in different clusters dissimilar
• Map each data point to a discrete cluster index
– “flat” methods find k groups (k known, or automatically set)
– “hierarchical” methods define a tree structure over the data
Hierarchical Clustering
• Data set is partitioned into a tree structure
• Top-down construction
– Start all data in one cluster: root node
– Apply “flat” clustering into k groups
– Recursively cluster the data in each group
• Bottom-up construction
– Start with all points in separate cluster
– Recursively merge “closest” clusters
– Distance between clusters A and B
• Min, max, or mean distance
between x in A, and y in B
Clustering example
• Learn face similarity from training pairs labeled as
same/different
• Cluster faces based on identity
• Example: picasa web albums, label face clusters
[Guillaumin, Verbeek, Schmid, ICCV 2009]
Clustering example: visual words
Airplanes
Motorbikes
Faces
Wild Cats
Leafs
People
Bikes
Clustering for visual vocabulary construction
• Clustering of local image descriptors
– Most often done using k-means or mixture of Gaussians
– Divide space of region descriptors in a collection of nonoverlapping cells
• Recap of the image representation pipe-line
– Extract image regions at different locations and scales:
randomly, on a regular grid, or using interest point detector
– Compute descriptor for each region (eg SIFT)
– Assign each descriptor to a cluster center
• Or do “soft assignment” or “multiple assignment”
– Make histogram for complete image
• Possibly separate histograms for different image regions
Definition of k-means clustering
• Given: data set of N points xn, n=1,…,N
• Goal: find K cluster centers mk, k=1,…,K
• Clustering: assignment of data points to cluster centers
– Binary indicator variables rnk =1 if xn assgined to xn, 0 otherwise
• Error criterion: sum of squared distances between each
data point and assigned cluster center
E {m }
K
k k 1
r
N
K
n 1 k 1
2
||
x
m
||
nk
n
k
Examples of k-means clustering
• Data uniformly sampled in unit square, running
k-means with 5, 10, 15, 20 and 25 centers
Minimizing the error function
E {m }
K
k k 1
r
N
K
n 1 k 1
nk
|| xn mk ||2
•
Goal find centers mk and assignments rnk to minimize
the error function
•
An iterative algorithm
1)
2)
3)
4)
5)
•
Initialize cluster centers, somehow
Update assignments rnk for fixed mk
Update centers mk for fixed data assignments rnk fixed
If cluster centers changed: return to step 2)
Return cluster centers
Iterations monotonically decrease error function
Examples of k-means clustering
• Several iterations with two centers
Error function
Minimizing the error function
E {m }
K
k k 1
•
–
–
–
•
–
–
–
r
N
K
n 1 k 1
nk
|| xn mk ||2
Update assignments rnk for fixed mk
Decouples over the data points
Only one rnk =1, rest zero
Assign to closest center
K
r
k 1
nk
|| xn mk ||2
Update centers mk for fixed assignments rnk
N
Decouples over the centers
2
r
||
x
m
||
nk
n
k
Set derivative to zero
n 1
Put center at mean of assigned data points
N
E
2 rnk ( xn mk )
mk
n 1
E
0
mk
N
mk rnk xn / rnk
n 1
n 1
N
Minimizing the error function
E {m }
K
k k 1
r
N
K
n 1 k 1
nk
|| xn mk ||2
•
Goal find centers mk and assignments rnk to minimize
the error function
•
An iterative algorithm
1)
2)
3)
4)
5)
•
Initialize cluster centers, somehow
Update assignments rnk for fixed mk
Update centers mk for fixed data assignments rnk fixed
If cluster centers changed: return to step 2)
Return cluster centers
Iterations monotonically decrease error function
– Both steps reduce the error function
– Only a finite number of possible assignments
Examples of k-means clustering
• Several iterations with two centers
Error function
Examples of k-means clustering
• Solutions for different initializations
Examples of k-means clustering
• Clustering RGB vectors of pixels in images
• Compression of image file: N x 3 x 8 bits
– Store RGB values of cluster centers: K x 24 bits
– Store cluster index of each pixel: N x log K bits
4.2%
8.3%
16.7%
Clustering with Gaussian mixture density
• Each cluster represented by Gaussian density
– Center, as in k-means
– Covariance matrix: cluster spread around center
1
p( x) N ( x; m, C ) (2 ) d / 2 | C |1/ 2 exp ( x m)T C 1 ( x m)
2
Data dimension d
Determinant of
covariance matrix C
Quadratic function of
point x and mean m
Clustering with Gaussian mixture density
• Mixture density is weighted sum of Gaussians
– Mixing weight: importance of each cluster
K
p( x) k N ( x; mk , Ck )
k 1
• Density has to integrate to 1, so we require
k 0
K
k 1
k
1
Clustering with Gaussian mixture density
• Given: data set of N points xn, n=1,…,N
• Find mixture of Gaussians (MoG) that best explains data
– Assigns maximum likelihood to the data
– Assume data points are drawn independently from MoG
N
N
K
p( X ) p( xn ) k N ( xn ; mk , Ck )
n 1
n 1 k 1
{ k , mk , Ck }kK1
N
N
K
n 1
n 1
k 1
L( ) log p( xn ) log k N ( xn ; mk , Ck )
– Maximize log-likelihood of fixed data set X w.r.t. parameters of MoG
• As with k-means objective function has local minima
– Can use Expectation-Maximization (EM) algorithm
– Similar to the iterative k-means algorithm
Assignment of data points to clusters
• As with k-means zn indicates cluster index for xn
• To sample point from MoG
– Select cluster index k with probability given by mixing weight
– Sample point from the k-th Gaussian
– MoG recovered if we marginalize of unknown index
p( zn k ) k
p( xn | zn k ) N ( xn ; mk , Ck )
K
K
k 1
k 1
p( xn ) p( zn k ) p( xn | zn k ) k N ( xn ; mk , Ck )
Soft assignment of data points to clusters
• Given data point xn, infer value of zn
– Conditional probability of zn given xn
p ( z n k | xn )
p ( z n k , xn ) p ( z n k ) p ( x n | z n k )
N ( xn ; mk , Ck )
K k
p ( xn )
p ( xn )
s N ( x n ; ms , C s )
s 1
Maximum likelihood estimation of Gaussian
• Given data points xn, n=1,…,N
• Find Gaussian that maximizes data log-likelihood
1
1
d
L( ) log p( xn ) log N ( xn ; m, C ) log( 2 ) log | C | ( xn m)T C 1 ( xn m)
2
2
2
n 1
n 1
n 1
N
N
N
• Set derivative of data log-likelihood w.r.t. parameters to zero
N
1
L( ) C xn m 0
m
n 1
1
m
N
N
xn
n 1
N
1
1
T
L
(
)
C
(
x
m
)(
x
m
)
0
n
n
1
C
2
n 1 2
1
C
N
N
(x
n 1
n
• Parameters set as data covariance and mean
m)( xn m)T
Maximum likelihood estimation of MoG
• Use EM algorithm
–
–
–
–
Initialize MoG: parameters or soft-assign
E-step: soft assign of data points to clusters
M-step: update the cluster parameters
Repeat EM steps, terminate if converged
• Convergence of parameters or assignments
• E-step: compute posterior on z given x:
p ( z n k | xn )
k N ( xn ; mk , Ck )
K
s 1
s
N ( x n ; ms , C s )
• M-step: update Gaussians from data points weighted by posterior
1
k
qns
N
1
q
nk
N
n 1
N
q
n 1
nk
n,s
mk
N
1
1
q
x
nk n
N k
n 1
N
q
n 1
nk
N
qnk xn
n 1
1
Ck
N k
N
T
q
(
x
m
)(
x
m
)
nk n k n k
n 1
qnk
Maximum likelihood estimation of MoG
• Example of several EM iterations
Clustering with k-means and MoG
• Hard assignment in k-means is not
robust near border of quantization cells
• Soft assignment in MoG accounts for
ambiguity in the assignment
• Both algorithms sensitive for
initialization
– Run from several initializations
– Keep best result
• Nr of clusters need to be set
• Both algorithm can be generalized to
other types of distances or densities
Images from [Gemert et al, IEEE TPAMI, 2010]
Further reading material
•
Paper by Radford Neal & Geoffrey Hinton
“A view of the EM algorithm that justifies incremental, sparse, and
other variants”
in “Learning in graphical models”,1998. (available online)
•
Chapter 9 from the book “Machine Learning and Pattern
Recognition”, by Chris Bishop (Springer, 2006)