Introduction to Machine Learning for Category Representation
Download
Report
Transcript Introduction to Machine Learning for Category Representation
Clustering with k-means and
mixture of Gaussian densities
Jakob Verbeek
December 4, 2009
Plan for this course
Introduction to machine learning
2)
Clustering techniques
3)
Gaussian mixture density continued
4)
Introduction, generative methods, semi-supervised
Classification techniques 2
6)
Parameter estimation with EM, Fisher kernels
Classification techniques 1
5)
k-means, Gaussian mixture density
Discriminative methods, kernels
Decomposition of images
Topic models, …
Clustering
• Finding a group structure in the data
– Data in one cluster similar to each other
– Data in different clusters dissimilar
• Map each data point to a discrete cluster index
– “flat” methods find k groups (k known, or automatically set)
– “hierarchical” methods define a tree structure over the data
Hierarchical Clustering
• Data set is partitioned into a tree structure
• Top-down construction
– Start all data in one cluster: root node
– Apply “flat” clustering into k groups
– Recursively cluster the data in each group
• Bottom-up construction
– Start with all points in separate cluster
– Recursively merge “closest” clusters
– Distance between clusters A and B
• Min, max, or mean distance
between x in A, and y in B
Clustering example
• Learn face similarity from training pairs labeled as
same/different
• Cluster faces based on identity
• Example: picasa web albums, label face clusters
[Guillaumin, Verbeek, Schmid, ICCV 2009]
Clustering example: visual words
Airplanes
Motorbikes
Faces
Wild Cats
Leafs
People
Bikes
Clustering for visual vocabulary construction
• Clustering of local image descriptors
– Most often done using k-means or mixture of Gaussians
– Divide space of region descriptors in a collection of nonoverlapping cells
• Recap of the image representation pipe-line
– Extract image regions at different locations and scales:
randomly, on a regular grid, or using interest point detector
– Compute descriptor for each region (eg SIFT)
– Assign each descriptor to a cluster center
• Or do “soft assignment” or “multiple assignment”
– Make histogram for complete image
• Possibly separate histograms for different image regions
Definition of k-means clustering
• Given: data set of N points xn, n=1,…,N
• Goal: find K cluster centers mk, k=1,…,K
• Clustering: assign each point to closest center
• Error criterion: sum of squared distances to closest
cluster center for each data point
E {mk }kK1 min k || xn mk ||22
N
n 1
Examples of k-means clustering
• Data uniformly sampled in R2
• Data non-uniformly sampled in R3
Minimizing the error function
E {mk }kK1 min k || xn mk ||22
N
n 1
•
The error function is
– non-differentiable due to the min operator
– Non-convex, i.e. there are local maxima
•
Minimization can be done with iterative algorithm
1)
2)
3)
4)
5)
•
Initialize cluster centers
Assign each data point to nearest center
Update the cluster centers as mean of associated data
If cluster centers changed: return to step 2)
Return cluster centers
Iterations monotonically decrease error function
Iteratively minimizing the error function
•
Introduce “latent” variable zn, with value in [1,…, K]
2) Assignment of data point xn, to one of the clusters: zn
E {m }
•
min
N
K
k k 1
n 1
|| xn mk || || xn mzn ||22 B{mk }kK1 ,{z n }nN1
N
k
2
2
n 1
Upper-bound on error function, without min operator
– Error function and bound equal for the “min” assignment
– Minimize the bound w.r.t. cluster centers
3) Update the cluster centers as mean of associated data
mk
2
||
x
m
||
n k 2 2
n:z n k
m xn / 1
n:zn k n:zn k
*
k
(x
n: z n k
n
mk )
Iteratively minimizing the error function
• Minimization can be done with iterative algorithm
1) Assign each data point to nearest center
• Construct tight bound on error function
2) Update the cluster centers as mean of associated data
• Minimize bound
E {mk }kK1 B {mk }kK1 ,{zn }nN1 B {mk* }kK1 ,{z n }nN1 E {mk* }kK1
1
2
• Example of “Iterative bound optimization”
– EM algorithm another example
Examples of k-means clustering
• Several iterations with two centers
Error function
Examples of k-means clustering
• Clustering RGB vectors of pixels in images
• Compression of image file: N x 24 bits
– Store RGB values of cluster centers: K x 24 bits
– Store cluster index of each pixel: N x log K bits
4.2%
8.3%
16.7%
Clustering with Gaussian mixture density
• Each cluster represented by Gaussian density
– Center, as in k-means
– Covariance matrix: cluster spread around center
1
p( x) N ( x; m, C ) (2 ) d / 2 | C |1/ 2 exp ( x m)T C 1 ( x m)
2
Data dimension d
Determinant of
covariance matrix C
Quadratic function of
point x and mean m
Clustering with Gaussian mixture density
• Mixture density is weighted sum of Gaussians
– Mixing weight: importance of each cluster
K
p( x) k N ( x; mk , Ck )
k 1
• Density has to integrate to 1, so we require
k 0
K
k 1
k
1
Clustering with Gaussian mixture density
• Given: data set of N points xn, n=1,…,N
• Find mixture of Gaussians (MoG) that best explains data
– Assigns maximum likelihood to the data
– Assume data points are drawn independently from MoG
N
N
K
p( X ) p( xn ) k N ( xn ; mk , Ck )
n 1
n 1 k 1
{ k , mk , Ck }kK1
N
N
K
n 1
n 1
k 1
L( ) log p( xn ) log k N ( xn ; mk , Ck )
– Maximize log-likelihood of fixed data set X w.r.t. parameters of MoG
• As with k-means objective function has local minima
– Can use Expectation-Maximization (EM) algorithm
– Similar to the iterative k-means algorithm
Assignment of data points to clusters
• As with k-means zn indicates cluster index for xn
• To sample point from MoG
– Select cluster index k with probability given by mixing weight
– Sample point from the k-th Gaussian
– MoG recovered if we marginalize of unknown index
p( zn k ) k
p( xn | zn k ) N ( xn ; mk , Ck )
K
K
k 1
k 1
p( xn ) p( zn k ) p( xn | zn k ) k N ( xn ; mk , Ck )
Soft assignment of data points to clusters
• Given data point xn, infer value of zn
– Conditional probability of zn given xn
p ( z n k | xn )
p ( z n k , xn ) p ( z n k ) p ( x n | z n k )
N ( xn ; mk , Ck )
K k
p ( xn )
p ( xn )
s N ( x n ; ms , C s )
s 1
Maximum likelihood estimation of Gaussian
• Given data points xn, n=1,…,N
• Find Gaussian that maximizes data log-likelihood
1
1
d
L( ) log p( xn ) log N ( xn ; m, C ) log( 2 ) log | C | ( xn m)T C 1 ( xn m)
2
2
2
n 1
n 1
n 1
N
N
N
• Set derivative of data log-likelihood w.r.t. parameters to zero
N
1
L( ) C xn m 0
m
n 1
1
m
N
N
xn
n 1
N
1
1
T
L
(
)
C
(
x
m
)(
x
m
)
0
n
n
1
C
2
n 1 2
1
C
N
N
(x
n 1
n
• Parameters set as data covariance and mean
m)( xn m)T
Maximum likelihood estimation of MoG
• Use EM algorithm
–
–
–
–
Initialize MoG: parameters or soft-assign
E-step: soft assign of data points to clusters
M-step: update the cluster parameters
Repeat EM steps, terminate if converged
• Convergence of parameters or assignments
• E-step: compute posterior on z given x:
p ( z n k | xn )
k N ( xn ; mk , Ck )
K
s 1
s
N ( x n ; ms , C s )
• M-step: update Gaussians from data points weighted by posterior
1
k
qns
N
1
q
nk
N
n 1
N
q
n 1
nk
n,s
mk
N
1
1
q
x
nk n
N k
n 1
N
q
n 1
nk
N
qnk xn
n 1
1
Ck
N k
N
T
q
(
x
m
)(
x
m
)
nk n k n k
n 1
qnk
Maximum likelihood estimation of MoG
• Example of several EM iterations
Clustering with k-means and MoG
• Hard assignment in k-means is not
robust near border of quantization cells
• Soft assignment in MoG accounts for
ambiguity in the assignment
• Both algorithms sensitive for
initialization
– Run from several initializations
– Keep best result
• Nr of clusters need to be set
• Both algorithm can be generalized to
other types of distances or densities
Images from [Gemert et al, IEEE TPAMI, 2010]
Plan for this course
Introduction to machine learning
Clustering techniques
•
k-means, Gaussian mixture density
Reading for next week:
3)
Neal & Hinton “A view of the EM algorithm that justifies incremental,
sparse, and other variants”, in “Learning in graphical models”,1998.
Part of chapter 3 of my thesis
Both available on course website http://lear.inrialpes.fr/~verbeek/teaching
Gaussian mixture density continued
•
4)
Parameter estimation with EM, Fisher kernels
Classification techniques 1
•
5)
Introduction, generative methods, semi-supervised
Classification techniques 2
•
6)
Discriminative methods, kernels
Decomposition of images
•
Topic models, …