Clustering - Microsoft Research

Download Report

Transcript Clustering - Microsoft Research

Clustering
How are we doing on the pass sequence?
Pretty good! We can
now automatically learn
the features needed to
track both people
Same set of weights used
for all hidden units
But, it sucks that we need to hand-label the
coordinates of both men in 30 frames and handlabel the 2 classes for the white-shirt tracker
Unsupervised learning
• Goal: Learn a machine or model that explains
the data using a predefined set of assumptions
about how the explanations can work
• There isn’t any labeled data, just input patterns
(ie, instead of x,t-pairs, we have only x’s)
• Examples of unsupervised learning
–
–
–
–
Clustering (eg, k-means clustering)
Dimensionality reduction (eg, PCA)
Data compression methods (eg, ZIP, JPEG, MPEG)
Generative models & Bayesian networks (tomorrow)
A simple example of unsupervised learning
• Our explanation of the data is that each training case
is equal to an unknown number m plus zero-mean
Gaussian noise with unknown variance s2
• Unsupervised learning
– Determine m and s2, and
– For each training case, determine the noise signal
Noise signal for
training case xn
A simple example of unsupervised learning
• What can this model be used for?
• Outlier detection / novelty detection
– Given a new input xN+1, define p as follows (see figure)
If xN+1 < m let p = z=-∞ N+1 N(z|m,s2)dz and otherwise
∞
let p = z=xN+1 N(z|m,s2)dz
x
– Then, we can say that the new input xN+1 is an outlier
(“unusual”) if p < 0.01
• Preference prediction
– Given a set of new inputs,
we can rank them
according to probability
xN+1
Problem
Gaussian estimated from training data
Training data
This test case has higher density
than any of the training cases, but
is quite probably an outlier
K-means clustering
Data and initial
(random) means
Assign each point to its Set each mean to the
nearest mean
average of its data
Iteration 1
Iteration 2
Iteration 3
Iteration 4 and
convergence
K-means clustering
• Suppose x1,…,xN are real-valued or vector-valued
• Goal: Learn K means and assign each training case to
a mean
• Denote the means by m1,…, mK
• Use one-of-K encoding for the assignments:
rnk = 1 if xn is assigned to mk, and rnk = 0 otherwise
• The goal of k-means clustering is to find the r’s and m’s
that minimize the following cost function:
• Generally, finding an exact solution takes time that is
exponential in N
K-means clustering
• Note that given the m’s we can efficiently find the best r’s,
and given the r’s, we can efficiently find the best m’s
• Here’s the algorithm:
Pick initial means randomly or cleverly
Loop until convergence
– Assign each training case to its nearest mean:
– Set each mean to the average of its training cases:
• Each step minimizes J wrt the r’s or the m’s, but the
procedure is not guaranteed to find the minimum of J
Example
Data and initial
(random) means
Assign each point to its Set each mean to the
nearest mean
average of its data
Iteration 1
Iteration 2
Iteration 3
Iteration 4 and
convergence
Iteration
Example: Color quantization
• Consider each image pixel as a 3-D vector (RGB) and
use K-means clustering to find K “prototype colors”
• Now, each pixel in the image can be stored using
log2(K) bits, with some loss in color quality
24 bits per pixel
1 bit per pixel
1.6 bits per pixel
3.3 bits per pixel
EM for a mixture of Gaussians
• Initialization: Pick m’s, S’s, p ’s randomly but validly
• E Step: For each training case, we need
q(z) = p(z|x) = p(x|z)p(z) / (Sz p(x|z)p(z))
Defining
= q(znk=1), we need to actually compute:
Do it in the
log-domain!
• M Step:
Recall: For labeled data, g(znk)=znk
EM for mixture of Gaussians: E step
p1= 0.5, m1=
F1 =
p 2= 0.5, m 2=
F 2=
c
z
Images from data set
EM for mixture of Gaussians: E step
P(c|z)
p1= 0.5, m1=
F1 =
c
p 2= 0.5, m 2=
F 2=
c
=1
0.52
=2
0.48
c
z
Images from data set
=
EM for mixture of Gaussians: E step
P(c|z)
p1= 0.5, m1=
F1 =
c
p 2= 0.5, m 2=
F 2=
c
=1
0.51
=2
0.49
c
z
Images from data set
=
EM for mixture of Gaussians: E step
P(c|z)
p1= 0.5, m1=
F1 =
c
p 2= 0.5, m 2=
F 2=
c
=1
0.48
=2
0.52
c
z
Images from data set
=
EM for mixture of Gaussians: E step
P(c|z)
p1= 0.5, m1=
F1 =
c
p 2= 0.5, m 2=
F 2=
c
=1
0.43
=2
0.57
c
z
Images from data set
=
EM for mixture of Gaussians: M step
p1= 0.5, m1=
F1 =
p 2= 0.5, m 2=
F 2=
1 to the average of zP(c=1|z)
Set m
2 to the average of zP(c=2|z)
Set m
c
z
EM for mixture of Gaussians: M step
p1= 0.5, m1=
F1 =
p 2= 0.5, m 2=
F 2=
1 to the average of zP(c=1|z)
Set m
2 to the average of zP(c=2|z)
Set m
c
z
EM for mixture of Gaussians: M step
p1= 0.5, m1=
F1 =
p 2= 0.5, m 2=
F 2=
Set F
1 to the average of
diag((z-m ) (z-m ))P(c=1|z)
1 1
T
Set F
2 to the average of
diag((z-m ) (z-m ))P(c=2|z)
2 2
T
c
z
EM for mixture of Gaussians: M step
p1= 0.5, m1=
F1 =
p 2= 0.5, m 2=
F 2=
Set F
1 to the average of
diag((z-m ) (z-m ))P(c=1|z)
1 1
T
Set F
2 to the average of
diag((z-m ) (z-m ))P(c=2|z)
2 2
T
c
z
… after iterating to convergence:
p1= 0.6, m1=
F1 =
p 2= 0.4, m 2=
F 2=
c
z
Non-vector-space clustering
For K-means clustering and EM,
• The data must lie in a vector space, where Euclidean
distance is a natural measure of similarity
– Some methods (such as mixtures of Gaussians) allow each
cluster to stretch and rotate its data, but these methods are
still essentially based on Euclidean distance
• There can be advantages to using kernels k(xi,xk) to
measure similarity and then making predictions using
training cases
• Now, we will study this approach for clustering, using
s(i,k) to denote the similarity of xi to xk (these are like
kernels, but not necessarily formally the same)
K-centers clustering
(aka K-medoids clustering and K-medians clustering)
K-centers clustering
(aka K-medoids clustering and K-medians clustering)
Randomly choose
initial exemplars,
(data centers)
K-centers clustering
(aka K-medoids clustering and K-medians clustering)
Assign data points to
nearest centers
K-centers clustering
(aka K-medoids clustering and K-medians clustering)
For each cluster,
pick best new
center
K-centers clustering
(aka K-medoids clustering and K-medians clustering)
For each cluster,
pick best new
center
K-centers clustering
(aka K-medoids clustering and K-medians clustering)
Assign data points to
nearest centers
K-centers clustering
(aka K-medoids clustering and K-medians clustering)
Assign data points to
nearest centers
Convergence:
Final set of
exemplars
(centers)
The K-centers clustering algorithm
• Denote the index of the center that xi is currently
assigned to by ci (ci = i indicates that xi is a center)
• Initialization: Randomly pick K points and set ck = k
• Loop until convergence
– For all i s.t. ci  i set ci = argmaxk:c
k=k
– For all k s.t. ck = k
s(i,k)
• Compute k new = argmaxi:c =k (Sj:c =k s(j,i))
i
• For all i s.t. ci = k set ci = k new
j
• This is similar to K-means clustering, except that the
means are always data points
Handwritten digit clustering and recognition
Example: Clustering 3000 Buffalo digits
Similarity: s(i,k) = - || xi - xk ||2
K=10
K=20
K=40
K=80
The effect of random initialization
Affinity propagation
Squared error
k-centers clustering
How good is the solution?
Solution that minimizes distances
Affinity propagation
(Frey & Dueck, Science, Feb 2007)
ITERATION #
0
ITERATION #15
# 8
1
ITERATION
#
#10
#11
#12
#13
#14
9
1
2
3
4
5
6
7




































































 






































 




















0
0
non-exemplar

























 












 










-5
-5
-5
-5
























































00










































+5
+5
+5
+5
exemplar
Affinity propagation
Affinity propagation
ITERATION #15
+5














0







-5
-5
 
0
non-exemplar
+5
exemplar
Solution that minimizes distances
How does affinity propagation work?
• The sum-product or max-sum algorithms (loopy belief
propagation) can be used to approximately maximize the
k-centers objective function (more on this tomorrow)
• Result – Affinity Propagation: Data points exchange
‘responsibilities’ and ‘availabilities’
Sending responsibilities
Candidate
exemplar k
r(i,k)
Competing
candidate
exemplar k’
a(i,k’)
Data instance i
Sending availabilities
Candidate
exemplar k
r(i’,k)
a(i,k)
Supporting
data instance i’
Data instance i
Sending responsibilities
Candidate
exemplar k
Competing
candidate
exemplar k’
r(i,k)
a(i,k’)
Data instance i
Sending availabilities
Candidate
exemplar k
r(i’,k)
a(i,k)
Supporting
data instance i’
Data instance i
Making decisions:
Message damping
• Unstable dynamics are avoided in practice by damping
messages:
r(i,k)* = l r(i,k) + (1- l) r(i,k)old
a(i,k)* = l a(i,k) + (1- l) a(i,k)old
• Default: l = 0.5
MATLAB implementation
(from http://www.psi.toronto.edu/affinitypropagation)
Document summarization using affinity propagation
s(sentence i,sentence k) = - Number of bits needed to encode the words in
sentence i using the words in sentence k and a global dictionary
Preference(sentence k) = - Number of bits needed to encode the words in
sentence k using only the global dictionary
1) Affinity propagation identifies
exemplars by recursively sending
real-valued messages between pairs
of data points.
3) The availability is set to the selfresponsibility plus the sum of the
positive responsibilities the candidate
exemplar receives from other points.
2) The number of identified
exemplars (number of clusters) is
influenced by the values of the input
preferences, but also emerges from
the message-passing procedure.
4) For different numbers of clusters,
the reconstruction errors achieved by
affinity propagation and k-centers
clustering are compared.
Questions?
How are we doing on the pass sequence?
Can clustering be used to automatically learn the two
modes of tracking for the man in the white shirt?
• Maybe, but this system is getting too complex!
• Is there any simple way to put the pieces
together…?