Transcript Clustering

Clustering
Applied Human Computer Interaction
Week 5
Yan Ke
Basic Clustering Methods


Kmeans
Expectation Maximization (EM)
Some
Data
This could easily be
modeled by a Gaussian
Mixture (with 5
components)
But let’s look at an
satisfying, friendly and
infinitely popular
alternative…
Copyright © 2001, 2004, Andrew W. Moore
Suppose you transmit the
coordinates of points drawn
randomly from this dataset.
You can install decoding
software at the receiver.
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”.
Loss = Sum Squared Error
between decoded coords and
original coords.
What encoder/decoder will
lose the least information?
Copyright © 2001, 2004, Andrew W. Moore
Lossy Compression
Suppose you transmit the
into a grid,
coordinates of points Break
drawn
decode each bit-pair as
randomly from this dataset.
the middle of each grid-
Idea One
cell
You can install decoding
software at the receiver.
You’re only allowed to send
two bits per point.
00
01
10
11
It’ll have to be a “lossy
transmission”.
Loss = Sum Squared Error
between decoded coords and
original coords.
What encoder/decoder will
lose the least information?
Copyright © 2001, 2004, Andrew W. Moore
Suppose you transmit the
Break
into a grid, decode each
coordinates of points
drawn
bit-pair as the centroid of all
randomly from thisdata
dataset.
in that grid-cell
Idea Two
You can install decoding
software at the receiver.
00
01
You’re only allowed to send
two bits per point.
It’ll have to be a “lossy
transmission”.
Loss = Sum Squared Error
between decoded coords and
original coords.
What encoder/decoder will
lose the least information?
Copyright © 2001, 2004, Andrew W. Moore
10
11
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
Copyright © 2001, 2004, Andrew W. Moore
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
Copyright © 2001, 2004, Andrew W. Moore
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to. (Thus
each Center “owns”
a set of datapoints)
Copyright © 2001, 2004, Andrew W. Moore
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns
Copyright © 2001, 2004, Andrew W. Moore
K-means
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
3. Each datapoint finds
out which Center it’s
closest to.
4. Each Center finds
the centroid of the
points it owns…
5. …and jumps there
6. …Repeat until
terminated!
Copyright © 2001, 2004, Andrew W. Moore
K-means
Start
Advance apologies: in
Black and White this
example will deteriorate
Example generated by
Dan Pelleg’s super-duper
fast K-means system:
Dan Pelleg and Andrew
Moore. Accelerating Exact
k-means Algorithms with
Geometric Reasoning.
Proc. Conference on
Knowledge Discovery in
Databases 1999, (KDD99)
(available on
www.autonlab.org/pap.html)
Copyright © 2001, 2004, Andrew W. Moore
K-means
continues
…
Copyright © 2001, 2004, Andrew W. Moore
K-means
continues
…
Copyright © 2001, 2004, Andrew W. Moore
K-means
continues
…
Copyright © 2001, 2004, Andrew W. Moore
K-means
continues
…
Copyright © 2001, 2004, Andrew W. Moore
K-means
continues
…
Copyright © 2001, 2004, Andrew W. Moore
K-means
continues
…
Copyright © 2001, 2004, Andrew W. Moore
K-means
continues
…
Copyright © 2001, 2004, Andrew W. Moore
K-means
continues
…
Copyright © 2001, 2004, Andrew W. Moore
K-means
terminates
Copyright © 2001, 2004, Andrew W. Moore
K-means Questions





What is it trying to optimize?
Are we sure it will terminate?
Are we sure it will find an optimal clustering?
How should we start it?
How could we automatically choose the
number of centers?
….we’ll deal with these questions over the next few slides
Copyright © 2001, 2004, Andrew W. Moore
Will we find the optimal configuration?


Not necessarily.
Can you invent a configuration that has
converged, but does not have the minimum
distortion? (Hint: try a fiendish k=3 configuration here…)
Copyright © 2001, 2004, Andrew W. Moore
Will we find the optimal configuration?


Not necessarily.
Can you invent a configuration that has
converged, but does not have the minimum
distortion? (Hint: try a fiendish k=3 configuration here…)
Copyright © 2001, 2004, Andrew W. Moore
Trying to find good optima



Idea 1: Be careful about where you start
Idea 2: Do many runs of k-means, each from a
different random start configuration
Many other ideas floating around.
Copyright © 2001, 2004, Andrew W. Moore
Common uses of K-means




Often used as an exploratory data analysis tool
In one-dimension, a good way to quantize real-valued
variables into k non-uniform buckets
Used on acoustic data in speech understanding to
convert waveforms into one of k categories (known as
Vector Quantization)
Also used for choosing color palettes on old
fashioned graphical display devices!
Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
3. Merge it into a parent
cluster
Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
3. Merge it into a parent
cluster
4. Repeat
Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical
Clustering
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
3. Merge it into a parent
cluster
4. Repeat
Copyright © 2001, 2004, Andrew W. Moore
Single Linkage Hierarchical
Clustering
How do we define similarity between
clusters?
• Minimum distance between points
in clusters (in which case we’re
simply doing Euclidian Minimum
Spanning Trees)
• Maximum distance between points
in clusters
• Average distance between points
in clusters
You’re left with a nice dendrogram,
or taxonomy, or hierarchy of
datapoints (not shown here)
Copyright © 2001, 2004, Andrew W. Moore
1. Say “Every point is its
own cluster”
2. Find “most similar” pair
of clusters
3. Merge it into a parent
cluster
4. Repeat…until you’ve
merged the whole
dataset into one cluster
Also known in the trade as Hierarchical
Agglomerative Clustering (note the
acronym)
Single Linkage Comments



It’s nice that you get a hierarchy instead of an
amorphous collection of groups
If you want k groups, just cut the (k-1) longest
links
There’s no real statistical or informationtheoretic foundation to this. Makes your
lecturer feel a bit queasy.
Copyright © 2001, 2004, Andrew W. Moore
What you should know





All the details of K-means
The theory behind K-means as an
optimization algorithm
How K-means can get stuck
The outline of Hierarchical clustering
Be able to contrast between which problems
would be relatively well/poorly suited to Kmeans vs Gaussian Mixtures vs Hierarchical
clustering
Copyright © 2001, 2004, Andrew W. Moore
EM Algorithm
Likelihood, Mixture Models and
Clustering
Introduction



In the last class the K-means algorithm for
clustering was introduced.
The two steps of K-means: assignment and
update appear frequently in data mining
tasks.
In fact a whole framework under the title
“EM Algorithm” where EM stands for
Expectation and Maximization is now a
standard part of the data mining toolkit
Outline






What is Likelihood?
Examples of Likelihood estimation?
Information Theory – Jensen Inequality
The EM Algorithm and Derivation
Example of Mixture Estimations
Clustering as a special case of Mixture
Modeling
Meta-Idea
Probability
Model
Data
Inference
(Likelihood)
A model of the data generating process gives rise to data.
Model estimation from data is most commonly through Likelihood estimation
From PDM by HMS
Likelihood Function
P( Data | Model ) P( Model )
P( Model | Data) 
P( Data)
Likelihood Function
Find the “best” model which has generated the data. In a likelihood function
the data is considered fixed and one searches for the best model over the
different choices available.
Model Space



The choice of the model space is plentiful
but not unlimited.
There is a bit of “art” in selecting the
appropriate model space.
Typically the model space is assumed to be a
linear combination of known probability
distribution functions.
Examples

Suppose we have the following data



0,1,1,0,0,1,1,0
In this case it is sensible to choose the
Bernoulli distribution (B(p)) as the model
space.
Now we want to choose the best p, i.e.,
Examples
Suppose the following are marks in a course
55.5, 67, 87, 48, 63
Marks typically follow a Normal distribution
whose density function is
Now, we want to find the best , such that
Examples

Suppose we have data about heights of
people (in cm)


185,140,134,150,170
Heights follow a normal (log normal)
distribution but men on average are taller
than women. This suggests a mixture of two
distributions
Maximum Likelihood Estimation




We have reduced the problem of selecting the best
model to that of selecting the best parameter.
We want to select a parameter p which will
maximize the probability that the data was
generated from the model with the parameter p
plugged-in.
The parameter p is called the maximum likelihood
estimator.
The maximum of the function can be obtained by
setting the derivative of the function ==0 and
solving for p.
Two Important Facts



If A1,,An are independent then
The log function is monotonically increasing.
x · y ! Log(x) · Log(y)
Therefore if a function f(x) >= 0, achieves a
maximum at x1, then log(f(x)) also achieves
a maximum at x1.
Example of MLE

Now, choose p which maximizes L(p). Instead we
will maximize l(p)= LogL(p)
Properties of MLE

There are several technical properties of the
estimator but lets look at the most intuitive
one:

As the number of data points increase we
become more sure about the parameter p
Properties of MLE
r is the number of data points. As the number of data points increase the
confidence of the estimator increases.
Matlab commands

[phat,ci]=mle(Data,’distribution’,’Bernoulli’);

[phi,ci]=mle(Data,’distribution’,’Normal’);
MLE for Mixture Distributions



When we proceed to calculate the MLE for a
mixture, the presence of the sum of the
distributions prevents a “neat” factorization
using the log function.
A completely new rethink is required to
estimate the parameter.
The new rethink also provides a solution to
the clustering problem.
A Mixture Distribution
Missing Data



We think of clustering as a problem of
estimating missing data.
The missing data are the cluster labels.
Clustering is only one example of a missing
data problem. Several other problems can
be formulated as missing data problems.
Missing Data Problem


Let D = {x(1),x(2),…x(n)} be a set of n
observations.
Let H = {z(1),z(2),..z(n)} be a set of n values
of a hidden variable Z.


z(i) corresponds to x(i)
Assume Z is discrete.
EM Algorithm

The log-likelihood of the observed data is
l ( )  log p( D |  )  log  p( D, H |  )
H

Not only do we have to estimate  but also H

Let Q(H) be the probability distribution on the missing data.
EM Algorithm
Inequality is because of Jensen’s Inequality.
This means that the F(Q,) is a lower bound on l()
Notice that the log of sums is become a sum of logs
EM Algorithm

The EM Algorithm alternates between
maximizing F with respect to Q (theta fixed)
and then maximizing F with respect to theta
(Q fixed).
EM Algorithm

It turns out that the E-step is just

And, furthermore

Just plug-in
EM Algorithm

The M-step reduces to maximizing the first
term with respect to  as there is no  in the
second term.
EM Algorithm for Mixture of Normals
Mixture of
Normals
E Step
M-Step
EM and K-means



Notice the similarity between EM for Normal
mixtures and K-means.
The expectation step is the assignment.
The maximization step is the update of
centers.
Utilizing Clustering Methods for
Forecasting
Pattern Sequence Forecasting (PSF)


PSF is a forecasting method mixing
traditional regression methods (such as
moving average (MA) and auto-regression
(AR)) with clustering methods, such as
Kmeans and EM
PSF can achieve extremely high forecasting
results in electricity consumption forecasting
Electricity Power Consumption
Forecasting
Daily Power Consumption Variation
Pattern Sequence Forecasting (PSF)

Step 1: Pre-processing Data (Re-organizing
table)
day1
1st
hour
2nd
hour
…
Class
1
day2
Class
2
day3
Class
3
Selecting the number of Clusters
Distribution of Weekdays for Clusters
Misclassification by Kmeans
Means of Daily Electricity
Consumption
Pattern Sequence Forecasting (PSF)
Predictions