Transcript Clustering

Clustering
Gilad Lerman
Math Department, UMN
Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore
What is Clustering?
Partitioning data into classes with
high intra-class similarity
low inter-class similarity
 Is it well-defined?

What is Similarity?

Clearly, subjective measure or problem-dependent
How Similar Clusters are?

Ex1: Two clusters or one clusters?
How Similar Clusters are?

Ex2: Cluster or outliers
Sum-Squares Intra-class Similarity


Given Cluster S1  x1,..., xN1
N1
Mean:
1 xi
1 
N1

Within Cluster Sum of Squares:
WCSS(S1 )=  xi  1 , where
2
xi S1

Note that 1  argmin  xi  c
c
xi S1
y
2
D
  ( y )2j
j 1
2
Within Cluster Sum of Squares

For Set of Clusters S={S1,…,SK}
K
WCSS(S )=

j 1 xi S j
D

xi   j
Can use y 1   ( y ) j
2
instead of y 
j 1

 (y )
j 1
So get Within Clusters Manhattan Distance
K
WCMD(S )=

j 1 xi S j
xi  m j ,
where m j  argmin
c

D
1

xi S j
xi  c
1
Question: how to compute/estimate c?
2
j
Minimizing WCSS
Precise minimization is “NP-hard”
 Approximate minimization for WCSS by
K-means
 Approximate minimization for WCMD by
K-medians

The K-means Algorithm




Input: Data & number of clusters (K)
Randomly guess locations of K cluster centers
For each center – assign nearest cluster
Repeat till convergence ….
Demonstration: K-means/medians

Applet
K-means: Pros and Cons

Pros



Often fast
Often terminates at a local minimum
Cons

May not obtain the global minimum
 Depends on initialization
 Need to specify K
 Sensitive to outliers
 Sensitive to variations in sizes and densities of clusters
 Not suitable for non-convex shapes
 Does not apply directly to categorical data
Spectral Clustering
Idea: embed data for easy clustering
•
Construct weights
based on proximity:
2
 xi  x j
•
/
Wij  e
if i  j and 0 otherwise
(Normalize W )
Embed using eigenvectors of W
Clustering vs. Classification
Clustering – find classes in an unsupervised
way (often K is given though)
 Classification – labels of clusters are given
for some data points (supervised learning)

Data 1: Face images




Facial images (e.g., of persons 5,8,10) live on different
“planes” in the “image space”
They are often well-separated so that simple clustering
can apply to them (but not always…)
Question: What is the high-dimensional image space?
Question: How can we present high-dim. data in 3D?
Data 2: Iris Data Set
Setosa


Versicolor
Virginica
50 samples from each of 3 species
4 features per sample:
length & width of sepal and petal
Data 2: Iris Data Set
Data 2: Iris Data Set
Setosa is clearly separated from 2 others
 Can’t separate Virginica and Versicolor
(need training set as done by Fischer in 1936)
 Question: What are other ways to visualize?

Data 3: Color-based Compression
of Images
Applet
 Question: What are the actual data points?
 Question: What does the error mean?

Some methods for # of Clusters
(with online codes)
Gap statistics
 Model-based clustering
 G-means
 X-means
 Data-spectroscopic clustering
 Self-tuning clustering

Your mission


Learn about clustering (theoretical results,
algorithms, codes)
Focus: methods for determining # of clusters

Understand details
 Compare using artificial and real data
 Conclude good/bad scenarios for each (prove?)
 Come up with new/improved methods


Summarize info: literature survey and possibly
new/improved demos/applets
We can suggest additional questions tailored to
your interest