Transcript Clustering
Clustering
Gilad Lerman
Math Department, UMN
Slides/figures stolen from M.-A. Dillies, E. Keogh, A. Moore
What is Clustering?
Partitioning data into classes with
high intra-class similarity
low inter-class similarity
Is it well-defined?
What is Similarity?
Clearly, subjective measure or problem-dependent
How Similar Clusters are?
Ex1: Two clusters or one clusters?
How Similar Clusters are?
Ex2: Cluster or outliers
Sum-Squares Intra-class Similarity
Given Cluster S1 x1,..., xN1
N1
Mean:
1 xi
1
N1
Within Cluster Sum of Squares:
WCSS(S1 )= xi 1 , where
2
xi S1
Note that 1 argmin xi c
c
xi S1
y
2
D
( y )2j
j 1
2
Within Cluster Sum of Squares
For Set of Clusters S={S1,…,SK}
K
WCSS(S )=
j 1 xi S j
D
xi j
Can use y 1 ( y ) j
2
instead of y
j 1
(y )
j 1
So get Within Clusters Manhattan Distance
K
WCMD(S )=
j 1 xi S j
xi m j ,
where m j argmin
c
D
1
xi S j
xi c
1
Question: how to compute/estimate c?
2
j
Minimizing WCSS
Precise minimization is “NP-hard”
Approximate minimization for WCSS by
K-means
Approximate minimization for WCMD by
K-medians
The K-means Algorithm
Input: Data & number of clusters (K)
Randomly guess locations of K cluster centers
For each center – assign nearest cluster
Repeat till convergence ….
Demonstration: K-means/medians
Applet
K-means: Pros and Cons
Pros
Often fast
Often terminates at a local minimum
Cons
May not obtain the global minimum
Depends on initialization
Need to specify K
Sensitive to outliers
Sensitive to variations in sizes and densities of clusters
Not suitable for non-convex shapes
Does not apply directly to categorical data
Spectral Clustering
Idea: embed data for easy clustering
•
Construct weights
based on proximity:
2
xi x j
•
/
Wij e
if i j and 0 otherwise
(Normalize W )
Embed using eigenvectors of W
Clustering vs. Classification
Clustering – find classes in an unsupervised
way (often K is given though)
Classification – labels of clusters are given
for some data points (supervised learning)
Data 1: Face images
Facial images (e.g., of persons 5,8,10) live on different
“planes” in the “image space”
They are often well-separated so that simple clustering
can apply to them (but not always…)
Question: What is the high-dimensional image space?
Question: How can we present high-dim. data in 3D?
Data 2: Iris Data Set
Setosa
Versicolor
Virginica
50 samples from each of 3 species
4 features per sample:
length & width of sepal and petal
Data 2: Iris Data Set
Data 2: Iris Data Set
Setosa is clearly separated from 2 others
Can’t separate Virginica and Versicolor
(need training set as done by Fischer in 1936)
Question: What are other ways to visualize?
Data 3: Color-based Compression
of Images
Applet
Question: What are the actual data points?
Question: What does the error mean?
Some methods for # of Clusters
(with online codes)
Gap statistics
Model-based clustering
G-means
X-means
Data-spectroscopic clustering
Self-tuning clustering
Your mission
Learn about clustering (theoretical results,
algorithms, codes)
Focus: methods for determining # of clusters
Understand details
Compare using artificial and real data
Conclude good/bad scenarios for each (prove?)
Come up with new/improved methods
Summarize info: literature survey and possibly
new/improved demos/applets
We can suggest additional questions tailored to
your interest