Hierarchical Stability Based Model Selection for Data Clustering

Download Report

Transcript Hierarchical Stability Based Model Selection for Data Clustering

Hierarchical Stability Based Model
Selection for Data Clustering
Bing Yin
Advisor: Greg Hamerly
4/8/2016
1
Roadmap
What is clustering?
What is model selection for clustering algorithms?
Stability Based Model Selection: Proposals and Problems
Hierarchical Stability Based Model Selection
● Algorithm
● Unimodality Test
● Experiments
Future work
Main Contribution
● Extended the concept of stability to hierarchical stability.
● Solved the symmetric data sets problem.
● Make stability a competing tool for model selection.
4/8/2016
2
What is clustering?
Given:
● data set of “objects”
● some relations between those objects: similarities, distances, neighborhoods,
connections,…
Goal: Find meaningful groups of objects s. t.
● objects in the same group are “similar”
● objects in different groups are “dissimilar”
Clustering is:
● a form of unsupervised learning
● a method of data exploration
4/8/2016
3
What is clustering? An Example
Image Segmentation:
Micro array Analysis:
Serum Stimulation of Human Fibroblasts
(Eisen,Spellman,PNAS,1998)
● 9800 spots representing 8600 genes
● 12 samples taken over 24 hour period
Document Clustering
Post-search Grouping
Data Mining
Social Network Analysis
Gene Family Grouping
…
4/8/2016
● Clusters can be roughly categorized as
gene involved in
A: cholesterol biosynthesis
B: the cell cycle
C: the immediate-early response
D: signaling and angiogenesis
E: wound healing and tissue remodeling
4
What is clustering? An Algorithm
K-Means algorithm (Lloyd, 1957)
Given: data points X1,…,Xn d, number K clusters to find.
1. Randomly initialize the centers m10,…,mK0.
2. Iterate until convergence:
2.1 Assign each point to the closest center according to Euclidean distance,
i.e., define clusters C1i+1,…,CKi+1 by
Xs  Cki+1 where ||Xs-mki||2 < ||Xs-mli||2, l=1 to K
2.2 Compute the new cluster centers by
mki+1 =  Xs / |Cki+1|
What is optimized?
Minimizing within-cluster distances:
4/8/2016
5
What is model selection?
Clustering algorithms need to know the K before running.
The correct answer of K for a given data is unknown
So we need a better way to find this K and also the positions of the K centers
This can be intuitively called model selection for clustering algorithms.
Existing model selection method:
● Bayesian Information Criterion
● Gap statistics
● Projection Test
…
● Stability based approach
4/8/2016
6
Stability Based Model Selection
The basic idea:
● scientific truth should be reproducible in experiments.
Repeatedly run a clustering algorithm on the same data
with parameter K and get a collection of clustering:
● If K is the correct model, clustering should be similar to each other
● If K is a wrong model, clustering may be quite different from each other
This fact is referred as the stability of K (Ulrike von Luxburg,2007)
4/8/2016
7
Stability Based Model Selection(2)
Example on the toy data:
If we can mathematically define this stability score for K, then stability
can be used to find the correct model for the given data.
4/8/2016
8
Define the Stability
Variation of Information (VI)
● Clustering C1: X1,…,Xk and Clustering C2: X’1,…,X’k on date X
● The prob. point p belongs in Xi is :
● The entropy of C1:
● The joint prob. p in Xi and X’j is P(i,j) with entropy:
● The VI is defined as:
VI indicates a distance between two clustering.
4/8/2016
9
Define the stability (2)
Calculate the VI score for a single K
● Clustering the data using K-Means for K clusters, run M times
● Calculate pair wise VI of these M clustering.
● Average the VI and use it as the VI score for K
The calculated VI score for K indicates instability of K
Try this over different K
The K with lowest VI score/instability is chosen as the
correct model
4/8/2016
10
Define the Stability(3)
An good example of Stability
An bad example of Stability: symmetric data
Why?
Because Clustering data into 9 clusters apparently has more grouping choices
than clustering them into 3.
4/8/2016
11
Hierarchical Stability
Problems with the concept of stability introduced above:
● Symmetric Data Set
● Only local optimization – the smaller K
Proposed solution
● Analyze the stability in an hierarchical manner
● Do Unimodality Test to detect the termination of the recursion
4/8/2016
12
Hierarchical Stability
Given: Data set X
HS-means:
● 1. Test if X is not a unimodal cluster
● 2. If yes, find the optimal K for X by analyzing stability;
otherwise, X is a single cluster, return.
● 3. Partition X into K subsets
● 4. For each subset, recursively perform this algorithm from step 1
● 5. Merge answers from each subset as answer for current data
4/8/2016
13
Unimodality Test -
2 Unimodality test
Fact: sum of squared Gaussians follows 2 distribution.
● If x1,…,xd are d independent Gaussian variables, then
S = x12+…+xd2 follows 2 distribution of degree d.
For given data set X, calculate Si=Xi12+…+Xid2
● If X is a single Gaussian, then S follows 2 of degree d
● Otherwise, S is not a 2 distribution.
4/8/2016
14
Unimodality Test -
Gap Test
Fact: the within cluster dispersion drops most apparently
with the correct K (Tibshirani, 2000)
Given: Data set X, candidate k
● cluster X to k clusters and
get within cluster dispersion Wk
● generate uniform data sets, cluster to
k clusters, calculate W*k (averaged)
● gap(k) = W*k – Wk
● select smallest k s. t. gap(k)>gap(k+1)
● we use it in another way: just ask k=1?
4/8/2016
15
Experiments
Synthetic data
● Both Gaussian Distribution and Uniform Distribution
● In dimensions from 2 up to 20
● c-separation between each cluster center and its nearest neighbor is 4
● 200 points in each cluster, 10 clusters in total
Handwritten Digits
● U.S. Postal Service handwritten digits
● 9298 instances in 256 dimensions
● 10 true clusters (maybe!)
KDDD Control Curves
● 600 instances in 60 dimensions
● 6 true clusters, each has 100 instances
Synthetic Gaussian
(10 true clusters)
Synthetic Uniform
(10 true clusters)
Handwritten Digits
(10 true clusters)
KDDD Control Curves
(6 true clusters)
HS-means
101
101
60
6.50.5
Lange Stability
6.51.5
71
20
30
PG-means
101
19.51.5
201
171
4/8/2016
16
Experiments – symmetric data
HS-means
4/8/2016
Lange Stability
17
Future Work
● Better Unimodality Testing approach.
● More detailed comparison on the performance with existing method
like within cluster distance, VI metric and so on.
● Improve the speed of the algorithm.
4/8/2016
18

Questions and Comments
Thank you!
4/8/2016
19