Different Perspectives at Clustering: The Number-of

Download Report

Transcript Different Perspectives at Clustering: The Number-of

Different Perspectives at
Clustering:
The “Number-of-Clusters” Case
B. Mirkin
School of Computer Science
Birkbeck College, University of London
IFCS 2006
Different Perspectives at Number
of Clusters: Talk Outline
Clustering and K-Means: A discussion
Clustering goals and four perspectives
Number of clusters in:
- Classical statistics perspective
- Machine learning perspective
- Data Mining perspective
(including a simulation study with 8 methods)
- Knowledge discovery perspective
(including a comparative genomics project)
WHAT IS CLUSTERING; WHAT IS DATA
K-MEANS CLUSTERING: Conventional K-Means; Initialization of KMeans; Intelligent K-Means; Interpretation Aids
WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive
Clustering with Ward Criterion; Extensions of Ward Clustering
DATA RECOVERY MODELS: Statistics Modelling as Data Recovery;
Data Recovery Model for K-Means; for Ward; Extensions to Other Data
Types; One-by-One Clustering
DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means;
Graph-Theoretic Approaches; Conceptual Description of Clusters
GENERAL ISSUES: Feature Selection and Extraction; Similarity on
Subsets and Partitions; Validity and Reliability
Example: W. Jevons (1835-1882),
updated in Mirkin 1996
Pluto doesn’t fit in the two clusters of planets
Example: A Few Clusters
Clustering interface to WEB search engines
(Grouper):
Query: Israel (after O. Zamir and O. Etzioni 2001)
Cluster
# sites
24
Interpretation
Society, religion
2
View
Refine
12
Middle East, War, History
3
View
Refine
31
Economy, Travel
1
View
Refine
• Israel and Iudaism
• Judaica collection
• The state of Israel
• Arabs and Palestinians
• Israel Hotel Association
• Electronics in Israel
Clustering: Main Steps
Data collecting
Data pre-processing
Finding clusters (the only step
appreciated in conventional clustering)
Interpretation
Drawing conclusions
Conventional Clustering:
Cluster Algorithms
Single Linkage: Nearest Neighbour
Ward Agglomeration
Conceptual Clustering
K-means
Kohonen SOM
………………….
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
***
** *
according to minimum distance rule
@
@
2. Put centroids in gravity centres of
thus obtained clusters
@
3. Iterate 1. and 2. until convergence
**
***
K= 3 hypothetical centroids (@)
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
***
according to Minimum distance rule
** *
@
@ 2. Put centroids in gravity centres of
thus obtained clusters
3. Iterate 1. and 2. until convergence
@
**
***
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
***
** *
according to Minimum distance rule
@
@
2. Put centroids in gravity centres of
thus obtained clusters
@
3. Iterate 1. and 2. until convergence
**
***
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
@
* *
*@*
according to Minimum distance rule
** *
2. Put centroids in gravity centres of
thus obtained clusters
3. Iterate 1. and 2. until convergence
**@
4. Output final centroids and clusters
***
Advantages of K-Means
Conventional:
 Models typology building
 Computationally effective
 Can be incremental, `on-line’
Unconventional:


Associates feature salience with
feature scales and
correlation/association
Applicable to mixed scale data
Drawbacks of K-Means
•No advice on:
•Data pre-processing
•Number of clusters
•Initial setting
•Instability of results
•Criterion can be inadequate
•Insufficient interpretation
aids
Initial Centroids: Correct
Two cluster case
Initial Centroids: Correct
Initial
Final
Different Initial Centroids
Different Initial Centroids:
Wrong, even though in different clusters
Initial
Final
Two types of goals
(with no clear-cut borderline)
Engineering goals
Data analysis goals
Engineering goals (examples)
Devising a market segmentation to
minimise the promotion and advertisement
expenses
Dividing a large scheme into modules to
minimise the cost
Organisation structure design
Data analysis goals (examples)
Recovery of the distribution function
Prediction
Revealing patterns in data
Enhancing knowledge with additional concepts
and regularities
Each of these is realised
in a different perspective at clustering
Clustering Perspectives
Classical statistics:
Recovery of a multimodal distribution function
Machine learning: Prediction
Data mining: Revealing patterns in data
Knowledge discovery: additional concepts
and regularities
Clustering Perspectives at # Clusters
Classical statistics:
As many as meaningful modes (mixture items)
Machine learning:
As many as needed for acceptable prediction
Data mining:
As many as meaningful patterns in data
(including incomplete clustering)
Knowledge discovery:
As many as needed to produce concepts and
Main Sources for Deriving # Clusters
Classical statistics:
Model of the world
Machine learning:
Cost & Accuracy Trade Off
Data mining:
Data
Knowledge discovery:
Domain knowledge
Classical Statistics Perspective
There must be a model of data
generation
 E.g., Mixture of Gaussians
The task: identify all parameters of the
model by using observed data
 E.g., The number of Gaussians and their
probabilities, means and covariances
Mixture of 3 Gaussian densities
Classical statistics perspective on
K-Means
But a maximum likelihood method with
spherical Gaussians of the same variance:
- within a cluster, all variables are
independent and Gaussian with the same
cluster-independent variance (z - scoring
is a must then);
- the issue of the number of clusters
can be approached with conventional
approaches to hypothesis testing
Machine learning perspective
Clusters should be of help in learning data
incrementally generated
The number should be specified by the
trade-off between accuracy and cost
A criterion should guarantee partitioning
of the feature space with clearly separated
high density areas;
A method should be proven to be
consistent with the criterion on the
population
Machine learning on K-Means
The number of clusters: to be specified
according to prediction goals
Pre-processing: no advice
An incremental version of K-means
converges to a minimum of the summary
within-cluster variance, under conventional
assumptions of data generation (McQueen
1967 – major reference, though the method
is traced a dozen or two years earlier)
Data mining perspective
Data recovery framework for
data mining methods
Type of Data




Similarity
Temporal
Entity-to-feature
Co-occurrence
Type of Model



Regression
Principal components
Clusters
Model:
Data = Model_Derived_Data + Residual
Pythagoras:
Data2 = Model_Derived_Data2 + Residual2
The better fit the better the model
K-Means as a data recovery
method
Representing a partition
Cluster
k:
Centroid
ckv
(v - feature)
Binary 1/0 membership
zik
(i - entity)
Basic equations (analogous to PCA)
K
y   c z  ,
iv
kv ik
iv
k 1
N
V
 y
i 1 v 1
2
iv
V
K
K
  c N k   ( yiv  ckv )
v 1 k 1
y – data entry,
2
kv
k 1 iSk v 1
z - membership
c - cluster centroid,
i - entity,
V
N – cardinality
v - feature /category,
k - cluster
2
Meaning of Data scatter
D 
2
N
V
 y
i 1 v 1
2
iv

V
N
 y
v 1 i 1
2
iv
contributions of
features – the basis for feature
pre-processing (dividing by range,
not std)
The sum of
Proportional to the summary variance
Contribution of a feature F
to a partition
K
Contrib(F) =
 c
vF k 1
2
kv
Nk
Proportional to
correlation ratio 2 if F is quantitative
 a contingency coefficient between cluster partition

and F, if F is nominal:
 Pearson chi-square (Poisson normalised)
 Goodman-Kruskal tau-b (Range normalised)
Contribution of a quantitative
feature to a partition
K
N  N  (  pk ) / 
2
Proportional to

2
k 1
correlation ratio 2
2
k
if F is quantitative
2
Contribution of a nominal
feature to a partition
K
Contr( F )  N  ( pij  pi p j ) / pi B j
2
k 1
Proportional to a contingency coefficient
 Pearson chi-square (Poisson normalised)
Bj 
pj
 Goodman-Kruskal tau-b (Range normalised)

Bj=1
2
Pythagorean Decomposition of
data scatter for interpretation
Contribution based description
of clusters
C. Dickens:
FCon = 0
M. Twain:
LenD < 28
L. Tolstoy:
NumCh > 3
Direct = 1
or
Principal Cluster Analysis
(Anomalous Pattern) Method
yiv =cv zi + eiv,
where
zi = 1 if iS, zi = 0 if iS
With Euclidean distance squared
N
V
 y
i 1 v 1
2
iv
V
V
  c N S   ( yiv  cSv )
v 1
2
Sv
2
iS v 1
N
 d (i,0)  d (c ,0) N   d (i, c
i 1
S
cS must be anomalous,
S
iS
S
)
that is, interesting
Initial setting with Anomalous
Single Cluster for iK-Means
Tom Sawyer
iK-Means with Anomalous
Single Clusters
1
2
0
Tom Sawyer
3
Anomalous clusters + K-means
After extracting 2 clusters (how
one can know that 2 is right?)
Final
Simulation study of 8 methods
(joint work with Mark Chiang)
Number-of clusters methods:
• Variance based:
Hartigan(HK)
Calinski & Harabasz (CH)
Jump Statistic (JS)
• Structure based:
Silhouette Width (SW)
• Consensus based:
Consensus Distribution area (CD)
Consensus Distribution mean (DD)
• Sequential extraction of APs:
Least Square (LS)
Least Moduli (LM)
Data generation for the experiment
• Gaussian Mixture (6,7,9 clusters) with:
•Cluster spatial size:
- Constant (spherical)
- k-proportional
- k2-proportional
•Cluster spread (distance between centroids)
PPCA model
Spread
Spherical
k-proport.
k2-proport.
Large
2 ()
10 ()
10 ()
Small
0.2 ()
0.5 ()
2 ()
Evaluation of results:
Estimated clustering versus that
generated
•Number of clusters
•Distance between centroids
•Similarity between partitions
Distance between estimated
centroids (o) and
those generated (o )
e1(q1)
e3(q3)
Prime Assignment
e2(q2)
G1(p1)
G2(p2)
e4(q4)
e5(q5)
G3(p3)
g1------e2
g2------e4
g3------e5
Distance between estimated
centroids (o) and
those generated (o )
e1(q1)
e3(q3)
Final Assignment
e2(q2)
G1(p1)
G2(p2)
e4(q4)
e5(q5)
G3(p3)
g1------e2, e1
g2------e4, e3
g3------e5
Distance between centroids:
quadratic and city-block
g1(p1)------e1(q1), e2(q2)
g2(p2)------e3(q3), e4(q4)
g3(p3)------e5(q5)
1. Assignment
2. Distancing
d1=(q1*d(g1,e1)+q2*d(g1,e2))/(q1+q2)
d2=(q3*d(g2,e3)+q4*d(g2,e4))/(q3+q4)
d3=(q5*d(g3,e5)/q5
Distance between centroids:
quadratic and city-block
1. Assignment
p1*d1+p2*d2+p3*d3
2. Distancing
3. Averaging
Similarity between partitions
according to their confusion table
• Relative distance (Mirkin-Cherny 1970)
• Tchouprov coefficient (Cramer 1943)
• Adjusted Rand Index (Arabie-Hubert, 1985)
• Average Overlap (Mirkin 2005)
Results
at 9 clusters, 1000 entities, 20 features generated
HK
CH
JS
SW
CD
DD
LS
LM
Estimated number of
clusters
Distance between
Centroids
Large
spread
Large
spread
Small
spread
Small
spread
Adjust Rand Index
Large spread
Small
spread
Knowledge discovery
perspective on clustering
Conforming to and enhancing domain
knowledge:
Informal considerations so far
Relevant items:


Decision trees
External validation
A case to generalise
Entities with a similarity measure
Clustering interpretation tool developed
Clustering method using a similarity threshold
leading to a number of clusters
Domain knowledge leading to constraints to
the similarity threshold
Best fitting interpretation provides for the
best number of clusters
Entities with a similarity measure
740 Homologous Protein Families (HPFs)
(in 30 herpes-virus genomes)
Homology defined by a protein sequence
fragment:
F1
F2
F3
Sequence neighbourhood based similarity
measure on HPFs
Interpretation tool: Mapping to an
evolutionary tree over genomes
y
o
q
F3
z
v
m
k
a
p
b
l
F2
w
d
c
r
n
s
f
g
h
e
x
F1
Different aggregations – different histories
j
u
t
i
Algorithm: ADDI-S (Mirkin JoC 1987),
a data approximation technique
To maximize Contribution to Data Scatter,
Average within-cluster similarity c multiplied by the
cluster’s size #S
Algorithm ADDI-S:



Take S={ j } for arbitrary j
Given S, find c=c(S) and similarities b(i,S) to S for
all entities i in and out of S;
Check the differences b(i,S)-c/2. If they are
consistent, change the state of a most
contributing entity. Else, stop and output S.
Resulting S: a tightness property.
Holzinger (1941) B-coefficient, Arkadiev&Braverman
(1964, 1967) Specter, Mirkin (1976, 1987) ADDI
family, Ben-Dor, Shamir, Yakhini (1999) CAST
Algorithm: ADDI-S (Mirkin 1987), a
data approximation techniques
Number of clusters: Depends on
similarity shift threshold b
b(ij)  b(ij) – b
b
Domain knowledge: Function is
known at some HPFs
287 pairs of HPFs with known function of
which 86 are SYNONYMOUS (same function)
density
Non-synonymous
synonymous
0.42
Two values:
0.67
Similarity
Min error No non-synonymous
Knowledge enhancing
Analyzing the reconstructed contents in 3 family
ancestors and HUCA (the root)
Analyzing differences between b=.42 and b=.67
cluster reconstructions
Analyzing gene arrangement within genomes
Glycoprotein L’s HPFs are sequencedissimilar, but they are always followed in
genomes by glycolase that is mapped to HUCA
Glyc L
Glycolase
Therefore glycoprotein L must be in HUCA too
Final HPFs and APFs
HPFs with a sequence-based similarity
measure
Interpretation: parsimonious histories
Clustering ADDI-S using a similarity threshold
leading to a number of clusters
Domain knowledge: 86 pairs should be in
same clusters, and 201 in different clusters
 2 suggested similarity thresholds
Best fitting: 102 APF (aggregating 249 HPFs)
and 491 singleton HPFs
Whole HPF aggregation method’s structure
(joint work with
R. Camargo, T. Fenner, P. Kellam, G. Loizou)
Conclusion I: Number of clusters?
Engineering perspective: defined by cost/effect
Classical statistics perspective: can and should be
determined from data with a model
Machine learning perspective: can be specified
according to the prediction accuracy to achieve
Data mining perspective: not to pre-specify; only
those are of interest that bear interesting patterns
Knowledge discovery perspective: not to prespecify; those that are best in knowledge
enhancing
Conclusion II:
Each other data analysis concept
Classical statistics perspective: can be
determined from data with a model
Machine learning perspective:
prediction accuracy to achieve
Data mining perspective: data
approximation
Knowledge discovery perspective:
knowledge enhancing
Variance based methods
Hartigan (HK):
-calculate HT=(Wk/Wk+1-1)(N-k-1), where N is the number of
entities
-find the k which HT is less than a threshold 10
Calinski and Harabasz (CH):
-calculate CH=((T-Wk)/(k-1))/(Wk/(N-k)), where T is the data
scatter
-find the k which maximize CH
Wk is given K, the smallest within-cluster summary distance to
centroids among those found at different K-Means initializations
Variance based methods
Jump Statistic (JS):
•for each entity i, clustering S={S1,S2,…,Sk}, and
centroids C={C1,C2,…,Ck}
•calculate d(i, Sk)=(yi-Ck)TΓ-1(yi-Ck) and dk=( 
d(i, Sk))/P*N
k iS
k
where P is the number of features, N is the number of rows and Γ
is the covariance matrix of y
•select a transformation power, typically P/2
•calculate the jumps JS=d k P / 2 -d kP1/ 2 and d 0 P / 2 ≡0
•find the k which maximize JS
Structure based methods
Silhouette Width (SW):
• for each entity i, a(i)=average dissimilarity between i and
all other entities of the cluster to which i belongs and
b(i) is the minimum of average dissimilarity of i to all entities
in other cluster
• for each other cluster Sk, d(i, Sk)=average dissimilarity between i
and all entities of Sk
• (i)=min(d(i, Sk)) over Sk
• s(i)=[b(i)-a(i)]/max(a(i),b(i))
• calculate the average s=  s(i)/N
i
• find the K maximizing the average s
Consensus based methods
Consensus based area(CD)
For each different K-means initializations
•Find the connectivity matrix
•Calculate the consensus matrix
•Calculate the cumulative distribution matrix CDF
•Calculate the area under CDF A(k)
•Calculate Δ(K+1)=
K 1
 A( K ),
 A( K  1)  A( K )

,K 2

A( K )
•find the k which maximize Δ(k)
Consensus based methods
μK is the mean of the consensus matrix
σK is the variance of the consensus matrix
avdis(K)= μK*(1- μK)- σK2
davdis(K)=(avdis(K)-avdis(K+1))/avdis(K+1)
Find the K which maximize davdis(DD)
Sequential cluster extraction
Intelligent K-Means:
• Anomalous Pattern (Initial clusters)
• Removal of singletons
• K-Means
• Euclidean Distance + The within-cluster mean
 Least Square Criterion (LS)
• Manhattan Distance + The within-cluster median
 Least Modules Criterion (LM)