No Slide Title

Transcript No Slide Title

Метод К-средних
в кластер-анализе и его
интеллектуализация
Б.Г. Миркин
Профессор, Кафедра анализа данных и
искусственного интеллекта, НИУ ВШЭ
Москва РФ
Professor Emeritus, School of Computer
Science & Information Systems, Birkbeck
College University of London, UK
1
Outline:
Clustering as empirical classification
K-Means and its issues:


(1) Determining K and initialization
(2) Weighting variables
Addressing (1):




Data recovery clustering and K-Means (Mirkin 1987, 1990)
One-by-one clustering: Anomalous patterns and iK-Means
Other approaches
Computational experiment
Addressing (2):



Three-stage K-Means
Minkowski K-Means
Computational experiment
Conclusion
2
WHAT IS CLUSTERING; WHAT IS DATA
K-MEANS CLUSTERING: Conventional K-Means; Initialization of KMeans; Intelligent K-Means; Mixed Data; Interpretation Aids
WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive
Clustering with Ward Criterion; Extensions of Ward Clustering
DATA RECOVERY MODELS: Statistics Modelling as Data Recovery;
Data Recovery Model for K-Means; for Ward; Extensions to Other Data
Types; One-by-One Clustering
DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means;
Graph-Theoretic Approaches; Conceptual Description of Clusters
GENERAL ISSUES: Feature Selection and Extraction; Similarity on
Subsets and Partitions; Validity and Reliability
3
Referred recent work:
B.G. Mirkin, Chiang M. (2010) Intelligent choice
of the number of clusters in K-Means
clustering: An experimental study with different
cluster spreads, J. of Classification, 27, 1, 3-41
B.G. Mirkin, Choosing the number of clusters
(2011), WIRE Data Mining and Knowledge
Discovery, 1, 3, 252-60
B.G. Mirkin, R.Amorim (2012) Minkowski
metric, feature weighting and anomalous
cluster initializing in K-Means clustering, Pattern
Recognition, 45, 1061-75
4
What is clustering?
Finding homogeneous fragments,
mostly sets of entities, in datasets for
further analysis
5
Example: W. Jevons (1857) planet
clusters (updated by Mirkin 1996)
Pluto doesn’t fit in the two clusters of planets:
originated another cluster (September 2006)
6
Example: A Few Clusters
Clustering interface to WEB search engines
(Grouper):
Query: Israel (after O. Zamir and O. Etzioni 2001)
Cluster
# sites
24
Interpretation
Society, religion
2
View
Refine
12
Middle East, War, History
3
View
Refine
31
Economy, Travel
1
View
Refine
• Israel and Iudaism
• Judaica collection
• The state of Israel
• Arabs and Palestinians
• Israel Hotel Association
• Electronics in Israel
7
Clustering algorithms:
Nearest neighbour
Agglomerative clustering
Divisive clustering
Conceptual clustering
K-means
Kohonen SOM
Spectral clustering
………………….
8
Batch K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
***
** *
according to minimum distance rule
@
@
2. Put centroids in gravity centres of
thus obtained clusters
@
3. Iterate 1. and 2. until convergence
**
***
K= 3 hypothetical centroids (@)
9
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
***
according to Minimum distance rule
** *
@
@ 2. Put centroids in gravity centres of
thus obtained clusters
3. Iterate 1. and 2. until convergence
@
**
***
10
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
***
** *
according to Minimum distance rule
@
@
2. Put centroids in gravity centres of
thus obtained clusters
@
3. Iterate 1. and 2. until convergence
**
***
11
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
@
* *
*@*
according to Minimum distance rule
** *
2. Put centroids in gravity centres of
thus obtained clusters
3. Iterate 1. and 2. until convergence
**@
4. Output final centroids and clusters
***
12
K-Means criterion: Summary
distance to cluster centroids
Minimize
* *
@
* *
*@*
** *
**@
***
K
W ( S , c)  
k 1
M
 ( y
iS k
v 1
iv
K
 ckv )  
2
k 1
d(y ,c )
iS k
i
13
k
Advantages of K-Means
- Models typology building
- Simple “data recovery” criterion
- Computationally effective
- Can be utilised incrementally, `on-line’
Shortcomings of K-Means
- Initialisation: no advice on K or initial
centroids
- No deep minima
- No defence of irrelevant features
14
Initial Centroids: Correct
Two cluster case
15
Initial Centroids: Correct
Initial
Final
16
Different Initial Centroids
17
Different Initial Centroids:
Wrong
Initial
Final
18
(1) To address:
Number of clusters
Issue:
Criterion WK < WK-1
*
*
*
Initial setting
Deeper minimum
The two are interrelated: a good
initial setting leads to a deeper
minimum
19
Number K: conventional approach
Take a range RK of K, say K=3, 4, …, 15
For each KRK

Run K-Means 100-200 times from randomly
chosen initial centroids and choose the best of
them W(S,c)=WK.
Compare WK for all KRK in a special way
and choose the best; such as



Gap statistic (2001)
Jump statistic (2003)
Hartigan (1975): In the ascending order of K,
pick the first K at which
HK = [ WK / WK+1 - 1 ]/(N-K-1)  10
20
(1) Addressing
* Number of clusters
* Initial setting
with a PCA-like method in the data
recovery approach
21
Representing a partition
Cluster
k:
Centroid
ckv
(v - feature)
Binary 1/0 membership
zik
(i - entity)
22
Basic equations (same as for PCA,
but score vectors zk constrained to be binary)
K
y   c z  ,
iv
kv ik
iv
k 1
y – data entry,
z – 1/0 membership, not score
c - cluster centroid,
i - entity,
N – cardinality
v - feature /category,
k - cluster
23
Quadratic data scatter
decomposition (Pythagorean)
K
y   c z  ,
iv
kv ik
iv
k 1
N
V
 y
i 1 v1
2
iv
V
K
K
V
  c Nk   ( yiv  ckv )
v1 k 1
2
kv
2
k 1 iSk v1
K-means: Alternating LS minimisation
y – data entry,
z – 1/0 membership
c - cluster centroid,
i - entity,
N – cardinality
v - feature /category,
k - cluster
24
Equivalent criteria (1)
A. Bilinear residuals squared MIN
N
 e
2
iv
i 1 vV
Minimizing difference between data and
cluster structure
B. Distance-to-Centre Squared MIN
K
W   d (ck , i)
2
k 1 iS k
Minimizing difference between data and
cluster structure
25
Equivalent criteria (2)
C. Within-group error squared MIN
N

i 1 vV iS k
(ckv yiv )
Minimizing difference between data and
cluster structure
D. Within-group variance Squared MIN
K
| S
Minimizing within-cluster variance
k 1
|  (Sk )
2
k
26
2
Equivalent criteria (3)
E. Semi-averaged within distance squared MIN
K
 d
2
k 1 i , jS k
(i, j ) / | S k |
Minimizing dissimilarities within clusters
F. Semi-averaged within similarity squared MAX
K
  a(i, j) / | S
k 1 i , jS k
k
|, where a(i, j )  i, j 
Maximizing similarities within clusters
27
Equivalent criteria (4)
G. Distant Centroids MAX
K
 c
k 1 vV
Finding anomalous types
H. Consensus partition MAX
V

2
kv
| Sk |
 ( S , v)
v 1
Maximizing correlation between sought partition and
given variables
28
Equivalent criteria (5)
I. Spectral Clusters
MAX
K
z
k 1
T
k
T
T
YY zk / zk zk
Maximizing summary Raileigh quotient over binary
vectors
29
PCA inspired
Anomalous Pattern Clustering
yiv =cv zi + eiv,
where zi = 1 if iS, zi = 0 if iS
N
V
 y
i 1 v1
2
iv
V
V
  c N S   ( yiv  cSv )
v1
2
Sv
2
iS v1
With Euclidean distance squared
N
 d (i,0)  d (c ,0) N   d (i, c )
i 1
S
cS must be anomalous,
S
iS
that is,
S
interesting
30
Initial setting with Anomalous
Pattern Cluster
Tom Sawyer
31
Anomalous Pattern Clusters:
Iterate
1
2
0
Tom Sawyer
3
32
iK-Means:
Anomalous clusters + K-means
After extracting 2 clusters (how
one can know that 2 is right?)
Final
33
iK-Means:
Defining K and Initial Setting with
Iterative Anomalous Pattern Clustering
Find all Anomalous Pattern clusters
Remove smaller (e.g., singleton)
clusters
Put the number of remaining clusters
as K and initialise K-Means with their
centres
34
Study of eight Number-of-clusters
methods (joint work with Mark Chiang):
• Variance based:
Hartigan (HK)
Calinski & Harabasz (CH)
Jump Statistic (JS)
• Structure based:
Silhouette Width (SW)
• Consensus based:
Consensus Distribution area (CD)
Consensus Distribution mean (DD)
• Sequential extraction of APs (iK-Means):
Least Square (LS)
Least Moduli (LM)
35
Experimental results
at 9 Gaussian clusters (3 spread patterns),
1000 x 15 data size
Estimated number
of clusters
Large
spread
Small
spread
Adjusted Rand Index
Large spread
Small
spread
HK
CH
JS
SW
Two
winners
counted
each
time
CD
DD
LS
LM
1-time winner
2-times winner
36
3-times winner
37
(2) Address: Weighting features
according to relevance
K
M
 s
k 1 iI v 1
K


w
|
y

с
|

d
(
y
,
с

ik v
iv
kv
i
k)



k 1 iSk
w: feature weights=scale factors
3-step K-Means:
-Given s, c, find w (weights)
-Given w, c, find s (clusters)
-Given s,w, find c (centroids)
-till convergence
38
Minkowski’s centers
Minimize over c
d (с)   | yiv  с |

iSk
At >1, d(c) is convex
Gradient method
39
Minkowski’s metric effects
The more uniform distribution of the entities
over a feature, the smaller its weight
Uniform distribution  w=0
The best Minkowski power  is data
dependent
The best  can be learnt from data in a semisupervised manner (with clustering of all
objects)
Example: at Fisher’s Iris, iMWK-Means gives 5
40
errors only (a record)
Conclusion:
Data recovery K-Means-wise model of clustering
is a tool that involves wealth of interesting
criteria for mathematical investigation and
application projects
Coder
Model
Decoder
Data clustering Clusters
Data recovery
Further work:
Extending the approach to other data types –
text, sequence, image, web page
Upgrading K-Means to address the issue of
interpretation of the results
41
HEFCE survey of students’ satisfaction
HEFCE method: ALL 93 of highest mark
STRATA: 43 best, ranging 71.8 to 84.6
42

No Slide Title

Transcript No Slide Title

Directory