Data clustering

Download Report

Transcript Data clustering

Data clustering: Topics of Current Interest
Boris Mirkin1,2
1National
Research University Higher School
of Economics Moscow RF
2Birkbeck University of London UK
Supported by:
- “Teacher-Student” grants from the Research Fund of NRU HSE
Moscow (2011-2013)
- International Lab for Decision Analysis and Choice NRU HSE
Moscow (2008 – pres.)
- Laboratory of Algorithms and Technologies for Networks
Analysis NRU HSE Nizhniy Novgorod Russia (2010 – pres.)
Data clustering: Topics of Current Interest
1. K-Means clustering and two issues
1. Finding right number of clusters
1. Before clustering (anomalous)
2. While clustering (divisive no minima of density function)
2. Weighting features (3-step iterations)
2.
3.
4.
5.
K-Means at similarity clustering (kernel k-means)
Semi-average similarity clustering
Consensus clustering
Spectral clustering, Threshold clustering and
Modularity clustering
6. Laplacian pseudo-inverse transformation
7. Conclusion
Batch K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
***
* * according
*
to minimum distance rule
@
@
2. Put centroids in gravity centres of
thus obtained clusters
@
3. Iterate 1. and 2. until convergence
**
***
K= 3 hypothetical centroids (@)
3
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
* * according
*
to Minimum distance rule
** *
@
2. @
Put centroids in gravity centres of
thus obtained clusters
3. Iterate 1. and 2. until convergence
@
**
***
4
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
***
* * according
*
to Minimum distance rule
@
@
2. Put centroids in gravity centres of
thus obtained clusters
@
3. Iterate 1. and 2. until convergence
**
***
5
K-Means:
a generic clustering method
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
@
* *
* @according
*
to Minimum distance rule
** *
2. Put centroids in gravity centres of
thus obtained clusters
3. Iterate 1. and 2. until convergence
**@
4. Output final centroids and clusters
***
6
K-Means criterion: Summary distance to
cluster centroids
Minimize
* *
@
* *
*@*
** *
**@
***
K
W ( S , c)  
k 1
M
 (y
iS k
v 1
iv
K
 ckv )  
2
k 1
d(y ,c )
iS k
i
k
7
Advantages of K-Means
- Models typology building
- Simple “data recovery” criterion
- Computationally effective
- Can be utilised incrementally, `on-line’
Shortcomings of K-Means
- Initialisation: no advice on K or initial
centroids
- No deep minima
- No defence of irrelevant features
8
How the number and location of initial centers
should be chosen? (Mirkin 1998, Chiang and Mirkin 2010)
Equivalent criterion:
Minimize
𝑲
𝑾 𝑺, 𝒄 =
Maximize
𝑫 𝑺, 𝒄 =
𝒅 𝒊, 𝒄𝒌
𝒌=𝟏 𝒊∈𝑺𝒌
𝑲
𝒌=𝟏 𝑵𝒌
< 𝒄𝒌 , 𝒄𝒌 >
over S and c.
Data scatter (the sum
of squared data entries)=
= W(S,c)+D(S,c)
Data scatter is constant
while partitioning
where Nk is the number of
entities in Sk
<ck, ck> - Euclidean squared
distance between 0 and ck
CODA Week 8 by Boris Mirkin
9
How the number and location of
initial centers should be chosen? 2
Maximize
𝑫 𝑺, 𝒄 =
𝑲
𝒌=𝟏 𝑵𝒌
< 𝒄𝒌 , 𝒄𝒌 >
where Nk=|Sk|
Preprocess data by centering: 0 is grand mean
<ck, ck> - Euclidean squared distance between 0 and ck
Look for anomalous & populated clusters!!!
Further away from the origin.
CODA Week 8 by Boris Mirkin
10
How the number and location of
initial centers should be chosen? 3
Preprocess data by centering to Reference point,
typically grand mean. 0 is grand mean since that.
Build just one Anomalous cluster.
CODA Week 8 by Boris Mirkin
11
How the number and location of
initial centers should be chosen? 4
Preprocess data by centering to Reference point,
typically grand mean. 0 is grand mean since that.
Build Anomalous cluster:
1. Initial center c is entity farthest away from 0.
2. Cluster update. if d(yi,c) < d(yi,0), assign yi to S.
3. Centroid update: Within-S mean c' if c'  c. Go to
2 with c c'. Otherwise, halt.
CODA Week 8 by Boris Mirkin
12
How the number and location of
initial centers should be chosen? 5
Anomalous Cluster is (almost) K-Means up to:
(i) the number of clusters K=2: the “anomalous” one
and the “main body” of entities around 0;
(ii) center of the “main body” cluster is forcibly
always at 0;
(iii) a farthest away from 0 entity initializes the
anomalous cluster.
CODA Week 8 by Boris Mirkin
13
How the number and location of
initial centers should be chosen? 6
Anomalous Cluster  iK-Means is superior of:
(Chiang, Mirkin, 2010)
Method
Acronym
Calinski and Harabasz index
CH
Hartigan rule
HK
Gap statistic
GS
Jump statistic
JS
Silhouette width
SW
Consensus distribution area
CD
Average distance between partitions
DD
Square error iK-Means
LS
Absolute error iK-Means
LM
CODA Week 8 by Boris Mirkin
14
Issue: Weighting features according to
relevance and Minkowski -distance (Amorim,
Mirkin, 2012)
K
M
 s
k 1 iI v 1
ik
K
wv | yiv  сkv |  d ( yi, сk )



k 1 iSk
w: feature weights=scale factors
3-step K-Means:
-Given s, c, find w (weights)
-Given w, c, find s (clusters)
-Given s,w, find c (centroids)
-till convergence
15
Issue: Weighting features according to
relevance and Minkowski -distance 2
Minkowski’s centers
• Minimize over c
d (с)   | yiv  с |

iSk
• At >1, d(c) is convex
• Gradient method
16
Issue: Weighting features according to
relevance and Minkowski -distance 3
Minkowski’s metric effects
• The more uniform distribution of the entities
over a feature, the smaller its weight
• Uniform distribution  w=0
• The best Minkowski power  is data dependent
• The best  can be learnt from data in a semisupervised manner (with clustering of all objects)
• Example: at Fisher’s Iris, iMWK-Means gives 5
errors only (a record)
17
K-Means kernelized 1
• K-Means: Given a quantitative data matrix,
find centers ck and clusters Sk to
minimize W(S,c)=
ck
𝑲
𝒌=𝟏
xi
𝟐
𝒅(x
,
c
)
i k
𝒊∈Sk
• Girolami 2002:
W(S,c)=
𝑲
𝒌=𝟏
𝒊∈Sk(𝑨
𝒊, 𝒊 − 𝟐
𝒋∈Sk 𝑨
𝒊, 𝒋 + 𝒂(𝑺))
where A(i,j)=<xi,xj> - kernel trick applicable <xi,xj>  K(xi,xj )
• Mirkin 2012:
W(S,c)= Const -
𝑲
𝒌=𝟏
𝒊,𝒋∈Sk A(𝒊, 𝒋)/|𝑺𝒌|
= 𝑪𝒐𝒏𝒔𝒕 − 𝑭(𝑺)
18
K-Means kernelized 2
• K-Means equivalent criterion: find partition
{S1,…, SK} to maximize
K
K
1
• G(S1,…, SK)= 
A(i, j )   ( Sk ) | Sk |

k 1 | S k | i , jSk
k 1
where (Sk) – within cluster mean
Mirkin (1976, 1996, 2012): Build partition {S1,…, SK}
finding one cluster at a time
19
K-Means kernelized 3
• K-Means equivalent criterion and one cluster S at a
time: maximizing
g(S)= (S)|S|
where (S) – within cluster mean
AddRemAdd(i) algorithm by adding/removing one
entity at a time
20
K-Means kernelized 4
• Semi-average criterion:
max g(S)= (S)|S|
where (S) – within cluster mean with AddRemAdd(i)
T
(1) Spectral: max
s As
g (S )  T
s s
(2) Tight: the average similarity of S and j
> (S) /2 if jS
< (S) /2 if jS
21
Three extensions to entire data set
• Partitional: Take set of all entities I
– 1. Compute S(i)=AddRem(i) for all iI;
– 2. Take S=S(i*) for i* maximizing f(S(i)) over all I
– 3. Remove S from I; if I is not empty, goto 1; else halt.
• Additive: Take set of all entities I
– 1. Compute S(i)=AddRem(i) for all iI;
– 2. Take S=S(i*) for i* maximizing f(S(i)) over all I
– 3. subtract a(S)ssT from A; if No-stop-condition, goto 1;
else halt.
• Explorative: Take set of all entities I
– 1. Compute S(i)=AddRem(i) for all iI;
– 2. Leave those S(i) that do not much overlap.
22
Consensus partition I: Given partitions
R1,R2,…,Rn, find an “average” R
• Partition R={R1, R2, …, RK} incidence matrix Z=(zik):
zik = 1 if iRk; zik = 0, otherwise
• Partition R={R1, R2, …, RK} projector matrix P=(pij):
P = Z(ZTZ)-1ZT
𝟐
• Criterion min (R)= 𝒏
𝒁𝒎
−
𝑷𝒁𝒎
𝒎=𝟏
(Mirkin, Muchnik 1981 in Russian, Mirkin 2012)
23
Consensus partition 2: Given partitions
R1,R2,…,Rn, find an “average” R
n
( R1 ,..., RK )   || Zm  PZ Zm ||  min
2
m 1
This is equivalent to max:
K
K
1
G( R1 ,..., RK )  
A(i, j )   ( Rk ) | Rk |

k 1 | Rk | i , jRk
k 1
24
Consensus partition 3: Given partitions
R1,R2,…,Rn, find an “average” R
n
( R1 ,..., RK )   || Zm  PZ Zm ||2  min
K
m 1
K
1
G( R1 ,..., RK )  
A(i, j )   ( Rk ) | Rk |

k 1 | Rk | i , jRk
k 1
Mirkin, Shestakov (2013):
(1) This is superior to a bunch of contemporary
consensus clustering approaches
(2) Consensus clustering of results of multiple
runs of K-Means is better in cluster recovery
than best K-Means
25
Additive clustering I
Given similarity A=(A(i,j)), find clusters
• u1=(ui1), u2=(ui2),…, uK=(uiK)
uik either 1 or 0
- crisp clusters
0  uik 1
- fuzzy clusters
• 1u1, 2u2,…, KuK - intensity
Additive Model:
• A= 12ui1uj1+ …+V2uiVujV+E; min E2
Shepard, Arabie 1979 (presented 1973); Mirkin 1987 (1976
in Russian)
26
Additive clustering II
Given similarity A=(A(i,j)), iterative extraction
Mirkin 1987 (1976 in Russian): double-greedy
• OUTER LOOP: One cluster at a time
min L(A, , u) =
𝒊,𝒋
𝑨 𝒊, 𝒋 − 𝝁𝟐𝒖𝒊𝒖𝒋
𝟐
1. Find real  (intensity) and 1/0 binary u
(membership) to (locally) minimize L(A,  ,u).
2. Take cluster S = { i | ui = 1 }.
3. Update A  A - 𝝁𝟐uuT (subtraction of 𝛍𝟐 in S)
4. Reiterate till a Stop-condition.
27
Additive clustering III
Given similarity A=(A(i,j)), iterative extraction
Mirkin 1987 (1976 in Russian): double-greedy
• OUTER LOOP: One cluster at a time leads to
T(A) = 𝝁1 2|S1|2 + 𝝁2 2|S2|2 +…+ 𝝁K 2|SK| 2 + L (*)
T(A)= 𝒊,𝒋 𝑨 𝒊, 𝒋 𝟐, 𝝁k 2|Sk|2 - contribution of cluster
k
Given Sk,
optimal 𝝁k = a(Sk)
Contribution 𝝁k 2|Sk|2 = f(Sk) 2
Additive extension of AddRem is applicable
Similar double-greedy approach to fuzzy clustering:
Mirkin, Nascimento 2012.
28
Different criteria I
• Summary Uniform (Mirkin 1976 in Russian)
Within-S sum of similarities A(i,j)- to maximize
Relates to those considered
• Summary Modular (Newman 2004)
Within-S sum of similarities A(i,j)-B(i,j) to maximize
B(i,j)= A(i,+)A(+,j)/A(+,+)
29
Different criteria II
• Normalized cut (Shi, Malik 2000) to maximize
A(S,S)/A(S,+) + A(𝑺,𝑺)/A(𝑺,+)
where 𝑺 is complement of S, A(S,S) and A(S,+)
summary similarities.
Can be reformulated: minimize a Rayleigh quotient,
f(S) =
𝒛𝑻 𝑳 𝑨 𝒛
𝒛𝑻 𝒛
z is binary; L(A) is Laplace transformation
A(i,j)  (i,j) − 𝑨 𝒊, 𝒋
𝑨 𝒊, + 𝑨(+, 𝒋)
30
FADDIS: Fuzzy Additive Spectral Clustering
• Spectral: B = Pseudo-inverse Laplacian of A
– One cluster at a time
• Min ||B – 2uiuj||2 (One cluster to find)
• Residual similarity B B – 2uiuj
• Stopping conditions
– Equivalent: Rayleigh quotient to maximize
• Max uTBu/uTu
[follows from model in contrast to a very popular, yet purely
heuristic, approach by Shi and Malik 2000]
• Experimentally demonstrated: Competitive over
–
–
–
–
ordinary graphs for community detection
conventional (dis)similarity data
affinity data (kernel transformations of feature space data)
in-house synthetic data generators
31
Competitive at:
•
•
•
•
Community detection in ordinary graphs
Conventional similarity data
Affinity similarity data
Lapin transformed similarity data
D=diag(B*1N)
L = I - D-1/2BD-1/2
L+ = pinv(L)
• There are examples at which Lapin doesn’t work
32
Example at which Lapin does work,
but no square error
33
Conclusion
• Clustering is yet far from a mathematical
theory, however, it gets meaty + Gaussian
kernels bringing distributions + Laplacian
transformation bringing dynamics
• To make it to a theory, a way to go
– Modeling dynamics
– Compatibility at Multiple data and metadata
– Interpretation