Transcript slides

Approximate clusters, biclusters and
n-clusters in the analysis of
binary and general data matrices
Boris Mirkin1,2
1National
Research University Higher School
of Economics Moscow RF
2Birkbeck University of London UK
Supported by:
- Grants from the Research Fund of NRU HSE Moscow (2011-2016)
- International Lab for Decision Analysis and Choice NRU HSE Moscow
(2008 – 2016)
My text (2011) expressing my views:
From Computing Reviews of the
ACM , June 27, 2011:
• “Core concepts in data
analysis is clean and devoid
of any fuzziness. The
author presents his theses
with a refreshing clarity
seldom seen in a text of
this sophistication.”.
Contents
• Quantitative approximation: Lessons for extension to categorical
approximation
• One approximate, anomalous, cluster
• K-Means clustering, anomalous clusters and extensions
• K-Means (kernelized) at Similarity data
• Approximate bicluster
• Approximate tricluster
• Conclusion
3
Illustrative data: Which college is greatest?
Colleges: Science, Engineering, Arts
Name
Stud
Acad
Sch
DistL
Type
Soliver
Sembey
Sixpent
3800
5880
4780
437
360
380
2
3
3
No
No
No
MSc
MSc
BSc
Etom
Efinen
Enkee
3680
5140
2420
279
223
169
2
3
2
Yes
Yes
Yes
MSc
BSc
BSc
Aywar
Annaty
4780
5440
302
580
4
5
Yes
Yes
Certif
Certif
4
Which college is greatest? Quantify First
Colleges: Science, Engineering, Arts
Name
Stud
Ac
#S DisL T-MS T-BS
T- Cert
Soliver
Sembe
Sixpen
3800
5880
4780
437
360
380
2
3
3
0
0
0
1
1
0
0
0
1
0
0
0
Etom
Efinen
Enkee
3680
5140
2420
279
223
169
2
3
2
1
1
1
1
0
0
0
1
1
0
0
0
Aywar 4780
Annaty 5440
302
580
4
5
1
1
0
0
0
0
1
1
5
Which college is greatest? Approximate
Feature loads
X=
Object
F1
F2
F3
F4
F5
F6
F7
scores
3800
5880
4780
437
360
380
2
3
3
0
0
0
1
1
0
0
0
1
0
0
0
S1
S2
S3
3680
5140
2420
279
223
169
2
3
2
1
1
1
1
0
0
0
1
1
0
0
0
S4
S5
S6
4780
5440
302
580
4
5
1
1
0
0
0
0
1
1
S7
S8
6
Approximate I
• X – data matrix object-to-feature
• S – object scores to find
• F – feature loadings to find
X(i,k)=S(i)F(k) + E(i,k)
X=S*F+E
(entry)
(matrix)
Find S, F minimizing residuals E(i,k):
min
𝑖,𝑘 𝐸
2
𝑖, 𝑘 =
𝒊,𝒌(𝑿
𝒊, 𝒌 − 𝑺 𝒊 ∗ 𝑭(𝒌))𝟐
Approximate II
X(i,k)=S(i)F(k) + E(i,k)
X=S*F+E
Find S, F to minimize residuals E(i,k):
2
𝟐
𝐸
𝑖,
𝑘
=
(𝑿
𝒊,
𝒌
−
𝑺
𝒊
𝑭(𝒌))
𝑖,𝑘
𝒊,𝒌
(entry)
(matrix)
minE2 =
Solution: First singular triplet (, z, c)
Xc= z, XTz= c,  > 0,  maximum
S= 1/2z, F= 1/2c
(*)
X 2 = 2 + E2
(**)
Approximate III
At two hidden factors:
X(i,k)=S1(i)F1(k) + S2(i)F2(k)+ E(i,k) (entry)
X=S*F+E
minS,F E2
(matrix)
Solution:
Two first singular (1, z1, c1), (2, z2, c2)
X 2 = 12 + 22 + E2
(***)
These two can be used for data visualization !
Data pre-processing
• Centering by subtracting from each feature its backdrop value
• Normalizing by dividing each feature by a scale factor to
balance feature “contributions”
• No need in that if all the entries across the table are comparable
(averaging is meaningful) or summable (summation is
meaningful)
10
*
=
Red dots – raw data X
Red arrow – PC for X
Blue dots – data
centered (Y)
Blue arrow – PC for Y
Circle – space origin
Effect: Contribution 2 down to 43.3% from
98.6% of data scatter
Why? All PCs must go through the origin.
11
Data centered and normalized
(taking in account tripled Type by further
relating three columns on the right to 3)
Normalization: Feature balancing
-0.20
0.40
0.08
-0.23
0.19
-0.60
0.08
0.27
0.23
0.05
0.09
-0.15
-0.29
-0.42
-0.10
0.58
-0.33
0
0
-0.33
0
-0.33
0.33
0.67
-0.63
-0.63
-0.63
0.38
0.38
0.38
0.38
0.38
0.36
0.36
-0.22
0.36
-0.22
-0.22
-0.22
-0.22
-0.22
-0.22
0.36
-0.22
0.36
0.36
-0.22
-0.22
-0.14
-0.14
-0.14
-0.14
-0.14
-0.14
0.43
0.43
Visualize I: Features centered, not
normalized
Soliver
Enkee
Etom
Sixpent
Aywar
Sembey
Efinen
Annaty
Visualize II: Features centered and
normalized (taking in account tripled
Type)
Annaty
Sembey
Sixpent
Aywar
Soliver
Efinen
Etom
Enkee
Lessons learnt to extend to clustering
• Preliminary feature centering and normalization is
cool: Data against backdrop (centering) + feature
balancing
• Approximation with multiplicative-additive
hidden factor models is cool: SVD and spectral
decompositions, Pearson correlation
• The least-squares criterion is cool: Means,
variances and Euclidean distances
• One-by-one extraction of hidden factors is cool:
Easy computation, orthogonality (at 2-way data
only), and data scatter additive decomposition
over contributions
Binary data (FC Contexts)
• Considered quantitative, as admitting both
• - averaging (proportion of unities p(i) or
p(j) )
• - summation to totals leading to
appropriateness of specific normalizations
involving random interaction mechanisms
and products p(i)*p(j)
Approximate one cluster I: Binary
hidden factor
X(i,k)=S(i)F(k) + E(i,k)
(model)
Find F and BINARY S to minimize residuals E(i,k):
min E2 =
=
2 𝑖, 𝑘 =
𝐸
𝑖,𝑘
𝟐
𝒅
(𝑿 𝒊, : , 𝑭) +
𝒊∈𝑺
𝟐 =
(𝑿
𝒊,
𝒌
−
𝑺
𝒊
𝑭(𝒌))
𝒊,𝒌
𝟐
𝒅
(𝑿 𝒊, : , 𝟎)
𝒊∈𝑺
S={ i: S(i)=1}
d2(x,y) – Euclidean squared distance
Approximate one cluster II:
ANOMALOUS CLUSTER
Find F and BINARY S to minimize
min E2 =
=
𝟐
𝒅
(𝑿 𝒊, : , 𝑭(𝒌)) +
𝒊∈𝑺
S={ i: S(i)=1},
𝟐
𝒅
(𝑿 𝒊, : , 𝟎)
𝒊∈𝑺
d2(x,y) – Euclidean squared distance
Properties:
A. Given cluster S, optimal F is within-S gravity center
B. Given center F, assign to S objects nearer to F than to 0
CODA Week 8 by Boris Mirkin
Preprocess data by centering to Reference point,
typically grand mean. 0 is the reference point.
19
Mathematics of data summarization:
Quantitative
K Principal components (PCA)
• Y = ZCT + E
• Y - NV data matrix, known
• Z - NK Hidden factors (scoring), unknown
• C - VK Loading matrix
• E - NV residual matrix
•
min
Z, C
||E||2
Quantitative data summarization:
K Principal components (PCA / SVD)
• Y = SFT + E
• min ||E||
Y=UMVT+E – SVD=singular value
decomposition, M – matrix of s.valu
• Solution: S=UM1/2, F= M1/2V
S, F
Byproduct:
2
Pythagorean decomposition
• ||Y||2= 12+22+…+K2 + ||E||2 (*)
Categorical data summarization:
K-Means clustering
• Y = SFT + E
• Y - NV data matrix, known
• S - NK 0/1 cluster membership
vectors, unknown
• F - VK cluster centers
• E - NV residual matrix
• min S, F ||E||2
SVD-like data recovery clustering model
[Mirkin 87 (Rus), 90 (Eng)]
Y = SFT + E
Criteria from (*)
Minimize
𝒅𝟐 𝒚 𝒊 , 𝒇 𝒌
𝒌=𝟏 𝒊∈𝑺𝒌
or
Maximize
𝑩 𝑺, 𝑭 =
𝑲
𝒌=𝟏 |𝑺𝒌 |
F - VK center matrix
E - NV residual matrix
min Z, C [||E||2 = W(S,F)]
< 𝒇𝒌 , 𝒇𝒌 >
over partition S and F.
Pythagorean decomposition
Yandex Berlin 5-8 October 2015
membership
𝑲
𝑾 𝑺, 𝑭 =
S - NK 0/1 cluster
||Y||2= W(S,F)+B(S,F) (*)23
SVD-like data recovery clustering model
[Mirkin 1987 (Rus), 1990 (Eng)] Y = SFT + E
At centered data:
Minimize
𝒅𝟐 𝒚𝒊 , 𝒇𝒌
𝑾 𝑺, 𝑭 =
𝒌=𝟏 𝒊∈𝑺𝒌
or
Maximize (Anomalous clusters)
𝑩 𝑺, 𝑭 =
over partition S and F.
𝑲
𝒌=𝟏 |𝑺𝒌 |
< 𝒇𝒌 , 𝒇𝒌 >
Yandex Berlin 5-8 October 2015
𝑲
24
K-Means criterion: Summary distance to
cluster centroids
Minimize
* *
@
* *
*@*
** *
**@
***
K
W (S, F )  
k 1
M
 ( y
iSk
v 1
K
 f kv )  
2
iv
k 1
d( y ,f
iSk
i
)
25
k
Advantages of K-Means
- Models typology building
- Simple “data recovery” criterion
- Computationally effective
- Can be utilised incrementally, `on-line’
Shortcomings of K-Means
- Initialisation: no advice on K or initial
centroids
- No deep minima
- No defence of irrelevant features
26
How the number and location of initial centers
should be chosen? (Mirkin 1998, Chiang and Mirkin 2010)
Equivalent criterion:
Minimize
𝑾 𝑺, 𝑭 =
𝒅 𝒊, 𝒇𝒌
𝒌=𝟏 𝒊∈𝑺𝒌
over S and F.
Maximize
𝑫 𝑺, 𝑭 =
𝑲
𝒌=𝟏 𝑵𝒌 < 𝒇𝒌 , 𝒇𝒌 >
Data scatter (the sum
where Nk is the number of
of squared data entries)=
= W(S,F)+D(S,F)
Data scatter is constant
while partitioning
entities in Sk
<fk, fk> - Euclidean squared
distance between 0 and fk
CODA Week 8 by Boris Mirkin
𝑲
27
How the number and location
of initial centers should be chosen? 2
Maximize
𝑫 𝑺, 𝑭 =
𝑲
𝒌=𝟏 𝑵𝒌
< 𝒇𝒌 , 𝒇𝒌 >
where Nk=|Sk|
CODA Week 8 by Boris Mirkin
Preprocess data by centering: 0 is grand mean
<fk, fk> - Euclidean squared distance between 0 and ck
Look for anomalous & populated clusters!!!
Further away from origin: can be done one by one
28
How the number and location
of initial centers should be chosen? 3
Anomalous Cluster (iK-Means) is superior of (see table)
(Chiang, Mirkin, 2010), yet as is leads to too many clusters
Acronym
Calinski and Harabasz index
CH
Hartigan rule
HK
Gap statistic
GS
Jump statistic
JS
Silhouette width
SW
Consensus distribution area
CD
Average distance between partitions
DD
Square error iK-Means
LS
Absolute error iK-Means
LM
CODA Week 8 by Boris Mirkin
Method
29
Agglomeration of ik-means clusters &
weighting features using Minkowski pdistance (Amorim, Mirkin, 2012, Amorim, Makarenkov,
Mirkin, 2016, to appear)
Feature weights in K-Means
K
M
 s
ik
K
w | yiv  f kv |  d wp ( yi , f k )

v
p
k 1 iI v 1
k 1 iSk
-Ward merging over minima of Ward distance
| Sk |  | Sm |
d wp ( f k , f m )
Dw(Sk, Sm)=
| Sk |  | Sm |
Works well at noise features added and feature values
blurred, unlike the other methods
30
K-Means kernelized 1
• K-Means: Given a quantitative data matrix,
find centers fk and clusters Sk to
minimize W(S,F)=
ck
xi
𝑲
𝒌=𝟏
𝟐
𝒅(x
,
f
)
i k
𝒊∈Sk
• Girolami 2002:
W(S,F)=
𝑲
𝒌=𝟏
𝒊∈Sk(𝑨
𝒊, 𝒊 − 𝟐
𝒋∈Sk 𝑨
𝒊, 𝒋 + 𝒂(𝑺))
where A(i,j)=<xi,xj> - kernel trick applicable <xi,xj>  K(xi,xj )
• Mirkin 1996, 2012:
W(S,F)= Const -
𝑲
𝒌=𝟏
𝒊,𝒋∈Sk A(𝒊, 𝒋)/|𝑺𝒌|
= 𝑪𝒐𝒏𝒔𝒕 − 𝑮(𝑺)
31
K-Means kernelized 2
• K-Means equivalent criterion: find partition
{S1,…, SK} to maximize
K
K
1
A(i, j )   ( Sk ) | Sk |
• G(S1,…, SK) = 

k 1 | S k | i , jSk
k 1
where (Sk) – within cluster mean
Mirkin (1976, 1996, 2012): Build partition {S1,…, SK}
finding one cluster at a time
32
K-Means kernelized 3
• K-Means equivalent criterion and one cluster S at a
time: maximizing
g(S)= (S)|S|
where (S) – within cluster mean
AddRemAdd(i) alg. by add/remove entities one by one
- Much competitive against modern community
detection algorithms
- Provably tight clusters: the average similarity of S and
j
> (S) /2 if jS
< (S) /2 if jS
- Can be used (for all i) to decide are there clusters at 33
all
Single Anomalous cluster at
temperature map data (Nascimento,
Caska, Mirkin 2015) to delineate
upwellings
Given a temperature
map data over pixels i,
Find center f and
cluster of pixels S to
maximize
g(S,f)= 𝑺 < f, f >
B. Mirkin: System Analysis 11-13
November 2015
Ocean Portugal
land
surface
Upwelling (pink)
34
Bicluster (Mirkin, Rostovtsev, 1978) as
a prerequisite to our decision tree
builder to get A-homogeneous clusters
targeted at B features
Target features B
Potential
input
features
Blue box (blob): High association values between
A inputs and B features
Mirkin's Review: Voronovo-2016
• Feature-to-Feature Association analysis
35
Approximate bicluster at comparable
data I: Binary both factor and load
X(i,k)= U(i)V(k) + E(i,k)
 - factor relating scale of X with 1/0 scale, intensity
Find BINARY U and V to minimize residuals E(i,k):
min E2 =
2 𝑖, 𝑘 =
𝐸
𝑖,𝑘
𝟐
(𝑿
𝒊,
𝒌
−

𝒖
𝒊
𝒗(𝒌))
𝒊,𝒌
At a binary data matrix R=(r(i,k)),
a reasonable data transformation
r’(i,k)= r(i,k) - 0,
where 0 - scale shift, soft threshold
Approximate bicluster, II
• Find a bicluster (U,V) to satisfy:
• at
rij '  ui v j   ij
1, i U ,
ui  
0, i U ;
L2    ij2   rij' 2   2 | U || V | min
i, j
i, j
Equivalent to maximizing contribution :
2
g (U ,V )   | U || V | max
at
 = the within (U,V)-cluster mean of r′ij =
= density of (U,V)  0
Algorithm Box(i)
U  {i}, V  { j | rij  1}
Start:
Find i* (or j*) maximizing the increment
of G
D(i*)=G(U  i*, V) – G (U,V)
and add/remove i* at U or j* at V
• Stop if: max D(i*), D(j*) < 0
•
•
Do that for all i, j; take those maximizing G
38
Good property of Box algorithm with
respect to proportions of unities in lines
At any resulting bicluster (U,V),
• For all i (and j) from within the bicluster
D(i)  (D+0)/2
• For all i (and j) outside of the bicluster
D(i)  (D+0)/2
• D(i) proportion of unities at the i-th line with bicluster
• D – density of bicluster (proportion of unities in it)
• 0 the scale shift
At Synthetic data
• Compared with Bimax algorithm (Prelic et al.
2006)
• Data NOT binary
• Context size: 100  50
• Number of biclusters: 10
2
• Noise:
N (0,  ),   0,0.05,...,0.4
40
Relevance and recovery
true
generated
true
K
generated
Recovery = K/true
Relevance = K/generated
true – true clusters
generated – clusters found by the algorithm
41
Results: Box is Better over Recovery
42
Extending to tri-clusters routinely (joint
work with A. Kramarenko, 2011)
• Tricluster:
(,U ,V ,W )
r  ui v j wk   ijk
• One-box model: ijk
• Least-Squares Criterion:
L  
2
i , j ,k
2
ijk
  r   | U || V || W | min
2
ijk
i , j ,k
2
43
Extending to triclusters (illustration)
titles
tags
000000000000000000000
bicluster
11111111
000000000000000000000
11111111
000000000000000000000
11111111
000000000000000000000
111
000000000000000000000
000
Tags  genres
tricluster
title
s
genre
44
Good property of Box algorithm with
respect to proportions of unities in lines
At any resulting tricluster (U,V,W),
• For all i (and j, and k) from within the tricluster
D(i)  (D+0)/2
• For all i (and j, and k) outside of the
tricluster
D(i)  (D+0)/2
• D(i) proportion of unities at the i-th line with bicluster
• D – density of bicluster (proportion of unities in it)
• 0 the scale shift (subtracted)
No similar statement for planes (pairs (i,j), (I,k),
(j,k)) because no planes in Box algorithm
Triclusters and biclusters on iMdb database (250 films, 738
tags, 20 genres)
• Tricluster model:
The Shawshank Redemption
The Godfather (1972)
The Godfather: Part II (1974)///
1.Prison|Murder|Friend|Shawsh
ank|Banker///Crime|Drama
2.Mafia|Wedding|Lawyer|
Violence|Organized Crime
///Crime | Drama | Thriller
3.Cuba|NewYork|Business|
1920s|1950s ///Crime|Drama|
Thriller
• Bicluster model:
The Shawshank
Redemption (1994)
The Godfather (1972)
The Godfather: Part II
(1974)///
1.Prison,Crime |
Prison,Drama |
Murder,Crime | …
46
Most contributing tri-clusters, I
Con.
28.5
18.9
Titles
'Star Wars: Episode V - The Empire
Strikes Back (1980)'
'Star Wars (1977)'
'Star Wars: Episode VI - Return of
the Jedi (1983)'
'12 Angry Men (1957)'
'Double Indemnity (1944)'
'Chinatown (1974)'
'The Big Sleep (1946)'
'Witness for the Prosecution (1957)'
'Dial M for Murder (1954)'
'Shadow of a Doubt (1943)'
Tags
'Rebel'
'Princess'
'Empire'
'Death Star'
'Murder'
'Trial'
'Widow'
'Marriage'
'Private
Detective'
Blackmail'
'Letter'
Genres
'Adventure'
'Action'
'Sci-Fi'
'Fantasy'
'Crime'
'Drama'
'Thriller'
'Mystery'
'Film-Noir'
47
Car makes/Features/Competition
I: Typically “competition” is better
'Audi A4
Sdn'
'Audi S4
Sdn'
'Acura TL
Sdn'
'Cadillac
CTS Sdn'
'4-Door'
'5 passeng'
'Sedan'
'5 bhp'
'Acura TL'
'BMW 3-series sedan'
'Lexus IS sedan'
'Mercedes-Benz Cclass'
Car makes/Features/Competition
II: Typically “competition” is better
'Audi A5 Cpe' '2-Door'
'Audi S5 Cpe' 'Coupe'
'All Wheel
Drive'
'4 passeng'
'Lexus IS sedan'
'BMW 3-series coupe'
'Cadillac CTS coupe'
'Infinity G37 coupe'
Conclusion
• Approximate clustering is a fertile approach, both for
theory and practice
• Just a thought: One-cluster ARA(i) algorithm as
a context generating tool so that its FCA structure may
lead to conclusions over the clarity of the community
structure in network
• Some directions for future work
• Modelling dynamics
• Compatible methods for Multiple data and metadata
• Tri- and n-clustering, specifically involving planes and
higher dimension facets
• Superclusters and super-n-clusters analogous to rough
sets (core and shell) – just started jointly with I. Rodin