Clustering - University of Kentucky

Download Report

Transcript Clustering - University of Kentucky

Subspace Clustering/Biclustering
CS 685: Special Topics in Data Mining
Spring 2008
Jinze Liu
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Data Mining: Clustering
k
 dist( x
t 1 ict
K-means clustering minimizes
Where dist ( x , c
i
2
t
)
m
 (x
j 1
ij
i
, ct ) 2
 ctj ) 2
CS685 : Special Topics in Data Mining, UKY
Clustering by Pattern Similarity
(p-Clustering)
• The micro-array “raw” data shows 3 genes and their
values in a multi-dimensional space
– Parallel Coordinates Plots
– Difficult to find their patterns
• “non-traditional”
clustering
3
CS685 : Special Topics in Data Mining, UKY
Clusters Are Clear After
Projection
4
CS685 : Special Topics in Data Mining, UKY
Motivation
• E-Commerce: collaborative filtering
Movie
1
Movie
2
Movie
3
Movie
4
Viewer 1
1
2
4
3
Viewer 2
4
Viewer 3
2
3
4
6
Viewer 4
3
4
5
7
Viewer 5
6
5
5
5
Movie
5
Movie
6
Movie
7
5
7
3
1
3
4
CS685 : Special Topics in Data Mining, UKY
rating
Motivation
8
7
6
5
4
3
2
1
0
m
viewer 1
viewer 2
viewer 3
viewer 4
viewer 5
ie
ov
1
m
ie
ov
2
m
ie
ov
3
m
ie
ov
4
m
ie
ov
6
5
m
ie
ov
6
m
ie
ov
7
CS685 : Special Topics in Data Mining, UKY
Motivation
Movie
1
Movie
2
Movie
3
Movie
4
Viewer 1
1
2
4
3
Viewer 2
4
Viewer 3
2
3
4
6
Viewer 4
3
4
5
7
Viewer 5
6
5
5
7
Movie
5
Movie
6
Movie
7
5
7
3
1
3
4
CS685 : Special Topics in Data Mining, UKY
Motivation
8
rating
6
viewer 1
4
viewer 3
viewer 4
2
0
movie 1
movie 2
movie 4
8
movie 6
CS685 : Special Topics in Data Mining, UKY
Gene Expression Data
CS685 : Special Topics in Data Mining, UKY
Biclustering of Gene Expression
Data
• Genes not regulated under all conditions
• Genes regulated by multiple
factors/processes concurrently
• Key to determine function of genes
• Key to determine classification of conditions
CS685 : Special Topics in Data Mining, UKY
Motivation
• DNA microarray analysis
CH1I
CH1B
CH1D
CH2I
CH2B
CTFC3
4392
284
4108
280
228
VPS8
401
281
120
275
298
EFB1
318
280
37
277
215
SSA1
401
292
109
580
238
FUN14
2857
285
2576
271
226
SP07
228
290
48
285
224
MDM10
538
272
266
277
236
CYS3
322
288
41
278
219
DEP1
312
272
40
273
232
NTG1
329
296
33
274
228
11
CS685 : Special Topics in Data Mining, UKY
strength
Motivation
450
400
350
300
250
200
150
100
50
0
CH1I
CH1D
CH2B
condition
12
CS685 : Special Topics in Data Mining, UKY
Motivation
• Strong coherence exhibits by the selected
objects on the selected attributes.
– They are not necessarily close to each other but
rather bear a constant shift.
– Object/attribute bias
• bi-cluster
13
CS685 : Special Topics in Data Mining, UKY
Challenges
• The set of objects and the set of attributes are
usually unknown.
• Different objects/attributes may possess
different biases and such biases
– may be local to the set of selected
objects/attributes
– are usually unknown in advance
• May have many unspecified entries
14
CS685 : Special Topics in Data Mining, UKY
What’s Biclustering?
• Given an n x m matrix, A, find a set of
submatrices, Bk, such that the contents of
each Bk follow a desired pattern
• Row/Column order need not be consistent
between different Bks
CS685 : Special Topics in Data Mining, UKY
Bipartite Graphs
• Matrix can be thought of as a Graph Rows are
one set of vertices L, Columns are another set
R
• Edges are weighted by the corresponding
entries in the matrix If all weights are binary,
biclustering becomes biclique finding
CS685 : Special Topics in Data Mining, UKY
Bicluster Structures
CS685 : Special Topics in Data Mining, UKY
Previous Work
• Subspace clustering
– Identifying a set of objects and a set of
attributes such that the set of objects are
physically close to each other on the subspace
formed by the set of attributes.
• Collaborative filtering: Pearson R
– Only considers global offset of each
object/attribute.
 (o1  o1 )(o2  o2 )
2
2
(
o

o
)

(
o

o
)
 1 1  2 2
18
CS685 : Special Topics in Data Mining, UKY
bi-cluster
• Consists of a (sub)set of objects and a (sub)set
of attributes
– Corresponds to a submatrix
– Occupancy threshold 
• Each object/attribute has to be filled by a certain
percentage.
– Volume: number of specified entries in the
submatrix
– Base: average value of each object/attribute (in
the bi-cluster)
19
CS685 : Special Topics in Data Mining, UKY
bi-cluster
CH1I
CH1B
CH1D
CH2I
CH2B
Obj base
CTFC3
VPS8
401
120
298
273
EFB1
318
37
215
190
322
41
219
194
347
66
20
244
219
SSA1
FUN14
SP07
MDM10
CYS3
DEP1
NTG1
Attr base
CS685 : Special Topics in Data Mining, UKY
Bicluster Structures
CS685 : Special Topics in Data Mining, UKY
Cheng and Church
• Example:
Back.: 5
Col 0: 1
Col 1: 3
Col 2: 2
Row 0: 2
8
10
9
Row 1: 4
10
12
11
Row 2: 1
7
9
8
• Correlation between any two columns = correlation between
any two rows = 1.
• aij = aiJ + aIj – aIJ, where aiJ = mean of row i, aIj = mean of
column j, aIJ = mean of A.
• Biological meaning: the genes have the same (amount of)
response to the conditions.
22
CS685 : Special Topics in Data Mining, UKY
bi-cluster
• Perfect -cluster
d ij  d iJ  d Ij  d IJ
d ij  d Ij  d iJ  d IJ
diJ
dij
d ij  d iJ  d Ij  d IJ
• Imperfect -cluster
dIJ
dIj
– Residue:

dij  diJ  d Ij  d IJ , dij is specified
rij 
0,
dij is unspecifie d
23
CS685 : Special Topics in Data Mining, UKY
Cheng and Church
•
Model:
o A bicluster is represented the submatrix A of the whole
expression matrix (the involved rows and columns need
not be contiguous in the original matrix).
o Each entry Aij in the bicluster is the superposition
(summation) of:
o The background level
o The row (gene) effect
o The column (condition) effect
o A dataset contains a number of biclusters, which are not
necessarily disjoint.
24
CS685 : Special Topics in Data Mining, UKY
Cheng and Church
• Finding the largest -bicluster:
– The problem of finding the largest square bicluster (|I| = |J|) is NP-hard.
– Objective function for heuristic methods (to
minimize):
1
H (I , J ) 
I J
2
(
a

a

a

a
)
 ij iJ Ij IJ
iI , jJ
=> sum of the components from each row and
column, which suggests simple greedy
algorithms to evaluate each row and column
independently.
25
CS685 : Special Topics in Data Mining, UKY
Cheng and Church
•
Greedy methods:
– Algorithm 0: Brute-force deletion (skipped)
– Algorithm 1: Single node deletion
• Parameter(s):  (maximum squared residue).
• Initialization: the bicluster contains all rows and
columns.
• Iteration:
1. Compute all aIj, aiJ, aIJ and H(I, J) for reuse.
2. Remove a row or column that gives the maximum decrease
of H.
•
•
Termination: when no action will decrease H or H <=
.
Time complexity: O(MN)
26
CS685 : Special Topics in Data Mining, UKY
Cheng and Church
•
Greedy methods:
– Algorithm 2: Multiple node deletion (take one more
parameter . In iteration step 2, delete all rows and
columns with row/column residue > H(I, J)).
– Algorithm 3: Node addition (allow both additions and
deletions of rows/columns).
27
CS685 : Special Topics in Data Mining, UKY
Cheng and Church
• Handling missing values and masking
discovered biclusters: replace by random
numbers so that no recognizable structures
will be introduced.
• Data preprocessing:
– Yeast: x  100log(105x)
– Lymphoma: x  100x (original data is already logtransformed)
28
CS685 : Special Topics in Data Mining, UKY
Cheng and Church
• Some results on yeast cell cycle data
(288417):
29
CS685 : Special Topics in Data Mining, UKY
Cheng and Church
• Some results on lymphoma data (402696):
No. of genes, no. of
conditions
4, 96
10, 29
103, 25 127, 13
11, 25
13, 21
10, 57
2, 96
25, 12
9, 51
3, 96
2, 96
30
CS685 : Special Topics in Data Mining, UKY
Cheng and Church
• Discussion:
– Biological validation: comparing with the clusters
in previously published results.
– No evaluation of the statistical significance of the
clusters.
– Both the model and the algorithm are not tailored
for discovering multiple non-disjoint clusters.
– Normalization is of utmost importance for the
model, but this issue is not well-discussed.
31
CS685 : Special Topics in Data Mining, UKY
The FLOC algorithm
Generating initial clusters
Determine the best action for
each row and each column
Perform the best action of each
row and column sequentially
Improved?
Y
N
CS685 : Special Topics in Data Mining, UKY
The FLOC algorithm
• Action: the change of membership of a row(or
column) with respect to a cluster
column
1
2
M=4
3
4
1
3
4
2
2
2
1
3
2
3
3
4
2
0
4
row
N=3
33
M+N actions are
Performed at
each iteration
CS685 : Special Topics in Data Mining, UKY
Performance
• Microarray data: 2884 genes, 17 conditions
– 100 bi-clusters with smallest residue were
returned.
– Average residue = 10.34
• The average residue of clusters found via the state of
the art method in computational biology field is 12.54
– The average volume is 25% bigger
– The response time is an order of magnitude faster
34
CS685 : Special Topics in Data Mining, UKY
Coherent Cluster
Want to accommodate noises but not outliers
35
CS685 : Special Topics in Data Mining, UKY
Coherent Cluster
• Coherent cluster
– Subspace clustering
• pair-wise disparity
– For a 22 (sub)matrix consisting of objects {x,
y} and attributes {a, b}
  d xa
D 
 d ya

d xb  


d yb  
dxa
 (d xa  d ya )  (d xb  d yb )
mutual bias
of attribute a
mutual bias
of attribute b
36
x d
z ya
y
dxb
dyb
a
b
attribute
CS685 : Special Topics in Data Mining, UKY
Coherent Cluster
– A 22 (sub)matrix is a -coherent cluster if its D
value is less than or equal to .
– An mn matrix X is a -coherent cluster if every
22 submatrix of X is -coherent cluster.
• A -coherent cluster is a maximum -coherent cluster if
it is not a submatrix of any other -coherent cluster.
– Objective: given a data matrix and a threshold ,
find all maximum -coherent clusters.
37
CS685 : Special Topics in Data Mining, UKY
Coherent Cluster
• Challenges:
– Finding subspace clustering based on distance
itself is already a difficult task due to the curse of
dimensionality.
• The (sub)set of objects and the (sub)set of attributes
that form a cluster are unknown in advance and may
not be adjacent to each other in the data matrix.
– The actual values of the objects in a coherent
cluster may be far apart from each other.
• Each object or attribute in a coherent cluster may bear
some relative bias (that are unknown in advance) and
such bias may be local to the coherent cluster.
38
CS685 : Special Topics in Data Mining, UKY
References
• J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing
subspace correlation in a large data set, Proceedings of the 18th
IEEE International Conference on Data Engineering (ICDE), pp. 517528, 2002.
• H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity
in large data sets, to appear in Proceedings of the ACM SIGMOD
International Conference on Management of Data (SIGMOD),
2002.
• Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced
pClustering and its applications to gene expression data
Bioinformatics and Bioengineering, 2004.
• J. Liu and W. Wang, OP-Cluster: clustering by tendency in high
dimensional space, ICDM’03.
39
CS685 : Special Topics in Data Mining, UKY