Categorical Clustering

Download Report

Transcript Categorical Clustering

A Hybrid Method to
Categorical Clustering
學生:吳宗和
Advisor : Prof. J. Hsiang
Date : 7/4/2006
Outlines








Introduction
Motivation
Related Work
Research Goal / Notation
Our Method
Experiments
Conclusion
Future Work
Introduction

Data clustering is to partition a set of data elements
into groups such that elements in the same groups
are similar, while elements in different groups are
dissimilar. [1]


Data elements can be



The similarity must be well-defined by developer.
Recorded as Numerical values. (Degree: 0o~360o )
Recorded as Categorical values. (Sex : { Male, Female})
 Find the latent structure in the dataset.
Clustering Methods purposed for numerical data
can’t fit for categorical data, because lack of a
proper measure of similarity[21].
Example:

An instance of the movie database.
Partition Partition
1
2
Attributes
Instances
Director
Actor
Genre
X-man III
Brett Ratner
Hugh
Jackman
Action
Adventure
C1
G1
Superman
returns
Bryan Singer
Brandon Routh
Action
Adventure
C2
G1
X-man II
Bryan Singer
Hugh
Jackman
Action
Adventure
C1
G1
Spiderman III
Sam Raimi
Tobey Maguire
Science Fiction
Fantasy
C3
G2
Van Helsing
Stephen
Sommers
Hugh
Jackman
Action
Adventure
C1
G1
{Brett Ratner,
Bryan Singer,
Sam Raimi,
Stephen
{Hugh Jackman,
Brandon Routh,
Tobey Maguire}
{Action
Adventure,
Science Fiction
Fantasy}
Outlines

Introduction

Motivation
Related Work
Research Goal / Notation
Our Method
Experiments
Conclusion
Future Work






Motivation

Needs,




For database, much of the data in databases are
categorical (described by people).
For web navigation, web documents can be categorical
data after feature selection.
For knowledge discovery, find latent structure in data.
The difficulties to solve clustering categorical data,


data elements with categorical values they can take are
not ordered.  no intuitive measure of distance.[21]
methods specified for clustering categorical data are not
easy to use.
Motivation (cont’)

Problems in Related work




Hard to understand or use.
Parameters are data sensitive.
Most of them CANNOT decide the number of groups when
clustering.[3,13,17,18]
Our gauss/assumption.

Need All new method?




Reuse the methods proposed for numerical data.
( measure of distance can’t fit the categorical data )
Or only NEED a good measure of similarity / distance between
two groups.
REUSE the framework for clustering numerical data. ( intuitive
approach, fewer parameters, and less sensitive for data )
Get better clustering result.
Related Work

Partition Based




K-modes [Z. Huang. (1998)]
Monte-Carlo algorithm [Tao Li, et. al 2003]
COOLCAT [Barbara, D., Couto, J., & Li, Y.
(2002)]
Agglomerative Based



ROCK [Guha, S., Rastogi, R.,& Shim, K. (2000)]
CACTUS [ Ganti, V., Genhrke, J., &
Ramarkrishnan, R. (1999)]
Best-K ( ACE )[Keke Chen, Ling Liu (2005)]
Summarization of relate work
Local
Minimal
Decide
Parameters
Decide the
number of
groups
K-Modes
Monte-Carlo
algorithm
COOLCAT
Yes
No
Human
No
No
Human
Yes
No
Human
ROCK
Yes
Hard
Human
CACTUS
Yes
Hard
Human
Best-K ( ACE )
Yes
No
Machine(Quality
is not good)
Outlines








Introduction
Motivation
Related Work
Research Goal / Notation
Our Method
Experiments
Conclusion
Future Work
Research Goal

Propose a method




A measure of similarity between two groups.
Reuse the gravitation concept.( for intuition,
fewer parameters )
Decide the proper number of groups by machine.
Strength the clustering result.( for better result )
Notation



D:data set of N elements p1, p2, p3, …,pn ,
|D| = n.
Element pi is a multidimensional vector of d
categorical attributes, i.
i.e. pi = <ai1,ai2,…,aid> , where aij  Aj, Aj is
the set of all possible values aij can take,
and Aj is a finite set.
For related work, K : an integer given by
user, then divide elements into K groups
G1,G2,…,GK such that ∪Gi=D and i,j , ij,
Gi∩Gj =ψ.
Outlines

Introduction
Motivation
Related Work
Research Goal

Our Method









A measure of similarity
Step 1: Gravitation Model
Step 2: Strength the result.
Experiments
Conclusion
Future Work
Our method

One similarity function, two steps.

Define a measure of similarity between two
groups.


Use the similarity to be the measure of distance.
Two steps.

Step 1.



Use the concept of gravitation model.
Find the most suitable group k.
Step 2.

Conquer the local optimal. Optimize the result.
The Intuition of similarity
Based on the structure of the group.

Each group has its own probability distribution of each attribute
and can be represented as its group structure.
Groups are similar if their have very similar probability
distribution of each attribute.
Gj
Gi
Similarity(Gi,Gj)
Attributes distribution of group Gi
10
0 90 80
70
% % % 70 60
50 60 %
% % 50 40
40
%
% % 30 % %
%
A1=Va11
A5=Va51
A9=Va91
A2=Va21
A6=Va61 A10=Va10,1
A3=Va31
A7=Va71 A11=Va11,1
A4=Va41
A8=Va81 A12=Va12,1
Y軸

Y軸

Attributes distribution of group Gj
10
0 90 80
70
% % % 70 60
50 60 %
% % 50 40
40
%
% % 30 % %
%
A1=Va11
A5=Va51
A9=Va91
A2=Va21
A6=Va61 A10=Va10,1
A3=Va31
A7=Va71 A11=Va11,1
A4=Va41
A8=Va81 A12=Va12,1
The Similarity function


A group Gi = <A1,A2,A3,…,Ad>, Ar is a
random variable for all r = 1…d, p(Ar=v | Gi)
is the probability, when Ar=v.
Entropy of Ar in group Gi :
entropy(Ar) =   p( Ar  v | Gi ) log( p( Ar  v | Gi ))
vAr
d

Entropy of Gi E(Gi) =
 entropy( A )
r 1
r
Sim(Gi,Gj)= | Gi  G j | E (Gi  G j )  | Gi | E (Gi ) | G j | E (G j )
To geometric analogy

Use Sim(Gi,Gj) to be our geometric
analogy, because



Sim(Gi,Gj) ≧ 0, i,j.
Sim(Gi,Gi) = 0, i.
Sim(Gi,Gi) = Sim(Gj,Gi), i,j.
Step 1.

Use the concept of gravitation model.
Time T
G1
R(G1 , G2 )
fg (G1 , G2 ) 

GM (G1 ) M (G2 )
R(G1 , G2 ) 2
Mass, Radius, and Time.




G2
Mass of a group = the # of elements.
Radius of two groups Gi,Gj = similarity(Gi,Gj)
Time = ∆T, a constant.
Merging groups Get a clustering tree.
A Merging Pass
Time T
fg (G1 , G2 ) 

GM (G1 ) M (G2 )
R(G1 , G2 ) 2
G2
G1
R(G1 , G2 )
fg (G 1 , G 2 )
accG1 
M (G 1 )



fg (G 1 , G 2 )
accG2 
M (G 2 )
Rnew(Gi,Gj) = R(Gi,Gj) -
1/2(accgi+accgj)Δt2
If Rnew(Gi,Gj) < 0
, Merge Gi and Gj.
Getting closer {Radius = 0}
 merge.
 Build a clustering tree




M(Gi)= # elements in Gi.
R(Gi,Gj)=Sim(Gi,Gj).
G is 1.0.
fg(Gi,Gj) is the gravitation force.
Δt is the time period. Δt = 1.0.
Time T + ∆t
G2
G1
Rnew (G1 , G2 )
A clustering tree.

C1
p1
p2
p3
p4
p5
pn-1
pn
Suitable K.

In the sequence of merging passes,
find  Suitable number of groups in the partition.

Suppose step1 merges to 1 group, after T iterations.


S1, S2, S3,…, ST-1, ST. Si : i-th iteration
Stablei = ΔTime(Si , Si+1) x Radius(Si), Radius(Si) =
Min{ Sim( Gα,Gβ) ,  Gα,Gβ in Si and αβ }
 i The suitable K = the # of groups in Si , such that
l
 Stable(S
ir
) is maximal, l is given by user.
r=0

In the following slides, we use K to denote the
proper number of clusters in the partition.
Step 2.

Still has Local minimal problem?


Randomized step.



Yes. Conquer it by randomized algorithm.
If Exchange two elements from two groups[11]
 Time consumption.
Goal : Enlarge distance/radius of two groups.
For Efficiency


Tree-structured data structure  digital search
tree.
Enlarge the distance/radius of each two groups
in the result obtained from Step 1.
Step 2(con’t)

Why digital search tree (D.S.T)?

Locality.

Storing elements


storing strings from elements.
Deeper nodes are looked like their siblings and
children.
Operations in D.S.T


D.S.T is tree-structured, storing binary
strings, and degree of each node ≦ 2.
Operations cost less ( O(d) )




Search
Insert
Delete
Search operation. ( O(d) )

Ex:
Operations in D.S.T (cont’)

Insertion (O(d)), Ex: Insert C010 into the tree.


Deletion (O(d)), Ex: delete G110

Randomizing

Transform the result from step 1 into a forest with K
digital search trees.




Nodes in Tree i ≡ elements in Gi ,  i.
Randomly select 2 trees i,j, and select an internal
node αfrom tree i.
Let gα = { node n | α is the ancestor of node n } ∪{α}
To calculate,
GoodExchange(Gi , G j , ga ) 

Yes , if Sim ( ( Gi  g a ) , G j  g a )  Sim ( Gi ,G j )
No , Otherwise.
Randomizing (cont’)
For example, K = 2.
α= ‘0101’


Phase 1
Phase 0
0
G2
0000
G1
0000
0001
0
0
1
0101
1001
0
0
0110
1010
0
1
0
1
11012
1001
1
00102
0
00112
G2
0000
0001
00102
1
G1
0000
0
0101
1
00112
1010
1
0110
1100
Good Exchange !!!
11012
0
1100
8.985047
3.248661
1
Outlines

Introduction
Motivation
Related Work
Research Goal / Notation
Our Method

Experiments








Datasets
Compared methods and Criterion
Conclusion
Future Work
Experiments

Datasets

Real datasets[7].



Congressional Voting Record.
Mushroom.
Document datasets[8,9].


Reuters 21578.
20 Newsgroups.
Compared methods &
Criterion

Compared methods



K-modes ( randomly partition elements into K groups )
Coolcat ( select some elements to be seeds and place
seeds into K groups, then place the remaining)
Criterion (10 times
average)


2
2
|
c
|
p
(
A

v
|
c
)

p
(
A

v
)
 k 


r
k
r
k 1 
r 1 vAr

K
K

Category Utility =
d
K

d
Expected Entropy =   p( Ar  v | ck ) log( p( Ar  v | ck ))
k 1 r 1 vAr
K

nk
1
P
(
c
)
,
P
(
c
)

max j (nkj ) , nkj is the # of elements
Purity = 
k
k
nk
k 1 n
of the i -th input class that were assigned to the j -th cluster.
Real datasets

Do classify like human.

Mushroom dataset



2 classes( edible, poisonous )
22 attributes.
8124 records.
# Groups
Category
Utility
Expected
Entropy
Purity
K-modes
22
0.60842867
20.2
95.5%
Coolcat
30
0.25583642
35.1
96.2%
Our Method
18
0.74743694
19.4
99.4%
Real datasets

Voting Record dataset


2 classes (democrat, republican), 435
records, 16 attributes (all boolean valued).
Some records (10%) have missed value.
# Groups
Category
Utility
Expected
Entropy
Purity
K-modes
22
0.22505624
6.04533891
93.4%
Coolcat
29
0.14361348
7.42124111
94.6%
Our Method
14
0.3177278
6.84627962
92.4%
Document Datasets

For applying.

20 Newsgroups




20 subjects. Each subject contains 1000 articles.
20000 documents (too many).
Select documents from 3 subjects to be the dataset
(3000 articles).
Use BOW package[4] to do feature selection. 100
features.
# Groups
Expected Entropy
Purity
K-modes
K=5
3.02019429
92.2%
Coolcat
K=5
3.44101501
90.8%
Our Method
K=5
2.93761945
95.0%
Document Datasets

Reuters 21578



135 subjects. 21578 articles, each subject
contains different # of articles.
Select first 10 most articles subjects.
Use BOW package[4] to do feature selection. 100
features.
# Groups
Expected Entropy
Purity
K-modes
K = 18
2.30435514
51.8%
Coolcat
K = 18
2.68157959
68.3%
Our Method
K = 18
0.78800815
66.9%
Conclusion

In this work, we purposed a measure of
similarity such that




Can reuse the framework used on numerical
data clustering.
With FEWER parameters and easy to use.
Try to avoid trapping into local minima.
Still can obtain same or better clustering result
than methods proposed for categorical data.
Future Work

For our method itself,


Still take a long time to calculate, because we
need to fix and satisfy “triangle-inequality law”
for our similarity function.
For application.


To build a framework to cluster documents
without other packages’ help.
Use our method to solve “Binding sets” problem
in Bioinformatics.
Reference










[1] A.K. Jain M.N Murty, and P.J. Flynn, Data Clustering: A Review, ACM Computing
Surveys. Vol.31, 1999.
[2] Yen-Jen Oyang, Chien-Yu Chen, Shien-Ching Huang, and Cheng-Fang Lin.
Characteristics of a Hierarchical Data Clustering Algorithm Based on Gravity Theory.
Technical Report of NTUCSIE 02-01.
[3] S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical
attributes. Proc. of IEEE Intl. Conf. on Data Eng. (ICDE), 1999.
[4] McClallum, A. K. Bow: A tookit for statistical language modeling, text retrieval,
classification and clustering. http://www.cs.cmu.edu/mccallum/bow.
[5] DIGITAL LIBRARIES: Metadata Resources. http://www.ifla.org/II/metadata.htm
[6] W.E. Wright. A formulization of cluster analysis and gravitational clustering. Doctoral
Dissertation. Washington University, 1972.
[7] Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of
machine learning databases. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine,
CA: University of California, Department of Information and Computer Science.
[8] David D. Lewis. Reuters-21578 text categorization test collection. AT&T Labs –
Research. 1997.
[9] Ken Lang. 20 Newsgroups. Http://people.csail.mit.edu/jrennie/20Newsgroups/
[10] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.
Reference












[11] T. Li, S. Ma, and M. Ogihara. Entropy-based criterion in categorical clustering. Proc. of Intl. Conf. on
Machine Learning (ICML), 2004.
[12] Bock, H.-H. Probabilistic aspects in cluster analysis. In O. Opitz(Ed.), Conceptual and numerical
analysis of data, 12-44. Berlin: Springer-verlag. 1989.
[13] Z. Huang. A fast clustering algorithm to cluster very large categorical data sets in data mining.
Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
[14] MacQueen, J. B. (1967) Some Methods for Classification and Analysis of Multivariate Observations,
In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297.
[15] Richard O. Duda and Peter E. Hard. Pattern Classification and Secen Analysis. A Wiley-Interscience
Publication, New York, 1973.
[16] P Andritsos, P Tsaparas, RJ Miller, KC Sevcik. LIMBO: Scalable Clustering of Categorical Data.
Hellenic Database Symposium, 2003.
[17] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS-clustering categorical data using summaries.
Proc. of ACM SIGKEE Conference, 1999.
[18] D. Barbara, Y. Li, and J. Couto. Coolcat: an entropy-based algorithm for categorical clustering. Proc.
of ACM Conf. on Information and Knowledge Mgt. (CIKM), 2002.
[19] R. C.Dubes. How many clusters are best? – an experiment. Pattern Recognition, vol. 20, no 6,
pp.645-663, 1987.
[20] Keke Chen and Ling Liu: "The ‘Best K’ for Entropy-based Categorical Clustering ", Proc of Scientific
and Statistical Database Management (SSDBM05). Santa Barbara, CA June 2005.
[21] 林正芳,歐陽彥正,”以重力為基礎的二階段階層式資料分群演算法特性之研究”,國立臺灣大學資
訊工程學系,碩士論文。民91。
[24] I.H. Witten, E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann Publishers, San Francisco, 2000.
Appendix 1

K-modes


Extend the k-means algorithm for clustering large data sets with
categorical values.
Use “Mode” to represent a group.
Mode : DQi  {x1 , x2 , x3 ,...xn }, x j  S ( X j ), 1  i  K
dist (di , DQi )  CountDiffAttributes(di , DQi ).
Appendix 2

ROCK

Aj
Aij=1
Aij=0
Aik=1
a
b
Aik=0
c
d
a
Using Jaccard coefficient: J j ( xi , xk )  a  b  c , for attribute j
to measure the distance between points Xi,Xk in attribute j
J j ( xi , xk )  threshold 

Define Neighbor ( xi , xk )  true , if
Otherwise Neighbor(xi,xk) = false. j

Define Link(p,Ci ) = the # of Neightbor(p,q) = 1, for every q
belongs to cluster Ci.
Link ( p, Ci )
| Ci |

Best clusters (Objective function): Maximize
1 2 f ( )


pCi



| Ci |
For well defined data set , ψ is 0.5, f(ψ) = (1+ψ)/(1-ψ)
Threshold ψ affects the result is good or not, and hard to decide.
Relaxing the problem to many clusters.
Appendix 3

CACTUS is an agglomerative algorithm.


Use attribute values to group data.
Define:
 Support : Pair ( s ji , skt )  # {( S j  s ji )  ( S k  skt )}






Strong connection: S_conn(sij,sjt)=true,
, if Pair(sij,sjt) > threshold ψ.
From pairs to sets of attributes.
A cluster is defined as a region of attributes
that are pairwise strongly connected.
From the attributes view.
Find k regions of attributes.
Setting threshold ψ decides the result is
good or not.
Appendix 4



COOLCAT [2002]
Use entropy to measure the uncertainty of
the attribute j.
Si = { Si1, Si2, Si3,…,Sit },
t

Entropy(Si) = (1)   p( S
j 1


i
 sij ) log( p( Si  sij ))
When Si is much clear, Entropy(Si)  0.
uncertainty is small.
For a cluster k, the entropy of k is
E (C )   Entropy ( S
d
k



i 1
在Ci中,各屬性Si的亂度的總和。
For a partition P = {C1,C2,…Ck}
| Ci |
The expected Entropy of P is E ( P)  
E (Ck )
k
m
i
),
Appendix 4 (cont’)



For a given partition P, P is best clusters, P satisfies
unique value in each attribute for each cluster Ci.
COOLCAT’s objective function is
.
Since





Coolcat picks k points from data set S, such that k
are most unlike for each other, when beginning. And
assign these points to k clusters.
For each point p in S-{ assigned points }, assign p
into cluster i with the minimal increasing entropy.
Very easy to implement, but performance is strongly
connected to the initial selection.
Sequence sensitive.
Trapped in local minima easily, expected entropy is
not the minimal.