Scalable Clustering of Categorical Data

Download Report

Transcript Scalable Clustering of Categorical Data

Scalable Clustering of
Categorical Data
HKU CS Database Research Seminar
August 11th, 2004
Panagiotis Karras
HKU CS 11/8/2004
1
The Problem
 Clustering a problem of great importance.
 Partitioning data into groups so that similar
objects are grouped together.
 Clustering of numerical data well-treated.
 Clustering of categorical data more
challenging: no inherent distance measure.
HKU CS 11/8/2004
2
An Example
 A Movie Relation:
 Distance or similarity between values not
immediately obvious
HKU CS 11/8/2004
3
Some Information Theory
 Mutual Information measure employed.
 Clusters should be informative about the
data they contain.
 Given a cluster, we should be able to predict
the attribute values of its objects accurately.
 Information loss should be minimized.
HKU CS 11/8/2004
4
An Example
 In the Movie Relation, clustering C is better then
clustering D according to this measure (why?).
HKU CS 11/8/2004
5
The Information Bottleneck Method
 Formalized by Tishby et al. [1999]
 Clustering: the compression of one random
variable that preserves as much information as
possible about another.
 Conditional entropy of A given T:
H A T    pt  pa t log pa t 
tT
aA
 Captures the uncertainty of predicting the values
of A given the values of T.
HKU CS 11/8/2004
6
The Information Bottleneck Method
 Mutual Information quantifies the amount of
information that variables convey about each
other [Shannon, 1948]:
I T ; A  H T   H T A  H  A  H A T 
HKU CS 11/8/2004
7
The Information Bottleneck Method
 A set of n tuples on m attributes.
 Let d be the size of the set of all possible
attribute values.
 Then the data can be conceptualized as a
n×d matrix M.
 M[t,a]=1 iff tuple t contains value a.
 Rows of normalized M contain conditional
probability distributions p(A|t).
HKU CS 11/8/2004
8
The Information Bottleneck Method
 In the Movie Relation Example:
HKU CS 11/8/2004
9
The Information Bottleneck Method
 Clustering is a problem of maximizing the Mutual
Information I(A;C) between attribute values and
cluster identities, for a given number k of clusters
[Tishby et al. 1999].
 Finding optimal clustering NP-complete.
 Agglomerative Information Bottleneck proposed by
Slonim and Tishby [1999].
 Starts with n clusters, reduces one at each step so
that Information Loss in I(A;C) be minimized.
HKU CS 11/8/2004
10
LIMBO Clustering




scaLable InforMation Bottleneck
Keeps only sufficient statistics in memory.
Compact Summary Model.
Clustering based on Model.
HKU CS 11/8/2004
11
What is a DCF?
 A Cluster is summarized in a Distributional Cluster
Feature (DCF).
DCF c   pc , pA c 
 Pair of probability of cluster c and conditional
probability distribution of attribute values given c.
 Distance between DCFs is defined as the
Information Loss incurred by merging the
corresponding clusters (computed by the JensenShannon divergence).
HKU CS 11/8/2004
12
The DCF Tree




Height balanced tree of branching factor B.
DCFs at leaves define clustering of tuples.
Non-leaf nodes merge DCFs of children.
Compact hierarchical summarization of data.
HKU CS 11/8/2004
13
The LIMBO algorithm
 Three phases.
 Phase 1: Insertion into DCF tree.
o
o
o
o
o
o
Each tuple t converted to DCF(t).
Follows path downward in tree along closest non-leaf DCFs.
At leaf level, let DCF(c) be entry closest to DCF(t).
If empty entry in leaf of DCF(c), then DCF(t) placed there.
If no empty entry and sufficient free space, leaf split in two halves, with
two farthest DCFs as seeds for new leaves. Split moves upward as
necessary.
Else, if no space, two closest DCF entries in {leaf, t} are merged.
HKU CS 11/8/2004
14
The LIMBO algorithm
 Phase 2: Clustering.
o
For a given value of k, the DCF tree is used to produce k DCFs that
serve as representatives of k clusters, emplying the Agglomerative
Information Bottleneck algorithm.
 Phase 3: Associating tuples with clusters.
o
A scan over the data set is performed and each tuple is assigned to the
closest cluster.
HKU CS 11/8/2004
15
Intra-Attribute Value Distance
 How to define the distance between
categorical attribute values of the same
attribute?
 Values should be placed within a context.
 Similar values appear in similar contexts.
 What is a suitable context?
 The distribution an attribute values induces
on the remaining attributes.
HKU CS 11/8/2004
16
Intra-Attribute Value Distance
 The distance between two values is then
defined as the Information Loss incurred
about the other attributes if we merge these
values.
 In the Movie example, Scorsese and
Coppola are the most similar directors.
 Distance between tuples = sum of distances
between attributes.
HKU CS 11/8/2004
17
Experiments - Algorithms
 Four algorithms are compared:
 ROCK. An agglomerative algorithm by Guha et al. [1999]
 COOLCAT. A scalable non-hierarchical algorithm most
similar to LIMBO by Barbará et al. [2002]
 STIRR. A dynamical systems approach using a hypergraph
of weighted attribute values, by Gibson et al. [1998]
 LIMBO. In addition to the space-bound version, LIMBO
was implemented in an accuracy-control version, where a
distance threshold is imposed on the decision of merging
two DCFs, as multiple φ of the average mutual information
of all tuples. The two versions differ only in Phase 1.
HKU CS 11/8/2004
18
Experiments – Data Sets
 The following Data Sers are used:
 Congressional Votes (435 boolean tuples on 16 issues,
from 1984, classified as Democrat or Republican).
 Mushroom (8,124 tuples with 22 attributes, classified as
poisonous or edible).
 Database and Theory bibliography (8,000 tuples on
research papers with 4 attributes).
 Synthetic Data Sets (5,000 tuples, 10 attributes, DS5 and
DS10 for 5 and 10 classes).
 Web Data (web pages - a tuple set of authorities with the
hubs they are linked to by as attributes)
HKU CS 11/8/2004
19
Experiments - Quality Measures
 Several measures are used to capture the
subjectivity of clustering quality:
 Information Loss. The lower the better.
 Category Utility. Difference between expected correct
guesses of attribute values with and without a clustering.
 Min Classification Error. For tuples already classified.
 Precision (P), Recall (R). P measures the accuracy with
which a cluster reproduces a class and R the completeness with which
this is done.
HKU CS 11/8/2004
20
Quality-Efficiency trade-offs for LIMBO
 Both controlling the size (S) or the accuracy (φ) of the
model, there is a trade-off between expressiveness (large
S, small φ) and compactness (small S, large φ).
 For Branching factor B=4 we obtain:
 For large S and small φ, the bottleneck is Phase 2.
HKU CS 11/8/2004
21
Quality-Efficiency trade-offs for LIMBO
 Still, in Phase 1 we can obtain significant compression of
the data sets at no expense in the final quality.
 This consistency can be attributed in part to the effect of
Phase 3, which assigns tuples to cluster representatives.
 Εven for large values of φ and small values of S, LIMBO
obtains essentially the same clustering quality as AIB, but
in linear time.
HKU CS 11/8/2004
22
Comparative Evaluations
 The table show the results for all algorithms and all quality
measures for the Votes and Mushrooms data sets.
 LIMBO’s quality superior to ROCK and COOLCAT.
 COOLCAT comes closest to LIMBO.
HKU CS 11/8/2004
23
Web Data
 Authorities clustered into
three clusters with
information loss 61%.
 LIMBO accurately
characterizes structure of
web graph.
 Three clusters
correspond to different
viewpoints (pro, against,
irrelevant).
HKU CS 11/8/2004
24
Scalability Evaluation
 Four data sets of size 500K, 1M, 5M, 10M (10 clusters, 10 attributes
each).
 Phase 1 in detail for LIMBOφ
 For 1.0 < φ <1.5 manageable size, fast execution time.
HKU CS 11/8/2004
25
Scalability Evaluation
 We set φ = 1.2, 1.3, S = 1MB, 5MB.
 Time scales linearly with data set size.
 Varied number of attributes – linear behavior.
HKU CS 11/8/2004
26
Scalability - Quality Results
 Quality measures the same for different data set
sizes.
HKU CS 11/8/2004
27
Conclusions
 LIMBO has advantages over other
information theoretic clustering algorithms in
terms of scalability and quality.
 LIMBO only hierarchical scalable categorical
clustering algorithm – based on compact
summary model.
HKU CS 11/8/2004
28
Main Reference
 P. Andritsos, P. Tsaparas, R. J. Miller, K. C. Sevcik.
LIMBO: Scalable Clustering of Categorical Data ,
9th International Conference on Extending
DataBase Technology (EDBT), Heraklion, Greece,
2004.
HKU CS 11/8/2004
29
References
 D. Barbará, J. Couto, and Y. Li. COOLCAT: An entropy-based algorithm
for categorical clustering. In CIKM, McLean, VA, 2002.
 D. Gibson, J. M. Kleinberg, and P. Raghavan. Clustering Categorical
Data: An Approach Based on Dynamical Systems. In VLDB, New York,
NY, 1998.
 S. Guha, R. Rastogi, and K. Shim. ROCK: A Robust Clustering
Algorithm for Categorical Atributes. In ICDE, Sydney, Australia, 1999.
 C. Shannon. A Mathematical Theory of Communication, 1948.
 N. Slonim and N. Tishby. Agglomerative Information Bottleneck. In
NIPS, Breckenridge, 1999.
 N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck
Method. In 37th Annual Allerton Conference on Communication,
Control and Computing, Urban-Champaign, IL, 1999.
HKU CS 11/8/2004
30