Clustering by impact
George Mason University
(joint work with P. Chen, J. Couto,
and Y. Li)
Copyright Daniel Barbara
Organizations are constantly acquiring and storing
new data (data streams)
The need to quickly extract knowledge from the
newly arrived data (and compare it with the old) is
Clustering data streams
Continuous data: Fractal Clustering
Categorical (nominal) data: Entropy-based
Clustering and data streams
To cluster continuously arriving data streams a
clustering algorithm should behave incrementally:
make the decision based on the newly arrived
point and a concise description of the clusters
encountered so far.
Concise bounded amount of RAM to describe the
clusters, independently of the number of data
points processed so far…
Most algorithms in the literature do not
have that property:
They look at the entire set of points at once
They cannot make decisions point by point.
The description of the clusters is usually the set
of points in them.
Some of the algorithms have high complexity
Paper by U. Fayyad, D. Bradley and C. Reina:
“Scaling Clustering algorithms to large databases”
Main idea: keep descriptions of centroids + set
descriptions that are likely and unlikely to change
given a new data point.
Papers by Motwani, et al. Incrementally updating
centroids while receiving a data stream. The goal
is to have an approximation to “min squares”
whose performance is bounded.
Find functions that naturally define clusters
and that can be easily computed given a
new point and a concise representation of
the current clusters.
Place a new point in the cluster for which
the evaluated function shows a minimum
(or a maximum) – less impact---
Numerical data points: fractal dimension
Measures the self-similarity of points.
The idea is that the lower the change in the
fractal dimension (when the point is included),
the more self-similar the point is w/respect to
Categorical data points: entropy.
Also measures similarity
Lower entropy means similar points.
Fractal dimension, is a (not necessarily integer) number
that characterizes the number of dimensions ``filled'' by the
object represented by the dataset. The object on the upper
right corner, called the Menger sponge (when complete)
has a F.D. equal to 2.73 (less than the embedding space,
whose dimension is 3)
Conjecture: if part of a dataset brings about a change in the
overall fractal dimension of the set, then this part is
``anomalous'' (exhibits different behavior) with respect to
the rest of the dataset.
pi log pi
( q 1) log r
r = grid size
for q 1
Cantor Dust Set
Box counting (cont.)
Population vs. grid size (logxlog)
r0 / 3
D1 = - limn->
log ( )
Take an unlabelled point in the sample and
start a cluster.
Find close neighbors and add them to the
Find close neighbors to points in the
cluster… If you can’t go to first step.
Space in RAM is not proportional to the size of the
dataset, but rather to the size of the grid and
number of grid levels kept.
These vary with:
Accuracy (odd-shaped clusters may require more
Scalability results with
number of points
Quality of clusters (Dataset1)
Percentage of points clustered right
High dimensional set
10 dimensions, 2 clusters
% of points clustered right
Results with the noisy dataset
92 % of the noise gets filtered out.
% points clustered right
Memory usage vs. dimensions
Memory used vs. dimensions
1 2 3 4 5
6 7 8 9 10
Space taken by the boxes is small, but it grows with the
number of dimensions.
Memory reduction techniques:
• Use boxes with # points > epsilon.
• Cache boxes
• Have only smallest granularity boxes and derive the rest.
None of them causes a significant degradation of quality. (2
and 3 have an impact on running time.)
Comparison with other
Comparison of quality
For Categorical data
Place new point where it minimizes some function
of the entropies of the individual clusters (e.g.,
min (max (entropy Ci)))
Heuristic (problem is NP-Hard)
Entropy of each cluster:
E (Ck )
P(Vij / Ck ) log P(Vij / Ck )
i 1,...d j 1,.. ji
Minimize expected entropy
Need to seed “k” clusters:
Select a sample
Find 2 points that are the most dissimilar (their
joint entropy is the highest).
Place them in 2 different clusters
Find another point that is the most dissimilar
(pairwise) to the ones selected, and start
For a given point and k current clusters:
Compute the expected entropy as the new point
is placed in each cluster.
Choose the one that minimizes the expected
After finishing with a batch of points, reprocess m% of them (take the ``worse’’ fits out
and re-cluster them): helps with the issue of
Notice that the current cluster description is
Counts of Vij for every i= 1,.., d (number of
attributes), and for every j (domain of each
COOLCAT and the MDL
MDL = minimum description length.
Widely used to argue about how good a
classifier is: how many bits does it take to
send to a receiver the description of your
classifier + the exceptions
K ( h, D ) K ( h) K ( D using h)
h model, D = data
K ( h) k log(| D |)
K ( h, D ) E (C )
Real and synthetic datasets
Evaluate quality and performance
Category utility function (how much “better” is the
distribution probability in the individual clusters
w/respect to the original distribution)
External entropy: take an attribute not used in the
clustering and compute the entropy of each cluster
w/respect to it, then the expected external entropy
Archaeological data set
Ext. E Exp E
Brute F. -
KDD99 Cup data
N x 1000
Clustering data streams as they come:
Consider r.v X = 0 if new point is outlier; 1
otherwise.Using Chernoff bounds:
Must see s “successes” – not outliers– in a window w
(1 ) p
If you don’t, it is time for new clusters…
FC, COOLCAT and Tracking
Find a good definition of outlier:
FC: if the min change in FD exceeds a
COOLCAT: mutual information of new point
with respect to clusters
One tracking experiment with FC
One tracking experiment with
COOLCAT (intrusion detection)
More tracking experiments
Hybrid data: numeric and categorical
Indexing based on clustering
``Using the Fractal Dimension to Cluster Datasets,'' Proceedings of the the
ACM-SIGKDD International Conference on Knowledge and Data Mining
, Boston, August 2000. D. Barbara, P.Chen.
``Tracking Clusters in Evolving Data Sets,'' Proceedings of FLAIRS'2001,
Special Track on Knowledge Discovery and Data Mining , Key West, FL,
May 2001. D. Barbara, P. Chen.
``Fractal Characterization of Web Workloads,'' Proceedings of the 11th
International World Wide Web Conference, May 2002. D. Menasce, V.
Almeida, D. Barbara, B. Abrahao, and F. Ribeiro.
``Using Self-Similarity to Cluster Large Data Sets,’’ to appear in Journal
of Data Mining and Knowledge Discovery, Kluwer Academic pub. D.
``Requirements for Clustering Data Streams,'' SIGKDD Explorations
(Special Issue on Online, Interactive, and Anytime Data Mining), Vol. 3,
No. 2, Jan 2002. D. Barbara
``COOLCAT: An Entropy-Based Algorithm for Categorical Clustering,’’
Submitted for publication. D. Barbara, J. Couto, Y. Li.