Clustering Data Streams

Download Report

Transcript Clustering Data Streams

Clustering Data Streams
Chun Wei
Dept Computer & Information Technology
Advisor: Dr. Sprague
Data Stream
Massive data sets accumulated at an
astonishing rate.
 Examples:

Tracking network data to study change in
traffic patterns and possible intrusions
 Tracking meteorological data, such as
temperatures

NASA MISR satellite
Collects several TB of
satellite imagery
data daily
Challenges
Compactness of data representation
 Fast, incremental processing of new data
points (one-pass and linear access of
data)
 Clear and fast identification of changes
in evolving clustering models

Compactness
Utilize a data structure that summarizes
a group of data points, minimizing the
storage space
 The space required does not grow
appreciably with the number of points
processed

Incremental Processing of data
When clustering new data points, the
algorithm should not require comparison
with all the points processed in the past
 The data must be processed as they are
produced. Linear scan is required,
random access is prohibitively
expensive.

Identification of Changes
The algorithm must be able to:
 diagnose changes in evolving data
streams
 distinguish outliers from data points that
represent a new cluster
Current Algorithms
BIRCH
 STREAM
 CLU-STREAM
…


N
i 1
Xi
BIRCH

Use CF vectors to store data
 CF = (N, ∑Xi2 , ∑ Xi ) Xi is a vector
 Store the number of points, the linear sum
and the square sum of all data points in a
micro-cluster
 Sufficient to calculate centroids, radius,
diameter and distances
B-Tree
Root
7
2*
3*
5*
7*
14* 16*
16
22
19* 20* 22*
29
24* 27* 29*
33* 34* 38* 39*
Building of CF Tree

B-Tree, with a branch factor B, threshold
T and L maximum number of entries in a
leave node
CF3
CF6
Non-Leaf node
CF1
Leaf node
CF2
CF3
CF4
CF5
CF6
Adjusting CF Tree
Increases the threshold T so that each
leaf entry to absorb more points. T can
be set as radius or diameter.
 Leaf entries with “far fewer” points are
regarded as “outliers” and written back to
disk.

STREAM
Process data streams in batches of
points
 Use weighted centroids Ci to represent
ith batch of points.
 Recursively cluster the weighted
centroids until k-clusters

Problems with BIRCH & STREAM
Old data points are equally important as
new data points
 May not be able to detect new trends in
evolving data stream

CLU-STREAM
Also use CF vectors to store data
summary
 Use time stamps to record the elapsed
time from the beginning
 Take snapshots at different time stamp,
favoring the most recent data
(Snapshot: micro-clusters stored at
particular moments in the stream)

CLU-STREAM (continue)
A snapshot contains q micro-clusters, q
depends on the memory available
 New data points will be assigned to one
of the micro-clusters in previous
snapshot if it falls within the maximum
boundary of that micro-cluster.

CLU-STREAM (continue)
If a new data points fails to fit into any
current cluster, a new cluster is created,
and an existing one is deleted or two
merged.
 A cluster is removed if the average timestamp when it absorbs m new data
points is least recent.

Detect New Trends

Comparing clustering results from
snapshots to snapshots reveals trends in
evolving data stream.
References





Aggarwal, C. C., Han J., Wang, J. & Yu, P. S. (2003). A Framework for
Clustering Evolving Data Stream. In Proc. of the 29th VLDB Conference.
Barbara, D. (2003). Requirements for Clustering Data Streams. SIGKDD
Explorations, 3 (2), 23-27.
Ester M., Kriegel H.-P., Sander J. & Xu X (1998). Clustering for Mining in
Large Spatial Databases. Special Issue on Data Mining, KI-Journal,
ScienTec Publishing, No. 1.
Guha, S., Meyerson, A., Mishra, N. & Motwani, R., Callaghan, L. (2003).
Clustering Data Streams: Theory and Practice. IEEE Transactions on
Knowledge and Data Engineering, 15 (3), 515-528.
Zhang, T., Ramakrishnan R. & Livny, M. (1996). BIRCH: An Efficient Data
Clustering Method for Very Large Databases. In Proc. of ACM SIGMOD
International Conference on Management of Data.