Clustering Data Streams
Download
Report
Transcript Clustering Data Streams
Clustering Data Streams
Chun Wei
Dept Computer & Information Technology
Advisor: Dr. Sprague
Data Stream
Massive data sets accumulated at an
astonishing rate.
Examples:
Tracking network data to study change in
traffic patterns and possible intrusions
Tracking meteorological data, such as
temperatures
NASA MISR satellite
Collects several TB of
satellite imagery
data daily
Challenges
Compactness of data representation
Fast, incremental processing of new data
points (one-pass and linear access of
data)
Clear and fast identification of changes
in evolving clustering models
Compactness
Utilize a data structure that summarizes
a group of data points, minimizing the
storage space
The space required does not grow
appreciably with the number of points
processed
Incremental Processing of data
When clustering new data points, the
algorithm should not require comparison
with all the points processed in the past
The data must be processed as they are
produced. Linear scan is required,
random access is prohibitively
expensive.
Identification of Changes
The algorithm must be able to:
diagnose changes in evolving data
streams
distinguish outliers from data points that
represent a new cluster
Current Algorithms
BIRCH
STREAM
CLU-STREAM
…
N
i 1
Xi
BIRCH
Use CF vectors to store data
CF = (N, ∑Xi2 , ∑ Xi ) Xi is a vector
Store the number of points, the linear sum
and the square sum of all data points in a
micro-cluster
Sufficient to calculate centroids, radius,
diameter and distances
B-Tree
Root
7
2*
3*
5*
7*
14* 16*
16
22
19* 20* 22*
29
24* 27* 29*
33* 34* 38* 39*
Building of CF Tree
B-Tree, with a branch factor B, threshold
T and L maximum number of entries in a
leave node
CF3
CF6
Non-Leaf node
CF1
Leaf node
CF2
CF3
CF4
CF5
CF6
Adjusting CF Tree
Increases the threshold T so that each
leaf entry to absorb more points. T can
be set as radius or diameter.
Leaf entries with “far fewer” points are
regarded as “outliers” and written back to
disk.
STREAM
Process data streams in batches of
points
Use weighted centroids Ci to represent
ith batch of points.
Recursively cluster the weighted
centroids until k-clusters
Problems with BIRCH & STREAM
Old data points are equally important as
new data points
May not be able to detect new trends in
evolving data stream
CLU-STREAM
Also use CF vectors to store data
summary
Use time stamps to record the elapsed
time from the beginning
Take snapshots at different time stamp,
favoring the most recent data
(Snapshot: micro-clusters stored at
particular moments in the stream)
CLU-STREAM (continue)
A snapshot contains q micro-clusters, q
depends on the memory available
New data points will be assigned to one
of the micro-clusters in previous
snapshot if it falls within the maximum
boundary of that micro-cluster.
CLU-STREAM (continue)
If a new data points fails to fit into any
current cluster, a new cluster is created,
and an existing one is deleted or two
merged.
A cluster is removed if the average timestamp when it absorbs m new data
points is least recent.
Detect New Trends
Comparing clustering results from
snapshots to snapshots reveals trends in
evolving data stream.
References
Aggarwal, C. C., Han J., Wang, J. & Yu, P. S. (2003). A Framework for
Clustering Evolving Data Stream. In Proc. of the 29th VLDB Conference.
Barbara, D. (2003). Requirements for Clustering Data Streams. SIGKDD
Explorations, 3 (2), 23-27.
Ester M., Kriegel H.-P., Sander J. & Xu X (1998). Clustering for Mining in
Large Spatial Databases. Special Issue on Data Mining, KI-Journal,
ScienTec Publishing, No. 1.
Guha, S., Meyerson, A., Mishra, N. & Motwani, R., Callaghan, L. (2003).
Clustering Data Streams: Theory and Practice. IEEE Transactions on
Knowledge and Data Engineering, 15 (3), 515-528.
Zhang, T., Ramakrishnan R. & Livny, M. (1996). BIRCH: An Efficient Data
Clustering Method for Very Large Databases. In Proc. of ACM SIGMOD
International Conference on Management of Data.