Transcript ppt

A Framework for Clustering
Evolving Data Streams
Charu C. Aggarwal, Jiawei Han, Jianyong
Wang, Philip S. Yu
Presented by: Di Yang
Charudatta Wad
Outline
Background of Clustering
Motivation for Clustering over Streaming
Data.
Overall Solution
Micro Clusters
Pyramid Time Frame
Macro Cluster
Cluster Maintenance
Background of Clustering
Definition of Clustering
For a given set of data points, partitioning them
into one or more groups of similar objects.
“Similarity” is often defined with the use of some
distance measure.
Difference between “group by” queries and
clustering.
Background of Clustering
Some of the most popular clustering
algorithms:
K- Means, BIRCH, CURE, Density Based
Clustering.
Clustering has many applications in data
bases, information visualization, data
mining.
What are Oultiers?
Motivation
Challenge in Streaming Environment:
Clustering is an expensive process.
Resource constraints.
Infinite streams.
Can simply extending one pass algorithms
for static databases to stream processing
suffice?
Motivation
Requirements of clustering for stream
processing:
Statistical summary information storage.
Efficient update process.
Ability to cluster for a specific time horizon,
Overall Solution of the Paper
 Divide the clustering process to two
phases
Online Component:
periodically stores detailed summary statistics
Offline Component
uses only the summary statistics to do clustering
Micro-Clusters
What is a Micro-Cluster
A Micro-Cluster is a set of individual data points that are
close to each other and will be treated as a single unit in
further offline Macro-clustering.
View of Micro-Cluster
View of Macro-Cluster
Micro-Clusters
What to Store in a Micro-Cluster
=
Key idea: Additivity Property
Pyramidal Time Frame
 The micro-clusters are stored at snapshots.
…
…
Snapshot
 When should we make the snapshot?
 The snapshots follow a pyramidal pattern
Pyramidal Time Frame
 Snapshots are classified into different orders which can vary from 1 to
log α(T). For example, T is 55, α=2, then we have orders 0 with
interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4,
order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with
interval 2^5=32.
 For a data stream the maximum number of snap- shots maintained at T
time units since the beginning of the stream mining process is
(α + 1) log α(T). (α + 1 for each order)
Why Pyramidal Pattern?
 For any user-specified time window of h, at least one
stored snapshot can be found within 2 h units of the
current time.
Please Note: Only Approximate Answers!!!
Micro Cluster Creation
It is assumed that a total of q microclusters are maintained at any moment by
the algorithm.
This is done using an offline process (kmeans) at the very beginning of the data
stream computation process.
Online Micro Cluster Maintenance
 How to deal with a new coming point?
1.
Join one of the old cluster
2.
Create a new cluster by its own
 How to deal with the old clusters
1.
Delete them (based on relevance stamp)
2.
Merge them (merge the closest two)
A merged cluster will have all the IDs its components have
Macro-Cluster Creation
Based on the Additivity Property of cluster
feature vector
Macro-Cluster Creation
Current Time T, the window size is h. That means the user want to
find the clusters formed in (T-h, T).
Approach:
1.
2.
3.
1st step: Find the snapshot for T, get the micro-cluster set S(T).
2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h).
Use S(T)-S(T-h)
Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T)
and a cluster with Id C1 in S(T-h). Then the we use
CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before
T-h, thus should not contribute to the micro-cluster formed in (T-h,T)
Example
C_ID: [C1, C2, C3]
Time: T
C_ID: [C1]
Time: T-h
C_ID: [C2, C3]
Result: T-h
Macro-Cluster Creation
Run K-means on Micro-Clusters
How do you feel about this paper?
My feeling:
Quite Fuzzy Results:
Approximation is every where.
Nothing New:
Micro-Clusters, K-means, Cluster Feature
Vectors, Pyramidal Time Frame are all old
stuffs.
Counter Example
C_ID: [C1, C2, C3]
Time: T
C_ID: [C2]
Time: T-h
C_ID: [C1, C3]
Result
Advertisement
 Di and Charu’s project deals with:
1. Deterministic Clusters
2. Clusters with Arbitrary Shapes
3. Real Expirations
4. Disk Version
5. Outlier Detection by Free