On Demand Classification of Data Streams

Download Report

Transcript On Demand Classification of Data Streams

On Demand Classification
of Data Streams
Charu C. Aggarwal
Jiawei Han
Philip S. Yu
Proc. 2004 Int. Conf. on Knowledge Discovery
and Data Mining (KDD'04), Seattle, WA, Aug.
2004
Speaker: Pei-Min Chou
Date:2005/04/01
1
Outline







Introduction
Supervised Micro-cluster
Snapshot
Maintenance Supervised Micro-cluster
Training Data Stream
Classification on Demand
Empirical Results
2
Introduction




Advances in data storage often grow without limit
referred to as data streams
one-pass mining model
does not recognize the changes and it is too
expensive to keep track of the entire history
static classification model likely to drop when there
is a sudden burst
Our model
simultaneous training and testing streams used for
dynamic classification of data sets
3
Supervised Micro-cluster : Modify Micro-cluster

Only from training data and each with same class

Data streams



Multi-dimensional points X 1 ... X k ... with time stamps T1, … Tk ….
1
d
Each point contains d dimensions, i.e., X i  xi ... xi


A micro-cluster for n points is defined as a (2*d + 4) tuple:
- the sum of the squares of the data values
- the sum of the data values
- the sum of the squares of the time stamps
- the sum of the time stamps
-the number of data points
-variable corresponding to class id corresponds to
the class label of that micro-cluster
4
Snapshot





not too expensive to keep track history
storing the behavior of the micro-clusters at different moments in
time
if (t mod 2i) = 0 but (t mod 2i+1)!= 0
reaches max capacity, the oldest snapshot in this frame is removed
geometric time frame
 vary from 0 to a value no larger than log2(T),
T is the maximum length of the stream
 maximum number
=(max capacity)*log2(T)
5
Maintenance Supervised Micro-clusters

Nearest neighbor and k-means algorithms

The initial micro-clusters is offline process
offline ---answers various user queries based on the
stored summary statistics

When a new data point Xik arrives, it is either
added to a micro-cluster, or a new microcluster is created
6
Classification on Demand

Construct




Find the correct time-horizon
The value of kfit
Large or small horizon be chosen
Test
7
Find the correct time-horizon

Macro-clusters are created over a user-specified time
S(tc-h)
S(tc)
horizon h
tc-h



h
tc
Let S(tc): the snapshot of micro-clusters at time tc
S(tc-h): the snapshot of micro-clusters at time tc-h
The new set of micro-clusters N(tc-h) are created by
subtracting S(tc-h) from S(tc)
Subtractive property
 Let C1 and C2 be two sets of points such that C1  C2
Then CFT (C1  C2 )  CFT (C1 )  CFT (C2 )
8
Training Data Stream

A small portion of the stream is used for the
process of horizon fitting stream segment


kfit :the number of points in the data used and the
value small as 1% of the data
remaining portion of the training stream is
used for the creation and maintenance of the
class-specific micro-clusters
9
The value of kfit





Horizon determined classification accuracy
Process executed periodically for changes
kfit should be small enough so that the points in it reflect
the immediate locality of tc
Qfit :pre-specified number of time units
 a part of the training stream
 the class labels are known a-priori
Nearest neighbor procedure (XεQfit)
 Find the closest micro-cluster in N(tc,h) to X

compare the class label and true label
10
Large or small horizon be chosen




The accuracy of all the time horizons which
are tracked by the geometric time frame are
determined
The p time horizons which provide the
greatest dynamic classification accuracy
by
First sight ---smallest
Stable stream ---large
11
Test


test stream is a separate process which is
executed continuously throughout the
algorithm
Insert Xt , nearest neighbor classication
process is applied using each (Xt belong H)


results in the determination class lable
these p class labels reported as the relevant class
12
Empirical Results



Pentium III,512MB,WinXP
Both real and synthetic
Advantage




much higher classification accuracy
Good scalability in terms of dimensionality and the
number of class labels
stable processing rate
Space-efficient
13
Experiment
14
Experiment
15
Experiment
16