On Demand Classification of Data Streams
Download
Report
Transcript On Demand Classification of Data Streams
On Demand Classification
of Data Streams
Charu C. Aggarwal
Jiawei Han
Philip S. Yu
Proc. 2004 Int. Conf. on Knowledge Discovery
and Data Mining (KDD'04), Seattle, WA, Aug.
2004
Speaker: Pei-Min Chou
Date:2005/04/01
1
Outline
Introduction
Supervised Micro-cluster
Snapshot
Maintenance Supervised Micro-cluster
Training Data Stream
Classification on Demand
Empirical Results
2
Introduction
Advances in data storage often grow without limit
referred to as data streams
one-pass mining model
does not recognize the changes and it is too
expensive to keep track of the entire history
static classification model likely to drop when there
is a sudden burst
Our model
simultaneous training and testing streams used for
dynamic classification of data sets
3
Supervised Micro-cluster : Modify Micro-cluster
Only from training data and each with same class
Data streams
Multi-dimensional points X 1 ... X k ... with time stamps T1, … Tk ….
1
d
Each point contains d dimensions, i.e., X i xi ... xi
A micro-cluster for n points is defined as a (2*d + 4) tuple:
- the sum of the squares of the data values
- the sum of the data values
- the sum of the squares of the time stamps
- the sum of the time stamps
-the number of data points
-variable corresponding to class id corresponds to
the class label of that micro-cluster
4
Snapshot
not too expensive to keep track history
storing the behavior of the micro-clusters at different moments in
time
if (t mod 2i) = 0 but (t mod 2i+1)!= 0
reaches max capacity, the oldest snapshot in this frame is removed
geometric time frame
vary from 0 to a value no larger than log2(T),
T is the maximum length of the stream
maximum number
=(max capacity)*log2(T)
5
Maintenance Supervised Micro-clusters
Nearest neighbor and k-means algorithms
The initial micro-clusters is offline process
offline ---answers various user queries based on the
stored summary statistics
When a new data point Xik arrives, it is either
added to a micro-cluster, or a new microcluster is created
6
Classification on Demand
Construct
Find the correct time-horizon
The value of kfit
Large or small horizon be chosen
Test
7
Find the correct time-horizon
Macro-clusters are created over a user-specified time
S(tc-h)
S(tc)
horizon h
tc-h
h
tc
Let S(tc): the snapshot of micro-clusters at time tc
S(tc-h): the snapshot of micro-clusters at time tc-h
The new set of micro-clusters N(tc-h) are created by
subtracting S(tc-h) from S(tc)
Subtractive property
Let C1 and C2 be two sets of points such that C1 C2
Then CFT (C1 C2 ) CFT (C1 ) CFT (C2 )
8
Training Data Stream
A small portion of the stream is used for the
process of horizon fitting stream segment
kfit :the number of points in the data used and the
value small as 1% of the data
remaining portion of the training stream is
used for the creation and maintenance of the
class-specific micro-clusters
9
The value of kfit
Horizon determined classification accuracy
Process executed periodically for changes
kfit should be small enough so that the points in it reflect
the immediate locality of tc
Qfit :pre-specified number of time units
a part of the training stream
the class labels are known a-priori
Nearest neighbor procedure (XεQfit)
Find the closest micro-cluster in N(tc,h) to X
compare the class label and true label
10
Large or small horizon be chosen
The accuracy of all the time horizons which
are tracked by the geometric time frame are
determined
The p time horizons which provide the
greatest dynamic classification accuracy
by
First sight ---smallest
Stable stream ---large
11
Test
test stream is a separate process which is
executed continuously throughout the
algorithm
Insert Xt , nearest neighbor classication
process is applied using each (Xt belong H)
results in the determination class lable
these p class labels reported as the relevant class
12
Empirical Results
Pentium III,512MB,WinXP
Both real and synthetic
Advantage
much higher classification accuracy
Good scalability in terms of dimensionality and the
number of class labels
stable processing rate
Space-efficient
13
Experiment
14
Experiment
15
Experiment
16