On Reducing Classifier Granularity in Mining Concept

Download Report

Transcript On Reducing Classifier Granularity in Mining Concept

On Reducing Classifier Granularity
in Mining Concept-Drifting Data
Streams
Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi
Proc. of the Fifth IEEE International
Conference on Data Mining (ICDM’05)
Speaker: Yu Jiun Liu
Date : 2006/9/26
Introduction
 State of the art
 The incrementally updated classifiers.
 The ensemble classifiers.
 Model Granularity
 Traditional : monolithic
 This paper : semantic decomposition
Motivation
 The model is decomposable into
smaller components.
 The decomposition is semantic-aware
in the sense.
Monolithic Models




Stream : r1 ,  , rk , 
Attributes : A1 ,  , Ad
Class Label : C i
Window : Wi , over records ri ,, ri w1
 Model (Classifier) :
Ci
Rule-based Models
 A rule form : p1  p 2    p k  C j
 minsup = 0.3 and minconf = 0.8
 Valid rules of W1 are:
 Valid rules of W3 are:
Algorithm
 Phase 1 : Initialization
 Use the first w records to train all valid
rules for window W1.
 Construct the RS-tree and REC-tree.
 Phase 2 : Update
 When record ri  w arrives, insert it into
the REC-tree and update the sup. and
conf. of the rules matched by it.
 Delete oldest record and update the
value matched by it.
Data Structure
RS-Tree
 A prefix tree with attribute order
 Each node N represents a unique rule
R : P  Ci
 N’ (P’  Cj) is a child node of N, iff:
REC-Tree
 Each record r as a sequence
 Node N points to rule
in the RS-tree if :
Detecting Concept Drifts
 percentage V.S. the distribution of the
misclassified records.
The percentage approach cannot tell us which part of the classifier gives
rise to the inaccuracy.
Definition
Finding Rule Algorithm
Update Algorithm
Experiments
 CPU : 1.7 GHz
 Memory : 256MB
 Datasets : synthetic and real life dataset.
 Synthetic :

 Real life dataset :
 10,344 recodes and 8 dimensions.
Effect of model updating




Synthetic
10 dimensions
Window size 5000
4 dimensions changing
The relation of concept drifts and N ij
Effect of rule composition
Accuracy and Time
 Window size : 10,000
 EC : 10 classifiers, each trained on 1000 records.
 Synthetic data.
Real life data
Conclusion
 Overcome the effects of concept drifts.
 By reducing granularity, change
detection and model update can be
more efficient without compromising
classification accuracy.