Data Stream Classification - The University of Texas at Dallas
Download
Report
Transcript Data Stream Classification - The University of Texas at Dallas
Data Stream Classification:
Training with Limited Amount of
Labeled Data
Mohammad Mehedy Masud
Latifur Khan
Bhavani Thuraisingham
University of Texas at Dallas
Jing Gao
Jiawei Han
University of Illinois at
Urbana-Champaign
To appear in IEEE International Conference on Data Mining,
(ICDM) Pisa, Italy, Dec 15-19, 2008
Funded by: Air Force
Data Stream Classification
Techniques
Data stream classification is a challenging task
because of two reasons
◦ Infinite length – can’t use all historical data for training
◦ Concept-drift – old models become outdated
Solutions:
◦ Single model classification with incremental learning
◦ Ensemble classification
◦ Ensemble techniques can be updated more efficiently,
and handles concept-drift more effectively.
Our solution: ensemble approach
Ensemble Classification
Ensemble techniques build an ensemble of M classifiers.
◦ The data stream is divided into equal-sized chunks
◦ A new classifier is trained from each labeled chunk
◦ The new classifier replaces one old classifier (if required)
Data stream
Last labeled data
chunk
2
Train new
Classifier
Last unlabeled
data chunk
Ensemble
L1
L2
Update
ensemble
LM
1
Ensemble
classification
Limited Labeled Data Problem
Ensemble techniques assume that the entire data
chunk is labeled during training.
This assumption is impractical because
◦ Data labeling is both time consuming and costly
◦ It may not be possible to label all the data
◦ Specially in an streaming environment, where data is being
produced at a high speed
Our solution:
◦ Train classifiers with limited amount of labeled data
◦ Assuming a fraction of the data chunk is labeled
◦ We obtain better result compared to other techniques
that train classifiers with fully labeled data chunk
Training With Partially-Labeled
Chunk
Train new classifier using semi-supervised clustering
If a new class has arrived in the stream, refine the
existing classifiers
Update the ensemble
Data stream
Last partially labeled chunk
2
Ensemble
Train a classifier using
L1
Semi-supervised
Clustering
L2
3
refine
ensemble
LM
4
Update ensemble
Last unlabeled
data chunk
1
Ensemble
classification
Semi-Supervised Clustering
Overview:
Only a few instances in the training data are labeled
We apply impurity-based clustering
The goal is to minimize cluster impurity
A cluster is completely pure if all the labeled data in that
cluster is from the same class
Objective function:
K: number of clusters, Xi: data points belonging to cluster i,
Li: labeled data points belonging to cluster i
Impi: Impurity of cluster i = Entropy * dissimilarity count
Semi-Supervised Clustering Using E-M
Constrained initialization:
Initialize K seeds
For each class Cj,
Select kj seeds from the labeled instances of Cj using farthest-first
traversal heuristic
where kj = (Nj/N) * K
Nj = number of instances in class Cj, N = total number of labeled
instances
Repeat E-step and M-step until convergence:
E-step
Assign clusters to each instance
So that the objective function is minimized
Apply Iterative Conditional Mode (ICM) until convergence
M-step
Re-compute cluster centroids
Saving Cluster Summary as Micro-Cluster
For each of the K clusters created using the
semi-supervised clustering, save the followings
Centroid
n: total number of points
L: total number of labeled points
Lt[ ]: array containing the total number of labeled
points belonging to each class.
e.g. : Lt[j]: total number of data points belonging to class Cj
Sum[ ]: array containing the sum of the attribute
values of each dimension of all the data points
e.g. : Sum[r]: sum of the r th dimension of all data points in the
cluster
Using the Micro-Clusters as Classification
Model
We remove all the raw data points after saving
the micro-clusters
The set of K such micro-clusters built from a
data chunk serves as a classification model
To classify a test data point x using the model:
Find the Q nearest micro-clusters (by computing the
distance between x and their centroids)
For each class Cj, Compute the “cumulative
normalized frequency (CNFrq[j])”, where
CNFrq[j] = sum of Lt[j]/L of all the Q micro-clusters
Output the class Cj with the highest CNFrq[j]
Experiments
Data sets:
Synthetic data – simulates evolving data streams
Real data – botnet data, collected from real botnet traffic
generated in a controlled environment
Baseline:
On-demand Stream
C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand
classification of evolving data streams. IEEE Transactions on Knowledge and
Data Engineering, 18(5):577–589, 2006.
For training, we use 5% labeled data in each chunk.
So, if there are 100 instances in a chunk
Our technique (SmSCluster) use only 5 labeled and 95
unlabeled instances for training
On Deman Stream uses 100 labeled instances for training
Environment
S/W: Java, OS: Windows XP, H/W: Pentium-IV, 3GHz dual core,
2GB RAM
Results
Each time unit = 1,000 data points (botnet)
and 1,600 data points (synthetic)
Conclusion
We address a more practical approach in
classifying evolving data streams – training with
limited amount of labeled data.
Our technique applies semi-supervised
clustering to train classification models.
Our technique outperforms other state-of-theart stream classification techniques, which use
20 times more labeled data for training than
our technique.