Data Stream Classification - The University of Texas at Dallas

Download Report

Transcript Data Stream Classification - The University of Texas at Dallas

Data Stream Classification:
Training with Limited Amount of
Labeled Data
Mohammad Mehedy Masud
Latifur Khan
Bhavani Thuraisingham
University of Texas at Dallas
Jing Gao
Jiawei Han
University of Illinois at
Urbana-Champaign
To appear in IEEE International Conference on Data Mining,
(ICDM) Pisa, Italy, Dec 15-19, 2008
Funded by: Air Force
Data Stream Classification
Techniques

Data stream classification is a challenging task
because of two reasons
◦ Infinite length – can’t use all historical data for training
◦ Concept-drift – old models become outdated

Solutions:
◦ Single model classification with incremental learning
◦ Ensemble classification
◦ Ensemble techniques can be updated more efficiently,
and handles concept-drift more effectively.

Our solution: ensemble approach
Ensemble Classification

Ensemble techniques build an ensemble of M classifiers.
◦ The data stream is divided into equal-sized chunks
◦ A new classifier is trained from each labeled chunk
◦ The new classifier replaces one old classifier (if required)
Data stream
Last labeled data
chunk
2
Train new
Classifier
Last unlabeled
data chunk
Ensemble
L1
L2
Update
ensemble
LM
1
Ensemble
classification
Limited Labeled Data Problem


Ensemble techniques assume that the entire data
chunk is labeled during training.
This assumption is impractical because
◦ Data labeling is both time consuming and costly
◦ It may not be possible to label all the data
◦ Specially in an streaming environment, where data is being
produced at a high speed

Our solution:
◦ Train classifiers with limited amount of labeled data
◦ Assuming a fraction of the data chunk is labeled
◦ We obtain better result compared to other techniques
that train classifiers with fully labeled data chunk
Training With Partially-Labeled
Chunk
Train new classifier using semi-supervised clustering
 If a new class has arrived in the stream, refine the
existing classifiers
 Update the ensemble

Data stream
Last partially labeled chunk
2
Ensemble
Train a classifier using
L1
Semi-supervised
Clustering
L2
3
refine
ensemble
LM
4
Update ensemble
Last unlabeled
data chunk
1
Ensemble
classification
Semi-Supervised Clustering

Overview:
Only a few instances in the training data are labeled
We apply impurity-based clustering
The goal is to minimize cluster impurity
A cluster is completely pure if all the labeled data in that
cluster is from the same class
 Objective function:




K: number of clusters, Xi: data points belonging to cluster i,
Li: labeled data points belonging to cluster i
Impi: Impurity of cluster i = Entropy * dissimilarity count
Semi-Supervised Clustering Using E-M

Constrained initialization:
 Initialize K seeds
 For each class Cj,
 Select kj seeds from the labeled instances of Cj using farthest-first
traversal heuristic
 where kj = (Nj/N) * K
 Nj = number of instances in class Cj, N = total number of labeled
instances

Repeat E-step and M-step until convergence:

E-step
 Assign clusters to each instance
 So that the objective function is minimized
 Apply Iterative Conditional Mode (ICM) until convergence

M-step
 Re-compute cluster centroids
Saving Cluster Summary as Micro-Cluster

For each of the K clusters created using the
semi-supervised clustering, save the followings




Centroid
n: total number of points
L: total number of labeled points
Lt[ ]: array containing the total number of labeled
points belonging to each class.
 e.g. : Lt[j]: total number of data points belonging to class Cj
 Sum[ ]: array containing the sum of the attribute
values of each dimension of all the data points
 e.g. : Sum[r]: sum of the r th dimension of all data points in the
cluster
Using the Micro-Clusters as Classification
Model
We remove all the raw data points after saving
the micro-clusters
 The set of K such micro-clusters built from a
data chunk serves as a classification model
 To classify a test data point x using the model:

 Find the Q nearest micro-clusters (by computing the
distance between x and their centroids)
 For each class Cj, Compute the “cumulative
normalized frequency (CNFrq[j])”, where
 CNFrq[j] = sum of Lt[j]/L of all the Q micro-clusters
 Output the class Cj with the highest CNFrq[j]
Experiments

Data sets:
 Synthetic data – simulates evolving data streams
 Real data – botnet data, collected from real botnet traffic
generated in a controlled environment

Baseline:
 On-demand Stream
 C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand
classification of evolving data streams. IEEE Transactions on Knowledge and
Data Engineering, 18(5):577–589, 2006.
 For training, we use 5% labeled data in each chunk.
 So, if there are 100 instances in a chunk
 Our technique (SmSCluster) use only 5 labeled and 95
unlabeled instances for training
 On Deman Stream uses 100 labeled instances for training

Environment
 S/W: Java, OS: Windows XP, H/W: Pentium-IV, 3GHz dual core,
2GB RAM
Results


Each time unit = 1,000 data points (botnet)
and 1,600 data points (synthetic)
Conclusion
We address a more practical approach in
classifying evolving data streams – training with
limited amount of labeled data.
 Our technique applies semi-supervised
clustering to train classification models.
 Our technique outperforms other state-of-theart stream classification techniques, which use
20 times more labeled data for training than
our technique.
