Transcript w - Cnr

Incremental Generalized Eigenvalue
Classification on data streams
[email protected]
High Performance Computing and Networking Institute
National Research Council – Naples, ITALY
International Workshop on Data Stream Management and Mining
Beijing, October 27-29, 2008
Acknowledgements

Panos Pardalos, Onur Seref, Claudio Cifarelli

Davide Feminiano, Salvatore Cuciniello

Rosanna Verde
Beijing,
October 27-29, 2008
2
Outline

Introduction
• Challenges, applications and existing methods

ReGEC (Regularized Generalized Eigenvalue Classifier)

I-Regec (Incremental ReGEC )

I-ReGEC on data streams

Experiments
Beijing,
October 27-29, 2008
3
Supervised learning




Supervised learning refers to the capability of
a system to learn from examples
The trained system is able to answer to new
questions
Supervised means the desired answer for the
training set is provided by an external teacher
Binary classification is among the most
successful methods for supervised learning
Beijing,
October 27-29, 2008
4
Supervised learning


Incremental Regularized Generalized Eigenvalue
Classifier (I-ReGEC) is a supervised learning
algorithm, using a subset of training data
The advantage is the classification model can
be incrementally updated
• The algorithm online decides which points bring new
information and updates the model

Experiments and comparison assess I-ReGEC
classification accuracy and processing speed
rates
Beijing,
October 27-29, 2008
5
The challenges

Applications on massive data sets are
emerging with an increasing frequency
• Data has to be analyzed the data as soon as produced


Legacy data bases and warehouses with
petabytes of data can not be loaded in main
memory and data are accessed as streams
Classification algorithms have to deal with a
large amount of data that are delivered in
form of streams
Beijing,
October 27-29, 2008
6
The challenges

Classification has to be performed online
• Data is processed on fly, at the transmission speed

Need for data sampling
• Classification models not over fitting data and
detailed enough to describe the phenomena.

Change of training behavior during time
• Nature and gradient of data changes over time
Beijing,
October 27-29, 2008
7
Applications on data streams

Sensor networks
• Power grids, telecommunications, bio, seismic,
security,…

Computer network traffic
• spam, intrusion detection, IP traffic logs…

Bank transactions
• Financial data, frauds, credit cards,…

Web
• Browser clicks, user queries, link rating,…
Beijing,
October 27-29, 2008
8
Support Vector Machines




SVM algorithm is among the most successful
method for classification, and its variations
have been applied to data streams
The general purpose methods are only
suitable for small size problems
For large problems, chunking subset
selection and decomposition methods use
subsets of points
SVM-Light and libSVM are among the most
preferred implementations that use chunking
subset selection and decomposition methods
Beijing,
October 27-29, 2008
9
Support Vector Machines
Find two
parallel lines
with
maximum
margin and
leave all
points of a
class on one
side
support vectors
B
A
2
w
min || ||

ω 0
2
s.t. Aw + b  e
Bw + b < - e
Beijing,
October 27-29, 2008
10
SVM for data streams



Batch technique uses SVM model on complete
data set
Error-driven technique randomly stores k
samples in a training and the other in a test set.
If a point is well classified, it remains in the
traing set
Fixed-partition technique divides the training
set in batches of fixed size. This partition
permits to add points to current SVM
accordingly to the ones loaded in memory
Beijing,
October 27-29, 2008
11
SVM for data streams

Exceeding-margin technique checks, at the
time t and for each new point, if the new data
exceeds the margin evaluated by SVM
• In case, the point is added to the incremental training,
otherwise it is discarded

Fixed-margin + errors technique adds the new
point to the training if either it exceeds the
margin or it is misclassified
Beijing,
October 27-29, 2008
12
Pros of SVM based methods



They require a minimal computational burden
to build classification models
In case of kernel-based nonlinear
classification, they reduce the size of the
training set, and thus, the related kernel
All of these methods show that a sensible
data reduction is possible while maintaining a
comparable level of classification accuracy
Beijing,
October 27-29, 2008
13
A different approach:
ReGEC
Find two
lines, each
the closest
to one set
and the
furthest to
the other
A
B
2
w
g
e ||
min || A
ω, γ  0
|| Bw - eg ||2
M.R. Guarracino, C. Cifarelli, O. Seref, P. Pardalos. A Classification Method
Based on Generalized Eigenvalue Problems, OMS, 2007.
Beijing, October 27-28, 2008
14
ReGEC formulation
2
w
g
A
e
||
||
min
ω, γ  0
2
w
g
e ||
|| B
Let:
G = [A –e]’[A –e], H = [B –e]’[B –e], z = [w’ g]’
Equation becomes:
min z ' G z
z0
z' H z
Raleigh quotient of Generalized Eigenvalue Problem
Gx=lHx
Beijing,
October 27-29, 2008
15
ReGEC classification
of new points
A new point
is assigned
to the class
described
by the
closest line
B
A
| x 'w - g |
dist ( x , Pi ) =
|| w ||
Beijing,
October 27-29, 2008
16
ReGEC
[w1, g1] and [w2, g2] be eigenvectors of
min and max eigenvalues of G x = l H x
 Let
• a ∈ A, closer to x'w1 -
g1 =0 than to x'w2-g2=0,
• b ∈ B, closer to x'w2 -
g2=0 than to x'w1-g1=0.
Beijing,
October 27-28, 2008
17
Incremental learning




The purpose of incremental learning is to find a
small and robust subset of the training set that
provides comparable accuracy results.
A smaller set of points reduces the probability of
overfitting the problem.
A classification model built from a smaller subset
is computationally more efficient in predicting new
points.
As new points become available, the cost of
retraining the algorithm decreases if the influence
of the new points is only evaluated by the small
subset.
C. Cifarelli, M.R. Guarracino, O. Seref, S. Cuciniello, and P.M. Pardalos.
Incremental Classification with Generalized Eigenvalues, JoC, 2007.
Beijing,
October 27-29, 2008
18
Incremental learning algorithm
1: 1 = C \ C0
2: {M0, Acc0} = Classify ( C; C0 )
3: k = 1
4: while |k| > 0 do
5:
6:
7:
xk = x : maxx  {Mk-1 ∩ k-1} {dist(x, Pclass(x))}
{Mk, Acck } = Classify( C; {Ck-1 U {xk}} )
if Acck > Acck-1 then
8:
Ck = Ck-1 U {xk}
9:
end if
10:
k = k-1 \ {xk}
11:
k=k+1
12: end while
Beijing,
October 27-29, 2008
19
Incremental classification on
data streams
 Find
a small and robust subset of the training
set while accessing data available in a
window
wsize
 When
the window is full, all points are
processed by the classifier
M.R. Guarracino, S. Cuciniello, D. Feminiano. Incremental Generalized Eigenvalue
Classification on Data Streams. MODULAD, 2007.
Beijing, October 27-28, 2008
20
I-ReGEC: Incremental Regularized
Eigenvalue Classifier on Streams
wsize
Beijing,
October 27-29, 2008
21
I-ReGEC: Incremental Regularized
Eigenvalue Classifier on Streams
New data
Old data
At each step, data in window are
processed with the incremental
learning classifier…
And hyperplanes are built
Beijing,
October 27-29, 2008
22
I-ReGEC: Incremental Regularized
Eigenvalue Classifier on Streams
Step by step new
points are processed
…and I-ReGEC updates
hyperplanes
configuration
Beijing,
October 27-29, 2008
23
I-ReGEC: Incremental Regularized
Eigenvalue Classifier on Streams
But not all points
are considered…
Some of them are discarted
if their information
contribution is useless
Beijing,
October 27-29, 2008
24
I-ReGEC: Incremental Regularized
Eigenvalue Classifier on Streams
New unknown incoming points
are classified by their distance
from the hyperplanes
Beijing,
October 27-29, 2008
25
I-ReGEC: Incremental Regularized
Eigenvalue Classifier on Streams
An
incremental
learning
technique
based on
ReGEC that
determines
the
classification
model from a
very small
sample of
data stream
Beijing,
October 27-29, 2008
26
Experiments
Large-noisy-crossed-norm
Data set
200.000 points with 20 features
equally divided in 2 classes
100.000 train points
100.000 test points
Each class is drawn from a
multivariate normal distribution
Beijing,
October 27-29, 2008
27
Miss-classification results
SI-ReGEC has the lowest error and
uses the smallest incremental set
B:
ED:
FP:
EM:
EM+E:
Batch SVM
Error-driven KNN
Fixed partition SVM
Exceeding-margin SVM
Fixed margin + errors
Beijing,
October 27-29, 2008
28
Window size
Larger windows lead to smaller train subset
and execution time increases with window
growth
One billion elements per day can be processed
on standard hardware
Beijing,
October 27-29, 2008
29
Conclusion and future work



The classification accuracy of I-ReGEC a well
compares with other methods
I-ReGEC produces small incremental training
sets
In future, investigate how to dinamically
adapt window size to stream rate and
nonstationary data streams
Beijing,
October 27-29, 2008
30
Incremental Generalized Eigenvalue
Classification on data streams
[email protected]
High Performance Computing and Networking Institute
National Research Council – Naples, ITALY
International Workshop on Data Stream Management and Mining
Beijing, October 27-29, 2008