slides - UCLA Computer Science

Transcript slides - UCLA Computer Science

An Adaptive Nearest Neighbor
Classification Algorithm
for Data Streams
Yan-Nei Law & Carlo Zaniolo
University of California, Los Angeles
PKDD, Porto, 2005
1
Outline




Related Work
ANNCAD
Properties of ANNCAD
Conclusion
2
Classifying Data Streams

Problem Statement: We seek an algorithm for
classifying data streams with numerical attributes--will work for totally ordered domains too.

Desiderata:
 Fast update speed for newly arriving records.
 Only require single pass of data.
 Incremental algorithms are needed.
 Coping with concept changes.

Classical mining algorithms were not designed for
data streams and need to replaced or modified.
3
Classifying Data Streams: Related Work

Hoeffding trees:




VFDT and CVFDT: build decision tree
incrementally.
Require a large amount of examples to obtain a
fair performance classifier.
Unsatisfied performance when training set is small.
Ensemble:



Combine base models by voting technique.
Suitable for coping with concept drift.
Fail to provide a simple model and understanding
of the problem.
4
State of the Art:
NearestNeighborhood Classifiers

Pros and cons:




+: Strong intuitive appeal and simple to implement
-: Fail to provide simple models/rules
-: Expensive Computations
ANN: Approximate Nearest Neighborhood with error
guarantee 1+ε:



Idea: pre-processing the data by devising a data structure
(e.g. ring-cover tree) to speed up the searchings.
Designed for stored data only.
Time for update the pre-processing step depends on size
of data set which may be infinite.
5
Our Algorithm: ANNCAD
Adaptive NN Classification Algorithm for Data
Streams
 Model building:




Pre-assign classes to obtain an approximate
result and provide simple models/rules.
Decompose the feature space to make
classification decisions.
Akin to wavelets.
Classification:


Find NN for classification adaptively.
progressively expand the searching
of nearby area of a test point (star).
6
Quantize Feature Space and
Compute Multi-resolution Coefficients
(8+9+10+0)/4
A set of 100 twoclass training
points
Quantize Feature Space
and record information
into data arrays
Multi-resolution representation of a two-class data set.
7
Building a Classifier
B=6.75; R=0.6
B=2; R=4.25
 Blue
 M(ix)
Label each block
with its majority class
B=3; R=3.25
 M(ix)
Label block only if
|C1st|-|C2nd| > 80%
Hierarchical structure of ANNCAD Classifier
8
Decision Algorithm on the ANNCAD Hierarchy
Classified
Classifiedblock
block

Label
LabelUnclassified
class
classIII
block,
go to next level.
The combined classifier
over multiple levels
Block with tag “M”, go
back to prev. level.
Compute the distance
between the test point
and the center of
every nonempty
neighboring block.
9
Incremental Update
0
8
8
0
6.75
10
9
0
0
10
2
1
0
0
0
0
0
1
2
3.0625
3
3
0.25
0.5
New training point
10
Concept Drift: Adaptation by
Exponential Forgetting

Data Array , Factor 01:
new   old



No effect if no concept changes
Adapt quickly (exponentially) if concept
changes
No extra memory needed (sliding window
required.)
11
Grid Position and Resolution


Problem: Neighborhood decision strongly depends on grid
position
Solution: Build several classifiers by shifting grid position by
1/n. Then combine the results by voting.
Example: 4 different grids for building 4 classifiers.

Thm. x: test point, nd classifiers, b(x): Blocks containing x,
then:
 zb(x), yb(x): dist(x,y)<(1+1/n-1)*dist(x,z).

In practice, only 2-3 classifiers can achieve a good result.
12
Properties of ANNCAD




Compact support: locality property allows fast
update
Dealing with noise: can set a threshold for
classification decision
Multi-resolution: to control the fineness of the
result, or optimize the system resources.
Low complexity (gd = total number of cells)
 Building classifier: O(min(N,gd))
 Testing: O(log2(g)+2d).
 Updating: log2(g)+1.
13
Experiments


Synthetic Data
 3-d unit cube:
 Class distribution:
class 0 inside sphere
with radius 0.5 class
1 outside
 3000 training examples
 1000 test examples
(a) different initial resolutions.
Exact ANN:


Expand the searching area by
double the radius until reaching
some training point.
Classify the test point with the
majority class.
(b) different # ensembles.
14
Experiments (Cont’)

Real Data 1 -- Letter
Recognition







Objective: identify a pixel
displays as one of the 26
letter.
16 numerical attributes to
describe its pixel displays.
15,000 training examples
5,000 test examples
Add 5 % noise by
randomly assign class.
Grid size: 16 units
#Classifiers: 2
15
ANNCAD Vs VFDT

Real Data 2 – Forest
Cover Type






Objective: predict forest
cover type.
10 numerical attributes.
12,000 training examples
9,000 test examples
Grid size: 32 unit
#Classifiers: 2
16
Concept Shift: ANNCAD vs CVFDT

Real Data 3 – Adult





Objective: determine a
person with salary>50K
Concept Shift Simulation:
Group by races
 = 0.98
Grid Size: 64
#Classifier: 2
17
Conclusion and Future Work

ANNCAD




an incremental classification algorithm to find
adaptive NN
Suitable for mining data streams: fast update
speed
Exponential forgetting for concept shift/drift.
Future Work: Detect concept shift/drift by
changes in class label of blocks.
18
THANK YOU!
19

slides - UCLA Computer Science

Transcript slides - UCLA Computer Science

Directory