Transcript Document

Cache Hierarchy
Inspired
Compression: A
Novel Architecture
for Data Streams
Geoffrey Holmes, Bernhard
Pfahringer and Richard Kirkby
Department of Computer Science,
University of Waikato, New Zealand









Traditional machine
learning model
Data stream
requirements
Scaling up existing
methods
Network caching
CHIC architecture
Experimental Design
Results
Discussion
Further Work
Traditional Machine Learning





Create train/test splits of the data (possibly via
cross-validation)
Load ALL the training data into main memory
Compute a model from the training data (may
involve multiple passes)
Load all the test data into main memory
Compute accuracy measures from the test data
2
Consequences





Data is processed one instance at a time
Very few incremental methods – none used
seriously in practice.
Many existing techniques don’t scale
Machine Learning is perceived as a small to
medium data set field of study.
Larger data sets are tackled through sampling or
building several models on separate portions and
combining their predictions
3
Data Streams



Takes the “stream” view, source maybe finite or
infinite
Concept of train/test less well defined, could train
for a while then test for a while – what is the
definition of “a while”?
What ever you do you can be sure that ALL the
data will NOT fit in main memory
4
Data Stream Constraints




Cannot store instances (not all anyway)
Cannot use more than available memory – no
swapping to disk
Cannot dwell too long over an instance - must
keep up with the incoming rate of instances
Cannot wait to make predictions – need to be
ready to make predictions at any time
5
Scaling up existing methods




Could learn models using existing methods in
batches and then merge models
Could merge instances (meta-instances)
Could use a cache model where we keep a set of
models and update the cache in time – eg use
least recently used, least frequently used type
strategies
Could do the above but use performance
measures to decide the make-up of the cache.
6
Caching in Data Communications





Web proxy caches provide a good model for what
we need to satisfy stream constraints
Real caches are hierarchical (Squid)
The hierarchy provides a mechanism for sharing
the load and increasing the likelihood of a page hit
When full a cache needs a replacement policy
To replicate this system we need to design a
hierarchy, fill it (with models) and implement a
model replacement policy
7
General CHIC Architecture

Idea: Build a hierarchy of levels (N) as follows:
Level Zero: data buffer from stream
 Level One: Build models from data at level zero
 Level Two to N-1: Fill with “best” models from lower
levels
 Level N: Adopt models from level N-1 but also
discard models so that new ones can be entered


For prediction use all models in hierarchy and vote
8
Features





Can use any machine learning algorithm to build
models
Implements a form of batch-incremental learning
Replacement policy can be performance based
As with the web cache CHIC fills up initially and
then keeps the best performing models
If a variety of models is used at the lower levels
then it is possible to adapt to the source of the
data.
9
Experimental Design

Try to demonstrate adaptation to data source


Learn a mixture of models at levels one and two
and let the performance based promotion
mechanism take over
Evaluate: two issues
Need performance measure (model
promotion/deletion)
 Overall performance of hierarchy (adopt a first test
then train approach)

10
Data Sources

Random Naïve Bayes


Random Tree


Choose a depth and randomly assign splitter nodes, here 5
nodes deep, leaves starting from depth 3, as above number
of attributes/classes.
Random RBF


Weight attribute labels and class values (here we use 5
classes, 5 nominal and 5 numeric attributes)
Random set of centers for each class, center is weighted and
has a standard deviation – here 50 centers, 10 numeric
attributes and 5 classes
Real data

Forest covertype (UCI repository), 7 classes, 500K instances.
11
Specific CHIC Architecture



Six levels with 16 models per level
Data buffer of size 1000 instances
First level uses four algorithms to generate
models

Naïve Bayes(N), C4.5(J), linear SVM(L), RBF
Kernel SVM(R)
12
Example






Read 1000 instances into buffer, build 4 models, repeat on
next 3 buffers – leads to level one full
Read next 1000 and build 4 more models
Using the buffer data as test data evaluate all models at
level 1 – promote best 4 (groupwise) to level 2 and replace
the worst 4 with the new ones.
Continue in this manner (at 12,000 instances level 2 will be
full)
Note: Levels 1&2 always have 4 models from the 4 groups
From level 2 ONLY promote the best (adapt to source)
13
Example Contd



Four models are promoted to level 3 the 4 worst
deleted freeing 8 spaces.
Levels 3, 4 & 5 work on the same basis. Level 6
simply deletes the worst 4 models to free up
space.
At prediction time all models have an equal vote
and the classification decision is arrived at by
majority.
14
Results
after 1000 instances:
N---J---L---R---
after 2000 instances:
NN--JJ--LL--RR--
after 3000 instances:
NNN-JJJ-LLL-RRR-
after 4000 instances:
NNNNJJJJLLLLRRRR
after 5000 instances:
N-NNJJ-JL-LLR-RR
NJLR------------
after 6000 instances:
NNNNJJJJLLLLRRRR
NJLR------------
after 7000 instances:
NNN-J-JJL-LLR-RR
NJLRNJLR--------
after 8000 instances:
NNNNJJJJLLLLRRRR
NJLRNJLR-------15
Continued
after 9000 instances:
NN-N-JJJLL-L-RRR
NJLRNJLRNJLR----
after 10000 instances:
NNNNJJJJLLLLRRRR
NJLRNJLRNJLR----
after 11000 instances:
-NNN-JJJLLL-RR-R
NJLRNJLRNJLRNJLR
after 12000 instances:
NNNNJJJJLLLLRRRR
NJLRNJLRNJLRNJLR
after 13000 instances:
NN-NJJ-JLL-LRR-R
NNLJNLLRN-L-N-LJJJJ-----------16
Model Adaptation to Source
random tree
N-NN-JJJLL-LR-RR
NNNNLNNLJLRNNJLR
JJJJJNNLJJ---JLJJJJJJJJJJJJJJJJ
JJJJ--JJ-JJJJJ-J
JJJJJJJJJJJJJJJJ
random naive Bayes
-NNNJJ-JLLL-R-RR
NJJJLLLLRNJJJLLR
NNNNNJJNJJ-NN--NNNNNNNNNNNNNNNN
NNNNNNNN---NNNNNNNNNNNNNNNNNNNN
17
Continued
random RBF
NNN--JJJLL-LRR-R
NNJJLRNJNNJJNJLR
RRJRJJRRRRJRRJJJ
RRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRR
covertype
N-NN-JJJ-LLLR-RR
NNJJJRJLNRJ---R
LJJLLJLJLJL--N-LLLLLLLLLLLLJLLJ
LLLLLLLLJL-J-L-LLLLLLLLLLLLLLLL
18
Learning Curve – Random Tree
19
Learning Curve - CoverType
20
Conclusion





Novel architecture for data streams
Operates much like a web cache (hierarchy and
replacement policy)
Provides scaling-up mechanism for arbitrary
classifiers (batch-incremental)
Can be adapted to clustering, regression,
association rule learning
Thousands of options still to explore!
21