Transcript Document
Cache Hierarchy
Inspired
Compression: A
Novel Architecture
for Data Streams
Geoffrey Holmes, Bernhard
Pfahringer and Richard Kirkby
Department of Computer Science,
University of Waikato, New Zealand
Traditional machine
learning model
Data stream
requirements
Scaling up existing
methods
Network caching
CHIC architecture
Experimental Design
Results
Discussion
Further Work
Traditional Machine Learning
Create train/test splits of the data (possibly via
cross-validation)
Load ALL the training data into main memory
Compute a model from the training data (may
involve multiple passes)
Load all the test data into main memory
Compute accuracy measures from the test data
2
Consequences
Data is processed one instance at a time
Very few incremental methods – none used
seriously in practice.
Many existing techniques don’t scale
Machine Learning is perceived as a small to
medium data set field of study.
Larger data sets are tackled through sampling or
building several models on separate portions and
combining their predictions
3
Data Streams
Takes the “stream” view, source maybe finite or
infinite
Concept of train/test less well defined, could train
for a while then test for a while – what is the
definition of “a while”?
What ever you do you can be sure that ALL the
data will NOT fit in main memory
4
Data Stream Constraints
Cannot store instances (not all anyway)
Cannot use more than available memory – no
swapping to disk
Cannot dwell too long over an instance - must
keep up with the incoming rate of instances
Cannot wait to make predictions – need to be
ready to make predictions at any time
5
Scaling up existing methods
Could learn models using existing methods in
batches and then merge models
Could merge instances (meta-instances)
Could use a cache model where we keep a set of
models and update the cache in time – eg use
least recently used, least frequently used type
strategies
Could do the above but use performance
measures to decide the make-up of the cache.
6
Caching in Data Communications
Web proxy caches provide a good model for what
we need to satisfy stream constraints
Real caches are hierarchical (Squid)
The hierarchy provides a mechanism for sharing
the load and increasing the likelihood of a page hit
When full a cache needs a replacement policy
To replicate this system we need to design a
hierarchy, fill it (with models) and implement a
model replacement policy
7
General CHIC Architecture
Idea: Build a hierarchy of levels (N) as follows:
Level Zero: data buffer from stream
Level One: Build models from data at level zero
Level Two to N-1: Fill with “best” models from lower
levels
Level N: Adopt models from level N-1 but also
discard models so that new ones can be entered
For prediction use all models in hierarchy and vote
8
Features
Can use any machine learning algorithm to build
models
Implements a form of batch-incremental learning
Replacement policy can be performance based
As with the web cache CHIC fills up initially and
then keeps the best performing models
If a variety of models is used at the lower levels
then it is possible to adapt to the source of the
data.
9
Experimental Design
Try to demonstrate adaptation to data source
Learn a mixture of models at levels one and two
and let the performance based promotion
mechanism take over
Evaluate: two issues
Need performance measure (model
promotion/deletion)
Overall performance of hierarchy (adopt a first test
then train approach)
10
Data Sources
Random Naïve Bayes
Random Tree
Choose a depth and randomly assign splitter nodes, here 5
nodes deep, leaves starting from depth 3, as above number
of attributes/classes.
Random RBF
Weight attribute labels and class values (here we use 5
classes, 5 nominal and 5 numeric attributes)
Random set of centers for each class, center is weighted and
has a standard deviation – here 50 centers, 10 numeric
attributes and 5 classes
Real data
Forest covertype (UCI repository), 7 classes, 500K instances.
11
Specific CHIC Architecture
Six levels with 16 models per level
Data buffer of size 1000 instances
First level uses four algorithms to generate
models
Naïve Bayes(N), C4.5(J), linear SVM(L), RBF
Kernel SVM(R)
12
Example
Read 1000 instances into buffer, build 4 models, repeat on
next 3 buffers – leads to level one full
Read next 1000 and build 4 more models
Using the buffer data as test data evaluate all models at
level 1 – promote best 4 (groupwise) to level 2 and replace
the worst 4 with the new ones.
Continue in this manner (at 12,000 instances level 2 will be
full)
Note: Levels 1&2 always have 4 models from the 4 groups
From level 2 ONLY promote the best (adapt to source)
13
Example Contd
Four models are promoted to level 3 the 4 worst
deleted freeing 8 spaces.
Levels 3, 4 & 5 work on the same basis. Level 6
simply deletes the worst 4 models to free up
space.
At prediction time all models have an equal vote
and the classification decision is arrived at by
majority.
14
Results
after 1000 instances:
N---J---L---R---
after 2000 instances:
NN--JJ--LL--RR--
after 3000 instances:
NNN-JJJ-LLL-RRR-
after 4000 instances:
NNNNJJJJLLLLRRRR
after 5000 instances:
N-NNJJ-JL-LLR-RR
NJLR------------
after 6000 instances:
NNNNJJJJLLLLRRRR
NJLR------------
after 7000 instances:
NNN-J-JJL-LLR-RR
NJLRNJLR--------
after 8000 instances:
NNNNJJJJLLLLRRRR
NJLRNJLR-------15
Continued
after 9000 instances:
NN-N-JJJLL-L-RRR
NJLRNJLRNJLR----
after 10000 instances:
NNNNJJJJLLLLRRRR
NJLRNJLRNJLR----
after 11000 instances:
-NNN-JJJLLL-RR-R
NJLRNJLRNJLRNJLR
after 12000 instances:
NNNNJJJJLLLLRRRR
NJLRNJLRNJLRNJLR
after 13000 instances:
NN-NJJ-JLL-LRR-R
NNLJNLLRN-L-N-LJJJJ-----------16
Model Adaptation to Source
random tree
N-NN-JJJLL-LR-RR
NNNNLNNLJLRNNJLR
JJJJJNNLJJ---JLJJJJJJJJJJJJJJJJ
JJJJ--JJ-JJJJJ-J
JJJJJJJJJJJJJJJJ
random naive Bayes
-NNNJJ-JLLL-R-RR
NJJJLLLLRNJJJLLR
NNNNNJJNJJ-NN--NNNNNNNNNNNNNNNN
NNNNNNNN---NNNNNNNNNNNNNNNNNNNN
17
Continued
random RBF
NNN--JJJLL-LRR-R
NNJJLRNJNNJJNJLR
RRJRJJRRRRJRRJJJ
RRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRR
RRRRRRRRRRRRRRRR
covertype
N-NN-JJJ-LLLR-RR
NNJJJRJLNRJ---R
LJJLLJLJLJL--N-LLLLLLLLLLLLJLLJ
LLLLLLLLJL-J-L-LLLLLLLLLLLLLLLL
18
Learning Curve – Random Tree
19
Learning Curve - CoverType
20
Conclusion
Novel architecture for data streams
Operates much like a web cache (hierarchy and
replacement policy)
Provides scaling-up mechanism for arbitrary
classifiers (batch-incremental)
Can be adapted to clustering, regression,
association rule learning
Thousands of options still to explore!
21