Scaling Decision Tree Induction

Download Report

Transcript Scaling Decision Tree Induction

Mining Decision Trees from
Data Streams
Tong Suk Man Ivy
CSIS DB Seminar
February 12, 2003
1
Contents
 Introduction: problems in mining data streams
 Classification of stream data
 VFDT algorithm
 Window approach
 CVFDT algorithm
 Experimental results
 Conclusions
 Future work
2
Data Streams
 Characteristics
 Large volume of ordered data points, possibly infinite
 Arrive continuously
 Fast changing
 Appropriate model for many applications:
 Phone call records
 Network and security monitoring
 Financial applications (stock exchange)
 Sensor networks
3
Problems in Mining Data Streams
 Traditional data mining techniques usually
require
 Entire data set to be present
 Random access (or multiple passes) to the data
 Much time per data item
 Challenges of stream mining
 Impractical to store the whole data
 Random access is expensive
 Simple calculation per data due to time and space
constraints
4
Classification of Stream Data
 VFDT algorithm
 “Mining High-Speed Data Streams”, KDD 2000.
Pedro Domingos, Geoff Hulten
 CVFDT algorithm (window approach)
 “Mining Time-Changing Data Streams”, KDD 2001.
Geoff Hulten, Laurie Spencer, Pedro Domingos
5
Hoeffding Trees
6
Definitions
 A classification problem is defined as:
 N is a set of training examples of the form (x, y)
 x is a vector of d attributes
 y is a discrete class label
 Goal: To produce from the examples a model
y=f(x) that predict the classes y for future
examples x with high accuracy
7
Decision Tree Learning
 One of the most effective and
widely-used classification
methods
 Induce models in the form of
decision trees
 Each node contains a test on the
attribute
 Each branch from a node
corresponds to a possible outcome
of the test
 Each leaf contains a class prediction
 A decision tree is learned by
recursively replacing leaves by test
nodes, starting at the root
Age<30?
Yes
No
Car Type=
Sports Car?
Yes
Yes
No
No
8
Challenges
 Classic decision tree learners assume all training
data can be simultaneously stored in main
memory
 Disk-based decision tree learners repeatedly
read training data from disk sequentially
 Prohibitively expensive when learning complex trees
 Goal: design decision tree learners that read
each example at most once, and use a small
constant time to process it
9
Key Observation
 In order to find the best attribute at a node, it may be
sufficient to consider only a small subset of the training
examples that pass through that node.
 Given a stream of examples, use the first ones to choose the
root attribute.
 Once the root attribute is chosen, the successive examples
are passed down to the corresponding leaves, and used to
choose the attribute there, and so on recursively.
 Use Hoeffding bound to decide how many examples are
enough at each node
10
Hoeffding Bound
 Consider a random variable a whose range is R
 Suppose we have n observations of a
_
 Mean: a
 Hoeffding bound states:
With probability 1- , the true mean of a is at least
_
a   , where
R 2 ln( 1 /  )

2n
11
How many examples are enough?
 Let G(Xi) be the heuristic measure used to choose test
attributes (e.g. Information Gain, Gini Index)
 Xa : the attribute with the highest attribute evaluation
value after seeing n examples.
 Xb : the attribute with the second highest split
evaluation function value after seeing n examples.
 Given a desired , if G  G( X a )  G( X b )  
after
seeing n examples at a node,
 Hoeffding bound guarantees the true G  G    0
probability 1-.
, with
 This node can be split using Xa, the succeeding examples will
be passed to the new leaves.
R 2 ln( 1 /  )

2n
12
Algorithm
 Calculate the information gain for the attributes and
determines the best two attributes
 Pre-pruning: consider a “null” attribute that consists of not
splitting the node
 At each node, check for the condition
G  G( X a )  G( X b )  
 If condition satisfied, create child nodes based on the
test at the node
 If not, stream in more examples and perform
calculations till condition satisfied
13
Age<30?
Yes
Data Stream
No
Yes
_
_
G(Car Type) - G(Gender)  
Age<30?
Yes
No
Car Type=
Sports Car?
Yes
Data Stream
Yes
No
No
14
Performance Analysis
 p: probability that an example passed through DT
to level i will fall into a leaf at that point
 The expected disagreement between the tree
produced by Hoeffding tree algorithm and that
produced using infinite examples at each node is
no greater than  /p.
 Required memory: O(leaves * attributes * values
* classes)
15
VFDT
16
VFDT (Very Fast Decision Tree)
 A decision-tree learning system based on the Hoeffding tree
algorithm
 Split on the current best attribute, if the difference is less than a
user-specified threshold
 Wasteful to decide between identical attributes
 Compute G and check for split periodically
 Memory management
 Memory dominated by sufficient statistics
 Deactivate or drop less promising leaves when needed
 Bootstrap with traditional learner
 Rescan old data when time available
17
VFDT(2)
 Scales better than pure memory-based or pure
disk-based learners
 Access data sequentially
 Use subsampling to potentially require much less than
one scan
 VFDT is incremental and anytime
 New examples can be quickly incorporated as they
arrive
 A usable model is available after the first few examples
and then progressively defined
18
Experiment Results (VFDT vs. C4.5)
 Compared VFDT and C4.5 (Quinlan, 1993)
 Same memory limit for both (40 MB)
 100k examples for C4.5
 VFDT settings: δ= 10-7, τ= 5%, nmin=200
 Domains: 2 classes, 100 binary attributes
 Fifteen synthetic trees 2.2k – 500k leaves
 Noise from 0% to 30%
19
Experiment Results
Accuracy as a function of the number of training examples
20
Experiment Results
Tree size as a function of number of training examples
21
Mining Time-Changing Data Stream
 Most KDD systems, include VFDT, assume training data is
a sample drawn from stationary distribution
 Most large databases or data streams violate this
assumption
 Concept Drift: data is generated by a time-changing
concept function, e.g.
 Seasonal effects
 Economic cycles
 Goal:
 Mining continuously changing data streams
 Scale well
22
Window Approach
 Common Approach: when a new example
arrives, reapply a traditional learner to a sliding
window of w most recent examples
 Sensitive to window size
 If w is small relative to the concept shift rate, assure the
availability of a model reflecting the current concept
 Too small w may lead to insufficient examples to learn
the concept
 If examples arrive at a rapid rate or the concept
changes quickly, the computational cost of reapplying a
learner may be prohibitively high.
23
CVFDT
24
CVFDT
 CVFDT (Concept-adapting Very Fast Decision
Tree learner)
 Extend VFDT
 Maintain VFDT’s speed and accuracy
 Detect and respond to changes in the examplegenerating process
25
Observations
 With a time-changing concept, the current
splitting attribute of some nodes may not be the
best any more.
 An outdated subtree may still be better than the
best single leaf, particularly if it is near the root.
 Grow an alternative subtree with the new best attribute
at its root, when the old attribute seems out-of-date.
 Periodically use a bunch of samples to evaluate
qualities of trees.
 Replace the old subtree when the alternate one
becomes more accurate.
26
CVFDT algorithm
 Alternate trees for each node in HT start as empty.
 Process examples from the stream indefinitely. For
each example (x, y),
 Pass (x, y) down to a set of leaves using HT and all
alternate trees of the nodes (x, y) passes through.
 Add (x, y) to the sliding window of examples.
 Remove and forget the effect of the oldest examples, if the
sliding window overflows.
 CVFDTGrow
 CheckSplitValidity if f examples seen since last checking of
alternate trees.
 Return HT.
27
CVFDT algorithm: process each
example
Pass example down to leaves
add example to sliding window
Window overflow?
Read new example
Yes
Forget oldest example
No
CVFDTGrow
f examples since last checking?
Yes
No
CheckSplitValidty
28
CVFDT algorithm: process each
example
Pass example down to leaves
add example to sliding window
Window overflow?
Read new example
Yes
Forget oldest example
No
CVFDTGrow
f examples since last checking?
Yes
No
CheckSplitValidty
29
CVFDTGrow
 For each node reached by the example in HT,
 Increment the corresponding statistics at the node.
 For each alternate tree Talt of the node,
 CVFDTGrow
 If enough examples seen at the leaf in HT which
the example reaches,
 Choose the attribute that has the highest average
value of the attribute evaluation measure (information
gain or gini index).
 If the best attribute is not the “null” attribute, create a
node for each possible value of this attribute
30
CVFDT algorithm: process each
example
Pass example down to leaves
add example to sliding window
Window overflow?
Read new example
Yes
Forget oldest example
No
CVFDTGrow
f examples since last checking?
Yes
No
CheckSplitValidty
31
Forget old example
 Maintain the sufficient statistics at every node in HT to monitor
the validity of its previous decisions.
 VFDT only maintain such statistics at leaves.
 HT might have grown or changed since the example was initially
incorporated.
 Assigned each node a unique, monotonically increasing ID
as they are created.
 forgetExample (HT, example, maxID)
 For each node reached by the old example with node ID no
larger than the max leave ID the example reaches,
 Decrement the corresponding statistics at the node.
 For each alternate tree Talt of the node, forget(Talt, example, maxID).
32
CVFDT algorithm: process each
example
Pass example down to leaves
add example to sliding window
Window overflow?
Read new example
Yes
Forget oldest example
No
CVFDTGrow
f examples since last checking?
Yes
No
CheckSplitValidty
33
CheckSplitValidtiy
 Periodically scans the internal nodes of HT.
 Start a new alternate tree when a new winning
attribute is found.
 Tighter criteria to avoid excessive alternate tree
creation.
 Limit the total number of alternate trees.
34
Smoothly adjust to concept drift
 Alternate trees are grown
the same way HT is.
 Periodically each node
with non-empty alternate
trees enter a testing
Yes
mode.
 M training examples to
compare accuracy.
 Prune alternate trees
with non-increasing
accuracy over time.
 Replace if an alternate
tree is more accurate.
Yes
Age<30?
Yes
Married?
No
No
No
Car Type=
Sports Car?
Yes
Yes
No
Experience
<1 year?
No
Yes
No
No
Yes
35
Adjust to concept drift(2)
 Dynamically change the window size
 Shrink the window when many nodes gets
questionable or data rate changes rapidly.
 Increase the window size when few nodes are
questionable.
36
Performance
 Require memory O(nodes * attributes * attribute
values * classes).
 Independent of the total number of examples.
 Running time O(Lc * attributes * attribute values *
number of classes).
 Lc : the longest length an example passes through * number
of alternate trees.
 Model learned by CVFDT vs. the one learned by
VFDT-Window:
 Similar in accuracy
 O(1) vs. O(window size) per new example.
37
Experiment Results




Compare CVFDT, VFDT, VFDT-Window
5 million training examples
Concept changed at every 50k examples
Drift Level: average percentage of the test points that
changes label at each concept change.
 About 8% of test points change label each drift
 100,000 examples in window
 5% noise
 Test the model every 10k examples throughout the run,
averaged these results.
38
Experiment Results (CVFDT vs. VFDT)
drift level
Error rate as a function of number of attributes
39
Experiment Results (CVFDT vs. VFDT)
Tree size as a function of number of attributes
40
Experiment Results (CVFDT vs.
VFDT)
Portion of data set
that is labelled -ve
Error rates of learners as a function of the number
of examples seen
41
Experiment Results (CVFDT vs.
VFDT)
Error rates as a function of the amount of concept drift
42
Experiment Results
CVFDT’s drift characteristics
43
Experiment Results (CVFDT vs.
VFDT vs. VFDT-window)
Stimulated by
running VFDT
on W for every
100K examples
instead of every
example
Error Rate:
VFDT: 19.4%
CVFDT: 16.3%
VFDT-Window: 15.3%
Running Time:
VFDT: 10 minutes
CVFDT: 46 minutes
VFDT-Window: expect 548 days
Error rates over time of CVFDT, VFDT, and VFDT-window
44
Experiment Results
 CVFDT not use too much RAM
 D=50, CVFDT never uses more than 70MB
 Use as little as half the RAM of VFDT
 VFDT often had twice as many leaves as the
number of nodes in CVFDT’s HT and alternate
subtrees combined
 Reason: VFDT considers many more outdated
examples and is forced to grow larger trees to
make up for its earlier wrong decisions due to
concept drift
45
Conclusions
 CVFDT – a decision-tree induction system
capable of learning accurate models from high
speed, concept-drifting data streams
 Grow an alternative subtree whenever an old one
becomes questionable
 Replace the old subtree when the new more
accurate
 Similar in accuracy to applying VFDT to a moving
window of examples
46
Future Work
 Concepts changed periodically and removed
subtrees may become useful again
 Comparisons with related systems
 Continuous attributes
 Weighting examples
47
Reference List
 P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of
the Sixth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2000.
 G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In
Proceedings of the Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2001
 V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON: Mining and monitoring
evolving data. In Proceedings of the Sixteenth International Conference on Data
Engineering, 2000.
 J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L. Loh. BOAT: optimistic decision
tree construction. In Proceedings of the 1999 ACM SIGMOD International
Conference on Management of Data, 1999.
48
The end
Q&A
49
Thank You!
50