CVFDT algorithm
Download
Report
Transcript CVFDT algorithm
Mining Decision Trees from
Data Streams
Thanks: Tong Suk Man Ivy
HKU
1
Contents
Introduction: problems in mining data streams
Classification of stream data
VFDT algorithm
Window approach
CVFDT algorithm
Experimental results
Conclusions
Future work
2
Data Streams
Characteristics
Large volume of ordered data points, possibly infinite
Arrive continuously
Fast changing
Appropriate model for many applications:
Phone call records
Network and security monitoring
Financial applications (stock exchange)
Sensor networks
3
Problems in Mining Data Streams
Traditional data mining techniques usually
require
Entire data set to be present
Random access (or multiple passes) to the data
Much time per data item
Challenges of stream mining
Impractical to store the whole data
Random access is expensive
Simple calculation per data due to time and space
constraints
4
Processing Data Streams:
Motivation
A growing number of applications generate streams of data
Performance measurements in network monitoring and traffic
management
Call detail records in telecommunications
Transactions in retail chains, ATM operations in banks
Log records generated by Web Servers
Sensor network data
Application characteristics
Massive volumes of data (several terabytes)
Records arrive at a rapid rate
Goal: Mine patterns, process queries and compute statistics on data
streams in real-time
(from VLDB’02 Tutorial)
5
Data Streams: Computation Model
Data Streams
Synopsis in Memory
Stream
Processing
Engine
e1 ,..., en
(Approximate)
Answer
A data stream is a (massive) sequence of elements:
Stream processing requirements
Single pass: Each record is examined at most once
Bounded storage: Limited Memory (M) for storing synopsis
Real-time: Per record processing time (to maintain synopsis) must be low
6
Network Management Application
Network Management involves monitoring and configuring
network hardware and software to ensure smooth operation
Monitor link bandwidth usage, estimate traffic demands
Quickly detect faults, congestion and isolate root cause
Load balancing, improve utilization of network resources
Measurements
Alarms
Network Operations
Center
Network
(from VLDB’02 Tutorial)
7
IP Network Measurement Data
IP session data (collected using Cisco NetFlow)
Source
10.1.0.2
18.6.7.1
13.9.4.3
15.2.2.9
12.4.3.8
10.5.1.3
11.1.0.6
19.7.1.2
Destination
16.2.3.7
12.4.0.3
11.6.8.2
17.1.2.1
14.8.7.4
13.0.0.1
10.3.4.5
16.5.5.8
Duration
12
16
15
19
26
27
32
18
Bytes
20K
24K
20K
40K
58K
100K
300K
80K
Protocol
http
http
http
http
http
ftp
ftp
ftp
AT&T collects 100 GBs of NetFlow data each day!
(from VLDB’02 Tutorial)
8
(from VLDB’02 Tutorial)
Network Data Processing
Traffic estimation
How many bytes were sent between a pair of IP addresses?
What fraction network IP addresses are active?
List the top 100 IP addresses in terms of traffic
Traffic analysis
What is the average duration of an IP session?
What is the median of the number of bytes in each IP session?
Fraud detection
List all sessions that transmitted more than 1000 bytes
Identify all sessions whose duration was more than twice the normal
Security/Denial of Service
List all IP addresses that have witnessed a sudden spike in traffic
Identify IP addresses involved in more than 1000 sessions
9
Data Stream Processing
Algorithms
Generally, algorithms compute approximate answers
Difficult to compute answers accurately with limited memory
Approximate answers - Deterministic bounds
Algorithms only compute an approximate answer, but bounds on
error
Approximate answers - Probabilistic bounds
Algorithms compute an approximate answer with high probability
With probability at least
of the actual
answer
1,the
computed answer is within a factor
Single-pass algorithms for processing streams also
applicable to (massive) terabyte databases!
(from VLDB’02 Tutorial)
10
Classification of Stream Data
VFDT algorithm
“Mining High-Speed Data Streams”, KDD 2000.
Pedro Domingos, Geoff Hulten
CVFDT algorithm (window approach)
“Mining Time-Changing Data Streams”, KDD 2001.
Geoff Hulten, Laurie Spencer, Pedro Domingos
11
Hoeffding Trees
12
Definitions
A classification problem is defined as:
N is a set of training examples of the form (x, y)
x is a vector of d attributes
y is a discrete class label
Goal: To produce from the examples a model
y=f(x) that predict the classes y for future
examples x with high accuracy
13
Decision Tree Learning
One of the most effective and
widely-used classification
methods
Induce models in the form of
decision trees
Each node contains a test on the
attribute
Each branch from a node
corresponds to a possible outcome
of the test
Each leaf contains a class prediction
A decision tree is learned by
recursively replacing leaves by test
nodes, starting at the root
Age<30?
Yes
No
Car Type=
Sports Car?
Yes
Yes
No
No
14
Challenges
Classic decision tree learners assume all training
data can be simultaneously stored in main
memory
Disk-based decision tree learners repeatedly
read training data from disk sequentially
Prohibitively expensive when learning complex trees
Goal: design decision tree learners that read
each example at most once, and use a small
constant time to process it
15
Key Observation
In order to find the best attribute at a node, it may be
sufficient to consider only a small subset of the training
examples that pass through that node.
Given a stream of examples, use the first ones to choose the
root attribute.
Once the root attribute is chosen, the successive examples
are passed down to the corresponding leaves, and used to
choose the attribute there, and so on recursively.
Use Hoeffding bound to decide how many examples are
enough at each node
16
Hoeffding Bound
Consider a random variable a whose range is R
Suppose we have n observations of a
_
Mean: a
Hoeffding bound states:
With probability 1- , the true mean of a is at least
_
a , where
R 2 ln( 1 / )
2n
17
How many examples are enough?
Let G(Xi) be the heuristic measure used to choose test
attributes (e.g. Information Gain, Gini Index)
Xa : the attribute with the highest attribute evaluation
value after seeing n examples.
Xb : the attribute with the second highest split
evaluation function value after seeing n examples.
Given a desired , if G G( X a ) G( X b )
after
seeing n examples at a node,
Hoeffding bound guarantees the true G G 0
probability 1-.
, with
This node can be split using Xa, the succeeding examples will
be passed to the new leaves.
R 2 ln( 1 / )
2n
18
Algorithm
Calculate the information gain for the attributes and
determines the best two attributes
Pre-pruning: consider a “null” attribute that consists of not
splitting the node
At each node, check for the condition
G G( X a ) G( X b )
If condition satisfied, create child nodes based on the
test at the node
If not, stream in more examples and perform
calculations till condition satisfied
19
Age<30?
Yes
Data Stream
No
Yes
_
_
G(Car Type) - G(Gender)
Age<30?
Yes
No
Car Type=
Sports Car?
Yes
Data Stream
Yes
No
No
20
Performance Analysis
p: probability that an example passed through DT
to level i will fall into a leaf at that point
The expected disagreement between the tree
produced by Hoeffding tree algorithm and that
produced using infinite examples at each node is
no greater than /p.
Required memory: O(leaves * attributes * values
* classes)
21
VFDT
22
VFDT (Very Fast Decision Tree)
A decision-tree learning system based on the Hoeffding tree
algorithm
Split on the current best attribute, if the difference is less than a
user-specified threshold
Wasteful to decide between identical attributes
Compute G and check for split periodically
Memory management
Memory dominated by sufficient statistics
Deactivate or drop less promising leaves when needed
Bootstrap with traditional learner
Rescan old data when time available
23
VFDT(2)
Scales better than pure memory-based or pure
disk-based learners
Access data sequentially
Use subsampling to potentially require much less than
one scan
VFDT is incremental and anytime
New examples can be quickly incorporated as they
arrive
A usable model is available after the first few examples
and then progressively defined
24
Experiment Results (VFDT vs. C4.5)
Compared VFDT and C4.5 (Quinlan, 1993)
Same memory limit for both (40 MB)
100k examples for C4.5
VFDT settings: δ= 10-7, τ= 5%, nmin=200
Domains: 2 classes, 100 binary attributes
Fifteen synthetic trees 2.2k – 500k leaves
Noise from 0% to 30%
25
Experiment Results
Accuracy as a function of the number of training examples
26
Experiment Results
Tree size as a function of number of training examples
27
Mining Time-Changing Data Stream
Most KDD systems, include VFDT, assume training data is
a sample drawn from stationary distribution
Most large databases or data streams violate this
assumption
Concept Drift: data is generated by a time-changing
concept function, e.g.
Seasonal effects
Economic cycles
Goal:
Mining continuously changing data streams
Scale well
28
Window Approach
Common Approach: when a new example arrives,
reapply a traditional learner to a sliding window of
w most recent examples
Sensitive to window size
If w is small relative to the concept shift rate, assure the
availability of a model reflecting the current concept
Too small w may lead to insufficient examples to learn
the concept
If examples arrive at a rapid rate or the concept
changes quickly, the computational cost of reapplying a
learner may be prohibitively high.
29
CVFDT
30
CVFDT
CVFDT (Concept-adapting Very Fast Decision
Tree learner)
Extend VFDT
Maintain VFDT’s speed and accuracy
Detect and respond to changes in the examplegenerating process
31
Observations
With a time-changing concept, the current
splitting attribute of some nodes may not be the
best any more.
An outdated subtree may still be better than the
best single leaf, particularly if it is near the root.
Grow an alternative subtree with the new best attribute
at its root, when the old attribute seems out-of-date.
Periodically use a bunch of samples to evaluate
qualities of trees.
Replace the old subtree when the alternate one
becomes more accurate.
32
CVFDT algorithm
Alternate trees for each node in HT start as empty.
Process examples from the stream indefinitely. For
each example (x, y),
Pass (x, y) down to a set of leaves using HT and all
alternate trees of the nodes (x, y) passes through.
Add (x, y) to the sliding window of examples.
Remove and forget the effect of the oldest examples, if the
sliding window overflows.
CVFDTGrow
CheckSplitValidity if f examples seen since last checking of
alternate trees.
Return HT.
33
CVFDT algorithm: process each
example
Pass example down to leaves
add example to sliding window
Window overflow?
Read new example
Yes
Forget oldest example
No
CVFDTGrow
f examples since last checking?
Yes
No
CheckSplitValidty
34
CVFDT algorithm: process each
example
Pass example down to leaves
add example to sliding window
Window overflow?
Read new example
Yes
Forget oldest example
No
CVFDTGrow
f examples since last checking?
Yes
No
CheckSplitValidty
35
CVFDTGrow
For each node reached by the example in HT,
Increment the corresponding statistics at the node.
For each alternate tree Talt of the node,
CVFDTGrow
If enough examples seen at the leaf in HT which
the example reaches,
Choose the attribute that has the highest average
value of the attribute evaluation measure (information
gain or gini index).
If the best attribute is not the “null” attribute, create a
node for each possible value of this attribute
36
CVFDT algorithm: process each
example
Pass example down to leaves
add example to sliding window
Window overflow?
Read new example
Yes
Forget oldest example
No
CVFDTGrow
f examples since last checking?
Yes
No
CheckSplitValidty
37
Forget old example
Maintain the sufficient statistics at every node in HT to monitor
the validity of its previous decisions.
VFDT only maintain such statistics at leaves.
HT might have grown or changed since the example was initially
incorporated.
Assigned each node a unique, monotonically increasing ID
as they are created.
forgetExample (HT, example, maxID)
For each node reached by the old example with node ID no
larger than the max leave ID the example reaches,
Decrement the corresponding statistics at the node.
For each alternate tree Talt of the node, forget(Talt, example, maxID).
38
CVFDT algorithm: process each
example
Pass example down to leaves
add example to sliding window
Window overflow?
Read new example
Yes
Forget oldest example
No
CVFDTGrow
f examples since last checking?
Yes
No
CheckSplitValidty
39
CheckSplitValidtiy
Periodically scans the internal nodes of HT.
Start a new alternate tree when a new winning
attribute is found.
Tighter criteria to avoid excessive alternate tree
creation.
Limit the total number of alternate trees.
40
Smoothly adjust to concept drift
Alternate trees are grown
the same way HT is.
Periodically each node
with non-empty alternate
trees enter a testing mode.
Yes
M training examples to
compare accuracy.
Prune alternate trees
with non-increasing
accuracy over time.
Replace if an alternate
tree is more accurate.
Yes
Age<30?
Yes
Married?
No
No
No
Car Type=
Sports Car?
Yes
Yes
No
Experience
<1 year?
No
Yes
No
No
Yes
41
Adjust to concept drift(2)
Dynamically change the window size
Shrink the window when many nodes gets
questionable or data rate changes rapidly.
Increase the window size when few nodes are
questionable.
42
Performance
Require memory O(nodes * attributes * attribute
values * classes).
Independent of the total number of examples.
Running time O(Lc * attributes * attribute values *
number of classes).
Lc : the longest length an example passes through * number
of alternate trees.
Model learned by CVFDT vs. the one learned by
VFDT-Window:
Similar in accuracy
O(1) vs. O(window size) per new example.
43
Experiment Results
Compare CVFDT, VFDT, VFDT-Window
5 million training examples
Concept changed at every 50k examples
Drift Level: average percentage of the test points that
changes label at each concept change.
About 8% of test points change label each drift
100,000 examples in window
5% noise
Test the model every 10k examples throughout the run,
averaged these results.
44
Experiment Results (CVFDT vs. VFDT)
drift level
Error rate as a function of number of attributes
45
Experiment Results (CVFDT vs. VFDT)
Tree size as a function of number of attributes
46
Experiment Results (CVFDT vs.
VFDT)
Portion of data set
that is labelled -ve
Error rates of learners as a function of the number
of examples seen
47
Experiment Results (CVFDT vs.
VFDT)
Error rates as a function of the amount of concept drift
48
Experiment Results
CVFDT’s drift characteristics
49
Experiment Results (CVFDT vs.
VFDT vs. VFDT-window)
Stimulated by
running VFDT
on W for every
100K examples
instead of every
example
Error Rate:
VFDT: 19.4%
CVFDT: 16.3%
VFDT-Window: 15.3%
Running Time:
VFDT: 10 minutes
CVFDT: 46 minutes
VFDT-Window: expect 548 days
Error rates over time of CVFDT, VFDT, and VFDT-window
50
Experiment Results
CVFDT not use too much RAM
D=50, CVFDT never uses more than 70MB
Use as little as half the RAM of VFDT
VFDT often had twice as many leaves as the
number of nodes in CVFDT’s HT and alternate
subtrees combined
Reason: VFDT considers many more outdated
examples and is forced to grow larger trees to
make up for its earlier wrong decisions due to
concept drift
51
Conclusions
CVFDT – a decision-tree induction system
capable of learning accurate models from high
speed, concept-drifting data streams
Grow an alternative subtree whenever an old one
becomes questionable
Replace the old subtree when the new more
accurate
Similar in accuracy to applying VFDT to a moving
window of examples
52
Future Work
Concepts changed periodically and removed
subtrees may become useful again
Comparisons with related systems
Continuous attributes
Weighting examples
53
Reference List
P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of
the Sixth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2000.
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In
Proceedings of the Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2001
V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON: Mining and monitoring
evolving data. In Proceedings of the Sixteenth International Conference on Data
Engineering, 2000.
J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L. Loh. BOAT: optimistic decision
tree construction. In Proceedings of the 1999 ACM SIGMOD International
Conference on Management of Data, 1999.
54
The end
Q&A
55
Thank You!
56