Transcript Here

Mining High-Speed Data Streams
Pedro Domingos
Geoff Hulten
Sixth ACM SIGKDD International Conference - 2000
Presented by: William Kniffin
2
Outline
→ Introduction
→ Hoeffding Trees
→ The VFDT System
→ Performance Study
→ Conclusion / Summary
→ Test Questions
3
Outline
→ Introduction
→ Hoeffding Trees
→ The VFDT System
→ Performance Study
→ Conclusion / Summary
→ Test Questions
5
Introduction – System Limitations
•
•
•
Knowledge Discovery System limitations:
o Time
o Memory
o Sample Size
Traditional Systems:
o Amount of available data is small
o Systems use a fraction of their computation power to avoid overfitting
Current Systems:
o Bottleneck is time and memory
o Majority of sample data is unused; underfitting issues surface
6
Introduction
•
Today’s Algorithms:
o
o
o
Efficient, but cannot handle supermassive databases
Current Data Mining systems are not equipped to handle the exponential
increase of data expansion
New examples arrive at a higher rate than they can be mined
7
Introduction - Cont.
•
Requirements for ‘Modern’ Algorithms:
o
o
o
o
o
o
o
Operate continuously and indefinitely
Incorporate new examples as they become available
Never lose potentially valuable information
Build a model using at most one scan of a database or dataset
Use only a fixed amount of main memory
Require small, constant time per record
Make a usable model that can be available at any point during the
algorithm’s runtime
8
Introduction - Cont.
•
What can fulfill these requirements?
o
Incremental Learning Methods



Online Methods
Successive Methods
Sequential Methods
9
Outline
→ Introduction
→ Hoeffding Trees
→ The VFDT System
→ Performance Study
→ Conclusion / Summary
→ Test Questions
10
Hoeffding Trees
•
•
Classic Decision Tree Learners
o Examples: ID3, C4.5, CART
o Assumes examples can be stored simultaneously in main memory;
loss of learnable examples.
Disk-based Decision Tree Learners
o Examples: SLIQ, SPRINT
o Assumes examples are stored on disk.
o Big Datasets easily fill disk and errors occur when the dataset is too
large to fit.
11
Hoeffding Trees - Cont.
•
A typical type of Classification Problem
o
o
o
o
Given : N training examples in the form (x,y)
y = discrete class label
x = array of d attributes
Goal: Produce a model, y = f(x), to predict classes y of future examples x
with high accuracy.
12
Hoeffding Trees - Cont.
•
•
Challenge : Design a decision tree learner for extremely large
(potentially infinite) datasets with high accuracy and low
computational cost.
Given a stream of examples:
o
o
o
o
The first ones will be used to choose the root test
Succeeding ones will pass to corresponding leaves
Pick the best attributes at each leaf
Continue process recursively
14
Hoeffding Trees - Cont.
•
Hoeffding Bound :
o
Complicated … basically:
o
Let G(Xi) be the heuristic measure used to choose test attributes
o
o
Like GINI or info-gain in CART and C4.5 respectively)
Our goal is to ensure that, with high probability, the attribute chosen using n examples
(where n is as small as possible) is the same that would be chosen using infinite examples
16
Hoeffding Tree Algorithm
•
•
Inputs:
o S : sequence of examples
o X : set of discrete attributes
o G(.) : split evaluation function
o δ : desired probability of choosing the correct attribute at any given
node
Output:
o HT : A decision tree (Hoeffding Tree)
17
Hoeffding Tree Algorithm
22
Outline
→ Introduction
→ Hoeffding Trees
→ The VFDT System
→ Performance Study
→ Conclusion / Summary
→ Test Questions
23
The VFDT System
•
•
•
•
Very Fast Decision Tree learner (VFDT)
A decision tree learning system
Based on the Hoeffding Tree algorithm
VFDT allows the use of either information gain or the Gini
index as the attribute evaluation measure
24
The VFDT System - Cont.
•
Includes a number of refinements to the Hoeffding Tree algorithm:
o
o
o
o
o
o
Ties
G-Computation
Memory
Poor Attributes
Initialization
Rescans
25
The VFDT System - Ties
•
•
•
Two or more attributes may have similar G’s
A large number of examples may be required to decide
between them with high confidence
In a VFDT, we specify a user-threshold for ties and split
26
The VFDT System - G-Computation
•
•
•
•
The most significant part of the time cost per example is
recomputing G
Computing a G value for every new example is inefficient.
In a VFDT, users can specify an nmin value
nmin : Number of new examples that must accumulate at a
leaf before recomputing G
27
The VFDT System - Memory
•
•
•
a VFDT’s memory use is dominated by the memory required
to keep counts for all growing leaves.
If the maximum available memory is reached, VFDT
deactivates the least promising leaves.
The least promising leaves are considered to be the ones
with the lowest values of plel.
28
The VFDT System - Poor Attributes
•
•
a VFDT’s memory usage is also minimized by dropping early
on attributes that do not look promising.
The memory used to store the corresponding counts can also
be freed.
29
The VFDT System - Initialization
•
•
VFDT can be initialized with the tree produced by a
conventional RAM-based learner on a small subset of the
data.
Gives VFDT a “head start”
30
The VFDT System - Rescans
•
•
•
VFDT can rescan previously-seen examples.
Rescans are activated if:
o The data arrives slowly enough that time allows for rescans
o The dataset is finite and small enough that it is feasible
VFDT will never grow a tree smaller than ones produced by
other algorithms.
31
Outline
→ Introduction
→ Hoeffding Trees
→ The VFDT System
→ Performance Study
→ Conclusion / Summary
→ Test Questions
32
Synthetic Data Study
•
•
•
•
Comparing VFDT with C4.5
Restricted Two Systems to using the same amount of RAM (40MB)
VFDT used information gain as the G function.
o At tree depth of 18, all the nodes were replaced with leaves.
o Each leaf was randomly assigned a class.
Stream of training examples were then generated
o Various levels of class and attribute noise was added.
33
Synthetic Data Study - Cont.
Accuracy as a function of the number of training examples
34
Synthetic Data Study - Cont.
Tree Size as a function of the number of training examples
35
Synthetic Data Study - Cont.
Accuracy as a function of the noise level
37
Web Data - Trial Run
•
•
•
Application of VFDT to mine the stream of Web Page Requests
Test Location : The Entire University of Washington Campus
Statistics for mining 1.6 million examples:
o VFDT took 1450 seconds to do one pass over the training data
o 983 seconds were spent reading data from the disk
o C4.5 took 24 hours to mine 1.6 million examples.
39
Outline
→ Introduction
→ Hoeffding Trees
→ The VFDT System
→ Performance Study
→ Conclusion / Summary
→ Test Questions
40
Conclusion - Hoeffding Trees
•
•
•
•
A method for learning online
Learns from the increasingly common high-volume data
streams
Allows learning in very small constant time per example
Strong guarantees of high asymptotic similarities to
corresponding batch trees.
41
Conclusion - VFDT Systems
•
•
•
•
A high-performance data mining system
Based on Hoeffding trees
Empirical studies show its effectiveness in taking advantage
of massive numbers of examples
Practical, efficient, and accurate.
42
Outline
→ Introduction
→ Hoeffding Trees
→ The VFDT System
→ Performance Study
→ Conclusion / Summary
→ Test Questions
43
Test Questions - 1 of 3
•
Question: Name four of the challenges that modern
algorithms have to overcome today.
o
o
o
o
o
o
o
o
Answer: See Slide 6.
Operate continuously and indefinitely
Incorporate new examples as they become available
Never lose potentially valuable information
Build a model using at most one scan of a database or dataset
Use only a fixed amount of main memory.
Require small, constant time per record.
Make a usable model that can be available at any point during the
algorithm’s runtime.
44
Test Questions - 2 of 3
•
Question: List the input requirements of the HT-Algorithm,
and state what output is generated.
o Answer: See Slide 16
o Inputs:
 S : sequence of examples
 X : set of discrete attributes
 G(.) : split evaluation function
 δ : desired probability of choosing the wrong attribute at any given
node
o Output:
 HT : A decision tree (Hoeffding Tree)
45
Test Questions - 3 of 3
•
Question: What does VFDT stand for and when does the
algorithm rescan?
o Answer: See Slide 23.
•
Rescans are activated if:
o The data arrives slowly enough that time allows for rescans
o The dataset is finite and small enough that it is feasible