Transcript talk
MINING HIGH-SPEED DATA STREAMS
Presented by:
Yumou Wang
Dongyun Zhang
Hao Zhou
INTRODUCTION
The
world’s information is doubling every
two years.
From 2006 to 2011, the amount of
information grew by a factor of 9 in just
five years.
INTRODUCTION
By
2020 the world will generate 50 times
the amount of information and 75 times
the number of "information containers"
However, IT staff to manage it will grow
less than 1.5 times.
Current algorithms can only deal with
small amount of data less than a day’s
data of many applications.
For example, banks, telecommunication
companies.
INTRODUCTION
Problems : When new examples arrive at a higher
rate than they can be mined, the amount of unused
data grows without bounds as time progresses.
Today, to deal with these huge amount of data in a
responsible way is very important.
Mining these continuous data streams brings
unique opportunities, but also new challenges.
BACKGROUND
Design
Criteria for mining High
Speed Data Streams
It
must be able to build a model using at most
one scan of the data.
It must use only a fixed amount of main
memory.
It must require small constant time per record.
BACKGROUND
Usually,
use KDD system to operate this
examples when they arrive.
Shortcomings: learning model learned
are highly sensitive to example ordering
compare to the batch model.
Others can produce the same model as
batch version but very slower.
CLASSIFICATION METHOD
Input:
Examples of the form (x,y), y is the class label, x
is the vector of attributes.
Output:
A model y=f(x), predict the classes y of future
examples x with high accuracy.
DECISION TREE
One
of the most effective and
widely-used classification
methods.
A decision tree is a decision
support tool that uses a treelike graph or model.
Decision trees are commonly
used in machine learning.
BUILDING A DECISION TREE
1.
Starting at the root.
2. Testing all the attributes and choose
the best one according to some heuristic
measure.
3. Split one node into branches and leaves.
4. Recursively replacing leaves by test
nodes.
EXAMPLE OF DECISION TREE
EXAMPLE OF DECISION TREE
PROBLEMS
There are some problems existed in traditional
decision tree.
Some of them assume that all training data
examples can be stored simultaneously in main
memory.
Disadvantages: Limited the number of examples
can be learned from.
Disk-based decision tree learners: examples in
disk, repeatedly reading them.
Disadvantages: expensive when learning
complex trees.
HOEFFDING TREES
Designed
for extremely large datasets
Main idea: To find the best attribute at a
given node by considering only a small
subset of the training examples that pass
through the node.
Using how many examples is sufficient
HOEFFDING BOUND
Definition: The statistical result that can decide
how many examples “n” using by each node is
called Hoeffding bound.
Assume: R—the range of variable r
n independent observations
mean: r’
With probability 1-δ, the true mean of r is
at least r’-є
R 2 In ( 1 )
2n
HOEFFDING BOUND
R 2 In ( 1 )
2n
This function is a decreasing function
n is bigger, the є is smaller
It is the difference between true value and mean
value of r.
HOEFFDING TREE ALGORITHM
HOEFFDING TREE ALGORITHM
Inputs:
S -> is a sequence of examples,
X -> is a set of discrete attributes,
G(.) -> is a split evaluation function,
δ -> is one minus the desired
probability of choosing the correct
attribute at any given node.
Outputs:
HT -> is a decision tree.
HOEFFDING TREE ALGORITHM
Goal:
Ensure that, with a high probability, the attribute
chosen using n examples, is the same as that
would be chosen using infinite examples.
Let Xa be the attribute with the highest observed
G’ and Xb be with second highest attribute.
After seeing n examples.
Let ΔG’ = G’(Xa) – G’(Xb)
ΔG’ > ϵ
Thus a node needs to accumulate examples from
the stream until ϵ becomes smaller than ΔG.
HOEFFDING TREE ALGORITHM
The
algorithm constructs the tree using
the same procedure as ID3. It calculates
the information gain for the attributes
and determines the best attributes.
At each node it checks for condition ΔG > ϵ.
If the condition is satisfied, then it creates
child nodes based on the test at the node.
If not it streams in more training
examples and carries out the calculations
till it satisfies the condition.
HOEFFDING TREE ALGORITHM
Memory
cost
d—number of attributes
c—number of classes
v—number of values per attribute
l—number of leaves in the tree
The memory cost for each leaf is O(dvc)
The memory cost for whole tree is O(ldvc)
ADVANTAGES
1.
OF
HOEFFDING TREE
Can deal with extremely large datasets.
2. Each example to be read at most once
in a small constant time. Makes it
possible to mine online data sources.
3. Build very complex trees with
acceptable computational cost.
VFDT—VERY FAST DECISION TREE
Breaking ties
Reduce waste
Useful under condition where ∆𝐺 < 𝜖 < 𝜏
Use of 𝑛𝑚𝑖𝑛
Split may not change with a single example
Significantly reduce the time of re-computation
Memory cleanup
Measurement of 𝑝𝑙 𝑒𝑙
Clearance of least promising leaves
Option of enabling reactivation
VFDT—VERY FAST DECISION TREE
Filtering out poor attributes
Dropping early
Reduces memory consumption
Initialization
Can be initialized with other existing tree
Set a head start
Rescans
TESTS—CONFIGURATION
14 Concepts
Generated by random decision trees using 𝑓
Number of leaves: 2.2k to 61k
Noise level: 0 to 30%
50k examples for testing
Available memory: 40MB
Legacy processors
TESTS—SYNTHETIC DATA
𝛿 = 10−7 , 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
TESTS—SYNTHETIC DATA
TESTS—SYNTHETIC DATA
TESTS—SYNTHETIC DATA
TESTS—SYNTHETIC DATA
TESTS—SYNTHETIC DATA
Time
20m examples
consumption
VFDT takes 5752s to read, 625s to process
100k examples
C4.5 takes 36s
VFDT takes 47s
TESTS—PARAMETERS
W/ & w/o over-pruning
TESTS—PARAMETERS
W/
ties vs. w/o ties
65 nodes vs. 8k nodes for VFDT
805 nodes vs. 8k nodes for VFDT-boot
72.9% vs. 86.9% for VFDT
83.3% vs. 88.5% for VFDT-boot
𝑛𝑚𝑖𝑛
= 1 vs. 𝑛𝑚𝑖𝑛 = 200
VFDT: +1.1% accuracy, +3.8x time
VFDT-boot: -0.9% accuracy, +3.7x time
5% more nodes
TESTS—PARAMETERS
40MB
𝛿
vs. 80MB memory
7.8k more nodes
VFDT: +3.0% accuracy
VFDT-boot: +3.2% accuracy
= 10−7 vs. 𝛿 = 10−2
30% less nodes
VFDT: +2.3% accuracy
VFDT-boot: +1.0% accuracy
TESTS—WEB DATA
For predicting accesses
1.89m examples
61.1% with most common class
276230 examples for testing
TESTS—WEB DATA
Decision
dump
64.2% accuracy
1277s to learn
C4.5
with 40MB memory
74.5k examples
2975s to learn
73.3% accuracy
VFDT-bootstrapped with C4.5
1.61m examples
1450s to learn after initialization(983s to read)
TESTS—WEB DATA
MINING TIME-CHANGING DATA
STREAMS
WHY IS VFDT NOT ENOUGH?
VFDT, assume training data is a sample drawn
from stationary distribution.
•Most large databases or data streams violate
this assumption
–Concept Drift: data is generated by a timechanging concept function, e.g.
•Seasonal effects
•Economic cycles
•Goal:
–Mining continuously changing data streams
–Scale well
WHY IS VFDT NOT ENOUGH?
Common Approach: when a new example arrives,
reapply a traditional learner to a sliding window
of w most recent examples
–Sensitive to window size
•If w is small relative to the concept shift rate,
assure the availability of a model reflecting the
current concept
•Too small w may lead to insufficient examples to
learn the concept
–If examples arrive at a rapid rate or the concept
changes quickly, the computational cost of
reapplying a learner may be prohibitively high.
CVFDT
CVFDT (Concept-adapting Very Fast Decision
Tree learner)
–Extend VFDT
–Maintain VFDT’s speed and accuracy
–Detect and respond to changes in the examplegenerating process
CVFDT (CONTD.)
With a time-changing concept, the current
splitting attribute of some nodes may not be the
best anymore.
An out dated subtree may still be better than the
best single leaf, particularly if it is near the root.
– Grow an alternative subtree with the new best
attribute at its root, when the old attribute seems
out-of-date.
Periodically use a bunch of samples to evaluate
qualities of trees.
– Replace the old subtree when the alternate one
becomes more accurate.
HOW CVFDT WORKS
EXAMPLE
SAMPLE EXPERIMENT RESULT
CONCLUSION AND FUTURE WORK
CVFDT is able to maintain a decision-tree upto—date with a window of examples by using a
small constant amount of time for each new
examples that arrives.
Empirical studies show that CVFDT is effectively
able to keep its model up-to-date with a massive
data stream even in the face of large and
frequent concept shifts.
Future Work: Currently CVFDT discards
subtrees that are out-of-date, but some concepts
change periodically and these subtrees may
become useful again – identifying these
situations and taking advantage of them is
another area for further study.
THANK YOU