Transcript talk

MINING HIGH-SPEED DATA STREAMS
Presented by:
Yumou Wang
Dongyun Zhang
Hao Zhou
INTRODUCTION
 The
world’s information is doubling every
two years.
 From 2006 to 2011, the amount of
information grew by a factor of 9 in just
five years.
INTRODUCTION
 By
2020 the world will generate 50 times
the amount of information and 75 times
the number of "information containers"
 However, IT staff to manage it will grow
less than 1.5 times.
 Current algorithms can only deal with
small amount of data less than a day’s
data of many applications.
 For example, banks, telecommunication
companies.
INTRODUCTION
Problems : When new examples arrive at a higher
rate than they can be mined, the amount of unused
data grows without bounds as time progresses.
 Today, to deal with these huge amount of data in a
responsible way is very important.
 Mining these continuous data streams brings
unique opportunities, but also new challenges.

BACKGROUND
 Design
Criteria for mining High
Speed Data Streams
 It
must be able to build a model using at most
one scan of the data.
 It must use only a fixed amount of main
memory.
 It must require small constant time per record.
BACKGROUND
 Usually,
use KDD system to operate this
examples when they arrive.
 Shortcomings: learning model learned
are highly sensitive to example ordering
compare to the batch model.
 Others can produce the same model as
batch version but very slower.
CLASSIFICATION METHOD
Input:
Examples of the form (x,y), y is the class label, x
is the vector of attributes.
 Output:
A model y=f(x), predict the classes y of future
examples x with high accuracy.

DECISION TREE
 One
of the most effective and
widely-used classification
methods.
 A decision tree is a decision
support tool that uses a treelike graph or model.
 Decision trees are commonly
used in machine learning.
BUILDING A DECISION TREE
 1.
Starting at the root.
 2. Testing all the attributes and choose
the best one according to some heuristic
measure.
 3. Split one node into branches and leaves.
 4. Recursively replacing leaves by test
nodes.
EXAMPLE OF DECISION TREE
EXAMPLE OF DECISION TREE
PROBLEMS
There are some problems existed in traditional
decision tree.
 Some of them assume that all training data
examples can be stored simultaneously in main
memory.
 Disadvantages: Limited the number of examples
can be learned from.
 Disk-based decision tree learners: examples in
disk, repeatedly reading them.
 Disadvantages: expensive when learning
complex trees.

HOEFFDING TREES
 Designed
for extremely large datasets
 Main idea: To find the best attribute at a
given node by considering only a small
subset of the training examples that pass
through the node.
 Using how many examples is sufficient
HOEFFDING BOUND
Definition: The statistical result that can decide
how many examples “n” using by each node is
called Hoeffding bound.
 Assume: R—the range of variable r
n independent observations
mean: r’

With probability 1-δ, the true mean of r is
at least r’-є
R 2 In ( 1 )


2n
HOEFFDING BOUND

R 2 In ( 1 )

2n
This function is a decreasing function
 n is bigger, the є is smaller
 It is the difference between true value and mean
value of r.

HOEFFDING TREE ALGORITHM
HOEFFDING TREE ALGORITHM
 Inputs:
S -> is a sequence of examples,
X -> is a set of discrete attributes,
G(.) -> is a split evaluation function,
δ -> is one minus the desired
probability of choosing the correct
attribute at any given node.
 Outputs:
HT -> is a decision tree.
HOEFFDING TREE ALGORITHM
Goal:
Ensure that, with a high probability, the attribute
chosen using n examples, is the same as that
would be chosen using infinite examples.
Let Xa be the attribute with the highest observed
G’ and Xb be with second highest attribute.
After seeing n examples.
Let ΔG’ = G’(Xa) – G’(Xb)
ΔG’ > ϵ
Thus a node needs to accumulate examples from
the stream until ϵ becomes smaller than ΔG.
HOEFFDING TREE ALGORITHM
 The
algorithm constructs the tree using
the same procedure as ID3. It calculates
the information gain for the attributes
and determines the best attributes.
 At each node it checks for condition ΔG > ϵ.
If the condition is satisfied, then it creates
child nodes based on the test at the node.
 If not it streams in more training
examples and carries out the calculations
till it satisfies the condition.
HOEFFDING TREE ALGORITHM
 Memory
cost
 d—number of attributes
 c—number of classes
 v—number of values per attribute
 l—number of leaves in the tree
 The memory cost for each leaf is O(dvc)
 The memory cost for whole tree is O(ldvc)
ADVANTAGES
 1.
OF
HOEFFDING TREE
Can deal with extremely large datasets.
 2. Each example to be read at most once
in a small constant time. Makes it
possible to mine online data sources.
 3. Build very complex trees with
acceptable computational cost.
VFDT—VERY FAST DECISION TREE

Breaking ties
Reduce waste
 Useful under condition where ∆𝐺 < 𝜖 < 𝜏


Use of 𝑛𝑚𝑖𝑛
Split may not change with a single example
 Significantly reduce the time of re-computation


Memory cleanup
Measurement of 𝑝𝑙 𝑒𝑙
 Clearance of least promising leaves
 Option of enabling reactivation

VFDT—VERY FAST DECISION TREE

Filtering out poor attributes
Dropping early
 Reduces memory consumption


Initialization
Can be initialized with other existing tree
 Set a head start


Rescans
TESTS—CONFIGURATION

14 Concepts
Generated by random decision trees using 𝑓
 Number of leaves: 2.2k to 61k
 Noise level: 0 to 30%

50k examples for testing
 Available memory: 40MB
 Legacy processors

TESTS—SYNTHETIC DATA

𝛿 = 10−7 , 𝜏 = 5%, 𝑛𝑚𝑖𝑛 = 200
TESTS—SYNTHETIC DATA
TESTS—SYNTHETIC DATA
TESTS—SYNTHETIC DATA
TESTS—SYNTHETIC DATA
TESTS—SYNTHETIC DATA
 Time

20m examples


consumption
VFDT takes 5752s to read, 625s to process
100k examples
C4.5 takes 36s
 VFDT takes 47s

TESTS—PARAMETERS

W/ & w/o over-pruning
TESTS—PARAMETERS
 W/
ties vs. w/o ties
65 nodes vs. 8k nodes for VFDT
 805 nodes vs. 8k nodes for VFDT-boot
 72.9% vs. 86.9% for VFDT
 83.3% vs. 88.5% for VFDT-boot

 𝑛𝑚𝑖𝑛



= 1 vs. 𝑛𝑚𝑖𝑛 = 200
VFDT: +1.1% accuracy, +3.8x time
VFDT-boot: -0.9% accuracy, +3.7x time
5% more nodes
TESTS—PARAMETERS
 40MB



𝛿



vs. 80MB memory
7.8k more nodes
VFDT: +3.0% accuracy
VFDT-boot: +3.2% accuracy
= 10−7 vs. 𝛿 = 10−2
30% less nodes
VFDT: +2.3% accuracy
VFDT-boot: +1.0% accuracy
TESTS—WEB DATA

For predicting accesses

1.89m examples

61.1% with most common class

276230 examples for testing
TESTS—WEB DATA
 Decision
dump
64.2% accuracy
 1277s to learn

 C4.5
with 40MB memory
74.5k examples
 2975s to learn
 73.3% accuracy


VFDT-bootstrapped with C4.5


1.61m examples
1450s to learn after initialization(983s to read)
TESTS—WEB DATA
MINING TIME-CHANGING DATA
STREAMS
WHY IS VFDT NOT ENOUGH?
VFDT, assume training data is a sample drawn
from stationary distribution.
 •Most large databases or data streams violate
this assumption


–Concept Drift: data is generated by a timechanging concept function, e.g.
•Seasonal effects
 •Economic cycles


•Goal:
–Mining continuously changing data streams
 –Scale well

WHY IS VFDT NOT ENOUGH?
Common Approach: when a new example arrives,
reapply a traditional learner to a sliding window
of w most recent examples
 –Sensitive to window size

•If w is small relative to the concept shift rate,
assure the availability of a model reflecting the
current concept
 •Too small w may lead to insufficient examples to
learn the concept


–If examples arrive at a rapid rate or the concept
changes quickly, the computational cost of
reapplying a learner may be prohibitively high.
CVFDT

CVFDT (Concept-adapting Very Fast Decision
Tree learner)



–Extend VFDT
–Maintain VFDT’s speed and accuracy
–Detect and respond to changes in the examplegenerating process
CVFDT (CONTD.)
With a time-changing concept, the current
splitting attribute of some nodes may not be the
best anymore.
 An out dated subtree may still be better than the
best single leaf, particularly if it is near the root.



– Grow an alternative subtree with the new best
attribute at its root, when the old attribute seems
out-of-date.
Periodically use a bunch of samples to evaluate
qualities of trees.

– Replace the old subtree when the alternate one
becomes more accurate.
HOW CVFDT WORKS
EXAMPLE
SAMPLE EXPERIMENT RESULT
CONCLUSION AND FUTURE WORK
CVFDT is able to maintain a decision-tree upto—date with a window of examples by using a
small constant amount of time for each new
examples that arrives.
 Empirical studies show that CVFDT is effectively
able to keep its model up-to-date with a massive
data stream even in the face of large and
frequent concept shifts.
 Future Work: Currently CVFDT discards
subtrees that are out-of-date, but some concepts
change periodically and these subtrees may
become useful again – identifying these
situations and taking advantage of them is
another area for further study.

THANK YOU