Mining Top-K Frequent Items in a Data Stream with Flexible Sliding

Download Report

Transcript Mining Top-K Frequent Items in a Data Stream with Flexible Sliding

Mining Top-K Frequent Items in
a Data Stream with
Flexible Sliding Windows
Hoang Thanh Lam and Toon Calders
Outline
•
•
•
•
•
•
•
A Survey of Frequency Measures
Max-Frequency
Problem Statement
Memory Issues
A Memory-efficient Stream Summary
A Stream Summary in Practice
Conclusions
/ name of department
11-4-2016
PAGE 1
A Survey of Frequency Measures
I am monitoring the data
stream
a
b
a
c
a
d
b
Time
Data Stream
/ name of department
11-4-2016
PAGE 2
A Survey of Frequency Measures
Which items are the
most important?
a
b
a
c
a
d
b
Time
Data Stream
/ name of department
11-4-2016
PAGE 3
A Survey of Frequency Measures
Resource allocation
Prediction burst early
a
b
a
c
a
d
b
Time
Data Stream
/ name of department
11-4-2016
PAGE 4
A Survey of Frequency Measures
•
•
•
•
Relative frequency (RF)
RF in a fixed-length sliding window
Measures based on decaying factor
Max frequency (MaxFreq)
/ name of department
11-4-2016
PAGE 5
Max Frequency Measure
• Independent from parameters
• Recent occurrence is more important
• Captures the entire history of the stream
/ name of department
11-4-2016
PAGE 6
A Survey of Frequency Measures
Max-Frequency
MaxFreq(a)=Max {1/2, 2/3,3/6,4/7} = 2/3
4/7
3/6
a
b
a
a
c
c
2/3
1/2
a
d
a
b
Data Stream
/ name of department
11-4-2016
PAGE 7
Max-Frequency Measure
MaxFrequency
MaxFreq(a)=Max {1/2, 2/3,3/6,4/7} = 2/3
Border Point
4/7
3/6
a
b
a
Max Point
a
c
c
2/3
1/2
a
d
a
b
Data Stream
/ name of department
11-4-2016
PAGE 8
Properties of Border Points
a
A
B
If there exists A and B such that RF(a,A)>RF(a,B), the given occurrence
of a cannot be a border point, otherwise it is.
/ name of department
11-4-2016
PAGE 9
Stream summary
/ name of department
11-4-2016
PAGE 10
Minimum Threshold Principle
A border point with the current associated frequency
less than a pre-defined minimum threshold never
becomes a max point with frequency greater than
that threshold value.
/ name of department
11-4-2016
PAGE 11
Problem statement
• Given a data stream S evolving over time,
• At every time point:
• Maintain a summary
• Upon request:
• Give top-k most frequent items according to MaxFreq
• Based on the summary
• fast
/ name of department
11-4-2016
PAGE 12
Memory issues
• It can be proven that in the worst case the number of
border points is the same order of magnitude as the
stream size.
Storing the complete set of borders is not feasible
in stream applications !!!
/ name of department
11-4-2016
PAGE 13
Memory issues
• To answer even the top-2 query exactly we need an
amount of memory being linear in the number of
distinct items in the stream
• In many applications it is not feasible:
• with multiple streams and
• the number of distinct of items is huge
• Therefore:
• approximate algorithms
/ name of department
11-4-2016
PAGE 14
A Memory-efficient Stream Summary
/ name of department
11-4-2016
PAGE 15
A Memory-efficient Stream Summary
Guess this lower
bound !!!
/ name of department
11-4-2016
PAGE 16
Guessing the lower bound
• Let Xk be the size of the minimum suffix of the stream S such that
exact k distinct items can be found in this suffix
a b b c d a d d e a a a f f c c a c
X5= 11
1
Observation: X k is a lower bound on the MaxFreq of the top-k items
Choose
/ name of department
1
c.E ( X k )
as the guessed lower bound and use this for pruning
11-4-2016
PAGE 17
MeanSummary
• Pruning all the border points at which the associated
relative frequency of the given item is less than the
pruning threshold
1

c.E ( X k )
/ name of department
11-4-2016
PAGE 18
Accuracy analysis
• There is no false positive
• According to the Markov’s inequality the possible false
negative top-k list is bounded by:
1
Pr( X k  cE ( X k )) 
c
/ name of department
11-4-2016
PAGE 19
However …
• Estimation of E(Xk) is well-known as the Coupon
Collector Problem (CCP)
• No closed formula of E(Xk) for any distribution is
known so far except for the uniform distribution
• Moreover, stream is very dynamic, item distribution
is changing over time
Therefore:
• Algorithm that dynamically adjusts threshold
/ name of department
11-4-2016
PAGE 20
A Stream Summary in Practice
/ name of department
11-4-2016
PAGE 21
MinSummary
• Initialize the pruning threshold with lowest possible value   0
• At every time step:
• Look up the minimum MaxFreq of the top-l stored items
• Update  if this value is greater than its current value
• In this way:
• threshold  is dynamically increasing
• There are no false positives
• Frequencies are exact
/ name of department
11-4-2016
PAGE 22
Datasets used in the experiments
/ name of department
11-4-2016
PAGE 23
Accuracy of MinSummary
/ name of department
11-4-2016
PAGE 24
Memory Usage Kosarak Data
/ name of department
11-4-2016
PAGE 25
Memory Usage Sligro Data
/ name of department
11-4-2016
PAGE 26
Conclusions
• We prove that the exact solution for the top-k
frequent item problem requires a prohibited amount
of memory
• However it is possible to summarize the stream with
very small memory usage to solve the problem with
high accuracy
/ name of department
11-4-2016
PAGE 27
Questions
• Thank you very much !
/ name of department
11-4-2016
PAGE 28