whats_hot_and_whats_not
Download
Report
Transcript whats_hot_and_whats_not
What’s Hot and What’s Not:
Tracking Most Frequent Items
Dynamically
G. Cormode and S. Muthukrishman
Rutgers University
ACM Principles of Database Systems 2003
ACM Transactions on Database Systems 2005
Introduction
Find “hot” items, but the set of hot
items will change over time
Applications: caching, load balancing,
sensor networks, data mining, etc.
Usually focus on “insert” only, this
paper also take “delete” into account
Prior works
Stream with sliding window (*)
Arrival time
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1
Elements
Flajolet-Martin approach (*)
Estimate number of distinct elements
Majority voting algorithm
Use only one counter to identify the
majority item
Lossy counting
* http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=385
Contribution of the paper
Dynamically maintain the hot items
Both insert and delete transactions are
supported
Randomized algorithm
Use hash table
Use “random” to confuse omniscient
adversary
Small space required
Short processing time
Finding the majority item
Keep log2m+1 counters
C0: keep how many items are “live”
Cj (j!=0): increase or decrease if bit(x,j)=1
Search: if there is a majority, it is given by
log2 m
j 1
2 j gt (c j , c0 / 2)
No false negative, but false positive is possible
Algorithms to find the majority
element in a sequence of updates
Example
Find majority:
x=0 +21 +0=2
Space of 8 items
Counter 0
Counter 1 (20)
#>(counter 0)/2 ?
Counter 2 (21)
Counter 3 (22)
1 2 2 2 7 2 4 6
False positive is possible!
Finding hot items
Sequence with length n
Item identifiers: 1..m
nx(t): # of inserts - # of deletes
before time t
fx(t): nx(t)/sigma(ny(t), y=1..m)
Hot item: given k, fx(t) > 1/(k+1)
Process Item (insert or delete)
Classify sets by universal hash function
Initialize c[0..2Tk][0..logm]=0, c=0
T: # of groups
k: frequency threshold (fx(t)>1/(k+1))
for all (i, transType) do
if (transType == insert) c=c+1
else c=c-1
for x=1 to T do
index = hash(x) // uniformly distributed
UpdateCounters(i,transType,c[index])
Find hot sets
for i=1 to T do
//for each group
if c[i][0] ≧n/(k+1)
position=0; t=1;
for j=1 to logm do
if (c[i][j] ≧ n/(k+1))
position = position + t
t = t*2
output(position)
Similar to the algorithm to find the majority
Error probability
Choosing |h|≧2k, T=log2(k/δ), the
algorithm ensures that the probability
of all hot items being output is at
least 1-δ
Details of the proof (*,**)
* Universal classes of hash functions, J. Comput. Syst. 1979
** the two papers currently presented
Experiments
Synthetic data:
Uniformly insert
Zip-f insert
Uniformly delete
1,000,000 items
k=50 (hot items: f>1/(k+1))
Real data:
Telephone connections (from AT&T)
3.5 million transactions
Every 100,000 transactions, query (src, dest)
pairs with frequency greater than 1%
Results of synthetic data
Recall: proportion of the hot items that
are found by the method
Precision: proportion of items identified
by the algorithm are hot items
Results of real data
Conclusion
Propose a new method for identifying
hot items
Cope with dynamic datasets