whats_hot_and_whats_not

Transcript whats_hot_and_whats_not

What’s Hot and What’s Not:
Tracking Most Frequent Items
Dynamically
G. Cormode and S. Muthukrishman
Rutgers University
ACM Principles of Database Systems 2003
ACM Transactions on Database Systems 2005
Introduction
 Find “hot” items, but the set of hot
items will change over time
 Applications: caching, load balancing,
sensor networks, data mining, etc.
 Usually focus on “insert” only, this
paper also take “delete” into account
Prior works
 Stream with sliding window (*)
Arrival time
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1
Elements
 Flajolet-Martin approach (*)
 Estimate number of distinct elements
 Majority voting algorithm
 Use only one counter to identify the
majority item
 Lossy counting
* http://vc.cs.nthu.edu.tw/ezLMS/show.php?id=385
Contribution of the paper
 Dynamically maintain the hot items
 Both insert and delete transactions are
supported
 Randomized algorithm
 Use hash table
 Use “random” to confuse omniscient
adversary
 Small space required
 Short processing time
Finding the majority item
 Keep log2m+1 counters
 C0: keep how many items are “live”
 Cj (j!=0): increase or decrease if bit(x,j)=1
 Search: if there is a majority, it is given by

log2 m
j 1
2 j gt (c j , c0 / 2)
 No false negative, but false positive is possible
Algorithms to find the majority
element in a sequence of updates
Example
Find majority:
x=0 +21 +0=2
Space of 8 items
Counter 0
Counter 1 (20)
#>(counter 0)/2 ?
Counter 2 (21)
Counter 3 (22)
1 2 2 2 7 2 4 6
False positive is possible!
Finding hot items
 Sequence with length n
 Item identifiers: 1..m
 nx(t): # of inserts - # of deletes
before time t
 fx(t): nx(t)/sigma(ny(t), y=1..m)
 Hot item: given k, fx(t) > 1/(k+1)
Process Item (insert or delete)
 Classify sets by universal hash function
 Initialize c[0..2Tk][0..logm]=0, c=0
 T: # of groups
 k: frequency threshold (fx(t)>1/(k+1))
 for all (i, transType) do
if (transType == insert)  c=c+1
else  c=c-1
for x=1 to T do
index = hash(x) // uniformly distributed
UpdateCounters(i,transType,c[index])
Find hot sets
 for i=1 to T do
//for each group
if c[i][0] ≧n/(k+1)
position=0; t=1;
for j=1 to logm do
if (c[i][j] ≧ n/(k+1))
position = position + t
t = t*2
output(position)
Similar to the algorithm to find the majority
Error probability
 Choosing |h|≧2k, T=log2(k/δ), the
algorithm ensures that the probability
of all hot items being output is at
least 1-δ
 Details of the proof (*,**)
* Universal classes of hash functions, J. Comput. Syst. 1979
** the two papers currently presented
Experiments
 Synthetic data:





Uniformly insert
Zip-f insert
Uniformly delete
1,000,000 items
k=50 (hot items: f>1/(k+1))
 Real data:
 Telephone connections (from AT&T)
 3.5 million transactions
 Every 100,000 transactions, query (src, dest)
pairs with frequency greater than 1%
Results of synthetic data
 Recall: proportion of the hot items that
are found by the method
 Precision: proportion of items identified
by the algorithm are hot items
Results of real data
Conclusion
 Propose a new method for identifying
hot items
 Cope with dynamic datasets

whats_hot_and_whats_not

Transcript whats_hot_and_whats_not

Directory