Transcript Document
What’s Hot and What’s Not:
Tracking Most Frequent Items
Dynamically
By Graham Cormode
& S. Muthukrishnan
Rutgers University, Piscataway NY
Presented by Tal Sterenzy
Motivation
A basic statistic on database relationship is
which items are hot – occur frequently
Dynamically maintaining hot items in the
presence of delete and insert transactions.
Examples:
DBMS – keep statistics to improve performance
Telecommunication networks - network
connections start and end over time
Overview
Definitions
Prior work
Algorithm description & analysis
Experimental results
Summery
Formal definition
Sequence of n transactions on m items [1…m]
ni t - Net occurrence of item i at time t
The number of times it has inserted minus the times it
has been deleted
fi t ni (t ) / j 1 n j (t ) - current frequency of item at time t
f * (t ) maxi fi (t ) - most frequent item at time t
m
The k most frequent items at time t are those with
the k largest fi (t )
Finding k hot items
k is a parameter
Item i is an hot item if fi (t ) 1/(1 k )
Frequent items that appear a significant
fraction of the entire dataset
There can be at most k hot items, and there
can be none
Assume basic integrity constraint
Our algorithm
highly efficient, randomized algorithm for
maintaining hot items in a dynamically
changing database
monitors the changes to the data distribution
and maintains O(klogklogm)
When queried, we can find all hot items in
time O(klogklogm) with probability 1-δ
No need to scan the underlying relation
Small tail assumption
Restriction:
f1 .. f m are the frequencies of items
A set of frequencies has a small tail
if ik fi (t ) 1/(1 k )
If there are k hot items then small tail
probability holds
If small tail probability holds then some top k
items might not be hot
We shall analyze our solution in the presence
and absence of this small tail property (STP)
Prior work – why is it not
adaptable?
All these algorithms hold counters:
Can’t directly adapt these algorithms for insertions
and deletions:
incremented when the item is observed
decremented or reallocated under certain circumstances
the state of the algorithm is different to that reached without
the insertions and deletions of the item.
Work on dynamic data is sparse, and provide no
guarantees for the fully dynamic case with deletions
Our algorithm - idea
Do not keep counters of individual items, but
rather of subsets of items
Ideas from group testing:
Design a number of tests, each of which group
together a number of m items in order to find up
to k items which test positive
Here: find k items that are hot
Minimize number of tests, where each group
consists of a subset of items
General procedure
For each transaction on item i, determine
which subsets it is included in: S(i)
Each subset has a counter:
For insertion: increment all S(i) counters
For deletion: decrement all S(i) counters
The test will be: does the counter exceed a
threshold
Identifying the hot items is done by combining
test results from several groups
The challenge is choosing the
subsets
Bounding the number of required subsets
Finding concise representation of the groups
Giving efficiant way to go from results of tests
to the sets of hot items
Lets start with a simple case: k=1 (freq>1/2)
Deterministic algorithm for maintaining
majority item
Finding majority item
For insertions only, constant time and space
Keep logm+1 counters:
1 counter of items “alive”: n(t ) ni (t )
The rest are labeled c1...clog m ,one per group
Each group represents a bit in the binary
representation of the item
Each group consists of half of the items
Finding majority item – cont.
bit(i,j) – reports value of jth bit in binary representation of i
gt(i, j) – return 1 if i>j, 0 otherwise
Scheme:
Insertion of item i: Increment each counter c j such
that bit(i, j) = 1 in time O(logm).
Deletion of i: Decrement each counter c j such that
bit(i, j) = 1 in time O(logm).
Query: If there is a majority, then it is given by
log 2 m j
j 1 2 gt (c j , c / 2) computed in time O(logm).
Finding majority item – cont.
Theorem: The algorithm finds a majority item
if there is one with time O(logm) per
operation
The state of the data structure is equivalent if
there are I insertion and D deletions, or if
there are c = I - D insertions
In case of insertions only: the majority is
found
UpdateCounters procedure
int c[0…logm]
UpdateCounters(i,transtype,c[0…logm])
c[0]=c[0] + diff
for j=1 to logm do
If (transtype = ins)
c[j] = c[j] + bit(j,i)
Else
c[j] = c[j] - bit(j,i)
FindMajority procedure
FindMajority(c[0 ... log m])
Position = 0, t =1
for j=1 to log m do
if (c[j] > c[0]/2) then
position = position + t
t = 2* t
return position
Randomized constructions for
finding hot items
Observation: If we select subsets with one hot
item exactly
applying the majority algorithm will identify the
hot item
Definition:
Let F [1...m] denote the set of hot items
Set S [1...m] is a good set if | S F | 1
How many subsets do we
need?
Theorem: Picking O(k logk) subsets by drawing m/k
items uniformly from [1…m] means that with
constant probability we have included k good
subsets S1…Sk such that i ( F Si ) F
Proof: p – pick one item from F
m k
k m / k 1
m
k m/k
p (1 )
(1 )
k m
m
mk
m
And for 1 k m / 2
1/ 4 p 2 / e 3 / 4
O(k logk) subsets will guarantee with constant
probability that we have one of each hot item
(coupon’s collector problem)
Coupon collector problem
p is probability that coupon is good
X – number of trials required to collect at
least one of each type of coupon
Epoch i begins with after i-th success and
ends with (i+1)-th success
Xi – number of trials in the i-th epoch
Xi distributed geometrically and pi = p(k-i)/k
k 1
k 1
E[ X ] E[ X i ]
i 0
i 0
k
k k 1
p(k i ) p i 1 i
k
p
ln k O(k ) O(k log k )
Defining the groups with
universal hash functions
The groups are chosen in a pseudo-random
way using universal hash functions:
Fix prime P > 2k
a, b are drawn uniformly from [0…P-1]
Then set: ha ,b ( x) ((ax b) mod P) mod 2k
Sa ,b,i {x | ha ,b ( x) i}
Fact: Over all choices of a and b, for x<>y:
1
Pr(ha ,b ( x) ha ,b ( y))
2k
Choosing and updating the
subsets
We will choose T = logk/δ values of a and b,
Which creates 2kT= 2klogk/δ subsets of
items
Processing an item i means:
To which T sets i belongs?
For each one: update logm counters based on bit
representation of i
If the set is good, this gives us the hot item
Space requirements
a and b are O(m): O(logk/δ logm)
Number of counters: 2k logk/δ (logm + 1)
Total space: O(k logk/δ logm)
log(k/δ) choices of a,b
2k subsets
log m + 1 counters
Probability of each hot item being
in at least one good subset is at
least 1-δ
Consider one hot item: For each T repetitions
we put it in one of 2k groups
fi
1
k
1
The expected total
E[f]=( )
2k k 1 2(k 1)
i j 2k
frequency of other items:
If f<1/(k+1) majority will be found success
If f>1/(k+1) majority can’t be found failure
Probability of failure < ½ (by Markov inequality)
Probability to fail on each T < 1/ 2log k / / k
Probability of any hot items failing at most δ.
Detecting good subsets
Given a subset Sa,b,i and it’s associated
c
...
c
0
log m , it is possible to detect
counters
deterministically whether the subset is a good
subset
Proof: a subset can fail in two cases:
No hot items (assuming STP) : then c0 c /(k 1)
More than one hot item: there will be j such that:
c j c /(k 1) and c0 c j c /(k 1)
a good subset is determined
ProcessItem procedure
Initialize c[0 … 2Tk][0 … log m]
Draw a[1 … T], b[1 … T], c=0
ProccessItem(i,transtype,T,k)
if (trans = ins) then
c = c + 1
else
c = c – 1
for x = 1 to T do
index =2k(x-1)+(i*a[x]+b[x]modP)mod2k
UpdateCounters(i,transtype,c[index])
GroupTest procedure
GroupTest(T,k,b)
for i=1 to 2Tk do
if c[i][0] > cb
position = 0; t =1
for j = 1 to log m do
if (c[i][j] > cb and
c[i][0] – c[i][j] > cb)
then Skip to next i
if c[i][j] > cb
position += t
t = 2 * t
output position
Algorithm correctness
With probability at least 1-δ, calling the
GroupTest(logk/δ,k,1/k+1)
procedure finds all hot items.
Time processing item is: O(logk/δ logm)
Time to get all hot items is O(k logk/δ logm)
With or without STP, we are still guarenteed
to include all hot items with high probability
Without STP, we might output infrequent
items
Algorithm correctness – cont.
When will an infrequent item be output? (no
STP)
A set with 2 hot items or more will be detected
A set with one hot item will never fault. Even if
there is a split without the hot item that exceeds
the threshold – it will be detected
A set with no hot item, and for all logm splits one
half will exceed the threshold and the other not
only then the algorithm will fail
Algorithm properties
• The set of counters created with T= log k/ δ
can be used to find hot items with parameter
k’ for any k’<k with probability of success 1 –
δ by calling GroupTest(logk/δ,k,1/(k’+1))
Proof: in the proof of probability for k hot
items: 1 k '
1
2k k ' 1
2( k ' 1)
Experiments
GroupTesting algorithm was compared to Loosy
Counting and Frequent algorithms.
The authors implemented them so that when an
item is deleted we decrement the corresponding
counter if such exist.
The recall is the proportion of the hot items that are
found by the method to the total number of hot
items.
The precision is the proportion of items identified by
the algorithm, which are hot, to number of all output
items.
Synthetic data (Recall)
Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed
Synthetic data (Precision)
Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed
Real data (Recall)
Real data was obtained from one of AT&T network for part of a day.
Real Data (Percision)
Real data has no guarantee of having small tail property
Varying frequency at query time
The data structure was build for queries at the 0.5% level, but
was then tested with queries ranged from 10% to 0.02%
Conclusions and extensions
New method which can cope with dynamic
dataset is proposed.
It’s interesting to try to use the algorithm to
compare the differences in frequencies
between different datasets.
Can we find combinatorial design that
achieve the same properties but in
deterministic construction for maintaining hot
items?