Tracking most frequent items dynamicly

Download Report

Transcript Tracking most frequent items dynamicly

Tracking most frequent
items dynamically.
Article by G.Cormode and
S.Muthukrishnan.
Presented by Simon
Kamenkovich.
Motivation

Most DB management systems maintains
“hot items” statistics.
 Hot items are used as simple outliers in data
mining, anomaly detection in network
applications and financial market data.
 The proposed algorithm handles deletions
and insertions and with better qualities then
other existing methods
Introduction
Given: n integers in the range [1… m]
ni (t ) - net occurance of any item i at time t,
fi (t ) 
n i (t )
m
 n (t )
j 1
- current frequency of item i,
j
1
Item i is hot if fi (t ) 
k 1
So there may be up to k hot items and there may be none.
Preliminaries
If we’re allowed O(m) space, then simple heap will process
each insert and delete in O(log m) and find all the hot items in O(k logk).
Lemma 1: Any algorithm which guarantees to find all and only items
which have frequency greater then 1/(k+1) must store at Ω(m) bits.
Proof : Consider S  {1...m}.
For i insert  n/k  copies of i,
if i  S : f i   n/k   1/( n   n/k  )  (n / k ) /( n  n / k )  1/( k  1)
if i  S : f i   n/k  /(n   n/k  )   n/k  /  n(k  1) / k  
  n/k  /( k  1)  n/k   1/( k  1)
So we can extract set S.
Small Tail Property
Say f1  f 2  ...  f m are the frequencies of items in the non
decreasing order. The set of frequencies have a small tail
if ( i k fi )  1
.
k 1
If there are k hot items then small tail property hold but the
opposite is not always true.
Prior works
Algorithm
Type
Time Per Item
Space
Misra-Gries
Deterministic
O(log k)
amortized
O(k)
Frequent
Randomized
O(1) expected
O(k)
Lossy Counting Deterministic
O(log(n/k))
Ω(klog(n/k))
Charikar et al.
Randomized
Ω(k/ε²logn)
Ω(k/ε²logn)
Gilbert et al.
(quantiles)
Randomized
Ω(k²log²n)
Ω(k²log²n)
Base method
Definition : A dyadic range sum oracle returns the sum of the counts
of items in the range (i 2 j  1)...(i  1)2 j for 0  j  log m, 0  i  m / 2 j
DivideAndConquer(l,r,thresh)
if oracle(l,r) > thresh
if (l=r) then output(l)
else DivideAndConquer(l,r-l/2,thresh)
DivideAndConquer(r-l/2+1,r,thresh)
Theorem 1: Calling DivideAndConquer(1,m,n/(k+1)) will output all
and only hot items. A total of O(k log m/k) calls will be made to the
oracle.
Oracle design
N.Alon et al. - requires O(km) space
 Gilbert (Random Subset Sums) requires
O(k²log m log k/δ) space
 Charikar et al. requires O(klogm log k/δ)
 Cormode requires O(k logm log k/δ)

Group Testing
Idea: To design number of tests, each of which groups
together a number of m items in order to find up to k items
which test positive.
General procedure: for each transaction on item i we
determine each subset it’s included in ,S(i). And increment
or decrement the counters associated with the subsets.
Majority item can be found for insertions only case in
O(1) time and space by the algorithm of Boyer and
Moore.
Deterministic algorithm
Each test includes
half of the range
[1..m], corresponding to binary
representations of
values
int c[0…log m]
UpdateCounters(i, transtype, c[0…log m])
c[0]=c[0] + diff
for j=1 to log m do
If (transtype = ins) c[j] = c[j] + bit(j,i)
Else
c[j] = c[j] - bit(j,i)
Deterministic algorithm(cont)
FindMajority(c[0 ... log m])
Position = 0, t =1
for j=1 to log m do
if (c[j] > c[0]/2) then position = position + t
t = 2* t
return position
Theorem 2: The above algorithm finds a majority item if
there is one with time O(log m) per operation.
Randomized Algorithm
Definiton: Let F  [1...m]denote the set of hot items.Set S  [1...m] is a
good set if | S  F | 1
Theorem3: Picking O ( k log k ) subsets by drawing m
items inifromly at random from
k
the items [1...m] means that with constant probability we have
included k subsets S1 , S 2 ...S k such that each Si is a good subset
and
i
(F
Si )  F
Proof : Probability of picking exactly one member of F after drawing
m k
k mk 1
m
k mk
times is p 
(1  )

(1  ) ,
k
k m
m
mk
m
m
1
2
In non trivial case 1  k 
, so
 p .
2
2
e
As for Coupon Collector Problem picking O( k ln k ) guarantees
an item m
that we have a set that is good for each hot item.
Coupon Collector Problem





X – number of trials required to collect at least one of each type of coupon
Epoch i begins with after i-th success and ends with (i+1)-th success
Xi – number of trials in the i-th epoch
Xi distributed geometrically and pi = p(k-i)/k
p is probability that coupon is good
k 1
X   Xi,
i 0
k 1
k 1
k
k k 1 k
E[ X ]  E[ X i ]  
   H n  k p ln k  O(k )  O(k log k )
p i 1 i p
i 0
i 0 p(k  i )
Using hash functions
We don’t store sets explicitly – O(mlogk), instead we choose the set in
pseudo-random fashion using hash functions: Fix a prime P > 2k, Take
a and b uniformly from [0…P-1]
ha ,b ( x)  ((ax  b) mod P) mod 2k
Sa ,b,i  {x | ha ,b ( x)  i}
Fact : Over all choices of a and b for x  y, Pr(ha ,b ( x)  ha ,b ( y )) 
Choose T= log k/δ pairs of a,b ; each pair will define 2k sets.
1
2k
Data Structure
log(k/δ) groups
2k subsets
log m counters
ProccessItem
Initialize c[0 … 2Tk][0 … log m]
Draw a[1 … T], b[1 … T], c=0
ProccessItem(i,transtype,T,k)
for all (i,transtype) do
if (trans = ins) c = c +1
else
c=c–1
for x = 1 to T do
index = 2k(x-1) + (i*a[x]+b[x] mod P) mod 2k
UpdateCounters(i,transtype,c[index])
//they had 2(x-1)
Space used by the algorithm is O(k log(k/ δ) log m).
Lemma
Lemma 2: The probability of each hot item being in at least
one good set is at least 1- δ
Proof: For each T repetitions we put a hot item in one of 2k buckets.
The expected total frequency of other items: m=(
i j
fi
1 k
1
)

2k 2k k  1 2(k  1)
If m < 1/(k+1) then there is a majority and we can find it
If m > 1/(k+1) then we won’t be able to find a majority
Probability of failure < ½ by Markov inequality.
Probability to fail on each T is at most δ/k.
Probability of any hot items failing at most δ.
GroupTest
Lemma 3 : Given a subset Sa,b,i and its associated set of counters
it is possible to detect whether Sa,b,i is a good set using
small tail property
GroupTest(T,k,b)
for i=1 to 2Tk do // they had T here
if c[i][0] > cb
position = 0; t =1
for j = 1 to log m do
if c[i][j] > cb and c[i][0] – c[i][j] > cb
or c[i][j] < cb and c[i][0] – c[i][j] < cb
then Skip to next i
if c[i][j] > cb position = position + t
t=2*t
return position
Algorithm properties
Theorem 4: With proability at least 1- δ, calling the
GroupTest(log k/δ,k,1/(k+1)) procedure finds all the hot
items using O(k log k/δ logm) space. The time for an
update is O(logk/δ log m) and the time to list all hot
items is O(k log k/δ log m)
Corollary 1: If we have the small tail property, then we
will output no items which are not hot.
Proof: We will look for hot items only if it exists and only
it will be output.
Algorithm properties (cont)
Corollary 2: The set of counters created with T= log k/ δ
can be used to find hot items with parameter k’ for any k’<k
with probability of success 1 – δ by calling
GroupTest(logk/δ,k,1/(k’+1)
Proof: According to Lemma2, since
1
k'
1

2k k ' 1
2(k ' 1)
Lemma 4: The output of the algorithm is the same for any
reordering of the input data.
Experiments
GroupTesting algorithm was compared to Loosy Counting and
Frequent algorithms.
The authors implemented them so that when an item is deleted we
decrement the corresponding counter if such exist.
Definitions: The recall is the proportion of the hot items that are
found by the method to the total number of hot items.
The precision is the proportion of items identified by the
algorithm, which are hot, to number of all output items.
Synthetic data (Recall)
Zipf for hot items: 0 – distributed uniformly , 3 – highly skewed
Synthetic data (Precision)
GroupTesting required more memory and took longer to process each item.
Real Data (Recall)
Real data was obtained from one of AT&T network for part of a day.
Real Data (Precision)
Real data has guarantee of having small tail property….
Varying frequency at query time
The data structure was build for queries at the 0.5% level, but was
then tested with queries ranged from 10% to 0.02%
Other algorithms are designed around fixed frequency threshold
supplied in advance.
Conclusions and extensions

New method which can cope with dynamic
dataset is proposed.
 It’s interesting to try to use the algorithm to
compare the differences in frequencies
between different datasets.
 Can we find combinatorial design that
achieve the same properties but in
deterministic construction for maintaining hot
items?
FIN