Transcript Slides
Data Stream Algorithms
Ke Yi
Hong Kong University of Science and Technology
Problem One: Missing Card
I take one from a deck of 52 cards, and pass the rest to
you. Suppose you only have a (very basic) calculator and
bad memory, how can you find out which card is missing
with just one pass over the 51 cards?
What if there are two missing cards?
2
A data stream algorithm …
3
Makes one pass over the input data
Uses a small amount of memory (much smaller than the
input data)
Computes something
Why do we need streaming algorithms
Networking
Often get to see the data once
Don’t want to store the entire data
Databases
Data stored on disk, sequential scans are much faster
Data stream algorithms have been a very active research
area for the past 15 years
Problems considered today
4
Missing card
Majority
Heavy hitters
Problem two: Majority
Given a sequence of items, find the majority if there is one
AA B C D B AA B B AAAAAA C C C D A B AAA
Answer: A
Trivial if we have O(n) memory
Can you do it with O(1) memory and two passes?
First pass: find the possible candidate
Second pass: compute its frequency and verify that it is > n/2
How about one pass?
5
Unfortunately, no
Problem three: Heavy hitters
Problem: find all items with counts > φn, for some 0< φ<n
Relaxation:
If an item has count > φ n, it must be reported, together with its
estimated count with (absolute) error < εn
If an item has count < (φ − ε) n, it cannot be reported
For items in between, don’t care
In fact, we will solve the most difficult case φ = ε
Applications
6
Frequent IP addresses
Data mining
The algorithm
[Metwally, Agrawal, Abbadi, 2006]
Input parameters: number of counters m, stream S
Algorithm:
for each element e in S {
if e is monitored {
find counter of e, counteri;
counteri++;
} else {
find the element with least frequency, em, denote its frequency min;
replace em with e;
assign counter for e with min+1;
}
}
7
Properties of the Algorithm
Actual count of a monitored item ≤ counter
Actual count of a monitored item ≥ counter – min
Actual count of an item not monitored ≤ min
Proof by induction
The sum of the counters maintained is n
Why?
So min <= n/m
If we set m = 1/ε, it’s sufficient to solve the heavy hitter
problem
8
Why?
So the heavy hitter problem can be solved in O(1/ε) space
How about deletions?
Any deterministic algorithm needs Ω(n) space
Why?
In fact, Las Vegas randomization doesn’t help
Will design a randomized algorithm that works with high
probability
9
For any item x, we can estimate its actual count within error εn
with probability 1-δ for any small constant δ
The Count-Min sketch
[Cormode, Muthukrishnan, 2003]
A Count-Min (CM) Sketch with parameters ( , ) is represented by
a two-dimensional array counts with width w and depth d : count[1,1] count[d , w]
Given parameters
2
1
( , ) , set w and d log .
Each entry of the array is initially zero.
d
hash functions are chosen uniformly at random from a 2-univeral family
For example, we can choose a prime number p > u, and random aj, bj, j=1,…,d. Define:
h1 ,, hd : {1n} {1 w}
hj(x) = (aj x + bj mod p) mod w
Property: for any x ≠ y, Pr[hj(x)=hj(y)] ≤ 1/w
10
Updating the sketch
Update procedure :
When item x arrives, set 1 j d
count[ j , h j ( x)] count[ j , h j ( x)] 1
h1
x
hd
1
1
1
1
When item x is deleted, do the same except changing +1 to -1
11
Estimating the count of x
aˆ x min count[ j, h j ( x)]
Q (x )
j
actual count
Theorem 1
estimated count
ax aˆ x
Pr[aˆ x ax n]
12
Proof
We introduce indicator variables
I x, y, j
( x y ) (h j ( x) h j ( y ))
1
if
0
otherwise
1
E ( I x , y , j ) Pr[ h j ( x) h j ( y )]
w 2
Define the variable
I x, j I x, y, j a y
y
By construction,
13
count[ j , h j ( x)] a x I x , j
min count[ j , h j (i )] ai
For the other direction, observe that
E ( I x, j ) E I x, y , j a y a y E ( I x , y , j ) n / 2
y
y
Pr[ aˆ x a x n] Pr[j , count[ j , h j ( x)] a x n]
Pr[j , a x I x , j a x n]
Pr[j, I x, j 2E( I x, j )] 2d
Markov inequality
Pr[ X t ]
E( X )
t
t 0
1
1
So, the Count-Min Sketch has size O log
14
■