Optimal Space Lower Bounds for all Frequency Moments
Download
Report
Transcript Optimal Space Lower Bounds for all Frequency Moments
Optimal Approximations
of the Frequency
Moments of Data
Streams
Piotr Indyk
David Woodruff
The Streaming Model
4
3
7
3
1
1
7
…
Stream of elements a1, …, an each in {1, …, m}
Want to compute statistics on stream
Elements arranged in adversarial order
Algorithms given one pass over stream
Goal: Minimum space algorithm
Frequency Moments [AMS96]
n = stream size, m = universe size
fi = # occurrences of item i
k-th moment
F0 = # of distinct elements
F1 = n = stream size
F2 = self-join size
Why are frequency moments important?
Applications
Estimating distinct elements with low space
Estimate query selectivity to huge DB without sorting
Routers gather # distinct destinations
F2 estimates size of self-joins:
Bob
a
c
fB2
b
y
x
Bob
a
x
, Alice y
Bob
a
z
Bob
Bob
c
x
Bob
c
z
Bob
Alice b
Bob
Alice
+
fA2
z
=4+1=5
Fk measures data skewness
The Best Deterministic Algorithm
Trivial algorithm for Fk
Store/update fi for each item i, sum fik at end
Space = O(mlog n): m items i, log n bits to count fi
Negative Results [AMS96]:
Compute Fk exactly (m) space
Any deterministic alg. outputs X with
|Fk – X| < Fk must use (m) space
What about randomized algorithms?
Randomized Approx Algs for Fk
Randomized alg. -approximates Fk if outputs X s.t.
Pr[|Fk – X| < Fk ] > 2/3
Previous work
(table suppresses polylog mn)
Upper
F0
1/2
F1
F2
Fk
1
1/2
m1-1/(k-1)
Lower
[FM85, GT02,
BJKST02]
[AMS96]
[CK04, G04]
1/2
[IW03, W04]
1
1/2
m1-2/k
[W04]
[BJKS02]
Matching Upper Bound
Our Contribution:
For every k there is a 1-pass O~(m1-2/k)
space algorithm to -approximate Fk
Additional Features:
1. Works even if we allow deletions, that is,
stream of elements (i, +), (i,-)
2. Constant update time
Techniques
Previous Algorithms [AMS96, CK04, G04]
1. Cleverly construct small-space estimator X s.t.
E[X] = Fk
Var[X] small
2. Apply Chebyshev’s inequality
Our “algorithm’’
1. Divide frequencies into “buckets”
0, [1, 2), [2, 4), [4, 8), …, [2i-1, 2i), …
2. Estimate size si of each bucket
3. Output X = i si 2ik
What’s Left?
Remaining Problem: Estimate si = # of
elements with frequency in each bucket [2i-1, 2i)
Is this always easy? No.
Suppose always easy – then could approximate
the maximum frequency
This is HARD – (m) space [AMS96]
However, (m) only applies to “worst-case” streams,
otherwise can do better: Countsketch [CCF-C]
For the moment, let’s assume:
1.
9 a 1-pass oracle Max returning the
maximum frequency using O(B) space
(we remove this using CountSketch)
Max
frequency
items
2. We have a very long RAM of random bits
(we remove this using Nisan’s generator)
0
1
1
0
0
0
1
…
General Idea: Max + Sampling
4
Restrict input stream to a random subset of
items in {1, …, m}, where items are included
independently with probability p.
3
7
3
1
Random subset = {1, 3}
3
3
1
1
…
1
7
…
General Idea: Max + Sampling
Restrict input to a random subset of items in
{1, …, m}, where items are included independently
with probability p.
What are chances the maximum lies in
Si = elements r such that fr 2 [2i-1, 2i)?
q = (1-p) j > i sj ¢ (1 – (1-p)si)
Idea: 1. Estimate q as q’ by taking independent trials
and computing fraction of max in Si
2. If already estimated sj for j > i, solve this
expression for si.
When is this estimate any good?
Recall q = (1-p){j > i} sj (1 – (1-p)si), so estimate si:
Need 1.
2.
(holds inductively)
(tight concentration of q’)
Requires 9 p so that q > 1/R, where
R = # trials used to estimate q
When is this estimate any good?
q = (1-p)j > i sj (1 – (1-p)si)
p too large? ! q too small
p too small? ! q too small
Motivates the following:
Say a class Si contributes if and only if si > j > i sj /R
If R = (log n), then Fk ¼ contributing i si 2ik
The Idealized Algorithm
1.
Use the random string to generate hash functions hjr :
[m] -> [2j] for j 2 [log m] and r 2 [R]
2.
Restrict stream Str to Strjr, those items i with hjr(i) = 1
3.
For each Strjr, compute Max(Strjr)
4.
5.
To estimate si given s’t for t > i, find some j for which
“enough” of the Max(Strjr) come from Si, and then set
Output F’k = i s’i 2ik
Removing the assumptions
1.
Assumption: 9 a 1-pass oracle Max returning the
maximum frequency using O(B) space
[CCF-C02]: 9 a 1-pass O(B)-space algorithm CountSketch
which, given stream Str, outputs all x for which fx2 ¸ F2/B
Recall: Si contributes if and only if si > j > i sj /R
Lemma: If Si = [2i-1, 2i) contributes, then
Proof: Holder’s inequality.
Removing the assumptions
2. We have an infinite string of random bits
Consider a space-S algorithm A and a function
f, with random strings R1, …, Rn that, when
processing a stream, maintains a variable
C, and updates as follows: C = C + f(i, Ri)
[Indyk00] Then R1, …, Rn can be generated using
Nisan’s PRG, and:
1.
2.
The new algorithm A’ has space O~(S)
The outputs of A’ and A are indistinguishable
Our algorithm follows this framework
Conclusions
Result: Tight O~(m1-2/k) upper bound
Handle deletions (j, -)
O~(1) update time
Open Problem: Reduce O~ factors