The Average Case Complexity of Counting Distinct Elements

Download Report

Transcript The Average Case Complexity of Counting Distinct Elements

The Average Case Complexity
of Counting Distinct Elements
David Woodruff
IBM Almaden
Problem Description
• Given a data stream of n insertions of records, count the
number F0 of distinct records
• One pass over the data stream
• Algorithms must use small amount of memory and have
fast update time
– too expensive to store set of distinct records
– implies algorithms must be randomized and must
settle for an approximate solution: output
F 2 [(1-²)F0, (1+²)F0] with constant probability
The Data Stream
• How is the data in the stream organized?
– Usually, assume worst-case ordered
• In this case, £(1/²2) bits are necessary and sufficient to estimate F0
– As the quality of approximation improves to say, ² = 1%, the quadratic
dependence is a major drawback
– Sometimes a random-ordering can be assumed
• Suppose we are mining salaries, and they are ordered alphabetically by
surname. If there is no correlation between salaries and surname, then the
stream of salaries is ordered randomly
• The backing sample architecture assumes data randomly ordered by design
(Gibbons, Matias, Poosala)
• This model is referred to as the Random-Order Model (Guha, McGregor)
• Unfortunately, even in this model, we still need (1/²2) bits to estimate F0
(Chakrabarti, Cormode, McGregor). Intuitively this is because the data is still
worst-case.
Random Data Model
• In an attempt to bypass the (1/²2) bound, we propose to study the
case when the data comes from an underlying distribution.
– Problem 3 of Muthukrishnan’s book: “Provide inproved estimates for Lp
sums including distinct element estimation if input stream has statistical
properties such as being Zipfian.”
• There is a distribution defined by probabilities pi, 1 · i · m, i pi = 1.
• The next item in the stream is chosen independently of previous
items, and is i with probability pi.
• We call this the Random-Data Model.
• The Random-Data Model is contained in the Random-Order Model.
Random Data Model
• This model for F0 was implicitly studied before
– by Motwani and Vassilvitskii when the distribution is Zipfian. This
distribution is useful for estimating WebGraph statistics
– sampling-based algorithms used in practice impose distributional
assumptions, without which they have poor performance (Charikar et al)
– Generalized Inverse Gaussian Poisson (GIGP) model studies samplingbased estimators when distribution is uniform and Zipfian
– by Guha and McGregor for estimating the density function of an
unknown distribution, which is useful in learning theory
Further Restriction
• We focus on the case when each probability pi = 1/d for an unknown
value of d (so uniform over a subset of [m])
• Captures the setting of sampling with replacement a set of unknown
cardinality
• For a certain range of d, we show that one can beat the space lower
bound that holds for adversarial data and randomly-ordered data
• For another choice of d, we show the lower bound for adversarial
and randomly-ordered data also applies in this setting
• Distribution fairly robust in the sense that other distributions with a
few heavy items and remaining items that are approximately uniform
have the same properties above
Our Upper Bound
• 1-pass algorithm with an expected O(d(log 1/²)/(n²2) + log m) bits of
space, whenever d ¸ 1/²2 and d · n. The per-item processing time
is constant.
• Recall the distribution is uniform over a d-element subset of [m], and
we see n samples from it, so this is a typical setting of parameters.
• Notice for n even slightly larger than d, the algorithm does much
better than the (1/²2) lower bound in other data stream models.
• One can show for every combination of known algorithms with
different space/time tradeoffs for F0 in the adversarial model, our
algorithm is either better in space or in time.
Our Lower Bound
• Our main technical result is that if n and d are £(1/²2), then even
estimating F0 in the random data model requires (1/²2) space
• Lower bound subsumes previous lower bounds, showing that even
for a natural (random) choice of data, the problem is hard
• Our choice of distribution for showing the lower bound was used in
subsequent work by Chakrabarti and Brody, which turned out to be
useful for establishing an (1/²2) lower bound for constant pass
algorithms for estimating F0
Techniques: Upper Bound
• Very simple observation:
– Since d · n, each item should have frequency n/d in the stream.
– If n/d is at least (1/²2), we can just compute n and the frequency
of the first item in the stream to get a (1+²)-approximation to d
– Using a balls-and-bins occupancy bound of Kamath et al, a good
estimation of d implies a good estimation to F0
• If n/d is less than 1/²2, we could instead store the first O(1/²2) items
(hashed appropriately for small space), treat them as a set S, and
count the number of items in the remaining part of the stream that
land in S
– Correct, but unnecessary if d is much less than n.
Techniques: Upper Bound
• Instead record the first item x in the stream, and find the second
position i of x in the stream.
• Position i should occur at roughly the d-th position in the stream, so i
provides a constant factor approximation to d
• Since n = (d), position i should be in the first half of the stream with
large constant probability.
• Now store the first i/(n²2) distinct stream elements in the second half
of stream, treat them as a set S, and count the remaining items in
the stream that occur in S.
– Good enough for (1+²)-approximation
– Space is O(log n + (i log m)/(n²2)) ¼ O(log n + (d log m)/(n²2))
Techniques: Upper Bound
• Space is O(log n + (d log m)/(n²2)), but we can do better.
• For each j in [m], sample j independently with probability 1/(i²2). In
expectation new distribution is now uniform over d/(i²2) items. If j is
sampled, say j survives.
• Go back to the previous step: store the first i/(n²2) distinct surviving
stream elements in the second half of stream, treat them as a set S,
and count the remaining items in the stream that occur in S.
– Since only £(1/²2) items survive, can store S with only (i log 1/²)/(n²2)
bits by hashing item IDs down to a range of size, say, 1/²5
– We estimate the distribution’s support size in the sub-sampled stream,
which is roughly d/(i²2). We can get a (1+²)-approximation to this
quantity provided it is at least (1/²2), which it is with high probability.
Then scale by i²2 to estimate d, and thus F0 by previous reasoning.
– Constant update time
Techniques: Lower Bound
Consider the uniform distribution ¹ on [d] = {1, 2, …, d}. d and n are £(1/²2)
Consider a stream of n samples from ¹ where we choose n so that
1-(1-1/d)n/2 in [1/3, 2/3]
Let X be the characteristic vector of the first n/2 stream samples on the universe [d]. So
Xi = 1 if and only if item i occurs in these samples.
Let Y be the characteristic vector of the second n/2 stream samples on the universe [d].
So Yi = 1 if and only if item i occurs in these samples.
Let wt(X), wt(Y) be the number of ones in vectors X, Y.
We consider a communication game. Alice is given X and wt(Y), while Bob is given Y and
wt(X), and they want to know if
Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d
Alice and Bob should solve this problem with large probability, where the probability is
over the choice of X and Y, and their coin tosses (which can be fixed).
Techniques: Lower Bound
Strategy
(1) show the space complexity S of the streaming algorithm is at least the
one-way communication complexity CC of the game
(2) lower bound CC
Theorem: S ¸ CC
Proof: Alice runs the streaming algorithm on a random stream aX generated
by her characteristic vector X. Alice transmits the state to Bob
Bob continues the computation of the streaming algorithm on a random
stream aY with characteristic vector Y
At the end the algorithm estimates F0 of a stream whose elements are in
the support of X or Y (or both)
Notice that the two halves of the stream are independent in the random
data model, so the stream generated has the right distribution
Techniques: Lower Bound
We show the estimate of F0 can solve the communication game
Remember, the communication game is to decide whether
Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d
Note the quantity on the right is the expected value of Delta(X,Y)
Some Lemmas
(1) Pr[d/4 · wt(X), wt(Y) · 3d/4] = 1-o(1)
(2) Consider the variable X’, distributed as X, but conditioned on wt(X) = k.
The variable Y’ is distributed as Y, but conditioned on wt(Y) = r.
– X’ and Y’ are uniform over k and r bit strings, respectively
– Choose k, r to be integers in [d/4, 3d/4]
Then for any constant δ > 0, there is a constant ® > 0, so that
Pr[|¢(X’,Y’) – E[¢(X’, Y’)]| ¸ ® d1/2] ¸ 1- δ
Follows from standard deviation of hypergeometric distribution
Techniques: Lower Bound
• ¢(X,Y) = 2F0(aX◦aY)-wt(X)-wt(Y)
• Note that Bob has wt(X) and wt(Y), so he can compute
¿ = wt(X) + wt(Y) – wt(X)wt(Y)/d
• If F is the output of the streaming algorithm, Bob simply computes
2F-wt(X)-wt(Y) and checks if it is greater than ¿
• F 2 [(1-²)F0, (1+²)F0] with large constant probability
• 2F-wt(X)-wt(Y) 2 [2F0-wt(X)-wt(Y)-2²F0,, 2F0-wt(X)-wt(Y)+2²F0]
= [¢(X,Y)-2²F0, ¢(X,Y)+ 2²F0]
= [¢(X,Y)-£(²d), ¢(X,Y) + £(²d)]
= [¢(X,Y) - £(d1/2), ¢(X,Y) + £(d1/2)]
Techniques: Lower Bound
• By our lemmas, and using that E[¢(X, Y)] = ¿, with large constant
probability either ¢(X,Y) > ¿ + ®d1/2 or ¢(X,Y) < ¿ - ®d1/2
• By previous slide, Bob’s value 2F-wt(X)-wt(Y) is in the range
[¢(X,Y) - £(d1/2), ¢(X,Y) + £(d1/2)]
• If ¢(X,Y) > ¿ + ® d1/2, then 2F-wt(X)-wt(Y) > ¿ + ® d1/2 -£(d1/2) = ¿
• If ¢(X,Y) < ¿ - ® d1/2, then 2F-wt(X)-wt(Y) < ¿ - ® d1/2 + £(d1/2) = ¿
• So Bob can use the output of the streaming algorithm to solve the
communication problem, so S ¸ CC
Lower Bounding CC
Communication game: Alice is given X and wt(Y), while Bob is given Y
and wt(X), and they want to know if
Delta(X,Y) ¸ wt(X) + wt(Y)-2wt(X)wt(Y)/d
Here X,Y have the distribution of being independent characteristic
vectors of a stream on n/2 samples, where
1-(1-1/d)n/2 2 [1/3, 2/3]
With large probability, Pr[d/4 · wt(X), wt(Y) · 3d/4] = 1-o(1)
By averaging, a correct protocol is also correct with large probability for
fixed weights i and j in [d/4, 3d/4], so we can assume X is a random string
of Hamming weight i, and Y a random string of Hamming weight j.
Lower Bounding CC
Y such that wt(Y) = j
X such that
wt(X) = i
0
1
…
1
0
1
1
1
0
1
0
1
0
0
1
0
1
0
0
0
0
0
1
1
0
1
0
1
0
1
0
0
1
1
1
0
1
0
…
1
1
0
Since
i, each
j 2that
[d/4,
one
can
show
that
the
fraction
of has
1s the
inthe
We
With
Since
show
large
probability,
row
for3d/4],
any
is roughly
large
the message
M,
balanced,
the
fraction
M
the
that
expected
ofAlice
1s
insends
most
fraction
of
ofeach
columns
1s
property
inrow is
is in
[1/2-o(1),1/2+o(1)]
in,
that
each
say,
many
column
[1/10,
different
9/10].
is in [1/2-o(1),
X
Then
cause
Bob
Alice
1/2+o(1)].
doesn’t
to send
know
M.what
Say to
such
output.
an M is large.
Recall that an entry is 1 if and only if ¢(X,Y) ¸ i + j – 2i¢j/d
Lower Bounding CC
Since each row is roughly balanced, the expected fraction of 1s in
each column is in [1/2-o(1), 1/2+o(1)].
But the variance could be huge, e.g., the matrix may look like this,
in which case Bob can easily output the answer.
1
1
…
1
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
0
1
0
0
1
1
1
0
0
0
…
1
1
0
Lower Bounding CC
•
Can show this doesn’t happen by the second-moment method.
•
Let Vy be the fraction of 1s in the column indexed by y.
•
Let S be the set of x which cause Alice to send a large message M.
•
Consider random Y. Let Cu = 1 if and only if ¢(u,Y) > ¿. So V = (1/|S|) sumu in S Cu
•
Var[V] = (1/|S|2) sumu,v in S (E[Cu Cv] – E[Cu]E[Cv]).
•
Notice that E[Cu]E[Cv] 2 [1/4-o(1), 1/4+o(1)]
while E[Cu Cv] = Pr[¢(u,Y) > ¿ | ¢(v,Y) > ¿] ¢ ½.
•
Since S is a large set, most pairs u, v 2 S have Hamming distance in [d/5, 4d/5].
–
•
A technical lemma shows that Pr[¢(u,Y) > ¿ | ¢(v,Y) > ¿] is a constant strictly less than 1.
Hence, E[Cu Cv] is a constant strictly less than 1/2, and since this happens for most
pairs u, v, we get that Var[V] is a constant strictly less than 1/4.
Lower Bounding CC
Fraction of 1s in a random column is just V = (1/|S|) sumu in S Cu
Let · be a small positive constant. By Chebyshev’s inequality,
Pr[|V-1/2| > 1/2-·] < Pr[|V-E[V]| > ½-·-o(1)] · Var[V]/(1/2-·)2 + o(1)
But we showed Var[V] is a constant strictly less than ¼, so this probability is a
constant strictly less than 1 for small enough ·.
Hence, for a random column Y, the fraction of 1s is in [·, 1-·] with constant
probability.
It follows that with some constant probability Bob outputs the wrong answer.
Hence, most messages of Alice must be small, so one can show that there
must be 2(d) of them, so that communication is (d) = (1/²2).
Conclusions
• Introduced random data model, and studied F0-estimation under
distributions uniform over a subset of d items
• For a certain range of d, we show that one can beat the space lower
bound that holds for adversarial data and randomly-ordered data
• For another choice of d, we show the lower bound for adversarial
and randomly-ordered data also applies in this setting
• Are there other natural distributions that admit more space-efficient
algorithms in this model? Are there other useful ways of bypassing
the (1/²2) lower bound?