Transcript Talk

Stochastic Streams: Sample
Complexity vs. Space Complexity
David Woodruff
IBM Almaden
Joint work with Michael Crouch, Andrew McGregor, and Greg Valiant
Motivation
• (Well-studied) Statistics question: how many samples from a
distribution are needed to estimate a property of a distribution?
• (Well-studied) Streaming question: for a given fixed stream of
samples, how much space is needed to estimate a property of a
distribution?
• Our work: understand the tradeoff between the sample and space
complexity
Model
4
3
7
3
1
1
2
…
• Algorithm sees a stream of i.i.d. samples from a distribution
• Algorithm only has 1 pass over the samples
• Goal: understand the tradeoff between the number t of samples
needed to solve a problem, versus the space s of the algorithm
Problems
• (Statistics) Given t samples from a distribution p = p1 , … , pn on n
items, estimate the collision probability i p2i up to a 1 + ϵ - factor
• (Graph Problems) Given t independent edges chosen with
replacement from a graph G, decide if G is connected
• (Linear Algebra) Given t independent samples from a subspace S of
GF 2 d , determine if S has dimension d/2 or dimension d
Talk Outline
• Sample/Space Tradeoff for Collision Probability Estimation
• Sample/Space Tradeoff for Deciding Connectivity
• Sample/Space Tradeoff for Determining if a Subspace is Full Rank
Collision Probability
• Given t samples from distribution p = p1 , … , pn on n items, estimate the collision
probability i p2i up to a 1 + ϵ - factor
• Collision probability is called F2
• If t = o(n.5 ), impossible with any amount of space
• Distinguish p =
1
1
,…,
n
n
2
n
2
n
vs. p = ( , … , , 0, … , 0)
• Our algorithm
n
t
• For any t = Ω𝜖 (n.5 ), can 1 + ϵ -approximate F2 with t samples and Oϵ (1 + ) space
Collision Probability Algorithm
• Break the t samples into t/w contiguous groups of w samples
4 … 3
7 …
Group 2
Group 1
3
1 …
1
…
Group 3
• For each group of samples a1 , … , aw let Xi,j = 1 if ai = aj , and let
1
X=
i≠j X i,j be the probability of a collision on the group
w(w−1)
• Use w log n bits of space to compute an estimate X for a group, and average
estimates over t/w groups
Collision Probability Algorithm
• E X = E Xi,j =
2
p
k k = F2
F2
w w−1
n.5 F2
Θ(
)
w
• Var X ≤
+
• Chebyshev’s inequality implies a 1 + ϵ -approximation to F2 with
n
n.5
error probability O
2 +
2
wt ϵ
• Set t >
n.5
, and
𝜖2
w=O
tϵ
n
tϵ2
Collision Probability Lower Bound
• Use lower bound for random order streams
• Case 1: see a stream a1 , … , at of t < n distinct items from a universe U
• Case 2: see a stream a1 , … , at of t-r distinct items together with an item i
which occurs r times, all from universe U
• Order of streams is random
t
• [AOMP, GH] any streaming algorithm needs Ω 2.5 space to distinguish
r
t
the cases, even with an infinite random tape (conjectured space: Ω 2 )
r
Collision Probability Lower Bound
• Choose a random function h: U -> [n]
• Given a stream a1 , … , at of items in a random order, feed the algorithm for IID
streams the stream h(a1 ), … , h(at )
a1 a2 a3 a4 a5
… at
h(a1 ) h(a2 ) h(a3 ) h(a4 ) h(a5 ) …
h(at )
Collision Probability Lower Bound
• If a1 , … , at are distinct, obtain IID samples from distribution
1
1
( ,…, )
n
n
• If a1 , … , at are distinct together with an item i occurring r times, roughly
see IID samples from distribution
1
r
1
r r 1
r
1
r
− , …to
, F
− , , − ,…, − )
Question: (Extend
n
nt
n
nt t n
nt
n
nt
k
• If r > t/n.5 , then F2 in two cases differs by a constant factor
5
n4
• Implies w = Ω( 1.5) (Conjectured =
t
n
Ω( ))
t
Talk Outline
• Sample/Space Tradeoff for Collision Probability Estimation
• Sample/Space Tradeoff for Deciding Connectivity
• Sample/Space Tradeoff for Determining if a Subspace is Full Rank
Graph Connectivity
• Given t independent edges chosen with replacement from graph G,
decide if G is connected
• Simulate a random walk starting at node 1
• Store current vertex
• If see an edge not incident to the current vertex, discard it
• Remember first node i which you haven’t seen. Finish when i > n
Graph Connectivity
2
Start at
vertex 1
Current Vertex:
1
3
1
First Untouched Vertex: done
23
4
4
See IID Stream: {1, 4}, {2, 3}, {1, 4}, {3,4}, {1,2}, {2, 3}, {1,2}, {3,4}
do nothing
do nothing
do nothing
O(log n) space, and O m (mn2 ) = O m2 n2 samples
The Loopy Graph
• For each vertex v in G, add m − dv self-loops to vertex v
• The resulting “loopy graph” H is m-regular and has mn edges
• Perform previous algorithm on H
• Each sample is important!
• O(log n) space, but only O mn n2 = O mn3 samples
Use More Space and Fewer Samples
• We have an algorithm with O(log n) space and O mn3 samples
How can we use more space but fewer samples?
• Take p random walks in loopy graph, and remember current vertex of each one
• O(p log n) space
• Use each sample to update all p random walks
• Issue: random walks not independent
• Fix: can simulate independence since most walks don’t move on a given sample
What to do with p independent random walks?
Space/Time Tradeoff for Connectivity [Feige]
• For any k, the following is a correct algorithm whp:
• (Phase 1) Ensure graph has no connected components of size ≤ k
• (Phase 2) Sample
n log n
k
x
vertices and verify the samples are connected
x
If there is asuppose
component
Otherwise,
we
with ≤ k vertices, declare
are
in phase 2
graph is disconnected
x
x
Will sample a vertex from
each group of k vertices
Implementation in the IID Model
• For any p ≤ n, can implement algorithm with O(p) space and
mn2
O( 2 )
p
samples
• Set k = O(n/p)
• Even for p = O(1), this improves our earlier O(log n) space and O mn3 samples
• Phase 1: sample O(p) nodes at random, run random walk on each of them
estimate in O(1) space the number of distinct nodes visited in each walk
• Phase 2: sample O(p) nodes at random, run random walk on each of them
keep track of which sampled nodes are connected
Talk Outline
• Sample/Space Tradeoff for Collision Probability Estimation
• Sample/Space Tradeoff for Deciding Connectivity
• Sample/Space Tradeoff for Determining if a Subspace is Full Rank
Determining if a Subspace Has Full Rank
• Given t IID samples from a subspace S of GF 2 d , is rank(S) = d/2 or rank(S) = d?
• O(log d) space algorithm: choose a random standard unit vector ei and check if
any sample equals ei
• After O 2d samples, if S has full rank, will see ei
• If S has rank d/2, with probability ½, will never see ei
• Our Lower Bound: Any algorithm succeeding with probability > 2/3 and using o(d)
space must use 2Ω d samples
Statistical Query Framework
• A “statistical query” algorithm (s.q. algorithm) is adaptively proposes
functions f1 , f2 , … with fi : 0,1 d → −1, 1 and receives estimates of
Ex fi x that are corrupted via adversarial noise
• An s-query s.q. algorithm with tolerance τ is an algorithm that:
• for every rank d/2 subspace S, after querying f1 , … , fs with responses r1 , … , rs with
ri − Ex uniform in S [fi x ] ≤ τ, it outputs “rank d/2” with probability 3/4
• if the responses r1 , … , rs satisfy ri − Ex uniform in 0,1 d [fi (x) ≤ τ, the algorithm
outputs “full rank” with probability 3/4
• Main Lemma: For any fi ,
d
Pr Ex uniform on
S
0,1 d fi
x − Ex uniform on S fi x
−8
>2
d
−4
≤2
Statistical Query Framework
Each of t players in a communication game receives a sample from GF 2d
P1
P2
P3
Pt
…
Each message depends on all previous messages and is at most s bits
[SVG] If there is a communication protocol with probability 1-δ, then there is an
δ
O(st) query s.q. algorithm with probability 1-2δ with tolerance s
t2
Set s = o d and t = 2Ω(d) and apply our main lemma
Conclusions
• Studied space versus sample tradeoffs in the data stream model
• Obtained tradeoffs for statistical, graph, and linear algebra problems
• Open questions: tighten our bounds
• General question: unify the techniques for the different problems