Transcript x x 2
Data Streams and Applications in
Computer Science
David Woodruff
IBM Almaden
Presburger lecture, ICALP, 2014
Thanks to my advisors
Prof. Ron Rivest
Prof. Piotr Indyk
Prof. Andy Yao
Thanks for your mentorship and research advice, and
early guidance on a path in theoretical computer science
and my amazing summer interns
Arnab
Bhattacharyya
Marco
Molinaro
Yi
Li
Jelani
Nelson
Eric
Price
Huy
Nguyen
Grigory
Yaroslavtsev
and my awesome collaborators!
in the theory group at IBM
and throughout the world…
My current research interests
•
•
•
•
•
•
•
Communication Complexity
Data Stream Algorithms and Lower Bounds
Graph Algorithms
Machine learning
Numerical Linear Algebra
Sketching
Sparse Recovery
Talk Outline
• Data Stream Model and Sample Results
– Distinct Elements
– Frequency Moments
– Characterization of Algorithms
• Connections to Other Areas
– Compressed Sensing
– Linear Algebra
– Machine Learning
Data Streams
• A data stream is a sequence of data that is too large to be stored in
available memory
• Examples
– Internet search logs
– Network Traffic
– Financial transactions
– Sensor networks
– Scientific data streams (astronomical, genomics, physical simulations)…
Streaming Model
4
3
7
3
1
1
0
…
• Stream of elements a1, …, am each in {1, …, n}
• Single or small number of passes over the data
• Algorithms should work for any ordering of elements
• Almost all algorithms are randomized and approximate
– Usually necessary to achieve efficiency
– Randomness is in the algorithm, not the input
• Goals: minimize space complexity (in bits), processing time
Vector Interpretation
Stream: 8 2 1 9 1 9 2 4 4 9 4 2 5 4 2 5 8 5 2 5
Vector x:
1
2
3
4
5
6
7
8
9
• Think of x as an n-dimensional vector
– Initially, x = 0n
• Insertion of i is interpreted as xi = xi + 1
• Output an approximation to f(x) with high probability
(1) Distinct Elements
• Streaming model originated in work of Flajolet and Martin, ‘85
– Studied the distinct elements question
– # of distinct elements, denoted F0, is |{i | xi > 0}|
– Output a number Z with F0 · Z · (1+ε) F0 with 99% probability
– Can we do better than just storing all the coordinates of x?
– Yes, and tight bounds are known [Indyk,W],[W],[Kane,Nelson,W]:
£(1/ ε2 + log n) bits of space, O(1) processing time
– Connections: to prove the tight lower bound, the gap-hamming
communication problem was introduced
Gap-Hamming Problem
x 2 {0,1}n
y 2 {0,1}n
• Promise: Hamming distance satisfies Δ(x,y) > n/2 + εn or Δ(x,y) < n/2 - εn
• Lower bound of Ω(ε-2) for randomized 1-way communication [Indyk, W],
[W], [Jayram, Kumar, Sivakumar]
• Same for 2-way communication [Chakrabarti, Regev]
• Applications: in information complexity, functional monitoring,
embeddings, linear algebra, differential privacy, sparsifiers, …
(Andoni, Brody, Clarkson, de Wolf, Jayram, Krauthgamer, McGregor,
Mironov, Pitassi, Reingold, Sherstov, Talwar, Vadhan, Vidick, W, Zhang…)
(2) Frequency Moments
• Streaming model revived in work of Alon, Matias, and Szegedy, ’96
[AMS]
• Consider more general turnstile streaming model [coined by
Muthukrishnan]
– positive and negative updates, so xi = xi + 1 or xi = xi – 1
– summarize statistics of difference x-y of two streams of insertions
Frequency Moments
• [AMS] study frequency moments Fp = sumi=1n |xi|p , or
equivalently lp-norms
– Summarize skewness of an empirical distribution
– F2 used in computing self-join sizes, geometry and
linear algebra
– F1 used for measuring distance between distributions,
and in “robust” algorithms (regression, subspace
approximation)
Flat
Skewed
Frequency Moments
• Output a number Z with Fp · Z · (1+ε) Fp with 99%
probability
• Near-tight bounds known (Andoni, Bar-Yossef,
Braverman, Chakrabarti, Coppersmith, Cormode,
Ganguly, Gronemeier, Indyk, Jayram, Kane,
Krauthgamer, Kumar, Li, Nelson, Porat, Sivakumar,
Sun, W, …)
Any guesses on how the space bounds depend on p?
Frequency Moments
• F2 is the “breaking point”
– Fp for p · 2 doable in O~(1) bits of space
– Fp for p > 2 requires £~(n1-2/p) bits of space
• Algorithms achieve O~(1) processing times
• Connections: “sub-sampling + heavy hitters” technique
for the upper bound
– Used in many data stream, embedding, and linear
algebra problems: earthmover distance, mixed norms,
sampling in the turnstile model, compressed sensing,
graph sparsifiers, regression
– Optimally solves sumi=1n G(xi) problems
[Braverman, Ostrovsky]
–Estimation error ¼ |x|2/B
Subsampling + Heavy Hitters
–Can be used to find the “heavy
• CountSketch
hitters”[Charikar,Chen,Farach-Colton]:
– Give–It
each
i a random
is acoordinate
linear map
x -> S ¢¾(i)
x 2 {-1,1}
– Randomly partition coordinates into B buckets,
•Easy
under
updates
maintain
cj = to
Σi maintain
¾(i)¢x
in
j-th
bucket
s.t. h(i) = j
i
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
.
Σi s.t. h(i) = 2 ¾(i)¢xi .
– Estimate xi as ch(i)
.
Subsampling + Heavy Hitters
(n)
x2
Rn:
Number of n1/2
coordinates
1
1 n1/4
n1/3
Value
• Subsampling [Indyk,W]:
– Create nested sequence of subsets of [n]
– [n] = Llog n ¶ Llog n - 1 ¶ … ¶ L0
– Li contains about 2i random coordinates
– Run CountSketch to find heavy hitters of each xLi
– Estimate number of coordinates “at every scale”
– Obtain a rough approximation x’ to x
(3) Characterization of Turnstile
Algorithms
•
All known algorithms in the turnstile model have the form:
1. Choose a random matrix A independent of x
2. Maintain the “linear sketch” Ax in the stream
A
x
3. Output a function of Ax
•
Question (?!): does the optimal algorithm for any function in the turnstile
model have this form?
•
[Li, Nguyen, W] Yes, up to a factor of log n in the space
–
Some caveats, e.g., can’t necessarily store A in low space
–
For lower bounds doesn’t matter, gives simpler proof strategy since
just need to rule out linear sketches
= Ax
Talk Outline
• Data Stream Model and Sample Results
– Distinct Elements
– Frequency Moments
– Characterization of Algorithms
• Connections to Other Areas
– Compressed Sensing
– Linear Algebra
– Machine Learning
Compressed Sensing
• Compute a sketch A¢x with a small number
of rows (a.k.a. measurements)
• Output x’ which approximates x in the sense
that
x
|x’-x|p · (1+ε) |x-xk|q
where xk is the best k-sparse approximation
to x
• Similar to heavy hitters problem solved by
CountSketch
• Variations of CountSketch + subsampling:
• Can design algorithms with near-optimal x2
number of measurements as a function
of various ε, k, p, q [Price, W]
• For p = q = 2, can reduce number of
measurements by adaptively invoking
CountSketch [Indyk, Price, W]
Linear Algebra
• Least squares regression
– Fitting points to a line, or more generally a subspace
Example Regression
250
200
150
Example Regression
100
50
0
0
50
100
150
– minx |Ax-b|2 for n x d matrix A, n x 1 vector b
– Typically n >> d, i.e., the problem is over-constrained
Linear Algebra
• If S is a random projection matrix:
– compute S*A and S*b,
– solve minx |SAx-Sb|2
– Intuition: randomly rotate the column span of [A°b], then
drop all but first O(d) coordinates
• (0, 0, 0, …, 0, 1) 2 Rn
• After rotation approximately:
(± 1/n1/2, …, ± 1/n1/2)
• Drop all but first d coordinates
and rescale by (n/d)1/2
(± 1/d1/2, …, ± 1/d1/2) 2 Rd
Linear Algebra
• 1+ε approximation in O(nd log n) + poly(d/ε) time using Fast
Johnson Lindenstrauss Transforms (restricted family of
projections)
• If replace S with CountSketch, this still works! [Clarkson, W]
– Leads to running time O(nnz(A)) + poly(d/ε) time, where
nnz(A) is the number of non-zero entries of A
• Low Rank Approximation
– Using CountSkech instead of Fast Johnson Lindenstrauss
improves running time from O(nd log n) to O(nnz(A))
[Clarkson, W]
• Beautiful followup works by Li, Mahoney, Meng, Miller,
Nelson, Nguyen, Peng
Machine Learning
• CountSketch can be used to estimate inner products
– Estimate <x,y> as <Sx, Sy>
– E[<Sx, Sy>] = <x,y>
– Var[<Sx, Sy>] · |x|2 |y|2/B
• Replace expensive inner product computations in classification
algorithms with approximations via CountSketch
– perceptron and minimum enclosing ball [Clarkson, Hazan, W]
• Often interested in non-linear kernel transformations of input
points: x1, …, xn -> f(x1), …, f(xn)
– “Tensor product” CountSketch of Pagh gives subspace
embeddings of the polynomial kernel [Avron, Nguyen, W]
Conclusions
• Many data stream and sketching techniques give efficient
ways of “compressing” big data – a broadly applicable
goal in computer science
– Compressed sensing, graph algorithms, linear
algebra, machine learning…
– Recently been looking at shape-fitting and clustering
problems, etc.
– Also useful for proving lower bounds in other areas,
e.g., number of measurements in sparse recovery
[DoBa,Indyk,Price,W]
– I’m sure there are many other unexplored areas
• Thank you!