Tight Bounds for Distributed Functional Monitoring

Download Report

Transcript Tight Bounds for Distributed Functional Monitoring

Tight Bounds for Distributed
Functional Monitoring
David Woodruff
IBM Almaden
Qin Zhang
Aarhus University
MADALGO
Based on a paper in STOC, 2012
k-party Number-In-Hand Model
P1
x2
x1
Pk
P2
-Player to player
communication
xk
x3
…
P3
P4
x4
- Protocol
transcript
always
determines who
speaks next
Goals:
- compute a function f(x1, …, xk)
- minimize communication complexity
k-party Number-In-Hand Model
C
P1
P2
P3
x1
x2
x3
…
Pk
xk
Convenient to introduce a “coordinator” C
All communication goes through the coordinator
Communication only affected by a factor of 2
Model Motivation
• Data distributed and stored in the cloud
– Impractical to put data on a single device
• Sensor networks
– Communication is power-intensive
• Network routers
– Bandwidth limitations
• Distributed functional monitoring
Authors: Can, Cormode, Huang, Muthukrishnan, PattShamir, Shafrir, Tirthapura, Wang, Yi, Zhao, …
k-Party Number-In-Hand Model
For distributed databases:
C
Important for
applications that
|x|0 is number of distinct elements the xi are non|x|P2 is knownPas self-join
P size … negative
P
2
1
2
3
|x|2x1useful for regression,
x2
x3 low-rank
approx
k
xk
Which functions do we care about?
- 8i, xi 2 {0,1, … n}n
- x = x 1 + x2 + … + xk
- f(x) = |x|p = (Σi xip)1/p
- |x|0 is number of non-zero coordinates
- Talk will focus on |x|0 and |x|2
Randomized Communication
Complexity
• What is the randomized communication cost of f?
• i.e., the minimal cost of a protocol, which for every set of
inputs, fails in computing f with probability < 1/3
•
(n) cost for |x|0 and |x|2
• Reduction from 2-Player Set-Disjointness (DISJ)
• Alice has a set S µ [n]
• Bob has a set T µ [n]
• Either |S Å T| = 0 or |S Å T| = 1
• |S Å T| = 1 ! DISJ(S,T) = 1, |S Å T| = 0 !DISJ(S,T) = 0
• [KS, R] (n) communication
• Prohibitive
Approximate Answers
Compute a relation with probability > 2/3:
f(x)2(1 ± ε) |x|0
f(x)2(1 ± ε) |x|2
What is the randomized communication cost
as a function of k, ε, and n?
Will ignore log(nk/ε) factors
Understanding dependence on ε is critical, e.g., ε<.01
Previous Results
• |x|0: (k + ε-2) and O(k¢ε-2 )
• |x|2: (k + ε-2) and O(k¢ε-2 )
Our Results
• |x|0: (k + ε-2) and O(k¢ε-2 )
(k¢ε-2)
• |x|2: (k + ε-2) and O(k¢ε-2 )
(k¢ε-2)
First lower bounds
Implications for data streams:
to
depend
on
- First tight space lower bound for estimating number of distinct elements
of k and εwithout using the Gap-Hammingproduct
Problem
2
- Improves lower bound for estimation of |x|p, p > 2
Previous Lower Bounds
• Lower bounds for |x|0 and |x|2
• [CMY] (k)
• [ABC] (ε-2)
• Reduction from Gap-Orthogonality (GAP-ORT)
• P1, P2 have u, v 2
-2
ε
{0,1} ,
respectively
• |¢(u, v) – 1/(2ε2)| < 1/ε or |¢(u, v) - 1/(2ε2)| > 2/ε
• [CR, S] (ε-2) communication
Talk Outline
• Lower Bounds
– |x|0
– |x|2
Lower Bound for |x|0
• Improve bound to optimal (k¢ε-2)
• Study a simpler problem: k-GAP-THRESH
– Each player Pi holds a bit Zi
– Zi are i.i.d. Bernoulli(¯)
– Decide if
 i=1k Zi > ¯ k + (¯ k)1/2 or  i=1k Zi < ¯ k - (¯ k)1/2
Otherwise don’t care
• Rectangle property: for any correct protocol transcript ¿,
Z1, Z2, …, Zk are independent conditioned on ¿
Rectangle Property of
Communication
• Let r be the randomness of C, P1, …, Pk
• For any fixed r, the set S of inputs giving
rise to a transcript ¿ is a combinatorial
rectangle: S = S1 x S2 x … x Sk
• If input distribution is a product distribution,
conditioned on ¿ and r, inputs are
independent
• Since this holds for every r, inputs are
independent conditioned on ¿
k-GAP-THRESH
C
P1
P2
P3
Z1
Z2
Z3
…
Pk
Zk
• The Zi are i.i.d. Bernoulli(¯)
• Coordinator wants to decide if:
i=1k Zi > ¯ k + (¯ k)1/2 or  i=1k Zi < ¯ k - (¯ k)1/2
• By independence of the Zi | ¿ , equivalent to C having “noisy”
independent copies of the Zi
A Key Lemma
• Lemma: For any protocol ¦ which succeeds w.pr. >.99, the
transcript ¿ is such that w.pr. > 1/2, for at least k/2 different i,
H(Zi | ¿) < H(.01 ¯)
• Proof: Suppose ¿ does not satisfy this
– With large probability,
¯ k - O(¯ k)1/2 < E[ i=1k Zi | ¿] < ¯ k + O(¯ k)1/2
– Since the Zi are independent given ¿,
i=1k Zi | ¿ is a sum of independent Bernoullis
– Since most H(Zi | ¿) are large, by anti-concentration, both
events occur with constant probability:
 i=1k Zi | ¿ > ¯ k + (¯ k)1/2 ,  i=1k Zi | ¿ < ¯ k - (¯ k)1/2
So ¦ can’t succeed with large probability
Composition Idea
C
DISJ
P1
P2
P3
Z1
Z2
Z3
…
Pk
Zk
The input to Pi in k-GAP-THRESH, denoted Zi, is the output
of a 2-party Disjointness (DISJ) instance between C and Si
- Let S be a random set of size 1/(4ε2) from {1, 2, …, 1/ε2}
- For each i, if Zi = 1, then choose Ti of size 1/(4ε2) so that
DISJ(S, Ti) = 1, else choose Ti so that DISJ(S, Ti) = 0
- Distributional complexity of solving DISJ with probability
1-¯/100, when DISJ(S,T) = 1 with probability ¯, is (1/ε2) [R]
Putting it All Together
• Key Lemma ! For most i, H(Zi | ¿) < H(.01¯)
• Since H(Zi) = H(¯) for all i, for most i protocol ¦ solves
DISJ(X, Yi) with probability ¸ 1- ¯/100
• For most i, the communication between C and Pi is (ε-2)
– Otherwise, C could simulate the other players without any
communication and contradict lower bound for DISJ(X, Yi)
• Total communication is (k¢ε-2)
• Can show a reduction to estimating |x|0
Reduction to |x|0
• Think of C as a player
• C’s input vector xC is characteristic vector
of the set [1/ε2] \ S
• Pi’s input vector xi is characteristic vector
of the set Ti
• When |Ti Å S| = 1, support of x = xC + i xi
usually increases by 1
• Choose ¯ = £(1/(ε2 k)) so that
 i=1k Zi = ¯ k +- (¯ k)1/2 = 1/ε2 +- 1/ε
Talk Outline
• Lower Bounds
– |x|0
– |x|2
Lower Bound for Euclidean Norm
• Improve (k + ε-2) bound to optimal (k¢ε-2)
• Use Gap-Orthogonality (GAP-ORT(X, Y))
–
–
–
–
GAP-ORT(X,Y) = 1
-2
ε
Alice, Bob have X, Y 2 {0,1}
Decide: |¢(X, Y) – 1/(2ε2)| <1/ε or |¢(X, Y) - 1/(2ε2)| >2/ε
Consider uniform distribution on X,Y
• [KLLRX, CKW] For any protocol ¦ that solves GAPORT with constant probability,
I(X, Y; ¦) = H(X,Y) – H(X,Y | ¦) = (1/ε2)
Information Implications
• By chain rule,
2
1/ε
I(X, Y ; ¦) = i=1 I(Xi, Yi ; ¦ | X< i, Y< i) = (ε-2)
• For most i, I(Xi, Yi ; ¦ | X< i, Y< i) = (1)
XOR DISJ
We compose GAP-ORT with a variant of k-Party DISJ
P1
…
Pk/2
T1
…
Tk/2
Pk/2+1
Tk/2+1
…
…
Pk
Tk µ [n]
• Choose random j 2 [n] and random S 2 {00, 10, 01, 11}:
S = 00: j doesn’t occur in any Ti
S = 10: j occurs only in T1, …, Tk/2
S = 01: j occurs only in Tk/2+1, …, Tk
S = 11: j occurs in T1, …, Tk
• Every j’  j occurs in at most one set Ti
• Output equals 1 if S 2 {10, 01}, otherwise output is 0
• I(¦ ; T1, …, Tk | j, S, D) = (k) for any ¦ for which I(¦ ; S) = (1)
GAP-ORT + XOR DISJ
• Take 1/ε2 independent copies of XOR DISJ
– Ti = (Ti1, …, Tik), ji, Si, Di are variables for i-th instance
• Is the number of outputs equal to 1 about 1/(2ε2) +-1/ε or
about 1/(2ε2) +- 2/ε?
1/ε2
{
XOR DISJ instance
XOR DISJ instance
…
XOR DISJ instance
Intuitive Proof
• GAP-ORT is “embedded” inside of GAP-ORT + XOR DISJ
Output is XOR of bits in S
• Implies for any correct protocol ¦:
For most i, I(Si ; ¦ | S< i) = (1)
• Implies via a direct sum:
For most i, I(¦ ; Ti | j, S, D, T< i ) = (k)
• Implies via the chain rule:
2
I(¦; T1, …, T1/ε | j, S, D) = (k/ε2)
• Implies communication is (k/ε2)
Conclusions
• Tight communication lower bounds for estimating
|x|0 and |x|2
• Techniques imply tight lower bounds for empirical
entropy, heavy hitters, quantiles
• Other results:
– Model in which the xi undergo poly(n) additive updates
to their coordinates
– Coordinator continually maintains (1+ε)-approximation
– Improve k2/poly(ε) to k/poly(ε) communication for |x|2