Tight Bounds for Distributed Functional Monitoring

Transcript Tight Bounds for Distributed Functional Monitoring

Tight Bounds for Distributed
Functional Monitoring
David Woodruff
IBM Almaden
Qin Zhang
Aarhus University
MADALGO
Based on a paper in STOC, 2012
k-party Number-In-Hand Model
P1
x2
x1
Pk
P2
-Player to player
communication
xk
x3
…
P3
P4
x4
- Protocol
transcript
always
determines who
speaks next
Goals:
- compute a function f(x1, …, xk)
- minimize communication complexity
k-party Number-In-Hand Model
C
P1
P2
P3
x1
x2
x3
…
Pk
xk
Convenient to introduce a “coordinator” C
All communication goes through the coordinator
Communication only affected by a factor of 2
Model Motivation
• Data distributed and stored in the cloud
– Impractical to put data on a single device
• Sensor networks
– Communication is power-intensive
• Network routers
– Bandwidth limitations
• Distributed functional monitoring
Authors: Can, Cormode, Huang, Muthukrishnan, PattShamir, Shafrir, Tirthapura, Wang, Yi, Zhao, …
k-Party Number-In-Hand Model
For distributed databases:
C
Important for
applications that
|x|0 is number of distinct elements the xi are non|x|P2 is knownPas self-join
P size … negative
P
2
1
2
3
|x|2x1useful for regression,
x2
x3 low-rank
approx
k
xk
Which functions do we care about?
- 8i, xi 2 {0,1, … n}n
- x = x 1 + x2 + … + xk
- f(x) = |x|p = (Σi xip)1/p
- |x|0 is number of non-zero coordinates
- Talk will focus on |x|0 and |x|2
Randomized Communication
Complexity
• What is the randomized communication cost of f?
• i.e., the minimal cost of a protocol, which for every set of
inputs, fails in computing f with probability < 1/3
•
(n) cost for |x|0 and |x|2
• Reduction from 2-Player Set-Disjointness (DISJ)
• Alice has a set S µ [n]
• Bob has a set T µ [n]
• Either |S Å T| = 0 or |S Å T| = 1
• |S Å T| = 1 ! DISJ(S,T) = 1, |S Å T| = 0 !DISJ(S,T) = 0
• [KS, R] (n) communication
• Prohibitive
Approximate Answers
Compute a relation with probability > 2/3:
f(x)2(1 ± ε) |x|0
f(x)2(1 ± ε) |x|2
What is the randomized communication cost
as a function of k, ε, and n?
Will ignore log(nk/ε) factors
Understanding dependence on ε is critical, e.g., ε<.01
Previous Results
• |x|0: (k + ε-2) and O(k¢ε-2 )
• |x|2: (k + ε-2) and O(k¢ε-2 )
Our Results
• |x|0: (k + ε-2) and O(k¢ε-2 )
(k¢ε-2)
• |x|2: (k + ε-2) and O(k¢ε-2 )
(k¢ε-2)
First lower bounds
Implications for data streams:
to
depend
on
- First tight space lower bound for estimating number of distinct elements
of k and εwithout using the Gap-Hammingproduct
Problem
2
- Improves lower bound for estimation of |x|p, p > 2
Previous Lower Bounds
• Lower bounds for |x|0 and |x|2
• [CMY] (k)
• [ABC] (ε-2)
• Reduction from Gap-Orthogonality (GAP-ORT)
• P1, P2 have u, v 2
-2
ε
{0,1} ,
respectively
• |¢(u, v) – 1/(2ε2)| < 1/ε or |¢(u, v) - 1/(2ε2)| > 2/ε
• [CR, S] (ε-2) communication
Talk Outline
• Lower Bounds
– |x|0
– |x|2
Lower Bound for |x|0
• Improve bound to optimal (k¢ε-2)
• Study a simpler problem: k-GAP-THRESH
– Each player Pi holds a bit Zi
– Zi are i.i.d. Bernoulli(¯)
– Decide if
 i=1k Zi > ¯ k + (¯ k)1/2 or  i=1k Zi < ¯ k - (¯ k)1/2
Otherwise don’t care
• Rectangle property: for any correct protocol transcript ¿,
Z1, Z2, …, Zk are independent conditioned on ¿
Rectangle Property of
Communication
• Let r be the randomness of C, P1, …, Pk
• For any fixed r, the set S of inputs giving
rise to a transcript ¿ is a combinatorial
rectangle: S = S1 x S2 x … x Sk
• If input distribution is a product distribution,
conditioned on ¿ and r, inputs are
independent
• Since this holds for every r, inputs are
independent conditioned on ¿
k-GAP-THRESH
C
P1
P2
P3
Z1
Z2
Z3
…
Pk
Zk
• The Zi are i.i.d. Bernoulli(¯)
• Coordinator wants to decide if:
i=1k Zi > ¯ k + (¯ k)1/2 or  i=1k Zi < ¯ k - (¯ k)1/2
• By independence of the Zi | ¿ , equivalent to C having “noisy”
independent copies of the Zi
A Key Lemma
• Lemma: For any protocol ¦ which succeeds w.pr. >.99, the
transcript ¿ is such that w.pr. > 1/2, for at least k/2 different i,
H(Zi | ¿) < H(.01 ¯)
• Proof: Suppose ¿ does not satisfy this
– With large probability,
¯ k - O(¯ k)1/2 < E[ i=1k Zi | ¿] < ¯ k + O(¯ k)1/2
– Since the Zi are independent given ¿,
i=1k Zi | ¿ is a sum of independent Bernoullis
– Since most H(Zi | ¿) are large, by anti-concentration, both
events occur with constant probability:
 i=1k Zi | ¿ > ¯ k + (¯ k)1/2 ,  i=1k Zi | ¿ < ¯ k - (¯ k)1/2
So ¦ can’t succeed with large probability
Composition Idea
C
DISJ
P1
P2
P3
Z1
Z2
Z3
…
Pk
Zk
The input to Pi in k-GAP-THRESH, denoted Zi, is the output
of a 2-party Disjointness (DISJ) instance between C and Si
- Let S be a random set of size 1/(4ε2) from {1, 2, …, 1/ε2}
- For each i, if Zi = 1, then choose Ti of size 1/(4ε2) so that
DISJ(S, Ti) = 1, else choose Ti so that DISJ(S, Ti) = 0
- Distributional complexity of solving DISJ with probability
1-¯/100, when DISJ(S,T) = 1 with probability ¯, is (1/ε2) [R]
Putting it All Together
• Key Lemma ! For most i, H(Zi | ¿) < H(.01¯)
• Since H(Zi) = H(¯) for all i, for most i protocol ¦ solves
DISJ(X, Yi) with probability ¸ 1- ¯/100
• For most i, the communication between C and Pi is (ε-2)
– Otherwise, C could simulate the other players without any
communication and contradict lower bound for DISJ(X, Yi)
• Total communication is (k¢ε-2)
• Can show a reduction to estimating |x|0
Reduction to |x|0
• Think of C as a player
• C’s input vector xC is characteristic vector
of the set [1/ε2] \ S
• Pi’s input vector xi is characteristic vector
of the set Ti
• When |Ti Å S| = 1, support of x = xC + i xi
usually increases by 1
• Choose ¯ = £(1/(ε2 k)) so that
 i=1k Zi = ¯ k +- (¯ k)1/2 = 1/ε2 +- 1/ε
Talk Outline
• Lower Bounds
– |x|0
– |x|2
Lower Bound for Euclidean Norm
• Improve (k + ε-2) bound to optimal (k¢ε-2)
• Use Gap-Orthogonality (GAP-ORT(X, Y))
–
–
–
–
GAP-ORT(X,Y) = 1
-2
ε
Alice, Bob have X, Y 2 {0,1}
Decide: |¢(X, Y) – 1/(2ε2)| <1/ε or |¢(X, Y) - 1/(2ε2)| >2/ε
Consider uniform distribution on X,Y
• [KLLRX, CKW] For any protocol ¦ that solves GAPORT with constant probability,
I(X, Y; ¦) = H(X,Y) – H(X,Y | ¦) = (1/ε2)
Information Implications
• By chain rule,
2
1/ε
I(X, Y ; ¦) = i=1 I(Xi, Yi ; ¦ | X< i, Y< i) = (ε-2)
• For most i, I(Xi, Yi ; ¦ | X< i, Y< i) = (1)
XOR DISJ
We compose GAP-ORT with a variant of k-Party DISJ
P1
…
Pk/2
T1
…
Tk/2
Pk/2+1
Tk/2+1
…
…
Pk
Tk µ [n]
• Choose random j 2 [n] and random S 2 {00, 10, 01, 11}:
S = 00: j doesn’t occur in any Ti
S = 10: j occurs only in T1, …, Tk/2
S = 01: j occurs only in Tk/2+1, …, Tk
S = 11: j occurs in T1, …, Tk
• Every j’  j occurs in at most one set Ti
• Output equals 1 if S 2 {10, 01}, otherwise output is 0
• I(¦ ; T1, …, Tk | j, S, D) = (k) for any ¦ for which I(¦ ; S) = (1)
GAP-ORT + XOR DISJ
• Take 1/ε2 independent copies of XOR DISJ
– Ti = (Ti1, …, Tik), ji, Si, Di are variables for i-th instance
• Is the number of outputs equal to 1 about 1/(2ε2) +-1/ε or
about 1/(2ε2) +- 2/ε?
1/ε2
{
XOR DISJ instance
XOR DISJ instance
…
XOR DISJ instance
Intuitive Proof
• GAP-ORT is “embedded” inside of GAP-ORT + XOR DISJ
Output is XOR of bits in S
• Implies for any correct protocol ¦:
For most i, I(Si ; ¦ | S< i) = (1)
• Implies via a direct sum:
For most i, I(¦ ; Ti | j, S, D, T< i ) = (k)
• Implies via the chain rule:
2
I(¦; T1, …, T1/ε | j, S, D) = (k/ε2)
• Implies communication is (k/ε2)
Conclusions
• Tight communication lower bounds for estimating
|x|0 and |x|2
• Techniques imply tight lower bounds for empirical
entropy, heavy hitters, quantiles
• Other results:
– Model in which the xi undergo poly(n) additive updates
to their coordinates
– Coordinator continually maintains (1+ε)-approximation
– Improve k2/poly(ε) to k/poly(ε) communication for |x|2

Tight Bounds for Distributed Functional Monitoring

Transcript Tight Bounds for Distributed Functional Monitoring

Directory