The Complexity of Computing on Massive Data Sets

Download Report

Transcript The Complexity of Computing on Massive Data Sets

The Complexity of
Massive Data Set Computations
Ziv Bar-Yossef
Computer Science Division
U.C. Berkeley
Ph.D. Dissertation Talk
May 6, 2002
1
What Are Massive Data Sets?
Examples
• The Web
• IP packets
• Supermarket transactions
• Telephone call graph
• Astronomical observations
Characterizing properties
• Huge collections of raw data
• Data is generated and modified continuously
• Distributed over many sites
• Slow storage devices
• Data is not organized / indexed
2
Nontraditional Computational
Challenges
Traditionally
Massive Date Sets
Cope with the difficulty
of the problem
Cope with the size of
the data and the
restricted access to it
Restricted access to the data
Sub-linear running time
•
•
•
•
• Ideally, independent of data size
Random access: expensive
“Streaming” access: more feasible
Some data may be unavailable
Fetching data is expensive
Sub-linear space
• Ideally, logarithmic in data size
3
Basic Framework
Massive data set computations are typically:
• Approximate
• Randomized
• Have a restricted access regime
Input
Data
($$ = randomness)
Access
Regime
$$
Algorithm
Approximate
Output
4
Prominent Computational Models
for Massive Data Sets
• Sampling Computations
– Sub-linear running time & space
– Suitable for “insensitive” functions
• Data Stream Computations
– Linear running time, sub-linear space
– Can compute sensitive functions
• Sketch Computations
– Suitable for distributed data
5
Sampling Computations
x1
x2
$$
Sampling
Algorithm
Approximation
of f(x1,…,xn)
xn
• Query input at random locations
• Can choose query distribution and can query adaptively
• Complexity measure: query complexity
• Applications
– Statistical parameter estimation
– Computational and statistical learning [Valiant 84, Vapnik 98]
– Property testing [RS96,GGR96]
6
Data Stream Computations
[HRR98, AMS96, FKSV99]
x1 x2 x3
$$
Data Stream
memory
Algorithm
xn
Approximation
of f(x1,…,xn)
• Input arrives in a one-way stream in arbitrary order
• Complexity measures: space and time per data item
• Applications
– Database (Frequency moments [AMS96])
– Networking (Lp distance [AMS96, FKSV99, FS00, Indyk 00])
7
– Web Information Retrieval (Web crawling, Google query logs [CCF02])
Sketch Computations
[GM98, BCFM98, FKSV99]
x11 … x1k
compression
x21 … x2k
compression
$$
Sketch
Algorithm
xt1 … xtk
$$
compression
Approximation
of f(x11,…,xtk)
• Algorithm computes from data “sketches” sent from sites
• Complexity measure: sketch lengths
• Applications
– Web Information Retrieval (Identifying document similarities [BCFM98])
– Networking (Lp distance [FKSV99])
– Lossy compression, approximate nearest neighbor
8
Main Objective
Explore the limitations of the
above computational models
• Develop general lower bound techniques
• Obtain lower bounds for specific functions
9
Thesis Blueprint
lower bounds for general
functions [BKS01,B02]
Statistical
Decision
Theory
General CC lower
bounds [BJKS02b]
Reduction from
simultaneous
CC
Communication
Complexity
Sampling
Computations
Sketch
Computations
Reduction from
one-way CC
Information
Theory
One-way and
simultaneous CC
lower bounds
[BJKS02a]
Data Stream
Computations
10
Sampling Lower Bounds
(with R. Kumar, and D. Sivakumar, STOC 2001, and Manuscript, 2002)
• Combinatorial lower bound [BKS01]
– bounds the expected query complexity of every function
– tends to be weak
– based on a generalization of Boolean block sensitivity [Nisan 89]
• Statistical lower bounds
–
–
–
–
–
bound the query complexity of symmetric functions
via Hellinger distance: worst-case query complexity [BKS01]
via KL distance: expected query complexity [B02]
tend to be tight
work by a reduction from statistical hypothesis testing
• Information theory lower bound [B02]
– bounds the worst-case query complexity of symmetric functions
– has better dependence on the domain size
11
Main Idea
approximation
set of w
approximation
set of x
Ce (x)
Ce (w)
Ce ( y )
approximation
set of y
e-disjoint inputs
(e,d)approximation: Pr(T ( x)  Ce ( x))  1  d
Main observation:
Since for all x, T ( x) C e ( x) w.p. 1 - d, then:
x,y e-disjoint  T(x),T(y) are “far” from each other12
Main Result
Theorem
For any symmetric f and e-disjoint inputs x,y, and for
any algorithm that (e,d)-approximates f,
• Worst-case # of queries  1/h2(Ux,Uy) log(1/d)
• Expected # of queries  1/KL(Ux,Uy) log(1/d)
• Ux – uniform query distribution on x:
(induced by: pick i u.a.r, output xi)
• Hellinger: h2(Ux,Uy) = 1 – a (Ux(a) Uy(a))½
• KL:
KL(Ux,Uy) = a Ux(a) log(Ux(a) / Uy(a))
13
Example: Mean
Theorem (originally, [CEG95])
Approximating the mean of n numbers in [0,1] to
within  e additive error requires W(1/e2 log(1/d))
queries.
½+e
X:
1
½-e
0
½-e
y:
1
½+e
0
h2(Ux,Uy) = KL(Ux,Uy) = O(e2)
Other applications: Selection functions, frequency
14
moments, extractors and dispersers
Proof Outline
1. For symmetric functions, WLOG, all queries
are uniform without replacement
2. If # of queries is  n½, can further assume
queries are uniform with replacement
3. For any e-disjoint inputs x,y,
(e,d)
approximation of
f with k queries
Hypothesis test of
Ux against Uy with
error d and k
samples
4. Hypothesis testing lower bounds
•
•
via Hellinger distance (worst-case)
via KL distance (expected) (cf. [Siegmund 85])
15
Statistical Hypothesis Testing
k i.i.d.
samples
P
Black
Box
Hypothesis
Test
Q
•
•
•
•
Black box contains either P or Q
Test has to decide: “P” or “Q”
Allowed error probability d
Goal: minimize k
16
Sampling Algorithm  Hypothesis
Test
x,y: e-disjoint inputs
(Ce ( x)  Ce ( y)   )
k i.i.d.
samples
Ux
Black
Box
Uy
Sampling
Algorithm
“Ux” – if output  Ce (x)
“Uy” - otherwise
17
Lower Bound via Hellinger
Distance
Hypothesis test
for Ux against Uy
with error d and
k samples
V (U ,U )  1  2d
k
x
k
y
Lemma (cf. Le Cam, Yang 90)
 h 2h
2
k
k
2
k
1

h
(
U
,
U
)

(
1

h
(
U
,
U
))
2.
x
y
x
y
1. V
2
2
Corollary: k  1/h2(Ux,Uy) log(1/d)
18
Communication Complexity
[Yao 79]
f: X  Y  Z
$$ Alice
xX
$$
f(x,y)
Bob
yY
Rd(f) = randomized CC of f with error d
19
Multi-Party Communication
f: X1  …  Xt  Z
P1
xt
f(x1,…,xt)
x1
Pt
x2
P2
P3
x3
20
Example: Set-disjointness
t-party set-disjointness
|  i Si | 1
1
Pi gets Si  [n], Disj t  
0 i  j, Si  S j  
Theorem [KS87,R90]: Rd(Disj2) = W(n)
Theorem [AMS96]:
Rd(Disjt) = W(n/t4)
Best upper bound:
Rd(Disjt) = O(n/t)
21
Restricted Communication Models
One-Way Communication [PS84, Ablayev 93, KNR95]
P1
P2
Pt
f(x1,…,xt)
• Reduces to data stream computations
Simultaneous Communication [Yao 79]
P1
P2
Pt
Referee
f(x1,…,xt)
• Reduces to sketch computations
22
Example:
Disjointness  Frequency Moments
k-th frequency moment
Input stream: a1,…,am [n],
For j [n], fj = # of occurrences of j in a1,…,am
Fk(a1,…,am) =  j [n] (fj)k
Theorem [AMS96]:
Disj n1 / k  Fk
Corollary:
DS(Fk) = nW(1), k > 5
Best upper bounds:
DS(Fk) = nO(1), k > 2
DS(Fk) = O(log n), k=0,1,2
23
Information Statistics Approach to
Communication Complexity
(with T.S. Jayram, R. Kumar, and D. Sivakumar, Manuscript 2002)
A novel lower bound technique for randomized CC
based on statistics and information theory
Applications
• General CC lower bounds
– t-party set-disjointness: W(n/t2) (improving on [AMS96])
– Lp (solving an open problem of [Saks-Sun 02])
– Inner product
• One-way CC lower bounds
– t-party set-disjointness: W(n/t1+e ) for any e > 0
• Space lower bounds in the data stream model
– frequency moments: nW(1),k > 2 (proving conjecture of [AMS96])
24
– Lp distance
Statistical View of Communication
Complexity
– a d-error randomized protocol for f: X  Y  Z
(x,y) – distribution over transcripts
Lemma:
For any two input pairs (x,y), (x’,y’) with f(x,y)  f(x’,y’),
V((x,y),(x’,y’))  1 – 2d
Proof:
By reduction from hypothesis testing.
Corollary:
h2((x,y),(x’,y’))  1 – 2d½
25
Information Cost
[Ablayev 93, Chakrabarti et al. 01, Saks-Sun 02]
For a protocol  that computes f, how much
information does (x,y) have to reveal about (x,y)?
m = (X,Y) – a distribution over inputs of f
Definition: m-information cost
icostm() = I(X,Y ; (X,Y))
icostm,d(f) = min{icostm()}
I(X,Y ; (X,Y))  H((X,Y))  |(X,Y)|
Information cost
lower bound
CC lower bound
26
Direct Sum for Information Cost
Decomposable functions:
f(x,y) = g(h(x1,y1),…,h(xn,yn)),
h: Xi  Yi {0,1},
g: {0,1}n  {0,1}
Example: Set Disjointness
Disj2(x,y) = (x1 Λ y1) V … V (xn Λ yn)
Theorem (direct sum):
For appropriately chosen m,m’,
icostm,d(f)  n · icostm’,d(h)
Lower bound on
icost(h)
Lower bound on
icost(f)
27
Information Cost of Single-Bit
Functions
In Disj2, m’ = ½ m’1 + ½ m’2, where:
m’1 = ½(1,0) + ½(0,0),
m’2 = ½(0,1) + ½(0,0)
Lemma 1: For any protocol  for AND,
icostm’()  W(h2((0,1),(1,0))
Lemma 2: h2((0,1),(1,0)) = h2((1,1),(0,0))
Corollary 1: icostm’,d(AND)  W(1 – 2d½)
Corollary 2: icostm,d(Disj2)  W(n · (1 – 2d½))
28
Proof of Lemma 2
“Rectangle” property of deterministic protocols:
For any transcript a, the set of all (x,y) with (x,y) = a is a
“combinatorial rectangle”: S  T, where S  X and T  Y
“Rectangle” property of randomized protocols:
For all x  X, y  Y, there exist functions px: {0,1}*[0,1] and
qy: {0,1}*[0,1], such that for any possible transcript a,
Pr((x,y) = a) = px(a) · qy(a)
h2((0,1),(1,0)) = 1 - Sa(Pr((0,1) = a) · Pr((1,0) = a))½
= 1 – Sa(p0(a) · q1(a) · p1(a) · q0(a))½ = h2((0,0),(1,1))
29
Conclusions
• Studied limitations of computing on massive data sets
– Sampling computations
– Data stream computations
– Sketch computations
• Lower bound methodologies are based on
– Information theory
– Statistical decision theory
– Communication complexity
• Lower bound techniques:
– Reveal novel aspects of the models
– Present a “template” for obtaining specific lower bounds
30
Open Problems
• Sampling
– Lower bounds for non-symmetric functions
– Property testing lower bounds
• Communication complexity
– Study the communication complexity of approximations
– Tight lower bound for t-party set disjointness
– Under what circumstances are one-way and
simultaneous communication equivalent?
31
Thank You!
32
Yao’s Lemma
[Yao 83]
Definition: m-distributional CC (Dm,d(f))
Complexity of best deterministic protocol that
computes f with error  d on inputs drawn
according to m
Yao’s Lemma: Rd(f)  maxmDm,d(f)
• Convenient technique to prove randomized CC
lower bounds
33
Communication Complexity Lower
Bounds via Information Theory
(with T.S. Jayram, R. Kumar, and D. Sivakumar, Complexity 2002)
• A novel information theory paradigm for proving
CC lower bounds
• Applications
– Characterization results: (w.r.t. product distributions)
• 1-way  simultaneous
• 2-party 1-way  t-party 1-way
• VC dimension characterization of t-party 1-way CC
– Optimal lower bounds for simultaneous CC
• t-party set-disjointness: W(n/t)
• Generalized addressing function
34
Information Theory
sender
mM
rR
noisy channel
receiver
• M – distribution of transmitted messages
• R – distribution of received messages
• Goal of receiver: reconstruct m from r
•
dg: error probability of a reconstruction function g
Fano’s Inequality: For all g, H2(dg)  H(M | R)
MLE Principle:
dMLE  H(M | R)
For a
Boolean
M
35
Information Theory View of
Distributional CC
“God”
(x,y)
f(x,y)
CC protocol
Alice
& Bob
• x,y distribute according to m  (X,Y)
• “God” transmits f(x,y) to Alice & Bob
• Alice & Bob receive the transcript (x,y)
• Fano’s inequality:
For any d-error protocol  for f,
H2(d)  H(f(X,Y) | (X,Y))
36
Simultaneous CC vs. One-Way CC
Theorem
For every product distribution m = X Y, and every Boolean f,
Dm,2H(d),sim(f)  Dm,d,AB(f) + Dm,d,BA(f)
Proof
A(x) – message of A on x in a d-error A  B protocol for f
B(y) – message of B on y in a d-error B  A protocol for f
Construct a SIM protocol for f:
A  Referee: A(x)
B  Referee: B(y)
Referee outputs MLE(f(X,Y) | A(x), B(y))
37
Simultaneous CC vs. One-Way CC
Proof (cont.)
By MLE Principle,
Prm(MLE(f(X,Y) | A(X),B(Y))  f(X,Y))  H(f(X,Y) | A(X),B(Y))
By Fano,
H(f(X,Y) | A(X),Y)  H2(d) and H(f(X,Y) | X,B(Y))  H2(d)
Lemma For independent X,Y,
H(f(X,Y) | A(X),B(Y))  H(f(X,Y) | A(X),Y) + H(f(X,Y) | X,B(Y))
 Our protocol errs with probability at most 2H2(d) □
38