Massive Data Sets and Information Theory

Transcript Massive Data Sets and Information Theory

Massive Data Sets
and
Information Theory
Ziv Bar-Yossef
Department of Electrical Engineering
Technion
What are Massive Data Sets?
Technology
Science
The World-Wide Web
IP packet flows
Phone call logs
Astronomical sky surveys
Weather data
Business
Credit card transactions
Billing records
Supermarket sales
Nontraditional Challenges
Traditionally
Massive Data Sets
Cope with the complexity
of the problem
Cope with the complexity
of the data
New challenges
 Restricted access to the data
 Not enough time to read the whole data
 Tiny fraction of the data can be held in main memory
Computing over Massive Data
Sets
(n is very large)
Examples
Mean
Data
Parity
Computer
Program
• Approximation of
f(x) is sufficient
• Program can be
randomized
Approximation of
Models for Computing over
Massive Data Sets



Sampling
Data Streams
Sketching
Sampling
(n is very large)
Examples
Data
Mean
O(1) queries
Parity
Computer
Program
Query a few
data items
Approximation of
n queries
Data Streams
(n is very large)
Examples
Data
Mean
O(1) memory
Computer
Program
Stream through the data;
Use limited memory
Parity
1 bit of memory
Approximation of
Sketching
(n is very large)
Examples
Data1
Data1
Sketch1
Data2
Sketch2
Equality
O(1) size sketch
Hamming distance
O(1) size sketch
Compress each
data segment into
a small “sketch”
Compute over
the sketches
Lp distance (p > 2)
(n1-2/p) size sketch
Approximation of
Algorithms for Massive Data
Sets
Data Streams
Sampling
• Mean and other moments
• Frequency moments
• Median and other quantiles
• Distinct elements
• Volume estimations
• Functional approximations
• Histograms
• Geometric problems
• Graph problems
• Graph problems
• Low-rank matrix approximations
• Database problems
Sketching
• Equality
• Edit distance
• Hamming distance
• Lp distance
Our goal
Study the limits of computing
over massive data sets
Query complexity
lower bounds
Data stream
memory
lower bounds
Sketch size
lower bounds
Main Tools
Communication complexity, information theory, statistics
Communication Complexity
[Yao 79]
Alice
m1
Bob
m2
m3
m4
Referee
(a,b)
“transcript”
cost() = i |mi|
CC(f) = min:  computes f cost()
Communication Complexity
View of Sampling
Alice
i1
Bob
X[i1]
i2
X[i2]
Referee
(x)
“transcript”
cost() = # of queries
Approximation of
QC(f) = min:  computes f cost()
Information Complexity
[Chakrabarti, Shi, Wirth, Yao 01]
Information Complexity:
Minimum amount of information a protocol
that computes f has to reveal about its inputs
: distribution on inputs to f
X: random variable with distribution 
icost() = I(X; (X));
IC(f) = min:  computes f icost()
Note: For some functions, any protocol must reveal
much more information about X than just f(X).
CC Lower Bounds via IC Lower
Bounds
Useful properties of information complexity:
• Lower bounds communication complexity
• Amenable to “direct sum” decompositions
Framework for bounding CC via IC
1. Find an appropriate “hard input distribution” .
2. Prove a lower bound on IC(f)
1. Decomposition: decompose IC(f) into “simple”
information quantities.
2. Basis: Prove a lower bound on the simple quantities
Applications of Information
Complexity
• Data streams
[Bar-Yossef, Jayram, Kumar, Sivakumar 02] [Chakrabarti,
Khot, Sun 03]
• Sampling [Bar-Yossef 03]
• Communication complexity and decision tree complexity
[Jayram, Kumar, Sivakumar 03]
• Quantum communication complexity
• Cell probe
[Jain, Radhakrishnan, Sen 03]
[Sen 03] [Chakrabarti, Regev 04]
• Simultaneous messages
[Chakrabarti, Shi, Wirth, Yao 01]
The “Election Problem”
• Input: a sequence x of n votes to k parties

Want to get D s.t. ||D – f(x)|| < .
Theorem QC(f) = (k/2)
Vote Distribution f(x)
(n = 18, k = 6)
7/18 4/18 3/18 2/18 1/18 1/18
Sampling Lower Bound
Lemma 1 (Normal form lemma):
WLOG, in any protocol  that computes the election
problem, the queries are uniformly distributed and
independent.
(x) : transcript of a full protocol
(x) : transcript of a single random query “protocol”
If cost() = q, then (x) = ((x),…,(x)) (q times)
Sampling Lower Bound (cont.)
Lemma 2 (Decomposition lemma):
For any X, I(X ; (X)) <= q I(X; (X)).
I(X; (X)) : information cost of  w.r.t. X
I(X; (X)) : information cost of  w.r.t. X
Therefore, q >= I(X; (X)) / I(X; (X)).
Combinatorial Designs
B1
U
B3
B2
A family of subsets B1,…,Bm of a universe U s.t.
1.
2.
Each of them constitutes half of U.
The intersection of each two of them is relatively small.
Fact: There exist designs of size exponential in |U|.
(Constant rate, constant relative minimum distance binary
error-correcting codes).
Hard Input Distribution for the
Election Problem
Let B1,…,Bm be a family of subsets of {1,…,k} that form a
design of size m = 2(k).
X is uniformly chosen among x1,…,xm, where in xi:
• ½ +  of the votes are split
among parties in Bi.
• ½ -  of the votes are split
among parties in Bic.
Bi
Bic
1. Unique decoding: For every i,j, || f(xi) – f(xj) || > 2.
Therefore, I(X; (X)) >= H(X) = (k).
2. Low diameter: I(X; (X)) = O(2).
Conclusions



Information theory plays a growingly major
role in complexity theory.
More applications of information complexity?
Can we use deeper information theory in
complexity theory?
Thank You

Massive Data Sets and Information Theory

Transcript Massive Data Sets and Information Theory

Directory