Lecture 3 (streaming)
Download
Report
Transcript Lecture 3 (streaming)
Sketching, Sampling and other Sublinear
Algorithms:
Streaming
Alex Andoni
(MSR SVC)
A scenario
Challenge:
compute something on the
table,
131.107.65.14
using small space.
18.9.22.69
Example
of “something”:
131.107.65.14
• # distinct IPs
• max
frequency
80.97.56.20
• other statistics…
18.9.22.69
IP
Frequency
131.107.65.14
3
18.9.22.69
2
80.97.56.20
2
128.112.128.81
9
127.0.0.1
8
257.2.5.7
0
7.8.20.13
1
80.97.56.20
131.107.65.14
Sublinear: a panacea?
Sub-linear space algorithm for solving Travelling
Salesperson Problem?
Hard to solve sublinearly even very simple problems:
Sorry, perhaps a different lecture
Ex: what is the count of distinct IPs seen
Will settle for:
Approximate algorithms: 1+ approximation
IP
Frequency
131.107.65.14
3
18.9.22.69
2
80.97.56.20
2
128.112.128.81
9
127.0.0.1
8
257.2.5.7
0
8.3.20.12
1
true answer ≤ output ≤ (1+) * (true answer)
Randomized: above holds with probability 95%
Quick and dirty way to get a sense of the data
Streaming data
Data through a router
Data stored on a hard drive, or streamed remotely
More efficient to do a linear scan on a hard drive
Working memory is the (smaller) main memory
2
2
Application areas
Data can come from:
Network logs, sensor data
Real time data
Search queries, served ads
Databases (query planning)
…
Problem 1: # distinct elements
Problem: compute the number of distinct elements in the
stream
Trivial solution: 𝑂(𝑚) space for 𝑚 distinct elements
Will see: 𝑂(log 𝑚) space (approximate)
2 5 7 5 5
i
Frequency
2
1
5
3
7
1
Distinct Elements: idea 1
[Flajolet-Martin’85, Alon-Matias-Szegedy’96]
Algorithm:
Hash function ℎ: 𝑈 → 0,1
Compute 𝑚𝑖𝑛𝐻𝑎𝑠ℎ = min𝑖∈𝑆 ℎ(𝑖)
1
Output is
−1
𝑚𝑖𝑛𝐻𝑎𝑠ℎ
Process(int i):
if (h(i) < minHash)
minHash = h(index);
repeats of the same element i don’t matter
1
𝐸 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 =
, for 𝑚 distinct elements
𝑚+1
5
0
Initialize:
minHash=1
hash function h into [0,1]
Output: 1/minHash-1
“Analysis”:
Algorithm DISTINCT:
ℎ(5)
1/(𝑚 + 1)
7
ℎ(7)
2
ℎ(2)
1
Distinct Elements: idea 2
Algorithm
Algorithm DISTINCT:
DISTINCT:
Store 𝑚𝑖𝑛𝐻𝑎𝑠ℎ approximately
Randomness: 2-wise enough!
Store just the count of
trailing zeros
Need only 𝑂(log log 𝑛) bits
Initialize:
Initialize:
minHash2=0
minHash=1
hash
hash function
function hh into
into [0,1]
[0,1]
Process(int
Process(int i):
i):
if
if (h(i)
(h(i) << 1/2^minHash2)
minHash)
minHash2
minHash == h(index);
ZEROS(h(index));
Output:
Output:2^minHash2
1/minHash-1
𝑂(log 𝑛) bits
Better accuracy using more space:
x=0.0000001100101
ZEROS(x)
error 1 + 𝜖
repeat 𝑂(1/𝜖 2 ) times with different hash functions
HyperLogLog: can also with just one hash function [FFGM’07]
Problem 2: max count heavy hitters
Problem: compute the maximum frequency of an element
in the stream
Bad news:
2 5 7 5 5
Hard to distinguish whether an element repeated (max = 1 vs 2)
Good news:
Can find “heavy hitters”
elements with frequency > total frequency / s
using space proportional to s
IP
Frequency
2
1
5
3
7
1
Heavy Hitters: CountMin
[Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05]
Algorithm CountMin:
2
ℎ3 2
321
5
ℎ1 (2)
7
5
Initialize(r, L):
array Sketch[L][w]
L hash functions h[L], into {0,…w-1}
21
4321
321
1
𝑤
freq
freq
freq
freq
11
ℎ2 (2)
1
1
5
2 =1
5 =3
7 =1
11 = 1
𝐿
Process(int i):
for(j=0; j<L; j++)
Sketch[j][ h[j](i) ] += 1;
Output:
foreach i in PossibleIP {
freq[i] = int.MaxValue;
for(j=0; j<L; j++)
freq[i] = min(freq[i],
Sketch[j][h[j](i)]);
}
// freq[] is the frequency estimate
Heavy Hitters: analysis
5
𝑤
3
2
1
1
3
mass”
Algorithm CountMin:
4
𝐿
1
= frequency of 5, plus “extra
Expected “extra mass” ≤ total mass / w
Chebyshev: true with probability >1/2
𝐿 = 𝑂(log 𝑚) to get high probability
(for all 𝑚 elements)
Compute heavy hitters from freq[]
Initialize(r, L):
array Sketch[L][w]
L hash functions h[L], into {0,…w-1}
Process(int i):
for(j=0; j<L; j++)
Sketch[j][ h[j](i) ] += 1;
Output:
foreach i in PossibleIP {
freq[i] = int.MaxValue;
for(j=0; j<L; j++)
freq[i] = min(freq[i],
Sketch[j][h[j](i)]);
}
// freq[] is the frequency estimate
Problem 3: Moments
Problem: compute frequency moment
variance 𝐹2 = 𝑖 𝑓(𝑖)2 or
higher moments 𝐹𝑘 = 𝑖 𝑓(𝑖)𝑘 for 𝑘 > 2
Skewness (k=3), kurtosis (k=4), etc
a different proxy for max: lim 𝑘 𝐹𝑘 = max𝑖 𝑓(𝑖)
𝑘→∞
IP
Frequency
𝒇(𝒊)
2
1
5
3
7
2
𝒇(𝒊)𝟐
𝒇(𝒊)𝟒
1
1
9
81
4
16
𝐹2 =1+9+4=14
𝐹2 = 3.74
𝐹4 =1+81+16=98
4
𝐹4 = 3.15
𝐹2 moment
Use Johnson-Lindenstrauss lemma! (2nd lecture)
Store sketch 𝑆 = 𝐺𝑓
Update on element 𝑖:
𝐺(𝑓 + 𝑒𝑖 ) = 𝐺𝑓 + 𝐺𝑒𝑖
Guarantees:
𝑓 = frequency vector
𝐺 = 𝑘 by 𝑛 matrix of Gaussian entries
𝑘 = 𝑂(1/𝜖 2 ) counters (words)
𝑂(𝑘) time to update
Better: ±1 entries, 𝑂(1) update [AMS’96, TZ’04]
𝐹𝑘 : precision sampling => next
Scenario 2: distributed traffic
Statistics on traffic difference/aggregate between two routers
Eg: traffic different by how many packets?
Linearity is the power!
Sketch(data 1) + Sketch(data 2) = Sketch(data 1 + data 2)
Sketch(data 1) - Sketch(data 2) = Sketch(data 1 - data 2) 131.107.65.14
35.8.10.140
18.9.22.69
IP
Frequency
IP
18.9.22.69
Frequency
131.107.65.14
1
131.107.65.14
1
18.9.22.69
1
18.9.22.69
2
35.8.10.140
1
Two sketches should be sufficient to compute
something on the difference or sum
Common primitive: estimate sum
Given: 𝑛 quantities 𝑎1 , 𝑎2 , … 𝑎𝑛 in the range [0,1]
Goal: estimate 𝑆 = 𝑎1 + 𝑎2 + ⋯ 𝑎𝑛 “cheaply”
Standard sampling: pick random set 𝐽 = {𝑗1, … 𝑗𝑚} of size 𝑚
Estimator: 𝑆 =
𝑛
𝑚
⋅ (𝑎𝑗1 + 𝑎𝑗2 + ⋯ 𝑎𝑗𝑚 )
Chebyshev bound: with 90% success probability
1
𝑆 – 𝑂(𝑛/𝑚) < 𝑆 < 2𝑆 + 𝑂(𝑛/𝑚)
2
For constant additive error, need 𝑚 = Ω(𝑛)
Compute an estimate 𝑆 from 𝑎1, 𝑎3
a3
a1
a1
a2
a3
a4
Precision Sampling Framework
Alternative “access” to 𝑎𝑖 ’s:
For each term 𝑎𝑖 , we get a (rough) estimate 𝑎𝑖
up to some precision 𝑢𝑖 , chosen in advance: |𝑎𝑖 – 𝑎𝑖 | < 𝑢𝑖
Challenge: achieve good trade-off between
quality of approximation to 𝑆
use only weak precisions 𝑢𝑖 (minimize “cost” of estimating 𝑎)
Compute an estimate 𝑆 from 𝑎1 , 𝑎2 , 𝑎3 , 𝑎4
u1
a1
ã1
u2
a2
ã2
u3
ã3
a3
u4
ã4
a4
Formalization
Sum Estimator
Adversary
1. fix precisions 𝑢𝑖
1. fix 𝑎1, 𝑎2, … 𝑎𝑛
3. given 𝑎1 , 𝑎2 , … 𝑎𝑛 , output 𝑆 s.t.
𝑖 𝑎𝑖 − 𝑆 < 1.
What is cost?
2. fix 𝑎1 , 𝑎2 , … 𝑎𝑛 s.t. |𝑎𝑖 − 𝑎𝑖 | < 𝑢𝑖
Here, average cost = 1/𝑛 ⋅ 1/𝑢𝑖
to achieve precision 𝑢𝑖, use 1/𝑢𝑖 “resources”: e.g., if 𝑎𝑖 is itself a sum 𝑎𝑖 =
𝑗𝑎𝑖𝑗 computed by subsampling, then one needs Θ(1/𝑢𝑖 ) samples
For example, can choose all 𝑢𝑖 = 1/𝑛
Average cost ≈ 𝑛
Precision Sampling Lemma
[A-Krauthgamer-Onak’11]
Goal: estimate ∑ai from {ãi} satisfying |ai-ãi|<ui.
Precision Sampling Lemma: can get, with 90% success:
O(1)
1.5 multiplicative error:
ε additive error and 1+ε
– ε <<S̃S̃ << (1+
ε)S
+ε
S –S O(1)
1.5*S
+ O(1)
O(ε-3 log
with average cost equal to O(log
n) n)
Example: distinguish Σai=3 vs Σai=0
Consider two extreme cases:
if three ai=1: enough to have crude approx for all (ui=0.1)
if all ai=3/n: only few with good approx ui=1/n, and the rest with ui=1
Precision Sampling Algorithm
Precision Sampling Lemma: can get, with 90% success:
ε additive error and 1+ε
O(1)
1.5 multiplicative error:
S –S O(1)
1.5*S
+ O(1)
– ε <<S̃S̃ << (1+
ε)S
+ε
Algorithm:
O(ε-3 log
with average cost equal to O(log
n) n)
Choose each ui[0,1]
i.i.d. distrib. = minimum of O(ε-3) u.r.v.
concrete
function
of [ãi of
/uii‘s- 4/ε]
Estimator: S̃ = count
number
s.t. ã+i and
/ ui >u6i’s (up to a
normalization constant)
Proof of correctness:
we use only ãi which are 1.5-approximation to ai
E[S̃] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6.
E[1/ui] = O(log n) w.h.p.
Moments (𝐹𝑘 ) via precision sampling
Theorem: linear sketch for 𝐹𝑘 with 𝑂(1) approximation,
and 𝑂(𝑛1−2/𝑘 log 𝑛) space (90% succ. prob.).
Sketch:
Pick random 𝑢𝑖[0,1], 𝑟𝑖{±1}, and let 𝑦𝑖 = 𝑥𝑖 ⋅ 𝑟𝑖 /𝑢𝑖
throw into one hash table 𝐻,
x= x1
1−2/𝑘
𝑚 = 𝑂(𝑛
log 𝑛) cells
x2
x3
x4
Estimator:
1/𝑘
max 𝐻 𝑗
𝑗
𝑝
y1+ y4
H= y
3
Randomness: 𝑂(1) independence suffices
y2+
y5+
y6
x5
x6
Streaming++
LOTS of work in the area:
Surveys
Muthukrishnan: http://algo.research.googlepages.com/eight.ps
McGregor: http://people.cs.umass.edu/~mcgregor/papers/08graphmining.pdf
Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49Fall11/Notes/lecnotes.pdf
Open problems: http://sublinear.info
Examples:
Moments, sampling
Median estimation, longest increasing sequence
Graph algorithms
Numerical algorithms (e.g., regression, SVD approximation)
E.g., dynamic graph connectivity [AGG’12, KKM’13,…]
Fastest (sparse) regression […CW’13,MM’13,KN’13,LMP’13]
related to Compressed Sensing