PODS 2002 Invited Talk

Download Report

Transcript PODS 2002 Invited Talk

CS 361A
(Advanced Data Structures and Algorithms)
Lectures 16 & 17
(Nov 16 and 28, 2005)
Synopses, Samples, and Sketches
Rajeev Motwani
CS 361A
1
Game Plan for Week
Last Class
Models for Streaming/Massive Data Sets
Negative results for Exact Distinct Values
Hashing for Approximate Distinct Values
Today
Synopsis Data Structures
Sampling Techniques
Frequency Moments Problem
Sketching Techniques
Finding High-Frequency Items
CS 361A
2
Synopsis Data Structures
 Synopses
Webster – a condensed statement or outline (as of a
narrative or treatise)
CS 361A – succinct data structure that lets us answers
queries efficiently
 Synopsis Data Structures
“Lossy” Summary (of a data stream)
Advantages – fits in memory + easy to communicate
Disadvantage – lossiness implies approximation error
Negative Results  best we can do
Key Techniques – randomization and hashing
CS 361A
3
Numerical Examples
Approximate Query Processing [AQUA/Bell Labs]
Database Size – 420 MB
Synopsis Size – 420 KB (0.1%)
Approximation Error – within 10%
Running Time – 0.3% of time for exact query
Histograms/Quantiles [Chaudhuri-Motwani-Narasayya,
Manku-Rajagopalan-Lindsay, Khanna-Greenwald]
Data Size – 109 items
Synopsis Size – 1249 items
Approximation Error – within 1%
CS 361A
4
Synopses
 Desidarata
Small Memory Footprint
Quick Update and Query
Provable, low-error guarantees
Composable – for distributed scenario
 Applicability?
General-purpose – e.g. random samples
Specific-purpose – e.g. distinct values estimator
 Granularity?
Per database – e.g. sample of entire table
Per distinct value – e.g. customer profiles
Structural – e.g. GROUP-BY or JOIN result samples
CS 361A
5
Examples of Synopses
Synopses need not be fancy!
Simple Aggregates – e.g. mean/median/max/min
Variance?
Random Samples
Aggregates on small samples represent entire data
Leverage extensive work on confidence intervals
Random Sketches
structured samples
Tracking High-Frequency Items
CS 361A
6
Random Samples
CS 361A
7
Types of Samples
 Oblivious sampling – at item level
o Limitations [Bar-Yossef–Kumar–Sivakumar STOC 01]
 Value-based sampling – e.g. distinct-value samples
 Structured samples – e.g. join sampling
Naïve approach – keep samples of each relation
Problem – sample-of-join ‡ join-of-samples
Foreign-Key Join [Chaudhuri-Motwani-Narasayya SIGMOD 99]
CS 361A
A
A
B
B
A
B
L
R
what if A sampled from L
and B from R?
8
Basic Scenario
Goal maintain uniform sample of item-stream
Sampling Semantics?
Coin flip
o select each item with probability p
o easy to maintain
o undesirable – sample size is unbounded
Fixed-size sample without replacement
o Our focus today
Fixed-size sample with replacement
o Show – can generate from previous sample
Non-Uniform Samples [Chaudhuri-Motwani-Narasayya]
CS 361A
9
Reservoir Sampling [Vitter]
Input – stream of items X1 , X2, X3, …
Goal – maintain uniform random sample S of
size n (without replacement) of stream so far
Reservoir Sampling
Initialize – include first n elements in S
Upon seeing item Xt
o Add Xt to S with probability n/t
o If added, evict random previous item
CS 361A
10
Analysis
 Correctness?
Fact: At each instant, |S| = n
Theorem: At time t, any XiεS with probability n/t
Exercise – prove via induction on t
 Efficiency?
Let N be stream size
n
N

E[# updates toS]  
 n(1 H N  H n )  O n ln 
n

n 1  t  N t
Remark: Verify this is optimal.
 Naïve implementation  N coin flips  time O(N)
CS 361A
11
Improving Efficiency
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
J3=2
J9=4
items inserted into sample S (where n=3)
 Random variable Jt – number jumped over after time t
 Idea – generate Jt and skip that many items
 Cumulative Distribution Function – F(s) = P[Jt ≤ s], for t>n & s≥0
t s
(t  1  n)s
 n
T-n 
F(s)  1   1 -   1   
  1
T
T  t 1
T  t 1 T 
(t  1)s
t s
where a b  a(a  1)(a  2) (a  b  1)
CS 361A
12
Analysis
Number of calls to RANDOM()?
one per insertion into sample
this is optimal!
Generating Jt?
Pick random number U ε [0,1]
Find smallest j such that U ≤ F(j)
How?
o Linear scan  O(N) time
o Binary search with Newton’s interpolation 
O(n2(1 + polylog N/n)) time
Remark – see paper for optimal algorithm
CS 361A
13
Sampling over Sliding Windows
[Babcock-Datar-Motwani]
Sliding Window W – last w items in stream
Model – item Xt expires at time t+w
Why?
Applications may require ignoring stale data
Type of approximation
Only way to define JOIN over streams
Goal – Maintain uniform sample of size n of
sliding window
CS 361A
14
Reservoir Sampling?
 Observe
any item in sample S will expire eventually
must replace with random item of current window
 Problem
no access to items in W-S
storing entire window requires O(w) memory
 Oversampling
 n log w 
θ

Backing sample B – select each item with probability 
w


sample S – select n items from B at random
upon expiry in S  replenish from B
Claim: n < |B| < n log w with high probability
CS 361A
15
Index-Set Approach
 Pick random index set I= { i1, … , in }, X{0,1, … , w-1}
 Sample S – items Xi with i ε {i1, … , in} (mod w) in current window
 Example
Suppose – w=2, n=1, and I={1}
Then – sample is always Xi with odd i
 Memory – only O(k)
 Observe
S is uniform random sample of each window
But sample is periodic (union of arithmetic progressions)
Correlation across successive windows
 Problems
Correlation may hurt in some applications
Some data (e.g. time-series) may be periodic
CS 361A
16
Chain-Sample Algorithm
 Idea
Fix expiry problem in Reservoir Sampling
Advance planning for expiry of sampled items
Focus on sample size 1 – keep n independent such samples
 Chain-Sampling
Add Xt to S with probability 1/min{t,w} – evict earlier sample
Initially – standard Reservoir Sampling up to time w
Pre-select Xt’s replacement Xr ε Wt+w = {Xt+1, …, Xt+w}
o Xt expires  must replace from Wt+w
o At time r, save Xr and pre-select its own replacement 
building “chain” of potential replacements
Note – if evicting earlier sample, discard its “chain” as well
CS 361A
17
Example
3514628523542250984673
3514628523542250984673
3514628523542250984673
3514628523542250984673
CS 361A
18
Expectation for Chain-Sample
 T(x) = E[chain length for Xt at time t+x]
1
for x  1


T (x)  1
1
T (i) for x  1
 w i  x
 E[chain length] = T(w)  e  2.718
 E[memory required for sample size n] = O(n)
CS 361A
19
Tail Bound for Chain-Sample
 Chain = “hops” of total length at most w
 Chain of h hops  ordered (h+1)-partition of w
h hops of total length less than w
plus, remainder
 Each partition has probability w-h
 Number of partitions:
 w   ew  h
   

h  h 
 h = O(log w)  probability of a partition is O(w-c)
 Thus – memory O(n log w) with high probability
CS 361A
20
Comparison of Algorithms
Algorithm
Expected
High-Probability
Periodic
O(n)
O(n)
Oversample
O(n log w)
O(n log w)
Chain-Sample
O(n)
O(n log w)
Chain-Sample beats Oversample:
Expected memory – O(n) vs O(n log w)
High-probability memory bound – both O(n log w)
Oversample may have sample size shrink below n!
CS 361A
21
Sketches
and
Frequency Moments
CS 361A
22
Generalized Stream Model
2
Data stream: 2, 0, 1, 3, 1, 2, 4, . . .
2
1
m0
m1
m2
1
1
m3
m4
Input Element (i,a)
a copies of domain-value i
increment to ith dimension of m by a
a need not be an integer
Negative value – captures deletions
CS 361A
23
Example
2
2
1
m0
m1
m2
1
1
m3
m4
On seeing element (i,a) = (2,2) On seeing element (i,a) = (1,-1)
4
4
2
1
m0
CS 361A
1
m1
m2
1
m3
1
m4
m0
1
m1
1
m2
1
m3
m4
24
Frequency Moments
 Input Stream
values from U = {0,1,…,N-1}
frequency vector m = (m0,m1,…,mN-1)
 Kth Frequency Moment Fk(m) = Σi mik
F0: number of distinct values (Lecture 15)
F1: stream size
F2: Gini index, self-join size, Euclidean norm
Fk: for k>2, measures skew, sometimes useful
F∞: maximum frequency
 Problem – estimation in small space
 Sketches – randomized estimators
CS 361A
25
Naive Approaches
Space N – counter mi for each distinct value i
Space O(1)
if input sorted by i
single counter recycled when new i value appears
Goal
Allow arbitrary input
Use small (logarithmic) space
Settle for randomization/approximation
CS 361A
26
Sketching F2
 Random Hash h(i): {0,1,…,N-1}  {-1,1}
 Define Zi =h(i)
 Maintain X = Σi miZi
Easy for update streams (i,a) – just add aZi to X
 Claim: X2 is unbiased estimator for F2
Proof: E[X2] = E[(Σi miZi)2]
from
independence
= E[Σi mi2Zi2] + E[Σi,jmimjZiZj]
= Σi mi2E[Zi2] + Σi,jmimjE[Zi]E[Zj]
= Σi mi2 + 0 = F2
 Last Line? – Zi2 = 1 and E[Zi] = 0 as uniform{-1,1}
CS 361A
27
Estimation Error?
 Chebyshev bound: P Y  EY   λEY  
Var Y
2
λ 2 EY
 Define Y = X2  E[Y] = E[X2] = Σi mi2 = F2
 Observe E[X4] = E[(ΣmiZi)4]
= E[Σmi4Zi4]+4E[Σmimj3ZiZj3]+6E[Σmi2mj2Zi2Zj2]
+12E[Σmimjmk2ZiZjZk2]+24E[ΣmimjmkmlZiZjZkZl]
= Σmi4 + 6Σmi2mj2
Why?
 By definition Var[Y] = E[Y2] – E[Y]2 = E[X4] – E[X2]2
= [Σmi4+6Σmi2mj2] – [Σmi4+2Σmi2mj2]
CS 361A
= 4Σmi2mj2 ≤ 2E[X2]2 = 2F22
28
Estimation Error?
Var Y
 Chebyshev bound P Y  EY  λEY  2
2
λ EY 
2
2F2
2


 P [relative estimation error >λ] λ 2 F 2 λ 2
2
 Problem – What if we want λ really small?
 Solution
Compute s = 8/λ2 independent copies of X
Estimator Y = mean(Xi2)
Variance reduces by factor s
2
2F
1
 P [relative estimation error >λ]  2 2 2 
4
sλ F2
CS 361A
29
Boosting Technique
 Algorithm A: Randomized λ-approximate estimator f
P[(1- λ)f* ≤ f ≤ (1+ λ)f*] = 3/4
 Heavy Tail Problem: P[f*–z, f*, f*+z] = [1/16, 3/4, 3/16]
 Boosting Idea
O(log1/ε) independent estimates from A(X)
Return median of estimates
 Claim: P[median is λ-approximate] >1- ε
Proof:
P[specific estimate is λ-approximate] = ¾
Bad event only if >50% estimates not λ-approximate
Binomial tail – probability less than ε
CS 361A
30
Overall Space Requirement
Observe
Let m = Σmi
Each hash needs O(log m)-bit counter
s = 8/λ2 hash functions for each estimator
O(log 1/ε) such estimators
Total O(λ-2 log 1/ε log m) bits
Question – Space for storing hash function?
CS 361A
31
Sketching Paradigm
 Random Sketch: inner product f, Z  if(i)Z i
frequency vector m = (m0,m1,…,mN-1)
random vector Z (currently, uniform {-1,1})
 Observe
Linearity  Sketch(m1) ± Sketch(m2) = Sketch (m1 ± m2)
Ideal for distributed computing
 Observe
Suppose: Given i, can efficiently generate Zi
Then: can maintain sketch for update streams
Problem
o Must generate Zi=h(i) on first appearance of i
o Need Ω(N) memory to store h explicitly
o Need Ω(N) random bits
CS 361A
32
Two birds, One stone
 Pairwise Independent Z1,Z2, …, Zn
for all Zi and Zk, P[Zi=x, Zk=y] = P[Zi=x].P[Zk=y]
property E[ZiZk] = E[Zi].E[Zk]
 Example – linear hash function
Seed S=<a,b> from [0..p-1], where p is prime
Zi = h(i) = ai+b (mod p)
 Claim: Z1,Z2, …, Zn are pairwise independent
Zi=x and Zk=y  x=ai+b (mod p) and y=ak+b (mod p)
fixing i, k, x, y  unique solution for a, b
P[Zi=x, Zk=y] = 1/ p2 = P[Zi=x].P[Zk=y]
 Memory/Randomness: n log p  2 log p
CS 361A
33
Wait a minute!
 Doesn’t pairwise independence screw up proofs?
 No – E[X2] calculation only has degree-2 terms
 But – what about Var[X2]?
 Need 4-wise independence
CS 361A
34
Application – Join-Size Estimation
 Given
Join attribute frequencies f1 and f2
Join size = f1.f2
 Define – X1 = f1.Z and X2 = f2.Z
 Choose – Z as 4-wise independent & uniform {-1,1}
 Exercise: Show, as before,
E[X1 X2] = f1.f2
Var[X1 X2] ≤ 2 (f1.f2)2
Hint: a.b ≤ |a|.|b|
CS 361A
35
Bounding Error Probability
 Using s copies of X’s & taking their mean Y
Pr[ |Y- f1.f2 | ≥ λ f1.f2 ] ≤ Var(Y) / λ2(f1.f2)2
≤ 2f12f22 / sλ2(f1.f2)2
= 2 / sλ2cos2 θ
 Bounding error probability?
Need – s > 2/λ2cos2θ
 Memory? – O( log 1/ε cos-2θ λ-2 (log N + log m))
 Problem
To choose s – need a-priori lower bound on cos θ = f1.f2
What if cos θ really small?
CS 361A
36
Sketch Partitioning
10
10
2
Idea for dealing with f12f22/(f1.f2)2 issue
-- partition domain into regions where
self-join size is smaller to compensate
small join-size (cos θ)
1
dom(R1.A)
10
2
10
self-join(R1.A)*self-join(R2.B)
= 205*205 = 42K
self-join(R1.A)*self-join(R2.B) +
self-join(R1.A)*self-join(R2.B)
= 200*5 +200*5 = 2K
1
dom(R2.B)
CS 361A
37
Sketch Partitioning
Idea
intelligently partition join-attribute space
need coarse statistics on stream
build independent sketches for each partition
Estimate = Σ partition sketches
Variance = Σ partition variances
CS 361A
38
Sketch Partitioning
 Partition Space Allocation?
Can solve optimally, given domain partition
 Optimal Partition: Find K-partition to minimize
K

1
K
Var[Xi ]  
1
 size(selfJoin)
 Results
Dynamic Programming – optimal solution for single join
NP-hard – for queries with multiple joins
CS 361A
39
Fk for k > 2
Assume – stream length m is known
(Exercise: Show can fix with log m space overhead by
repeated-doubling estimate of m.)
Choose – random stream item ap 
p uniform from {1,2,…,m}
Suppose – ap = v ε {0,1,…,N-1}
Count subsequent frequency of v
r = | {q | q≥p, aq=v} |
Define X = m(rk – (r-1)k)
CS 361A
40
Example
Stream
7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8
m = 20
p = 9
ap = 5
r = 3
CS 361A
41
Fk for k > 2
m k
EX   [1  (2k  1k )  ...  (m1k  (m1  1)k ) 
m
1k  (2k  1k )  ...  (mk2  (m2  1)k ) 
... 
1k  (2k  1k )  ...  (mkn  (mn  1)k )]
 Fk
Summing over
m choices of
stream elements
 Var(X) ≤ kN1 – 1/k Fk2
 Bounded Error Probability  s = O(kN1 – 1/k / λ2)
 Boosting  memory bound
O(kn1 – 1/k λ-2 (log 1/ε)(log N + log m))
CS 361A
42
Frequency Moments
F0 – distinct values problem (Lecture 15)
F1 – sequence length
for case with deletions, use Cauchy distribution
F2 – self-join size/Gini index (Today)
Fk for k >2
omitting grungy details
can achieve space bound
O(kN1 – 1/k λ-2 (log 1/ε)(log n + log m))
F∞ – maximum frequency
CS 361A
43
Communication Complexity
ALICE
input A
BOB
input B
 Cooperatively compute function f(A,B)
Minimize bits communicated
Unbounded computational power
 Communication Complexity C(f) – bits exchanged by
optimal protocol Π
 Protocols?
1-way versus 2-way
deterministic versus randomized
 Cδ(f) – randomized complexity for error probability δ
CS 361A
44
Streaming & Communication Complexity
 Stream Algorithm 1-way communication protocol
 Simulation Argument
Given – algorithm S computing f over streams
Alice – initiates S, providing A as input stream prefix
Communicates to Bob – S’s state after seeing A
Bob – resumes S, providing B as input stream suffix
 Theorem – Stream algorithm’s space requirement is
at least the communication complexity C(f)
CS 361A
45
Example: Set Disjointness
Set Disjointness (DIS)
A, B subsets of {1,2,…,N}
Output 1 A  B  φ

0 A  B  φ
Theorem: Cδ(DIS) = Ω(N), for any δ<1/2
CS 361A
46
Lower Bound for F∞
Theorem: Fix ε<1/3, δ<1/2. Any stream algorithm S with
P[ (1-ε)F∞ < S < (1+ε)F∞ ] > 1-δ
needs Ω(N) space
Proof
Claim: S  1-way protocol for DIS (on any sets A and B)
Alice streams set A to S
Communicates S’s state to Bob
Bob streams set B to S
Observe
if A  B  φ
2
F  
1
if A  B  φ
Relative Error ε<1/3  DIS solved exactly!
CS 361A
P[error <½ ] < δ
 Ω(N) space
47
Extensions
 Observe
Used only 1-way communication in proof
Cδ(DIS) bound was for arbitrary communication
Exercise – extend lower bound to multi-pass algorithms
 Lower Bound for Fk, k>2
Need to increase gap beyond 2
Multiparty Set Disjointness – t players
Theorem: Fix ε,δ<½ and k > 5. Any stream algorithm S with
P[ (1-ε)Fk < S < (1+ε)Fk ] > 1-δ
needs Ω(N1-(2+ δ)/k) space
Implies Ω(N1/2) even for multi-pass algorithms
CS 361A
48
Tracking
High-Frequency Items
CS 361A
49
Problem 1 – Top-K List
[Charikar-Chen-Farach-Colton]
The Google Problem
Return list of k most frequent items in stream
Motivation
search engine queries, network traffic, …
Remember
Saw lower bound recently!
Solution
Data structure Count-Sketch  maintaining
count-estimates of high-frequency elements
CS 361A
50
Definitions
 Notation
Assume {1, 2, …, N} in order of frequency
mi is frequency of ith most frequent element
m = Σmi is number of elements in stream
 FindCandidateTop
Input: stream S, int k, int p
Output: list of p elements containing top k
Naive sampling gives solution with p = (m log k / mk)
 FindApproxTop
Input: stream S, int k, real 
Output: list of k elements, each of frequency mi > (1-) mk
Naive sampling gives no solution
CS 361A
51
Main Idea
 Consider
single counter X
hash function h(i): {1, 2,…,N}  {-1,+1}
 Input element i  update counter X += Zi = h(i)
 For each r, use XZr as estimator of mr
Theorem: E[XZr] = mr
Proof
X = Σi miZi
E[XZr] = E[Σi miZiZr] = Σi miE[Zi Zr] = mrE[Zr2] = mr
Cross-terms cancel
CS 361A
52
Finding Max Frequency Element
 Problem – var[X] = F2 = Σi mi2
 Idea – t counters, independent 4-wise hashes h1,…,ht
h1: i {+1, –1}
ht: i {+1, –1}
 Use t = O(log m •  mi2 / (m1)2)
 Claim: New Variance <  mi2 / t = (m1)2 / log m
 Overall Estimator
repeat + median of averages
with high probability, approximate m1
CS 361A
53
Problem with “Array of Counters”
Variance – dominated by highest frequency
Estimates for less-frequent elements like k
corrupted by higher frequencies
variance >> mk
Avoiding Collisions?
spread out high frequency elements
replace each counter with hashtable of b counters
CS 361A
54
Count Sketch
 Hash Functions
4-wise independent hashes h1,...,ht and s1,…,st
hashes independent of each other
 Data structure: hashtables of counters X(r,c)
1
2
…
b
s1 : i  {1, ..., b}
h1: i  {+1, -1}
st : i  {1, ..., b}
ht: i  {+1, -1}
CS 361A
55
Overall Algorithm
 sr(i) – one of b counters in rth hashtable
 Input i  for each r, update X(r,sr(i)) += hr(i)
 Estimator(mi) = medianr { X(r,sr(i)) • hr(i) }
 Maintain heap of k top elements seen so far
 Observe
Not completely eliminated collision with high frequency items
Few of estimates X(r,sr(i)) • hr(i) could have high variance
Median not sensitive to these poor estimates
CS 361A
56
Avoiding Large Items
 b > O(k)  with probability Ω(1), no collision with top-k elements
 t hashtables represent independent trials
 Need log m/ trials to estimate with probability 1-
 Also need – small variance for colliding small elements
 Claim:
P[variance due to small items in each estimate < (i>k mi2)/b] = Ω(1)
 Final bound b = O(k + i>k mi2 / (mk)2)
CS 361A
57
Final Results
Zipfian Distribution: mi  1/i [Power Law]
FindApproxTop
[k + (i>kmi2) / (mk)2] log m/
Roughly: sampling bound with frequencies squared
Zipfian – gives improved results
FindCandidateTop
Zipf parameter 0.5
O(k log N log m)
Compare: sampling bound O((kN)0.5 log k)
CS 361A
58
Problem 2 – Elephants-and-Ants
[Manku-Motwani]
Stream
 Identify items whose current frequency exceeds
support threshold s = 0.1%.
[Jacobson 2000, Estan-Verghese 2001]
CS 361A
59
Algorithm 1: Lossy Counting
Step 1: Divide the stream into ‘windows’
Window 1
Window 2
Window 3
Window-size w is function of support s – specify later…
CS 361A
60
Lossy Counting in Action ...
Frequency
Counts
+
Empty
First Window
At window boundary, decrement all counters by 1
CS 361A
61
Lossy Counting (continued)
Frequency
Counts
+
Next Window
At window boundary, decrement all counters by 1
CS 361A
62
Error Analysis
How much do we undercount?
If
and
current size of stream
window-size w
=N
= 1/ε
then frequency error  # windows = εN
Rule of thumb:
Set ε = 10% of support s
Example:
Given support frequency s = 1%,
set error frequency
ε = 0.1%
CS 361A
63
Putting it all together…
Output:
Elements with counter values exceeding (s-ε)N
Approximation guarantees
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least (s–ε)N
How many counters do we need?
 Worst case bound: 1/ε log εN counters
 Implementation details…
CS 361A
64
Number of Counters?
 Window size w = 1/
 Number of windows m = N
 ni – # counters alive over last i windows
 Fact:
j
 in
i 1
j
i
 jw
for j  1,2,...,m
j
w
 Claim:  n i  
i 1
i 1 i
for j  1,2,...,m
Counter must average 1 increment/window to survive
m
m
w
1
 # active counters  n i   i  w log m  ε log εN
i 1
i 1
CS 361A
65
Enhancements
Frequency Errors
For counter (X, c), true frequency in [c, c+εN]
Trick: Track number of windows t counter has been active
For counter (X, c, t), true frequency in [c, c+t-1]
If (t = 1), no error!
Batch Processing
Decrements after k windows
CS 361A
66
Algorithm 2: Sticky Sampling
Stream
 Create counters by sampling
 Maintain exact counts thereafter
28
31
41
23
35
19
34
15
30
What is sampling rate?
CS 361A
67
Sticky Sampling (continued)
For finite stream of length N
Sampling rate = 2/εN log 1/s
 = probability of failure
Output:
Elements with counter values exceeding (s-ε)N
Approximation guarantees (probabilistic)
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least (s-ε)N
Same Rule of thumb:
Same error guarantees
as Lossy Counting
but probabilistic
CS 361A
Set ε = 10% of support s
Example:
Given support threshold s = 1%,
set error threshold
ε = 0.1%
set failure probability  = 0.01%
68
Number of counters?
Finite stream of length N
Sampling rate: 2/εN log 1/s
Infinite stream with unknown N
Gradually adjust sampling rate
In either case,
Expected number of counters = 2/ log 1/s
Independent of N
CS 361A
69
References – Synopses
 Synopsis data structures for massive data sets. Gibbons and
Matias, DIMACS 1999.
 Tracking Join and Self-Join Sizes in Limited Storage, Alon,
Gibbons, Matias, and Szegedy. PODS 1999.
 Join Synopses for Approximate Query Answering, Acharya,
Gibbons, Poosala, and Ramaswamy. SIGMOD 1999.
 Random Sampling for Histogram Construction: How much is
enough? Chaudhuri, Motwani, and Narasayya. SIGMOD 1998.
 Random Sampling Techniques for Space Efficient Online
Computation of Order Statistics of Large Datasets, Manku,
Rajagopalan, and Lindsay. SIGMOD 1999.
 Space-efficient online computation of quantile summaries,
Greenwald and Khanna. SIGMOD 2001.
CS 361A
70
References – Sampling
 Random Sampling with a Reservoir, Vitter. Transactions on
Mathematical Software 11(1):37-57 (1985).
 On Sampling and Relational Operators. Chaudhuri and Motwani.
Bulletin of the Technical Committee on Data Engineering (1999).
 On Random Sampling over Joins. Chaudhuri, Motwani, and Narasayya.
SIGMOD 1999.
 Congressional Samples for Approximate Answering of Group-By
Queries, Acharya, Gibbons, and Poosala. SIGMOD 2000.
 Overcoming Limitations of Sampling for Aggregation Queries,
Chaudhuri, Das, Datar, Motwani and Narasayya. ICDE 2001.
 A Robust Optimization-Based Approach for Approximate Answering
of Aggregate Queries, Chaudhuri, Das and Narasayya. SIGMOD 01.
 Sampling From a Moving Window Over Streaming Data. Babcock,
Datar, and Motwani. SODA 2002.
 Sampling algorithms: lower bounds and applications. Bar-Yossef–
Kumar–Sivakumar. STOC 2001.
CS 361A
71
References – Sketches
 Probabilistic counting algorithms for data base applications.
Flajolet and Martin. JCSS (1985).
 The space complexity of approximating the frequency
moments. Alon, Matias, and Szegedy. STOC 1996.
 Approximate Frequency Counts over Streaming Data. Manku
and Motwani. VLDB 2002.
 Finding Frequent Items in Data Streams. Charikar, Chen, and
Farach-Colton. ICALP 2002.
 An Approximate L1-Difference Algorithm for Massive Data
Streams. Feigenbaum, Kannan, Strauss, and Viswanathan.
FOCS 1999.
 Stable Distributions, Pseudorandom Generators, Embeddings
and Data Stream Computation. Indyk. FOCS 2000.
CS 361A
72