PODS 2002 Invited Talk
Download
Report
Transcript PODS 2002 Invited Talk
CS 361A
(Advanced Data Structures and Algorithms)
Lectures 16 & 17
(Nov 16 and 28, 2005)
Synopses, Samples, and Sketches
Rajeev Motwani
CS 361A
1
Game Plan for Week
Last Class
Models for Streaming/Massive Data Sets
Negative results for Exact Distinct Values
Hashing for Approximate Distinct Values
Today
Synopsis Data Structures
Sampling Techniques
Frequency Moments Problem
Sketching Techniques
Finding High-Frequency Items
CS 361A
2
Synopsis Data Structures
Synopses
Webster – a condensed statement or outline (as of a
narrative or treatise)
CS 361A – succinct data structure that lets us answers
queries efficiently
Synopsis Data Structures
“Lossy” Summary (of a data stream)
Advantages – fits in memory + easy to communicate
Disadvantage – lossiness implies approximation error
Negative Results best we can do
Key Techniques – randomization and hashing
CS 361A
3
Numerical Examples
Approximate Query Processing [AQUA/Bell Labs]
Database Size – 420 MB
Synopsis Size – 420 KB (0.1%)
Approximation Error – within 10%
Running Time – 0.3% of time for exact query
Histograms/Quantiles [Chaudhuri-Motwani-Narasayya,
Manku-Rajagopalan-Lindsay, Khanna-Greenwald]
Data Size – 109 items
Synopsis Size – 1249 items
Approximation Error – within 1%
CS 361A
4
Synopses
Desidarata
Small Memory Footprint
Quick Update and Query
Provable, low-error guarantees
Composable – for distributed scenario
Applicability?
General-purpose – e.g. random samples
Specific-purpose – e.g. distinct values estimator
Granularity?
Per database – e.g. sample of entire table
Per distinct value – e.g. customer profiles
Structural – e.g. GROUP-BY or JOIN result samples
CS 361A
5
Examples of Synopses
Synopses need not be fancy!
Simple Aggregates – e.g. mean/median/max/min
Variance?
Random Samples
Aggregates on small samples represent entire data
Leverage extensive work on confidence intervals
Random Sketches
structured samples
Tracking High-Frequency Items
CS 361A
6
Random Samples
CS 361A
7
Types of Samples
Oblivious sampling – at item level
o Limitations [Bar-Yossef–Kumar–Sivakumar STOC 01]
Value-based sampling – e.g. distinct-value samples
Structured samples – e.g. join sampling
Naïve approach – keep samples of each relation
Problem – sample-of-join ‡ join-of-samples
Foreign-Key Join [Chaudhuri-Motwani-Narasayya SIGMOD 99]
CS 361A
A
A
B
B
A
B
L
R
what if A sampled from L
and B from R?
8
Basic Scenario
Goal maintain uniform sample of item-stream
Sampling Semantics?
Coin flip
o select each item with probability p
o easy to maintain
o undesirable – sample size is unbounded
Fixed-size sample without replacement
o Our focus today
Fixed-size sample with replacement
o Show – can generate from previous sample
Non-Uniform Samples [Chaudhuri-Motwani-Narasayya]
CS 361A
9
Reservoir Sampling [Vitter]
Input – stream of items X1 , X2, X3, …
Goal – maintain uniform random sample S of
size n (without replacement) of stream so far
Reservoir Sampling
Initialize – include first n elements in S
Upon seeing item Xt
o Add Xt to S with probability n/t
o If added, evict random previous item
CS 361A
10
Analysis
Correctness?
Fact: At each instant, |S| = n
Theorem: At time t, any XiεS with probability n/t
Exercise – prove via induction on t
Efficiency?
Let N be stream size
n
N
E[# updates toS]
n(1 H N H n ) O n ln
n
n 1 t N t
Remark: Verify this is optimal.
Naïve implementation N coin flips time O(N)
CS 361A
11
Improving Efficiency
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
J3=2
J9=4
items inserted into sample S (where n=3)
Random variable Jt – number jumped over after time t
Idea – generate Jt and skip that many items
Cumulative Distribution Function – F(s) = P[Jt ≤ s], for t>n & s≥0
t s
(t 1 n)s
n
T-n
F(s) 1 1 - 1
1
T
T t 1
T t 1 T
(t 1)s
t s
where a b a(a 1)(a 2) (a b 1)
CS 361A
12
Analysis
Number of calls to RANDOM()?
one per insertion into sample
this is optimal!
Generating Jt?
Pick random number U ε [0,1]
Find smallest j such that U ≤ F(j)
How?
o Linear scan O(N) time
o Binary search with Newton’s interpolation
O(n2(1 + polylog N/n)) time
Remark – see paper for optimal algorithm
CS 361A
13
Sampling over Sliding Windows
[Babcock-Datar-Motwani]
Sliding Window W – last w items in stream
Model – item Xt expires at time t+w
Why?
Applications may require ignoring stale data
Type of approximation
Only way to define JOIN over streams
Goal – Maintain uniform sample of size n of
sliding window
CS 361A
14
Reservoir Sampling?
Observe
any item in sample S will expire eventually
must replace with random item of current window
Problem
no access to items in W-S
storing entire window requires O(w) memory
Oversampling
n log w
θ
Backing sample B – select each item with probability
w
sample S – select n items from B at random
upon expiry in S replenish from B
Claim: n < |B| < n log w with high probability
CS 361A
15
Index-Set Approach
Pick random index set I= { i1, … , in }, X{0,1, … , w-1}
Sample S – items Xi with i ε {i1, … , in} (mod w) in current window
Example
Suppose – w=2, n=1, and I={1}
Then – sample is always Xi with odd i
Memory – only O(k)
Observe
S is uniform random sample of each window
But sample is periodic (union of arithmetic progressions)
Correlation across successive windows
Problems
Correlation may hurt in some applications
Some data (e.g. time-series) may be periodic
CS 361A
16
Chain-Sample Algorithm
Idea
Fix expiry problem in Reservoir Sampling
Advance planning for expiry of sampled items
Focus on sample size 1 – keep n independent such samples
Chain-Sampling
Add Xt to S with probability 1/min{t,w} – evict earlier sample
Initially – standard Reservoir Sampling up to time w
Pre-select Xt’s replacement Xr ε Wt+w = {Xt+1, …, Xt+w}
o Xt expires must replace from Wt+w
o At time r, save Xr and pre-select its own replacement
building “chain” of potential replacements
Note – if evicting earlier sample, discard its “chain” as well
CS 361A
17
Example
3514628523542250984673
3514628523542250984673
3514628523542250984673
3514628523542250984673
CS 361A
18
Expectation for Chain-Sample
T(x) = E[chain length for Xt at time t+x]
1
for x 1
T (x) 1
1
T (i) for x 1
w i x
E[chain length] = T(w) e 2.718
E[memory required for sample size n] = O(n)
CS 361A
19
Tail Bound for Chain-Sample
Chain = “hops” of total length at most w
Chain of h hops ordered (h+1)-partition of w
h hops of total length less than w
plus, remainder
Each partition has probability w-h
Number of partitions:
w ew h
h h
h = O(log w) probability of a partition is O(w-c)
Thus – memory O(n log w) with high probability
CS 361A
20
Comparison of Algorithms
Algorithm
Expected
High-Probability
Periodic
O(n)
O(n)
Oversample
O(n log w)
O(n log w)
Chain-Sample
O(n)
O(n log w)
Chain-Sample beats Oversample:
Expected memory – O(n) vs O(n log w)
High-probability memory bound – both O(n log w)
Oversample may have sample size shrink below n!
CS 361A
21
Sketches
and
Frequency Moments
CS 361A
22
Generalized Stream Model
2
Data stream: 2, 0, 1, 3, 1, 2, 4, . . .
2
1
m0
m1
m2
1
1
m3
m4
Input Element (i,a)
a copies of domain-value i
increment to ith dimension of m by a
a need not be an integer
Negative value – captures deletions
CS 361A
23
Example
2
2
1
m0
m1
m2
1
1
m3
m4
On seeing element (i,a) = (2,2) On seeing element (i,a) = (1,-1)
4
4
2
1
m0
CS 361A
1
m1
m2
1
m3
1
m4
m0
1
m1
1
m2
1
m3
m4
24
Frequency Moments
Input Stream
values from U = {0,1,…,N-1}
frequency vector m = (m0,m1,…,mN-1)
Kth Frequency Moment Fk(m) = Σi mik
F0: number of distinct values (Lecture 15)
F1: stream size
F2: Gini index, self-join size, Euclidean norm
Fk: for k>2, measures skew, sometimes useful
F∞: maximum frequency
Problem – estimation in small space
Sketches – randomized estimators
CS 361A
25
Naive Approaches
Space N – counter mi for each distinct value i
Space O(1)
if input sorted by i
single counter recycled when new i value appears
Goal
Allow arbitrary input
Use small (logarithmic) space
Settle for randomization/approximation
CS 361A
26
Sketching F2
Random Hash h(i): {0,1,…,N-1} {-1,1}
Define Zi =h(i)
Maintain X = Σi miZi
Easy for update streams (i,a) – just add aZi to X
Claim: X2 is unbiased estimator for F2
Proof: E[X2] = E[(Σi miZi)2]
from
independence
= E[Σi mi2Zi2] + E[Σi,jmimjZiZj]
= Σi mi2E[Zi2] + Σi,jmimjE[Zi]E[Zj]
= Σi mi2 + 0 = F2
Last Line? – Zi2 = 1 and E[Zi] = 0 as uniform{-1,1}
CS 361A
27
Estimation Error?
Chebyshev bound: P Y EY λEY
Var Y
2
λ 2 EY
Define Y = X2 E[Y] = E[X2] = Σi mi2 = F2
Observe E[X4] = E[(ΣmiZi)4]
= E[Σmi4Zi4]+4E[Σmimj3ZiZj3]+6E[Σmi2mj2Zi2Zj2]
+12E[Σmimjmk2ZiZjZk2]+24E[ΣmimjmkmlZiZjZkZl]
= Σmi4 + 6Σmi2mj2
Why?
By definition Var[Y] = E[Y2] – E[Y]2 = E[X4] – E[X2]2
= [Σmi4+6Σmi2mj2] – [Σmi4+2Σmi2mj2]
CS 361A
= 4Σmi2mj2 ≤ 2E[X2]2 = 2F22
28
Estimation Error?
Var Y
Chebyshev bound P Y EY λEY 2
2
λ EY
2
2F2
2
P [relative estimation error >λ] λ 2 F 2 λ 2
2
Problem – What if we want λ really small?
Solution
Compute s = 8/λ2 independent copies of X
Estimator Y = mean(Xi2)
Variance reduces by factor s
2
2F
1
P [relative estimation error >λ] 2 2 2
4
sλ F2
CS 361A
29
Boosting Technique
Algorithm A: Randomized λ-approximate estimator f
P[(1- λ)f* ≤ f ≤ (1+ λ)f*] = 3/4
Heavy Tail Problem: P[f*–z, f*, f*+z] = [1/16, 3/4, 3/16]
Boosting Idea
O(log1/ε) independent estimates from A(X)
Return median of estimates
Claim: P[median is λ-approximate] >1- ε
Proof:
P[specific estimate is λ-approximate] = ¾
Bad event only if >50% estimates not λ-approximate
Binomial tail – probability less than ε
CS 361A
30
Overall Space Requirement
Observe
Let m = Σmi
Each hash needs O(log m)-bit counter
s = 8/λ2 hash functions for each estimator
O(log 1/ε) such estimators
Total O(λ-2 log 1/ε log m) bits
Question – Space for storing hash function?
CS 361A
31
Sketching Paradigm
Random Sketch: inner product f, Z if(i)Z i
frequency vector m = (m0,m1,…,mN-1)
random vector Z (currently, uniform {-1,1})
Observe
Linearity Sketch(m1) ± Sketch(m2) = Sketch (m1 ± m2)
Ideal for distributed computing
Observe
Suppose: Given i, can efficiently generate Zi
Then: can maintain sketch for update streams
Problem
o Must generate Zi=h(i) on first appearance of i
o Need Ω(N) memory to store h explicitly
o Need Ω(N) random bits
CS 361A
32
Two birds, One stone
Pairwise Independent Z1,Z2, …, Zn
for all Zi and Zk, P[Zi=x, Zk=y] = P[Zi=x].P[Zk=y]
property E[ZiZk] = E[Zi].E[Zk]
Example – linear hash function
Seed S=<a,b> from [0..p-1], where p is prime
Zi = h(i) = ai+b (mod p)
Claim: Z1,Z2, …, Zn are pairwise independent
Zi=x and Zk=y x=ai+b (mod p) and y=ak+b (mod p)
fixing i, k, x, y unique solution for a, b
P[Zi=x, Zk=y] = 1/ p2 = P[Zi=x].P[Zk=y]
Memory/Randomness: n log p 2 log p
CS 361A
33
Wait a minute!
Doesn’t pairwise independence screw up proofs?
No – E[X2] calculation only has degree-2 terms
But – what about Var[X2]?
Need 4-wise independence
CS 361A
34
Application – Join-Size Estimation
Given
Join attribute frequencies f1 and f2
Join size = f1.f2
Define – X1 = f1.Z and X2 = f2.Z
Choose – Z as 4-wise independent & uniform {-1,1}
Exercise: Show, as before,
E[X1 X2] = f1.f2
Var[X1 X2] ≤ 2 (f1.f2)2
Hint: a.b ≤ |a|.|b|
CS 361A
35
Bounding Error Probability
Using s copies of X’s & taking their mean Y
Pr[ |Y- f1.f2 | ≥ λ f1.f2 ] ≤ Var(Y) / λ2(f1.f2)2
≤ 2f12f22 / sλ2(f1.f2)2
= 2 / sλ2cos2 θ
Bounding error probability?
Need – s > 2/λ2cos2θ
Memory? – O( log 1/ε cos-2θ λ-2 (log N + log m))
Problem
To choose s – need a-priori lower bound on cos θ = f1.f2
What if cos θ really small?
CS 361A
36
Sketch Partitioning
10
10
2
Idea for dealing with f12f22/(f1.f2)2 issue
-- partition domain into regions where
self-join size is smaller to compensate
small join-size (cos θ)
1
dom(R1.A)
10
2
10
self-join(R1.A)*self-join(R2.B)
= 205*205 = 42K
self-join(R1.A)*self-join(R2.B) +
self-join(R1.A)*self-join(R2.B)
= 200*5 +200*5 = 2K
1
dom(R2.B)
CS 361A
37
Sketch Partitioning
Idea
intelligently partition join-attribute space
need coarse statistics on stream
build independent sketches for each partition
Estimate = Σ partition sketches
Variance = Σ partition variances
CS 361A
38
Sketch Partitioning
Partition Space Allocation?
Can solve optimally, given domain partition
Optimal Partition: Find K-partition to minimize
K
1
K
Var[Xi ]
1
size(selfJoin)
Results
Dynamic Programming – optimal solution for single join
NP-hard – for queries with multiple joins
CS 361A
39
Fk for k > 2
Assume – stream length m is known
(Exercise: Show can fix with log m space overhead by
repeated-doubling estimate of m.)
Choose – random stream item ap
p uniform from {1,2,…,m}
Suppose – ap = v ε {0,1,…,N-1}
Count subsequent frequency of v
r = | {q | q≥p, aq=v} |
Define X = m(rk – (r-1)k)
CS 361A
40
Example
Stream
7,8,5,1,7,5,2,1,5,4,5,10,6,5,4,1,4,7,3,8
m = 20
p = 9
ap = 5
r = 3
CS 361A
41
Fk for k > 2
m k
EX [1 (2k 1k ) ... (m1k (m1 1)k )
m
1k (2k 1k ) ... (mk2 (m2 1)k )
...
1k (2k 1k ) ... (mkn (mn 1)k )]
Fk
Summing over
m choices of
stream elements
Var(X) ≤ kN1 – 1/k Fk2
Bounded Error Probability s = O(kN1 – 1/k / λ2)
Boosting memory bound
O(kn1 – 1/k λ-2 (log 1/ε)(log N + log m))
CS 361A
42
Frequency Moments
F0 – distinct values problem (Lecture 15)
F1 – sequence length
for case with deletions, use Cauchy distribution
F2 – self-join size/Gini index (Today)
Fk for k >2
omitting grungy details
can achieve space bound
O(kN1 – 1/k λ-2 (log 1/ε)(log n + log m))
F∞ – maximum frequency
CS 361A
43
Communication Complexity
ALICE
input A
BOB
input B
Cooperatively compute function f(A,B)
Minimize bits communicated
Unbounded computational power
Communication Complexity C(f) – bits exchanged by
optimal protocol Π
Protocols?
1-way versus 2-way
deterministic versus randomized
Cδ(f) – randomized complexity for error probability δ
CS 361A
44
Streaming & Communication Complexity
Stream Algorithm 1-way communication protocol
Simulation Argument
Given – algorithm S computing f over streams
Alice – initiates S, providing A as input stream prefix
Communicates to Bob – S’s state after seeing A
Bob – resumes S, providing B as input stream suffix
Theorem – Stream algorithm’s space requirement is
at least the communication complexity C(f)
CS 361A
45
Example: Set Disjointness
Set Disjointness (DIS)
A, B subsets of {1,2,…,N}
Output 1 A B φ
0 A B φ
Theorem: Cδ(DIS) = Ω(N), for any δ<1/2
CS 361A
46
Lower Bound for F∞
Theorem: Fix ε<1/3, δ<1/2. Any stream algorithm S with
P[ (1-ε)F∞ < S < (1+ε)F∞ ] > 1-δ
needs Ω(N) space
Proof
Claim: S 1-way protocol for DIS (on any sets A and B)
Alice streams set A to S
Communicates S’s state to Bob
Bob streams set B to S
Observe
if A B φ
2
F
1
if A B φ
Relative Error ε<1/3 DIS solved exactly!
CS 361A
P[error <½ ] < δ
Ω(N) space
47
Extensions
Observe
Used only 1-way communication in proof
Cδ(DIS) bound was for arbitrary communication
Exercise – extend lower bound to multi-pass algorithms
Lower Bound for Fk, k>2
Need to increase gap beyond 2
Multiparty Set Disjointness – t players
Theorem: Fix ε,δ<½ and k > 5. Any stream algorithm S with
P[ (1-ε)Fk < S < (1+ε)Fk ] > 1-δ
needs Ω(N1-(2+ δ)/k) space
Implies Ω(N1/2) even for multi-pass algorithms
CS 361A
48
Tracking
High-Frequency Items
CS 361A
49
Problem 1 – Top-K List
[Charikar-Chen-Farach-Colton]
The Google Problem
Return list of k most frequent items in stream
Motivation
search engine queries, network traffic, …
Remember
Saw lower bound recently!
Solution
Data structure Count-Sketch maintaining
count-estimates of high-frequency elements
CS 361A
50
Definitions
Notation
Assume {1, 2, …, N} in order of frequency
mi is frequency of ith most frequent element
m = Σmi is number of elements in stream
FindCandidateTop
Input: stream S, int k, int p
Output: list of p elements containing top k
Naive sampling gives solution with p = (m log k / mk)
FindApproxTop
Input: stream S, int k, real
Output: list of k elements, each of frequency mi > (1-) mk
Naive sampling gives no solution
CS 361A
51
Main Idea
Consider
single counter X
hash function h(i): {1, 2,…,N} {-1,+1}
Input element i update counter X += Zi = h(i)
For each r, use XZr as estimator of mr
Theorem: E[XZr] = mr
Proof
X = Σi miZi
E[XZr] = E[Σi miZiZr] = Σi miE[Zi Zr] = mrE[Zr2] = mr
Cross-terms cancel
CS 361A
52
Finding Max Frequency Element
Problem – var[X] = F2 = Σi mi2
Idea – t counters, independent 4-wise hashes h1,…,ht
h1: i {+1, –1}
ht: i {+1, –1}
Use t = O(log m • mi2 / (m1)2)
Claim: New Variance < mi2 / t = (m1)2 / log m
Overall Estimator
repeat + median of averages
with high probability, approximate m1
CS 361A
53
Problem with “Array of Counters”
Variance – dominated by highest frequency
Estimates for less-frequent elements like k
corrupted by higher frequencies
variance >> mk
Avoiding Collisions?
spread out high frequency elements
replace each counter with hashtable of b counters
CS 361A
54
Count Sketch
Hash Functions
4-wise independent hashes h1,...,ht and s1,…,st
hashes independent of each other
Data structure: hashtables of counters X(r,c)
1
2
…
b
s1 : i {1, ..., b}
h1: i {+1, -1}
st : i {1, ..., b}
ht: i {+1, -1}
CS 361A
55
Overall Algorithm
sr(i) – one of b counters in rth hashtable
Input i for each r, update X(r,sr(i)) += hr(i)
Estimator(mi) = medianr { X(r,sr(i)) • hr(i) }
Maintain heap of k top elements seen so far
Observe
Not completely eliminated collision with high frequency items
Few of estimates X(r,sr(i)) • hr(i) could have high variance
Median not sensitive to these poor estimates
CS 361A
56
Avoiding Large Items
b > O(k) with probability Ω(1), no collision with top-k elements
t hashtables represent independent trials
Need log m/ trials to estimate with probability 1-
Also need – small variance for colliding small elements
Claim:
P[variance due to small items in each estimate < (i>k mi2)/b] = Ω(1)
Final bound b = O(k + i>k mi2 / (mk)2)
CS 361A
57
Final Results
Zipfian Distribution: mi 1/i [Power Law]
FindApproxTop
[k + (i>kmi2) / (mk)2] log m/
Roughly: sampling bound with frequencies squared
Zipfian – gives improved results
FindCandidateTop
Zipf parameter 0.5
O(k log N log m)
Compare: sampling bound O((kN)0.5 log k)
CS 361A
58
Problem 2 – Elephants-and-Ants
[Manku-Motwani]
Stream
Identify items whose current frequency exceeds
support threshold s = 0.1%.
[Jacobson 2000, Estan-Verghese 2001]
CS 361A
59
Algorithm 1: Lossy Counting
Step 1: Divide the stream into ‘windows’
Window 1
Window 2
Window 3
Window-size w is function of support s – specify later…
CS 361A
60
Lossy Counting in Action ...
Frequency
Counts
+
Empty
First Window
At window boundary, decrement all counters by 1
CS 361A
61
Lossy Counting (continued)
Frequency
Counts
+
Next Window
At window boundary, decrement all counters by 1
CS 361A
62
Error Analysis
How much do we undercount?
If
and
current size of stream
window-size w
=N
= 1/ε
then frequency error # windows = εN
Rule of thumb:
Set ε = 10% of support s
Example:
Given support frequency s = 1%,
set error frequency
ε = 0.1%
CS 361A
63
Putting it all together…
Output:
Elements with counter values exceeding (s-ε)N
Approximation guarantees
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least (s–ε)N
How many counters do we need?
Worst case bound: 1/ε log εN counters
Implementation details…
CS 361A
64
Number of Counters?
Window size w = 1/
Number of windows m = N
ni – # counters alive over last i windows
Fact:
j
in
i 1
j
i
jw
for j 1,2,...,m
j
w
Claim: n i
i 1
i 1 i
for j 1,2,...,m
Counter must average 1 increment/window to survive
m
m
w
1
# active counters n i i w log m ε log εN
i 1
i 1
CS 361A
65
Enhancements
Frequency Errors
For counter (X, c), true frequency in [c, c+εN]
Trick: Track number of windows t counter has been active
For counter (X, c, t), true frequency in [c, c+t-1]
If (t = 1), no error!
Batch Processing
Decrements after k windows
CS 361A
66
Algorithm 2: Sticky Sampling
Stream
Create counters by sampling
Maintain exact counts thereafter
28
31
41
23
35
19
34
15
30
What is sampling rate?
CS 361A
67
Sticky Sampling (continued)
For finite stream of length N
Sampling rate = 2/εN log 1/s
= probability of failure
Output:
Elements with counter values exceeding (s-ε)N
Approximation guarantees (probabilistic)
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least (s-ε)N
Same Rule of thumb:
Same error guarantees
as Lossy Counting
but probabilistic
CS 361A
Set ε = 10% of support s
Example:
Given support threshold s = 1%,
set error threshold
ε = 0.1%
set failure probability = 0.01%
68
Number of counters?
Finite stream of length N
Sampling rate: 2/εN log 1/s
Infinite stream with unknown N
Gradually adjust sampling rate
In either case,
Expected number of counters = 2/ log 1/s
Independent of N
CS 361A
69
References – Synopses
Synopsis data structures for massive data sets. Gibbons and
Matias, DIMACS 1999.
Tracking Join and Self-Join Sizes in Limited Storage, Alon,
Gibbons, Matias, and Szegedy. PODS 1999.
Join Synopses for Approximate Query Answering, Acharya,
Gibbons, Poosala, and Ramaswamy. SIGMOD 1999.
Random Sampling for Histogram Construction: How much is
enough? Chaudhuri, Motwani, and Narasayya. SIGMOD 1998.
Random Sampling Techniques for Space Efficient Online
Computation of Order Statistics of Large Datasets, Manku,
Rajagopalan, and Lindsay. SIGMOD 1999.
Space-efficient online computation of quantile summaries,
Greenwald and Khanna. SIGMOD 2001.
CS 361A
70
References – Sampling
Random Sampling with a Reservoir, Vitter. Transactions on
Mathematical Software 11(1):37-57 (1985).
On Sampling and Relational Operators. Chaudhuri and Motwani.
Bulletin of the Technical Committee on Data Engineering (1999).
On Random Sampling over Joins. Chaudhuri, Motwani, and Narasayya.
SIGMOD 1999.
Congressional Samples for Approximate Answering of Group-By
Queries, Acharya, Gibbons, and Poosala. SIGMOD 2000.
Overcoming Limitations of Sampling for Aggregation Queries,
Chaudhuri, Das, Datar, Motwani and Narasayya. ICDE 2001.
A Robust Optimization-Based Approach for Approximate Answering
of Aggregate Queries, Chaudhuri, Das and Narasayya. SIGMOD 01.
Sampling From a Moving Window Over Streaming Data. Babcock,
Datar, and Motwani. SODA 2002.
Sampling algorithms: lower bounds and applications. Bar-Yossef–
Kumar–Sivakumar. STOC 2001.
CS 361A
71
References – Sketches
Probabilistic counting algorithms for data base applications.
Flajolet and Martin. JCSS (1985).
The space complexity of approximating the frequency
moments. Alon, Matias, and Szegedy. STOC 1996.
Approximate Frequency Counts over Streaming Data. Manku
and Motwani. VLDB 2002.
Finding Frequent Items in Data Streams. Charikar, Chen, and
Farach-Colton. ICALP 2002.
An Approximate L1-Difference Algorithm for Massive Data
Streams. Feigenbaum, Kannan, Strauss, and Viswanathan.
FOCS 1999.
Stable Distributions, Pseudorandom Generators, Embeddings
and Data Stream Computation. Indyk. FOCS 2000.
CS 361A
72