Data Stream Computation(2)
Download
Report
Transcript Data Stream Computation(2)
Statistic estimation
over data stream
Slides modified from Minos Garofalakis ( yahoo! research)
and S. Muthukrishnan (Rutgers University)
Outline
Introduction
Frequent moment estimation
Element Frequency estimation
2
Data Stream Processing Algorithms
Generally, algorithms compute approximate answers
– Provably difficult to compute answers accurately with
limited memory
Approximate answers - Deterministic bounds
– Algorithms only compute an approximate answer, but
bounds on error
Approximate answers - Probabilistic bounds
– Algorithms compute an approximate answer with high
probability
• With probability at least 1 , the computed answer
is within a factor of the actual answer
3
Sampling: Basics
Idea: A small random sample S of the data often well-represents all the
data
– For a fast approximate answer, apply “modified” query to S
– Example: select agg from R
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 (n=12)
Sample S: 9 5 1 8
– If agg is avg, return average of the elements in S
answer: 11.5
– Number of odd elements ?
4
Probabilistic Guarantees
Example: Actual answer is within 11.5 ± 1 with prob 0.9
Randomized algorithms: Answer returned is a speciallybuilt random variable
Use Tail Inequalities to give probabilistic bounds on
returned answer
– Markov Inequality
– Chebyshev’s Inequality
– Chernoff/Hoeffding Bound
5
Basic Tools: Tail Inequalities
General bounds on tail probability of a random variable
(that is, probability that a random variable deviates far
from its expectation)
Probability
distribution
Tail probability
Basic Inequalities: Let X be a random variable with
expectation and variance Var[X]. Then for any 0
Markov:
Pr( X )
Chebyshev: Pr(| X | ) Var[ X ]
2 2
6
Tail Inequalities for Sums
Possible to derive even stronger bounds on tail probabilities for the
sum of independent Bernoulli trials
Chernoff Bound: Let X1, ..., Xm be independent Bernoulli trials such
that Pr[Xi=1] = p (Pr[Xi=0] = 1-p). Let X i X i and mp be the
expectation of X . Then, for any 0,
Pr(| X | ) 2 exp
2
2
Do not need to compute Var(X),
but need the independent assumption!
Application to count queries:
– m is size of sample S (4 in example)
– p is fraction of odd elements in stream (2/3 in example)
7
The Streaming Model
Underlying signal: One-dimensional array A[1…N] with
values A[i] all initially zero
– Multi-dimensional arrays as well (e.g., row-major)
Signal is implicitly represented via a stream of updates
– j-th update is <k, c[j]> implying
• A[k] := A[k] + c[j] (c[j] can be >=0, <0)
Goal: Compute functions on A[] subject to
– Small space
– Fast processing of updates
– Fast function computation
–…
8
Streaming Model: Special Cases
Time-Series Model
– Only j-th update updates A[j] (i.e., A[j] := c[j])
Cash-Register Model
– c[j] is always >= 0 (i.e., increment-only)
– Typically, c[j]=1, so we see a multi-set of items in one
pass
Turnstile Model
– Most general streaming model
– c[j] can be >=0 or <0 (i.e., increment or decrement)
Problem difficulty varies depending on the model
– E.g., MIN/MAX in Time-Series vs. Turnstile!
9
Frequent moment computation
Problem
• Data arrives online ( a1,a2,a3…..am )
• Let f(i)=|{ j | aj = i }|
n
Fk f i
Example
ai {1,2,...,n}
( represented by ||A[i]|| )
k
i1
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
2
1
2
1
1
f(1) f(2) f(3) f(4) f(5)
F0 = 5 < distinct elements>, F1 = 7,
F2 = 11 ( 1*1+2*2+2*2+1*1+1*1) ( surprise index)
What is F∞?
10
Frequent moment computation
Easy for F1
How about others ?
- focus on the F2 and F0
- Estimation of Fk
11
Linear-Projection (AMS) Sketch Synopses
Goal: Build small-space summary for distribution vector f(i) (i=1,..., N) seen as a
stream of i-values
2
2
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
1
1
1
f(1) f(2) f(3) f(4) f(5)
Basic Construct: Randomized Linear Projection of f() = project onto inner/dot
product of f-vector
f , f (i)i
where = vector of random values from an
appropriate distribution
– Simple to compute over the stream: Add i whenever the i-th value is seen
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
1 2 2 23 4 5
– Generate i ‘s in small O(logN) space using pseudo-random generators
– Tunable probabilistic guarantees on approximation error
– Delete-Proof: Just subtract i to delete an i-th value occurrence
12
AMS ( sketch ) cont.
Key Intuition: Use randomized linear projections of f() to define
random variable X such that
– X is easily computed over the stream (in small space)
– E[X] = F2
Probabilistic error guarantees
– Var[X] is small
(e.g., actual answer is 10±1 with
probability 0.9)
Basic Idea:
– Define a family of 4-wise independent {-1, +1} random variables
– Pr[ i = +1] = Pr[ i = -1] = 1/2
• Expected value of each i , E[ ] = ? E[ i ] = ?
– Variables
2
i
i are 4-wise independent
• Expected value of product of 4 distinct i = 0
E( 1 2 3 4 ) = 0
– Variables i can be generated using pseudo-random generator using
only O(log N) space (for seeding)!
13
AMS ( sketch ) cont.
Example
Data stream R : 4 1 2 4 1 4
f(i) :
2
1
Z Z 4
3
1
2
0
3
4
Z 21 2 3 4
Suppose { i } :
1) 1,2 {1}, 3,4 {-1} then Z = ?
2) 4
{1}, 1,3,4 {-1} then Z = ?
X Z2
14
AMS ( sketch ) cont.
Expected value of X = F2
2
E(i ) 0 and E(i ) 1
E(X) E[(if(i)i )2 ]
2 2
E[if(i) i ] 2E[ii' f(i) f(i')ii']
if(i)2
1
Using 4-wise independence, possible to show that
0
E(X2 ) if(i)4 6f(i)2 f(i')2
Var[X] E(X2 ) - E(X)2 2F22
15
Boosting Accuracy
Chebyshev’s Inequality:
Var[X]
Pr(| X - E[X]| εE[X]) 2
ε E[X]2
Boost accuracy to ε by averaging over several independent copies
of X (reduces variance)
x
x
x
s
16
ε2
Average
y
E[Y] E[X]
copies
Var[X] ε2 F22
Var[Y]
s
8
By Chebyshev:
Pr(| Y - F2 | F2 )
Var[Y] 1
2 2
F2
8
16
Boosting Confidence
Boost confidence to 1 δ by taking median of 2log(1/δ )
independent copies of Y
Each Y = Bernoulli Trial
Pr 1/8
“FAILURE”:
y
2log(1/δ )
copies
y
median
(1 ε)F2
F2
(1 ε)F2
Pr 1 δ
y
Pr[|median(Y)-F2| ε F2]
= Pr[ # failures in 2log(1/δ ) trials >= log(1/δ) ]
δ
(By Chernoff Bound)
17
Summary of AMS Sketching for F2
Step 1: Compute random variables:
Step 2: Define X= Z2
Z if(i)i
Steps 3 & 4: Average independent copies of X; Return median of averages
8
ε2
2log(1/ δ)
copies
copies
x
x
x
Average
y
x
x
x
Average
y
x
x
x
Average
y
median
Main Theorem : Sketching approximates F2 to within a relative error
of ε with probability 1 δ using space
log(1/ ) logN
O(
)
2
ε
– Remember: O(log N) space for “seeding” the construction of each X
18
Binary-Join COUNT Query
Problem: Compute answer for the query COUNT(R
Example:
Data stream R.A: 4 1 2 4 1 4
fR (i) :
A
3
2
1
1
Data stream S.A: 3 1 2 4 2 4
fS (i) :
A
0
3
2
1
1
COUNT(R
S)
2
2
4
1
3
2
4
S) ifR (i) fS (i)
= 10 (2 + 2 + 0 + 6)
Exact solution: too expensive, requires O(N) space!
– N = sizeof(domain(A))
19
Basic AMS Sketching Technique [AMS96]
Key Intuition: Use randomized linear projections of f() to define
random variable X such that
– X is easily computed over the stream (in small space)
– E[X] = COUNT(R
– Var[X] is small
A S)
Basic Idea:
– Define a family of 4-wise independent {-1, +1} random variables
{i : i 1,..., N}
20
AMS Sketch Construction
Compute random variables: XR
f (i)
i R
i
and XS
f (i)
i S
i
– Simply add i to XR(XS) whenever the i-th value is observed in the
R.A (S.A) stream
Define X = XRXS to be estimate of COUNT query
Example:
Data stream R.A: 4 1 2 4 1 4
3
2
fR (i) :
1
1
XR XR 4
Data stream S.A: 3 1 2 4 2 4
3
2
4
XR 21 2 3 4
fS (i) :
1
1
XS XS 1
0
2
2
1
3
2
4
XS 1 22 2 2 4
21
Binary-Join AMS Sketching Analysis
Expected value of X = COUNT(R
A
S)
E[X] E[XR XS ]
E[ifR (i)i ifS (i)i ]
E[ifR (i) fS (i)i ] E[ii'fR (i) fS (i')ii']
2
ifR (i) fS (i)
1
0
Using 4-wise independence, possible to show that
Var[X] 2 SJ(R) SJ(S)
SJ(R) fR (i)2 is self-join size of R (second/L2 moment)
i
22
Boosting Accuracy
Chebyshev’s Inequality:
Var[X]
Pr(| X - E[X]| εE[X]) 2
ε E[X]2
Boost accuracy to ε by averaging over several independent copies
of X (reduces variance)
x
s
x
8 (2 SJ(R) SJ(S))
ε2 COUNT2
x
y
Average
copies
E[Y] E[X] COUNT(R
S)
Var[X] ε2 COUNT2
By Chebyshev: Var[Y]
s
8
Pr(| Y - COUNT | ε COUNT)
Var[Y]
1
ε2 COUNT2 8
23
Boosting Confidence
Boost confidence to 1 δ by taking median of 2log(1/δ )
independent copies of Y
Each Y = Bernoulli Trial
Pr 1/8
“FAILURE”:
y
2log(1/δ )
copies
y
median
(1 ε)COUNT
COUNT
(1 ε)COUNT
Pr 1 δ
y
Pr[|median(Y)-COUNT| ε COUNT]
= Pr[ # failures in 2log(1/δ ) trials >= log(1/δ) ]
δ
(By Chernoff Bound)
24
Summary of Binary-Join AMS Sketching
Step 1: Compute random variables:
XR ifR (i)i and XS fS (i)i
i
Step 2: Define X= XRXS
Steps 3 & 4: Average independent copies of X; Return median of averages
8 (2 SJ(R) SJ(S))
copies
ε2 COUNT2
2log(1/ δ)
copies
x
x
x
Average
y
x
x
x
Average
y
x
x
x
Average
y
median
Main Theorem (AGMS99): Sketching approximates COUNT to within a
relative error of ε with probability 1 δ using space
O(
SJ(R) SJ(S) log(1/ ) logN
)
2
2
ε COUNT
– Remember: O(log N) space for “seeding” the construction of each X
25
Distinct Value Estimation ( F0 )
Problem: Find the number of distinct values in a stream of values with
domain [0,...,N-1]
– Zeroth frequency moment
F0
– Statistics: number of species or classes in a population
– Important for query optimizers
– Network monitoring: distinct destination IP addresses,
source/destination pairs, requested URLs, etc.
Example (N=64)
Data stream: 3 0 5 3 0 1 7 5 1 0 3 7
Number of distinct values: 5
Hard problem for random sampling!
– Must sample almost the entire table to guarantee the estimate is
within a factor of 10 with probability > 1/2, regardless of the
estimator used!
26
Hash (aka FM) Sketches for Distinct
Value Estimation [FM85]
Assume a hash function h(x) that maps incoming values x in [0,…, N-1]
uniformly across [0,…, 2^L-1], where L = O(logN)
Let lsb(y) denote the position of the least-significant 1 bit in the binary
representation of y
– A value x is mapped to lsb(h(x))
Maintain Hash Sketch = BITMAP array of L bits, initialized to 0
– For each incoming value x, set BITMAP[ lsb(h(x)) ] = 1
x=5
h(x) = 101100
lsb(h(x)) = 2
5
4
0
0
3
0
BITMAP
2
1
0
1
0
0
Prob[ lsb(h(x) = i ] = ?
27
Hash (FM) Sketches for Distinct Value
Estimation [FM85]
By uniformity through h(x): Prob[ BITMAP[k]=1 ] =
1
2 k 1
– Assuming d distinct values: expect d/2 to map to BITMAP[0] ,
d/4 to map to BITMAP[1], . . .
BITMAP
L-1
0
0
0
0
0
0
0
position >> log(d)
1
0
1
0
1
1
fringe of 0/1s
around log(d)
1
1
1
1
1
1
position << log(d)
Let R = position of rightmost zero in BITMAP
– Use as indicator of log(d)
[FM85] prove that E[R] =
– Estimate d =
log( d )
, where
.7735
2R
– Average several iid instances (different hash functions) to reduce
estimator variance
28
Accuracy of FM
0
0
0
0
0
0
0
0
0
1
0
1
0
1
1
1
1
0
1
1
1
1
1
1
BITMAP 1
O(
0
0
0
0
1
0
1
1
1
1
1
1
2
log
1
)
BITMAP m
1
0
O (log m log
1
1
log )
Approximation with probability at least 1-
29
Hash (FM) Sketches for Distinct Value
Estimation
[FM85] assume “ideal” hash functions h(x) (N-wise independence)
– In practice
• h(x) = (a x b) mod N
in [0,…,2^L-1]
, where a, b are random binary vectors
Composable: Component-wise OR/add distributed sketches together
– Estimate |S1 S2 … Sk| = set-union cardinality
30
Cash Register Sketch (AMS)
• A more general algorithm for Fk
• Choose random p from 1..n and let
Stream
sampling
r |{q : q p, aq a p }|
k
k
X m(r (r 1) ) Estimator
• Using F2 ( k=2 ) as example
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
If we choose the first element a1 r = 2 and X = 7*(2*2-1*1) = 21
And for a2 r = ? , X= ?
a5 r = ? , X= ?
31
Cash Register Sketch (AMS)
Y=Average A copies of X and Z = median of B copies
Of Y’s
B copies
A copies
x
x
x
Average
y
x
x
x
Average
y
x
x
x
Average
y
median
• Claim: This is a 1 +ε approx to F2, and space used is
O(AB) =
n log
O(
2
1
)
words of size O(logn+logm)
with probability at least 1-δ.
32
Analysis: Cash Register Sketch
• E(X) = F2
• V(X) = E(X)2 - (E(X))2.
• Using (a2 - b2) ≤ 2(a-b)a, we have V(X) ≤ 2 F1F3. •
2
Also, V(X) ≤ 2 F1 F3 ≤
.
Hence,
n F2
E(Yi ) E( X i ) F2
V (Yi ) V ( X i ) / A
2
2
2 nF
A
33
Analysis Contd.
• Applying Chebyshev’s inequality
V(Yi )
Pr(| Yi F2 | F2 ) 2 2
F2
2
2
2
2
2 nF
A 2 F
1
8
• Hence, by Chernoff bounds, probability that more than
B/2 Yi’s deviate by far is at most δ, if we take log (1/δ) of
Yi’s. Hence, median gives the correct approximation.
34
Computation of Fk
E(X) = Fk
r |{q : q p, aq a p }|
k
k
X m(r (r 1) )
1 1
When A =
8kn
2
B = 2 log(
Get
k
1
)
approximation with probability at least 1 -
35
Estimate the element frequency
2
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
1
2
1
1
f(1) f(2) f(3) f(4) f(5)
Ask for f(1) = ? f(4) = ?
- AMS based algorithm
- Count Min sketch.
36
AMS ( sketch ) based algorithm.
Key Intuition: Use randomized linear projections of f() to define random variable
Z such that
For given element A[i]
E( iZ ) = ||A[i]|| = fi
Similar, we have E( jZ ) = fj
Basic Idea:
– Define a family of 4-wise independent {-1, +1} random variables
( same as before )
Pr[ i= +1] = Pr[ i= -1] = ½
Let Z =
f , f (i )i
So E( i Z )
E[ f(i)ii ] E[ii' f(i')i'i ]
f(i)
1
0
37
AMS cont.
Keep an array of w ╳ d counters
for Zij
Use d hash functions to map
element x to [1..w]
W
h1(a)
a
d
hd(a)
Z[i, hi(a)] +=
i,a
Est(fa) = median i (Z[i,hi(a)] i, a )
38
The Count Min (CM) Sketch
Simple sketch idea, can be used for point queries ( fi), range
queries, quantiles, join size estimation
Creates a small summary as an array of w ╳ d counters C
Use d hash functions to map element to [1..w]
W=
e
d=
log(
1
)
39
CM Sketch Structure
+1
h1(xi)
+1
xi
hd(xi )
d
+1
+1
w
Each element xi is mapped to one counter per row
C[ k,hk(xi)] = C[k, hk(xi)]+1 ( -1 if deletion )
or +c[j] if income is <j, c[j]>
Estimate A[j] by taking mink C[k,hk(j)]
40
CM Sketch Summary
CM sketch guarantees approximation error on point queries
less than ||A|| in size O(1/ log 1/)
– Probability of more error is less than 1-
Hints
– Counts are biased! Can you limit the expected amount
of extra “mass” at each bucket? (Use Markov)
41