Data Stream Computation(2)

Download Report

Transcript Data Stream Computation(2)

Statistic estimation
over data stream
Slides modified from Minos Garofalakis ( yahoo! research)
and S. Muthukrishnan (Rutgers University)
Outline
 Introduction
 Frequent moment estimation
 Element Frequency estimation
2
Data Stream Processing Algorithms
 Generally, algorithms compute approximate answers
– Provably difficult to compute answers accurately with
limited memory
 Approximate answers - Deterministic bounds
– Algorithms only compute an approximate answer, but
bounds on error
 Approximate answers - Probabilistic bounds
– Algorithms compute an approximate answer with high
probability
• With probability at least 1   , the computed answer
is within a factor  of the actual answer
3
Sampling: Basics
 Idea: A small random sample S of the data often well-represents all the
data
– For a fast approximate answer, apply “modified” query to S
– Example: select agg from R
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 (n=12)
Sample S: 9 5 1 8
– If agg is avg, return average of the elements in S
answer: 11.5
– Number of odd elements ?
4
Probabilistic Guarantees
 Example: Actual answer is within 11.5 ± 1 with prob  0.9
 Randomized algorithms: Answer returned is a speciallybuilt random variable
 Use Tail Inequalities to give probabilistic bounds on
returned answer
– Markov Inequality
– Chebyshev’s Inequality
– Chernoff/Hoeffding Bound
5
Basic Tools: Tail Inequalities
 General bounds on tail probability of a random variable
(that is, probability that a random variable deviates far
from its expectation)
Probability
distribution
Tail probability
 

 Basic Inequalities: Let X be a random variable with
expectation  and variance Var[X]. Then for any   0
Markov:

Pr( X   ) 

Chebyshev: Pr(| X   |  )  Var[ X ]
2 2

6
Tail Inequalities for Sums
 Possible to derive even stronger bounds on tail probabilities for the
sum of independent Bernoulli trials
 Chernoff Bound: Let X1, ..., Xm be independent Bernoulli trials such
that Pr[Xi=1] = p (Pr[Xi=0] = 1-p). Let X  i X i and   mp be the
expectation of X . Then, for any   0,
Pr(| X   |  )  2 exp
  2
2
Do not need to compute Var(X),
but need the independent assumption!
 Application to count queries:
– m is size of sample S (4 in example)
– p is fraction of odd elements in stream (2/3 in example)
7
The Streaming Model
 Underlying signal: One-dimensional array A[1…N] with
values A[i] all initially zero
– Multi-dimensional arrays as well (e.g., row-major)
 Signal is implicitly represented via a stream of updates
– j-th update is <k, c[j]> implying
• A[k] := A[k] + c[j] (c[j] can be >=0, <0)
 Goal: Compute functions on A[] subject to
– Small space
– Fast processing of updates
– Fast function computation
–…
8
Streaming Model: Special Cases
 Time-Series Model
– Only j-th update updates A[j] (i.e., A[j] := c[j])
 Cash-Register Model
– c[j] is always >= 0 (i.e., increment-only)
– Typically, c[j]=1, so we see a multi-set of items in one
pass
 Turnstile Model
– Most general streaming model
– c[j] can be >=0 or <0 (i.e., increment or decrement)
 Problem difficulty varies depending on the model
– E.g., MIN/MAX in Time-Series vs. Turnstile!
9
Frequent moment computation
Problem
• Data arrives online ( a1,a2,a3…..am )
• Let f(i)=|{ j | aj = i }|
n
Fk   f i
Example
ai  {1,2,...,n}
( represented by ||A[i]|| )
k
i1
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
2
1
2
1
1
f(1) f(2) f(3) f(4) f(5)
F0 = 5 < distinct elements>, F1 = 7,
F2 = 11 ( 1*1+2*2+2*2+1*1+1*1) ( surprise index)
What is F∞?
10
Frequent moment computation
 Easy for F1
 How about others ?
- focus on the F2 and F0
- Estimation of Fk
11
Linear-Projection (AMS) Sketch Synopses
 Goal: Build small-space summary for distribution vector f(i) (i=1,..., N) seen as a
stream of i-values
2
2
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
1
1
1
f(1) f(2) f(3) f(4) f(5)
 Basic Construct: Randomized Linear Projection of f() = project onto inner/dot
product of f-vector
 f ,    f (i)i
where  = vector of random values from an
appropriate distribution
– Simple to compute over the stream: Add  i whenever the i-th value is seen
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
1  2 2  23   4  5
– Generate  i ‘s in small O(logN) space using pseudo-random generators
– Tunable probabilistic guarantees on approximation error
– Delete-Proof: Just subtract  i to delete an i-th value occurrence
12
AMS ( sketch ) cont.
 Key Intuition: Use randomized linear projections of f() to define
random variable X such that
– X is easily computed over the stream (in small space)
– E[X] = F2
Probabilistic error guarantees
– Var[X] is small
(e.g., actual answer is 10±1 with
probability 0.9)
 Basic Idea:
– Define a family of 4-wise independent {-1, +1} random variables
– Pr[  i = +1] = Pr[  i = -1] = 1/2
• Expected value of each  i , E[  ] = ? E[  i ] = ?
– Variables
2
i
i are 4-wise independent
• Expected value of product of 4 distinct  i = 0
E( 1  2 3  4 ) = 0
– Variables  i can be generated using pseudo-random generator using
only O(log N) space (for seeding)!
13
AMS ( sketch ) cont.
Example
Data stream R : 4 1 2 4 1 4
f(i) :
2
1
Z  Z  4
3
1
2
0
3
4
Z  21  2  3 4
Suppose {  i } :
1) 1,2 {1}, 3,4 {-1} then Z = ?
2) 4
{1}, 1,3,4 {-1} then Z = ?
X  Z2
14
AMS ( sketch ) cont.
 Expected value of X = F2
2
E(i )  0 and E(i )  1
E(X)  E[(if(i)i )2 ]
2 2
 E[if(i) i ]  2E[ii' f(i)  f(i')ii']
 if(i)2
1
 Using 4-wise independence, possible to show that
0
E(X2 )  if(i)4  6f(i)2  f(i')2
Var[X]  E(X2 ) - E(X)2  2F22
15
Boosting Accuracy
 Chebyshev’s Inequality:
Var[X]
Pr(| X - E[X]| εE[X])  2
ε E[X]2
 Boost accuracy to ε by averaging over several independent copies
of X (reduces variance)
x
x
x
s
16
ε2
Average
y
E[Y]  E[X]
copies
Var[X] ε2 F22
Var[Y] 

s
8
 By Chebyshev:
Pr(| Y - F2 |  F2 ) 
Var[Y] 1

2 2
 F2
8
16
Boosting Confidence
 Boost confidence to 1  δ by taking median of 2log(1/δ )
independent copies of Y
 Each Y = Bernoulli Trial
Pr  1/8
“FAILURE”:
y
2log(1/δ )
copies
y
median
(1  ε)F2
F2
(1  ε)F2
Pr  1  δ
y
Pr[|median(Y)-F2|  ε F2]
= Pr[ # failures in 2log(1/δ ) trials >= log(1/δ) ]
δ
(By Chernoff Bound)
17
Summary of AMS Sketching for F2
 Step 1: Compute random variables:
 Step 2: Define X= Z2
Z  if(i)i
 Steps 3 & 4: Average independent copies of X; Return median of averages
8
ε2
2log(1/ δ)
copies
copies
x
x
x
Average
y
x
x
x
Average
y
x
x
x
Average
y
median
 Main Theorem : Sketching approximates F2 to within a relative error
of ε with probability  1  δ using space
log(1/  ) logN
O(
)
2
ε
– Remember: O(log N) space for “seeding” the construction of each X
18
Binary-Join COUNT Query
 Problem: Compute answer for the query COUNT(R
 Example:
Data stream R.A: 4 1 2 4 1 4
fR (i) :
A
3
2
1
1
Data stream S.A: 3 1 2 4 2 4
fS (i) :
A
0
3
2
1
1
COUNT(R
S)
2
2
4
1
3
2
4
S)  ifR (i)  fS (i)
= 10 (2 + 2 + 0 + 6)
 Exact solution: too expensive, requires O(N) space!
– N = sizeof(domain(A))
19
Basic AMS Sketching Technique [AMS96]
 Key Intuition: Use randomized linear projections of f() to define
random variable X such that
– X is easily computed over the stream (in small space)
– E[X] = COUNT(R
– Var[X] is small
A S)
 Basic Idea:
– Define a family of 4-wise independent {-1, +1} random variables
{i : i  1,..., N}
20
AMS Sketch Construction
 Compute random variables: XR 
 f (i)
i R
i
and XS 
 f (i)
i S
i
– Simply add  i to XR(XS) whenever the i-th value is observed in the
R.A (S.A) stream
 Define X = XRXS to be estimate of COUNT query
 Example:
Data stream R.A: 4 1 2 4 1 4
3
2
fR (i) :
1
1
XR  XR   4
Data stream S.A: 3 1 2 4 2 4
3
2
4
XR  21  2  3 4
fS (i) :
1
1
XS  XS  1
0
2
2
1
3
2
4
XS  1  22  2  2 4
21
Binary-Join AMS Sketching Analysis
 Expected value of X = COUNT(R
A
S)
E[X]  E[XR XS ]
 E[ifR (i)i  ifS (i)i ]
 E[ifR (i)  fS (i)i ]  E[ii'fR (i)  fS (i')ii']
2
 ifR (i)  fS (i)
1
0
 Using 4-wise independence, possible to show that
Var[X]  2  SJ(R)  SJ(S)
 SJ(R)   fR (i)2 is self-join size of R (second/L2 moment)
i
22
Boosting Accuracy
 Chebyshev’s Inequality:
Var[X]
Pr(| X - E[X]| εE[X])  2
ε E[X]2
 Boost accuracy to ε by averaging over several independent copies
of X (reduces variance)
x
s
x
8  (2  SJ(R) SJ(S))
ε2 COUNT2
x
y
Average
copies
E[Y]  E[X]  COUNT(R
S)
Var[X] ε2 COUNT2

 By Chebyshev: Var[Y] 
s
8
Pr(| Y - COUNT | ε  COUNT) 
Var[Y]
1

ε2 COUNT2 8
23
Boosting Confidence
 Boost confidence to 1  δ by taking median of 2log(1/δ )
independent copies of Y
 Each Y = Bernoulli Trial
Pr  1/8
“FAILURE”:
y
2log(1/δ )
copies
y
median
(1  ε)COUNT
COUNT
(1  ε)COUNT
Pr  1  δ
y
Pr[|median(Y)-COUNT|  ε  COUNT]
= Pr[ # failures in 2log(1/δ ) trials >= log(1/δ) ]
δ
(By Chernoff Bound)
24
Summary of Binary-Join AMS Sketching
 Step 1: Compute random variables:
XR  ifR (i)i and XS   fS (i)i
i
 Step 2: Define X= XRXS
 Steps 3 & 4: Average independent copies of X; Return median of averages
8  (2  SJ(R)  SJ(S))
copies
ε2 COUNT2
2log(1/ δ)
copies
x
x
x
Average
y
x
x
x
Average
y
x
x
x
Average
y
median
 Main Theorem (AGMS99): Sketching approximates COUNT to within a
relative error of ε with probability  1  δ using space
O(
SJ(R)  SJ(S) log(1/  ) logN
)
2
2
ε COUNT
– Remember: O(log N) space for “seeding” the construction of each X
25
Distinct Value Estimation ( F0 )
 Problem: Find the number of distinct values in a stream of values with
domain [0,...,N-1]
– Zeroth frequency moment
F0
– Statistics: number of species or classes in a population
– Important for query optimizers
– Network monitoring: distinct destination IP addresses,
source/destination pairs, requested URLs, etc.
 Example (N=64)
Data stream: 3 0 5 3 0 1 7 5 1 0 3 7
Number of distinct values: 5
 Hard problem for random sampling!
– Must sample almost the entire table to guarantee the estimate is
within a factor of 10 with probability > 1/2, regardless of the
estimator used!
26
Hash (aka FM) Sketches for Distinct
Value Estimation [FM85]
 Assume a hash function h(x) that maps incoming values x in [0,…, N-1]
uniformly across [0,…, 2^L-1], where L = O(logN)
 Let lsb(y) denote the position of the least-significant 1 bit in the binary
representation of y
– A value x is mapped to lsb(h(x))
 Maintain Hash Sketch = BITMAP array of L bits, initialized to 0
– For each incoming value x, set BITMAP[ lsb(h(x)) ] = 1
x=5
h(x) = 101100
lsb(h(x)) = 2
5
4
0
0
3
0
BITMAP
2
1
0
1
0
0
 Prob[ lsb(h(x) = i ] = ?
27
Hash (FM) Sketches for Distinct Value
Estimation [FM85]
 By uniformity through h(x): Prob[ BITMAP[k]=1 ] =
1
2 k 1
– Assuming d distinct values: expect d/2 to map to BITMAP[0] ,
d/4 to map to BITMAP[1], . . .
BITMAP
L-1
0
0
0
0
0
0
0
position >> log(d)
1
0
1
0
1
1
fringe of 0/1s
around log(d)
1
1
1
1
1
1
position << log(d)
 Let R = position of rightmost zero in BITMAP
– Use as indicator of log(d)
 [FM85] prove that E[R] =
– Estimate d =
log( d )
, where
  .7735
2R 
– Average several iid instances (different hash functions) to reduce
estimator variance
28
Accuracy of FM
0
0
0
0
0
0
0
0
0
1
0
1
0
1
1
1
1
0
1
1
1
1
1
1
BITMAP 1
O(
0
0
0
0
1
0
1
1
1
1
1
1

2
log
1

)
BITMAP m
1
0
O (log m  log

1

1
 log )

Approximation with probability at least 1-

29
Hash (FM) Sketches for Distinct Value
Estimation
 [FM85] assume “ideal” hash functions h(x) (N-wise independence)
– In practice
• h(x) = (a  x  b) mod N
in [0,…,2^L-1]
, where a, b are random binary vectors
 Composable: Component-wise OR/add distributed sketches together
– Estimate |S1  S2  …  Sk| = set-union cardinality
30
Cash Register Sketch (AMS)
• A more general algorithm for Fk
• Choose random p from 1..n and let
Stream
sampling
r  |{q : q  p, aq  a p }|
k
k
X  m(r  (r 1) ) Estimator
• Using F2 ( k=2 ) as example
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
If we choose the first element a1 r = 2 and X = 7*(2*2-1*1) = 21
And for a2 r = ? , X= ?
a5 r = ? , X= ?
31
Cash Register Sketch (AMS)
 Y=Average A copies of X and Z = median of B copies
Of Y’s
B copies
A copies
x
x
x
Average
y
x
x
x
Average
y
x
x
x
Average
y
median
• Claim: This is a 1 +ε approx to F2, and space used is
O(AB) =
n log
O(

2
1
 )
words of size O(logn+logm)
with probability at least 1-δ.
32
Analysis: Cash Register Sketch
• E(X) = F2
• V(X) = E(X)2 - (E(X))2.
• Using (a2 - b2) ≤ 2(a-b)a, we have V(X) ≤ 2 F1F3. •
2
Also, V(X) ≤ 2 F1 F3 ≤
.
Hence,
n F2
E(Yi )  E( X i )  F2
V (Yi )  V ( X i ) / A 
2
2
2 nF
A
33
Analysis Contd.
• Applying Chebyshev’s inequality
V(Yi )
Pr(| Yi  F2 | F2 )  2 2
 F2
2
2
2
2
2 nF

A 2 F
1

8
• Hence, by Chernoff bounds, probability that more than
B/2 Yi’s deviate by far is at most δ, if we take log (1/δ) of
Yi’s. Hence, median gives the correct approximation.
34
Computation of Fk
 E(X) = Fk
r  |{q : q  p, aq  a p }|
k
k
X  m(r  (r 1) )
1 1
 When A =
8kn
2
B = 2 log(
Get

k
1

)
approximation with probability at least 1 - 
35
Estimate the element frequency
2
Data stream: 3, 1, 2, 4, 2, 3, 5, . . .
1
2
1
1
f(1) f(2) f(3) f(4) f(5)
 Ask for f(1) = ? f(4) = ?
- AMS based algorithm
- Count Min sketch.
36
AMS ( sketch ) based algorithm.
 Key Intuition: Use randomized linear projections of f() to define random variable
Z such that
For given element A[i]
E(  iZ ) = ||A[i]|| = fi
Similar, we have E(  jZ ) = fj
Basic Idea:
– Define a family of 4-wise independent {-1, +1} random variables
( same as before )
Pr[  i= +1] = Pr[  i= -1] = ½
Let Z =
 f ,    f (i )i
So E(  i Z )
 E[ f(i)ii ]  E[ii' f(i')i'i ]
 f(i)
1
0
37
AMS cont.
 Keep an array of w ╳ d counters
for Zij
 Use d hash functions to map
element x to [1..w]
W
h1(a)
a
d
hd(a)
Z[i, hi(a)] +=
 i,a
Est(fa) = median i (Z[i,hi(a)]  i, a )
38
The Count Min (CM) Sketch
 Simple sketch idea, can be used for point queries ( fi), range
queries, quantiles, join size estimation
 Creates a small summary as an array of w ╳ d counters C
 Use d hash functions to map element to [1..w]
W=
e

d=
log(
1

)
39
CM Sketch Structure
+1
h1(xi)
+1
xi
hd(xi )
d
+1
+1
w
 Each element xi is mapped to one counter per row
 C[ k,hk(xi)] = C[k, hk(xi)]+1 ( -1 if deletion )
or +c[j] if income is <j, c[j]>
 Estimate A[j] by taking mink C[k,hk(j)]
40
CM Sketch Summary
 CM sketch guarantees approximation error on point queries
less than ||A|| in size O(1/ log 1/)
– Probability of more error is less than 1-
 Hints
– Counts are biased! Can you limit the expected amount
of extra “mass” at each bucket? (Use Markov)
41