Transcript A 1

Data Streams Part 3:
Approximate Query
Evaluation
Reynold Cheng
23rd July, 2002
Approximate Query Evaluation
 Why?
 Handling load – streams coming too fast
 Data stream is archived in a off-site data
warehouse, expensive access of archived
data
 Avoid unbounded storage and computation
 Ad hoc queries need approximate history
 Try to look at the data items only once and
in a fixed order
Synopses
 Queries may access or aggregate past data
 Need bounded-memory history-approximation
 Synopsis?
– Succinct summary of old stream tuples
– Like indexes/materialized-views, but base data is
unavailable
 Examples
–
–
–
–
–
Sliding Windows
Samples
Sketches
Histograms
Wavelet representation
Sketching Techniques




Self-Join Size Estimation
Stream of values from dom(A) = {1,2,…,n}
Let f(i) = frequency of value i
Consider SJ(A) = Σidom(A)f(i)2, or Gini’s index of
homogeneity.
 Useful in parallel DB applications, error
estimation in query result size estimation and
access plan costs.
 Equivalent query: Q = COUNT (R |><|A R)
Evaluating SJ(A) = Σ fi2
 To update S, keep a counter f(i) for each value i
in the domain D  (|dom(A)|) bits of storage
 This is true for any deterministic algorithm
 Question – estimating S in sub-linear space
(O(log |dom(A)|))
 A randomized technique that offers strong
probabilistic guarantees on the quality of SJ(A)
Self-Join Size Estimation
 AMS Technique (randomized sketches)
 Given (f(1),f(2),…,f(n))
 Generate a family of 4-wise independent binary
random variables {Zi: i = 1,…, |dom(A)|}
 Zi = random{-1,1}
 P[Zi = +1] = P[Zi = -1] = ½ (E[Zi] = 0)
 By using tools like orthogonal arrays, such
families can be constructed online using
O(log|dom(A)|) space.
Self-Join Size Estimation
 Define X = Σi dom(A) f(i)Zi (X incrementally
computable)
 Theorem Exp[X2] = Σ f(i)2
– Cross-terms f(i)Zi f(j)Zj have 0 expectation
– Square-terms f(i)Zi f(i)Zi = f(i)2
 Hence, X2 is an unbiased estimator for SJ(A)
 Space = log (N + Σ f(i))
 Independent samples Xk reduce variance
Estimation Quality
 How can independent samples Xk improve the quality of
estimation?
 Keep s1 x s2 iid instantiations for Xk
 Use independent random seeds to get a family of Zi’s for
each instance
 s1 reduces variance, s2 boosts confidence
s1
Atomic
Sketch
Avg(X1j2)
Avg(X2j2)
s2
Avg(X3j2)
Avg(X4j2)
Avg(X5j2)
Median
(sketch)
Sample Run of AMS
Z1 =
V =
3
1
-1 1
1
Σvi2 = 123
Z1 =
6
2
-1
X1= 5, X12 = 25
V =
4
1
-1 1
1
6
-1
5
Z2 =
7
-1 1
-1 1
1
X2= 14, X22 = 196 Est = 110.5
2
5
Z2 =
7
-1 1
-1 1
1
Σ vi2 = 130, X1= 6, X12 = 36, X2= 12, X22 = 144, Est = 90
Comments on AMS




The self-join size can be computed on-line
Sufficiently small variance (controlled by s1 and s2)
Can estimate SJ(A) in one pass
Probabilistic guarantee: a relative error of at most  with
probability at least 1-
 Can this method be extended for other queries?
Complex Aggregate Queries
 A. Dobra et al. extend the idea of AMS to
provide approximate answers to complex
aggregate queries.
 SELECT AGG FROM R1,R2,…,Rr where 
 AGG: COUNT/SUM/AVERAGE
  : conjunction of equality joins (Ri.Aj = Rk.Al)
 The error of these estimates is at most ε with
probability 1-δ.
Attribute Constraint of 
 Each attribute of a relation appears at most once
in the join condition .
 For any attribute Ri.Aj occuring m > 1 times in ,
add m-1 new “attributes” to Ri
 Replace m-1 occurrences in  with the new
attributes
  = ((R1.A1=R2.A1) & (R1.A1=R3.A1)) becomes
((R1.A1=R2.A1) & (R1.A2=R3.A1))
The COUNT Query
 QCOUNT : # of tuples in the cross-product of
R1,…,Rr that satisfy the equality constraints in .
 SELECT COUNT(*) from R1,…,Rr where 
 Approach: Express the above query using
mathematical operators.
 Rename the 2n join attributes in  to A1, A2,…,
A2n such that each constraint in  is Aj = An+j for
1  j  n.
Projection of Domains
 D = dom(A1) x dom(A2) x…x dom(A2n)
 Sk is the set of join attributes appearing in
Rk
 Dk = dom(Ak1) x dom(Ak2) x…x
dom(Ak|Sk|); Ak1,…,Ak|Sk| are attributes in Sk
 Dk is a projection of D on attributes in Sk
Value Assignment I
 An assignment I assigns values to (a subset
of) join attributes
 If I  D, each join attribute is assigned a
value I[j] by I.
 If I  Dk, then I only assigns a value I[j] to
attributes j  Sk.
 Use letter j to refer to attribute Aj
Value Assignment I
 I[Sk] is the projection of I on attributes in
Sk.
 fk(I) is the number of tuples in Rk that
match I.
Example
 SELECT COUNT (*) FROM R1, R2, R3
where R1.A1 = R2.A1 and R2.A2 = R3.A1
 Renaming: SELECT COUNT (*) FROM
R1, R2, R3 where A1 = A3 and A2 = A4
– A1 = R1.A1
– A3 = R2.A1
A2 = R2.A2
A4 = R3.A1
 Note: Joins in the form Aj = An+j
 S1 = {1}, S2 = {2,3}, S3 = {4}
Example (cont.)
 D = dom(A1) x dom(A2) x dom(A3) x
dom (A4)
 I1 = <1, 3, 2, 6>  D; I1[3] = 2
 D1 = dom(A1); D2 = dom(A2) x dom(A3);
D3 = dom(A4)
 I1[S1] = <1>, I1[S2] = <3,2>, I1[S3] = <6>
 f2(<3,2>) = 5 means 5 tuples from R2 have
A2=3, A3=2
Exact QCOUNT
 QCOUNT=ΣID,j:I[j]=I[n+j]Πrk=1fk(I[Sk])
–
–
Product of the # of tuples in each relation that match a value
assignment I
Summed over all I D that satisfy the equi-join constraints ε
Approximate QCOUNT
 Similar to self-join size!
 Construct a random variable X that is an
unbiased estimator for QCOUNT
 Variance can be bounded from above
 Use averaging and median-selection tricks
to boost the accuracy and confidence of X
 Achieve small relative error with high
probability
Constructing X
 For each pair of join attributes j, n+j in ,
build random variables
{Zj,l: l=1,…,|dom(Aj)|},where Zj,l  {-1, +1}
 Random variables belonging to families for
different attribute pairs are independent
 Space:Σnj=1O(log|dom(Aj)|)
Constructing X (2)
 For each relation Rk, define atomic sketch Xk =
ΣIDk (fk(I)Π jSk Zj,I[j])
 X = unbiased estimator of QCOUNT = Πrk=1Xk
 E(X) = QCOUNT
 Each Xk can be efficiently computed as tuples of
Rk are streaming in.
– Initialize Xk to 0
– For each tuple t in the Rk stream, addΠjSk Zj,t[j] to Xk
Join Graph and Acyclicity
 A join graph is an undirected graph
– a node for each relation, and,
– an edge for each join attribute pair j, n+j
 If the join graph is acyclic, the variance of
X is bounded from above.
 Many join queries are acyclic, like chain
joins and star joins.
Probabilistic Guarantees
 If QCOUNT is acyclic, multi-join COUNT
query over relations R1,…, Rr, then a
sketch of size O(log|dom(Aj)|) is possible
to approximate QCOUNT so that:
– the relative error of the estimate is at most ε
with probability 1-δ.
Basic Notions of Approximation
 For aggregate queries (e.g., SUM, COUNT),
approximation quality can be measured by absolute
relative error:
– |Estimated value – Actual value| / Actual value
 Open question: for queries involving more than simple
aggregation, how should we define approximation?
 Consider S |><|BT: (S: {A,B}, T: {B,C})
A
B
C
A
B
C
10
20.5
Doctor
8
10.3
Lawyer
8
10.3
Lawyer
3
10.2
Teacher
3
10.2
Teacher
Actual Result
Approximate Result
Basic Notions of Approximation
(2)
 Can we accept this kind of approximation?
A
B
C
A
B
C
10
20.5
Doctor
11
21.6
Doctor
8
10.3
Lawyer
8
10.3
Student
3
10.2
Teacher
3
10.2
Teacher
Actual Result
Approximate Result
Basic Notions of Approximation
(3)
 Can we provide useful (semantically correct) but stale
results?
A
B
C
A
B
C
10
20.5
Doctor
10
20.5
Doctor
8
10.3
Lawyer
8
10.3
Lawyer
3
10.2
Teacher
Actual Result (at time t)
Approximate Result
(correct result at time t - )
Related Issues
 Metric for set-valued queries
 Composition of approximate operators
 How is it understood/controlled by user?
 Integrate into query language
 Query planning and interaction with
resource allocation
 Accuracy-efficiency-storage tradeoff and
global metric
Conclusions
 To evaluate aggregate queries, O(|dom(A)|)
size is needed to obtain exact answers.
 True for any deterministic algorithms
 We introduce randomized algorithms that
uses only O(Σnj=1log|dom(Aj)|) space.
 Only 1 pass is needed for estimation.
Conclusions (2)
 Express the target query by mathematical
operators
 Use an unbiased estimator for the expression
 Independent samples reduce variance
 A probabilistic guarantee: the relative error of the
estimate is at most ε with probability 1-δ.
 Question: Can we use this technique to answer
queries involving more general selection
predicates (e.g., inequality joins)?
References
N. Alon, Y. Matias, M. Szegedy. The Space
Complexity of Approximating the
Frequency Moments, STOC ’96.
A. Dobra, M. Garofalakis, J. Gehrke, R.
Rastogi. Processing Complex Aggregate
Queries over Data Streams, SIGMOD ’02.
Thank You!