Transcript A 1
Data Streams Part 3:
Approximate Query
Evaluation
Reynold Cheng
23rd July, 2002
Approximate Query Evaluation
Why?
Handling load – streams coming too fast
Data stream is archived in a off-site data
warehouse, expensive access of archived
data
Avoid unbounded storage and computation
Ad hoc queries need approximate history
Try to look at the data items only once and
in a fixed order
Synopses
Queries may access or aggregate past data
Need bounded-memory history-approximation
Synopsis?
– Succinct summary of old stream tuples
– Like indexes/materialized-views, but base data is
unavailable
Examples
–
–
–
–
–
Sliding Windows
Samples
Sketches
Histograms
Wavelet representation
Sketching Techniques
Self-Join Size Estimation
Stream of values from dom(A) = {1,2,…,n}
Let f(i) = frequency of value i
Consider SJ(A) = Σidom(A)f(i)2, or Gini’s index of
homogeneity.
Useful in parallel DB applications, error
estimation in query result size estimation and
access plan costs.
Equivalent query: Q = COUNT (R |><|A R)
Evaluating SJ(A) = Σ fi2
To update S, keep a counter f(i) for each value i
in the domain D (|dom(A)|) bits of storage
This is true for any deterministic algorithm
Question – estimating S in sub-linear space
(O(log |dom(A)|))
A randomized technique that offers strong
probabilistic guarantees on the quality of SJ(A)
Self-Join Size Estimation
AMS Technique (randomized sketches)
Given (f(1),f(2),…,f(n))
Generate a family of 4-wise independent binary
random variables {Zi: i = 1,…, |dom(A)|}
Zi = random{-1,1}
P[Zi = +1] = P[Zi = -1] = ½ (E[Zi] = 0)
By using tools like orthogonal arrays, such
families can be constructed online using
O(log|dom(A)|) space.
Self-Join Size Estimation
Define X = Σi dom(A) f(i)Zi (X incrementally
computable)
Theorem Exp[X2] = Σ f(i)2
– Cross-terms f(i)Zi f(j)Zj have 0 expectation
– Square-terms f(i)Zi f(i)Zi = f(i)2
Hence, X2 is an unbiased estimator for SJ(A)
Space = log (N + Σ f(i))
Independent samples Xk reduce variance
Estimation Quality
How can independent samples Xk improve the quality of
estimation?
Keep s1 x s2 iid instantiations for Xk
Use independent random seeds to get a family of Zi’s for
each instance
s1 reduces variance, s2 boosts confidence
s1
Atomic
Sketch
Avg(X1j2)
Avg(X2j2)
s2
Avg(X3j2)
Avg(X4j2)
Avg(X5j2)
Median
(sketch)
Sample Run of AMS
Z1 =
V =
3
1
-1 1
1
Σvi2 = 123
Z1 =
6
2
-1
X1= 5, X12 = 25
V =
4
1
-1 1
1
6
-1
5
Z2 =
7
-1 1
-1 1
1
X2= 14, X22 = 196 Est = 110.5
2
5
Z2 =
7
-1 1
-1 1
1
Σ vi2 = 130, X1= 6, X12 = 36, X2= 12, X22 = 144, Est = 90
Comments on AMS
The self-join size can be computed on-line
Sufficiently small variance (controlled by s1 and s2)
Can estimate SJ(A) in one pass
Probabilistic guarantee: a relative error of at most with
probability at least 1-
Can this method be extended for other queries?
Complex Aggregate Queries
A. Dobra et al. extend the idea of AMS to
provide approximate answers to complex
aggregate queries.
SELECT AGG FROM R1,R2,…,Rr where
AGG: COUNT/SUM/AVERAGE
: conjunction of equality joins (Ri.Aj = Rk.Al)
The error of these estimates is at most ε with
probability 1-δ.
Attribute Constraint of
Each attribute of a relation appears at most once
in the join condition .
For any attribute Ri.Aj occuring m > 1 times in ,
add m-1 new “attributes” to Ri
Replace m-1 occurrences in with the new
attributes
= ((R1.A1=R2.A1) & (R1.A1=R3.A1)) becomes
((R1.A1=R2.A1) & (R1.A2=R3.A1))
The COUNT Query
QCOUNT : # of tuples in the cross-product of
R1,…,Rr that satisfy the equality constraints in .
SELECT COUNT(*) from R1,…,Rr where
Approach: Express the above query using
mathematical operators.
Rename the 2n join attributes in to A1, A2,…,
A2n such that each constraint in is Aj = An+j for
1 j n.
Projection of Domains
D = dom(A1) x dom(A2) x…x dom(A2n)
Sk is the set of join attributes appearing in
Rk
Dk = dom(Ak1) x dom(Ak2) x…x
dom(Ak|Sk|); Ak1,…,Ak|Sk| are attributes in Sk
Dk is a projection of D on attributes in Sk
Value Assignment I
An assignment I assigns values to (a subset
of) join attributes
If I D, each join attribute is assigned a
value I[j] by I.
If I Dk, then I only assigns a value I[j] to
attributes j Sk.
Use letter j to refer to attribute Aj
Value Assignment I
I[Sk] is the projection of I on attributes in
Sk.
fk(I) is the number of tuples in Rk that
match I.
Example
SELECT COUNT (*) FROM R1, R2, R3
where R1.A1 = R2.A1 and R2.A2 = R3.A1
Renaming: SELECT COUNT (*) FROM
R1, R2, R3 where A1 = A3 and A2 = A4
– A1 = R1.A1
– A3 = R2.A1
A2 = R2.A2
A4 = R3.A1
Note: Joins in the form Aj = An+j
S1 = {1}, S2 = {2,3}, S3 = {4}
Example (cont.)
D = dom(A1) x dom(A2) x dom(A3) x
dom (A4)
I1 = <1, 3, 2, 6> D; I1[3] = 2
D1 = dom(A1); D2 = dom(A2) x dom(A3);
D3 = dom(A4)
I1[S1] = <1>, I1[S2] = <3,2>, I1[S3] = <6>
f2(<3,2>) = 5 means 5 tuples from R2 have
A2=3, A3=2
Exact QCOUNT
QCOUNT=ΣID,j:I[j]=I[n+j]Πrk=1fk(I[Sk])
–
–
Product of the # of tuples in each relation that match a value
assignment I
Summed over all I D that satisfy the equi-join constraints ε
Approximate QCOUNT
Similar to self-join size!
Construct a random variable X that is an
unbiased estimator for QCOUNT
Variance can be bounded from above
Use averaging and median-selection tricks
to boost the accuracy and confidence of X
Achieve small relative error with high
probability
Constructing X
For each pair of join attributes j, n+j in ,
build random variables
{Zj,l: l=1,…,|dom(Aj)|},where Zj,l {-1, +1}
Random variables belonging to families for
different attribute pairs are independent
Space:Σnj=1O(log|dom(Aj)|)
Constructing X (2)
For each relation Rk, define atomic sketch Xk =
ΣIDk (fk(I)Π jSk Zj,I[j])
X = unbiased estimator of QCOUNT = Πrk=1Xk
E(X) = QCOUNT
Each Xk can be efficiently computed as tuples of
Rk are streaming in.
– Initialize Xk to 0
– For each tuple t in the Rk stream, addΠjSk Zj,t[j] to Xk
Join Graph and Acyclicity
A join graph is an undirected graph
– a node for each relation, and,
– an edge for each join attribute pair j, n+j
If the join graph is acyclic, the variance of
X is bounded from above.
Many join queries are acyclic, like chain
joins and star joins.
Probabilistic Guarantees
If QCOUNT is acyclic, multi-join COUNT
query over relations R1,…, Rr, then a
sketch of size O(log|dom(Aj)|) is possible
to approximate QCOUNT so that:
– the relative error of the estimate is at most ε
with probability 1-δ.
Basic Notions of Approximation
For aggregate queries (e.g., SUM, COUNT),
approximation quality can be measured by absolute
relative error:
– |Estimated value – Actual value| / Actual value
Open question: for queries involving more than simple
aggregation, how should we define approximation?
Consider S |><|BT: (S: {A,B}, T: {B,C})
A
B
C
A
B
C
10
20.5
Doctor
8
10.3
Lawyer
8
10.3
Lawyer
3
10.2
Teacher
3
10.2
Teacher
Actual Result
Approximate Result
Basic Notions of Approximation
(2)
Can we accept this kind of approximation?
A
B
C
A
B
C
10
20.5
Doctor
11
21.6
Doctor
8
10.3
Lawyer
8
10.3
Student
3
10.2
Teacher
3
10.2
Teacher
Actual Result
Approximate Result
Basic Notions of Approximation
(3)
Can we provide useful (semantically correct) but stale
results?
A
B
C
A
B
C
10
20.5
Doctor
10
20.5
Doctor
8
10.3
Lawyer
8
10.3
Lawyer
3
10.2
Teacher
Actual Result (at time t)
Approximate Result
(correct result at time t - )
Related Issues
Metric for set-valued queries
Composition of approximate operators
How is it understood/controlled by user?
Integrate into query language
Query planning and interaction with
resource allocation
Accuracy-efficiency-storage tradeoff and
global metric
Conclusions
To evaluate aggregate queries, O(|dom(A)|)
size is needed to obtain exact answers.
True for any deterministic algorithms
We introduce randomized algorithms that
uses only O(Σnj=1log|dom(Aj)|) space.
Only 1 pass is needed for estimation.
Conclusions (2)
Express the target query by mathematical
operators
Use an unbiased estimator for the expression
Independent samples reduce variance
A probabilistic guarantee: the relative error of the
estimate is at most ε with probability 1-δ.
Question: Can we use this technique to answer
queries involving more general selection
predicates (e.g., inequality joins)?
References
N. Alon, Y. Matias, M. Szegedy. The Space
Complexity of Approximating the
Frequency Moments, STOC ’96.
A. Dobra, M. Garofalakis, J. Gehrke, R.
Rastogi. Processing Complex Aggregate
Queries over Data Streams, SIGMOD ’02.
Thank You!