Transcript ppt slides

A Robust, Optimization-Based
Approach for Approximate Answering
of Aggregate Queries
Surajit Chaudhuri
Gautam Das
Vivek Narasayya
Presented by
Sushanth Sivaram Vallath
Why Approximate query
Answering?




Most applications are OLAP and
data mining for analyzing large
databases.
Most of the queries are aggregation
queries on these large databases.
Hence expensive and resource
intensive.
Approximate answers given
accurately and efficiently benefits the
scalability of the application.
Approaches to
Approximation/Data Reduction


Use pre-computed samples of data
instead of complete data
Sampling:




Weighted Sampling
Congressional sampling
On the fly sampling (error prone when
selections, GROUP BYs and joins are
used)
Workload (Deterministic solution for
identical workloads)
Approaches to
Approximation/Data Reduction
(contd.)



“similar” workloads, are
considered as optimization
problem (minimizing the error)
Histograms
Wavelets
Attacking “similar” workloads


Stratified sampling
Minimize error in estimation of
aggregates
Drawbacks of previous studies



Lack of rigorous problem
formulations leds to solutions that
are difficult to evaluate theoretically.
Does not deal with uncertainty in
incoming queries that are “similar”
but identical
Ignore the variance in data
distribution of aggregated columns
Architecture for AQP

Queries with selections, foreignkey joins and GROUP BY,
containing aggregation functions
such as COUNT, SUM, AVG.
Architecture for AQP
Architecture for AQP


Offline Component: building of
sample
Online Component:
1.
2.
Rewrites an incoming query to
use the sample to answer the
query approximately
Reports the answer with error
estimates
Pre-computing Samples for Fixed
Workload

Fundamental Regions: For a given
relation R and workload W, consider
partitioning the records in R into a
minimum number of regions R1, R2, …, Rr
such that for any region Rj, each query in
W selects either all records in Rj or none.
Fixed Workload
1.
Identify all fundamental regions → r
After step 1, Case A (r ≤ k) and Case B (r >k)
(k → sample size)
2.
3.
2.
3.
Case A (r ≤ k): (Pick Sample records) Selects the
samples by picking exactly one record from each
important fundamental region
Assigns appropriate values to additional columns in the
sample records
Case B (r > k): (Pick Sample records) select k regions
and then pick one record from each of the selected
regions. The heuristic is to select top k.
Assign Values to Additional Columns. This is an
optimization problem, which is solved by partially
differentiating and the resulting linear equations using
Gauss-Seidel method.
Disadvantage


Per-query (probabilistic) error
guarantee is not possible.
If incoming query is not identical
to a query in the given workload,
FIXED can result in
unpredictable errors.
Lifting Workload to Query
Distributions



Should be resilient to situations
where “similar” but not identical
queries
“Similar”ity is not based on syntax. If
the records returned by the two
queries have significant overlap, it is
similar.
Each record will have a probability
associated with it such that the
incoming query will select this record
Rationale for Stratified
Sampling
Minimize the
MSE of the lifted
workload

An effective scheme for
stratification should be such that
the expected variance, over all
queries in each stratum, is
small, and allocate more
samples to strata with larger
expected variances.
Example
ProductID
Revenue
1
10
2
10
3
10
4
1000
Solution for single-table selection
queries with Aggregation
1.
Stratification
1.
2.
2.
Allocation
1.
3.
How many strata required during
partition?
How many records should each strata
have?
Determine the number of samples
required across each strata
Sampling
Pragmatic Issues




Identifying Fundamental
Regions
Handling Large Number of
Fundamental Regions
Obtaining Integer Solutions
Obtaining an Unbiased
Estimator
Extensions for more General
Queries



GROUP BY Queries
JOIN Queries
Other Extensions

Mix of COUNT and SUM queries
Conclusion

The solutions FIXED and
STRAT handles the problems of
data variance, heterogeneous
mixes of queries, GROUP BY
and foreign-key joins.