Jatinder Paul_ Approximate Query Processing in Databases

Download Report

Transcript Jatinder Paul_ Approximate Query Processing in Databases

APPROXIMATE QUERY
PROCESSING IN DATABASES
By:
Jatinder Paul
Introduction
 Decision
Support
Systems
(DSS)
User
SQL Query
Exact Answer
Long Response Times!
• In recent years, advances in data collection and management technologies
have led to proliferation of very large databases.
• Decision support applications such as Online Analytical Processing (OLAP)
and data mining tools are gaining popularity. Executing such applications on
large volumes of data can be resource intensive.
• For very large databases (multi-gigabyte ) processing even a single query
involves accessing enormous amounts of data, leading to very high running
time.
What is Approximate Query Processing ?
AQP
 Keeping query processing time small is very important in data mining and
decision support system (DSS).
 In aggregate queries (involving SUM,COUNT,AVG etc.) precision to “last
decimal” is not needed
 e.g., “What is averages sale of a product in all the states of US ?”
 Exactness of result is not very important . Result can be given as
approximate answer with some error guarantee.
 e.g. Average salary
$59,000 +/- $500 (with 95% confidence) in 10 seconds
vs. $59,152.25
in 10 minutes

Approximate Query Processing (AQP) techniques sacrifice accuracy to
improve the running time
Different AQP techniques
 All AQP techniques involves building some kind of synopses (summary) of
the database.
 AQP techniques differ in the way synopses is build.
 Sampling : Most preferred and practically used technique.
* Randomized sampling : Sampling uniformly at random.
* Deterministic sampling : Selecting the best sample that minimizes the total
error in the approximate answers.
 Histogram : Involves creating histograms which represent the database.
 Wavelets : Still a research area !
 Concentrating on various sampling techniques.
Basic Sampling Based AQP Architecture.
1. Pre-processing phase (offline): Sample
is build .
2. Run time phase (online): Queries are
rewritten to run against the sample. The
result is then scaled to given the
approximate answer.
Limitations of Uniform Sampling
for Aggregation Queries
 We discuss two problems that adversely affect the accuracy of
sampling based estimations:
o presence of skewed data in aggregate values ,
o the effect of low selectivity in selection queries and ,
o the presence of small groups in group-by queries.

Discussions of these problems in details.
Effect of data skew
 A skewed database is characterized by the existence of certain
tuples that are deviant from the rest in terms of their contribution the
aggregate value.
 These tuples are called outliers. These are the values that
contribute heavily to error in uniform sampling.
 Outliers in data results large variance .
Large variance => Large standard deviation and ,
Large standard deviation =>Large error in results.
Effect of data skew
 Example: Consider a relation R containing
Two columns
<ProductId, Revenue> and
four records {<1, 10>, <2, 10>, <3,10>, <4, 1000>}.
and aggregate query Q: SELECT SUM(Revenue) FROM R.
Using a sample S of two records from R to answer the query.
Query is answered by :
1. Running it against S and
2. Scaling the result by a factor of two (since we are using a 50% sample).
 Consider a sample S1 = {<1, 10>, <3, 10>}.
The estimated answer for Q using S1 is 40, which is a severe underestimate of the
actual answer (1030).
 Now consider another sample S2 = {<1,10>, <4,1000>}. The estimated answer for Q
using S2 is 2020, which is a significant overestimate.
Thus, large variance in the aggregate column can lead to large relative errors
Handling data skew :Outlier -indexes
 Main idea is deal with outliers separately ,and sample from rest of
the relation.
 Partition the relation R into two sub relation:o Ro (outliers) and
o RNO (non-outliers).
 The query Q can now be considered as the UNION of two sub
queries :o Q applied to Ro and
o Q applied to RNO (non-outliers).
Handling data skew :Outlier -indexes
 Preprocessing steps
1. Determine outliers — specify the sub-relation RO of the relation R
deemed to be the set of outliers.
2. Sample non-outliers— select a uniform random sample T of the
relation RNO.
 Query processing steps.
1. Aggregate outliers — apply the query to the outliers in RO accessed via the
outlier-index.
2. Aggregate non-outliers — Apply the query to the sample T and extrapolate to
obtain an estimate of the query result for RNO.
3. Combine aggregates — combine the approximate result for RNO with the
exact result for RO to obtain an approximate result for R.
Outlier –indexes : Other Issues.
 Since the database content changes over time, this requires
selection of outlier indexes and samples to be refreshed
appropriately. The samples need to be refreshed periodically as precomputed samples can become stale with use.
 An alternative is to do the sampling completely online i.e., make it a
part of query processing.
 For the case of the COUNT aggregate ,outlier-indexing is not
beneficial since there is no variance among the data values.
 Also , outlier indexing is not useful for aggregate that depend on the
rank order of tuples rather than their actual values.
(e.g. Aggregates such as min ,max and median )
Overcoming limitation of sampling in AQp
Effect of low selectivity
 Most queries involve selection conditions and/or group-by’s .
 If the selectivity of a query is low (i.e. it results in few records selection)
then it adversely impacts the accuracy of sampling based estimation.
 A selection query partitions the relation into two sub-relations:
o tuples that satisfy the condition (relevant sub-relation) and
o tuples that do not.
 If we sample uniformly from the relation, the number of tuples that are
sampled from the relevant sub-relations will be proportional to its size. If
this relevant sample size is small due to low selectivity of the query, it may
lead to large error.
Effect of small groups.
 The effect of small groups is same as effect of low selectivity .
 The group-by queries also partition the relation into numerous sub-relations
(tuples that belong to specific groups).

Thus for uniform sampling to perform well, the relevant sub-relation should
be large in size, which is not the case in general
Using Workload Information:
Solution to low selectivity and small groups
 Approach to this problem is to use weighted sampling (instead of
uniform sampling) of data by exploiting workload information while
drawing the sample.
 The essential idea behind weighted sampling scheme is to sample
more from subsets of data that are small in size but are important,
i.e. have high usage. This results in low error.
 This approach is based on the fact that the usage of a database is
typically characterized by considerable locality in the access pattern,
i.e., queries against the database access certain parts of the data
more than others.
Using Workload Information:
Solution to low selectivity and small groups
 The use of workload information involves the following steps:
1. Workload Collection: Obtain a workload consisting of representative queries
against the database. Modern database systems provide tools to log queries
posed against the server (e.g. the Profiler component of Microsoft SQL Sever).
2. Trace Query Patterns: The workload can be analyzed to obtain parsed
information, e.g., the set of selection conditions that are posed.
3. Trace Tuple Usage: The execution of the workload reveals additional information
on usage of specific tuples, e.g., frequency of access to each tuple , the number of
queries in the workload for which it passes the selection condition of the query.
4. Weighted Sampling: Perform sampling by taking into account weights of tuples
(from Step 3).
Weighted Sampling
 Assuming that a tuple ti has weight wi, if the tuple ti is required to
answer wi of the queries in the workload.
 The normalized weight be wi‘ defined as
.
 Tuple is accepted in the sample with the probability :
pi = n.wi’ , where n is the sample size.
 Thus the probability with which each tuple is accepted in the sample
varies from tuple to tuple.
 The inverse of this probability is the multiplication factor associated
with tuple used while answering the query. Each aggregate
computed over this tuple gets multiplied by this multiplication factor.
Problem with Group-By Queries
 Decision support queries routinely segment the data into groups.
 For example, a group-by query on the U.S. census database could be used
to determine the per capita income per state. However ,there can be a
huge discrepancy in the sizes of different groups, e.g., the state of California
has nearly 70 times the population of North Carolina.
 As a result, a uniform random sample of the relation will contain
disproportionately fewer tuples from the smaller groups, which leads to poor
accuracy for answers on those groups because accuracy is highly
dependent on the number of sample tuples that belong to that group.
 Standard error is inversely proportional to √n for uniform sample.
n is the uniform sample random size.
Solution: Congressional Samples
 Congressional samples are hybrid union of uniform and biased samples.
 The strategy adopted is to divide the available sample space X equally
among the g groups , and take a uniform random sample within each group.
 Consider US Congress which is hybrid of House and Senate.
House has representative from each from each state in proportion to its
population.
Senate has equal number of representative from each state.
 We now apply House and Senate scenario for representing different groups.
House sample: Uniform random sampling from each group .
Senate sample: Sample an equal number of tuples from each group.
Congressional Samples
 We define a strategy S1 as following :
 Divide the available sample space X equally among the g groups , and take a
uniform random sample within each group
 Congressional approach : In this approach we consider the entire set of possible
group by queries over a relation R.
 Let be the set of non-empty groups under the grouping G. The grouping G
partitions the relation R according to the cross-product of all the grouping attributes;
this is the finest possible partitioning for group-bys on R. Any group h on any other
grouping T G is the union of one or more groups g from
.
 Constructing Congress,
1. Apply S1 on each T
G.
2. Let
be the set of non-empty groups under the grouping T, and let
the number of such groups.
3. By S1, each of the non-empty groups in T should get a uniform random sample
of X/mT tuples from the group.
Congressional Samples
 Constructing Congress,
4. Thus for each subgroup g in
allocated to g is simply
of a group h in T, the expected space
where ng and nh are the
number of tuples in g and h
respectively.
5. Then, for each group g
, take the maximum over all T of Sg,T, as the sample
size for g, and scale it down to limit the space used to X. The final formula is:
Sample Size (g) =
6. For each group g in
Size(g).
, select a uniform random sample of size Sample
Thus we have a stratified, biased sample in which each group at the finest
partitioning is its own strata. Thus Congress essentially guarantees that
both large and small groups in all groupings will have a reasonable number
of samples.
Rewriting for biased samples
 Query rewriting involves two key steps:
a) scaling up the aggregate expressions and
b) deriving error bounds on the estimate.
 For each tuple, let its scale factor ScaleFactor be the inverse sampling rate
for its strata.
 All the sample tuples belonging to a group will have the same ScaleFactor.
Thus key step in scaling is efficiently associate each tuple with its
corresponding ScaleFactor.
 There are two approaches to doing this:
a) store the ScaleFactor(SF) with each tuple in sample relation and
b) use a separate table to store the ScaleFactors for the groups.
 Each approach has its pros and cons.
Dynamic Sample Selection for AQP
 All previous strategies for AQP uses single sample with fixed bias and do
not take advantage of extra disk space.
 Increasing the size of a sample stored on disk increases the running time of
a query executing against that sample.
 Dynamic Sampling gets around with this problem by :
1.Creating a large sample containing a family of differently biased
sub samples during pre-processing phase.
2. Using only portion of the sample to answer each query at runtime.
 Because there are many different sub samples with different biases
available to choose from a runtime, the chances increase that one of them
will be a “good fit” for any particular query that is issued.
 And since only a small portion of the overall sample is used to answer query
response time is low.
Generic Dynamic Sample Selection
Architecture
 Pre-Processing Phase
Consists of two steps :
1. The data distribution and query
distribution ( e.g. workload information)
is examined to identify a set of biased
samples to be created.
Figure 2. Pre-processing
phase
Division of database
Identifies the characteristics
of each sample
2. Samples are created and stored in
database along with the metadata.
Generic Dynamic Sample Selection
Architecture
 Runtime Phase
Two steps :
1. Rewrites the query to run against
sample tables.
2. Appropriate sample table(s) to
use for a given query Q is
determined by comparing Q with
the metadata annotations for the
sample.
Figure 3. Runtime phase
We need algorithm to decide :
1. Which samples are to be built during pre-processing .
2. Which samples are to be used for query answering during runtime phase.
It should be possible to quickly determine which of various samples to use to
answer that query.
Small Group Sampling :A dynamic
sample selection technique
 Small group sampling targets most common type of analysis queries ,
aggregation queries with “groups-bys”.
 Basic idea: Uniform sampling does performs well for larger group-by query
but for small groups uniform sampling performs poorly since do not have
enough representative from small groups.
The small group sampling approach uses a combination of
o a uniform random sample (overall sample) over large groups.
o One or more small group tables that contain only rows from
small groups. The small group tables are not downscaled.
Small Group Sampling Algorithm
Description
Pre-processing phase
The pre-processing algorithm takes two input parameter:


Base sampling rate, r, which determines the size of the uniform random
sample that is created (i.e. overall sample).
Small group fraction t, which determines the maximum size of each small
group sample table.
Let, N = number of rows in the database.
C= the set of columns in the database.
The pre-processing algorithm can be implemented efficiently by making just
two scans:
1) First scan identifies the frequently occurring values for each column an
their approximate frequencies.
2) Second scan, the small group tables for each column in S is constructed
along with the overall sample.
Small Group Sampling Algorithm
Description
Pre-processing phase Algorithm
Final pass of algorithm
1. Initially , the set S is initialized to C.
2. Count the number of occurrences of each distinct value in each column of
the database and determine the common values for each column.
( can be done using separate hash table).
3. If the number of distinct values for a column exceeds a threshold u, then
remove the that column from S and stop maintaining its count.
L(C) is defined as the minimum set of values from C whose frequencies sum to at least
N(1 − t).
Second pass of algorithm
1. Determine the set of common values L(C) for each column C.
2. Rows with values from the set L(C) will not be included in the small group table for C,
but rows with all other values will be; there are at most Nt such rows.
3. If column C has no small groups, remove it from S.
4. After computing L(C) for every C € S, the algorithm creates a metadata table which
contains a mapping from each column name to an unique index between 0 and |S| − 1.
Pre-Processing Phase Algorithm
 The final step in pre-processing is to make a second scan of the database to
construct the sample tables.
1.
2.
3.
Each row containing an uncommon value for one or more columns (i.e. a value not
in the set L(C)) is added to the small group sample table for the appropriate
columns.
At the same time as the small group tables are being constructed, the preprocessing
algorithm also creates the overall sample, using reservoir sampling to maintain a
uniform random sample of rN tuples.
Each row that is added to either a small group table or the overall sample is tagged
with an extra bit-mask field (of length |S|) indicating the set of small group tables to
which that row was added. This field is used during runtime query processing to
avoid double-counting rows that are assigned to multiple sample tables.
Small Group Sampling Algorithm
Description
Runtime phase




When a query arrives at runtime, it is re-written to run against the sample
tables instead of the base fact table.
Each query is executed against the overall sample, scaling the aggregate
values by the inverse of the sampling rate r.
In addition, for each column C Є S in the query’s group-by list, the query
is executed against that column’s small group table. The aggregate
values are unscaled when executing against the small group sample
tables.
Finally, the results from the various sample queries are aggregated
together into a single approximate query answer. Since a row can be
included in multiple sample tables, the re-written queries include filters
that avoid double-counting rows
Small group sampling v/s
Congressional sampling
Congressional Sampling
Small group sampling
1. It creates only single sample, hence
sample must necessarily be very general
purpose in nature and only loosely for any
appropriate query.
Uses dynamic sample selection
architecture and thus can have benefits of
more specialized samples that are each
tuned for a narrower , more specific class
of queries.
2.The preprocessing time required by
congressional sampling is proportional to
the number of different combinations of
grouping columns, which is exponential in
the number of columns. This renders
it
impractical for typical data warehouses that
have dozens or hundreds of potential
grouping columns
The pre-processing time for small group
sampling is linear in the number of columns
in the database
Problem with Joins
 Approximate query strategies discussed so far uses some sort of sample
( called base sample) using uniform random sampling.
 The use of base samples to estimate the output of a join of two or more
relations, however, can produce a poor quality approximation. This is for the
following two reasons:
1. Non-Uniform Result Sample: In general, the join of two uniform random
base samples is not a uniform random sample of the output of the join. In most
cases, this non-uniformity significantly degrades the accuracy of the answer and
the confidence bounds.
2. Small Join Result Sizes: The join of two random samples typically has
very few tuples, even when the actual join selectivity is fairly high. This
can lead to both inaccurate answers and very poor confidence bounds
since they critically depend on the query result size.
 Thus, it is in general impossible to produce good quality approximate
answers using samples on the base relations alone.
Solution :Join Synopses
 We need to pre-compute samples of join results for making quality answer
possible.
 First , a naïve solution is to pre-compute such samples is to execute all
possible join queries of interest and collect samples of their results.
 But it is not feasible since it is too expensive to compute and maintain.
 One strategy is to compute samples of the results of a small set of
distinguished joins , which can be used to obtain random samples of all
possible joins in the schema. The samples of these distinguished joins as
join synopses.
 Join synopses is a solution for queries with only foreign key joins.
Join Synopses
 Lemma J.1 The sub-graph of G on the k nodes in any k-way foreign key join must
be a connected sub-graph with a single root node.
We denote the relation corresponding to the root node as the source relation for the kway foreign key join.
 Lemma J.2 There is a 1-1 correspondence between tuples in a relation r1 and
tuples in any k-way foreign key join with source relation r1.
 For each relation r, there is some
maximum foreign key join
(i.e., having the largest number of relations)
with r as the source relation.
For example., in Figure,
is
the maximum foreign key join with
source relation C.
Join Synopses
 Definition of Join synopses: For each node u in G, corresponding to a
relation r1.
Let ( u) to be the output of the maximum foreign key join
with source r1.
Let
be a uniform random sample of r1.
A join synopsis,
, to be the output of
The join synopses of a schema consists of
 Join synopses are also referred as join samples.
for all u in G.
Join Synopses

A single join synopses can be used for a large number of distinct joins.
 Lemma J.3 From a single join synopsis for a node whose maximum foreign key
join has
relations, we can extract a uniform random sample of the output of
between
and
distinct foreign key joins.
 In other words, a join synopsis of relation r can be used to provide a random samples
of any join involving r and one or more of its descendants.
 Since Lemma J.2 fails to apply in general for any relation other than the source
relation, the joining tuples in any relation r other than the source relation will not in
general be a uniform random sample of r. Thus distinct join synopses are needed for
each node/relation.
 Since tuples in join synopses are the results of multi-way joins, a possible concern is
that they will be too large because they have many columns. To reduce the columns
stored for tuples in join synopses, we can eliminate redundant columns (for example,
join columns) and only store columns of interest. Small relations can be stored in their
entirety, rather than as part of join synopses.
References
1.
2.
3.
4.
5.
6.
7.
Surajit Chaudhuri, Gautam Das, Mayur Datar, Rajeev Motwani, Vivek Narasayya:
Overcoming Limitations of Sampling for Aggregation Queries. ICDE 2001.
Surajit Chaudhuri, Gautam Das, Vivek Narasayya: A Robust, Optimization-Based
Approach for Approximate Answering of Aggregate Queries. SIGMOD
Conference 2001.
Brian Babcock, Surajit Chaudhuri, Gautam Das: Dynamic Sample Selection for
Approximate Query Processing. SIGMOD Conference 2003.
S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy.Join synopses for
approximate query answering. In Proc. ACM SIGMOD International Conf. on
Management of Data, pages 275-286, June 1999.
Gautam Das: Survey of Approximate Query Processing Techniques. (Invited
Tutorial) SSDBM 2003
P. B. Gibbons, S. Acharya, and V. PoosaIa. Aqua Approximate Query Answering
System
Yong Yao and Johannes Gehrke, Query Processing for sensor network
THANK YOU