A Statistics Propagation Approach to Enable Cost

Download Report

Transcript A Statistics Propagation Approach to Enable Cost

University of Stuttgart
as
Anwendungssoftware
A Statistics Propagation Approach
to Enable Cost-Based Optimization
of Statement Sequences
Tobias Kraft, Holger Schwarz, Bernhard Mitschang
Institute of Parallel and Distributed Systems
University of Stuttgart
University
of Stuttgart
as
Anwendungssoftware
Overview
• Motivation
• Cost Estimation Approach
• Histogram Propagation
• Related Work
• Experiments
• Conclusion & Future Work
2
University
of Stuttgart
as
Motivation
Anwendungssoftware
The Case for Optimization
• Many of today’s applications
embed query generators.
• Some of these generators not
only produce a single query but
a sequence of SQL statements
(e.g. MicroStrategy DSS tools).
• Rewriting these sequences may
lead to significant performance
improvements!
3
• Development of an optimizer based on rewrite rules and a
heuristic priority-based control strategy
Coarse-Grained Optimization (CGO) [VLDB03]
University
of Stuttgart
as
Motivation
Anwendungssoftware
Statement Sequences
We focus on statement sequences that:
CREATE TABLE q1
(custkey INTEGER, turnover1990 FLOAT);
CREATE TABLE q2
(custkey INTEGER, turnover1991 FLOAT);
• compute the final result of a request
in a set of subsequent steps,
• allow to share intermediate results by
multiple subsequent steps,
• temporarily store intermediate results
in tables that are being created and
dropped within the sequence,
• store the final result in a table that is
not being dropped within the sequence.
4
CREATE TABLE q3
(custkey INTEGER, name VARCHAR(25));
INSERT INTO q1
SELECT o.custkey, SUM(o.totalprice)
FROM
orders o
WHERE o.orderyear = 1990
GROUP BY o.custkey;
INSERT INTO q2
SELECT o.custkey, SUM(o.totalprice)
FROM
orders o
WHERE o.orderyear = 1991
GROUP BY o.custkey;
INSERT INTO q3
SELECT c.custkey, c.name
FROM
q1, q2, customer c
WHERE q1.custkey = c.custkey
AND q1.custkey = q2.custkey
AND q2.turnover1991 >
q1.turnover1990;
DROP TABLE q1;
DROP TABLE q2;
q1
q2
q3
University
of Stuttgart
as
Motivation
Anwendungssoftware
Problems of the Heuristic Control Strategy
• In some scenarios a rule application may lead to deterioration
of performance.
• Different behavior on different platforms and database
management systems.
• Alternative sequences of rule applications lead to different
results.
Need for a cost-based approach.
5
University
of Stuttgart
as
Cost Estimation Approach
Anwendungssoftware
Problems & Solutions
• Cost estimates depend on the physical layout of the database and the
capabilities and strategies of the DBMS‘s query optimizer.
A cost model on top of the DBMS is no feasible solution.
Make use of the cost estimates provided by the DBMS‘s query
optimizer.
• DBMSs only provide cost estimates for statements on existing tables.
Execute CREATE TABLES statements before cost estimation.
• Missing statistics for the created tables causes the DBMS‘s query
optimizer to use default values for cardinality and selectivity.
Propagate statistics through the INSERT statements.
Make these propagated statistics available to the DBMS‘s optimizer.
6
University
of Stuttgart
as
Cost Estimation Approach
Anwendungssoftware
Algorithm of Cost Estimation
Input:
Output:
A statement sequence S.
A cost estimate for S.
totalcosts = 0
foreach CREATE TABLE statement c in S
Execute c on the underlying database system.
foreach INSERT statement i in S (in the order given by S)
Retrieve a cost estimate for i from the optimizer of the underlying database system.
Add this cost estimate to totalcosts.
Translate i into an algebraic tree.
Retrieve histograms for the base tables from the underlying database system and
propagate them through the algebraic tree to retrieve histograms for the target
table of i.
Store the resulting histograms in the catalog of the underlying database system.
foreach CREATE TABLE statement c in S
Drop the table that has been created by c.
return totalcosts.
7
University
of Stuttgart
as
Cost Estimation Approach
Anwendungssoftware
Architectural Overview
sequence
optimized
sequence
cost-estimation component
for statement sequences
statement cost
aggregation
histogram
propagation
Statistics API
cost-based CGO optimizer
cost-based
control strategy
CGO ruleset
JDBC
histograms
cost estimates
for statements
DBMS query optimizer
DBMS execution engine
and data management
8
DDL statements
University
of Stuttgart
Histogram Propagation
as
Anwendungssoftware
Overview
• We have adopted techniques from approximate query
answering and added some extensions:





T4
interval arithmetic for arithmetic terms,
heuristics for grouping and aggregation,
unification of comparison operators,
normalization of histograms,
common subexpressions


• SQL statement sequence  algebraic operator trees.

• Everything is a bucket / histogram:
 NULL values are represented by a special bucket.
 Constant values (used in arithmetic terms) are
represented by a histogram with a single bucket.
 Sets of constants (used in IN-predicates) are
represented by a histogram.
9
X


T1
T2
T3
University
of Stuttgart
as
Histogram Propagation
Anwendungssoftware
Histograms
• Buckets are 4-tuples:
(low, high, card, dv )
Histogram
#properties:
#dataType:
• The NULL value:
(null, null, card, 1)
1
*
#buckets
Bucket
• A constant value c :
(c, c, card, 1)
#properties:
#dataType:
#cardinality:
#distinctValues:
1
1
CharValue
-value: String
Properties
int
double
double
1
#low
1
Value
#dataType:
10
Properties
int
#high
int
DateValue
DecimalValue
IntegerValue
-value: Date
-value: double
-value: long
University
of Stuttgart
as
Histogram Propagation
Anwendungssoftware
Algebra and Propagation
• Operators: projection, selection, cartesian product, union,
difference and grouping (including aggregation).
• A join can be represented by a cartesian product followed by a
selection that contains the join condition.
• Arithmetic terms (projection / selection) and predicates
(selection) are also represented by operator trees.
• Propagation is done by recursively traversing the algebraic
operator tree and the arithmetic trees in a post order manner.
11
University
of Stuttgart
Histogram Propagation
as
Anwendungssoftware
Algebra and Propagation
• Arithmetic Operators:




histograms as input
a result histogram as output
iterate over all bucket combinations
compute a result bucket for each bucket combination
• Comparison Operators:





A3
histograms as input
selectivity as output
some operators also provide modified histograms
iterate over all bucket combinations
compute a selectivity for each bucket combination
(and adapt the buckets)
A1  A2
A1
A2
A1 ‘
A2 ‘
selectivity
A1  A2
A1
12
A2
University
of Stuttgart
as
Histogram Propagation
Anwendungssoftware
Interval Arithmetic for Arithmetic Terms
• Example of using interval arithmetic for adding two buckets:
 BO.low = BI1.low + BI2.low
 BO.high = BI1.high + BI2.high
BI1
+
0 10 20 30 40 50
=
BI2
0 10 20 30 40 50
+
0 10 20 30 40 50
BO
0 10 20 30 40 50
=
0 10 20 30 40 50
0 10 20 30 40 50 60
Serialization
13
10 20 30 40 50 60
University
of Stuttgart
Histogram Propagation
as
Anwendungssoftware
Heuristics for Grouping and Aggregation
• Determine amount of groups and average group sizes.
• Use these values together with the histogram of the attribute
that should be aggregated to compute the aggregate buckets.
cardinality
30
100 groups
with
group size 50
50
tuple
SUM(A) ?
100
20
10
histogram of attribute A
0
10 20 30 40 50 60 70 80 90 100 110
lower bound of SUM(A) =
450 + 500 = 950
14
cardinality
50
tuple
value
upper bound of SUM(A) =
2100 + 1900 + 875 = 4875
histogram
of SUM(A)
950 4875
value
University
of Stuttgart
as
Histogram Propagation
Anwendungssoftware
Unification of Comparison Operators
• Comparison operators compare histograms.
A single operator implementation can be used for different purposes.
No separate join implementation is necessary.
• E.g., the comparison operator ‘=‘ can be used in:
 a predicate comparing two attributes,
 a predicate comparing an attribute and a constant,
 a join condition of an equi-join
 cartesian product followed by a selection that
contains the join-condition as predicate,
 an IN predicate comparing an attribute with a set of values
 join with a table (represented by a histogram) that
contains the values of the value set.
15
C
C1
…
Cn
University
of Stuttgart
as
Related Work
Anwendungssoftware
on Histogram Propagation
• Papers on Approximate Query Answering:
 Yannis E. Ioannidis, Viswanath Poosala:
Histogram-Based Approximation of Set-Valued Query-Answers.
VLDB 1999
 Viswanath Poosala, Venkatesh Ganti, Yannis E. Ioannidis:
Approximate Query Answering using Histograms.
IEEE Data Eng. Bull. 1999
• Transformation of SQL queries on tables into SQL queries on
histograms.
• Join is done by creating two tables under the uniform spread
assumption such that each table represents a value distribution
which fits to the respective input histogram.
• Only equi-joins, no support for predicates that compare two
attributes, no grouping and no support of arithmetic terms.
16
University
of Stuttgart
Experiments
as
Anwendungssoftware
Experimental Setup
• Database: TPC-H benchmark database on IBM DB2 V9.
• Sample sequences:
 Sequence S1 generated by the MicroStrategy DSS tool suite.
 A set of semantically equivalent sequences that result from the
application of the CGO rewrite rules.
 Variants of those sequences resulting from the use of different
values in the filter predicates of the sequence.
S1
S1
Q5
S2
S3
S4
S5
Q4
Q1
17
Q2
Q3
S6
S7
S8
S10
S9
S10
Q7
Q6
University
of Stuttgart
Experiments
as
Anwendungssoftware
Results for Alternative Sequences
execution time (s)
cost estimate (timerons / 1000)
S1
9,455
2,379
2,345
S1
S2
6,862
S2
1,826
1,803
S3
6,859
S3
1,810
1,777
S4
6,858
S4
1,791
1,765
9,375
S5
S6
3,791
2,356
2,344
S5
S6
1,103
1,059
16,815
S7
6,805
S7
S8
6,803
S8
1,763
S9
6,802
S9
1,766
S10
3,735
S10
1,777
16,887
16,887
55,170
1,083
without histogram propagation
with histogram propagation
18
University
of Stuttgart
Experiments
as
Anwendungssoftware
Results for Different Selectivities
9,500
cost estimate (timerons / 1000) for sequence S1
9,475
9,450
9,425
9,400
100,000
2,475
300,000
parameter value
500,000
execution time (s) for sequence S1
2,425
2,375
2,325
2,275
100,000
300,000
parameter value
without histogram propagation
19
500,000
with histogram propagation
University
of Stuttgart
as
Anwendungssoftware
Conclusion & Future Work
• Cost estimates for statement sequences are necessary to avoid
rule applications that lead to performance deterioration.
• Making use of the cost estimates of the underlying DBMS is a
feasible solution.
• Histogram propagation is necessary to get useable cost
estimates for statements that access intermediate-result tables
and to avoid bad plans.
• Future Work:
 A cost-based control strategy.
 Extensive measurements.
20
University
of Stuttgart
as
Anwendungssoftware
Thank you for
your attention!
21
University
of Stuttgart
22
as
Anwendungssoftware
University
of Stuttgart
as
Cost Estimation Approach
Anwendungssoftware
Statistics API
Statistics API
DB2
implementation
Oracle
implementation
SQL Server
implementation
JDBC
DB2
database
23
Oracle
database
SQL Server
database
• is an interface that offers uniform DBMS-independent access to
DBMS statistics, meta data and optimizer estimates
• provides a flexible histogram format that abstracts from
proprietary data structures used in different DBMSs
• implementations exist for IBM DB2, Oracle and MS SQL Server
University
of Stuttgart
as
Histogram Propagation
Anwendungssoftware
Common Subexpressions
• Identify arithmetic terms that appear multiple times in the
different clauses of an SQL query.
• Otherwise, the arithmetic term will be recomputed for each
appearance and modifications of histograms may get lost.
 $1, $2
 $1, $2 = count(*)
INSERT INTO temptable
SELECT
year(o.o_orderdate), count(*)
FROM
orders o
WHERE
year(o.o_orderdate) > 1992 AND
o.o_orderpriority = ‘1-URGENT‘
GROUP BY year(o.o_orderdate)
 o_orderpriority
= ‘1-URGENT‘
 $1 > 1992
 $1 = year(o_orderdate)
24
orders o_orderdate, o_orderpriority
University
of Stuttgart
as
Histogram Propagation
Anwendungssoftware
Normalization of Histograms
• Worst case:
 The number of buckets in the output histogram is the product of
the number of buckets in the input histograms.
 Serializing a histogram may double the number of buckets.
Normalization:
 prior to the enumeration phase and / or after an output
histogram has been produced,
 reduces the number of buckets by merging adjacent buckets,
 trade-off between complexity / performance and quality.
Merge
10 buckets  4 buckets
25
0 10 20 30 40 50
0
10 20 30 40 50