Efficient Mid-Query Re-Optimization of Sub
Download
Report
Transcript Efficient Mid-Query Re-Optimization of Sub
Efficient Mid-Query ReOptimization of Sub-Optimal
Query Execution Plans
Kabra and DeWitt
presented by Zack Ives
CSE 590DB, May 11, 1998
The Problem
Query execution plans are formulated based on
estimated cost of operations, which in turn
depend on estimates of table size and
cardinality
These estimates may be highly inaccurate,
especially for user-defined types or predicates
The errors become multiplicative as the
number of joins increases
We might have chosen a more nearly optimal
plan based on greater knowledge
The Solution
We shall monitor how the query is doing at key
points, and consider dynamically re-optimizing
those portions of the query which have not yet
been started
Since re-optimization is expensive, we shall
only do it if we think we will see an
improvement
Elements of the Algorithm
Annotated Query Execution Plans
Annotate plan with estimates of size
Runtime Collection of Statistics
Statistics collectors embedded in execution tree
Keep overhead down
Dynamic Resource Re-allocation
Reallocate memory to individual operations
Query Plan Modification
May wish to re-optimize the remainder of query
Annotated Query Plans
We save at each point in the tree the
expected:
Sizes and cardinalities
Selectivities of predicates
Estimates of number of groups to be aggregated
Statistics Collectors
Add into tree
Must be
collectable in a
single pass
Will only help
with portions of
query “beyond”
the current
pipeline
Resource Re-Allocation
Based on improved
estimates, we can
modify the memory
allocated to each
operation
Results: less I/O,
better performance
Only for operations
that have not yet
begun executing
Plan Modification
Create new plan for
remainder, treating temp as
an input
Only re-optimize part not
begun
Suspend query, save
intermediate in temp file
Re-Optimization
When to re-optimize:
Calculate time current should take (using gathered stats)
Only consider re-optimization if:
Our original estimate was off by at least some factor 2 and if
Topt, estimated < 1Tcur-plan,improved where 1 5% and cost of
optimization depends on number of operators, esp. joins
Only modify the plan if the new estimate, including the cost of
writing the temp file, is better
Low-Overhead Statistics
Want to find “most effective” statistics
Don’t want to gather statistics for “simple” queries
Want to limit effect of algorithm to maximum
overhead ratio,
Factors:
Probability of inaccuracy
Fraction of query affected
Inaccuracy Potentials
The following heuristics are used:
Inaccuracy potential = low, medium, high
Lower if we have more information on table value
distribution
1+max of inputs for multiple-input selection
Always high for user-defined methods
Always high for non-equijoins
For most other operators, same as worst of inputs
More Heuristics
Check fraction of query
affected
Check how many other
operators use the same
statistic
The winner:
Higher inaccuracy
potentials first
Then, if a tie, the one
affecting the larger
portion of the plan
Implementation
On top of Paradise (parallel database that
supports ADTs, built on OO framework)
Using System-R optimizer
New SCIA (Stat Collector Insertion Algorithm)
and Dynamic Re-Optimization modules
It Works!
Results are 5% worse for simple queries, much
better for complex queries
Of course, we would not really collect statistics on
simple queries
Data skew made a slight difference - both normal
and re-optimized queries performed slightly better