Efficient Mid-Query Re-Optimization of Sub

Download Report

Transcript Efficient Mid-Query Re-Optimization of Sub

Efficient Mid-Query ReOptimization of Sub-Optimal
Query Execution Plans
Kabra and DeWitt
presented by Zack Ives
CSE 590DB, May 11, 1998
The Problem
Query execution plans are formulated based on
estimated cost of operations, which in turn
depend on estimates of table size and
cardinality
These estimates may be highly inaccurate,
especially for user-defined types or predicates
The errors become multiplicative as the
number of joins increases
We might have chosen a more nearly optimal
plan based on greater knowledge
The Solution
We shall monitor how the query is doing at key
points, and consider dynamically re-optimizing
those portions of the query which have not yet
been started
Since re-optimization is expensive, we shall
only do it if we think we will see an
improvement
Elements of the Algorithm
Annotated Query Execution Plans
Annotate plan with estimates of size
Runtime Collection of Statistics
Statistics collectors embedded in execution tree
Keep overhead down
Dynamic Resource Re-allocation
Reallocate memory to individual operations
Query Plan Modification
May wish to re-optimize the remainder of query
Annotated Query Plans
We save at each point in the tree the
expected:
Sizes and cardinalities
Selectivities of predicates
Estimates of number of groups to be aggregated
Statistics Collectors
Add into tree
Must be
collectable in a
single pass
Will only help
with portions of
query “beyond”
the current
pipeline
Resource Re-Allocation
Based on improved
estimates, we can
modify the memory
allocated to each
operation
Results: less I/O,
better performance
Only for operations
that have not yet
begun executing
Plan Modification
 Create new plan for
remainder, treating temp as
an input
 Only re-optimize part not
begun
 Suspend query, save
intermediate in temp file
Re-Optimization
 When to re-optimize:
Calculate time current should take (using gathered stats)
Only consider re-optimization if:
Our original estimate was off by at least some factor 2 and if
Topt, estimated < 1Tcur-plan,improved where 1  5% and cost of
optimization depends on number of operators, esp. joins
Only modify the plan if the new estimate, including the cost of
writing the temp file, is better
Low-Overhead Statistics
Want to find “most effective” statistics
Don’t want to gather statistics for “simple” queries
Want to limit effect of algorithm to maximum
overhead ratio, 
Factors:
Probability of inaccuracy
Fraction of query affected
Inaccuracy Potentials
The following heuristics are used:
Inaccuracy potential = low, medium, high
Lower if we have more information on table value
distribution
1+max of inputs for multiple-input selection
Always high for user-defined methods
Always high for non-equijoins
For most other operators, same as worst of inputs
More Heuristics
 Check fraction of query
affected
Check how many other
operators use the same
statistic
 The winner:
Higher inaccuracy
potentials first
Then, if a tie, the one
affecting the larger
portion of the plan
Implementation
On top of Paradise (parallel database that
supports ADTs, built on OO framework)
Using System-R optimizer
New SCIA (Stat Collector Insertion Algorithm)
and Dynamic Re-Optimization modules
It Works!
Results are 5% worse for simple queries, much
better for complex queries
Of course, we would not really collect statistics on
simple queries
Data skew made a slight difference - both normal
and re-optimized queries performed slightly better