R - METU Computer Engineering

Download Report

Transcript R - METU Computer Engineering

Query Optimization
SPRING 2006
CENG 352 Database Management Systems
1
Query Evaluation
• Problem: An SQL query is declarative – does
not specify a query execution plan.
• A relational algebra expression is procedural
– there is an associated query execution plan.
• Solution: Convert SQL query to an equivalent
relational algebra and evaluate it using the
associated query execution plan.
– But which equivalent expression is best?
SPRING 2006
CENG 352 Database Management Systems
2
Naive Conversion
SELECT DISTINCT TargetList
FROM
R1, R2, …, RN
WHERE Condition
is equivalent to
TargetList (Condition (R1  R2  ...  RN))
but this may imply a very inefficient query execution plan.
Example: Name (Id=ProfId CrsCode=‘CS532’ (Professor  Teaching))
• Result can be < 100 bytes
• But if each relation is 50K then we end up computing an
intermediate result Professor  Teaching of size 1G
before shrinking it down to just a few bytes.
Problem: Find an equivalent relational algebra expression that can be
evaluated “efficiently”.
SPRING 2006
CENG 352 Database Management Systems
3
Query Processing Architecture
SPRING 2006
CENG 352 Database Management Systems
4
Query Optimizer
•
Uses heuristic algorithms to evaluate relational
algebra expressions. This involves:
1. Enumerating alternative plans (Typically a subset of all
possible plans).
•
Needs equivalence rules
2. Estimating the cost of each enumerated plan and
choosing the plan with the lowest estimated cost.
•
Query optimizers actually do not “optimize” – just
try to find “reasonably good” evaluation strategies
SPRING 2006
CENG 352 Database Management Systems
5
Selection and Projection Rules
• Break complex selection into simpler ones:
– Cond1Cond2 (R)  Cond1 (Cond2 (R) )
• Break projection into stages:
– attr (R)   attr ( attr (R)), if attr  attr
• Commute projection and selection:
–  attr (Cond(R))  Cond ( attr (R)),
if attr  all attributes in Cond
SPRING 2006
CENG 352 Database Management Systems
6
Commutativity and Associativity of Join
(and Cartesian Product as Special Case)
• Join commutativity: R
S  S
R
– used to reduce cost of nested loop evaluation strategies (smaller relation
should be in outer loop)
• Join associativity: R
(S
T)  (R
S)
T
– used to reduce the size of intermediate relations in computation of multirelational join – first compute the join that yields smaller intermediate
result
• N-way join has T(N) N! different evaluation plans
– T(N) is the number of different binary trees with N leaf nodes
– N! is the number of permutations
• Query optimizer cannot look at all plans (might take longer to
find an optimal plan than to compute query brute-force). Hence it
does not necessarily produce optimal plan
SPRING 2006
CENG 352 Database Management Systems
7
Pushing Selections and Projections
• Cond (R  S)  R
Cond
S
– Cond relates attributes of both R and S
– Reduces size of intermediate relation since rows can be
discarded sooner
• Cond (R  S)  Cond (R)  S
– Cond involves only the attributes of R
– Reduces size of intermediate relation since rows of R
are discarded sooner
• attr(R  S)  attr(attr (R)  S),
if attributes(R)  attr  attr
– reduces the size of an operand of product
SPRING 2006
CENG 352 Database Management Systems
8
Equivalence Example
• C1 C2 C3 (R  S)  C1 (C2 (C3 (R  S) ) )
 C1 (C2 (R)  C3 (S) )
 C2 (R)
C1 C3 (S)
assuming C2 involves only attributes of R,
C3 involves only attributes of S,
and C1 relates attributes of R and S
SPRING 2006
CENG 352 Database Management Systems
9
Estimating Cost - Example 1
SELECT P.Name
FROM Professor P, Teaching T
WHERE P.Id = T.ProfId
-- join condition
AND P. DeptId = ‘CS’ AND T.Semester = ‘F1994’
 Name(DeptId=‘CS’  Semester=‘F1994’(Professor
 Name
Master query
execution plan
(nothing pushed)
Id=ProfId
Teaching))
DeptId=‘CS’ Semester=‘F1994’
Id=ProfId
Professor
SPRING 2006
Teaching
CENG 352 Database Management Systems
10
Metadata on Tables (in system catalog)
– Professor (Id, Name, DeptId)
• size: 200 pages, 1000 rows, 50 departments
• indexes: clustered, 2-level B+tree on DeptId, hash on Id
– Teaching (ProfId, CrsCode, Semester)
• size: 1000 pages, 10,000 rows, 4 semesters
• indexes: clustered, 2-level B+tree on Semester;
hash on ProfId
– Definition: Weight of an attribute – average number
of rows that have a particular value
• weight of Id = 1 (it is a key)
• weight of ProfId = 10 (10,000 classes/1000 professors)
SPRING 2006
CENG 352 Database Management Systems
11
Estimating Cost - Example 1
• Join - block-nested loops with 52 page buffer (50
pages – input for Professor, 1 page – input for
Teaching, 1 – output page
– Scanning Professor (outer loop): 200 page transfers,
(4 iterations, 50 transfers each)
– Finding matching rows in Teaching (inner loop):
1000 page transfers for each iteration of outer loop
– Total cost = 200+4*1000 = 4200 page transfers
SPRING 2006
CENG 352 Database Management Systems
12
Estimating Cost - Example 1
(cont’d)
• Selection and projection – scan rows of
intermediate file, discard those that don’t satisfy
selection, project on those that do, write result
when output buffer is full.
• Complete algorithm:
– do join, write result to intermediate file on disk
– read intermediate file, do select/project, write final
result
– Problem: unnecessary I/O
SPRING 2006
CENG 352 Database Management Systems
13
Pipelining
• Solution: use pipelining:
– join and select/project act as coroutines, operate as
producer/consumer sharing a buffer in main memory.
• When join fills buffer; select/project filters it and outputs
result
• Process is repeated until select/project has processed last
output from join
– Performing select/project adds no additional cost
Intermediate
result
join
SPRING 2006
buffer
select/project
CENG 352 Database Management Systems
output
final result
14
Estimating Cost - Example 1
(cont’d)
• Total cost:
4200 + (cost of outputting final result)
– We will disregard the cost of outputting final
result in comparing with other query evaluation
strategies, since this will be same for all
SPRING 2006
CENG 352 Database Management Systems
15
The resulting Query Execution Plan 1
 Name
(On-the-fly)
DeptId=‘CS’ Semester=‘F1994’
Id=ProfId
Professor
Pipelining
(On-the-fly)
(Block Nested Loops)
Teaching
– Note: different join algorithms (index-nested loop, sort-merge etc.)
could have been used for the same tree to get different query execution
plans with possibly different costs.
SPRING 2006
CENG 352 Database Management Systems
16
Cost Example 2
SELECT P.Name
FROM Professor P, Teaching T
WHERE P.Id = T.ProfId AND
P. DeptId = ‘CS’ AND T.Semester = ‘F1994’
Name(Semester=‘F1994’ (DeptId=‘CS’ (Professor)
 Name
Partially pushed
plan: selection
pushed to Professor
Teaching))
Semester=‘F1994’
DeptId=‘CS’
SPRING 2006
Id=ProfId
Professor
Id=ProfId
Teaching
CENG 352 Database Management Systems
17
Cost Example 2 -- selection
• Compute DeptId=‘CS’ (Professor) (to reduce size of one
join table) using clustered, 2-level B+ tree on DeptId.
– 50 departments and 1000 professors; hence weight
of DeptId is 20 (roughly 20 CS professors). These
rows are in ~4 consecutive pages in Professor.
• Cost = 4 (to get rows) + 2 (to search index) = 6
• keep resulting 4 pages in memory and pipe to next step
clustered index
on DeptId
SPRING 2006
CENG 352 Database Management Systems
rows of
Professor
18
Cost Example 2 -- join
• Index-nested loops join using hash index on
ProfId of Teaching and looping on the selected
professors (computed on previous slide)
– Since selection on Semester was not pushed, hash
index on ProfId of Teaching can be used
– Note: if selection on Semester were pushed, the
index on ProfId would have been lost – an
advantage of not using a fully pushed query
execution plan
SPRING 2006
CENG 352 Database Management Systems
19
Cost Example 2 – join (cont’d)
– Each professor matches ~10 Teaching rows. Since 20 CS
professors, hence 200 teaching records.
– All index entries for a particular ProfId are in same bucket.
Assume ~1.2 I/Os to get a bucket.
• Cost = 1.2  20 (to fetch index entries for 20 CS
professors) + 200 (to fetch Teaching rows, since hash
index is unclustered) = 224
1.2
ProfId
SPRING 2006
20  10
Teaching
hash
CENG 352 Database Management Systems
20
Cost Example 2 – select/project
• Pipe result of join to select (on Semester) and
project (on Name) at no I/O cost
• Cost of output same as for Example 1
• Total cost:
6 (select on Professor) + 224 (join) = 230
• Comparison:
4200 (example 1) vs. 230 (example 2) !!!
SPRING 2006
CENG 352 Database Management Systems
21
Resulting Query Execution Plan 2
 Name
on the fly
Semester=‘F1994’
use clustered
B+tree index
DeptId=‘CS’
Professor
Id=ProfId
on the fly
indexed
nested loop
Teaching
Note : Other possibilities are block nested-loops, sort-merge.
SPRING 2006
CENG 352 Database Management Systems
22
Other Possible Query Trees
 Name
 Name
DeptId=‘CS’
Id=ProfId
Id=ProfId
DeptId=‘CS’

Semester=‘F1994’
Semester=‘F1994’
Professor
Teaching
Professor
SPRING 2006
CENG 352 Database Management Systems
Teaching
23
Estimating Output Size
• It is important to estimate the size of the output of a
relational expression – size serves as input to the next
stage and affects the choice of how the next stage will
be evaluated.
• Size estimation uses the following measures on a
particular instance of R:
–
–
–
–
–
Tuples(R): number of tuples
Blocks(R): number of blocks
Values(R.A): number of distinct values of A
MaxVal(R.A): maximum value of A
MinVal(R.A): minimum value of A
SPRING 2006
CENG 352 Database Management Systems
24
Reduction Factor
• Consider a query block:
SELECT attribute list
FROM relation list
WHERE term1  term2 ...  termn
• Every term in WHERE eliminates some of the result
tuples.
• A reduction factor is associated with each term.
• The number of tuples in the result is estimated as the
maximum size (i.e. product of cardinalities) times the
product of reduction factors.
SPRING 2006
CENG 352 Database Management
Systems
25
Computing reduction factors
• reduction (Ri.A=val) =
1
Values(Ri.A)
– if no index and no statistics about the distinct values: 1/10
(arbitrarily).
• reduction (Ri.A > val) =
MaxVal(Ri.A) – val
MaxVal(Ri.A) – MinVal(Ri.A)
– No index, not of arithmetic type: 1/2 (arbitrarily).
SPRING 2006
CENG 352 Database Management Systems
26
Reduction factors
• reduction (Ri.A=Rj.B) =
1
max(Values(Ri.A), Values(Rj.B))
– if only one of the columns have an index:
1
Values(R.A)
– if no index: 1/10 arbitrarily
SPRING 2006
CENG 352 Database Management Systems
27
Reduction Due to Complex Conditions
• reduction(Cond1 AND Cond2) =
reduction(Cond1)  reduction(Cond2)
• reduction(Cond1 OR Cond2) =
min(1, reduction(Cond1) + reduction(Cond2))
SPRING 2006
CENG 352 Database Management Systems
28
Reduction Due to TargetList
• reduction(TargetList) =
number-of-attributes (TargetList)
i number-of-attributes (Ri)
SPRING 2006
CENG 352 Database Management Systems
29
Choosing Query Execution Plan
• Step 1: Choose a logical plan
• Step 2: Reduce search space
• Step 3: Use a heuristic search to further
reduce complexity
SPRING 2006
CENG 352 Database Management Systems
30
Step 1: Choosing a Logical Plan
• Involves choosing a query tree, which indicates the
order in which algebraic operations are applied
• Heuristic: Pushed trees are good, but sometimes “nearly
fully pushed” trees are better due to indexing (as we
saw in the example)
• So: Take the initial “master plan” tree and produce a
fully pushed tree plus several nearly fully pushed trees.
SPRING 2006
CENG 352 Database Management Systems
31
Step 2: Reduce Search Space
• Deal with associativity of binary operators
(join, union, …)
A
B
C
Logical query
execution plan
SPRING 2006
D
D
A
B
C
Equivalent
query tree
C
D
A
B
Equivalent left
deep query tree
CENG 352 Database Management Systems
32
Step 2 (cont’d)
• Two issues:
– Choose a particular shape of a tree (like in the
previous slide)
• Equals the number of ways to parenthesize N-way
join – grows very rapidly
– Choose a particular permutation of the leaves
• E.g., 4! permutations of the leaves A, B, C, D
SPRING 2006
CENG 352 Database Management Systems
33
Step 2: Dealing With Associativity
• Too many trees to evaluate: settle on a particular
shape: left-deep tree.
– Used because it allows pipelining:
P1
P2
A
B
X
X
C
P3
Y
Y
D
– Property: once a row of X has been output by P1, it need not
be output again (but C may have to be processed several
times in P2 for successive portions of X)
– Advantage: none of the intermediate relations (X, Y) have to
be completely materialized and saved on disk.
• Important if one such relation is very large, but the final result is
small
SPRING 2006
CENG 352 Database Management Systems
34
Step 3: Heuristic Search
• The choice of left-deep trees still leaves open
too many options (N! permutations):
– (((A
– (((C
B)
A)
C)
D)
D),
B), …..
• A heuristic (often dynamic programming
based) algorithm is used to get a ‘good’ plan
SPRING 2006
CENG 352 Database Management Systems
35
Step 3: Dynamic Programming
Algorithm
• To compute a join of E1, E2, …, EN in a leftdeep manner:
– Start with 1-relation expressions (can involve , )
– Choose the best and “nearly best” plans for each (a
plan is considered nearly best if its output has some
“interesting” form, e.g., is sorted)
– Combine these 1-relation plans into 2-relation
expressions. Retain only the best and nearly best 2relation plans
– Do same for 3-relation expressions, etc.
SPRING 2006
CENG 352 Database Management Systems
36
Enumeration of Plans
• ORDER BY, GROUP BY, aggregates etc. handled as a final
step, using either an `interestingly ordered’ plan or an
additonal sorting operator.
• An N-1 way plan is not combined with an additional
relation unless there is a join condition between them,
unless all predicates in WHERE have been used up.
–
i.e., avoid Cartesian products if possible.
• In spite of pruning plan space, this approach is still
exponential in the # of tables.
SPRING 2006
CENG 352 Database Management
Systems
37
Example 1
• Pass1:
–
Sailors:
B+ tree on rating
Hash on sid
Reserves:
B+ tree on bid
Sailors: B+ tree matches rating>5, and is probably
cheapest. However, if this selection is expected to retrieve
a lot of tuples, and index is unclustered, file scan may be
cheaper.
sname
sid=sid
bid=100 rating > 5
• Still, B+ tree plan kept (because tuples are in rating order).
–

Reserves: B+ tree on bid matches bid=100; cheapest.
Reserves Sailors
Pass 2:
– We consider each plan retained from Pass 1 as the outer, and
consider how to join it with the (only) other relation.

e.g., Reserves as outer: Hash index can be used to get Sailors tuples
that satisfy sid = outer tuple’s sid value.
SPRING 2006
CENG 352 Database Management
Systems
38
Example 2
sid,count(*) as numres
SELECT S.sid, COUNT(*) As numres
FROM Boats B, Reserves R, Sailors S
WHERE R.sid = S.sid and B.bid = R.bid
and B.color = ‘red
GROUP BY S.sid
GROUP BYsid
sid=sid
• Available Indexes :
Sailors:
B+ tree on sid
Hash on sid
Reserves:
B+ tree on sid
B+ tree on bid (clustered)
Boats:
B+ tree on color
Hash on color
SPRING 2006
CENG 352 Database Management
Systems
Sailors
bid=bid
color = red
Reserves
Boats
39
• Pass 1: Find best plans for each file.
– Sailors: File scan
– Reserves: File scan
– Boats: hash index on color, B+ tree index on color
• Pass 2: Possible joins are considered.
–
–
–
–
–
–
–
–
Reserves Boats
Reserves Sailors
Sailors Boats
Sailors
Reserves
Boats (access B+ tree)
Sailors
Boats (access B+ tree)
Reserves
Boats (access hash)
Sailors
Boats (access hash)
Reserves
SPRING 2006
CENG 352 Database Management
Systems
•For each such pair,
consider every join
method and for each join
method consider every
access path for the inner
relation.
•For each pair keep the
cheapest of the plans &
the cheapest plan that
generates tuples in sorted
order
40
• Pass 3: For each plan retained in pass 2, taken as outer
relation, consider how to join remaining relation as inner
one.
 Example plan for this pass:
Boats (via hash)
Reserves (via B+tree) (sort-merge)
Result is joined with Sailors (via B+ tree) (sort merge)
• GROUP BY is considered after all joins. It requires sorting
on sid.
SPRING 2006
CENG 352 Database Management
Systems
41
Translating Complex SQL Queries into R.A.
• As an example consider the following SQL query:
SELECT sname, age
FROM Sailors
WHERE rating > SELECT max(rating)
FROM Sailors
WHERE age = 30
• Decompose into 2 blocks:
SELECT max(rating)
FROM Sailors
WHERE age = 30
Max(rating (age = 30 (Sailors))
SPRING 2006
SELECT sname, age
FROM Sailors
WHERE rating > constant
sname,age (rating>constant (Sailors))
CENG 352 Database Management
Systems
42
Query Blocks: Units of Optimization
• An SQL query is parsed into a
collection of query blocks, and these are
optimized one block at a time.
• Nested blocks are usually treated as
calls to a subroutine, made once per
outer tuple.
SELECT S.sname
FROM Sailors S
WHERE S.age IN
(SELECT MAX (S2.age)
FROM Sailors S2
GROUP BY S2.rating)
Outer block

Nested block
For each block, the plans considered are:
– All available access methods, for each reln in FROM clause.
– All left-deep join trees (i.e., all ways to join the relations one-ata-time, with the inner reln in the FROM clause, considering all reln
permutations and join methods.)
SPRING 2006
CENG 352 Database Management
Systems
43
Nested Queries
• Nested block is optimized
independently, with the outer tuple
considered as providing a selection
condition.
• Outer block is optimized with the cost
of `calling’ nested block computation
taken into account.
• Implicit ordering of these blocks
means that some good strategies are
not considered. The non-nested
version of the query is typically
optimized better.
SPRING 2006
SELECT S.sname
FROM Sailors S
WHERE EXISTS
(SELECT *
FROM Reserves R
WHERE R.bid=103
AND R.sid=S.sid)
Nested block to optimize:
SELECT *
FROM Reserves R
WHERE R.bid=103
AND S.sid= outer value
Equivalent non-nested query:
SELECT S.sname
FROM Sailors S, Reserves R
WHERE S.sid=R.sid
CENG 352 Database Management
44
AND R.bid=103
Systems
Summary
• Query optimization is an important task in a relational
DBMS.
• Must understand optimization in order to understand
the performance impact of a given database design
(relations, indexes) on a workload (set of queries).
• Two parts to optimizing a query:
–
Consider a set of alternative plans.
• Must prune search space; typically, left-deep plans only.
–
Must estimate cost of each plan that is considered.
• Must estimate size of result and cost for each plan node.
• Key issues: Statistics, indexes, operator implementations.
SPRING 2006
CENG 352 Database Management
Systems
45
Summary (Contd.)
• Single-relation queries:
–
–
All access paths considered, cheapest is chosen.
Issues: Selections that match index, whether index key has all
needed fields and/or provides tuples in a desired order.
• Multiple-relation queries:
–
All single-relation plans are first enumerated.
• Selections/projections considered as early as possible.
Next, for each 1-relation plan, all ways of joining another
relation (as inner) are considered.
– Next, for each 2-relation plan that is `retained’, all ways of
joining another relation (as inner) are considered, etc.
– At each level, for each subset of relations, only best plan for
each
interesting order
is `retained’.
SPRING
2006
CENGof
352tuples
Database Management
46
–
Systems