Notes on query optimization
Download
Report
Transcript Notes on query optimization
QUERY PROCESSING
AND OPTIMIZATION
Overview
SQL is a declarative language:
Specifies the outcome of the computation
without defining any flow of control
Will require DBMS to select an execution
Will allow optimizations
plan
Sample query
SELECT C, D
FROM R, S
WHERE R.B = "z" AND S.F = 30 AND R.A = S.D
The two tables
A
1
2
R
B
x
y
C
100
200
3
4
z
y
300
400
D
S
E
F
1
3
5
u
v
w
10
30
30
First execution plan
Can use relational algebra to express an
execution plan
Could be:
Cartesian product: R×S
Selection: σR.B = "z" S.F = 30 R.A = S.D (R×S)
Projection:
πC, E (σR.B = "z" S.F = 30 R.A = S.D (R×S))
Graphical representation
πC, E
σR.B = "z" S.F = 200 R.A = S.D
R×S
R×S
A
1
B
x
C
100
D
1
E
u
F
10
1
1
2
2
x
x
y
y
100
100
200
200
3
5
1
3
v
w
u
v
30
30
10
30
2
3
3
y
z
z
200
300
300
5
1
3
w
u
v
30
10
30
3
4
4
z
y
y
300
400
400
5
1
3
w
u
v
30
10
30
4
y
400
5
w
30
Second execution plan
Selection:
σB = "z" (R)
σF = 30 (S)
Join:
σB = "z" (R)⋈R.A=S.D σF = 30 (S)
Projection:
πC, E (…)
The two tables
R
S
A
B
C
D
E
F
1
2
x
y
100
200
1
u
10
3
4
z
y
300
400
3
5
v
w
30
30
After the selections
σB = "z" (R)
σF = 30 (S)
A
B
C
D
E
F
3
z
300
3
v
30
5
w
30
σB = "z" (R)
R.A=S.D
σF = 30 (S)
A
B
C
D
E
F
3
z
300
3
v
30
Discussion
Second plan
Extracts first relevant rows of tables R
and S
Uses more efficient join
for each row in σB = "z" (R) :
for each row in σF = 30 (S) :
if R.A = S.D :
include rows in result
Note that inner loop searches the
smaller temporary table (σF = 30 (S))
More generally
Exclude as quickly as possible:
Irrelevant lines
Irrelevant attributes
Most important when the involved tables
reside on different hosts (Distributed
DBMS)
Whenever possible, ensure that inner join
loops search tables that can reside in main
memory
Caching considerations
Cannot rely on LRU to achieve that
Will keep in memory recently accessed
pages of all tables
Must keep
All pages of table inside the inner loop
No pages of the other table
Can either
Let DBMS manage the cache
Use a scan-tolerant cache algorithm (ARC)
A third plan
Find lines of R where B = "z"
Using index S.D find lines of S where S.D
matches R.A for the lines where R.B = "z"
Include pair of lines in the join
Processing a query (I)
Parse the query
Convert query parse tree into a
logical query plan (LQP)
Apply equivalence rules (laws) and try to
improve upon extant LQP
Estimate result sizes
Processing a query (II)
Consider possible physical plans
Estimate their cost
Select the best
Execute it
Given the high cost of query processing, it
make sense to evaluate various alternatives
Example (from [GM])
SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE ‘%1960’
);
Relational Algebra plan
title
StarsIn
<condition>
<tuple>
<attribute>
starName
IN
name
birthdate LIKE ‘%1960’
MovieStar
Fig. 7.15: An expression using a two-argument , midway between a parse tree
and relational algebra
Relational Algebra plan
title
StarsIn
<condition>
<tuple>
<attribute>
starName
IN
name
birthdate LIKE ‘%1960’
MovieStar
Fig. 7.15: An expression using a two-argument , midway between a parse tree
and relational algebra
Relational algebra plan
title
StarsIn
<condition>
<tuple>
<attribute>
starName
IN
name
birthdate LIKE ‘%1960’
MovieStar
Fig. 7.15: An expression using a two-argument , midway between a parse tree
and relational algebra
Logical query plan
title
starName=name
StarsIn
Cartesian product
could indicate
a brute force
solution
name
birthdate LIKE ‘%1960’
MovieStar
Fig. 7.18: Applying the rule for IN conditions
Estmating result sizes
Need expected size
StarsIn
MovieStar
Estimate the costs of each
option
Logical Query Plan
P1
P2
….
Pn
C1
C2
….
Cn
Pick the best!
Query optimization
At two levels
Relational
Use
algebra level:
equivalence rules
Detailed
Takes
query plan level:
into account result sizes
Considers DB organization
How it is stored
Presence and types of indexes, …
Result sizes do matter
Consider the Cartesian product
Very costly when its two operands are
large tables
Less true when the tables are small
Equivalence rules for joins
R⋈S
= S⋈R
(R⋈S)⋈T = R⋈(S⋈T)
Column order does not matter because the
columns have labels
Rules for product and union
Equivalence rules for Cartesian product:
R
xS=SxR
(R x S) x T = R x (S x T)
Equivalence rules for union :
R
S=SR
(R S) T = R (S T)
Column order does not matter because the
columns have labels
Rules for selections and
unions
Equivalence rules for selection:
p1p2(R)
= p1(p2(R))
p1p2(R)
= p1(R) p2(R)
Equivalence rules for union :
R
S=SR
(R S) T = R (S T)
Combining projections and
joins
predicate p only involves attributes of R
not used in the join
If
pp
(R⋈S) = pp (R)⋈S
predicate q only involves attributes of S
not used in the join
If
pq
(R⋈S) = R⋈pq (S)
Warning:
πp1, p2(R) is NOT the same as πp1(πp2(R))
Combining selection and joins
pq (R⋈S)
= p (R)⋈q (S)
pqm (R⋈S) = m [(p R)⋈(q S)]
pq (R⋈S)
= [p (R)⋈S] [R⋈(q (S)]
Combining projections and
selections
Let
x
be a subset of R attributes
z the set of attributes of R used in
predicate p
then
πx [σp(R)] = πx [σp [πxz(R)]]
We can only eliminate attributes that are
not used in the selection predicate!
Combining projections and
joins
Let
x
be a subset of R attributes
y a subset of S attributes
z the common attributes of R and S
then
pxy (R⋈S) = pxy {[pxz (R)]⋈[pyz (S)]}
Combining projections,
selections and joins
Let
x, y, z be ...
the union of z and the attributes used
in predicate p
z'
pxy {p (R⋈S)} =
pxy {p [pxz’ (R)]⋈[pyz'(S)]}
Combining selections, projections
and Cartesian product
Rules are similar
Just replace join operator by Cartesian
product operator
Keep in mind that join is a restricted
Cartesian product
Combining selections and unions
p(R
U S) = p(R) U p(S)
p(R
- S) = p(R) - S = p(R) - p(S)
Finding the most promising
transformations
p1p2 (R) p1 [p2 (R)]
Use
successive selections
p (R⋈S) [p (R)]⋈S
Do
selections before joins
R⋈S S⋈R
px [p (R)] px {p [pxz (R)]}
Do
projections before selection
First heuristics
Do projections early
Example
from [GM]:
Given R(A,B,C,D,E) and the select
predicate
P: (A=3) (B=“cat”)
Seems
by
a good idea to replace
pE {p{pABE(R)}}
What
if we have indexes?
px {p (R)}
Same example with indexes
Assume attribute A is indexed
Use index to locate all tuples where A = 3
Select tuples where B=“cat”
Do the projections
In other words
px {p (R)} is the best solution
Second heuristics
Do selections early
Especially
if we can use indexes
but no heuristics is always true
Estimating cost of query
plans
Requires
Estimating sizes of the results
Estimating the number of I/O operations
We will generally assume that the
cost of a query plan is dominated by
the number tuples being read or written
Estimating result sizes
Relevant data are
T(R) : number of tuples in R
S(R) : size of each tuple of R (in bytes)
B(R): number of blocks required to store R
V(R, A) : number of distinct values for
attribute A in R
Example
Owner
Pet
Vax date
Alice
Cat
3/2/15
Relation R
Alice
Cat
3/2/15
T(R)=8
Bob
Dog
10/8/14
Assuming dates
Bob
Dog
10/8/15
take 8 bytes and
strings 20 bytes
Carol
Dog
11/9/14
S(R)=48 bytes
Carol
Cat
12/7/14
B(R)=1 block
V(R, Owner)=3, V(R, Pet)=2,V(R, Vax date)=4
Estimating cost of W = R1 x R2
T(W) = T(R1)×T(R2)
Obvious!
S(W) = S(R1)+S(R2)
Estimating cost of W =
S(W) = S(R)
T(W) = T(R)/V(R, A)
but
A=a (R)
this assumes that the values of A are
uniformly distributed over all the tuples
Example
W = σowner= Bob(R)
As T(R) = 6 and
V(R, Owner) = 3
T(W) = 3
Owner
Pet
Vax date
Alice
Cat
3/2/15
Alice
Cat
3/2/15
Bob
Dog
10/8/14
Bob
Dog
10/8/15
Carol
Dog
11/9/14
Carol
Cat
12/7/14
Making another assumption
Assume now that values in select
expression Z = val are uniformly distributed
over all possible V(R, Z) values.
If W = σZ=val (R)
T(W) = T(R)/V(R, Z)
Estimating sizes of range
queries
Attribute Z of table R has 50 possible values
ranging from 1 to 100
If W =
σZ > 80, what is T(W)?
Assuming the values in Z are uniformly
distributed over [0, 1]
T(W) = T(R)×(100 – 80)/(100 – 1 +1)
= 0.2×T(R)
Explanation
T(W) = T(R)×(Query_Range/Value_Range)
If query had been W =
σZ ≥ 80
T(W) would have been
T(R)×(100 – 80 + 1)/(100 – 1 +1) =
0.21×T(R)
21 possible values
Estimating the size of R⋈S
queries
We consider R(X, Y)⋈S(Y, Z)
Special cases:
R
and S have disjoint values for Y:
T(R⋈S) = 0
Y is the key of S and a foreign key in R:
T(R⋈S) = T(R)
Almost all tuples of R and S have the
same value for Y:
T(R⋈S) = T(R)T(S)
Estimating the size of R⋈S
queries
General case:
Will
assume
Containment of values:
If V(R, Y) ≤ V(S, Y) then all values of
Y in R are also in S
Preservation of value sets:
If A is an attribute of R that is not in
S, then V(R⋈S, A) = V(R, A)
Estimating the size of R⋈S
queries
If
V(R, Y) ≤ V(S, Y)
Every value of R is present in S
On average, a given tuple in R is likely
to match T(S)/V(S, Y)
R has T(R) tuples
T(R⋈S) = T(R)×T(S)/V(S, Y)
Estimating the size of R⋈S
queries
If
V(R, Y) ≥ V(S, Y)
Every value of S is present in R
On average, a given tuple in S is likely
to match T(R)/V(R, Y)
S has T(S) tuples
T(R⋈S) = T(R)×T(S)/V(R, Y)
Estimating the size of R ⋈ S
queries
In
general
T(R⋈S)
= T(R)×T(S)/max(V(R, Y), V(R, S))
An example (I)
Finding all employees who live in a city
where the company has a plant:
EMPLOYEE( EID, NAME, …., CITY)
PLANT(PLANTID, …,CITY)
SELECT E.NAME
FROM EMPLOYEE E, PLANT P
WHERE E.CITY = P.CITY
SELECT EMPLOYEE.NAME
FROM EMPLOYEE JOIN PLANT
ON EMPLOYEE.CITY= PLANT.CITY
An example (II)
Assume
T(E)=5,000
V(E, CITY) = 100
T(P)= 200
V(P, CITY) = 50
T(E⋈P) = T(E)×T(P)/
MAX(V(E, CITY), V(P, CITY))
= 5,000×200/MAX(100, 50)
= 1,000,000/100 = 10,000
Estimating the size of
multiple joins
R⋈S⋈U
R(A,
B)
T(R)=1,000
V(R,B)=20
S(B,C)
T(S)=2,000
V(S,B)=50
V(S,C)=100
U(C,D)
T(U)=5,000
V(U,C)=500
Left to right:
T(R⋈S)=
2,000,000/max(20, 50)=40,000
T(R⋈S⋈U)=200,000,000/max(100, 500)=400,000
Right to left:
T(S⋈U)=
10,000,000/max(100, 500)=20,000
T(R⋈S⋈U)=20,000,000/max(20,
50)=400,000
Estimating the size of
multicondition joins
R(X,y1, y2,…)⋈S(y1, y2,…, Z)
If V(R, y1)≤V(S, y1) and V(R, y2)≤ V(S, y2)
…
Every value of R is present in S
On average, a given tuple in R is likely
to match T(S)/(V(S, y1)×V(S, y2) …)
R has T(R) tuples
T(R⋈S) = T(R)×T(S)/(V(S, y1)×V(S, y2) …)
Multicondition join
R(X,y1, y2,…)⋈S(y1, y2,…, Z)
In general
T(R⋈S)
= T(R)×T(S)/
[max(V(R, y1), V(R, y1))×
max(V(R, y2), V(R, y2))× ….]
Multicondition join
R(X,y1, y2,…)⋈S(y1, y2,…, Z)
If V(R, y1)≤V(S, y1) and V(R, y2)≤ V(S, y2)
…
Every value of R is present in S
On average, a given tuple in R is likely
to match T(S)/(V(S, y1)×V(S, y2))
R has T(R) tuples
T(R⋈S) = T(R)×T(S)/(V(S, y1)×V(S, y2))
Estimating the size of unions
T(R⋃S)
for a bag union:
T(R⋃S) = T(R)+T(S) exact
for a regular union:
If the relations are disjoint:
T(R⋃S) = T(R)+T(S)
If one relation contains the other:
T(R⋃S) = max(T(R), T(S))
T(R⋃S)=(max(T(R), T(S))+T(R)+T(S))/2
We
take the average!
Estimating the size of
intersections
T(R⋂S)
If the relations are disjoint:
T(R⋂S) = 0
If one relation contains the other
T(R⋂S) = min(T(R), T(S))
T(R⋂S)=min(T(R), T(S))/2
We
take the average!
Estimating the size of set
differences
T(R-S)
If the relations are disjoint
T(R)-T(S) =T(R)
If relation R contains relation S:
T(R-S) = T(R)-T(S)
T(R-S)=(2T(R)+T(S))/2
We take the average!
Estimating the cost of
eliminating duplicates
δ(R)
If all tuples are duplicates : T(δ(R)) = 1
If no tuples are duplicates : T(δ(R)) = T(R)
T(δ(R)) = T(R)/2
If R(a1, a2, …) and we know the V(R, ai)
T(δ(R)) = ΠiV(R, ai)
Collecting statistics
Can explicitly request statistics
Maintain them incrementally
Can collect histograms
Give an idea how data are distributed
Not all patrons borrow equal number of
books
The Zipf distribution (I)
Empirical distribution
Verified for many cases
Ranks items by frequency/popularity
If f is the probability of accessing/using the
most popular item in the list (rank 1)
The probability of accessing/using the
second most popular item will be close to
f/2
The probability … third most popular
item will be close to f/3
The Zipf distribution (II)
Can alter the shape
Ranks items by frequency/popularity
If f is the probability of accessing/using the
most popular item in the list (rank 1)
The probability of accessing/using the
second most popular item will be close to
f/2
The probability … third most popular
item will be close to f/3
The Zipf distribution (II)
0.25
Frequency
0.20
0.15
0.10
0.05
0.00
1
6
11
16
21
26
31
Rank (1-50)
36
41
46
The Zipf distribution (III)
Can adjust the slope of the course by
adding an exponent
If f is the probability of accessing/using the
most popular item in the list (rank 1)
The probability of accessing/using the
n-th ranked item in the list is will be
close to f/ni
i = ½ seems to be a good choice
Example (I)
A library uses two tables to keep track of its
books
Book(BID,Title,Authors,Loaned,Due)
Patron(PID, Name, Phone, Address)
The Loaned attribute of a book is equal to
The PID of the patron who borrowed the
book
Zero if the book is on the shelves
Example (II)
We want to find the titles of all books
currently loaned to "E. Chambas"
T(Books)=5,000
T(Patron)=500
V(Books, Loaned) = 200
V(Patron, Name) = 500
First plan
X = Book ⋈Loaned = PID Patron
Y = σName = "E. Chambas"(X)
Since PID is the key of Patron and assuming
that Loaned were a foreign key in Books:
T(X) = T(Books) (all books are borrowed!)
= 5,000
T(Y) = T(X)/V(Patron, Name)
=5,000/500
= 10
Second plan
X = σName = "E. Chambas"(Patron)
Y = Book ⋈Loaned = PID X
T(X) =
=
T(Z) =
=
=
T(Patron) /V(Patron, Name)
5,000/5,000 = 1
T(Book)×T(X)/V(Book, Loaned)
5,000/500
10
Comparing the two plans (I)
Comparison based on the number of tuples
created by the plan minus
The
number of tuples constituting the
answer
Should be the same for all correct plans
For the same reason, we do not considder
the number of tuples being read
Comparing the two plans (II)
Cost of first plan: 5,000
Cost of second plan: 1
An example (I)
Finding all employees who live in a city
where the company has a plant:
EMPLOYEE( EID, NAME, …., CITY)
PLANT(PLANTID, …,CITY)
Assume
T(E)=5,000
V(E, CITY) = 100
V(E, NAME) = 5,000
T(P)= 200
V(P, CITY) = 50
A first plan
X = E ⋈E.CITY = P.CITY P
Y = πE.NAME (X)
T(E⋈P) = T(E)×T(P)/
max(V(E, CITY), V(P, CITY))
= 5,000×200/MAX(100, 50)
= 1,000,000/100 = 10,000
T(Y) = 10,000 (not possible!)
A second plan
X = πP.CITY (P)
Y = δ(X)
Z = E ⋈E.CITY = Y.CITY Y
U = πE.NAME (Z)
T(X) = T(P) = 200
T(Y) = V(X, CITY) = V(P, CITY) =50
T(Z) = T(E)×T(Y)/max(V(E, CITY), 1)
= 5,000×50/MAX(100, 1) = 2,500
T(U) =T(Z) = 2,500
Comparing the two plans
First plan:
10,000
+ 10,000
__________
10,000
Second plan:
200
50
2,500
+ 2,500
_____________
2,750
Here it pays off to eliminate
duplicates early
Example [GM]
We have R(a,b) and S(b,c)
We want δ(σA="a" (R⋈S))
We know
T(R) = 5,000
T(S) = 2,000
V(R, a) = 50
V(R, b) = 100
V(S, b) = 200
V(S, c) = 100
First plan
X1 = σa="a" (R)
X2 = X1⋈S
X3 = δ (X2)
First plan
X1 = σa="a" (R)
X2 = X1⋈S
X3 = δ (X2)
T(X1) = T(R)/V(R, a) = 5,000/50 = 100
T(X2) = T(X1)×T(S)/max(V(R, b), V(S, b))
= 100×2000/max(100, 200)
= 1,000
doesn't
T(X3) = min(…, T(X2)/2) = 500
count
Second plan
X1
X2
X3
X3
=
=
=
=
δ(R)
δ(S)
σa="a" (X1)
X3⋈X2
Second plan
X1=δ(R), X2=δ(S), X3=σa="a" (X1), X4=X3⋈X2
T(X1) = min(V(R, a)×V(R, b), T(R)/2)
= min(50×100, 5000/2) = 2500
T(X2) = min(V(S, b)×V(S, c), T(S)/2)
= min(200×100, 2000/2) = 1000
T(X3) =T(X1)/V(R, a) = 2500/50 = 50
T(X4) = T(X3)×T(X2)/max(V(R, b), V(S, b))
= 50×1000/max(100, 200))= 250
n
o
Comparing the two plans
First plan:
100
1,000
+
500
__________
1,100
Second plan:
2,500
1,000
50
+
250
_____________
3,550
Here it did not pay off to
eliminate duplicates early
A hybrid plan
X1
X2
X3
X4
=
=
=
=
σa="a" (R)
δ(X1)
X2⋈S
δ(X3)
A hybrid plan
X1
X2
X3
X4
=
=
=
=
σa="a" (R)
δ(X1)
X2⋈S
δ(X3)
T(X1) = T(R)/V(R, a) = 5,000/50 = 100
T(X2) = min(V(R, b), T(X1)/2) = min(100,50)
= 50
T(X3) = T(X3)×T(S)/max(V(R, b), V(S, b))
= 50×2000/max(100, 200))= 500
n
T(X4) = min(…, T(X3)/2) = 250
Comparing the two best
plans
First plan:
100
1,000
+
500
__________
1,100
Hybrid plan:
100
50
500
+
250
_____________
650
Reducing the sizes of the tables
in a join is a good idea if we can
do it on the cheap
Ordering joins
Joins methods are often asymmetric so
cost(R⋈S)≠cost(S⋈R)
Useful to build a join tree
A simple greedy algorithm will work well:
Start with pair of relation whose
estimated join size will the the smallest
Find among other relations the one that
would produce the smallest estimates
size when joined to the current tree.
Implementing
joins
A. Nested loops
W=[]
for rows in R :
for rows in S :
if match_found( ) :
append_concatenated_rows()
Number of operations:
T(R)×T(S)
The idea
Table R
Table S
Try to match
every tuple of R
with all tuples of S
Optimization
Assume that the second relation can fit in
main memory
Read only once
Number of reads is T(R) + T(S)
B. Sort and merge
We sort the two tables using the matching
attributes as sorting keys
Can now do select matches by doing a merge
Single pass process unless we have
duplicate matches
Number of operations is
O(T(R)log(T(R)))+O(T(S)log(T(S)))+T(R)+T(S)
assuming one table does not have potential
duplicate matches
Great if the tables are already sorted
C. Hashing
Assume both tables maintain a hash with K
entries for the matching attributes
for i in range(0, K – 1) :
join all R entries in bucket i with all
S entries in the same bucket
We replace a big join by K smaller joins
Number of operations will be:
K×(T(R)/K)×(T(S)/K) = T(R)×T(S)/K