Notes on query optimization

Download Report

Transcript Notes on query optimization

QUERY PROCESSING
AND OPTIMIZATION
Overview

SQL is a declarative language:
 Specifies the outcome of the computation
without defining any flow of control

Will require DBMS to select an execution

Will allow optimizations
plan
Sample query

SELECT C, D
FROM R, S
WHERE R.B = "z" AND S.F = 30 AND R.A = S.D
The two tables
A
1
2
R
B
x
y
C
100
200
3
4
z
y
300
400
D
S
E
F
1
3
5
u
v
w
10
30
30
First execution plan


Can use relational algebra to express an
execution plan
Could be:
 Cartesian product: R×S
 Selection: σR.B = "z"  S.F = 30  R.A = S.D (R×S)
 Projection:
πC, E (σR.B = "z"  S.F = 30  R.A = S.D (R×S))
Graphical representation
πC, E
σR.B = "z"  S.F = 200  R.A = S.D
R×S
R×S
A
1
B
x
C
100
D
1
E
u
F
10
1
1
2
2
x
x
y
y
100
100
200
200
3
5
1
3
v
w
u
v
30
30
10
30
2
3
3
y
z
z
200
300
300
5
1
3
w
u
v
30
10
30
3
4
4
z
y
y
300
400
400
5
1
3
w
u
v
30
10
30
4
y
400
5
w
30
Second execution plan

Selection:
σB = "z" (R)
σF = 30 (S)

Join:
σB = "z" (R)⋈R.A=S.D σF = 30 (S)

Projection:
πC, E (…)
The two tables
R
S
A
B
C
D
E
F
1
2
x
y
100
200
1
u
10
3
4
z
y
300
400
3
5
v
w
30
30
After the selections
σB = "z" (R)
σF = 30 (S)
A
B
C
D
E
F
3
z
300
3
v
30
5
w
30
σB = "z" (R)
R.A=S.D
σF = 30 (S)
A
B
C
D
E
F
3
z
300
3
v
30
Discussion

Second plan
 Extracts first relevant rows of tables R
and S
 Uses more efficient join
for each row in σB = "z" (R) :
for each row in σF = 30 (S) :
if R.A = S.D :
include rows in result
Note that inner loop searches the
smaller temporary table (σF = 30 (S))
More generally


Exclude as quickly as possible:
 Irrelevant lines
 Irrelevant attributes
Most important when the involved tables
reside on different hosts (Distributed
DBMS)
Whenever possible, ensure that inner join
loops search tables that can reside in main
memory
Caching considerations



Cannot rely on LRU to achieve that
 Will keep in memory recently accessed
pages of all tables
Must keep
 All pages of table inside the inner loop
 No pages of the other table
Can either
 Let DBMS manage the cache
 Use a scan-tolerant cache algorithm (ARC)
A third plan



Find lines of R where B = "z"
Using index S.D find lines of S where S.D
matches R.A for the lines where R.B = "z"
Include pair of lines in the join
Processing a query (I)




Parse the query
Convert query parse tree into a
logical query plan (LQP)
Apply equivalence rules (laws) and try to
improve upon extant LQP
Estimate result sizes
Processing a query (II)




Consider possible physical plans
Estimate their cost
Select the best
Execute it
Given the high cost of query processing, it
make sense to evaluate various alternatives
Example (from [GM])
SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE ‘%1960’
);
Relational Algebra plan
title

StarsIn
<condition>
<tuple>
<attribute>
starName
IN
name
birthdate LIKE ‘%1960’
MovieStar
Fig. 7.15: An expression using a two-argument , midway between a parse tree
and relational algebra
Relational Algebra plan
title

StarsIn
<condition>
<tuple>
<attribute>
starName
IN
name
birthdate LIKE ‘%1960’
MovieStar
Fig. 7.15: An expression using a two-argument , midway between a parse tree
and relational algebra
Relational algebra plan
title

StarsIn
<condition>
<tuple>
<attribute>
starName
IN
name
birthdate LIKE ‘%1960’
MovieStar
Fig. 7.15: An expression using a two-argument , midway between a parse tree
and relational algebra
Logical query plan
title
starName=name

StarsIn
Cartesian product
could indicate
a brute force
solution
name
birthdate LIKE ‘%1960’
MovieStar
Fig. 7.18: Applying the rule for IN conditions
Estmating result sizes
Need expected size
StarsIn


MovieStar
Estimate the costs of each
option
Logical Query Plan
P1
P2
….
Pn
C1
C2
….
Cn
Pick the best!
Query optimization

At two levels
 Relational
 Use
algebra level:
equivalence rules
 Detailed
 Takes
query plan level:
into account result sizes
 Considers DB organization
How it is stored
Presence and types of indexes, …
Result sizes do matter

Consider the Cartesian product
 Very costly when its two operands are
large tables
 Less true when the tables are small
Equivalence rules for joins
 R⋈S
= S⋈R
 (R⋈S)⋈T = R⋈(S⋈T)

Column order does not matter because the
columns have labels
Rules for product and union

Equivalence rules for Cartesian product:
R
xS=SxR
 (R x S) x T = R x (S x T)

Equivalence rules for union :
R
S=SR
 (R  S)  T = R  (S  T)

Column order does not matter because the
columns have labels
Rules for selections and
unions


Equivalence rules for selection:
 p1p2(R)
= p1(p2(R))
 p1p2(R)
= p1(R)  p2(R)
Equivalence rules for union :
R
S=SR
 (R  S)  T = R  (S  T)
Combining projections and
joins
predicate p only involves attributes of R
not used in the join
 If
 pp
(R⋈S) = pp (R)⋈S
predicate q only involves attributes of S
not used in the join
 If
 pq
(R⋈S) = R⋈pq (S)
Warning:
πp1, p2(R) is NOT the same as πp1(πp2(R))
Combining selection and joins

pq (R⋈S)
= p (R)⋈q (S)

pqm (R⋈S) = m [(p R)⋈(q S)]

pq (R⋈S)
= [p (R)⋈S]  [R⋈(q (S)]
Combining projections and
selections

Let
x
be a subset of R attributes
 z the set of attributes of R used in
predicate p
then
 πx [σp(R)] = πx [σp [πxz(R)]]

We can only eliminate attributes that are
not used in the selection predicate!
Combining projections and
joins

Let
x
be a subset of R attributes
 y a subset of S attributes
 z the common attributes of R and S
then
pxy (R⋈S) = pxy {[pxz (R)]⋈[pyz (S)]}
Combining projections,
selections and joins

Let

x, y, z be ...
the union of z and the attributes used
in predicate p
 z'
 pxy {p (R⋈S)} =
pxy {p [pxz’ (R)]⋈[pyz'(S)]}
Combining selections, projections
and Cartesian product


Rules are similar
Just replace join operator by Cartesian
product operator
 Keep in mind that join is a restricted
Cartesian product
Combining selections and unions
 p(R
U S) = p(R) U p(S)
 p(R
- S) = p(R) - S = p(R) - p(S)
Finding the most promising
transformations

p1p2 (R)  p1 [p2 (R)]
 Use

successive selections
p (R⋈S)  [p (R)]⋈S
 Do
selections before joins

R⋈S  S⋈R

px [p (R)]  px {p [pxz (R)]}
 Do
projections before selection
First heuristics

Do projections early
 Example
from [GM]:
 Given R(A,B,C,D,E) and the select
predicate
P: (A=3)  (B=“cat”)
 Seems
by
a good idea to replace
pE {p{pABE(R)}}
 What
if we have indexes?
px {p (R)}
Same example with indexes


Assume attribute A is indexed
 Use index to locate all tuples where A = 3
 Select tuples where B=“cat”
 Do the projections
In other words

px {p (R)} is the best solution
Second heuristics

Do selections early
 Especially
if we can use indexes
but no heuristics is always true
Estimating cost of query
plans

Requires
 Estimating sizes of the results
 Estimating the number of I/O operations
We will generally assume that the
cost of a query plan is dominated by
the number tuples being read or written
Estimating result sizes

Relevant data are
 T(R) : number of tuples in R
 S(R) : size of each tuple of R (in bytes)
 B(R): number of blocks required to store R
 V(R, A) : number of distinct values for
attribute A in R
Example





Owner
Pet
Vax date
Alice
Cat
3/2/15
Relation R
Alice
Cat
3/2/15
T(R)=8
Bob
Dog
10/8/14
Assuming dates
Bob
Dog
10/8/15
take 8 bytes and
strings 20 bytes
Carol
Dog
11/9/14
S(R)=48 bytes
Carol
Cat
12/7/14
B(R)=1 block
V(R, Owner)=3, V(R, Pet)=2,V(R, Vax date)=4
Estimating cost of W = R1 x R2

T(W) = T(R1)×T(R2)
Obvious!

S(W) = S(R1)+S(R2)
Estimating cost of W =

S(W) = S(R)

T(W) = T(R)/V(R, A)
 but
A=a (R)
this assumes that the values of A are
uniformly distributed over all the tuples
Example



W = σowner= Bob(R)
As T(R) = 6 and
V(R, Owner) = 3
T(W) = 3
Owner
Pet
Vax date
Alice
Cat
3/2/15
Alice
Cat
3/2/15
Bob
Dog
10/8/14
Bob
Dog
10/8/15
Carol
Dog
11/9/14
Carol
Cat
12/7/14
Making another assumption


Assume now that values in select
expression Z = val are uniformly distributed
over all possible V(R, Z) values.
If W = σZ=val (R)

T(W) = T(R)/V(R, Z)
Estimating sizes of range
queries



Attribute Z of table R has 50 possible values
ranging from 1 to 100
If W =
σZ > 80, what is T(W)?
Assuming the values in Z are uniformly
distributed over [0, 1]
T(W) = T(R)×(100 – 80)/(100 – 1 +1)
= 0.2×T(R)
Explanation

T(W) = T(R)×(Query_Range/Value_Range)

If query had been W =
σZ ≥ 80
T(W) would have been
T(R)×(100 – 80 + 1)/(100 – 1 +1) =
0.21×T(R)
21 possible values
Estimating the size of R⋈S
queries

We consider R(X, Y)⋈S(Y, Z)

Special cases:
R
and S have disjoint values for Y:
T(R⋈S) = 0
 Y is the key of S and a foreign key in R:
T(R⋈S) = T(R)
 Almost all tuples of R and S have the
same value for Y:
T(R⋈S) = T(R)T(S)
Estimating the size of R⋈S
queries

General case:
 Will
assume
 Containment of values:
If V(R, Y) ≤ V(S, Y) then all values of
Y in R are also in S
 Preservation of value sets:
If A is an attribute of R that is not in
S, then V(R⋈S, A) = V(R, A)
Estimating the size of R⋈S
queries
 If
V(R, Y) ≤ V(S, Y)
 Every value of R is present in S
 On average, a given tuple in R is likely
to match T(S)/V(S, Y)
 R has T(R) tuples
 T(R⋈S) = T(R)×T(S)/V(S, Y)
Estimating the size of R⋈S
queries
 If
V(R, Y) ≥ V(S, Y)
 Every value of S is present in R
 On average, a given tuple in S is likely
to match T(R)/V(R, Y)
 S has T(S) tuples
 T(R⋈S) = T(R)×T(S)/V(R, Y)
Estimating the size of R ⋈ S
queries
 In
general
 T(R⋈S)
= T(R)×T(S)/max(V(R, Y), V(R, S))
An example (I)



Finding all employees who live in a city
where the company has a plant:
 EMPLOYEE( EID, NAME, …., CITY)
 PLANT(PLANTID, …,CITY)
SELECT E.NAME
FROM EMPLOYEE E, PLANT P
WHERE E.CITY = P.CITY
SELECT EMPLOYEE.NAME
FROM EMPLOYEE JOIN PLANT
ON EMPLOYEE.CITY= PLANT.CITY
An example (II)

Assume
 T(E)=5,000
V(E, CITY) = 100
 T(P)= 200
V(P, CITY) = 50
 T(E⋈P) = T(E)×T(P)/
MAX(V(E, CITY), V(P, CITY))
= 5,000×200/MAX(100, 50)
= 1,000,000/100 = 10,000
Estimating the size of
multiple joins

R⋈S⋈U
 R(A,
B)
T(R)=1,000
V(R,B)=20

S(B,C)
T(S)=2,000
V(S,B)=50
V(S,C)=100
U(C,D)
T(U)=5,000
V(U,C)=500
Left to right:
 T(R⋈S)=
2,000,000/max(20, 50)=40,000
 T(R⋈S⋈U)=200,000,000/max(100, 500)=400,000

Right to left:
 T(S⋈U)=
10,000,000/max(100, 500)=20,000
 T(R⋈S⋈U)=20,000,000/max(20,
50)=400,000
Estimating the size of
multicondition joins

R(X,y1, y2,…)⋈S(y1, y2,…, Z)

If V(R, y1)≤V(S, y1) and V(R, y2)≤ V(S, y2)
…
 Every value of R is present in S
 On average, a given tuple in R is likely
to match T(S)/(V(S, y1)×V(S, y2) …)
 R has T(R) tuples
 T(R⋈S) = T(R)×T(S)/(V(S, y1)×V(S, y2) …)
Multicondition join


R(X,y1, y2,…)⋈S(y1, y2,…, Z)
In general
 T(R⋈S)
= T(R)×T(S)/
[max(V(R, y1), V(R, y1))×
max(V(R, y2), V(R, y2))× ….]
Multicondition join

R(X,y1, y2,…)⋈S(y1, y2,…, Z)

If V(R, y1)≤V(S, y1) and V(R, y2)≤ V(S, y2)
…
 Every value of R is present in S
 On average, a given tuple in R is likely
to match T(S)/(V(S, y1)×V(S, y2))
 R has T(R) tuples
 T(R⋈S) = T(R)×T(S)/(V(S, y1)×V(S, y2))
Estimating the size of unions

T(R⋃S)
 for a bag union:
 T(R⋃S) = T(R)+T(S) exact
 for a regular union:
 If the relations are disjoint:
T(R⋃S) = T(R)+T(S)
 If one relation contains the other:
T(R⋃S) = max(T(R), T(S))
 T(R⋃S)=(max(T(R), T(S))+T(R)+T(S))/2
We
take the average!
Estimating the size of
intersections

T(R⋂S)
 If the relations are disjoint:
 T(R⋂S) = 0
 If one relation contains the other
T(R⋂S) = min(T(R), T(S))
 T(R⋂S)=min(T(R), T(S))/2
We
take the average!
Estimating the size of set
differences

T(R-S)
 If the relations are disjoint
 T(R)-T(S) =T(R)
 If relation R contains relation S:
 T(R-S) = T(R)-T(S)
 T(R-S)=(2T(R)+T(S))/2

We take the average!
Estimating the cost of
eliminating duplicates


δ(R)
 If all tuples are duplicates : T(δ(R)) = 1
 If no tuples are duplicates : T(δ(R)) = T(R)
 T(δ(R)) = T(R)/2
If R(a1, a2, …) and we know the V(R, ai)

T(δ(R)) = ΠiV(R, ai)
Collecting statistics



Can explicitly request statistics
Maintain them incrementally
Can collect histograms
 Give an idea how data are distributed
 Not all patrons borrow equal number of
books
The Zipf distribution (I)




Empirical distribution
Verified for many cases
Ranks items by frequency/popularity
If f is the probability of accessing/using the
most popular item in the list (rank 1)
 The probability of accessing/using the
second most popular item will be close to
f/2
 The probability … third most popular
item will be close to f/3
The Zipf distribution (II)



Can alter the shape
Ranks items by frequency/popularity
If f is the probability of accessing/using the
most popular item in the list (rank 1)
 The probability of accessing/using the
second most popular item will be close to
f/2
 The probability … third most popular
item will be close to f/3
The Zipf distribution (II)
0.25
Frequency
0.20
0.15
0.10
0.05
0.00
1
6
11
16
21
26
31
Rank (1-50)
36
41
46
The Zipf distribution (III)



Can adjust the slope of the course by
adding an exponent
If f is the probability of accessing/using the
most popular item in the list (rank 1)
 The probability of accessing/using the
n-th ranked item in the list is will be
close to f/ni
i = ½ seems to be a good choice
Example (I)

A library uses two tables to keep track of its
books
 Book(BID,Title,Authors,Loaned,Due)
 Patron(PID, Name, Phone, Address)
The Loaned attribute of a book is equal to
 The PID of the patron who borrowed the
book
 Zero if the book is on the shelves
Example (II)



We want to find the titles of all books
currently loaned to "E. Chambas"
T(Books)=5,000
T(Patron)=500
V(Books, Loaned) = 200
V(Patron, Name) = 500
First plan

X = Book ⋈Loaned = PID Patron

Y = σName = "E. Chambas"(X)


Since PID is the key of Patron and assuming
that Loaned were a foreign key in Books:
T(X) = T(Books) (all books are borrowed!)
= 5,000
T(Y) = T(X)/V(Patron, Name)
=5,000/500
= 10
Second plan

X = σName = "E. Chambas"(Patron)

Y = Book ⋈Loaned = PID X


T(X) =
=
T(Z) =
=
=
T(Patron) /V(Patron, Name)
5,000/5,000 = 1
T(Book)×T(X)/V(Book, Loaned)
5,000/500
10
Comparing the two plans (I)

Comparison based on the number of tuples
created by the plan minus
 The
number of tuples constituting the
answer
 Should be the same for all correct plans

For the same reason, we do not considder
the number of tuples being read
Comparing the two plans (II)


Cost of first plan: 5,000
Cost of second plan: 1
An example (I)


Finding all employees who live in a city
where the company has a plant:
 EMPLOYEE( EID, NAME, …., CITY)
 PLANT(PLANTID, …,CITY)
Assume
 T(E)=5,000
V(E, CITY) = 100
V(E, NAME) = 5,000
 T(P)= 200
V(P, CITY) = 50
A first plan




X = E ⋈E.CITY = P.CITY P
Y = πE.NAME (X)
T(E⋈P) = T(E)×T(P)/
max(V(E, CITY), V(P, CITY))
= 5,000×200/MAX(100, 50)
= 1,000,000/100 = 10,000
T(Y) = 10,000 (not possible!)
A second plan








X = πP.CITY (P)
Y = δ(X)
Z = E ⋈E.CITY = Y.CITY Y
U = πE.NAME (Z)
T(X) = T(P) = 200
T(Y) = V(X, CITY) = V(P, CITY) =50
T(Z) = T(E)×T(Y)/max(V(E, CITY), 1)
= 5,000×50/MAX(100, 1) = 2,500
T(U) =T(Z) = 2,500
Comparing the two plans

First plan:
10,000
+ 10,000
__________
10,000

Second plan:
200
50
2,500
+ 2,500
_____________
2,750
Here it pays off to eliminate
duplicates early
Example [GM]



We have R(a,b) and S(b,c)
We want δ(σA="a" (R⋈S))
We know
 T(R) = 5,000
T(S) = 2,000
V(R, a) = 50
V(R, b) = 100
V(S, b) = 200
V(S, c) = 100
First plan



X1 = σa="a" (R)
X2 = X1⋈S
X3 = δ (X2)
First plan

X1 = σa="a" (R)
X2 = X1⋈S
X3 = δ (X2)

T(X1) = T(R)/V(R, a) = 5,000/50 = 100




T(X2) = T(X1)×T(S)/max(V(R, b), V(S, b))
= 100×2000/max(100, 200)
= 1,000
doesn't
T(X3) = min(…, T(X2)/2) = 500
count
Second plan




X1
X2
X3
X3
=
=
=
=
δ(R)
δ(S)
σa="a" (X1)
X3⋈X2
Second plan





X1=δ(R), X2=δ(S), X3=σa="a" (X1), X4=X3⋈X2
T(X1) = min(V(R, a)×V(R, b), T(R)/2)
= min(50×100, 5000/2) = 2500
T(X2) = min(V(S, b)×V(S, c), T(S)/2)
= min(200×100, 2000/2) = 1000
T(X3) =T(X1)/V(R, a) = 2500/50 = 50
T(X4) = T(X3)×T(X2)/max(V(R, b), V(S, b))
= 50×1000/max(100, 200))= 250
n
o
Comparing the two plans

First plan:
100
1,000
+
500
__________
1,100

Second plan:
2,500
1,000
50
+
250
_____________
3,550
Here it did not pay off to
eliminate duplicates early
A hybrid plan




X1
X2
X3
X4
=
=
=
=
σa="a" (R)
δ(X1)
X2⋈S
δ(X3)
A hybrid plan








X1
X2
X3
X4
=
=
=
=
σa="a" (R)
δ(X1)
X2⋈S
δ(X3)
T(X1) = T(R)/V(R, a) = 5,000/50 = 100
T(X2) = min(V(R, b), T(X1)/2) = min(100,50)
= 50
T(X3) = T(X3)×T(S)/max(V(R, b), V(S, b))
= 50×2000/max(100, 200))= 500
n
T(X4) = min(…, T(X3)/2) = 250
Comparing the two best
plans

First plan:
100
1,000
+
500
__________
1,100

Hybrid plan:
100
50
500
+
250
_____________
650
Reducing the sizes of the tables
in a join is a good idea if we can
do it on the cheap
Ordering joins



Joins methods are often asymmetric so
cost(R⋈S)≠cost(S⋈R)
Useful to build a join tree
A simple greedy algorithm will work well:
 Start with pair of relation whose
estimated join size will the the smallest
 Find among other relations the one that
would produce the smallest estimates
size when joined to the current tree.
Implementing
joins
A. Nested loops


W=[]
for rows in R :
for rows in S :
if match_found( ) :
append_concatenated_rows()
Number of operations:
 T(R)×T(S)
The idea
Table R
Table S
Try to match
every tuple of R
with all tuples of S
Optimization

Assume that the second relation can fit in
main memory
 Read only once
 Number of reads is T(R) + T(S)
B. Sort and merge


We sort the two tables using the matching
attributes as sorting keys
Can now do select matches by doing a merge
 Single pass process unless we have
duplicate matches
 Number of operations is
O(T(R)log(T(R)))+O(T(S)log(T(S)))+T(R)+T(S)
assuming one table does not have potential
duplicate matches
 Great if the tables are already sorted
C. Hashing



Assume both tables maintain a hash with K
entries for the matching attributes
for i in range(0, K – 1) :
join all R entries in bucket i with all
S entries in the same bucket
 We replace a big join by K smaller joins
Number of operations will be:
K×(T(R)/K)×(T(S)/K) = T(R)×T(S)/K