Transcript Document
Algebraic Laws
C D (R) C ( D ( R))
C ( R S ) R C S
C ( R S ) C ( R) S
C ( R D S ) C ( R) D S
For the binary operators, we push the selection
only if all attributes in the condition C are in R.
Example:
•
Consider relation schemas R(A,B) and S(B,C) and the expression
below:
(A=1 OR A=3) AND B<C(R S)
1. Splitting AND
A=1 OR A=3 (B < C(R S))
2. Push to S
A=1 OR A=3 (R B < C(S))
3. Push to R
A=1 OR A=3 (R) B < C(S)
Pushing selections
• Usually selections are pushed down the expression tree.
• The following example shows that it is sometimes useful to pull
selection up in the tree.
StarsIn(title,year,starName)
Movie(title,year,length,studioName)
CREATE VIEW MoviesOf1996 AS
SELECT * FROM MOVIE WHERE year=1996;
Query: Which stars worked for which studios in 1996?
SELECT starName,studioName
FROM MoviesOf1996 NATURAL JOIN StarsIN;
pull selection up
then push down
Laws for (bag) Projection
• A simple law: Project out attributes that are not needed later.
– I.e. keep only the input attr. and any join attribute.
L ( R S ) L ( M ( R) N (S ))
L ( R C S ) L ( M ( R) C N (S ))
L ( R S ) L ( M ( R) N (S ))
L C R C M R
Examples for pushing projection
Schema R(a,b,c), S(c,d,e)
aex ( R S )
aex ( R c,e (S ))
abx,d e y ( R S ) x, y ( abx,c (R) d ey,c (S ))
Example: Pushing Projection
• Schema: StarsIn(title,year,starName)
• Query:
SELECT starName FROM StarsIn
WHERE year = 1996;
starName
starName
Should we transform to
year=1996
?
Depends!
Is StarsIn stored or
computed?
year=1996
starName,year
StarsIn
StarsIn
Reasons for not pushing the projection
• If StarsIn is stored, the for the projection we have to scan the
relation.
• If the relation is pipelined from some previous computation, then
yes, we better do the projection (on the fly).
• Also, if for example there is an index on year for StarsIn, such
index is useless in the projected relation starName,year(StarsIn)
– While such an index is very useful for the selection on “year=1996”
Laws for duplicate elimination and grouping
• Try to move in a position where it can be eliminated altogether
E.g. when is applied on
• A stored relation with a declared primary key
• A relation that is the result of a operation, since grouping creates a relation
with no duplicates.
( L ( R)) L ( R)
Also:
L ( R) L ( M ( R))
What’s M?
absorbs
Improving logical query plans
• Push as far down as possible (sometimes pull them up first).
• Do splitting of complex conditions in in order to push even
further.
• Push as far down as possible, introduce new early (but take
care for exceptions)
• Combine with to produce -joins or equi-joins
• Choose an order for joins
Example of improvement
SELECT title
FROM StarsIn, MovieStar
WHERE starName = name AND birthdate LIKE ‘%1960’;
title
title
starname=name AND birthdate LIKE ‘%1960’
starName=name
birthdate LIKE ‘%1960’
StarsIn
MovieStar
StarsIn
MovieStar
And a better plan introducing a projection
to filter out useless attributes:
title
starName=name
name
StarsIn
birthdate LIKE ‘%1960’
MovieStar
Estimating the Cost of Operations
•
•
•
•
We don’t want to execute the query in order to learn the costs.
So, we need to estimate the costs.
What’s the cost?
The number of I/O’s to needed to manage the intermediate
relations.
This number will be a function of the size of intermediate
relations,
–
•
i.e. number of their tuples times the number of bytes per tuple
How can we estimate the number of tuples in an intermediate
relation?
Rules about estimation formulas:
1. Give (somehow) accurate estimates
2. Easy to compute
Projection
• Projection retains duplicates, so the number of tuples in the
result is the same as in the input.
• Result tuples are usually shorter than the input tuples.
• The size of a projection is the only one we can compute exactly.
Selection
Let S = A=c (R)
We can estimate
T(S) = T(R) / V(R,A)
Let S = A<c (R)
On average, T(S) would be T(R)/2, but more properly: T(R)/3
Let S = Ac (R),
Then, an estimate is:
T(S) = T(R)*[(V(R,A)-1)/V(R,A)], or simply
T(S) = T(R)
Selection ...
Let
S = C AND D(R) = C(D(R))
U = D(R).
and
First estimate T(U) and then use this to estimate T(S).
Example
S = a=10 AND b<20(R)
T(R) = 10,000, V(R,a) = 50
T(S) = (1/50)* (1/3) * T(R) = 67
Note: Watch for selections like:
a=10 AND a>20(R)
Selection ...
• Let S = C OR D(R).
• Simple estimate: T(S) = T(C(R)) + T(D(R)).
• Problem: It is possible that T(S)T(R)!
A more accurate estimate
• Let:
– T(R)=n,
– m1 = size of selection on C, and
– m2 = size of selection on D.
• Then T(S) = n(1-(1-m1/n)(1-m2/n)) Why?
• Example: S = a=10 OR b<20(R).
T(R) = 10,000, V(R,a) =50
• Simple estimation: T(S) = 3533
• More accurate: T(S) = 3466
Natural Join R(X,Y) S(Y,Z)
• Anything could happen!
• No tuples join: T(R S) = 0
• Y is the key in S and a foreign key in R (i.e., R.Y refers to S.Y):
Then,
T(R S) = T(R)
• All tuples join: i.e. R.Y=S.Y = a.
Then,
T(R S) = T(R)*T(S)
Two Assumptions
Containment of value sets
• If V(R,Y) ≤ V(S,Y), then every Y-value in R is assumed to occur as
a Y-value in S
• When such thing can happen?
• For example when: Y is foreign key in R, and key in S
Preservation of set values
• If A is an attribute of R but not S, then is assumed that
V(R S, A)=V(R, A)
• This may be violated when there are dangling tuples in R
• There is no violation when: Y is foreign key in R, and key in S
Natural Join size estimation
• Let, R(X,Y) and S(Y,Z), where Y is a single attribute.
• What’s the size of T(R S)?
•
•
•
•
•
Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s join?
Suppose V(R,Y) V(S,Y)
By the containment of set values we infer that:
Every Y’s value in R appears in S.
So, the tuple r of R surely is going match with some tuples of S, but what’s the
probability it matches with s?
• It’s 1/V(S,Y).
• Hence, T(R S) = T(R)*T(S)/V(S,Y)
– When V(R,Y) V(S,Y)
• By a similar reasoning, for the case when V(S,Y) V(R,Y), we get T(R S) =
T(R)*T(S)/V(S,Y).
• So, sumarizing we have as an estimate:
T(R S) = T(R)*T(S)/max{V(R,Y),V(S,Y)}
• Remember:
T(R S) = T(R)*T(S)/max{V(R,Y),V(S,Y)}
• Example:
R(a,b), T(R)=1000, V(R,b)=20
S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100
U(c,d), T(U)=5000, V(U,c)=500
• Estimate the size of R S U
• T(R S) =
40,000,
• T((R S) U) =
400,000
• T(S U) =
20,000,
• T(R (S U)) =
400,000
• The equality of results is
not a coincidence.
• Note 1: estimate of final
result should not
depend on the
evaluation order
• Note 2: intermediate
results could be of
different sizes
Natural join with multiple join attrib.
• R(x,y1,y2) S(y1,y2,z)
• T(R S) = T(R)*T(S)/m1*m2, where
m1 = max{V(R,y1),V(S,y1)}
m2 = max{V(R,y2),V(S,y2)}
• Why?
• Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s
agree on y1?
From the previous reasoning, it’s 1/max{V(R,y1),V(S,y1)}
• Similarly, what’s the probability that r and s agree on y2?
It’s 1/max{V(R,y2),V(S,y2)}
• Assuming that aggrements on y1 and y2 are independent we estimate:
T(R S) =
T(R)*T(S)/[max{V(R,y1),V(S,y1)} * max{V(R,y2),V(S,y2)}]
Example:
T(R)=1000, V(R,b)=20, V(R,c)=100
T(S)=2000, V(S,d)=50, V(S,e)=50
R(a,b,c) R.b=S.d AND R.c=S.e S(d,e,f)
T(R S) = (1000*2000)/(50*100)=400
• Another example: (one of the previous)
R(a,b), T(R)=1000, V(R,b)=20
S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100
U(c,d), T(U)=5000, V(U,c)=500
• Estimate the size of R S U
• Observe that R S U = (R U) S
• T(R U) =
1000*5000 = 5,000,000
• Note that the number of b’s in the product is 20 (V(R,b)), and the number
of c’s is 500 (V(U,c)).
• T((R U) S) =
5,000,000 * 2000 / (50 * 500) = 400,000