Transcript T(R)

Estimating the Cost of Operations
•
We don’t want to execute the query in order to learn the costs.
So, we need to estimate the costs.
•
How can we estimate the number of tuples in an intermediate
relation?
Rules about estimation formulas:
1. Give (somehow) accurate estimates
2. Easy to compute
Projection
• Projection  retains duplicates, so the number of tuples in the
result is the same as in the input.
• Result tuples are usually shorter than the input tuples.
• The size of a projection is the only one we can compute exactly.
Selection
Let S = A=c (R)
We can estimate
T(S) = T(R) / V(R,A)
Let S = A<c (R)
On average, T(S) would be T(R)/2, but more properly: T(R)/3
Let S = Ac (R),
Then, an estimate is:
T(S) = T(R)*[(V(R,A)-1)/V(R,A)], or simply
T(S) = T(R)
Selection ...
Let
S = C AND D(R) = C(D(R))
U = D(R).
and
First estimate T(U) and then use this to estimate T(S).
Example
S = a=10 AND b<20(R)
T(R) = 10,000, V(R,a) = 50
T(S) = (1/50)* (1/3) * T(R) = 67
Note: Watch for selections like:
a=10 AND a>20(R)
Selection ...
• Let S = C OR D(R).
• Simple estimate: T(S) = T(C(R)) + T(D(R)).
• Problem: It is possible that T(S)T(R)!
A more accurate estimate
• Let:
– T(R)=n,
– m1 = size of selection on C, and
– m2 = size of selection on D.
• Then T(S) = n(1-(1-m1/n)(1-m2/n)) Why?
• Example: S = a=10 OR b<20(R).
T(R) = 10,000, V(R,a) =50
• Simple estimation: T(S) = 3533
• More accurate: T(S) = 3466
Natural Join R(X,Y)  S(Y,Z)
• Anything could happen!
• No tuples join: T(R  S) = 0
• Y is key in S and a foreign key in R (i.e., R.Y refers to S.Y): Then,
T(R  S) = T(R)
• All tuples join: i.e. R.Y=S.Y = a.
Then,
T(R  S) = T(R)*T(S)
Two Assumptions
Containment of value sets
• If V(R,Y) ≤ V(S,Y), then every Y-value in R is assumed to occur as
a Y-value in S
• When such thing can happen?
• For example when: Y is foreign key in R, and key in S
Preservation of set values
• If A is an attribute of R but not S, then it is assumed that
V(R  S, A)=V(R, A)
• This may be violated when there are dangling tuples in R
• There is no violation when: Y is foreign key in R, and key in S
Natural Join size estimation
• Let, R(X,Y) and S(Y,Z), where Y is a single attribute.
• What’s the size of T(R  S)?
•
•
•
•
•
Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s join?
Suppose V(R,Y)  V(S,Y)
By the containment of set values we infer that:
Every Y’s value in R appears in S.
So, the tuple r of R surely is going match with some tuples of S, but what’s the
probability it matches with s?
• It’s 1/V(S,Y).
• Hence, T(R  S) = T(R)*T(S)/V(S,Y)
– When V(R,Y)  V(S,Y)
• By a similar reasoning, for the case when V(S,Y)  V(R,Y), we get T(R  S) =
T(R)*T(S)/V(R,Y).
• So, sumarizing we have as an estimate:
T(R  S) = T(R)*T(S)/max{V(R,Y),V(S,Y)}
T(R  S) = T(R)*T(S)/max{V(R,Y),V(S,Y)}
• Example:
R(a,b), T(R)=1000, V(R,b)=20
S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100
U(c,d), T(U)=5000, V(U,c)=500
• Estimate the size of R  S  U
• T(R  S) =
1000*2000 / 50 = 40,000
• T((R  S)  U) =
40000 * 5000 / 500 = 400,000
• T(S  U) =
20,000
• T(R  (S  U)) =
1000*20000 / 50 = 400,000
• The equality of results is
not a coincidence.
• Note 1: estimate of final
result should not
depend on the
evaluation order
• Note 2: intermediate
results could be of
different sizes
Natural join with multiple join attrib.
• R(x,y1,y2)  S(y1,y2,z)
• T(R  S) = T(R)*T(S)/m1*m2, where
m1 = max{V(R,y1),V(S,y1)}
m2 = max{V(R,y2),V(S,y2)}
• Why?
• Let r be a tuple in R and s be a tuple in S. What’s the probability that r and s
agree on y1?
From the previous reasoning, it’s 1/max{V(R,y1),V(S,y1)}
• Similarly, what’s the probability that r and s agree on y2?
It’s 1/max{V(R,y2),V(S,y2)}
• Assuming that aggrements on y1 and y2 are independent we estimate:
T(R  S) =
T(R)*T(S)/[max{V(R,y1),V(S,y1)} * max{V(R,y2),V(S,y2)}]
Example:
T(R)=1000, V(R,b)=20, V(R,c)=100
T(S)=2000, V(S,d)=50, V(S,e)=50
R(a,b,c)  R.b=S.d AND R.c=S.e S(d,e,f)
T(R  S) = (1000*2000)/(50*100)=400
• Another example: (one of the previous)
R(a,b), T(R)=1000, V(R,b)=20
S(b,c), T(S)=2000, V(S,b)=50, V(S,c)=100
U(c,d), T(U)=5000, V(U,c)=500
• Estimate the size of R  S  U
• Observe that R  S  U = (R  U)  S
• T(R  U) =
1000*5000 = 5,000,000
• Note that the number of b’s in the product is 20 (V(R,b)), and the number
of c’s is 500 (V(U,c)).
• T((R  U)  S) =
5,000,000 * 2000 / (50 * 500) = 400,000
Size estimates for other operations
Cartesian product:
T(R  S) = T(R) * T(S)
Bag Union: sum of sizes
Set union:
larger + half the smaller. Why?
Because a set union can be as large as the sum of sizes or as small as
the larger of the two arguments. Something in the middle is suggested.
Intersection: half the smaller. Why?
Because intersection can be as small as 0 or as large as the sizes of the
smaller. Something in the middle is suggested.
Difference: T(R-S) = T(R) - 1/2*T(S)
Because the result can have between T(R) and T(R)-T(S). Something
in the middle is suggested.
Size estimates for other operations
Duplicate elimination  in (R(a1,...,an)):
The size ranges from 1 to T(R).
T((R))= V(R,[a1...an]), if available (but usually not available).
Otherwise: T((R))= min[V(R,a1)*...*V(R,an), 1/2*T(R)] is
suggested. Why?
V(R,a1)*...*V(R,an) is the upper limit on the number of distinct
tuples that could exist
1/2*T(R) is because the size can be as small as 1 or as big as T(R)
Grouping and Aggregation:
similar to , but only with respect to grouping attributes.
Computing the statistics
• It is the DBA who orders the statistics to be computed by the
system.
• T(R)’s, and V(R,A)’s are just aggregation queries (COUNT
queries).
• However, they are expensive to be computed.
Incremental computation of statistics
Maintaining T(R):
• Add 1 for every insertion and subtract 1 for every deletion.
– What’s the problem?
• If there is a B-Tree on any attribute of R, then:
Just keep track of the B-Tree blocks and infer the approximate size
of the relation.
• Requires effort only when?
• On B-Tree changes, which is relative rare compared with the rate
of insertions and deletions.
Incremental computation of statistics
Maintaining V(R,A):
• If there is an index on attribute A of a relation R, then:
– On insert into R, we must find the A-value for the new tuple in
the index anyway, and so we can determine whether there is
already such a value for A. If not increment V(R,A).
– On deletion…
• If there isn’t an index on A, the system could in effect create a
rudimentary index by keeping a data structure (e.g. B-Tree) that
holds every value of A.
• Final option: Sampling the relation.
Histograms
1-2 3-4
4-5
Equal width
6-7 8-9
4
7
rest
Most frequent values
Advantage: more accurate estimate of the size of a join.
Example (freq. Values histogram)
Estimate U = R(a,b)  S(b,c)
V(R,b) = 14. Histogram for R.b:
0:150, 1:200, 5:100, rest: 550
V(S,b) = 13. Histogram for S.b:
0:100, 1:80, 2:70, rest: 250
• Tuples in U
– on 0:
• 100*150 = 15,000
– on 1:
We have 9 values, because V(S,b)<V(R,b), and
by the preservation of value sets assumption, all
the 9 values we didn’t consider yet in S, will be
in R as well.
• 200*80 = 16,000
– on 2:
• 70 * (550/(14-3)) = 3500
– on 5:
• 100 * (250/(13-3)) = 2500
– on the 9 other values:
• 9*(550/11)*(250/10) = 9*1250
Total T(U) = 15000 + 16000 + 3500 +
2500 + 9*1250 = 48,250
Simple estimate (equal occurrence
assumption) T(U) = 1000*500/14 =
35,714
Example (equal width histogram)
• Schemas:
Jan(day,temp)
July(day,temp)
• Query:
SELECT Jan.day, July.day
FROM Jan, July
WHERE Jan.temp=July.temp;
Size of join of each band T1*T2/Width
On band 40-49: 10*5/10 = 5
On band 50-59: 5*20/10 = 10
 size of the result is thus 5+10 = 15
Without using the histogram we would
estimate the size as
245*245/100 = 600 !!