SSSS - Computer Science

Download Report

Transcript SSSS - Computer Science

QUERY PROCESSING
&
OPTIMIZATION
Dr. Awad Khalil
Computer Science Department
AUC
CSCI 453 -- Query Processing
1
Content
Why Query Optimization?
 Optimization Procedure
 Syntactic and semantic checking
 Casting the query into some internal
representation
 Heuristic optimization
 Semantic Query Optimization
 Systematic Optimization
 Selection of the Cheapest Plan
 Code Generation for the Access Plan

CSCI 453 -- Query Processing
2
Why Query Optimization

When users submit queries to a DBMS, they expect a response that is not
only correct and consistent; but also timely, that is, it is produced in an
acceptable period of time.

Queries can be written in a number of different ways, many of them being
inefficient. Therefore, the DBMS should take a query and, before it is run,
prepare a version that can be executed efficiently, a process that is known as
Query Optimization.

In practice, a DBMS is concerned with the improvement of execution strategy
rather finding the most efficient version, which is effectively impossible for
most queries.

The larger a database becomes, the greater the need for a query optimizer.

It is one of the strengths of relational databases that query optimization can
be done automatically by a software optimizer included within the DBMS
software.
CSCI 453 -- Query Processing
3
Optimization Procedure
1. Syntactic and semantic checking.
2. Casting the query into some internal representation.
3. Heuristic optimization (converting to a more efficient form).
4. Semantic query optimization.
5. Systematic optimization.
6. Selection of the cheapest plan.
7. Code generation for the access plan.
CSCI 453 -- Query Processing
4
1- Syntactic and semantic checking
A
query is expressed in a high-level language (SQL)
and is parsed to check if it syntactically correct (i.e.
obeys the rules of SQL grammar).
 The
query is then validated to see if it semantically
correct (i.e. verified to see that the attributes, tables,
views and other objects actually exist in the database).
CSCI 453 -- Query Processing
5
2- Casting the query into some internal
representation

The query is converted into a form more suitable for machine
manipulation. In relational data models, the form is based on
relational algebra or relational calculus. The relational
algebra is commonly used and usually manipulated in the form
of operator graphs.
S(S#, Sname, Status, City)
SP(S#, P#, Qty)
Query: Get names of suppliers
Who supply part P2.
((S Join SP) Where P#=‘P2’)[Sname]
CSCI 453 -- Query Processing
6
3- Heuristic optimization


(Converting to more efficient form)
Heuristic optimization depends on the syntax and not the
semantics of the database.
It is based only on the general qualities of the relational algebra
expressions and involves substituting relational algebra
expressions with more efficient expressions using equivalence
preserving transformation rules.
Objectives:
A- Minimize disk input/output and processing by reducing the size
of intermediate tables in a query.
B- Directly reducing the amount of computation involved by
rewriting simpler expression.
C- Reducing the amount of computation involved by rewriting
expression, although more complex but less computationally
demanding.
CSCI 453 -- Query Processing
7
A- Rules that tend to reduce the size of intermediate
tables
1.
Perform selection as early as possible (especially
before join)
STUDENT (Std#, Sname)
COURSE (Course#, Cname, Instructor)
RGISTRATION (Std#, Course#, Date)
GRADE (Std#, Course#, Grade)
Query: List the names of students
registered in the database course.
((STUDENT Join (REGISTRATION
Join COURSE)) Where Cname =
‘Database’ ) [Sname]
CSCI 453 -- Query Processing
8
Perform selection as early as possible (especially before join)

The selection operator can
be pushed as far down the
operator graph as possible.

At intermediate nodes, the
operators are pushed down
the appropriate branches.
CSCI 453 -- Query Processing
9
Perform selection as early as possible (especially before join)

The selection operator can be
passed down again:
CSCI 453 -- Query Processing
10
A- Rules that tend to reduce the size of intermediate
tables
2.
Perform projection as early as possible

Under certain conditions projection may be
commuted with a join. When a projection is preceded
by a join, it is possible to push the projection down
before the join, but the projection acquires new
attributes, therefore the original projection must be
performed after the join. Unless the cardinalities of
the intermediate relations are reduced, the usefulness
of pushing a projection before a join is questionable.
CSCI 453 -- Query Processing
11
Perform projection as early as possible
(STUDENT Join (((REGISTRATION
Join (COURSE Where Cname =
‘Database’)) [Std#, Course#])))
[Sname]
The projection,
Project(Std#, Course#)
should be pushed down the tree!
CSCI 453 -- Query Processing
12
Perform projection as early as possible
CSCI 453 -- Query Processing
13
A- Rules that tend to reduce the size of intermediate
tables
3.
Perform select before project

If the selection condition involves only some of the attributes
in the projection list, then the two operations can be
commuted, e.g.:
((GRADE [Std#, Course#]) Where Std# = 123)
can be commuted to:
(GRADE Where Std# = 123) [Std#, Course#]
CSCI 453 -- Query Processing
14
B- Rules that tend to directly reduce the amount of
computation involved – making expression simpler
1.
Combine a cascade of selections into one selection.
Query: Get the full details of courses with course number CSCI-453
where the instructor is Khalil.
((COURSE Where Instructor = ‘Khalil’) Where Course# = ‘CSCI-453’)
Can be converted to:
COURSE Where Instructor = ‘Khalil’ AND Course# = ‘CSCI-453’
CSCI 453 -- Query Processing
15
B- Rules that tend to directly reduce the amount of
computation involved – making expression simpler
2.
Combine a cascade of projections into one
projection
((COURSE [Cname, Instructor]) [Cname])
Can be converted to:
COURSE [Cname]
CSCI 453 -- Query Processing
16
C- Rules that tend to directly reduce the amount of
computation involved – rewriting in a less
computationally demanding form
Where P OR (Q AND R)
Can be rewritten as:
Where (P OR Q) AND (P OR R)
CSCI 453 -- Query Processing
17
4- Semantic Query Optimization

Semantic query optimization transformations use constraints on
database schema to modify queries.
Example: Consider the Join of the two tables:
SP (S#, P#, Qty) and
P (P#, Pname, Color, Weight, City)
If SP.P# is a foreign key (with no nulls allowed) and is matched to the
primary key P.P#, then
(SP Join P) [S#]
can be transformed to
SP[S#].
CSCI 453 -- Query Processing
18
5- Systematic Optimization
 Having
reorganized the query, the Query Optimizer
must then consider how to retrieve the information
physically from the database.
Query Optimizer generates a query plan
(execution strategy, access plan, or execution plan)
using access routines (access aids or low level
implementation procedures) for the various operations.
 The
 In
optimizing at this level, an accurate cost estimate for
each execution strategy must be calculated. Cost
estimation is a time consuming task.
CSCI 453 -- Query Processing
19
5- Systematic Optimization (Cont’d)
Statistical Information:
 Systematic optimizers may make use of the following
information:










Number of tuples.
Number of blocks used to store these tuples.
Number of distinct data values.
Percent of total number of relevant database blocks used by
the relation.
Ordering of tuples in the blocks.
Blocking factor for each file.
Existence and type of indexes.
Number of levels of each index.
Number of blocks for packed relations.
Physical clustering of records.
CSCI 453 -- Query Processing
20
5- Systematic Optimization (Cont’d)
Types of cost functions:
1- Accesses to secondary storage costs:
Cost of searching for, reading and writing data blocks that reside on secondary
storage. Temporary, intermediate files may also need to be accessed; this represents
significant problem as there must also be an accurate estimate of the size of
intermediate results to calculate the number of I/O required. The access cost is the
number of blocks that must be brought into main memory for reading and the number
of blocks that must be written out to secondary storage.
2- Computation Costs:




Cost of performing in-memory operations in the data buffers during query executions.
Operations include:
Searching for records.
Sorting records.
Merging records for a join.
Performing computations on field values.
3- Communication Costs:
Cost of shipping the query and results from database site to terminal where the query
originated.
CSCI 453 -- Query Processing
21
5- Systematic Optimization (Cont’d)
Goals:
 For large databases, the main emphasis is on reducing accesses
costs to secondary storage, i.e., the number of block transfers
between disk and memory.
 Small databases, in which most data can be stored in memory
focus on minimizing computation.
 In case of distributed databases, communication costs must also
be minimized.
 It is difficult to include all cost components into a weighted cost
function, therefore, most cost functions consider a single factor
only or possibly a combination of one factor for estimating I/O
and one factor to estimate the use of CPU.
CSCI 453 -- Query Processing
22
5- Systematic Optimization (Cont’d)
Use of Costs:
t( R ) - the number of tuples in relation R.
b( R ) - the number of blocks needed to store the relation R, if R is packed
forms.
bf( R ) - the number of tuples per block, also called the blocking factor of
R.
If R is packed. Then b( R) = t( R ) / bf( R ).
n(A, R) - the number of distinct values of attribute A in relation R. This can
be used to approximate the number of tuples (t) that have a
particular value. If we assume that the values of A are uniformly
distributed in R, then the number of tuples expected to have a
particular value c for A (called the selection size):
s(A=c, R) - average number of records that will satisfy an equality selection
condition on an attribute. If we assume that the values for A are
uniformly distributed in R, then s = t( R )/n(A,R).
CSCI 453 -- Query Processing
23
5- Systematic Optimization (Cont’d)
Example:
Consider the following database:
STUDENT (Stuid, Stuname, Major, Credits)
ENROLL (Course#, Stuid, Grade)
 Thus to estimate the number of students in the university with a
CS major:
If there are 10,000 students, then t(STUDENT) = 10,000
-
-
If there are 25 possible major subjects, then n(MAJOR, STUDENT) = 25
Then, we can estimate the number of CS majors as: s(MAJOR=’CS’,
STUDENT) = t(STUDENT)/n(MAJOR, STUDENT) = 10,000/25 = 400
Note that if A is a primary key, then,
n(A, R)=t( R ) and the selection
size is 1.
CSCI 453 -- Query Processing
24
5- Systematic Optimization (Cont’d)
Processing Joins:

In examining the systematic cost of evaluating a typical query, say a Join,
most processors focus on the accesses to secondary storage. This cost will
involve in not only the effort of retrieving the tables that are input to a query
and writing out the final result, but also involve the costs of writing and
reading any intermediate tables. This is especially important when
considering the size of intermediate results in complex query.
CSCI 453 -- Query Processing
25
5- Systematic Optimization (Cont’d)

To calculate size of a Join, say between R and S, of
size t( R ) and t( S ) respectively, we first need to
estimate the number of tuples of R that will match of S
on the corresponding attributes. There are several
distinct possibilities to consider:
1.
If there are no common attributes, Join becomes a Product, and
the number of tuples in the result is t( R ) * t( S ).
2.
If the set of common attributes is a key for one relation, then the
number of tuples in the Join can be no larger than the number of
tuples in the other relation, e.g., if the common attributes are a
key for R then the size of the Join is less than or equal to t( S ).
CSCI 453 -- Query Processing
26
5- Systematic Optimization (Cont’d)
Methods of Processing Joins:
 Nested loops using blocks.
 Sort-merge.
 Using
an index or hash key.
CSCI 453 -- Query Processing
27
5- Systematic Optimization (Cont’d)
Nested loops using blocks:

Assuming that both R and S are packed relations having b( R ) and b( S )
respectively, then if we have two buffers, we can bring the first block of R
into the first buffer and bring each block of S in turn into the second buffer,
compare each tuple of the r block with each tuple of the s block before
switching in the next s block. When we have finished all the s blocks we bring
the next r block into the first buffer and so on. The algorithm can be shown as:
For each block of R
For each block of S
For each tuple in the R block
For each tuple in the S block
If the tuples satisfy the condition then add to join
End
End
End
End
CSCI 453 -- Query Processing
28
5- Systematic Optimization (Cont’d)
Cost Functions for JOIN using Nested Loops:

Refer to the text book “Fundamentals of Database Systems”, El
Masri:


Third Edition: pages 618 – 621
Fourth Edition: pages 527 – 529
CSCI 453 -- Query Processing
29
5- Systematic Optimization (Cont’d)
Sort-Merge Join:
 In the previous method, the assumption has been made
that the tuples in the tables are not sorted in any
particular way. If both files are sorted on the attribute to
be joined, then another access method, Sort-Merge Join
is preferable (see references).
CSCI 453 -- Query Processing
30
5- Systematic Optimization (Cont’d)
Using an Index or Hash Key:




If one of the files, S, has an index on the common attribute A, or if A is a hash key, then each
tuple of R would be retrieved in the usual way and the index or hashing algorithm would be used
to find all the matching records of S.
For example, to find STUDENT Join ENROLL, representing S and R respectively, we access
each ENROLL record in sequence and then use the index on Stuid to find each matching
STUDENT record. The overall cost depends on the type of index as follows:
If A is the primary key of S and we have a primary index on S then the access cost is the cost of
accessing all blocks of R plus the cost of reading the index and accessing one record of S for
each of the tuples in R:
b( R ) + (t(R) * (L(indexname) + 1),
where
L(indexname) is the number of levels in a multilevel index, which is equivalent to the average
number of index accesses to find an entry.
If the index is a clustering index (a file with a clustering index is one in which data records are
physically ordered on a non-key field that does not have a distinct value for each record) with
selection size s(A=c,S), the cost is: b( R ) + (t(R) * (L(indexname) + s(A=c, S)/bf(S)))
If we have a hash function (in a hash file, the hash field of a record is subjected to a hashing
function which gives the block address that contains that record) instead of an index, the cost is:
b( R ) + (t(R) * h)
where,
h is the average number of accesses to get a block from the its hash key value.
If the index is a secondary index (a secondary index is an ordered file of data values that point to
entities in a data file containing this value in one of its fields), we get:
b( R ) + (t(R) * (L(indexname) + s(A=c,S)))
CSCI 453 -- Query Processing
31
6- Selection of the Cheapest Plan
 Having
generated several access plans, the
systematic optimizer would then choose the
cheapest.
CSCI 453 -- Query Processing
32
7- Code Generation for the Access Plan
 The
cheapest plan is then coded for execution at
the appropriate time.
CSCI 453 -- Query Processing
33
Thank you
CSCI 453 -- Query Processing
34