Relational Database Performance - CSCI 6442
Download
Report
Transcript Relational Database Performance - CSCI 6442
Relational Database Performance
CSCI 6442
Copyright 2013, David C. Roberts, all rights reserved
“Optimization”
More properly called access path selection
“Optimizer” selects a strategy for processing
Approaches:
◦ Cost-based: estimate total cost to process by different
approaches, choose lowest estimate
◦ Heuristic: use rules to decide how to process
Cost-based is typically used by all database systems
today
2
RUNSTATS
RUNSTATS is the name of a statisticsgathering utility first included in IBM’s
DB2
It scans the database, gathers statistics
used for estimating costs for access path
selection
DBA determines how often and when to
run the utility
What statistics do you think are
gathered?
3
Quandary
The more that RUNSTATS collects, the
better job the optimizer can do of
selecting efficient processing methods
However, RUNSTATS uses a lot of
resources, scanning every relation
Use of RUNSTATS must be balanced
against its cost
4
The Optimizer
Selects which indexes to use
Chooses the order of using indexes
Chooses algorithms to use
Decides when to apply predicates
5
SQL Statement Parts of Interest
Simple query:
SELECT ENAME, JOB
FROM EMP
WHERE SAL > 20 OR JOB = ‘VP’
OR JOB LIKE ‘PRES%’;
We’re interested in the FROM clause (that
tells us the table names) and the WHERE
clause (that tells us the predicates)
6
SINGLE-TABLE
QUERIES
7
Predicates
WHERE clause of SQL statement is made
up of predicates
Each predicate is a condition
Each condition references a column
Conditions may be equality, inequality,
range, LIKE
The first three conditions use an index if
one exists, scan the table if no index
exists
8
Example
SAL > 20
OR
JOB = ‘VP’
OR
JOB LIKE ‘PRES%’;
For each predicate, do we use an index to
retrieve rows that make it true, then
examine each row for the other predicates?
9
Predicate Selectivity
Selectivity: an estimate of the fraction of
rows of a table that make a predicate true
10
Classes of Predicates
Predicate: condition in the WHERE clause
Predicates are combined using AND, OR to
make WHERE clauses
Classes of predicates:
◦ Sargable: search arguments that can be processed
close to the data
◦ Residual: not sargable, such as complex use of nesting
11
Access Paths
Five possible access paths:
◦
◦
◦
◦
◦
Table scan
Non-selective index scan
Selective index scan
Index only access
Fully qualified unique index
Each of these types of scans has different
cost estimates for its use
12
Predicate Selectivity
Selectivity function f(p): % of rows retrieved on average by
predicate p
Number of rows retrieved is strongly related to the cost to carry
out the operation
n = number of rows in table
Form of P
f
column = value
1/n
column != value
1-1/n (nearly 1)
column > value
(high value - search value)/(high value low value)
Column LIKE ‘value’
n
p1 or p2
f(p1) + f(p2)
p1 and p2
f(p1) * f(p2)
13
Single-Table Queries
Find out which columns that are
referenced in the WHERE clause have
indexes
Find out selectivity of indexes
Estimate selectivity of each predicate
Use most selective index-predicate
combination to retrieve rows that satisfy
one predicate
Examine each row for other predicates
14
MULTI-TABLE QUERIES
(I.E., JOINS)
15
Join
Result of a join is a subset of the
Cartesian product of the tables being
joined
Cartesian product of two tables with m
and n rows is a new table of mn rows,
where every row of the join consists of
one row of the first table and one row of
the second table
16
Example Join
17
Simple Join Processing Algorithm
1.
2.
3.
Form the Cartesian product of all tables
involved in the join
Scan rows of the Cartesian product, testing
each against all of the predicates
Eliminate rows of the Cartesian product
that don’t meet the predicates
What’s wrong with this picture?
Think about two tables of 1 million
rows. Cartesian product would be 1
thousand billion rows!
18
Joining More than 2 Relations
A join of more than two relations is
processed 2 relations at a time
Part of access path planning is to select
that sequence
We will talk about algorithms for joining 2
tables and then choosing the order of
processing a multi-table join
19
Joins
An equijoin is based on equality of an
attribute of each of two relations
Outer join includes all rows of both
tables even if some rows did not have a
matching value
A semi-join can be based on inequalities
as the relationship
20
Join-Processing Algorithms
Nested loop join
◦ Each tuple of outer relation is compared to all
rows of the inner relation
Sort-merge join
Hash-based join
21
Nested-Loop Join
The algorithm:
For efficiency, the relation with higher
cardinality (R) is chosen as the inner
relation
Number of operations: NR+ NR* NS
What if there is an index?
22
Nested-Loop
23
Join Order
For JOIN queries, the “outer” table is
access first, “inner” second
Order for joining tables must be selected
Most selective first
Least costly joins first
24
Merge Join
First, each relation is sorted on the join
attribute
Then both relations are scanned in the
order of the join attributes
Tuples that satisfy the join predicate are
concatenated and placed in the output
relation
Number of operations: NR+NS (after the
sort!)
What is there is an index on R or S or both?
25
Merge Join Algorithm
26
Merge Join
27
Hash-Join
The joins we have looked at compare
tuples in the first relation with tuples in
the second relation that cannot possibly
be part of the join
The goal of the hash join is to compare
only those tuples that might be part of
the join
Hashing is used to identify those tuples
There are many variations of hash-join
28
Simple Hash-Join Algorithm
29
Hash-Join Performance
Performance of hash-join can be superior
to other join algorithm
Performance depends on the hashing
algorithm (although note Lum’s research)
Perfect hashing algorithm could find
match or non-match with a single probe
With hash table in RAM, processing
would be very fast
30
Indexes
Impact of a b-tree index on performance
of these algorithms is obvious
But the index must be maintained itself
When an attribute that’s indexed in
changed in a relation, the value in the
index must also be changed
And note that the changes must be
synchronized (and locked together)
31
Order of Processing Joins
Typically, all combinations of order of
processing are considered and a cost
developed for each
Selectivity of predicates, selectivity of
indexes, cardinality of relations all are factors
in cost analysis
Goal is to minimize number of intermediate
results produced during processing
Usually, low selectivity values are processed
first (that is, highest selectivity)
32
Summary
Single-table queries
Multi-table queries
◦ Nested loop
◦ Sort-merge
◦ Hash
Order of processing joins
33
But Note:
We have left out a LOT
Relations may be partitioned and joins
processed by partition
Many other parts of the DBMS affect
performance
If you are responsible for database
performance, buy a book and dig in
Remember not to give up on
normalization to get performance
34
And Now: What You Do for
Performance
35
How to Start
First, don’t even consider denormalization
You have many tools to get the
performance you need without ruining
the data model (and the applications)
Performance test the applications
Look for SQL operations that are taking a
long time
36
EXPLAIN
IBM invented the EXPLAIN utility; it
explains the processing strategy for each
WHERE clause
Run it for operations that are taking too
long
Look for table scans, cartesian product
joins
Provide indexes to speed things up
37
EXPLAIN PLAN
Tells you the execution plan an Oracle
database follow for a SQL statement
Inserts a row describing each step of the
execution plan into a specified table
Determines total cost of execution
38
39
Beyond EXPLAIN
There are many indexing options, other
options to control physical characteristics
of the database
Learn about them, learn how to control
them
But you will go very far with EXPLAIN
and providing indexes
40
THANK YOU!
41