Relational Database Performance - CSCI 6442

Transcript Relational Database Performance - CSCI 6442

Relational Database Performance
CSCI 6442
Copyright 2013, David C. Roberts, all rights reserved
“Optimization”
More properly called access path selection
 “Optimizer” selects a strategy for processing
 Approaches:

◦ Cost-based: estimate total cost to process by different
approaches, choose lowest estimate
◦ Heuristic: use rules to decide how to process

Cost-based is typically used by all database systems
today
2
RUNSTATS
RUNSTATS is the name of a statisticsgathering utility first included in IBM’s
DB2
 It scans the database, gathers statistics
used for estimating costs for access path
selection
 DBA determines how often and when to
run the utility
What statistics do you think are
gathered?

3
Quandary
The more that RUNSTATS collects, the
better job the optimizer can do of
selecting efficient processing methods
 However, RUNSTATS uses a lot of
resources, scanning every relation
 Use of RUNSTATS must be balanced
against its cost

4
The Optimizer
Selects which indexes to use
 Chooses the order of using indexes
 Chooses algorithms to use
 Decides when to apply predicates

5
SQL Statement Parts of Interest
Simple query:
SELECT ENAME, JOB
FROM EMP
WHERE SAL > 20 OR JOB = ‘VP’
OR JOB LIKE ‘PRES%’;
We’re interested in the FROM clause (that
tells us the table names) and the WHERE
clause (that tells us the predicates)
6
SINGLE-TABLE
QUERIES
7
Predicates
WHERE clause of SQL statement is made
up of predicates
 Each predicate is a condition
 Each condition references a column
 Conditions may be equality, inequality,
range, LIKE
 The first three conditions use an index if
one exists, scan the table if no index
exists

8
Example
SAL > 20
OR
JOB = ‘VP’
OR
JOB LIKE ‘PRES%’;
For each predicate, do we use an index to
retrieve rows that make it true, then
examine each row for the other predicates?
9
Predicate Selectivity

Selectivity: an estimate of the fraction of
rows of a table that make a predicate true
10
Classes of Predicates
Predicate: condition in the WHERE clause
 Predicates are combined using AND, OR to
make WHERE clauses
 Classes of predicates:

◦ Sargable: search arguments that can be processed
close to the data
◦ Residual: not sargable, such as complex use of nesting
11
Access Paths

Five possible access paths:
◦
◦
◦
◦
◦

Table scan
Non-selective index scan
Selective index scan
Index only access
Fully qualified unique index
Each of these types of scans has different
cost estimates for its use
12
Predicate Selectivity



Selectivity function f(p): % of rows retrieved on average by
predicate p
Number of rows retrieved is strongly related to the cost to carry
out the operation
n = number of rows in table
Form of P
f
column = value
1/n
column != value
1-1/n (nearly 1)
column > value
(high value - search value)/(high value low value)
Column LIKE ‘value’
n
p1 or p2
f(p1) + f(p2)
p1 and p2
f(p1) * f(p2)
13
Single-Table Queries
Find out which columns that are
referenced in the WHERE clause have
indexes
 Find out selectivity of indexes
 Estimate selectivity of each predicate
 Use most selective index-predicate
combination to retrieve rows that satisfy
one predicate
 Examine each row for other predicates

14
MULTI-TABLE QUERIES
(I.E., JOINS)
15
Join
Result of a join is a subset of the
Cartesian product of the tables being
joined
 Cartesian product of two tables with m
and n rows is a new table of mn rows,
where every row of the join consists of
one row of the first table and one row of
the second table

16
Example Join
17
Simple Join Processing Algorithm
1.
2.
3.
Form the Cartesian product of all tables
involved in the join
Scan rows of the Cartesian product, testing
each against all of the predicates
Eliminate rows of the Cartesian product
that don’t meet the predicates
What’s wrong with this picture?
Think about two tables of 1 million
rows. Cartesian product would be 1
thousand billion rows!
18
Joining More than 2 Relations
A join of more than two relations is
processed 2 relations at a time
 Part of access path planning is to select
that sequence
 We will talk about algorithms for joining 2
tables and then choosing the order of
processing a multi-table join

19
Joins
An equijoin is based on equality of an
attribute of each of two relations
 Outer join includes all rows of both
tables even if some rows did not have a
matching value
 A semi-join can be based on inequalities
as the relationship

20
Join-Processing Algorithms

Nested loop join
◦ Each tuple of outer relation is compared to all
rows of the inner relation
Sort-merge join
 Hash-based join

21
Nested-Loop Join

The algorithm:
For efficiency, the relation with higher
cardinality (R) is chosen as the inner
relation
 Number of operations: NR+ NR* NS

What if there is an index?
22
Nested-Loop
23
Join Order
For JOIN queries, the “outer” table is
access first, “inner” second
 Order for joining tables must be selected
 Most selective first
 Least costly joins first

24
Merge Join
First, each relation is sorted on the join
attribute
 Then both relations are scanned in the
order of the join attributes
 Tuples that satisfy the join predicate are
concatenated and placed in the output
relation
 Number of operations: NR+NS (after the
sort!)

What is there is an index on R or S or both?
25
Merge Join Algorithm
26
Merge Join
27
Hash-Join
The joins we have looked at compare
tuples in the first relation with tuples in
the second relation that cannot possibly
be part of the join
 The goal of the hash join is to compare
only those tuples that might be part of
the join
 Hashing is used to identify those tuples
 There are many variations of hash-join

28
Simple Hash-Join Algorithm
29
Hash-Join Performance
Performance of hash-join can be superior
to other join algorithm
 Performance depends on the hashing
algorithm (although note Lum’s research)
 Perfect hashing algorithm could find
match or non-match with a single probe
 With hash table in RAM, processing
would be very fast

30
Indexes
Impact of a b-tree index on performance
of these algorithms is obvious
 But the index must be maintained itself
 When an attribute that’s indexed in
changed in a relation, the value in the
index must also be changed
 And note that the changes must be
synchronized (and locked together)

31
Order of Processing Joins
Typically, all combinations of order of
processing are considered and a cost
developed for each
 Selectivity of predicates, selectivity of
indexes, cardinality of relations all are factors
in cost analysis
 Goal is to minimize number of intermediate
results produced during processing
 Usually, low selectivity values are processed
first (that is, highest selectivity)

32
Summary
Single-table queries
 Multi-table queries

◦ Nested loop
◦ Sort-merge
◦ Hash

Order of processing joins
33
But Note:
We have left out a LOT
 Relations may be partitioned and joins
processed by partition
 Many other parts of the DBMS affect
performance
 If you are responsible for database
performance, buy a book and dig in
 Remember not to give up on
normalization to get performance

34
And Now: What You Do for
Performance
35
How to Start
First, don’t even consider denormalization
 You have many tools to get the
performance you need without ruining
the data model (and the applications)
 Performance test the applications
 Look for SQL operations that are taking a
long time

36
EXPLAIN
IBM invented the EXPLAIN utility; it
explains the processing strategy for each
WHERE clause
 Run it for operations that are taking too
long
 Look for table scans, cartesian product
joins
 Provide indexes to speed things up

37
EXPLAIN PLAN
Tells you the execution plan an Oracle
database follow for a SQL statement
 Inserts a row describing each step of the
execution plan into a specified table
 Determines total cost of execution

38
39
Beyond EXPLAIN
There are many indexing options, other
options to control physical characteristics
of the database
 Learn about them, learn how to control
them
 But you will go very far with EXPLAIN
and providing indexes

40
THANK YOU!
41

Relational Database Performance - CSCI 6442

Transcript Relational Database Performance - CSCI 6442

Directory