Physical Database Design(1)

Download Report

Transcript Physical Database Design(1)

L03: Physical Database Design (I)
Introduction
 Index Selection
 Partitioning & denormalization

H.Lu/HKUST
Overview




After logical design, we have a “good” logical schema that
describes the data to be stored and related constraints;
The next step is physical database design, a process of
producing a physical schema, i.e., a description of the
implementation of the database on secondary storage using
the target DBMS
The main tasks including: choosing indexes, make clustering
decisions, and to refine the conceptual and external schemas
(if necessary) to meet performance goals.
We must begin by understanding the workload:



H.Lu/HKUST
The most important queries and how often they arise.
The most important updates and how often they arise.
The desired performance for these queries and updates.
L03: Physical Database Design -- 2
Understanding the Workload

For each query in the workload:




Which relations does it access?
Which attributes are retrieved?
Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be?
For each update in the workload:


H.Lu/HKUST
Which attributes are involved in selection/join conditions?
How selective are these conditions likely to be?
The type of update (INSERT/DELETE/UPDATE), and the
attributes that are affected.
L03: Physical Database Design -- 3
T-R Cross-Reference
A.
B.
C.
D.
E.
Emplyee(eno, enam, title)
Pay (title, salary)
Project(pno, pname, budget, loc)
Assignment (eno, pno, duration)
Enter the details of a new project and the employees assigned to it.
Enter the details of a new employee and his/her assignment
Change the salary for a particular job title.
List the project number, name and the average salary of employees
working on the project
List the name of employees and the name of projects they have been
assigned to.
QA
EMP
PAY
PROJ
ASSIGN
H.Lu/HKUST
QB
QC
QD
QE
I R U D I R U D I R U D I R U D I R U D
X
X
X
X X
X
X
X
X
X
X
X
X
L03: Physical Database Design -- 4
Types of Queries (I)

Point query: returns at most one record (or part of a record)
based on an equality selection.
SELECT name FROM Employee WHERE eid=8788

Multipoint query: may return several records based on an
equality selection
SELECT name FROM Employee WHERE dept=Sales

Range query: returns a set of records whose values lie in an
interval of half interval for the selected attribute
 SELECT name FROM Employee WHERE salary > 15000

Prefix match query on an attribute of sequence of attributes X
specifies only a prefix of X
Lname = ‘Lu’ Fname LIKE ‘Ho’
H.Lu/HKUST
L03: Physical Database Design -- 5
Types of Queries
Extremal query: returns a set of records whose value
on some attribute is minimum or maximum.
 Ordering query: Display a set of records in the order
of the value of some attribute (s)
 Grouping query: partition the results of a query into
groups
 Join query: multi-table queries

 Equality join
 Non-equality join
H.Lu/HKUST
L03: Physical Database Design -- 6
Cost Estimation
Each query is associated a cost for its execution.
 The main concern of physical database design is
performance, which involves cost estimation.
 Cost estimation:

 Must estimate cost of each operation in a query
execution plan tree.
• Depends on input cardinalities and processing method.
 Must estimate size of result for each operation in tree!
• Use information about the input relations.
• For selections and joins, assume independence of
predicates.
H.Lu/HKUST
L03: Physical Database Design -- 7
Statistics and Catalogs

Need information about the relations and indexes involved.
Catalogs typically contain at least:
 # tuples (NTuples) and # pages (NPages) for each relation.
 # distinct key values (NKeys) and NPages for each index.
 Index height, low/high key values (Low/High) for each tree
index.

Catalogs updated periodically.
 Updating whenever data changes is too expensive; lots of
approximation anyway, so slight inconsistency ok.

More detailed information (e.g., histograms of the values in
some field) are sometimes stored.
H.Lu/HKUST
L03: Physical Database Design -- 8
Selectivity of Predicates
SELECT attribute list
FROM relation list
WHERE term1 AND ... AND termk
Consider a query block:
 Maximum # tuples in result is the product of the
cardinalities of relations in the FROM clause.
 Selectivity, s, associated with each term reflects the
impact of the term in reducing result size.

 Term col=value s = 1 / number of distinct values
 Term col > value s = (High -value)/(High - Low)
 Mutiple terms: product of each term
 Implicit assumption that terms are independent!
H.Lu/HKUST
L03: Physical Database Design -- 9
Joins

Theta-join : R  c S 
S

S.id Rid
.
 Result schema same as that of cross-product.
 Fewer tuples than cross-product, might be able to compute more
efficiently
Equi-Join: A special case of condition join where the condition c contains
only equalities.
S

 c ( R  S)
R
sid
R
 Result schema similar to cross-product, but only one copy of fields for
which equality is specified.
Natural Join: Equijoin on all common fields.
H.Lu/HKUST
L03: Physical Database Design -- 10
Join Selectivity

Join selectivity of R
JS 

R
R.A S.B
S
R. A S .B S
RS
Estimate join selectivity
 R.A is a key
 R.A is a key, and S.B is a foreign key reference to R
 Neither the above two is true
H.Lu/HKUST
L03: Physical Database Design -- 11
Decisions to Make

Index selection
 What indexes should we create?
•
Which relations should have indexes? What field(s) should be
the search key? Should we build several indexes?
 For each index, what kind of an index should it be?
•

Clustered? Hash/tree? Dynamic/static? Dense/sparse?
Should we make changes to the conceptual schema?



H.Lu/HKUST
Consider alternative normalized schemas? (Remember, there
are many choices in decomposing into BCNF, etc.)
Should we ``undo’’ some decomposition steps and settle for a
lower normal form? (Denormalization.)
Vertical/Horizontal partitioning, replication, views ...
L03: Physical Database Design -- 12
L03: Physical Database Design (I)
Introduction
 Index Selection
 Partitioning & denormalization

H.Lu/HKUST
Indexes

A Heap file allows us to retrieve records:



Sometimes, we want to retrieve records by
specifying the values in one or more fields, e.g.,



by specifying the rid, or
by scanning all records sequentially
Find all students in the “CS” department
Find all students with a gpa > 3
Indexes are file structures that enable us to answer
such value-based queries efficiently.
H.Lu/HKUST
L03: Physical Database Design -- 14
Indexes

An index on a file speeds up selections on the search
key fields for the index.



Any subset of the fields of a relation can be the search
key for an index on the relation.
Search key is not the same as key (minimal set of
fields that uniquely identify a record in a relation).
An index contains a collection of data entries, and
supports efficient retrieval of all data entries k* with
a given key value k.
H.Lu/HKUST
L03: Physical Database Design -- 15
Alternatives for Data Entry k* in Index

Three alternatives:
 Data record with key value k
 <k, rid of data record with search key value k>
 <k, list of rids of data records with search key k>

Choice of alternative for data entries is orthogonal to the
indexing technique used to locate data entries with a given
key value k.
 Examples of indexing techniques: B+ trees, hash-based
structures
 Typically, index contains auxiliary information that directs
searches to the desired data entries
H.Lu/HKUST
L03: Physical Database Design -- 16
Alternatives for Data Entries (Contd.)

Alternative 1:
 If this is used, index structure is a file organization for
data records (like Heap files or sorted files).
 At most one index on a given collection of data
records can use Alternative 1. (Otherwise, data
records duplicated, leading to redundant storage and
potential inconsistency.)
 If data records very large, # of pages containing data
entries is high. Implies size of auxiliary information
in the index is also large, typically.
H.Lu/HKUST
L03: Physical Database Design -- 17
Alternatives for Data Entries (Contd.)

Alternatives 2 and 3:
 Data entries typically much smaller than data records. So,
better than Alternative 1 with large data records, especially if
search keys are small. (Portion of index structure used to direct
search is much smaller than with Alternative 1.)
 If more than one index is required on a given file, at most one
index can use Alternative 1; rest must use Alternatives 2 or 3.
 Alternative 3 more compact than Alternative 2, but leads to
variable sized data entries even if search keys are of fixed
length.
H.Lu/HKUST
L03: Physical Database Design -- 18
Example B+-Tree Index
17
Root
5
2* 3*




5* 7* 8*
13
20
14* 16*
17* 18*
22
20* 21*
30
22* 27* 29*
33* 34* 38* 39*
Balanced tree
Leaf nodes contain keys and pointers pointing to data pages
Internal nodes contain keys and pointers: Each pointer Pi,
associated to a key Ki, points to a subtree in which all key values k
lie between Ki and Ki+1, P0(Pm) points to a subtree in which all
key values are less then K0 (large than Km),
Each node must at least be half full
H.Lu/HKUST
L03: Physical Database Design -- 19
B+ Trees in Practice

Typical order: 100. Typical fill-factor: 67%.


Typical capacities:



average fanout = 133
Height 4: 1334 = 312,900,700 records
Height 3: 1333 = 2,352,637 records
Can often hold top levels in buffer pool:



Level 1 =
1 page = 8 Kbytes
Level 2 = 133 pages = 1 Mbyte
Level 3 = 17,689 pages = 133 MBytes
H.Lu/HKUST
L03: Physical Database Design -- 20
Key Length and Fanout

The leaf pages contain 40 million key-pointer pairs
Page size = 4K,
length of pointer = 6 bytes,
length of key = 4 bytes
Tree level: 3 levels

How about the key length is 94 bytes?
H.Lu/HKUST
L03: Physical Database Design -- 21
Static Hashing Index
# primary pages fixed, allocated sequentially, never
de-allocated; overflow pages if needed.
 h(k) mod M = bucket to which data entry with key k
belongs. (M = # of buckets)

h(key) mod N
key
0
2
h
N-1
Primary bucket pages
H.Lu/HKUST
Overflow pages
L03: Physical Database Design -- 22
Extensible Hashing
LOCAL DEPTH
2
32*16*
GLOBAL DEPTH
2
00
Bucket A
2
Bucket C
15* 7* 19*
Bucket D
2
011
10*
Bucket C
101
2
110
15* 7* 19*
Bucket D
111
2
4* 12* 20*
010
100
2
H.Lu/HKUST
1* 5* 21* 13* Bucket B
000
001
10*
DIRECTORY
32* 16* Bucket A
2
3
1* 5* 21*13* Bucket B
2
11
3
GLOBAL DEPTH
01
10
LOCAL DEPTH
Bucket A2
(`split image'
of Bucket A)
3
DIRECTORY
4* 12* 20*
Bucket A2
(`split image'
of Bucket A)
L03: Physical Database Design -- 23
Index Classification

Primary vs. secondary: If search key contains primary key,
then called primary index.
 Unique index: Search key contains a candidate key.

Clustered vs. unclustered: If order of data records is the same
as, or `close to’, order of data entries, then called clustered
index.
 Alternative 1 implies clustered, but not vice-versa.
 A file can be clustered on at most one search key.
 Cost of retrieving data records through index varies greatly
based on whether index is clustered or not!
H.Lu/HKUST
L03: Physical Database Design -- 24
Clustered vs. Unclustered Index

Suppose that Alternative (2) is used for data entries, and that
the data records are stored in a Heap file.
 To build clustered index, first sort the Heap file (with some
free space on each page for future inserts).
 Overflow pages may be needed for inserts. (Thus, order of data
recs is `close to’, but not identical to, the sort order.)
CLUSTERED
Index entries
direct search for
data entries
Data entries
UNCLUSTERED
Data entries
(Index File)
(Data file)
Data Records
H.Lu/HKUST
Data Records
L03: Physical Database Design -- 25
Index Classification (Contd.)

Dense vs. Sparse: If there is at
least one data entry per search
key value (in some data record),
then dense.
 Alternative 1 always leads
to dense index.
 Every sparse index is
clustered!
 Sparse indexes are smaller;
however, some useful
optimizations are based on
dense indexes.
H.Lu/HKUST
Ashby, 25, 3000
22
Basu, 33, 4003
Bristow, 30, 2007
25
30
Ashby
33
Cass
Cass, 50, 5004
Smith
Daniels, 22, 6003
Jones, 40, 6003
40
44
Smith, 44, 3000
44
50
Tracy, 44, 5004
Sparse Index
on
Name
Data File
Dense Index
on
Age
L03: Physical Database Design -- 26
Index Classification (Contd.)

Composite Search Keys: Search on a
combination of fields.
 Equality query: Every field value
is equal to a constant value. E.g.
wrt <sal,age> index:
• age=20 and sal =75

Range query: Some field value is
not a constant. E.g.:
• age =20; or age=20 and sal > 10

Data entries in index sorted by
search key to support range queries.
 Lexicographic order, or
 Spatial order.
H.Lu/HKUST
Examples of composite key
indexes using lexicographic order.
11,80
11
12,10
12
12,20
13,75
<age, sal>
10,12
20,12
75,13
name age sal
bob 12
10
cal 11
80
joe 12
20
sue 13
75
12
13
<age>
10
Data records
sorted by name
80,11
20
75
80
<sal, age>
<sal>
Data entries in index
sorted by <sal,age>
Data entries
sorted by <sal>
L03: Physical Database Design -- 27
Clustered Index and Queries
Clustered index can be sparse: save space and
number of disk access
 Good for multipoint queries, i.e., equality accesses to
nonunique fields – the results will be in consecutive
pages.
 Good for equality join on an attribute with few
distinct values;
 If both relations have clustering index on join
attributes, can use merge join

H.Lu/HKUST
L03: Physical Database Design -- 28
Non-clustered index

Non-clustered index are
 best if they cover the query;
• Query can be answered using the index only
 good if each query retrieves significantly fewer
records than there are pages in the file
• Significant: sequential scan is much faster than random
scan
• Useful for point queries
• For multipoint queries, may or may not help
H.Lu/HKUST
L03: Physical Database Design -- 29
Choice of Indexes
One approach: consider the most important queries
in turn. Consider the best plan using the current
indexes, and see if a better plan is possible with an
additional index. If so, create it.
 Before creating an index, must also consider the
impact on updates in the workload!

 Trade-off: indexes can make queries go faster, updates
slower. Require disk space, too.
H.Lu/HKUST
L03: Physical Database Design -- 30
Issues to Consider in Index Selection

Attributes mentioned in a WHERE clause are
candidates for index search keys.
 Exact match condition suggests hash index.
 Range query suggests tree index.
• Clustering is especially useful for range queries,
although it can help on equality queries as well in the
presence of duplicates.

Try to choose indexes that benefit as many queries as
possible. Since only one index can be clustered per
relation, choose it based on important queries that
would benefit the most from clustering.
H.Lu/HKUST
L03: Physical Database Design -- 31
Issues in Index Selection (Contd.)

Multi-attribute search keys should be considered when a
WHERE clause contains several conditions.


If range selections are involved, order of attributes should be
carefully chosen to match the range ordering.
Such indexes can sometimes enable index-only strategies for
important queries.
• For index-only strategies, clustering is not important!

When considering a join condition:

Hash index on inner is very good for Index Nested Loops.
• Should be clustered if join column is not key for inner, and inner
tuples need to be retrieved.

Clustered B+ tree on join column(s) good for Sort-Merge.
H.Lu/HKUST
L03: Physical Database Design -- 32
Example 1

SELECT E.ename, D.mgr
FROM Emp E, Dept D
WHERE D.dname=‘Toy’ AND E.dno=D.dno
Hash index on D.dname supports ‘Toy’ selection.
 Given this, index on D.dno is not needed.


Hash index on E.dno allows us to get matching (inner) Emp
tuples for each selected (outer) Dept tuple.
What if WHERE included: `` ... AND E.age=25’’ ?
 Could retrieve Emp tuples using index on E.age, then join with
Dept tuples satisfying dname selection. Comparable to strategy
that used E.dno index.
 So, if E.age index is already created, this query provides much
less motivation for adding an E.dno index.
H.Lu/HKUST
L03: Physical Database Design -- 33
Example 2

SELECT E.ename, D.mgr
FROM Emp E, Dept D
WHERE E.sal BETWEEN 10000 AND 20000
AND E.hobby=‘Stamps’ AND E.dno=D.dno
Clearly, Emp should be the outer relation.
 Suggests that we build a hash index on D.dno.

What index should we build on Emp?
 B+ tree on E.sal could be used, OR an index on E.hobby could
be used. Only one of these is needed, and which is better
depends upon the selectivity of the conditions.
• As a rule of thumb, equality selections more selective than range
selections.

As both examples indicate, our choice of indexes is guided by
the plan(s) that we expect an optimizer to consider for a
query. Have to understand optimizers!
H.Lu/HKUST
L03: Physical Database Design -- 34
Examples of Clustering

B+ tree index on E.age can be
used to get qualifying tuples.



How selective is the condition?
Is the index clustered?
SELECT E.dno
FROM Emp E
WHERE E.age>40
Consider the GROUP BY query.
SELECT E.dno, COUNT (*)
 If many tuples have E.age > 10,
FROM Emp E
using E.age index and sorting the
WHERE E.age>10
retrieved tuples may be costly.
GROUP BY E.dno
 Clustered E.dno index may be
better!

Equality queries and duplicates:

Clustering on E.hobby helps!
H.Lu/HKUST
SELECT E.dno
FROM Emp E
WHERE E.hobby=Stamps
L03: Physical Database Design -- 35
Clustering and Joins
SELECT E.ename, D.mgr
FROM Emp E, Dept D
WHERE D.dname=‘Toy’ AND E.dno=D.dno

Clustering is especially important when accessing inner tuples in
INL.


Should make index on E.dno clustered.
Suppose that the WHERE clause is instead:
WHERE E.hobby=‘Stamps AND E.dno=D.dno


If many employees collect stamps, Sort-Merge join may be worth
considering. A clustered index on D.dno would help.
Summary: Clustering is useful whenever many tuples are to be
retrieved.
H.Lu/HKUST
L03: Physical Database Design -- 36
Multi-Attribute Index Keys

To retrieve Emp records with age=30 AND sal=4000, an index
on <age,sal> would be better than an index on age or an index
on sal.



If condition is: 20<age<30 AND 3000<sal<5000:


Clustered tree index on <age,sal> or <sal,age> is best.
If condition is: age=30 AND 3000<sal<5000:


Such indexes also called composite or concatenated indexes.
Choice of index key orthogonal to clustering etc.
Clustered <age,sal> index much better than <sal,age> index!
Composite indexes are larger, updated more often.
H.Lu/HKUST
L03: Physical Database Design -- 37
Covering Index

<E.dno>
SELECT D.mgr
FROM Dept D, Emp E
WHERE D.dno=E.dno
A number of
SELECT D.mgr, E.eid
<E.dno,E.eid>
queries can be
FROM Dept D, Emp E
Tree index!
WHERE D.dno=E.dno
answered
without
SELECT E.dno, COUNT(*)
retrieving any
<E.dno>
FROM Emp E
tuples from one
GROUP BY E.dno
or more of the
SELECT E.dno, MIN(E.sal)
relations
<E.dno,E.sal>
FROM Emp E
involved if a
Tree index!
GROUP BY E.dno
suitable index is
<E. age,E.sal> SELECT AVG(E.sal)
available.
or
FROM Emp E
<E.sal, E.age> WHERE E.age=25 AND
E.sal BETWEEN 3000 AND 5000
Tree!
H.Lu/HKUST
L03: Physical Database Design -- 38
Cost Based Index Selection

SQL Server 7 and above
 Input: schema (with existing index), workload
 Output: recommendations adding/removing indexes

Basic approach:
 First step: pick the best index for each SQL statement in the
workload by enumerating relevant indexes on attributes, and
computing the cost of execution plans that would use the index
 Second step: computes the cost of combining the indexes
picked in the first step. The subset of indexes with the lowest
combined cost is used for the final recommendation.

Details: VLDB 02 paper
H.Lu/HKUST
L03: Physical Database Design -- 39
Summary

Understanding the nature of the workload for the application,
and the performance goals, is essential to developing a good
design.
 What are the important queries and updates? What
attributes/relations are involved? Indexes support efficient
retrieval of records based on the values in some fields.

Index is a collection of data entries plus a way to quickly find
entries with given key values. Data entries can be actual data
records, <key, rid> pairs, or <key, rid-list> pairs.
 Choice orthogonal to indexing technique used to locate data
entries with a given key value.

Indexes can be classified as clustered vs. unclustered, primary
vs. secondary, and dense vs. sparse. Differences have
important consequences for utility/performance.
H.Lu/HKUST
L03: Physical Database Design -- 40
Summary (Contd.)


Several indexes can be built on a given file of data records,
each with a different search key.
Indexes must be chosen to speed up important queries (and
perhaps some updates!).







Index maintenance overhead on updates to key fields.
Choose indexes that can help many queries, if possible.
Build indexes to support index-only strategies.
Clustering is an important decision; only one index on a given
relation can be clustered!
Order of fields in composite index key can be important.
Static indexes may have to be periodically re-built.
Statistics have to be periodically updated.
H.Lu/HKUST
L03: Physical Database Design -- 41