Transcript Chapter 8

Chapter 8
Physical Database Design
McGraw-Hill/Irwin
Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Outline





Overview of Physical Database Design
File Structures
Query Optimization
Index Selection
Additional Choices in Physical Database
Design
8-2
Overview of Physical Database
Design
 Sequence of decision-making processes.
 Decisions involve the storage level of a
database: file structure and optimization
choices.
 Importance of providing detailed inputs
and using tools that are integrated with
DBMS usage of file structures and
optimization decisions
8-3
Storage Level of Databases
 Closest to the hardware and operating
system.
 Physical records organized into files.
 The number of physical record accesses is
an important measure of database
performance.
 Difficult to predict physical record
accesses
8-4
Logical Records (LR) and
Physical Records (PR)
(a) Multiple LRs per PR
PR
LR
(b) LR split across PRs
LR
PR
LR
(c) PR containing LRs
from different tables
PR
LRT1
LRT2
LR
PR
LRT2
8-5
Transferring Physical Records
Application buffers:
Logical records (LRs)
LR 1
DBMS Buffers:
Logical records (LRs) inside
of physical records (PRs)
read PR
1
LR 2
LR 3
LR 4
Operating system:
Physical records
(PRs) on disk
read
LR 1
LR 2
write
PR1
write
PR2
LR 3
PR2
LR 4
8-6
Objectives
 Minimize response time to access and
change a database.
 Minimizing computing resources is a
substitute measure for response time.
 Database resources
 Physical record transfers
 CPU operations
 Communication network usage (distributed
processing)
8-7
Constraints
 Main memory and disk space
 Minimizing main memory and disk
space can lead to high response times.
 Useful to consider additional memory
and disk space
8-8
Combined Measure of
Database Performance
 Weight combines physical record
accesses and CPU usage
 Weight is usually close to 0
 Mmany CPU operations can be performed
in the time to perform one physical record
transfer.
Combined Measure of Database Performance :
PRA + W * CPU-OP
8-9
Inputs, Outputs, and
Environment
Table design
(from logical database design)
File structures
Table
profiles
Application
profiles
Physical
database
design
Data placement
Data formatting
Denormalization
Knowledge about file structures and
query optimization
8-10
Difficulty of physical database
design





Number of decisions
Relationship among decisions
Detailed inputs
Complex environment
Uncertainty in predicting physical record
accesses
8-11
Inputs of Physical Database
Design
 Physical database design requires inputs
specified in sufficient detail.
 Table profiles used to estimate
performance measures.
 Application profiles provide importance of
applications.
8-12
Table Profile
 Tables
 Number of rows
 Number of physical records
 Columns
 Number of unique values
 Distribution of values
 Correlation of columns
 Relationships: distribution of related rows
8-13
Histogram
 Specify distribution of values
 Two dimensional graph
 Column values on the x axis
 Number of rows on the y axis
 Variations
 Equal-width: do not work well with skewed
data
 Equal-height: control error by the number of
ranges
8-14
Equal-Width Histogram
Number of Rows
Salary Histogram (Equal Width)
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
10000 50000
50001 90000
90001 - 130001 - 170001 - 210001 - 250001 - 290001 - 330001 - 370001 130000 170000 210000 250000 290000 330000 370000 410000
Salary
8-15
Equal-Height Histogram
Number of Rows
Salary Histogram (Equal Width)
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
10000 50000
50001 90000
90001 - 130001 - 170001 - 210001 - 250001 - 290001 - 330001 - 370001 130000 170000 210000 250000 290000 330000 370000 410000
Salary
8-16
Application profiles
 Application profiles summarize the
queries, forms, and reports that access a
database.
Typical Components of an Application Profile
Application Type
Query
Form
Report
Statistics
Frequency; distribution of parameter values
Frequency of insert, update, delete, and retrieval
operations to the main form and the subform
Frequency; distribution of parameter values
8-17
File structures
 Selecting among alternative file structures
is one of the most important choices in
physical database design.
 In order to choose intelligently, you must
understand characteristics of available file
structures.
8-18
Sequential Files





Simplest kind of file structure
Unordered: insertion order
Ordered: key order
Simple to maintain
Provide good performance for processing
large numbers of records
8-19
Unordered Sequential File
StdSSN
PR1
Name ...
123-45-6789 Joe Abbot ...
788-45-1235 Sue Peters ...
122-44-8655 Pat Heldon ...
...
Insert a new logical
record in the last
physical record .
PRn
466-55-3299 Bill Harper ...
323-97-3787 Mary Grant ...
543-01-9593 Tom Adtkins
8-20
Ordered Sequential File
StdSSN
PR1
Name ...
122-44-8655 Pat Heldon ...
123-45-6789 Joe Abbot ...
323-97-3787 Mary Grant ...
...
Rearrange physical record
to insert new logical record.
PRn
466-55-3299 Bill Harper ...
788-45-1235 Sue Peters ...
543-01-9593 Tom Adtkins
8-21
Hash Files
 Support fast access by unique key value
 Convert a key value into a physical record
address
 Mod function: typical hash function
 Divisor: large prime number close to the file
capacity
 Physical record number: hash function plus
the starting physical record number
8-22
Example: Hash Function
Calculations for StdSSN Key
.
StdSSN
122448655
123456789
323973787
466553299
788451235
543019593
StdSSN Mod 97 PR Number
26
176
39
189
92
242
80
230
24
174
13
163
8-23
Hash File after Insertions
PR163
543-01-9593 Tom Adtkins
123-45-6789 Joe Abbot
PR230
466-55-3299 Bill Harper
...
...
PR189
PR174
788-45-1235 Sue Peters
122-44-8655 Pat Heldon
...
...
PR176
PR242
323-97-3787 Mary Grant
8-24
Collision Handling Example
Home address = Hash function value + Base address
(122448946 mod 97 = 26) + 150
PR176
...
122-44-8655 Pat Heldon
...
122-44-8752 Joe Bishop
...
Home address (176) is full.
122-44-8849 Mary Wyatt
...
122-44-8946 Tom Atkins
adtkins
PR177
Linear probe to find
a
physical record with space
122-44-8753 Bill Hayes
...
...
8-25
Hash File Limitations
 Poor performance for sequential search
 Reorganization when capacity exceeds
70%
 Dynamic hash files reduce random search
performance but eliminate periodic
reorganization
8-26
Multi-Way Tree (Btrees) Files
 A popular file structure supported by most
DBMSs.
 Btree provides good performance on both
sequential search and key search.
 Btree characteristics:




Balanced
Bushy: multi-way tree
Block-oriented
Dynamic
8-27
Structure of a Btree of Height 3
Root
node
Level
0
Level
1
Level
2
...
...
...
Leaf nodes
8-28
Btree Node Containing Keys
and Pointers
Key1 Key 2 ... Key d
Pointer 1
... Key 2d
Pointer 2 Pointer 3 Pointer d+1 ... Pointer 2d+1
Each non root node contains at least half capacity
(d keys and d+1 pointers).
Each non root node contains at most full capacity
(2d keys and 2 d+1 pointers).
8-29
Btree Insertion Examples
(a) Initial Btree
20
22
28
35
45
70
40
50
60
65
(b) After inserting 55
20
22
28
35
45
70
40
50
55
(c) After inserting 58
20
22
28
35
40
50
45
58
70
55
60
65
Middle key value
(58) moved up
60
Node split
65
8-30
Btree Deletion Examples
(a) Initial Btree
20
22
28
45
70
35
50
60
50
65
65
(b) After deleting 60
20
22
28
45
70
35
(c) After deleting 65
20
22
28
35
70
45
50
Borrowing a key
8-31
Cost of Operations
 The height of Btree dominates the number
of physical record accesses operation.
 Logarithmic search cost
 Upper bound of height: log function
 Log base: minimum number of keys in a node
 Insertion cost
 Cost to locate the nearest key
 Cost to change nodes
8-32
B+Tree
 Provides improved performance on
sequential and range searches.
 In a B+tree, all keys are redundantly
stored in the leaf nodes.
 To ensure that physical records are not
replaced, the B+tree variation is usually
implemented.
8-33
B+tree Illustration
Index set
Link to first leaf
...
...
Sequence set
8-34
Index Matching
 Determining usage of an index for a query
 Complexity of condition determines match.
 Single column indexes: =, <, >, <=, >=, IN
<list of values>, BETWEEN, IS NULL,
LIKE ‘Pattern’ (meta character not the first
symbol)
 Composite indexes: more complex and
restrictive rules
8-35
Index Matching Examples






C2 BETWEEN 10 and 20: match on C2
C3 IN (10,20): match on C3
C1 <> 10: no match
C4 LIKE 'A%‘: match on C4
C4 LIKE '%A‘: no match
C2 = 5 AND C3 = 20 AND C1 = 10: matches on
index with C1, C2, and C3
8-36
Bitmap Index
 Can be useful for stable columns with few values
 Bitmap:
 String of bits: 0 (no match) or 1 (match)
 One bit for each row
 Bitmap index record
 Column value
 Bitmap
 DBMS converts bit position into row identifier.
8-37
Bitmap Index Example
Faculty Table
RowId
1
2
3
4
5
6
7
8
9
10
11
12
FacSSN
098-55-1234
123-45-6789
456-89-1243
111-09-0245
931-99-2034
998-00-1245
287-44-3341
230-21-9432
321-44-5588
443-22-3356
559-87-3211
220-44-5688
… FacRank
Asst
Asst
Assc
Prof
Asst
Prof
Assc
Asst
Prof
Assc
Prof
Asst
Bitmap Index on FacRank
FacRank
Asst
Assc
Prof
Bitmap
110010010001
001000100100
000101001010
8-38
Bitmap Join Index
 Bitmap identifies rows of a related table.
 Represents a precomputed join
 Can define for a join column or a non-join
column
 Typically used in query dominated
environments such as data warehouses
(Chapter 16)
8-39
Summary of File Structures
Unordered Ordered Hash
Sequential Y
search
Key
search
Range
search
Usage
B+tree
Bitmap
Y
Extra PRs Y
N
Linear
Linear
N
Y
Constant
time
N
Y
Primary
only
Primary
only
Logarithmic Y
Y
Primary
Primary or
or
secondary
secondary
Secondary
only
8-40
Query Optimization
 Query optimizer determines
implementation of queries.
 Major improvement in software
productivity
 Improve performance with knowledge
about the optimization process
8-41
Translation Tasks
Query
Syntax and semantic
analysis
Parsed query
Query transformation
Relational algebra query
Access plan
evaluation
Access
Plan
Access
Plan
Access plan
interpretation
Code generation
Query results
Machine code
8-42
Access Plan Evaluation
 Optimizer evaluates thousands of access
plans
 Access plans vary by join order, file
structures, and join algorithm.
 Some optimizers can use multiple indexes
on the same table.
 Access plan evaluation can consume
significant resources
8-43
Access Plan Example 1
Sort Merge Join
Sort(FacSSN)
BTree(FacSSN)
Sort Merge Join
Faculty
Btree(OfferNo)
BTree(OfferNo)
Enrollment
Offering
8-44
Access Plan Example 2
Sort Merge Join
Sort(OfferNo)
BTree(OfferNo)
Sort Merge Join
Enrollment
BTree(FacSSN)
BTree(FacSSN)
Faculty
Offering
8-45
Join Algorithms
 Nested loops: inner and outer loops;
universal
 Sort merge: join column sorting or indexes
 Hybrid join: combination of nested loops
and sort merge
 Hash join: uses internal hash table
 Star join: uses bitmap join indexes
8-46
Improving Optimization Results
 Monitor poorly performing access plans
 Look for problems involving table profiles
and query coding practices
 Use hints carefully to improve results
 Override optimizer judgment
 Cover file structures, join algorithms, and join
orders
 Use as a last result
8-47
Table Profile Deficiencies
 Detailed and current statistics needed
 Beware of uniform value assumption and
independence assumption
 Use hints to overcome optimization blind
spots
 Estimation of result size for parameterized
queries
 Correlated columns: multiple index access
may be useful
8-48
Query Coding Practices
 Avoid functions on indexable columns
 Eliminate unnecessary joins
 For conditions on join columns, test the condition
on the parent table.
 Do not use the HAVING clause for row
conditions.
 Avoid repetitive binding of complex queries
 Beware of queries that use complex views
8-49
Index Selection
 Most important decision
 Difficult decision
 Choice of clustered and nonclustered
indexes
8-50
Clustering Index Example
Index set
Sequence set
<Abe, 1> <Adam, 2>
1. Abe, Denver, ...
2. Adam, Boulder, ...
<Bill, 4> <Bob, 3>
3. Bob, Denver, ...
4. Bill, Aspen, ...
<Carl, 5> <Carol, 6>
5. Carl, Denver, ...
6. Carol, Golden, ...
...
Physical records
containing rows
8-51
Nonclustering Index Example
Index set
Sequence set
<Abe, 6> <Adam, 2>
1. Carl,Denver, ...
2. Adam, Boulder, ...
<Bill, 4> <Bob, 5>
3. Carol, Golden, ...
4. Bill, Aspen, ...
<Carl, 1> <Carol, 3>
5. Bob, Denver, ...
6. Abe, Denver, ...
...
Physical
records
containing rows
8-52
Inputs and Outputs of Index
Selection
SQL statements
and weights
Clustered index
choices
Index
Selection
Table profiles
Nonclustered
index choices
8-53
Trade-offs in Index Selection
 Balance retrieval against update performance
 Nonclustering index usage:
 Few rows satisfy the condition in the query
 Join column usage if a small number of rows result in
child table
 Clustering index usage:
 Larger number of rows satisfy a condition than for
nonclustering index
 Use in sort merge join algorithm to avoid sorting
 More expensive to maintain
8-54
Difficulties of Index Selection
 Application weights are difficult to specify.
 Distribution of parameter values needed
 Behavior of the query optimization
component must be known.
 The number of choices is large.
 Index choices can be interrelated.
8-55
Selection Rules I
Rule 1: A primary key is a good candidate for a clustering
index.
Rule 2: To support joins, consider indexes on foreign keys.
Rule 3: A column with many values may be a good choice
for a non-clustering index if it is used in equality
conditions.
Rule 4: A column used in highly selective range conditions
is a good candidate for a non-clustering index.
Rule 5: A combination of columns used together in query
conditions may be good candidates for nonclustering
indexes if the joint conditions return few rows, the DBMS
optimizer supports multiple index access, and the
columns are stable.
8-56
Selection Rules II
Rule 6: A frequently updated column is not a good
index candidate.
Rule 7: Volatile tables (lots of insertions and
deletions) should not have many indexes.
Rule 8: Stable columns with few values are good
candidates for bitmap indexes if the columns
appear in WHERE conditions.
Rule 9: Avoid indexes on combinations of
columns. Most optimization components can use
multiple indexes on the same table.
8-57
Index Creation
 To create the indexes, the CREATE
INDEX statement can be used.
 The word following the INDEX keyword is
the name of the index.
 CREATE INDEX is not part of SQL:2003.
 Examples
CREATE INDEX StdGPAIndex ON Student (StdGPA)
CREATE UNIQUE INDEX OfferNoIndex ON Offering
(OfferNo)
CREATE BITMAP INDEX OffYearIndex ON Offering
(OffYear)
8-58
Denormalization
 Additional choice in physical database
design
 Denormalization combines tables so that
they are easier to query.
 Use carefully because normalized designs
have important advantages.
8-59
Normalized designs
 Better update performance
 Require less coding to enforce integrity
constraints
 Support more indexes to improve query
performance
8-60
Repeating Groups
 Collection of associated values.
 Normalization rules force repeating groups to be
stored in an M table separate from an
associated one table.
 If a repeating group is always accessed with its
associated parent table, denormalization may be
a reasonable alternative.
8-61
Denormalizing a Repeating
Group
Normalized
Denormalized
Te rritory
TerrNo
TerrName
TerrLoc
1
M
Te rritorySale s
Te rritory
TerrNo
TerrName
TerrLoc
Qtr1Sales
Qtr2Sales
Qtr3Sales
Qtr4Sales
TerrNo
TerrQtr
TerrSales
8-62
Denormalizing a Generalization
Hierarchy
Normalized
Denormalized
Em p
EmpNo
EmpName
EmpHireDate
Em p
1
1
1
SalaryEm p
HourlyEm p
EmpNo
EmpSalary
EmpNo
EmpRate
EmpNo
EmpName
EmpHireDate
EmpSalary
EmpRate
8-63
Codes and Meanings
Normalized
Denormalized
De pt
De pt
DeptNo
DeptName
DeptLoc
DeptNo
DeptName
DeptLoc
1
1
M
M
Em p
Em p
EmpNo
EmpName
DeptNo
EmpNo
EmpName
DeptNo
DeptName
8-64
Record Formatting
 Record formatting decisions involve
compression and derived data.
 Compression is a trade-off between inputoutput and processing effort.
 Derived data is a trade-offs between query
and update operations.
8-65
Storing Derived Data to
Improve Query Performance
Order
Product
OrdNo
OrdDate
OrdAmt
ProdNo
ProdName
ProdPrice
1
derived
data
1
M
OrdLine
M
OrdNo
ProdNo
Qty
8-66
Parallel Processing
 Parallel processing can improve retrieval
and modification performance.
 Retrieving many records can be improved
by reading physical records in parallel.
 Many DBMSs provide parallel processing
capabilities with RAID systems.
 RAID is a collection of disks (a disk array)
that operates as a single disk.
8-67
Striping in RAID
Storage Systems
Each stripe consists of four adjacent physical records. Three stripes are
shown separated by dotted lines.
PR1
PR2
PR3
PR4
PR5
PR6
PR7
PR8
PR9
PR10
PR11
PR12
8-68
Other Ways to Improve
Performance
 Transaction processing: add computing
capacity and improve transaction design.
 Data warehouses: add computing capacity
and store derived data.
 Distributed databases: allocate processing
and data to various computing locations.
8-69
Summary
 Goal: minimize response time
 Constraints: disk space, memory, communication
bandwidth
 Table profiles and application profiles must be
specified in sufficient detail.
 Environment: file structures and query optimization
 Monitor and possibly improve query optimization
results
 Index selection: most important decision
 Other techniques: denormalization, record
formatting, and parallel processing
8-70