chap05 - Gonzaga University
Download
Report
Transcript chap05 - Gonzaga University
Chapter 5:
Physical Database Design and
Performance
Jason C. H. Chen, Ph.D.
Professor of MIS
School of Business Administration
Gonzaga University
Spokane, WA 99258
[email protected]
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-1
Objectives
•
•
•
•
•
•
•
•
Definition of terms
Describe the physical database design process
Choose storage formats for attributes
Select appropriate file organizations
Describe three types of file organization
Describe indexes and their appropriate use
Translate a database model into efficient structures
Know when and how to use denormalization
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-2
Physical Database Design
• Purpose–translate the logical description of
data into the technical specifications for
storing and retrieving data
• Goal–create a design for storing data that
will provide adequate performance and
insure database integrity, security, and
recoverability
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-3
Physical Design Process
Inputs
Normalized
Volume
Decisions
relations
Attribute data types
estimates
Physical record descriptions
Attribute definitions
Response time
Data
(doesn’t always match
logical design)
expectations
security needs
Leads to
Backup/recovery needs
Integrity expectations
DBMS
technology used
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
File
organizations
Indexes and
database
architectures
Query optimization
TM 5-4
Figure 5-1 Composite usage map
(Pine Valley Furniture Company)
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
5
TM 5-5
What will we learn from
Usage Map?
• Why should Manufacutred_Part and
Purchased_Part tables be separated?
• Why should SUPPLIER and SUPPLIES
tables be combined?
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-6
Figure 5-1 Composite usage map
(Pine Valley Furniture Company) (cont.)
Data (transaction)
volumes
Database access frequencies are estimated from “transaction volumes.”
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
7
TM 5-7
Figure 5-1 Composite usage map
(Pine Valley Furniture Company) (cont.)
Access Frequencies
(per hour)
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-8
Figure 5-1 Composite usage map
(Pine Valley Furniture Company) (cont.)
Usage analysis:
14,000 purchased parts
accessed per hour
8000 quotations accessed
from these 140 purchased part
accesses
7000 suppliers accessed from
these 8000 quotation accesses
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-9
Figure 5-1 Composite usage map
(Pine Valley Furniture Company) (cont.)
Usage analysis:
7500 suppliers accessed per
hour
4000 quotations accessed
from these 7500 supplier
accesses
4000 purchased parts
accessed from these 4000
quotation accesses
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-10
Designing Fields
• Field: smallest unit of data in
database
• Field design
– Choosing data type
– Coding, compression, encryption
– Controlling data integrity
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
11
TM 5-11
Choosing Data Types
•
•
•
•
•
•
•
CHAR–fixed-length character
VARCHAR2–variable-length character (memo)
LONG–large number
NUMBER–positive/negative number
INEGER–positive/negative whole number
DATE–actual date
BLOB–binary large object (good for graphics,
sound clips, etc.)
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
12
TM 5-12
Figure 5-2 Example code look-up table
(Pine Valley Furniture Company)
Code saves space, but costs
an additional lookup to
obtain actual value
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
13
TM 5-13
Field Data Integrity
•
Default value–assumed value if no explicit value
–
•
•
•
e.g., s_state CHAR(2) DEFAULT ‘WI’;
Range control–allowable value limitations
(constraints or validation rules)
Null value control–allowing or prohibiting empty
fields
Referential integrity–range control (and null value
allowances) for foreign-key to primary-key matchups
Sarbanes-Oxley Act (SOX) legislates importance of financial data integrity
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
14
TM 5-14
Handling Missing Data
• Substitute an estimate of the missing value (e.g.,
using a formula)
• Construct a report listing missing values
• In programs, ignore missing data unless the value
is significant (sensitivity testing)
Triggers can be used to perform these operations
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-15
Physical Records
• Physical Record: A group of fields stored in
adjacent memory locations and retrieved
together as a unit
• Page: The amount of data read or written in
one I/O operation
• Blocking Factor: The number of physical
records per page
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-16
Denormalization
• Transforming normalized relations into unnormalized physical
record specifications
• Benefits:
– Can improve performance (speed) by reducing number of table lookups
(i.e. reduce number of necessary join queries)
• Costs (due to data duplication)
– Wasted storage space
– Data integrity/consistency threats
• Common denormalization opportunities
– One-to-one relationship (Fig. 5-3)
– Many-to-many relationship with attributes (Fig. 5-4)
– Reference data (1:N relationship where 1-side has data not used in any
other relationship) (Fig. 5-5)
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-17
Figure 5-3 A possible denormalization situation: two entities with oneto-one relationship
mandatory
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
optional
18
TM 5-18
Figure 5-4 A possible denormalization situation: a many-to-many
relationship with nonkey attributes
Extra table
access required
Null description possible
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-19
Figure 5-5
A possible
denormalization
situation:
reference data
Extra table
access
required
Data duplication
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-20
Partitioning
• Horizontal Partitioning: Distributing the rows of a table
into several separate files
– Useful for situations where different users need access
to different rows
– Three types: Key Range Partitioning, Hash Partitioning,
or Composite Partitioning
• Vertical Partitioning: Distributing the columns of a table
into several separate relations
– Useful for situations where different users need access
to different columns
– The primary key must be repeated in each file
• Combinations of Horizontal and Vertical
Partitions often correspond with User Schemas (user views)
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-21
Partitioning (cont.)
• Advantages of Partitioning:
– Efficiency: Records used together are grouped together
– Local optimization: Each partition can be optimized for
performance
– Security, recovery
– Load balancing: Partitions stored on different disks,
reduces contention
– Take advantage of parallel processing capability
• Disadvantages of Partitioning:
– Inconsistent access speed: Slow retrievals across partitions
– Complexity: Non-transparent partitioning
– Extra space or update time: Duplicate data; access from
multiple partitions
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-22
This figure shows how data are recorded on magnetic disks.
Figure: Magnetic Disk Components
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
4-23
TM 5-23
Oracle 11g Horizontal Partitioning Methods
• Range partitioning
– Partitions defined by range of field values
– Could result in unbalanced distribution of rows
– Like-valued fields share partitions
• Hash partitioning
– Partitions defined via hash functions
– Will guarantee balanced distribution of rows
– Partition could contain widely varying valued fields
• List partitioning
– Based on predefined lists of values for the partitioning key
• Composite partitioning
– Combination of the other approaches
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-24
Designing Physical Files
• Physical File:
– A named portion of secondary memory allocated for
the purpose of storing physical records
– Tablespace–named set of disk storage elements in
which physical files for database tables can be stored
– Extent–contiguous section of disk space
• Constructs to link two pieces of data:
– Sequential storage
– Pointers : field of data that can be used to locate related
fields or records
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
25
TM 5-25
Figure 5-6 DBMS terminology in an Oracle 11g environment
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-26
File Organizations
• Technique for physically arranging records of a file on
secondary storage
• Factors for selecting file organization:
–
–
–
–
–
–
Fast data retrieval and throughput
Efficient storage space utilization
Protection from failure and data loss
Minimizing need for reorganization
Accommodating growth
Security from unauthorized use
• Types of file organizations
– Sequential (Fig. 5-7a): the most efficient with storage space.
– Indexed (Fig. 5-7b) : quick retrieval
• Join Index, Fig 5-8
– Hashed (Fig. 5-7c) : easiest to update
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
27
TM 5-27
Figure 5-7a
Sequential file
organization
1
2
Records of the
file are stored in
sequence by the
primary key
field values
If sorted –
every insert or
delete requires
resort
If not sorted
Average time to
find desired record
= n/2
n
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
28
TM 5-28
Indexed File Organizations
• Indexed File Organization: the storage of records
either sequentially or nonsequentially with an index
that allows software to locate individual records
• Index: a table or other data structure used to
determine in a file the location of records that satisfy
some condition
• Primary keys are automatically indexed
• Other fields or combinations of fields can also be
indexed; these are called secondary keys (or
nonunique keys)
• Oracle has a CREATE INDEX operation, and MS
ACCESS allows indexes to be created for most field
types
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-29
Figure 5-7b Indexed file organization
uses a tree search
Average time to find
desired record = depth
of the tree
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-30
Figure 5-7c
Hashed file or
index
organization
Hash algorithm
Usually uses divisionremainder to determine
record position. Records
with same position are
grouped in lists
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
31
TM 5-31
Figure 5-8 Join Indexes–speeds up join operations
b) Join index for matching
foreign key (FK) and primary
key (PK)
a) Join
index for
common
non-key
columns
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-32
Understand their general concepts
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
33
TM 5-33
Clustering Files
• In some relational DBMSs, related records from
different tables can be stored together in the same
disk area
• Useful for improving performance of join
operations
• Primary key records of the main table are stored
adjacent to associated foreign key records of the
dependent table
• e.g. Oracle has a CREATE CLUSTER command
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
34
TM 5-34
Cluster
• Clustering is a method of storing tables that are
intimately related and often joined together into the
same area on disk. For example, instead of the
WORKER table being in one section of the disk and
the WORKERSKILL table being somewhere else,
their rows could be interleaved together in a single
area, called a cluster.
• The cluster key is the field (or fields) by which the
table are usually joined in a query (e.g., Worker_ID
in the example). To cluster tables, you must own the
tables you are going to cluster together.
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-35
Fig. 4-10: Mapping an entity with a multivalued attribute
(a) Employee entity type with multivalued attribute
[Two
relations
WORKER
created
Worker_ID
with one
Name
containing
Age
all of the
Loding
attributes
{Skill, Ability}
except the
multiMultivalued attribute becomes a separate relation with foreign key valued
attribute,
(b) Mapping a multivalued attribute
and the
second
WORKER
one
Lodging
Name Age
Worker_ID
contains
the pk (on
WORKERSKILL
the first
one) and
Ability
Skill
Worker_ID
the multivalued
One–to–many relationship between original entity and new relation attribute]
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-36
The following is the basic syntax of the create cluster command:
CREATE CLUSTER cluster_name (field-1 datatype,
Field-2 datatype …)
CREATE CLUSTER WORKERandSKILL (
Judy VARCHAR2(25));
Cluster created.
This creates a cluster (a space is set aside, as it would be
for a table) with nothing in it. The use of
Worker_Cluster for the cluster key is irrelevant; you’ll
never use it again.
Next, tables are created to be included in this cluster.
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-37
CREATE TABLE WORKER (
Worker_ID NUMBER(2),
Name
VARCHAR2 (25) NOT NULL,
Age
NUMBER (2),
Lodging
VARCHAR2 (15)
)
CLUSTER WORKERandSKILL (Worker_ID)
;
Prior to inserting rows into WORKER, you must create a cluster index:
CREATE INDEX WORKERandSKILL_NDX
ON CLUSTER WORKERandSKILL;
Notice that nowhere does either statement say explicitly
that the Worker_ID column goes into the July cluster key.
The mentioned in their respective cluster statements.
Multiple columns and cluster keys are matched first to first,
second to second, third to third, and so on. Now a second
table is added to the cluster.
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-38
CREATE TABLE WORKERSKILL (
Worker_ID NUMBER(2),
Skill
VARCHAR2(25) NOT NULL,
Ability
VARCHAR2 (15) NOT NULL
)
CLUSTER WORKERandSKILL (Worker_ID)
;
We will investigate how the tables are stored in the cluster.
Recall that WORKER table has four fields: Worker_ID, Name, Age,
and Lodging. The WORKERSKILL table has three fields:
Worker_ID, Skill, and Ability. When these two tables are clustered,
each unique Worker_ID is actually stored only once, in the cluster
key. To each Worker_ID are attached the columns from both of
these tables.
The data from both of these tables is actually stored in a single
location, almost as if the cluster where a big table containing data
drawn from both of the tables that make it up.
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-39
Figure: How data is stored in clusters
AGE
LODGING
23
29
22
18
16
43
41
55
15
33
27
PAPA KING
ROSE HILL
CRANMER
ROSE HILL
MATTS
WEITBROCHT
ROSE HILL
PAPA KING
21
25
33
35
32
67
26
25
NAME
WORKER_ID
ADAH TALBOT
ANDREW DYE
BART SARJEANT
DICK JONES
DONALD ROLLO
ELBERT TALBOT
GEORGE OSCAR
GERHARDT KENTGEN
HELEN BRANDT
JED HOPKINS
JOHN PEARSON
1
2
3
4
5
6
7
8
9
10
11
ROSE HILL
ROSE HILL
CRANMER
WEITBROCHT
MATTS
MULLERS
PALMER WALLBOM
PAT LAVAY
PETER LAWSON
RICHARD KOCH ROLAND
BRANDT
VICTORIA LYNN
WILFRED LOWELL
12
13
14
15
16
17
18
CRANMER
WILLIAM SWING
19
MATTS
ROSE HILL
from WORKER table
SKILL
ABILITY
PHOTO
GOOD
SMITHY
EXCELLENT
COMPUTER
SLOW
COMBINE DRIVER
VERY FAST
COMBINE DRIVR
WOODCUTTER
SMITHY
GOOD
AVERAGE
GOOD
SMITHY
PHOTO
COMPUTER
PRECISE
AVERAGE
AVERAGE
from WORKERSKILL table
CLUSTER key
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-40
Figure: How data is stored in clusters
AGE
LODGING
23
29
22
18
16
43
41
55
15
33
27
PAPA KING
ROSE HILL
CRANMER
ROSE HILL
MATTS
WEITBROCHT
ROSE HILL
PAPA KING
21
25
33
35
32
67
26
25
NAME
WORKER_ID
ADAH TALBOT
ANDREW DYE
BART SARJEANT
DICK JONES
DONALD ROLLO
ELBERT TALBOT
GEORGE OSCAR
GERHARDT KENTGEN
HELEN BRANDT
JED HOPKINS
JOHN PEARSON
1
2
3
4
5
6
7
8
9
10
11
ROSE HILL
ROSE HILL
CRANMER
WEITBROCHT
MATTS
MULLERS
PALMER WALLBOM
PAT LAVAY
PETER LAWSON
RICHARD KOCH ROLAND
BRANDT
VICTORIA LYNN
WILFRED LOWELL
12
13
14
15
16
17
18
CRANMER
WILLIAM SWING
19
MATTS
ROSE HILL
SKILL
ABILITY
PHOTO
GOOD
SMITHY
EXCELLENT
COMPUTER
SLOW
COMBINE DRIVER
VERY FAST
COMBINE DRIVR
WOODCUTTER
SMITHY
GOOD
AVERAGE
GOOD
SMITHY
PHOTO
COMPUTER
PRECISE
AVERAGE
AVERAGE
Why we do not simply create a table like the one shown above?
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-41
Optimizing Query Processing A Quick Way to Speed Access to Your data
• Indexes may be used to improve the
efficiency of data searches to meet
particular search criteria after the table has
been in use for some time. Therefore, the
ability to create indexes quickly and
efficiently at any time is important.
• SQL indexes can be created on the basis of
any selected attributes
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-42
Optimizing Query Processing (conti.)
For example:
SELECT index_name
FROM user_indexes;
SELECT *
FROM inventory
WHERE inv_qoh>= 130;
CREATE INDEX
qoh_index
ON inventory (inv_qoh);
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
Note that this statement
defines an index called
qoh_index for the qoh
column in the inventory
table. This index
ensures that in the next
example SQL only
needs to look at row in
the database that satisfy
the WHERE condition,
and is, therefore,
quicker to produce an
answer.
TM 5-43
SQL> SELECT *
2 FROM inventory
3 WHERE inv_qoh>= 130;
INV_ID
ITEM_ID COLOR
---------- ---------- -------------------3
3 Khaki
4
3 Khaki
6
3 Navy
7
3 Navy
9
4 Eggplant
10
4 Eggplant
11
4 Eggplant
19
5 Bright Pink
20
5 Bright Pink
21
5 Bright Pink
INV_SIZE
INV_PRICE
INV_QOH
---------- ---------- ---------S
29.95
150
M
29.95
147
S
29.95
139
M
29.95
137
S
59.95
135
M
59.95
168
L
59.95
187
10
15.99
148
11
15.99
137
12
15.99
134
10 rows selected.
CREATE INDEX qoh_index ON inventory (inv_qoh);
SQL> SELECT *
2 FROM inventory
3 WHERE inv_qoh>= 130;
INV_ID
ITEM_ID COLOR
---------- ---------- -------------------21
5 Bright Pink
9
4 Eggplant
7
3 Navy
20
5 Bright Pink
6
3 Navy
4
3 Khaki
19
5 Bright Pink
3
3 Khaki
10
4 Eggplant
11
4 Eggplant
INV_SIZE
INV_PRICE
INV_QOH
---------- ---------- ---------12
15.99
134
S
59.95
135
M
29.95
137
11
15.99
137
S
29.95
139
M
29.95
147
10
15.99
148
S
29.95
150
M
59.95
168
L
59.95
187
10 rows selected.
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-44
Optimizing Query Processing (conti.)
You may even create an index that prevents
you from using a value that has been used
before. Such a feature is especially useful
when the index field (attribute) is a primary
key whose values must not be duplicated:
CREATE UNIQUE INDEX <index_field>
ON <tablename> (the key field);
SELECT index_name FROM user_indexes;
DROP INDEX <index_name>;
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-45
Summary on Optimizing Query
Processing (conti.)
The indexes are defined so as to optimize the
processing of SELECT statements. ;
An index is never explicitly referenced in a SELECT
statement; the syntax of SQL does not allow this;
During the processing of a statement, SQL itself
determines whether an existing index will be used;
An index may be created or deleted at any time;
When updating, inserting or deleting rows, SQL
also maintains the indexes on the tables concerned.
This means that, on the one hand, the processing time
for SELECT statements is reduced, while, on the other
hand, the processing time for update statements (such
as INSERT, UPDATE and DELETE) can increase.
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-46
Rules for Using Indexes
1. Use on larger tables
2. Index the primary key of each table
3. Index search fields (fields frequently in
WHERE clause)
4. Fields in SQL ORDER BY and GROUP BY
commands
5. When there are >100 values but not when
there are <30 values
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-47
Rules for Using Indexes (cont.)
6.
7.
8.
9.
Avoid use of indexes for fields with long values;
perhaps compress values first
If key to index is used to determine location of
record, use surrogate (like sequence nbr) to
allow even spread in storage area
DBMS may have limit on number of indexes per
table and number of bytes per indexed field(s)
Be careful of indexing attributes with null values;
many DBMSs will not recognize null values in
an index search
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-48
Query Optimization
• Parallel query processing–possible when
working in multiprocessor systems
• Overriding automatic query optimization–
allows for query writers to preempt the
automated optimization
• Picking data block size–factors to consider
include:
– Block contention, random and sequential row
access speed, row size
• Balancing I/O across disk controllers
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-49
Query Design Guidelines
•
•
•
•
•
•
•
1. Understand how indexes are used
2. Keep optimization statistics up-to-date
3. Use compatible data types for fields and literals
4. Write simple queries
5. Break complex queries into multiple, simple parts
6. Don’t use one query inside another
7. Don’t combine a table with itself
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
50
TM 5-50
Query Design Guidelines (cont.)
•
•
•
•
•
8. Create temporary tables for groups of queries
9. Combine update operations
10. Retrieve only the data you need
11. Don’t have the DBMS sort without an index
12. Learn!
– Learning to Learn and Learning to Change!
• 13. Consider the total query processing time for ad
hoc queries
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
51
TM 5-51
Database Architectures
(Model) (Figure-Extra)
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
Legacy
Systems
Current
Technology
Data
Warehouses
multidimensional
52
TM 5-52
Exercise and Homework
• Exercises (p.234)
– #1 (index), 5 (storage), 8 (denormalization)
• HW (p.235) [usage map]
– #19
Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems
TM 5-53