- Courses - University of California, Berkeley

Download Report

Transcript - Courses - University of California, Berkeley

Physical Database Design
University of California, Berkeley
School of Information Management and
Systems
SIMS 257: Database Management
9/25/2001
SIMS 257: Database Management
Review
• Database Design Process
• Normalization
9/25/2001
SIMS 257: Database Management
Database Design Process
Application 1
External
Model
Application 2
Application 3
Application 4
External
Model
External
Model
External
Model
Application 1
Conceptual
requirements
Application 2
Conceptual
requirements
Application 3
Conceptual
requirements
Conceptual
Model
Logical
Model
Application 4
Conceptual
requirements
9/25/2001
SIMS 257: Database Management
Internal
Model
Normalization
• Normalization theory is based on the
observation that relations with certain
properties are more effective in inserting,
updating and deleting data than other sets of
relations containing the same data
• Normalization is a multi-step process
beginning with an “unnormalized” relation
– Hospital example from Atre, S. Data Base: Structured Techniques for
Design, Performance, and Management.
9/25/2001
SIMS 257: Database Management
Normal Forms
•
•
•
•
•
•
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
9/25/2001
SIMS 257: Database Management
Normalization
No transitive
dependency
between
nonkey
attributes
All
determinants
are candidate
keys - Single
multivalued
dependency
9/25/2001
BoyceCodd and
Higher
SIMS 257: Database Management
Functional
dependencyof
nonkey
attributes on
the primary
key - Atomic
values only
Full
Functional
dependencyof
nonkey
attributes on
the primary
key
Unnormalized Relations
• First step in normalization is to convert the
data into a two-dimensional table
• In unnormalized relations data can repeat
within a column
9/25/2001
SIMS 257: Database Management
Unnormalized Relation
Patient #
Surgeon #
145
1111 311
Surg. date
Patient Name
Jan 1,
1995; June
12, 1995
John White
Patient Addr Surgeon
15 New St.
New York,
NY
243
1234 467
2345 189
Jan 8,
1996
Charles Brown
4876 145
Nov 5,
1995
Hal Kane
5123 145
May 10,
1995
Paul Kosher
Charles
Field
10 Main St. Patricia
Rye, NY
Gold
Dogwood
Lane
Harrison,
David
NY
Rosen
55 Boston
Post Road,
Chester,
CN
Beth Little
Blind Brook
Mamaronec
k, NY
Beth Little
6845 243
Apr 5,
1994 Dec
15, 1984
Ann Hood
Hilton Road
Larchmont, Charles
NY
Field
9/25/2001
Postop drug
Drug side effects
Gallstone
s removal;
Beth Little Kidney
Michael
stones
Penicillin,
Diamond removal
none-
Apr 5,
1994 May
10, 1995
Mary Jones
Surgery
SIMS 257: Database Management
rash
none
Eye
Cataract
removal
Thrombos Tetracyclin Fever
is removal e none
none
Open
Heart
Surgery
Cholecyst
ectomy
Gallstone
s
Removal
Eye
Cornea
Replacem
ent Eye
cataract
removal
Cephalosp
orin
none
Demicillin
none
none
none
Tetracyclin
e
Fever
First Normal Form
Patient #
Surgeon # Surgery DatePatient Name Patient Addr Surgeon Name
1111
145
01-Jan-95 John White
1111
311
12-Jun-95 John White
15 New St.
New York,
NY
15 New St.
New York,
NY
1234
243
05-Apr-94 Mary Jones
10 Main St.
Rye, NY
1234
467
10-May-95 Mary Jones
2345
4876
5123
6845
6845
9/25/2001
189
145
145
243
243
Charles
08-Jan-96 Brown
10 Main St.
Rye, NY
Dogwood
Lane
Harrison,
NY
05-Nov-95 Hal Kane
55 Boston
Post Road,
Chester,
CN
05-Apr-94 Ann Hood
15-Dec-84 Ann Hood
Hilton Road
Larchmont,
NY
Drug adminSide Effects
Charles Field
Gallstone
s removal
Kidney
stones
removal
Eye
Cataract
removal
Patricia Gold
Thrombos
is removal none
none
David Rosen
Open
Heart
Surgery
none
Beth Little
Cholecyst
ectomy
Demicillin
Beth Little
Michael
Diamond
Blind Brook
Mamaronec
10-May-95 Paul Kosher k, NY
Beth Little
Hilton Road
Larchmont,
NY
Surgery
Penicillin
rash
none
none
Tetracyclin
e
Fever
Cephalosp
orin
Charles Field
Gallstone
s
Removal
none
Eye
Cornea
Replacem Tetracyclin
ent
e
Charles Field
Eye
cataract
removal
SIMS 257: Database Management
none
none
none
Fever
none
Second Normal Form
Patient #
1111
1234
2345
4876
5123
6845
9/25/2001
Patient Name Patient Address
15 New St. New
John White York, NY
10 Main St. Rye,
Mary Jones NY
Charles
Dogwood Lane
Brown
Harrison, NY
55 Boston Post
Hal Kane
Road, Chester,
Blind Brook
Paul Kosher Mamaroneck, NY
Hilton Road
Ann Hood
Larchmont, NY
SIMS 257: Database Management
Second Normal Form
Surgeon #
Surgeon Name
145 Beth Little
189 David Rosen
243 Charles Field
311 Michael Diamond
467 Patricia Gold
9/25/2001
SIMS 257: Database Management
Second Normal Form
Patient # Surgeon # Surgery Date
1111
1111
1234
1234
2345
4876
9/25/2001
Surgery
Drug Admin Side Effects
145
Gallstones
01-Jan-95 removal
Kidney
Penicillin
rash
311
stones
12-Jun-95 removal
none
none
243
Eye Cataract
05-Apr-94 removal
Tetracycline Fever
467
Thrombosis
10-May-95 removal
189
Open Heart
08-Jan-96 Surgery
Cephalospori
n
none
145
Cholecystect
05-Nov-95 omy
Demicillin
none
none
none
none
none
5123
145
6845
243
6845
243
Gallstones
10-May-95 Removal
Eye cataract
15-Dec-84 removal
Eye Cornea
05-Apr-94 Replacement
SIMS 257: Database Management
none
none
Tetracycline Fever
Third Normal Form
Patient # Surgeon # Surgery Date
9/25/2001
Surgery
Drug Admin
1111
145
1111
311
01-Jan-95 Gallstones removal
Kidney stones
12-Jun-95 removal
1234
243
05-Apr-94 Eye Cataract removal Tetracycline
1234
467
10-May-95 Thrombosis removal
2345
189
08-Jan-96 Open Heart Surgery
Cephalosporin
4876
145
05-Nov-95 Cholecystectomy
Demicillin
5123
145
10-May-95 Gallstones Removal
none
6845
243
none
6845
243
15-Dec-84 Eye cataract removal
Eye Cornea
05-Apr-94 Replacement
SIMS 257: Database Management
Penicillin
none
none
Tetracycline
Third Normal Form
Drug Admin
9/25/2001
Side Effects
Cephalosporin
none
Demicillin
none
none
none
Penicillin
rash
Tetracycline
Fever
SIMS 257: Database Management
Most 3NF Relations are also
BCNF
Patient #
1111
1234
2345
4876
5123
6845
9/25/2001
Patient Name Patient Address
15 New St. New
John White York, NY
10 Main St. Rye,
Mary Jones NY
Charles
Dogwood Lane
Brown
Harrison, NY
55 Boston Post
Hal Kane
Road, Chester,
Blind Brook
Paul Kosher Mamaroneck, NY
Hilton Road
Ann Hood
Larchmont, NY
SIMS 257: Database Management
Fourth Normal Form
• Any relation is in Fourth Normal Form if it
is BCNF and any multivalued dependencies
are trivial
• Eliminate non-trivial multivalued
dependencies by projecting into simpler
tables
9/25/2001
SIMS 257: Database Management
Fifth Normal Form
• A relation is in 5NF if every join
dependency in the relation is implied by the
keys of the relation
• Implies that relations that have been
decomposed in previous NF can be
recombined via natural joins to recreate the
original relation.
9/25/2001
SIMS 257: Database Management
Normalization
• Normalization is performed to reduce or
eliminate Insertion, Deletion or Update
anomalies.
• However, a completely normalized database
may not be the most efficient or effective
implementation.
• “Denormalization” is sometimes used to
improve efficiency.
9/25/2001
SIMS 257: Database Management
Denormalization
• Usually driven by the need to improve
query speed
• Query speed is improved at the expense of
more complex or problematic DML (Data
manipulation language) for updates,
deletions and insertions.
9/25/2001
SIMS 257: Database Management
Downward Denormalization
Before:
Customer
ID
Address
Name
Telephone
After:
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
9/25/2001
Customer
ID
Address
Name
Telephone
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
Cust Name
SIMS 257: Database Management
Upward Denormalization
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
Cust Name
Order Price
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
Cust Name
Order Item
Order No
Item No
Item Price
Num Ordered
9/25/2001
SIMS 257: Database Management
Order Item
Order No
Item No
Item Price
Num Ordered
9/25/2001
SIMS 257: Database Management
Today
• Physical Database Design
• Access Methods
• Indexes
Based on McFadden Modern Database Management
and Atre Database:Structured Techniques for Design,
Performance and Management
9/25/2001
SIMS 257: Database Management
Database Design Process
Application 1
External
Model
Application 2
Application 3
Application 4
External
Model
External
Model
External
Model
Application 1
Conceptual
requirements
Application 2
Conceptual
requirements
Application 3
Conceptual
requirements
Conceptual
Model
Logical
Model
Internal
Model
Application 4
Conceptual
requirements
9/25/2001
Physical
Design
SIMS 257: Database Management
Physical Database Design
• Many physical database design decisions
are implicit in the technology adopted
– Also, organizations may have standards or an
“information architecture” that specifies
operating systems, DBMS, and data access
languages -- thus constraining the range of
possible physical implementations.
• We will be concerned with some of the
possible physical implementation issues
9/25/2001
SIMS 257: Database Management
Physical Database Design
• The primary goal of physical database
design is data processing efficiency
• We will concentrate on choices often
available to optimize performance of
database services
• Physical Database Design requires
information gathered during earlier stages
of the design process
9/25/2001
SIMS 257: Database Management
Physical Design Information
• Information needed for physical file and
database design includes:
– Normalized relations plus size estimates for them
– Definitions of each attribute
– Descriptions of where and when data are used
• entered, retrieved, deleted, updated, and how often
– Expectations and requirements for response time,
and data security, backup, recovery, retention and
integrity
– Descriptions of the technologies used to
implement the database
9/25/2001
SIMS 257: Database Management
Physical Design Decisions
• There are several critical decisions that will
affect the integrity and performance of the
system.
–
–
–
–
–
9/25/2001
Storage Format
Physical record composition
Data arrangement
Indexes
Query optimization and performance tuning
SIMS 257: Database Management
Storage Format
• Choosing the storage format of each field
(attribute). The DBMS provides some set of
data types that can be used for the physical
storage of fields in the database
• Data Type (format) is chosen to minimize
storage space and maximize data integrity
9/25/2001
SIMS 257: Database Management
Objectives of data type selection
•
•
•
•
Minimize storage space
Represent all possible values
Improve data integrity
Support all data manipulations
• The correct data type should, in minimal space,
represent every possible value (but eliminated
illegal values) for the associated attribute and can
support the required data manipulations (e.g.
numerical or string operations)
9/25/2001
SIMS 257: Database Management
Access Data Types
•
•
•
•
•
•
•
•
•
Numeric (1, 2, 4, 8 bytes, fixed or float)
Text (255 max)
Memo (64000 max)
Date/Time (8 bytes)
Currency (8 bytes, 15 digits + 4 digits decimal)
Autonumber (4 bytes)
Yes/No (1 bit)
OLE (limited only by disk space)
Hyperlinks (up to 64000 chars)
9/25/2001
SIMS 257: Database Management
Access Numeric types
• Byte
– Stores numbers from 0 to 255 (no fractions). 1 byte
• Integer
– Stores numbers from –32,768 to 32,767 (no fractions) 2 bytes
• Long Integer
• Single
(Default)
– Stores numbers from –2,147,483,648 to 2,147,483,647 (no
fractions). 4 bytes
– Stores numbers from -3.402823E38 to –1.401298E–45 for
negative values and from 1.401298E–45 to 3.402823E38 for
positive values.
4 bytes
• Double
– Stores numbers from –1.79769313486231E308 to –
4.94065645841247E–324 for negative values and from
1.79769313486231E308 to 4.94065645841247E–324 for
positive values.
15
8 bytes
• Replication ID
– Globally unique identifier (GUID)
9/25/2001
SIMS 257: Database Management
N/A
16 bytes
Controlling Data Integrity
•
•
•
•
•
Default values
Range control
Null value control
Referential integrity
Handling missing data
9/25/2001
SIMS 257: Database Management
Designing Physical Records
• A physical record is a group of fields stored
in adjacent memory locations and retrieved
together as a unit
• Fixed Length and variable fields
9/25/2001
SIMS 257: Database Management
Designing Physical Files/Internal
Model
• Overview
• terminology
• Access methods
9/25/2001
SIMS 257: Database Management
Physical Design
• Internal Model/Physical Model
User request
Interface 1
External Model
DBMS
Internal Model
Access Methods
Interface 2
Operating
System
Access Methods
Interface 3
Data
Base
9/25/2001
SIMS 257: Database Management
Physical Design
• Interface 1: User request to the DBMS. The user
presents a query, the DBMS determines which
physical DBs are needed to resolve the query
• Interface 2: The DBMS uses an internal model
access method to access the data stored in a logical
database.
• Interface 3: The internal model access methods
and OS access methods access the physical
records of the database.
9/25/2001
SIMS 257: Database Management
Physical File Design
• A Physical file is a portion of secondary
storage (disk space) allocated for the
purpose of storing physical records
• Pointers - a field of data that can be used to
locate a related field or record of data
• Access Methods - An operating system
algorithm for storing and locating data in
secondary storage
• Pages - The amount of data read or written
in one disk input or output operation
9/25/2001
SIMS 257: Database Management
Internal Model Access Methods
• Many types of access methods:
–
–
–
–
–
–
Physical Sequential
Indexed Sequential
Indexed Random
Inverted
Direct
Hashed
• Differences in
– Access Efficiency
– Storage Efficiency
9/25/2001
SIMS 257: Database Management
Physical Sequential
• Key values of the physical records are in
logical sequence
• Main use is for “dump” and “restore”
• Access method may be used for storage as
well as retrieval
• Storage Efficiency is near 100%
• Access Efficiency is poor (unless fixed size
physical records)
9/25/2001
SIMS 257: Database Management
Indexed Sequential
• Key values of the physical records are in logical
sequence
• Access method may be used for storage and
retrieval
• Index of key values is maintained with entries for
the highest key values per block(s)
• Access Efficiency depends on the levels of index,
storage allocated for index, number of database
records, and amount of overflow
• Storage Efficiency depends on size of index and
volatility of database
9/25/2001
SIMS 257: Database Management
Index Sequential
Data File
Actual
Value
9/25/2001
Address
Block
Number
Dumpling
1
Harty
2
Texaci
3
...
…
Adams
Becker
Dumpling
SIMS 257: Database Management
Block 1
Getta
Harty
Block 2
Mobile
Sunoci
Texaci
Block 3
Indexed Sequential: Two Levels
Key
Value
Key
Value
150
1
385
2
001
003
.
.
150
Address
385
7
678
8
805
9
…
Key
Value
251
.
.
385
Address
536
3
678
4
Key
Value
9/25/2001
Address
455
480
.
.
536
605
610
.
.
678
Address
785
5
805
6
SIMS 257: Database Management
791
.
.
805
705
710
.
.
785
Indexed Random
• Key values of the physical records are not
necessarily in logical sequence
• Index may be stored and accessed with Indexed
Sequential Access Method
• Index has an entry for every data base record.
These are in ascending order. The index keys are
in logical sequence. Database records are not
necessarily in ascending sequence.
• Access method may be used for storage and
retrieval
9/25/2001
SIMS 257: Database Management
Indexed Random
Becker
Harty
Actual
Value
Address
Block
Number
Adams
2
Becker
1
Dumpling
3
Getta
2
Harty
1
Adams
Getta
Dumpling
9/25/2001
SIMS 257: Database Management
Btree
F
B
|| D || F|
|| P || Z|
H || L || P|
R || S || Z|
Devils
Aces
Boilers
Cars
9/25/2001
Flyers
Hawkeyes
Hoosiers
Minors
Panthers
SIMS 257: Database Management
Seminoles
Inverted
• Key values of the physical records are not
necessarily in logical sequence
• Access Method is better used for retrieval
• An index for every field to be inverted may
be built
• Access efficiency depends on number of
database records, levels of index, and
storage allocated for index
9/25/2001
SIMS 257: Database Management
Inverted
CH 145
101, 103,104
Actual
Value
Address
Block
Number
CH 145
1
CS 201
2
CS 623
3
PH 345
…
CS 201
102
CS 623
105, 106
9/25/2001
SIMS 257: Database Management
Student
name
Course
Number
Adams
CH145
Becker
cs201
Dumpling ch145
Getta
ch145
Harty
cs623
Mobile
cs623
Direct
• Key values of the physical records are not
necessarily in logical sequence
• There is a one-to-one correspondence between a
record key and the physical address of the record
• May be used for storage and retrieval
• Access efficiency always 1
• Storage efficiency depends on density of keys
• No duplicate keys permitted
9/25/2001
SIMS 257: Database Management
Hashing
• Key values of the physical records are not
necessarily in logical sequence
• Many key values may share the same physical
address (block)
• May be used for storage and retrieval
• Access efficiency depends on distribution of keys,
algorithm for key transformation and space
allocated
• Storage efficiency depends on distibution of keys
and algorithm used for key transformation
9/25/2001
SIMS 257: Database Management
Comparative Access Methods
Factor
Storage space
Sequential
retrieval on
primary key
Random Retr.
Multiple Key
Retr.
Deleting records
Sequential
No wasted space
Indexed
Hashed
No wasted
space for data
but extra space for index
more space needed for
addition and deletion of
records after initial load
Very fast
Moderately Fast
Impractical
Moderately Fast
Impractical
Possible but needs Very fast with
multiple indexes
a full scan
can create wasted OK if dynamic
space
OK if dynamic
Adding records requires rewriting
file
Easy but requires
Maintenance of
Updating records usually requires
indexes
rewriting file
9/25/2001
SIMS 257: Database Management
Very fast
Not possible
very easy
very easy
very easy