Slides from Lecture 7 - Courses - University of California, Berkeley

Download Report

Transcript Slides from Lecture 7 - Courses - University of California, Berkeley

Physical Database Design
University of California, Berkeley
School of Information Management
and Systems
SIMS 257: Database Management
IS 257 – Spring 2004
2004-02-10 SLIDE 1
Lecture Outline
• Review
– Normalization
• Physical Database Design
• Access Methods
IS 257 – Spring 2004
2004-02-10 SLIDE 2
Lecture Outline
• Review
– Normalization
• Physical Database Design
• Access Methods
IS 257 – Spring 2004
2004-02-10 SLIDE 3
Database Design Process
Application 1
External
Model
Application 2
Application 3
Application 4
External
Model
External
Model
External
Model
Application 1
Conceptual
requirements
Application 2
Conceptual
requirements
Application 3
Conceptual
requirements
Conceptual
Model
Logical
Model
Internal
Model
Application 4
Conceptual
requirements
IS 257 – Spring 2004
2004-02-10 SLIDE 4
Normalization
• Normalization theory is based on the
observation that relations with certain
properties are more effective in inserting,
updating and deleting data than other sets
of relations containing the same data
• Normalization is a multi-step process
beginning with an “unnormalized” relation
– Hospital example from Atre, S. Data Base:
Structured Techniques for Design,
Performance, and Management.
IS 257 – Spring 2004
2004-02-10 SLIDE 5
Normal Forms
•
•
•
•
•
•
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
IS 257 – Spring 2004
2004-02-10 SLIDE 6
Normalization
No transitive
dependency
between
nonkey
attributes
All
determinants
are candidate
keys - Single
multivalued
dependency
IS 257 – Spring 2004
BoyceCodd and
Higher
Functional
dependency
of nonkey
attributes on
the primary
key - Atomic
values only
Full
Functional
dependency
of nonkey
attributes on
the primary
key
2004-02-10 SLIDE 7
Unnormalized Relations
• First step in normalization is to convert the
data into a two-dimensional table
• In unnormalized relations data can repeat
within a column
IS 257 – Spring 2004
2004-02-10 SLIDE 8
Unnormalized Relation
Patient #
Surgeon #
145
1111 311
Surg. date
Patient Name
Jan 1,
1995; June
12, 1995
John White
Patient Addr Surgeon
15 New St.
New York,
NY
243
1234 467
2345 189
Jan 8,
1996
Charles Brown
4876 145
Nov 5,
1995
Hal Kane
5123 145
May 10,
1995
Paul Kosher
Charles
Field
10 Main St. Patricia
Rye, NY
Gold
Dogwood
Lane
Harrison,
David
NY
Rosen
55 Boston
Post Road,
Chester,
CN
Beth Little
Blind Brook
Mamaronec
k, NY
Beth Little
6845 243
Apr 5,
1994 Dec
15, 1984
Ann Hood
Hilton Road
Larchmont, Charles
NY
Field
IS 257 – Spring 2004
Postop drug
Drug side effects
Gallstone
s removal;
Beth Little Kidney
Michael
stones
Penicillin,
Diamond removal
none-
Apr 5,
1994 May
10, 1995
Mary Jones
Surgery
rash
none
Eye
Cataract
removal
Thrombos Tetracyclin Fever
is removal e none
none
Open
Heart
Surgery
Cholecyst
ectomy
Gallstone
s
Removal
Eye
Cornea
Replacem
ent Eye
cataract
removal
Cephalosp
orin
none
Demicillin
none
none
none
Tetracyclin
e
Fever
2004-02-10 SLIDE 9
First Normal Form
• To move to First Normal Form a relation
must contain only atomic values at each
row and column.
– No repeating groups
– A column or set of columns is called a
Candidate Key when its values can uniquely
identify the row in the relation.
IS 257 – Spring 2004
2004-02-10 SLIDE 10
First Normal Form
Patient #
Surgeon # Surgery DatePatient Name Patient Addr Surgeon Name
1111
145
01-Jan-95 John White
1111
311
12-Jun-95 John White
15 New St.
New York,
NY
15 New St.
New York,
NY
1234
243
05-Apr-94 Mary Jones
10 Main St.
Rye, NY
1234
467
10-May-95 Mary Jones
2345
4876
5123
6845
6845
189
145
145
243
243
IS 257 – Spring 2004
Charles
08-Jan-96 Brown
10 Main St.
Rye, NY
Dogwood
Lane
Harrison,
NY
05-Nov-95 Hal Kane
55 Boston
Post Road,
Chester,
CN
05-Apr-94 Ann Hood
15-Dec-84 Ann Hood
Hilton Road
Larchmont,
NY
Drug adminSide Effects
Charles Field
Gallstone
s removal
Kidney
stones
removal
Eye
Cataract
removal
Patricia Gold
Thrombos
is removal none
none
David Rosen
Open
Heart
Surgery
none
Beth Little
Cholecyst
ectomy
Demicillin
Beth Little
Michael
Diamond
Blind Brook
Mamaronec
10-May-95 Paul Kosher k, NY
Beth Little
Hilton Road
Larchmont,
NY
Surgery
Penicillin
rash
none
none
Tetracyclin
e
Fever
Cephalosp
orin
Charles Field
Gallstone
s
Removal
none
Eye
Cornea
Replacem Tetracyclin
ent
e
Charles Field
Eye
cataract
removal
none
none
none
Fever
none
2004-02-10 SLIDE 11
Second Normal Form
• A relation is said to be in Second Normal
Form when every nonkey attribute is fully
functionally dependent on the primary key.
– That is, every nonkey attribute needs the full
primary key for unique identification
IS 257 – Spring 2004
2004-02-10 SLIDE 12
Second Normal Form
Patient #
1111
1234
2345
4876
5123
6845
IS 257 – Spring 2004
Patient Name Patient Address
15 New St. New
John White York, NY
10 Main St. Rye,
Mary Jones NY
Charles
Dogwood Lane
Brown
Harrison, NY
55 Boston Post
Hal Kane
Road, Chester,
Blind Brook
Paul Kosher Mamaroneck, NY
Hilton Road
Ann Hood
Larchmont, NY
2004-02-10 SLIDE 13
Second Normal Form
Surgeon #
Surgeon Name
145 Beth Little
189 David Rosen
243 Charles Field
311 Michael Diamond
467 Patricia Gold
IS 257 – Spring 2004
2004-02-10 SLIDE 14
Second Normal Form
Patient # Surgeon # Surgery Date
1111
1111
1234
1234
2345
4876
Drug Admin Side Effects
145
Gallstones
01-Jan-95 removal
Kidney
Penicillin
rash
311
stones
12-Jun-95 removal
none
none
243
Eye Cataract
05-Apr-94 removal
Tetracycline Fever
467
Thrombosis
10-May-95 removal
189
Open Heart
08-Jan-96 Surgery
Cephalospori
n
none
145
Cholecystect
05-Nov-95 omy
Demicillin
none
none
none
none
none
5123
145
6845
243
6845
243
IS 257 – Spring 2004
Surgery
Gallstones
10-May-95 Removal
Eye cataract
15-Dec-84 removal
Eye Cornea
05-Apr-94 Replacement
none
none
Tetracycline Fever
2004-02-10 SLIDE 15
Third Normal Form
• A relation is said to be in Third Normal Form if
there is no transitive functional dependency
between nonkey attributes
– When one nonkey attribute can be determined with
one or more nonkey attributes there is said to be a
transitive functional dependency.
• The side effect column in the Surgery table is
determined by the drug administered
– Side effect is transitively functionally dependent on
drug so Surgery is not 3NF
IS 257 – Spring 2004
2004-02-10 SLIDE 16
Third Normal Form
Patient # Surgeon # Surgery Date
Surgery
Drug Admin
1111
145
1111
311
01-Jan-95 Gallstones removal
Kidney stones
12-Jun-95 removal
1234
243
05-Apr-94 Eye Cataract removal Tetracycline
1234
467
10-May-95 Thrombosis removal
2345
189
08-Jan-96 Open Heart Surgery
Cephalosporin
4876
145
05-Nov-95 Cholecystectomy
Demicillin
5123
145
10-May-95 Gallstones Removal
none
6845
243
none
6845
243
15-Dec-84 Eye cataract removal
Eye Cornea
05-Apr-94 Replacement
IS 257 – Spring 2004
Penicillin
none
none
Tetracycline
2004-02-10 SLIDE 17
Third Normal Form
Drug Admin
IS 257 – Spring 2004
Side Effects
Cephalosporin
none
Demicillin
none
none
none
Penicillin
rash
Tetracycline
Fever
2004-02-10 SLIDE 18
Boyce-Codd Normal Form
• Most 3NF relations are also BCNF
relations.
• A 3NF relation is NOT in BCNF if:
– Candidate keys in the relation are composite
keys (they are not single attributes)
– There is more than one candidate key in the
relation, and
– The keys are not disjoint, that is, some
attributes in the keys are common
IS 257 – Spring 2004
2004-02-10 SLIDE 19
BCNF Relations
Patient # Patient Name
IS 257 – Spring 2004
Patient #
1111 John White
1111
1234 Mary Jones
Charles
2345 Brown
1234
4876 Hal Kane
4876
5123 Paul Kosher
5123
6845 Ann Hood
6845
2345
Patient Address
15 New St. New
York, NY
10 Main St. Rye,
NY
Dogwood Lane
Harrison, NY
55 Boston Post
Road, Chester,
Blind Brook
Mamaroneck, NY
Hilton Road
Larchmont, NY
2004-02-10 SLIDE 20
Fourth Normal Form
• Any relation is in Fourth Normal Form if it
is BCNF and any multivalued
dependencies are trivial
• Eliminate non-trivial multivalued
dependencies by projecting into simpler
tables
IS 257 – Spring 2004
2004-02-10 SLIDE 21
Fifth Normal Form
• A relation is in 5NF if every join
dependency in the relation is implied by
the keys of the relation
• Implies that relations that have been
decomposed in previous NF can be
recombined via natural joins to recreate
the original relation.
IS 257 – Spring 2004
2004-02-10 SLIDE 22
Normalization
• Normalization is performed to reduce or
eliminate Insertion, Deletion or Update
anomalies.
• However, a completely normalized
database may not be the most efficient or
effective implementation.
• “Denormalization” is sometimes used to
improve efficiency.
IS 257 – Spring 2004
2004-02-10 SLIDE 23
Denormalization
• Usually driven by the need to improve
query speed
• Query speed is improved at the expense
of more complex or problematic DML
(Data manipulation language) for updates,
deletions and insertions.
IS 257 – Spring 2004
2004-02-10 SLIDE 24
Downward Denormalization
Before:
Customer
ID
Address
Name
Telephone
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
IS 257 – Spring 2004
After:
Customer
ID
Address
Name
Telephone
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
Cust Name
2004-02-10 SLIDE 25
Upward Denormalization
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
Cust Name
Order Item
Order No
Item No
Item Price
Num Ordered
IS 257 – Spring 2004
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
Cust Name
Order Price
Order Item
Order No
Item No
Item Price
Num Ordered
2004-02-10 SLIDE 26
Lecture Outline
• Review
– Normalization
• Physical Database Design
• Access Methods
IS 257 – Spring 2004
2004-02-10 SLIDE 27
Database Design Process
Application 1
External
Model
Application 2
Application 3
Application 4
External
Model
External
Model
External
Model
Application 1
Conceptual
requirements
Application 2
Conceptual
requirements
Application 3
Conceptual
requirements
Conceptual
Model
Logical
Model
Internal
Model
Application 4
Conceptual
requirements
IS 257 – Spring 2004
Physical
Design
2004-02-10 SLIDE 28
Physical Database Design
• Many physical database design decisions
are implicit in the technology adopted
– Also, organizations may have standards or an
“information architecture” that specifies
operating systems, DBMS, and data access
languages -- thus constraining the range of
possible physical implementations.
• We will be concerned with some of the
possible physical implementation issues
IS 257 – Spring 2004
2004-02-10 SLIDE 29
Physical Database Design
• The primary goal of physical database
design is data processing efficiency
• We will concentrate on choices often
available to optimize performance of
database services
• Physical Database Design requires
information gathered during earlier stages
of the design process
IS 257 – Spring 2004
2004-02-10 SLIDE 30
Physical Design Information
• Information needed for physical file and
database design includes:
– Normalized relations plus size estimates for them
– Definitions of each attribute
– Descriptions of where and when data are used
• entered, retrieved, deleted, updated, and how often
– Expectations and requirements for response time,
and data security, backup, recovery, retention and
integrity
– Descriptions of the technologies used to implement
the database
IS 257 – Spring 2004
2004-02-10 SLIDE 31
Physical Design Decisions
• There are several critical decisions that
will affect the integrity and performance of
the system.
– Storage Format
– Physical record composition
– Data arrangement
– Indexes
– Query optimization and performance tuning
IS 257 – Spring 2004
2004-02-10 SLIDE 32
Storage Format
• Choosing the storage format of each field
(attribute). The DBMS provides some set
of data types that can be used for the
physical storage of fields in the database
• Data Type (format) is chosen to minimize
storage space and maximize data integrity
IS 257 – Spring 2004
2004-02-10 SLIDE 33
Objectives of data type selection
•
•
•
•
•
Minimize storage space
Represent all possible values
Improve data integrity
Support all data manipulations
The correct data type should, in minimal
space, represent every possible value (but
eliminated illegal values) for the
associated attribute and can support the
required data manipulations (e.g.
numerical or string operations)
IS 257 – Spring 2004
2004-02-10 SLIDE 34
Access Data Types
•
•
•
•
•
•
•
•
•
Numeric (1, 2, 4, 8 bytes, fixed or float)
Text (255 max)
Memo (64000 max)
Date/Time (8 bytes)
Currency (8 bytes, 15 digits + 4 digits decimal)
Autonumber (4 bytes)
Yes/No (1 bit)
OLE (limited only by disk space)
Hyperlinks (up to 64000 chars)
IS 257 – Spring 2004
2004-02-10 SLIDE 35
Access Numeric types
• Byte
– Stores numbers from 0 to 255 (no fractions). 1 byte
• Integer
– Stores numbers from –32,768 to 32,767 (no fractions) 2 bytes
• Long Integer
(Default)
– Stores numbers from –2,147,483,648 to 2,147,483,647 (no fractions). 4
bytes
• Single
– Stores numbers from -3.402823E38 to –1.401298E–45 for negative
values and from 1.401298E–45 to 3.402823E38 for positive values.
4 bytes
• Double
– Stores numbers from –1.79769313486231E308 to –
4.94065645841247E–324 for negative values and from
1.79769313486231E308 to 4.94065645841247E–324 for positive
values.
15
8 bytes
• Replication ID
– Globally unique identifier (GUID) N/A
IS 257 – Spring 2004
16 bytes
2004-02-10 SLIDE 36
Controlling Data Integrity
•
•
•
•
•
Default values
Range control
Null value control
Referential integrity
Handling missing data
IS 257 – Spring 2004
2004-02-10 SLIDE 37
Designing Physical Records
• A physical record is a group of fields
stored in adjacent memory locations and
retrieved together as a unit
• Fixed Length and variable fields
IS 257 – Spring 2004
2004-02-10 SLIDE 38
Designing Physical/Internal Model
• Overview
• terminology
• Access methods
IS 257 – Spring 2004
2004-02-10 SLIDE 39
Physical Design
• Internal Model/Physical Model
User request
Interface 1
External Model
DBMS
Internal Model
Access Methods
Interface 2
Operating
System
Access Methods
Interface 3
Data
Base
IS 257 – Spring 2004
2004-02-10 SLIDE 40
Physical Design
• Interface 1: User request to the DBMS.
The user presents a query, the DBMS
determines which physical DBs are
needed to resolve the query
• Interface 2: The DBMS uses an internal
model access method to access the data
stored in a logical database.
• Interface 3: The internal model access
methods and OS access methods access
the physical records of the database.
IS 257 – Spring 2004
2004-02-10 SLIDE 41
Physical File Design
• A Physical file is a portion of secondary storage
(disk space) allocated for the purpose of storing
physical records
• Pointers - a field of data that can be used to
locate a related field or record of data
• Access Methods - An operating system
algorithm for storing and locating data in
secondary storage
• Pages - The amount of data read or written in
one disk input or output operation
IS 257 – Spring 2004
2004-02-10 SLIDE 42
Internal Model Access Methods
• Many types of access methods:
– Physical Sequential
– Indexed Sequential
– Indexed Random
– Inverted
– Direct
– Hashed
• Differences in
– Access Efficiency
– Storage Efficiency
IS 257 – Spring 2004
2004-02-10 SLIDE 43
Physical Sequential
• Key values of the physical records are in
logical sequence
• Main use is for “dump” and “restore”
• Access method may be used for storage
as well as retrieval
• Storage Efficiency is near 100%
• Access Efficiency is poor (unless fixed
size physical records)
IS 257 – Spring 2004
2004-02-10 SLIDE 44
Indexed Sequential
• Key values of the physical records are in logical
sequence
• Access method may be used for storage and
retrieval
• Index of key values is maintained with entries for
the highest key values per block(s)
• Access Efficiency depends on the levels of
index, storage allocated for index, number of
database records, and amount of overflow
• Storage Efficiency depends on size of index and
volatility of database
IS 257 – Spring 2004
2004-02-10 SLIDE 45
Index Sequential
Data File
Actual
Value
IS 257 – Spring 2004
Address
Block
Number
Dumpling
1
Harty
2
Texaci
3
...
…
Adams
Becker
Dumpling
Block 1
Getta
Harty
Block 2
Mobile
Sunoci
Texaci
Block 3
2004-02-10 SLIDE 46
Indexed Sequential: Two Levels
Key
Value
Key
Value
Address
150
1
385
2
001
003
.
.
150
Address
385
7
678
8
805
9
…
Key
Value
Address
536
3
678
4
Key
Value
251
.
.
385
455
480
.
.
536
605
610
.
.
678
Address
785
5
805
6
791
.
.
805
IS 257 – Spring 2004
705
710
.
.
785
2004-02-10 SLIDE 47
Indexed Random
• Key values of the physical records are not
necessarily in logical sequence
• Index may be stored and accessed with Indexed
Sequential Access Method
• Index has an entry for every data base record.
These are in ascending order. The index keys
are in logical sequence. Database records are
not necessarily in ascending sequence.
• Access method may be used for storage and
retrieval
IS 257 – Spring 2004
2004-02-10 SLIDE 48
Indexed Random
Becker
Harty
Actual
Value
Address
Block
Number
Adams
2
Becker
1
Dumpling
3
Getta
2
Harty
1
Adams
Getta
Dumpling
IS 257 – Spring 2004
2004-02-10 SLIDE 49
Btree
F
B
||
D || F|
||
P || Z|
H ||
L || P|
R ||
S || Z|
Devils
Aces
Boilers
Cars
IS 257 – Spring 2004
Flyers
Hawkeyes
Hoosiers
Minors
Panthers
Seminoles
2004-02-10 SLIDE 50
Inverted
• Key values of the physical records are not
necessarily in logical sequence
• Access Method is better used for retrieval
• An index for every field to be inverted may
be built
• Access efficiency depends on number of
database records, levels of index, and
storage allocated for index
IS 257 – Spring 2004
2004-02-10 SLIDE 51
Inverted
Student
name
Course
Number
CH 145
101, 103,104
Actual
Value
Address
Block
Number
CH 145
1
CS 201
2
CS 623
3
PH 345
…
Adams
CH145
Becker
cs201
Dumpling ch145
CS 201
102
Getta
ch145
Harty
cs623
Mobile
cs623
CS 623
105, 106
IS 257 – Spring 2004
2004-02-10 SLIDE 52
Direct
• Key values of the physical records are not
necessarily in logical sequence
• There is a one-to-one correspondence
between a record key and the physical
address of the record
• May be used for storage and retrieval
• Access efficiency always 1
• Storage efficiency depends on density of
keys
• No duplicate keys permitted
IS 257 – Spring 2004
2004-02-10 SLIDE 53
Hashing
• Key values of the physical records are not
necessarily in logical sequence
• Many key values may share the same physical
address (block)
• May be used for storage and retrieval
• Access efficiency depends on distribution of
keys, algorithm for key transformation and space
allocated
• Storage efficiency depends on distibution of
keys and algorithm used for key transformation
IS 257 – Spring 2004
2004-02-10 SLIDE 54
Comparative Access Methods
Factor
Storage space
Sequential
retrieval on
primary key
Random Retr.
Multiple Key
Retr.
Deleting records
Sequential
No wasted space
Indexed
Hashed
No wasted
space for data
but extra space for index
more space needed for
addition and deletion of
records after initial load
Very fast
Moderately Fast
Impractical
Moderately Fast
Very fast with
multiple indexes
OK if dynamic
Very fast
OK if dynamic
very easy
Easy but requires
Maintenance of
indexes
very easy
Impractical
Possible but needs
a full scan
can create wasted
space
Adding records requires rewriting
file
Updating records usually requires
rewriting file
IS 257 – Spring 2004
Not possible
very easy
2004-02-10 SLIDE 55
Next Time
• Indexes and when to index
• Integrity Constraints
• Referential Integrity
IS 257 – Spring 2004
2004-02-10 SLIDE 56