Slides from Lecture 5 - Courses - University of California, Berkeley

Download Report

Transcript Slides from Lecture 5 - Courses - University of California, Berkeley

Database Design: Normalization
and The Relational Model
University of California, Berkeley
School of Information Management
and Systems
SIMS 257: Database Management
IS 257 – Fall 2004
2004.09.22 - SLIDE 1
Lecture Outline
• Review
– Logical Design for the Diveshop
database
• Normalization
IS 257 – Fall 2004
2004.09.22 - SLIDE 2
Lecture Outline
• Review
– Logical Design for the Diveshop
database
• Normalization
IS 257 – Fall 2004
2004.09.22 - SLIDE 3
Database Design Process
Application 1
External
Model
Application 2
Application 3
Application 4
External
Model
External
Model
External
Model
Application 1
Conceptual
requirements
Application 2
Conceptual
requirements
Application 3
Conceptual
requirements
Conceptual
Model
Logical
Model
Internal
Model
Application 4
Conceptual
requirements
IS 257 – Fall 2004
2004.09.22 - SLIDE 4
DiveShop ER Diagram
Customer
No
DiveCust
1
Destination
Name
Destination
no
Customer
No
ShipVia
n
Dest
n
1
DiveOrds
n
1
ShipVia
ShipVia
1
Destination
no
Site No
1
n
Site No
BioSite
Species
No
1
Destination
n
Sites
Order
No
n
1
1/n
ShipWrck
Order
No
DiveItem
n
Item
No
n
Site No
1
Species
No
BioLife
IS 257 – Fall 2004
1
DiveStok
Item
No
2004.09.22 - SLIDE 5
Logical Design: Mapping to a Relational
Model
• Each entity in the ER Diagram becomes a
relation.
• A properly normalized ER diagram will indicate
where intersection relations for many-to-many
mappings are needed.
• Relationships are indicated by common columns
(or domains) in tables that are related.
• We will examine the tables for the Diveshop
derived from the ER diagram
IS 257 – Fall 2004
2004.09.22 - SLIDE 6
Customer = DIVECUST
Customer No
Name
Street
City
State/Prov Zip/Postal Code
Country
1480 Louis Jazdzewski
2501 O'Connor
New Orleans
LA
60332
U.S.A.
1481 Barbara Wright
6344 W. Freeway
San Francisco
CA
95031
U.S.A.
1909 Stephen Bredenburg
559 N.E. 167
Indianapolis
Place IN
46241
U.S.A.
1913 Phillip Davoust
123 First Street
Berkeley CA
94704
U.S.A.
1969 David Burgett
320 Montgomery
SeattleStreet
WA
98105
U.S.A.
2001 Mary Rioux1701 Gateway
Pueblo
Blvd. #385
CO
81002
U.S.A.
2306 Kim Lopez 14134 Nottingham
HonoluluLane
HI
96826
U.S.A.
2589 Hiram Marley
7233 Mill Run
SanDrive
Francisco
CA
94123
U.S.A.
3154 Tanya Kulesa
505 S. Flower,
NewMail
YorkStop
NY 48943 10032
U.S.A.
3333 Charles Sekaron
110 East Park
Miller
Avenue,SD
Box 8
57362
U.S.A.
3684 Lowell Lutz915 E. Fesler
Dallas
TX
75043
U.S.A.
4158 Keith Lucas56 South Euclid
Chicago IL
60542
U.S.A.
4175 Karen Ng 2134 ElmhillKlamath
Pike Falls
OR
97603
U.S.A.
5510 Ken Soule 58 Sansome
Aurora
Street CO
89022
U.S.A.
IS 257 – Fall 2004
Phone
First Contact
(902) 555-88881/29/95
(415) 555-43212/2/93
(317) 555-36441/5/93
(415) 555-91843/9/98
(206) 555-75803/12/99
(719) 555-20103/15/97
(808) 555-50501/29/99
(415) 555-64302/18/99
(212) 555-67501/30/99
(613) 555-43333/16/98
(214) 555-27222/15/99
(312) 555-43103/17/98
(503) 555-47003/20/99
(303) 555-66952/5/99
2004.09.22 - SLIDE 7
Mapping to Other Models
• Hierarchical
– Need to make decisions about access paths
• Network
– Need to pre-specify all of the links and sets
• Object-Oriented
– What are the objects, datatypes, their
methods and the access points for them
• Object-Relational
– Same as relational, but what new datatypes
might be needed or useful (more on OR later)
IS 257 – Fall 2004
2004.09.22 - SLIDE 8
Lecture Outline
• Review
– Logical Design for the Diveshop
database
• Normalization
IS 257 – Fall 2004
2004.09.22 - SLIDE 9
Normalization
• Normalization theory is based on the
observation that relations with certain
properties are more effective in inserting,
updating and deleting data than other sets
of relations containing the same data
• Normalization is a multi-step process
beginning with an “unnormalized” relation
– Hospital example from Atre, S. Data Base:
Structured Techniques for Design,
Performance, and Management.
IS 257 – Fall 2004
2004.09.22 - SLIDE 10
Normal Forms
•
•
•
•
•
•
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
IS 257 – Fall 2004
2004.09.22 - SLIDE 11
Normalization
No transitive
dependency
between
nonkey
attributes
All
determinants
are candidate
keys - Single
multivalued
dependency
IS 257 – Fall 2004
BoyceCodd and
Higher
Functional
dependency
of nonkey
attributes on
the primary
key - Atomic
values only
Full
Functional
dependency
of nonkey
attributes on
the primary
key
2004.09.22 - SLIDE 12
Unnormalized Relations
• First step in normalization is to convert the
data into a two-dimensional table
• In unnormalized relations data can repeat
within a column
IS 257 – Fall 2004
2004.09.22 - SLIDE 13
Unnormalized Relation
Patient #
Surgeon #
145
1111 311
Surg. date
Patient Name
Jan 1,
1995; June
12, 1995
John White
Patient Addr Surgeon
15 New St.
New York,
NY
243
1234 467
2345 189
Jan 8,
1996
Charles Brown
4876 145
Nov 5,
1995
Hal Kane
5123 145
May 10,
1995
Paul Kosher
Charles
Field
10 Main St. Patricia
Rye, NY
Gold
Dogwood
Lane
Harrison,
David
NY
Rosen
55 Boston
Post Road,
Chester,
CN
Beth Little
Blind Brook
Mamaronec
k, NY
Beth Little
6845 243
Apr 5,
1994 Dec
15, 1984
Ann Hood
Hilton Road
Larchmont, Charles
NY
Field
IS 257 – Fall 2004
Postop drug
Drug side effects
Gallstone
s removal;
Beth Little Kidney
Michael
stones
Penicillin,
Diamond removal
none-
Apr 5,
1994 May
10, 1995
Mary Jones
Surgery
rash
none
Eye
Cataract
removal
Thrombos Tetracyclin Fever
is removal e none
none
Open
Heart
Surgery
Cholecyst
ectomy
Gallstone
s
Removal
Eye
Cornea
Replacem
ent Eye
cataract
removal
Cephalosp
orin
none
Demicillin
none
none
none
Tetracyclin
e
Fever
2004.09.22 - SLIDE 14
First Normal Form
• To move to First Normal Form a relation
must contain only atomic values at each
row and column.
– No repeating groups
– A column or set of columns is called a
Candidate Key when its values can uniquely
identify the row in the relation.
IS 257 – Fall 2004
2004.09.22 - SLIDE 15
First Normal Form
Patient #
Surgeon # Surgery DatePatient Name Patient Addr Surgeon Name
1111
145
01-Jan-95 John White
1111
311
12-Jun-95 John White
15 New St.
New York,
NY
15 New St.
New York,
NY
1234
243
05-Apr-94 Mary Jones
10 Main St.
Rye, NY
1234
467
10-May-95 Mary Jones
2345
4876
5123
6845
6845
IS 257 – Fall 2004
189
145
145
243
243
Charles
08-Jan-96 Brown
10 Main St.
Rye, NY
Dogwood
Lane
Harrison,
NY
05-Nov-95 Hal Kane
55 Boston
Post Road,
Chester,
CN
05-Apr-94 Ann Hood
15-Dec-84 Ann Hood
Hilton Road
Larchmont,
NY
Drug adminSide Effects
Charles Field
Gallstone
s removal
Kidney
stones
removal
Eye
Cataract
removal
Patricia Gold
Thrombos
is removal none
none
David Rosen
Open
Heart
Surgery
none
Beth Little
Cholecyst
ectomy
Demicillin
Beth Little
Michael
Diamond
Blind Brook
Mamaronec
10-May-95 Paul Kosher k, NY
Beth Little
Hilton Road
Larchmont,
NY
Surgery
Penicillin
rash
none
none
Tetracyclin
e
Fever
Cephalosp
orin
Charles Field
Gallstone
s
Removal
none
Eye
Cornea
Replacem Tetracyclin
ent
e
Charles Field
Eye
cataract
removal
none
none
none
Fever
none
2004.09.22 - SLIDE 16
1NF Storage Anomalies
• Insertion: A new patient has not yet undergone
surgery -- hence no surgeon # -- Since surgeon
# is part of the key we can’t insert.
• Insertion: If a surgeon is newly hired and hasn’t
operated yet -- there will be no way to include
that person in the database.
• Update: If a patient comes in for a new
procedure, and has moved, we need to change
multiple address entries.
• Deletion (type 1): Deleting a patient record may
also delete all info about a surgeon.
• Deletion (type 2): When there are functional
dependencies (like side effects and drug)
changing one item eliminates other information.
IS 257 – Fall 2004
2004.09.22 - SLIDE 17
Second Normal Form
• A relation is said to be in Second Normal
Form when every nonkey attribute is fully
functionally dependent on the primary
key.
– That is, every nonkey attribute needs the full
primary key for unique identification
IS 257 – Fall 2004
2004.09.22 - SLIDE 18
Second Normal Form
Patient #
1111
1234
2345
4876
5123
6845
IS 257 – Fall 2004
Patient Name Patient Address
15 New St. New
John White York, NY
10 Main St. Rye,
Mary Jones NY
Charles
Dogwood Lane
Brown
Harrison, NY
55 Boston Post
Hal Kane
Road, Chester,
Blind Brook
Paul Kosher Mamaroneck, NY
Hilton Road
Ann Hood
Larchmont, NY
2004.09.22 - SLIDE 19
Second Normal Form
Surgeon #
Surgeon Name
145 Beth Little
189 David Rosen
243 Charles Field
311 Michael Diamond
467 Patricia Gold
IS 257 – Fall 2004
2004.09.22 - SLIDE 20
Second Normal Form
Patient # Surgeon # Surgery Date
1111
1111
1234
1234
2345
4876
Drug Admin Side Effects
145
Gallstones
01-Jan-95 removal
Kidney
Penicillin
rash
311
stones
12-Jun-95 removal
none
none
243
Eye Cataract
05-Apr-94 removal
Tetracycline Fever
467
Thrombosis
10-May-95 removal
189
Open Heart
08-Jan-96 Surgery
Cephalospori
n
none
145
Cholecystect
05-Nov-95 omy
Demicillin
none
none
none
none
none
5123
145
6845
243
6845
243
IS 257 – Fall 2004
Surgery
Gallstones
10-May-95 Removal
Eye cataract
15-Dec-84 removal
Eye Cornea
05-Apr-94 Replacement
none
none
Tetracycline Fever
2004.09.22 - SLIDE 21
1NF Storage Anomalies Removed
• Insertion: Can now enter new patients without
surgery.
• Insertion: Can now enter Surgeons who haven’t
operated.
• Deletion (type 1): If Charles Brown dies the
corresponding tuples from Patient and Surgery
tables can be deleted without losing information
on David Rosen.
• Update: If John White comes in for third time,
and has moved, we only need to change the
Patient table
IS 257 – Fall 2004
2004.09.22 - SLIDE 22
2NF Storage Anomalies
• Insertion: Cannot enter the fact that a particular
drug has a particular side effect unless it is given
to a patient.
• Deletion: If John White receives some other drug
because of the penicillin rash, and a new drug
and side effect are entered, we lose the
information that penicillin can cause a rash
• Update: If drug side effects change (a new
formula) we have to update multiple occurrences
of side effects.
IS 257 – Fall 2004
2004.09.22 - SLIDE 23
Third Normal Form
• A relation is said to be in Third Normal Form if
there is no transitive functional dependency
between nonkey attributes
– When one nonkey attribute can be determined with
one or more nonkey attributes there is said to be a
transitive functional dependency.
• The side effect column in the Surgery table is
determined by the drug administered
– Side effect is transitively functionally dependent on
drug so Surgery is not 3NF
IS 257 – Fall 2004
2004.09.22 - SLIDE 24
Third Normal Form
Patient # Surgeon # Surgery Date
IS 257 – Fall 2004
Surgery
Drug Admin
1111
145
1111
311
01-Jan-95 Gallstones removal
Kidney stones
12-Jun-95 removal
Penicillin
1234
243
05-Apr-94 Eye Cataract removal Tetracycline
1234
467
10-May-95 Thrombosis removal
2345
189
08-Jan-96 Open Heart Surgery
Cephalosporin
4876
145
05-Nov-95 Cholecystectomy
Demicillin
5123
145
10-May-95 Gallstones Removal
none
6845
243
none
6845
243
15-Dec-84 Eye cataract removal
Eye Cornea
05-Apr-94 Replacement
none
none
Tetracycline
2004.09.22 - SLIDE 25
Third Normal Form
Drug Admin
IS 257 – Fall 2004
Side Effects
Cephalosporin
none
Demicillin
none
none
none
Penicillin
rash
Tetracycline
Fever
2004.09.22 - SLIDE 26
2NF Storage Anomalies Removed
• Insertion: We can now enter the fact that a
particular drug has a particular side effect
in the Drug relation.
• Deletion: If John White recieves some
other drug as a result of the rash from
penicillin, but the information on penicillin
and rash is maintained.
• Update: The side effects for each drug
appear only once.
IS 257 – Fall 2004
2004.09.22 - SLIDE 27
Boyce-Codd Normal Form
• Most 3NF relations are also BCNF
relations.
• A 3NF relation is NOT in BCNF if:
– Candidate keys in the relation are composite
keys (they are not single attributes)
– There is more than one candidate key in the
relation, and
– The keys are not disjoint, that is, some
attributes in the keys are common
IS 257 – Fall 2004
2004.09.22 - SLIDE 28
Most 3NF Relations are also BCNF – Is
this one?
Patient # Patient Name Patient Address
15 New St. New
1111 John White York, NY
10 Main St. Rye,
1234 Mary Jones NY
Charles
Dogwood Lane
2345 Brown
Harrison, NY
55 Boston Post
4876 Hal Kane
Road, Chester,
Blind Brook
5123 Paul Kosher Mamaroneck, NY
Hilton Road
6845 Ann Hood
Larchmont, NY
IS 257 – Fall 2004
2004.09.22 - SLIDE 29
BCNF Relations
Patient # Patient Name
IS 257 – Fall 2004
Patient #
1111 John White
1111
1234 Mary Jones
Charles
2345 Brown
1234
4876 Hal Kane
4876
5123 Paul Kosher
5123
6845 Ann Hood
6845
2345
Patient Address
15 New St. New
York, NY
10 Main St. Rye,
NY
Dogwood Lane
Harrison, NY
55 Boston Post
Road, Chester,
Blind Brook
Mamaroneck, NY
Hilton Road
Larchmont, NY
2004.09.22 - SLIDE 30
Fourth Normal Form
• Any relation is in Fourth Normal Form if it
is BCNF and any multivalued
dependencies are trivial
• Eliminate non-trivial multivalued
dependencies by projecting into simpler
tables
IS 257 – Fall 2004
2004.09.22 - SLIDE 31
Fifth Normal Form
• A relation is in 5NF if every join
dependency in the relation is implied by
the keys of the relation
• Implies that relations that have been
decomposed in previous NF can be
recombined via natural joins to recreate
the original relation.
IS 257 – Fall 2004
2004.09.22 - SLIDE 32
Effectiveness and Efficiency Issues for
DBMS
• Focus on the relational model
• Any column in a relational database can
be searched for values.
• To improve efficiency indexes using
storage structures such as BTrees and
Hashing are used
• But many useful functions are not
indexable and require complete scans of
the the database
IS 257 – Fall 2004
2004.09.22 - SLIDE 33
Example: Text Fields
• In conventional RDBMS, when a text field
is indexed, only exact matching of the text
field contents (or Greater-than and Lessthan).
– Can search for individual words using pattern
matching, but a full scan is required.
• Text searching is still done best (and
fastest) by specialized text search
programs (Search Engines) that we will
look at more later.
IS 257 – Fall 2004
2004.09.22 - SLIDE 34
Normalization
• Normalization is performed to reduce or
eliminate Insertion, Deletion or Update
anomalies.
• However, a completely normalized
database may not be the most efficient or
effective implementation.
• “Denormalization” is sometimes used to
improve efficiency.
IS 257 – Fall 2004
2004.09.22 - SLIDE 35
Normalizing to death
• Normalization splits database information
across multiple tables.
• To retrieve complete information from a
normalized database, the JOIN operation
must be used.
• JOIN tends to be expensive in terms of
processing time, and very large joins are
very expensive.
IS 257 – Fall 2004
2004.09.22 - SLIDE 36
Downward Denormalization
Customer
ID
Address
Name
Telephone
Before:
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
IS 257 – Fall 2004
After:
Customer
ID
Address
Name
Telephone
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
Cust Name
2004.09.22 - SLIDE 37
Upward Denormalization
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
Cust Name
Order Item
Order No
Item No
Item Price
Num Ordered
IS 257 – Fall 2004
Order
Order No
Date Taken
Date Dispatched
Date Invoiced
Cust ID
Cust Name
Order Price
Order Item
Order No
Item No
Item Price
Num Ordered
2004.09.22 - SLIDE 38
Denormalization
• Usually driven by the need to improve
query speed
• Query speed is improved at the expense
of more complex or problematic DML
(Data manipulation language) for updates,
deletions and insertions.
IS 257 – Fall 2004
2004.09.22 - SLIDE 39
Using RDBMS to help normalize
• Example database: Cookie
• Database of books, libraries, publisher and
holding information for a shared (union)
catalog
IS 257 – Fall 2004
2004.09.22 - SLIDE 40
Cookie relationships
IS 257 – Fall 2004
2004.09.22 - SLIDE 41
Cookie BIBFILE relation
ACCNO
A003
T082
C024
B006
B007
B005
B008
B010
B009
B012
B011
B014
B013
B016
B017
F047
B116
S102
B118
B018
C031
C032
C034
AUTHOR
TITLE
LOC
PUBID DATE PRICE
AMERICAN LIBRARY ASSOCIATION
ALA BULLETIN
CHICAGO
04
$3.00
ANDERSON, THEODORE
THE TEACHING OF MODERN
PARIS LANGUAGES
53
1955
$10.95
AXT, RICHARD G.
COLLEGE SELF STUDY
BOULDER,
: LECTURES
CO. ON
51INSTITU
1960
$7.00
BALDERSTON, FREDERICK
MANAGING
E.
TODAYS SAN
UNIVERSITY
FRANCISCO 27
1975
$6.00
BARZUN, JACQUES
TEACHER IN AMERICA
GARDEN CITY
18
1954
$7.00
BARZUN, JACQUES
THE AMERICAN UNIVERSITY
NEW YORK
: HOW IT 24
RUNS, W
1970
$5.00
BARZUN, JACQUES
THE HOUSE OF INTELLECT
NEW YORK
24
1961
$8.00
BELL, DANIEL
THE COMING OF POST-INDUSTRIAL
NEW YORK
SOCIETY
09
1976
:
$10.00
BENSON, CHARLES S. IMPLEMENTING THE SAN
LEARNING
FRANCISCO
SOCIETY
27
1974
$9.00
BERG, IVAR
EDUCATION AND JOBS
BOSTON
: THE GREAT TRAINING
10
1971
$12.00
BERSI, ROBERT M.
RESTRUCTURING THE
WASHINGTON,
BACCALAUREATE
D.C.
03
1973
$11.00
BEVERIDGE, WILLIAM I.THE ART OF SCIENTIFIC
NEWINVESTIGATION
YORK
58
1957
$14.00
BIRD, CAROLINE
THE CASE AGAINST NEW
COLLEGE
YORK
08
1975
$13.00
BISSELL, CLAUDE T. THE STRENGTH OF THE
TORONTO
UNIVERSITY 57
1968
$14.00
BLAIR, GLENN MYERS EDUCATIONAL PSYCHOLOGY
NEW YORK
30
1962
$11.00
BLAKE, ELIAS, JR.
THE FUTURE OF THECAMBRIDGE,
BLACK COLLEGES
MA.02
1971
$14.25
BOLAND, R.J.
CRITICAL ISSUES IN INFORMATION
CHICHESTER, ENG.
SYSTEMS
63
1987
R
$30.95
BROWN, SANBORN C., SCIENTIFIC
ED.
MANPOWER
CAMBRIDGE, MASS.
29
1971
$4.00
BUCKLAND, MICHAEL K.LIBRARY SERVICES ELMSFORD,
IN THEORY AND
NY CONTEXT
70
1983
$12.00
BUDIG, GENE A.
ACADEMIC QUICKSAND
LINCOLN,
: SOME
NEBRASKA
TRENDS
37 AND1973
ISS
$13.00
CALIFORNIA. DEPT. OFLAW
JUSTICE
IN THE SCHOOLMONTCLAIR, N.J. 35
1974
$0.50
CAMPBELL, MARGARET
WHY
A. WOULD A GIRLOLD
GO INTO
WESTBURY,
MEDICINE?
48
N.Y. 1973
$1.50
CARNEGIE COMMISSION
A DIGEST
ON HIGHER
OF REPORTS
NEW
OF
YORK
THE CARNEGIE
30
COMM
1974
$3.50
IS 257 – Fall 2004
PAGINATION
ILL
HEIGHT
63 V.
ILL.
26
294 P.
22
X, 300 P. GRAPHS
28
XVI, 307 P.
24
280 P.
18
XII, 319 P.
20
VIII, 271 P.
21
XXVII, 507 P.
21
XVII, 147 P.
24
XX, 200 P.
21
IV, 160P.
23
XIV, 239 P.
18
XII, 308 P.
18
VII, 251 P.
21
678 P.
24
VIII, PP. 539
23
XV, 394 P. ILL.
24
X, 180 P.
26
XII, 201 P. ILL.
23
74 P.
23
IV, 87 P.
21
V, 114 P.
24
399 P.
24
2004.09.22 - SLIDE 42
How to Normalize?
• Currently no way to have multiple authors
for a given book, and there is duplicate
data spread over the BIBFILE table
• Can we use the DBMS to help us
normalize?
• Access example…
IS 257 – Fall 2004
2004.09.22 - SLIDE 43
Next Week
• Physical DB Design and Access Methods
IS 257 – Fall 2004
2004.09.22 - SLIDE 44