Information Organization and Retrieval - Courses

Download Report

Transcript Information Organization and Retrieval - Courses

Database Design: Normalization
University of California, Berkeley
School of Information Management and
Systems
SIMS 202: Information Organization and
Retrieval
8/28/97
Information Organization and Retrieval
Review
• Entity-Relationship Diagrams
• Designing a database
8/28/97
Information Organization and Retrieval
Entities
•
•
•
•
•
•
•
•
Customer
Invoice
Employee
Inventory
Supplier
Account
Sales Rep
Parts
8/28/97
• Timecard
• Check
Information Organization and Retrieval
Functional areas
•
•
•
•
•
•
•
Ordering
Inventory
Supplies
Shipping
Personnel
Payroll
We will concentrate on Ordering and
Inventory
8/28/97
Information Organization and Retrieval
Ordering Normalization
Rep#
Sales-Rep
Part#
Cust#
Customer
Writes
Orders
Invoice
Invoice#
Rep#
Cust#
8/28/97
Information Organization and Retrieval
Invoice#
Contains
Quantity
Line-Item
Emp#
Wage
Employee
ISA
Hourly
ER Model
Company#
Sales
Sales-Rep
Part#
Part# Quantity
Company#
Cust#
Customer
Writes
Orders
Invoice
Invoice# Rep#
Invoice# Quantity
Contains
Line-Item
Ordered
Part
Has
Contains
Cust#
On-Order
Supplied
Part
Part
Part#
Count
Price
8/28/97
Supplier
Information Organization and Retrieval
Supplies
Part#
Cost
Company#
Mapping to a Relational Model
• Each entity in the ER Diagram becomes a
relation.
• A properly normalized ER diagram will
indicate where intersection relations for
many-to-many mappings are needed.
• Relationships are indicated by common
columns (or domains) in tables that are
related.
• We will examine the tables for the Acme
Widget Company derived from the ER
diagram
8/28/97
Information Organization and Retrieval
Employee
SSN
123-76-3423
342-88-7865
486-87-6543
843-36-7659
…
8/28/97
Lastname
Jones
Smith
Hendersen
Martinez
…
Firstname
Janet
Thomas
Charles
Roberto
…
Middlename Birthdate Address
City
Mary
6/25/63 234 State Berkeley
Frederick
8/4/70 12 Lambert Oakland
Robert
9/23/61 44 Central Berkeley
Garcia
7/8/58 76 Highland Berkeley
…
…
…
…
Information Organization and Retrieval
Sales-Rep
SSN
Rep # Sales
123-76-3423
1 $12,345.45
843-36-7659
2 $231,456.75
Hourly
SSN
Wage
342-88-7865
$12.75
486-87-6543
$20.50
8/28/97
Information Organization and Retrieval
Customer
Cust #
COMPANY
STREET1
Integrated Standards
1 Ltd.
35 Broadway
STREET2
STATE
ZIPCODE
New York
NY
02111
34 Bureaucracy Plaza Floors 1-172
3 Control Elevation
Cyber Assicates
Place
Center
Phildelphia
PA
03756
Cyberoid
NY
08645
35 Libra Plaza
Nashua
NH
09242
1 Broadway
Middletown
IN
32467
88 Oligopoly Place
3 Independence
Parkway
Sagrado
TX
78798
Rivendell
CA
93456
8 Little Mighty Micro
34 Last One Drive
Orinda
CA
94563
9 SportLine Ltd.
38 Champion Place
Compton
CA
95328
2 MegaInt Inc.
3 Cyber Associates
General
4 Consolidated
Consolidated
5 MultiCorp
Internet Behometh
6 Ltd.
Consolidated
7 Brands, Inc.
8/28/97
Floor 12
CITY
Suite 882
Information Organization and Retrieval
Invoice
Invoice # Cust #
93774
84747
88367
88647
776879
65689
8/28/97
Rep #
3
4
5
9
2
6
Information Organization and Retrieval
1
1
2
1
2
2
Line-Item
Invoice # Part #
Quantity
93774
3
10
84747
23
1
88367
75
2
88647
4
3
776879
22
5
65689
76
12
93774
23
10
88367
34
2
8/28/97
Information Organization and Retrieval
Part
Part #
1
2
3
4
5
6
7
8
9
8/28/97
Name
Price
Count
Big blue widget
3.76
2
Small blue Widget
7.35
4
Tiny red widget
5.25
7
large red widget
157.23
23
double widget rack
10.44
12
Small green Widget
30.45
58
Big yellow widget
7.96
1
Tiny orange widget
81.75
42
Big purple widget
55.99
9
Information Organization and Retrieval
Joins
Part #
Invoice # Part #
Quantity
93774
3
10
84747
23
1
88367
75
2
88647
4
3
776879
22
5
65689
76
12
93774
23
10
88367
34
2
1
2
3
4
5
6
7
8
9
Cust #
COMPANY
STREET1
Integrated Standards
1 Ltd.
35 Broadway
8/28/97
Rep #
3
4
5
9
2
6
1
1
2
1
2
2
STREET2
STATE
ZIPCODE
NY
02111
34 Bureaucracy Plaza Floors 1-172
3 Control Elevation
Cyber Assicates
Place
Center
Phildelphia
PA
03756
Cyberoid
NY
08645
35 Libra Plaza
Nashua
NH
09242
1 Broadway
Middletown
IN
32467
88 Oligopoly Place
3 Independence
Parkway
Sagrado
TX
78798
Rivendell
CA
93456
8 Little Mighty Micro
34 Last One Drive
Orinda
CA
94563
9 SportLine Ltd.
38 Champion Place
Compton
CA
95328
3 Cyber Associates
General
4 Consolidated
Consolidated
5 MultiCorp
Internet Behometh
6 Ltd.
Consolidated
7 Brands, Inc.
Information Organization and Retrieval
Floor 12
CITY
New York
2 MegaInt Inc.
Invoice # Cust #
93774
84747
88367
88647
776879
65689
Name
Price
Count
Big blue widget
3.76
2
Small blue Widget
7.35
4
Tiny red widget
5.25
7
large red widget
157.23
23
double widget rack
10.44
12
Small green Widget
30.45
58
Big yellow widget
7.96
1
Tiny orange widget
81.75
42
Big purple widget
55.99
9
Suite 882
Today
• Normalization
• Effectiveness and Efficiency criteria for
database designs
• Advantages and failings of DBMS
technology
8/28/97
Information Organization and Retrieval
Normalization
• Normalization theory is based on the
observation that relations with certain
properties are more effective in inserting,
updating and deleting data than other sets of
relations containing the same data
• Normalization is a multi-step process
beginning with an “unnormalized” relation
– Hospital example from Atre, S. Data Base: Structured Techniques for
Design, Performance, and Management.
8/28/97
Information Organization and Retrieval
Normal Forms
•
•
•
•
•
•
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
Fourth Normal Form (4NF)
Fifth Normal Form (5NF)
8/28/97
Information Organization and Retrieval
Normalization
No transitive
dependency
between
nonkey
attributes
All
determinants
are candidate
keys - Single
multivalued
dependency
8/28/97
BoyceCodd and
Higher
Information Organization and Retrieval
Functional
dependencyof
nonkey
attributes on
the primary
key - Atomic
values only
Full
Functional
dependencyof
nonkey
attributes on
the primary
key
Unnormalized Relations
• First step in normalization is to convert the
data into a two-dimensional table
• In unnormalized relations data can repeat
within a column
8/28/97
Information Organization and Retrieval
Unnormalized Relation
Patient #
Surgeon #
145
1111 311
Surg. date
Patient Name
Jan 1,
1995; June
12, 1995
John White
Patient Addr Surgeon
15 New St.
New York,
NY
243
1234 467
2345 189
Jan 8,
1996
Charles Brown
4876 145
Nov 5,
1995
Hal Kane
5123 145
May 10,
1995
Paul Kosher
Charles
Field
10 Main St. Patricia
Rye, NY
Gold
Dogwood
Lane
Harrison,
David
NY
Rosen
55 Boston
Post Road,
Chester,
CN
Beth Little
Blind Brook
Mamaronec
k, NY
Beth Little
6845 243
Apr 5,
1994 Dec
15, 1984
Ann Hood
Hilton Road
Larchmont, Charles
NY
Field
8/28/97
Postop drug
Drug side effects
Gallstone
s removal;
Beth Little Kidney
Michael
stones
Penicillin,
Diamond removal
none-
Apr 5,
1994 May
10, 1995
Mary Jones
Surgery
Information Organization and Retrieval
rash
none
Eye
Cataract
removal
Thrombos Tetracyclin Fever
is removal e none
none
Open
Heart
Surgery
Cholecyst
ectomy
Gallstone
s
Removal
Eye
Cornea
Replacem
ent Eye
cataract
removal
Cephalosp
orin
none
Demicillin
none
none
none
Tetracyclin
e
Fever
First Normal Form
• To move to First Normal Form a relation
must contain only atomic values at each row
and column.
– No repeating groups
– A column or set of columns is called a
Candidate Key when its values can uniquely
identify the row in the relation.
8/28/97
Information Organization and Retrieval
First Normal Form
Patient #
Surgeon # Surgery DatePatient Name Patient Addr Surgeon Name
1111
145
01-Jan-95 John White
1111
311
12-Jun-95 John White
15 New St.
New York,
NY
15 New St.
New York,
NY
1234
243
05-Apr-94 Mary Jones
10 Main St.
Rye, NY
1234
467
10-May-95 Mary Jones
2345
4876
5123
6845
6845
8/28/97
189
145
145
243
243
Charles
08-Jan-96 Brown
10 Main St.
Rye, NY
Dogwood
Lane
Harrison,
NY
05-Nov-95 Hal Kane
55 Boston
Post Road,
Chester,
CN
05-Apr-94 Ann Hood
15-Dec-84 Ann Hood
Hilton Road
Larchmont,
NY
Drug adminSide Effects
Charles Field
Gallstone
s removal
Kidney
stones
removal
Eye
Cataract
removal
Patricia Gold
Thrombos
is removal none
none
David Rosen
Open
Heart
Surgery
none
Beth Little
Cholecyst
ectomy
Demicillin
Beth Little
Michael
Diamond
Blind Brook
Mamaronec
10-May-95 Paul Kosher k, NY
Beth Little
Hilton Road
Larchmont,
NY
Surgery
Penicillin
rash
none
none
Tetracyclin
e
Fever
Cephalosp
orin
Charles Field
Gallstone
s
Removal
none
Eye
Cornea
Replacem Tetracyclin
ent
e
Charles Field
Eye
cataract
removal
Information Organization and Retrieval
none
none
none
Fever
none
1NF Storage Anomalies
• Insertion: A new patient has not yet undergone
surgery -- hence no surgeon # -- Since surgeon # is
part of the key we can’t insert.
• Insertion: If a surgeon is newly hired and hasn’t
operated yet -- there will be no way to include that
person in the database.
• Update: If a patient comes in for a new procedure,
and has moved, we need to change multiple
address entries.
• Deletion (type 1): Deleting a patient record may
also delete all info about a surgeon.
• Deletion (type 2): When there are functional
dependencies (like side effects and drug) changing
one item eliminates other information.
8/28/97
Information Organization and Retrieval
Second Normal Form
• A relation is said to be in Second Normal
Form when every nonkey attribute is fully
functionally dependent on the primary key.
– That is, every nonkey attribute needs the full
primary key for unique identification
8/28/97
Information Organization and Retrieval
Second Normal Form
Patient #
1111
1234
2345
4876
5123
6845
8/28/97
Patient Name Patient Address
15 New St. New
John White York, NY
10 Main St. Rye,
Mary Jones NY
Charles
Dogwood Lane
Brown
Harrison, NY
55 Boston Post
Hal Kane
Road, Chester,
Blind Brook
Paul Kosher Mamaroneck, NY
Hilton Road
Ann Hood
Larchmont, NY
Information Organization and Retrieval
Second Normal Form
Surgeon #
Surgeon Name
145 Beth Little
189 David Rosen
243 Charles Field
311 Michael Diamond
467 Patricia Gold
8/28/97
Information Organization and Retrieval
Second Normal Form
Patient # Surgeon # Surgery Date
1111
1111
1234
1234
2345
4876
8/28/97
Surgery
Drug Admin Side Effects
145
Gallstones
01-Jan-95 removal
Kidney
Penicillin
rash
311
stones
12-Jun-95 removal
none
none
243
Eye Cataract
05-Apr-94 removal
Tetracycline Fever
467
Thrombosis
10-May-95 removal
189
Open Heart
08-Jan-96 Surgery
Cephalospori
n
none
145
Cholecystect
05-Nov-95 omy
Demicillin
none
none
none
none
none
5123
145
6845
243
6845
243
Gallstones
10-May-95 Removal
Eye cataract
15-Dec-84 removal
Eye Cornea
05-Apr-94 Replacement
Information Organization and Retrieval
none
none
Tetracycline Fever
1NF Storage Anomalies
Removed
• Insertion: Can now enter new patients without
surgery.
• Insertion: Can now enter Surgeons who haven’t
operated.
• Deletion (type 1): If Charles Brown dies the
corresponding tuples from Patient and Surgery
tables can be deleted without losing information
on David Rosen.
• Update: If John White comes in for third time, and
has moved, we only need to change the Patient
table
8/28/97
Information Organization and Retrieval
2NF Storage Anomalies
• Insertion: Cannot enter the fact that a particular
drug has a particular side effect unless it is given
to a patient.
• Deletion: If John White receives some other drug
because of the penicillin rash, and a new drug and
side effect are entered, we lose the information
that penicillin can cause a rash
• Update: If drug side effects change (a new
formula) we have to update multiple occurrences
of side effects.
8/28/97
Information Organization and Retrieval
Third Normal Form
• A relation is said to be in Third Normal
Form if there is no transitive functional
dependency between nonkey attributes
– When one nonkey attribute can be determined
with one or more nonkey attributes there is said
to be a transitive functional dependency.
• The side effect column in the Surgery table
is determined by the drug administered
– Side effect is transitively functionally
dependent on drug so Surgery is not 3NF
8/28/97
Information Organization and Retrieval
Third Normal Form
Patient # Surgeon # Surgery Date
8/28/97
Surgery
Drug Admin
1111
145
1111
311
01-Jan-95 Gallstones removal
Kidney stones
12-Jun-95 removal
1234
243
05-Apr-94 Eye Cataract removal Tetracycline
1234
467
10-May-95 Thrombosis removal
2345
189
08-Jan-96 Open Heart Surgery
Cephalosporin
4876
145
05-Nov-95 Cholecystectomy
Demicillin
5123
145
10-May-95 Gallstones Removal
none
6845
243
none
6845
243
15-Dec-84 Eye cataract removal
Eye Cornea
05-Apr-94 Replacement
Information Organization and Retrieval
Penicillin
none
none
Tetracycline
Third Normal Form
Drug Admin
8/28/97
Side Effects
Cephalosporin
none
Demicillin
none
none
none
Penicillin
rash
Tetracycline
Fever
Information Organization and Retrieval
2NF Storage Anomalies
Removed
• Insertion: We can now enter the fact that a
particular drug has a particular side effect in
the Drug relation.
• Deletion: If John White recieves some other
drug as a result of the rash from penicillin,
but the information on penicillin and rash is
maintained.
• Update: The side effects for each drug
appear only once.
8/28/97
Information Organization and Retrieval
Boyce-Codd Normal Form
• Most 3NF relations are also BCNF
relations.
• A 3NF relation is NOT in BCNF if:
– Candidate keys in the relation are composite
keys (they are not single attributes)
– There is more than one candidate key in the
relation, and
– The keys are not disjoint, that is, some
attributes in the keys are common
8/28/97
Information Organization and Retrieval
Most 3NF Relations are also
BCNF
Patient #
1111
1234
2345
4876
5123
6845
8/28/97
Patient Name Patient Address
15 New St. New
John White York, NY
10 Main St. Rye,
Mary Jones NY
Charles
Dogwood Lane
Brown
Harrison, NY
55 Boston Post
Hal Kane
Road, Chester,
Blind Brook
Paul Kosher Mamaroneck, NY
Hilton Road
Ann Hood
Larchmont, NY
Information Organization and Retrieval
Effectiveness and Efficiency
Issues for DBMS
• Focus on the relational model
• Any column in a relational database can be
searched for values.
• To improve efficiency indexes using storage
structures such as BTrees and Hashing are
used
• But many useful functions are not indexable
and require complete scans of the the
database
8/28/97
Information Organization and Retrieval
Example: Text Fields
• In conventional RDBMS, when a text field
is indexed, only exact matching of the text
field contents (or Greater-than and Lessthan).
– Can search for individual words using pattern
matching, but a full scan is required.
• Text searching is still done best (and fastest)
by specialized text search programs (Search
Engines) that we will look at more later.
8/28/97
Information Organization and Retrieval
Normalizing to death
• Normalization splits database information
across multiple tables.
• To retrieve complete information from a
normalized database, the JOIN operation
must be used.
• JOIN tends to be expensive in terms of
processing time, and very large joins are
very expensive.
8/28/97
Information Organization and Retrieval
Advantages of RDBMS
• Possible to design complex data storage and
retrieval systems with ease (and without
programming).
• Support for ACID transactions
–
–
–
–
8/28/97
Atomic
Consistent
Independent
Durable
Information Organization and Retrieval
Advantages of RDBMS
• Support for very large databases
• Automatic optimization of searching (when
possible)
• RDBMS have a simple view of the
database that conforms to much of the data
used in businesses.
• Standard query language (SQL)
8/28/97
Information Organization and Retrieval
Disadvantages of RDBMS
• Until recently, no support for complex
objects such as documents, video, images,
spatial or time-series data. (ORDBMS are
adding support these).
• Often poor support for storage of complex
objects. (Disassembling the car to park it in
the garage)
• Still no efficient and effective integrated
support for things like text searching within
fields.
8/28/97
Information Organization and Retrieval
Assignment
• Examine the Cookie database using Access
and look at the ER Diagram for it posted on
the assignments page.
• Consider the possibilities of Book
publications
– What are the problems with the database?
– What new fields would you add to the database,
and where?
– Draw a new ER diagram showing your design.
8/28/97
Information Organization and Retrieval
pubid
Cookie ER diagram
accno
BIBFILE
CALLFILE
Has call
accno
Libid
Callno
publishes
Has index
accno
8/28/97
PUBFILE
INDXFILE
libid
pubid
subid
Information Organization and Retrieval
Library
Address, etc
SUBFILE
Has subject
subid
LIBFILE
Has copy
subject