Database design 2
Download
Report
Transcript Database design 2
Database Design - Normalization
Normalization
are a set of techniques for
organizing data into tables in order to...
– Eliminate most redundancy
– Prevent incompleteness
Much
of this presentation is devoted to a
rather long example... will work through it
together
1
Two Steps to Normalization
1.
Put the data into tabular form (by
removing repeating groups)
2. Remove duplicated data into separate
tables
2
First - A Simple Example
Graphic
2-1 - Employee Qualifications
3
Put Data Into Tabular Form
Problem:
What do we do if an employee
has more than one qualification?
Graphic
2-2
4
1. Divide Data Into Two Tables
Employee Table
– Emp#, Name, Dept#, DeptName, DeptLocation
Qualification Table
– Emp#, Qualification Desc, Qualification Year
Graphic
2-3
5
2. Remove Duplicated Data
Notice
that Dept. Number "05" is
"Auditing" and is located at "HO"
It is repeated for every employee in that
department
– This wastes space
– Makes updating data more complicated
– "Elegance" rule is violated
6
Whose Data is it Anyway?
The
basic problem is that department name
and location are really data about
departments rather than employees
It belongs in a separate Department table
Graphic
2-4
7
A Very Basic Example
This
example was presented informally
The rules of normalization have their
foundation in mathematics
– On one hand, we can have confidence in
normalization as a technique
– On the other hand, it's very easy to become lost
in mathematical terminology and proofs
– Remember that data modeling is "design" and
we should be careful about anything that lead
us to one "right" answer.
8
Relational Notation
The
sample tables we've seen so far take up
a lot of space
We need a more concise notation
If we eliminate the sample rows, we are left
with the table names and columns
Graphic
2-5
9
Relational Notation - cont.
Text
books usually use data displayed as in
Graphic 2-5
Designers usually want a little more
information
Graphic
2-6
10
Exercise
Normalize
the database shown in the
graphic
Take about 10 minutes
Remember two steps...
– Put the information in tabular format
– Eliminate redundant information
Graphic
2-E1
11
A More Complex Example
We'll
introduce the rules of normalization as
we proceed
The rules can be daunting at first, but we'll
look at the problems they solve
12
Hospital Survey Example
The
form displayed in Graphic 2-7 is one
used in an actual survey of antibiotic drug
usage in Australian hospitals
Survey used to determine which drugs and
dosages were being used for various
operations, to ensure that patients were
properly prescribed for, and that the public
was not paying for unnecessary drugs
13
One Form for Each Operation
Each
hospital in the survey was given a
unique hospital number
All hospital numbers were prefixed with 'H'
Hospitals fell into three categories:
– 'P' for public
– 'V for private
– 'T' for training
» All training hospitals were public so,
'T' implied 'P'
14
Operations
All
Operation Numbers were assigned
sequentially be each hospital
Operation Code was a standard
international code for the named operation
Procedure Group was a broader
classification
15
Surgeons
Surgeon
Number was allocated by
individual hospitals to allow surgeons to
remain anonymous
The prefix 'S' stood for surgeon
16
Drugs
The
Total Drug Cost was the total cost of
all drug doses for the operation
The bottom of the form recorded the
individual antibiotic drugs used in the
operation
The Drug Code consisted of the short name
for the drug and the size of the dose
17
As the Study Continued...
...
It was decided to replace the heaps of
forms with a computerized database
Graphic 2-8 shows the initial database
design using the relational notation
This was done by one person, who was the
data modular, the physical database
designer, and the programmer
18
Exercise
Normalize
the database as shown in
Graphic 2-8
Take about 15 minutes
Remember two steps...
– Put the information in tabular format
– Eliminate redundant information
19
Determining Columns
Normalization
relies on certain assumptions
about the way data is represented
– We need to make sure that these are valid
There
are some problems normalization
does not solve
– It is better to address these at the outset, rather
than carry excess baggage through the whole
normalization process
The
following steps are necessary...
20
One Fact per Column
First
we make sure that each column in the
table represents one fact only
– The Drug code column holds both a short name
for the drug and a dosage size
– The Dosage size consists of a numeric size and
a unit of measure
– The Hospital Category really provides two
facts
» Hospital Public or Private?
» Does the hospital provide training?
21
Hidden Data
Make
sure that we have not lost any data in
the translation to tabular form
Common problem: We cannot rely on the
rows of the table being stored in any
particular order
– Suppose the original survey forms had been
filed in order of return
– We would need a Return Date or Sequence
column
22
Derivable Data
We
need to remove any data that can be
derived from other data in the table and
amend the columns accordingly
– Remember our basic objective is nonredundancy
– The Total drug cost was derived by adding
together the Drug Costs multiplied by the
Number of Doses
23
Determining the Key
Key
- a minimal set of columns that hold a
different combination of values for each
row in the table
The value of the Key uniquely identifies
one row on the table
– The combination of Hospital Number and
Operation number
– In relational notation, underline the key
column(s)
24
Exercise
Using
Graphic 2-8, Clean up the design (i.e.
list all the columns that are valid in the new
design)
Remember...
–
–
–
–
One fact per column
Don't lose hidden data
Remove data that can be derived
Determine the Key
Take
about 10 minutes
25
After Tidying Up...
Graphic
2-9
26
Repeating Groups & 1st Normal Form
Our
first task was to put the data in tabular
format…
– it might seem like we've done this
– but we've actually hidden a problem about the
drug administered data
27
Limiting Number of Occurrences
The
drug administration data is the major
cause of the tables complexity and
inelegance
– Drug Short Name 2, Drug Short Name 4 etc.
The
columns needed to accommodate up to
four drugs account for most of the
complexity
28
Why Only Four Drugs?
Why
not five or six or more?
Four drugs represented the maximum
arrived at by asking one of the survey teams
"What would be the maximum number of
drugs used in an operation?"
In fact, this number was frequently
exceeded
– some operations had 10 or more
29
What's the Problem?
Part
of the problem is that the question was
poorly put… a line on the form was
required for each drug-dose combination,
rather than just for each different drug
The maximum number of drugs could
increase later, so this model still rates
poorly against the stability criterion
30
Paper Over the Differences
With
the paper form, a continuation sheet
was simply attached when more drugs were
needed
We could add more columns to the table
easily, but the application program changes
would be much more difficult
31
What Was Done...
The
original designer decided to handle
continuations by suffixing the operation
number with "a", "b", or "c" to indicate a
continuation
This caused program changes and
comprised the original simplicity of the
system
32
Data Reusability & Program
Complexity
The
main difficulties are with data
reusability and program complexity
Program can easily answer questions…
– How many operations were performed by
neurosurgeons?
– Which hospital is spending the most money on
drugs?
But
not …
– How much money was spent on Ampicillin?
33
Another Way
You
might argue that some queries are
always going to be more difficult than
others
What would happen if we had designed the
table on the basis of "one row per drug"?
34
Recognizing Repeating Groups
A Set
of columns repeated a number of
times - a "repeating group" - results in…
– inflexibility
– complexity
– poor data reusability
Graphic
2-10
– curly brackets indicate a repeating group with
an indefinite number of occurrences
35
Solution
A general
and flexible solution should not
set any limit on the maximum number of
occurrences
It should also handle of few or no
occurrences (the drug-free operation)
36
Removing Repeating Groups
First
step of Normalization:
– Put the data in tabular form by identifying and
eliminating repeating group
Split
the file into two tables
– basic operation data
– (repeating) drug administration table
37
Split into Two Tables
Remove
all repeating group columns to a
new table (each occurrence of group is a
row in the new table
Include the primary key of the original table
in the new table (this makes a foreign key
in the new table)
38
Split into Two Tables - cont.
Add
a 'Sequence' column if needed
Name the new table
Identify and underline the primary key of
the new table
39
Exercise
Using
Graphic 2-10 remove all repeating
groups
40
Solution
Graphic
2-11
41
Determining Key of New Table
Not
always an easy task
Question:
What is the minimum number of
combination of columns needed to uniquely
identify one row?
42
A Six Column Key
Hospital
Number - FK
Operation Number - FK
Drug Short Name
Dose
Unit of Measure
Method of Administration
43
First Normal Form
Our
tables are now technically in first
normal form (1NF). What have we
achieved?
– All data of the same kind in now held in the
same place
– The number of different drug dosages that can
be recorded for an operation is effectively
unlimited
– An operation that don't use any drugs is
allowed
44
Problems with First Normal Form
Look
at Operation table in Graphic 2-11
Every row for an operation (e.g. Hospital
number 17) will contain the facts that its
name is St. Vincent's and Fred Fleming is
the contact person
Criterion of non-redundancy is not being
met
45
More Problems
Change
a fact about a hospital and you will
have to change it for every operation in the
hospital
If
we delete the last operation for the
hospital, then we effectively delete the
hospital
46
Eliminating Redundancy
Solve
all these problems by removing the
hospital information to a separate table, in
which each hospital number appears once
only (which becomes the primary key)
Graphic
2-12
47
Determinants
For
a given hospital number there could be
only one...
–
–
–
–
hospital name
hospital type
contact person
training status
The
hospital number is a determinant for
the other columns
48
Formal Procedure
1.
Identify and determinants (other than
the primary key) and the columns they
determine
2. Create a separate table for each
determinant and its columns (the
determinant becomes the primary key)
3. Name the new table
4. Remove the determined columns from
the original table, leave the determinant
49
Other Determinants
Hospital
number + Surgeon Number =>
Surgeon Specialty
Operation code => Operation Name,
Procedure Group
Drug Short Name => Drug Name,
Manufacturer
Drug Short Name + Method of
Administration + Size of Dose + Unit of
Measure = Cost of Dose
50
Exercise
Finish
normalizing Graphic 2-10
51
Third Normal Form
Graphic
This
2-13 is the final model
model is in Third Normal Form (3NF)
52
Where is Second Normal Form?
Our
approach took us directly to Third
Normal Form
Most texts treat this as a two stage process
– deal first with the determinants that are part of
the table's key (Second Normal Form)
– then with the non-key determinants
53
Is Third Normal Form the End?
Unfortunately,
no
Boyce-Codd Normal Form
Fourth Normal Form
Fifth Normal Form
We'll discuss later...
54
Third Normal Form Hint
Every
non-key column must be a fact about
the key, the whole key, and nothing but the
key
55
Candidate Keys
Sometime
more than one column or
combination of columns could serve as a
primary key (e.g. Drug Name instead of
Drug Short Name)
These are called Candidate Keys
In step 1 we should really say...
– Identify any determinants, other than candidate
keys...
56
Foreign Keys
When
we removed repeating groups we
carried the primary key of the original table
to cross-reference back
These columns are called Foreign Keys
57
Self-Referencing Tables
A Foreign
Key may refer back to the Primary
Key of the same table
– Example: An Employees table might have
employee id as its Primary Key and a Foreign Key
of Manager id which refers to another row in the
same table
The
convention to represent Foreign Keys is
an asterisk...
– EMPLOYEE (Employee Id, Name, Manager id*
58
Referential Integrity
If
the Employees table has a Department Id
foreign key referring back to the
Departments table's Primary key, then we
would expect to find a valid Department Id
on the Departments table for every value of
Department id in the Employees table
If not, our database lacks Referential
Integrity
59
Summary
Normalization
is a set of techniques for
organizing data into tables to eliminate
certain types of redundancy an
incompleteness
Normalization relies on correct
identification of determinants and keys
60
Last Slide - Normalization
Assignment
#9 due next week
61