Transcript Job
Database Normalization
What is Normalization
Normalization allows us to organize data so
that it:
• Allows faster access (dependencies make sense)
• Reduced space (less redundancy)
Data Normalization
Primarily a tool to validate and improve a
logical design so that it satisfies certain
constraints that avoid unnecessary
duplication of data
The process of decomposing relations with
anomalies to produce smaller, wellstructured relations
Results of Normalization
Removes the following modification
anomalies (integrity errors) with the
database
• Insertion
• Deletion
• Update
ANOMALIES
Insertion
• inserting one fact in the database requires knowledge
of other facts unrelated to the fact being inserted
Deletion
• Deleting one fact from the database causes loss of
other unrelated data from the database
Update
• Updating the values of one fact requires multiple
changes to the database
Insertion Anomaly
Suppose we want to enter the new Job 'Cooker' having $800 as a salary.
We cannot enter the data into the EMPLOYEE relation until an employee
enrolls in this Job.
This restriction seems silly. Why should we have to wait until someone
takes the Job before we can record its salary? This situation is called
insertion anomaly. It is due to the relational model constraint that disables
'Null values' for the primary key (EMPID).
EMPLOYEE
EMPID
NAME
JOB
200
300
600
700
450
Smith
John
George
Jean
Mike
Driver
Guardian
Guardian
Gardener
Gardener
SALAR
Y
1500
2000
2000
2500
2500
Insertion Anomaly
These anomalies can be eliminated by decomposing the
employee relation into the two ones shown in the
following figure.
EMP-JOB
EmpId
200
300
600
700
450
Name
Smith
John
George
Jean
Mike
JOB-SAL
Job
driving
Guardian
Guardian
Gardener
Gardener
Job
Salary
Driving
Guardian
1500
2000
Gardener
2500
ANOMALIES EXAMPLES
TABLE: COURSE
COURSE#
SECTION#
C_NAME
CIS564
072
Database Design
CIS564
073
Database Design
CIS570
072
Oracle Forms
CIS564
074
Database Design
ANOMALIES EXAMPLES
Insertion:
Suppose our College has approved a new
course called CIS 563: SQL & PL/SQL.
Can this information about the new course be
entered (inserted) into the table COURSE in its
present form?
COURSE#
SECTION#
C_NAME
CIS564
072
Database Design
CIS564
073
Database Design
CIS570
072
Oracle Forms
CIS564
074
Database Design
Deletion anomaly
For the data in following figure (where we have only one
driver), if we delete the employee number 200, we will lose
not only that this employee is a driver, but also that driving
costs' $1500. This is called a deletion anomaly; we are losing
more information than we want to. We lose facts about two
entities with one deletion.
EMPLOYEE
EMPID
200
300
600
700
450
NAME
Smith
John
George
Jean
Mike
JOB
Driver
Guardian
Guardian
Gardener
Gardener
SALARY
1500
2000
2000
2500
2500
Update anomaly
Example1: If the salary of the job "Guardian" changes
from $2000 to $2100, the updating operation must be
repeated with each Guardian employee.
EMPLOYEE
EMPID
200
300
600
700
450
NAME
Smith
John
George
Jean
Mike
JOB
Driver
Guardian
Guardian
Gardener
Gardener
SALARY
1500
2000
2000
2500
2500
ANOMALIES EXAMPLES
Update:
Suppose the course name (C_Name) for
CIS 564 got changed to Database Management.
How many times do you have to make this
change in the COURSE table in its current
form?
COURSE#
SECTION#
C_NAME
CIS564
072
Database Design
CIS564
073
Database Design
CIS570
072
Oracle Forms
CIS564
074
Database Design
ANOMALIES
Following Figure shows the relations issued from the decomposition
phase. Now the previous anomalies are avoided so then we can :
Delete the employee Smith from EMP-JOB without losing the fact
that driving has a salary $1500,
Insert a new job in the JOB-SAL table without enrolled employees,
Modify only one tuple in the JOB-SAL table instead of many tuples
in the initial relation.
EMP-JOB
EmpId
200
300
600
700
450
Name
Smith
John
George
Jean
Mike
JOB-SAL
Job
driving
Guardian
Guardian
Gardener
Gardener
Job
Salary
Driving
Guardian
1500
2000
Gardener
2500
ANOMALIES
So, a table (relation) is a stable (‘good’) table only
if it is free from any of these anomalies at any point
in time.
You have to ensure that each and every table in a
database is always free from these modification
anomalies. And, how do you ensure that?
‘Normalization’ theory helps.
NORMAL FORMS
1 NF
2NF
3NF
BCNF (Boyce-Codd Normal Form)
4NF
5NF
Normal Forms
Normalization is done through changing or
transforming data into various Normal Forms.
There are 5 Normal Forms but we almost
never use 4NF or 5NF.
We will only be concerned with 1NF, 2NF,
and 3NF.
Normal Forms
For a database to be in a normal form, it must
meet all requirements of the previous forms:
• Eg. For a database to be in 2NF, it must already be
in 1NF. For a database to be in 3NF, it must
already be in 1NF and 2NF.
Manager
Fatma
Abdulaziz
Ali
Employees
Sayed, Tariq
Tafla, Mohammed
Sarai, Miriam
Sample Data
Manager
Fatma
Abdulkaziz
Ali
Employees
Sayed, Tariq
Tafla, Mohammed
Sarai, Miriam
Data that is not atomic means:
•
•
•
•
We can’t easily sort the data
We can’t easily search or index the data
We can’t easily change the data
We can’t easily reference the data in other tables
Manager
Fatma
Abdulaziz
Ali
Employee1
Sayed
Tafla
Sarai
Employee2
Tariq
Mohammed
Miriam
Sample Data
• We still can’t easily sort, search, or index our
employees.
• What if a manager has more than 2 employees, 10
employees, 100 employees? We’d need to add
columns to our database just for these cases.
• It is still hard to reference our employees in other
tables.
Manager
Fatma
Abdulaziz
Ali
Employee1
Sayed
Tafla
Sarai
Employee2
Tariq
Mohammed
Miriam
First Normal Form
1NF means that we must:
• Eliminate duplicate columns from the same table,
and
• Create separate tables for each group of related
data into separate tables, each with a unique row
identifier (primary key)
Let’s get started by making our columns
atomic…
Manager
Fatma
Fatma
Abdulaziz
Abdulaziz
Ali
Ali
Employee
Sayed
Tariq
Tafla
Mohammed
Sarai
Miriam
Primary Key
The best primary key would
be the Employee column.
Every employee only has
one manager, therefore an
employee is unique.
Employee
Sayed
Tariq
Tafla
Mohammed
Sarai
Miriam
Manager
Fatma
Fatma
Abdulaziz
Abdulaziz
Ali
Ali
Students should now say that the Employee is the
Primary Key since there are now multiple manager
values in the table. Only Employee is unique.
First Normal Form
Congratulations!
The fact that all our data and
columns is atomic and we
have a primary key means
that we are in 1NF!
Employee
Sayed
Tariq
Tafla
Mohammed
Sarai
Miriam
Manager
Fatma
Fatma
Abdulaziz
Abdulaziz
Ali
Ali
Students should now say that the Employee is the
Primary Key since there are now multiple manager
values in the table. Only Employee is unique.
ID
1
2
3
4
5
6
7
8
9
Employee ManagerID
Sayed
7
Tariq
7
Tafla
8
Mohammed
8
Sarai
9
Miriam
9
Fatma
Abdulaziz
Ali
Manager
Fatma
Abdulaziz
Ali
Employees
Sayed, Tariq
Tafla, Mohammed
Sarai, Miriam
ID
1
2
3
4
5
6
7
8
9
Employee ManagerID
Sayed
7
Tariq
7
Tafla
8
Mohammed
8
Sarai
9
Miriam
9
Fatma
Abdulaziz
Ali
Moving to Second Normal Form
A database in 2NF must also be in 1NF:
• Data must be atomic
• Every row (or tuple) must have a unique primary
key
Plus:
• Subsets of data that apply to multiple rows
(repeating data) are moved to separate tables
CustID
1
2
3
4
5
6
7
8
9
FirstName
Bob
John
Sandy
Maria
Gameil
James
Shiela
Ian
Ed
LastName
Smith
Brown
Jessop
Hernandez
Hintz
Richardson
Green
Sampson
Rodgers
Address
123 Main St.
555 2nd Ave.
4256 James St.
4599 Columbia
569 Summit St.
12 Cameron Bay
12 Michigan Ave.
56 Manitoba St.
15 Athol St.
City
Tucson
St. Paul
Chicago
Vancouver
St. Paul
Regina
Chicago
Winnipeg
Regina
State
AZ
MN
IL
BC
MN
SK
IL
MB
SK
Zip
12345
54355
43555
V5N 1M0
54355
S4T 2V8
43555
M5W 9N7
S4T 2V9
City
Tucson
St. Paul
Chicago
Vancouver
St. Paul
Regina
Chicago
Winnipeg
Regina
State
AZ
MN
IL
BC
MN
SK
IL
MB
SK
Zip
12345
54355
43555
V5N 1M0
54355
S4T 2V8
43555
M5W 9N7
S4T 2V9
Moving to Second Normal Form
The CustID determines all the data in the row, but
U.S. Zip codes determines the City and State. (eg. A
given Zip code can only belong to one city and state
so storing Zip codes with a City and State is
redundant)
This means that City and State are Functionally
Dependent on the value in Zip code and not only the
primary key.
Moving to Second Normal Form
To be in 2NF, this repeating data must be in its
own table.
So:
• Let’s create a Zip code table that maps Zip codes
to their City and State.
• Note that Canadian Postal Codes are different: the
same city and state can have many different postal
codes.
Customer Table
CustID
1
2
3
4
5
6
7
8
9
FirstName
Bob
John
Sandy
Maria
Gameil
James
Shiela
Ian
Ed
Zip Code Table
Zip
12345
54355
43555
V5N 1M0
S4T 2V8
M5W 9N7
S4T 2V9
LastName
Smith
Brown
Jessop
Hernandez
Hintz
Richardson
Green
Sampson
Rodgers
City
Tucson
St. Paul
Chicago
Vancouver
Regina
Winnipeg
Regina
State
AZ
MN
IL
BC
SK
MB
SK
Address
123 Main St.
555 2nd Ave.
4256 James St.
4599 Columbia
569 Summit St.
12 Cameron Bay
12 Michigan Ave.
56 Manitoba St.
15 Athol St.
Zip
12345
54355
43555
V5N 1M0
54355
S4T 2V8
43555
M5W 9N7
S4T 2V9
Advantages of 2NF
Saves space in the database by reducing
redundancies
If a customer calls, you can just ask them for
their Zip code and you’ll know their city and
state! (No more spelling mistakes)
If a City name changes, we only need to make
one change to the database.
Summary So Far…
1NF:
• All data is atomic
• All rows have a unique primary key
2NF:
• Data is in 1NF
• Subsets of data in multiple columns are moved to a
new table
• These new tables are related using foreign keys
Moving to 3NF
To be in 3NF, a database must be:
• In 2NF
• All columns must be fully functionally dependent
on the primary key (There are no transitive
dependencies)
OrderID CustID ProdID Price Quantity Total
1 1001 AB-111
50
1,000 50,000
2 1002 AB-111
60
500 30,000
3 1001 ZA-245
35
100 3,500
4 1003 MB-153
82
25 2,050
5 1004 ZA-245
42
10
420
6 1002 ZA-245
40
50 2,000
7 1001 AB-111
75
100 7,500
OrderID CustID ProdID Price Quantity Total
1 1001 AB-111
50
1,000 50,000
2 1002 AB-111
60
500 30,000
3 1001 ZA-245
35
100 3,500
4 1003 MB-153
82
25 2,050
5 1004 ZA-245
42
10
420
6 1002 ZA-245
40
50 2,000
7 1001 AB-111
75
100 7,500
OrderID CustID ProdID Price Quantity Total
1 1001 AB-111
50
1,000 50,000
2 1002 AB-111
60
500 30,000
3 1001 ZA-245
35
100 3,500
4 1003 MB-153
82
25 2,050
5 1004 ZA-245
42
10
420
6 1002 ZA-245
40
50 2,000
7 1001 AB-111
75
100 7,500
OrderID CustID ProdID Price Quantity Total
1 1001 AB-111
50
1,000 50,000
2 1002 AB-111
60
500 30,000
3 1001 ZA-245
35
100 3,500
4 1003 MB-153
82
25 2,050
5 1004 ZA-245
42
10
420
6 1002 ZA-245
40
50 2,000
7 1001 AB-111
75
100 7,500
Maybe price is dependent on the ProdID and
Quantity: The more you buy of a given product
the cheaper that product becomes!
So we ask the business manager and she tells
us that this is the case.
OrderID CustID ProdID Price Quantity Total
1 1001 AB-111
50
1,000 50,000
2 1002 AB-111
60
500 30,000
3 1001 ZA-245
35
100 3,500
4 1003 MB-153
82
25 2,050
5 1004 ZA-245
42
10
420
6 1002 ZA-245
40
50 2,000
7 1001 AB-111
75
100 7,500
We say that Price has a transitive dependency on
ProdID and Quantity.
• This means that Price isn’t just determined by the OrderID.
It is also determined by the size (or quantity) of the order
(and of course what is ordered).
OrderID
CustID
ProdID
Price
Quantity
Total
OrderID
CustID
ProdID
Price
Quantity
Total
OrderID
CustID
ProdID
Price
Quantity
OrderID
CustID
ProdID
Price
Quantity
OrderID
CustID
ProdID
ProdID
Quantity
Price
Quantity
OrderID CustID ProdID Quantity ProdID Quantity Price
1 1001 AB-111
1,000 AB-111
1
75
2 1002 AB-111
500 AB-111
101
60
3 1001 ZA-245
100 AB-111
501
50
4 1003 MB-153
25 ZA-245
1
42
5 1004 ZA-245
10 ZA-245
11
40
6 1002 ZA-245
50 ZA-245
51
35
7 1001 AB-111
100 MB-153
1
82
OrderID CustID ProdID Quantity ProdID Quantity Price
1 1001 AB-111
1,000 AB-111
1
75
2 1002 AB-111
500 AB-111
101
60
3 1001 ZA-245
100 AB-111
501
50
4 1003 MB-153
25 ZA-245
1
42
5 1004 ZA-245
10 ZA-245
11
40
6 1002 ZA-245
50 ZA-245
51
35
7 1001 AB-111
100 MB-153
1
82
Congratulations! We’re now in 3NF!
We can also quickly figure out what price to
offer our customers for any quantity they want.
To Summarize (again)
A database is in 3NF if:
• It is in 2NF
• It has no transitive dependencies
A transitive dependency exists when one attribute (or
field) is determined by another non-key attribute (or
field)
We remove fields with a transitive dependency to a new
table and link them by a foreign key.
Summarizing
A database is in 2NF if:
• It is in 1NF
• There is no repeating data in its tables.
Put another way, if we use a composite primary key,
then all attributes are dependent on all parts of the key.
And Finally…
A database is in 1NF if:
• All its attributes are atomic (meaning they contain
only a single unit or type of data), and
• All rows have a unique primary key.