Normalization
Download
Report
Transcript Normalization
IT 244
Database Management System
Topic 3 : Normalization
Ref : -
A First Course in Database System
(Jeffrey D Ullman, Jennifer Widom) + online.
Different Technique
• Thus far in this course we have discussed Entity
Relationship Modeling as a technique for
designing a database
• There exists another important design
technique, called Normalization which we shall
now cover .
• Before we discuss the details, let us first
consider the applicability of each of these
techniques
Entity Relationship Modeling
• Entity Relationship Modeling is appropriate
when a database is to be built from scratch
• This arises when a completely new information
system is being built (e.g. a company realizes
that they need to keep track all purchase made
from its suppliers in order to improve efficiency)
• In the above scenario no relations exist (or can
be identified directly), thus we need to model the
requirements in the form of an ER model, and
then convert the latter to the Relational Model
Normalization
If you've been working with databases for a while,
chances are you've heard the term
normalization. Perhaps someone's asked you "Is
that database normalized?" or "Is that in BCNF?"
All too often, the reply is "Yeah." Normalization is
often brushed aside as a luxury that only
academics have time for. However, knowing the
principles of normalization and applying them to
your daily database design tasks really isn't all
that complicated and it could drastically improve
the performance of your DBMS.
Normalization
• However, there exists another scenario whereby a
company decides to migrate from a file-based
information system to a database system
• It is in this latter scenario that Normalization is most
applicable.
• In this scenario it is possible to define relations
directly (i.e. without going through the intermediate
ER modeling stage) by analyzing the report
structures needed by the system
Normalization
• This analysis results in a set of relations,
however, these relations are very often
unsatisfactory from the viewpoint of database
efficiency (as we shall see shortly)
• Normalization is the technique by which we
decompose unsatisfactory relations into smaller
relations, each of which are "better" in some
sense than the original
• Is the step by step process of organizing data in
a way that eliminates potential problem in
updating and accessing data
• it's the process of efficiently organizing data in a
database.
Normalized Student Database
Students (Idnumber, Surname, First name, Gender, Sponsor)
Dept (DeptID, Dept name)
DeptCourses (DeptId, CourseNumber,Course Title, Coordinator)
CourseTextbooks (ISBN, CoourseNumber,Title, Author, Edition)
TextbookFees (FeeCode, ISBN, Fee, Fee Amount)
StudentCourses
(Idnumber,CourseNumber,Fees ID)
Normalizations
• There are two goals of the normalization
process: eliminate redundant data (for example,
storing the same data in more than one table)
and ensure data dependencies make sense (only
storing related data in a table).
• Both of these are worthy goals as they reduce
the amount of space a database consumes and
ensure that data is logically stored.
Normalizations
The database community has developed a series
of guidelines for ensuring that databases are
normalized. These are referred to as normal
forms and are numbered from one (the lowest
form of normalization, referred to as first normal
form or 1NF) through sixth (sixth normal form or
6NF).
In practical applications, you'll often see 1NF, 2NF,
and 3NF along with the occasional 4NF. Fifth and
Sixth normal form is very rarely seen and won't
be discussed in this course.
Normalizations
Before we begin our discussion of the normal
forms, it's important to point out that they are
guidelines and guidelines only.
Occasionally, it becomes necessary to distance
from them to meet practical business
requirements.
However, when variations take place, it's extremely
important to evaluate any possible effect they
could have on your system and account for
possible inconsistencies. That said, let's explore
the normal forms.
Normal Form
• Relations can be in:
- 1st Normal Form,
- 2nd Normal Form,
- 3rd Normal Form
- 4th Normal Form
- or 5th Normal Form
- even 6th NF in 2002
Normal Form
First normal form (1NF) sets the very basic rules
for an organized database:
Eliminate duplicative columns from the same
table.
A table can be said to be 1NF by identifying
the non-key attribute is dependent on non-key
attribute and identify the column or attribute
which produces more than one value.
Create separate tables for each group of related data
and identify each row with a unique column or set of
columns (the primary key).
Normal Form
Second normal form (2NF) further addresses the
concept of removing duplicative data:
Remove subsets of data that apply to multiple
rows of a table and place them in separate
tables.
Create relationships between these new tables
and their predecessors through the use of
foreign keys.
• 2NF: A table can set to be in 2NF, if the table
has already satisfy the 1NF and eliminating
partial functional dependency
Normal Form
Third normal form (3NF) goes one large step
further:
Remove columns that are not dependent upon the
primary key.
: A Table can be said to be 3 NF if the table
already satisfy 2NF and eliminate transitive
dependency.
Finally, fourth normal form (4NF) has one
requirement:
A relation is in 4NF if it has no multi-valued
dependencies and already said to satisfy the 3NF
1st Normal Form
– Supplier(supp_no, pro_no, city, supp_status,
(pro_no,qty))
• Supplier(supp_no, pro_no, city, supp_status, qty)
• Relation Supplier is in 1st Normal Form
• The first NF addresses the occurrence of repeating
groups in an entity… an entity is said to be in 1NF if
it contains no repeating group.
• K, A, B, C, D, E, F, R1, R2, R3
• K, A, B, C, D, E, F
• Even though Supplier can be represented in a
relational database it suffers from the following
problems – anomalies
1st Normal Form
• Problems with Relation Supplier
• Insert : We cannot insert a supplier until that
supplier supplies at least one part
• Delete : If we delete the only tuple for a
particular shipment for a supplier, we
also lose supplier information in the
process
• Modify : the city value for a given supplier appears
many times causing update problems
• Solution to these problems is to further normalize - into
2nd Normal Form
Progressing to the 1NF
1. Place all attributes which appear in a
repeating group into a new table;
2. Duplicate in the new table the primary
key of he table from which the repeating
group was extracted;
3. Designate a primary key for each new
table produced. This key will be the
primary key of the original table
concatenated with one or more attributes
in the new table.
2nd Normal Form
• A relation is said to be in 2nd Normal Form if it
has been placed in first NF and if every
data item in the entity is fully functionally
dependent (FD’s) on the primary key of the
entity.
• K1 + K2, A, B, C, D, E, F (F depends only on K2)
• K1 + K2, A, B, C, D, E
• A non-key attribute is any attribute that does
not participate in the primary key
Progressing to the 2NF
1. If a data is fully functionally dependent on only
a part of the primary key, move that data item
and the part of the primary key on which it is
fully functionally dependent to new table;
2. If other data items are functionally dependent
on the same part of the key, place them in the
new table
3. Make the partial primary key value copied from
the original table of the primary key of the new
table
4. Examine each new table to ensure that each
data item is fully functionally dependent on the
primary key, that is that they are in the 2NF
Functional Dependence
• Given a Relation R, an attribute Y of R is
functionally dependent on attribute X if and
only if each X value is associated with only
one Y value
• Y is fully functionally dependent on X if it is
functionally dependent on X and not
functionally dependent on any part of X
Functional Dependence
status
Supp_no
City
Qty
Pro_no
Above diagram is called the Functional Dependency Diagram (FDD)
2nd Normal Form
• From the definition of 2nd NF we get:
SECOND (supp_no, status, city)
SP(supp_no, pro_no, qty)
• Need to verify that these 2 relations
overcome all the problems associated with
relation FIRST
• However, the SECOND/SP structure still
has problems associated with it
2nd Normal Form
• Problems
• Insert - Cannot express the fact that a particular
city has a particular status value until at
least one supplier is located in that city
• Delete - If we delete the only supplier for a
particular city, then we not only lose that
supplier information but we also lose city
info:
• Update - The status value for a given city appears
many times, causing the usual update
problems
• Thus need to normalize further into 3NF
Anomalies
• Goal of relational schema design is to
avoid anomalies and redundancy.
– Update anomaly : one occurrence of a fact is
changed, but not all occurrences.- should
necessary only one changed.
– Deletion anomaly : valid fact is lost when a
tuple is deleted.- deletion of a record might
delete data that is not intended to be deleted
– Insertion anomaly: Adding a new record into a
relation would require entering another record
to another relation
3rd Normal Form
• A relation R is in 3rd NF if and only if it is in 2NF,
and every non-key attribute is non-transitively
dependent on the primary key attribute
• En entity is said to be in 3NF if has been placed
in 2NF and if all fields which are not part of the
primary key are mutually independent , that is
there are no transitive dependencies (where a
field is dependent not only on the key but also
on another field)
• K, A, B, C, D, E (E is also dependent on A)
• K, A, B, C, D
Progressing to 3NF
1. Identify all fields involved in transitive
dependencies;
2. Move all items involved in transitive
dependencies to a new entity
3. Identify a primary key for the new entity
4. Place the primary key for the new entity
as a foreign key on the original entity.
There should be no remaining transitive
dependencies on the original entity.
Boyce-Codd Normal Form
• An entity is said to be in BCNF if it is in 3NF and
every data item on which some other item is fully
functionally dependent is a candidate key.
• 3rd NF does not deal satisfactorily when a relation
has multiple candidate keys. A candidate key is
any key that can act as a primary key.
• A relation is in BCNF if and only if every
determinant is a candidate key. A determinant is
any attribute on which some other attribute is fully
functionally dependent on.
BCNF Example
• Entity X (A, B, C, D, E, F, G, H)
• {A B} and {B C} are both Candidate
primary key.
• Entity X (A + B, D, F, G)
• Entity Y (B + C, E, H)
BCNF Example
• Consider this Relation SSP
Sup_no
sname
Pro_no
Qty
SI
Smith
PI
300
Sl
Smith
P2
200
SI
Smith.
P3
400
SI
Smith
P4
200
Update problems with SSP, even though SSP is in 3rd NF, there are two
candidate keys for SSP:
(sup_no, pro_no) and (sname, pro_no)
Here sname is a determinant, but not a candidate key
Hence relation SSP is not in BCNF SSP can be decomposed into:
SS(sup_no, sname)
SP(sname, pro_no, qty)
Both SS and SP are in BCNF Can be verified that problems with SSP have
now disappeared
Higher Normal Form
• Higher Normal Forms such as 4th NF and
5th NF do exist, but they are mainly of
theoretical interest only
• This is because more than 99 % of relations
which are in 3rd NF or BCNF are also in 4th
NF and 5th NF
Choosing Between ER Model and
Normalisation
• ER modeling is appropriate when the applications being
developed is new
• In this case normalization should be applied once the ER
model is transformed to the Relational Model
• On the other hand the pure normalization approach is
appropriate when the application being developed already
exists
Eg. A company wants to migrate its S/W from a file-based system
to a database system
In this case the database could be designed by applying
normalization techniques to the reports produced by the old
(file_based) system.
Modeling Entities from Relations
• It is possible to construct an ER model from the
relations produced by the normalization process
• The ER model is drawn in three steps
Step 1
For each relation, draw a rectangle to represent the
equivalent entity type, using the relation name.
Step 2
For each pair of entity types, compare the primary keys. If
the key of one is fully contained within the other, draw a
relationship line between them
Step 3
For each pair of entity types, if the key of one is a foreign
key of another draw a relationship line between the entity
types
Example
Customer (Cust#, Name, Address)
Order (Order#, Date, Total, Cust#)
Order Item (Order#, Item#, Quantity)
Stock Item (Item#, Name, Unit Price, Qty_on_hand)
Statement Item (Cust#, Date, Line#, PayType.
PayAmt)
Example of Bad Design
Drinkers(name, addr, beersLiked, manf, favBeer)
name
Vili
Vili
Tolu
addr
Kalau
???
Manima
beersLiked
VB
Royal
VB
manf
A.B.
R.B
???
favBeer
Ikale
???
VB
Data is redundant, because each of the ???’s can be figured
out by using the FD’s name -> addr favBeer and
beersLiked -> manf.
This Bad Design Also
Exhibits Anomalies
name
Vili
Vili
Tolu
addr
Kalau
Kalau
Manima
beersLiked
VB
Royal
VB
manf
A.B.
R.B
A.B.
favBeer
Ikale
Ikale
VB
• Update anomaly: if Vili is transferred to Fanga,
will we remember to change each of his tuples?
• Deletion anomaly: If nobody likes VB, we lose track
of the fact that A.B manufactures VB.
• Insertion anomaly: If entering of records under beersLiked
require entering of records under manf.
Normalization Benefits
Make the data model more informative to users
Normalized tables, and the relationship between one
normalized table and another, mirror real-world
concepts and their interrelationships.
Avoid bias towards any particular pattern of querying
Normalized tables are suitable for general-purpose
querying. This means any queries against these tables,
including future queries whose details cannot be
anticipated, are supported. In contrast, tables that are
not normalized lend themselves to some types of
queries, but not others.
Normalization Benefits
Functional dependency:
Attribute B has a functional
dependency on attribute A (i.e., A → B) if, for each value of attribute
A, there is exactly one value of attribute B. If value of A is repeating in
tuples then value of B will also repeat. In our example, Employee
Address has a functional dependency on Employee ID, because a
particular Employee ID value corresponds to one and only one
Employee Address value. (Note that the reverse need not be true:
several employees could live at the same address and therefore one
Employee Address value could correspond to more than one
Employee ID. Employee ID is therefore not functionally dependent on
Employee Address.) An attribute may be functionally dependent either
on a single attribute or on a combination of attributes. It is not
possible to determine the extent to which a design is normalized
without understanding what functional dependencies apply to the
attributes within its tables; understanding this, in turn, requires
knowledge of the problem domain.