itec_630_data_normalization_final_section_7_for_final

Download Report

Transcript itec_630_data_normalization_final_section_7_for_final

ITEC 630: Session Seven
Systems Analysis and Development Methodologies
ITEC 630: Information Systems
Analysis, Modeling, and Design
Session 7: Chapter 9 (Conger)
Database Design & Implementation
Invoices
Data Represented in an MS Excel
Spreadsheet
Revised (flatten) Spreadsheet
Zero Normal Form
• In Figure 1 you see that there are several
repeating attributes in the
data Order0NF table –
– the ordered item information repeats nine times
and
– the contact information is repeated twice,
• once for shipping information and
• once for billing information.
Zero Normal Form
• Although this initial version of orders could
work, what happens when an order has
more than nine order items?
– Do you create additional order records for
them?
– What about the vast majority of orders that
only have one or two items?
– Do we really want to waste all that storage
space in the database for the empty
fields? Likely not.
Zero Normal Form
• Furthermore:
– do you want to write the code required to
process the nine copies of item information,
• even if it is only to marshal it back and forth
between the appropriate number of objects.
• Once again, likely not.
First Normal Form
First Normal Form addresses three issues:
1. A row of data cannot contain repeating groups of similar data
2.
Create a separate table for each set of related data.
3. Each row of data must have a unique identifier (or Primary Key).
First Normal Form
When this value is made up of two or more columns, it is referred to as
a concatenated primary key.
First Normal Form
First Normal Form
• Figure 2 presents a reworked data schema where
the order schema is put in first normal form.
• The introduction of the OrderItem1NF table
enables us to have as many, or as few, order
items associated with an order, increasing the
flexibility of our schema while reducing storage
requirements for small orders (the majority of our
business).
First Normal Form
• The ContactInformation1NF table offers a similar benefit,
when an order is shipped and billed to the same person
(once again the majority of cases) we could use the same
contact information record in the database to reduce data
redundancy.
• OrderPayment1NF was introduced to enable customers to
make several payments against an order – Order0NF
could accept up to two payments, the type being
something like “MC" and the description “MasterCard
Payment", although with the new approach far more than
two payments could be supported.
First Normal Form
• Multiple payments are accepted only when the
total of an order is large enough that a customer
must pay via more than one approach, perhaps
paying some by check and some by credit card.
Second Normal Form
Next we test each table for partial dependencies on a
concatenated key.
This means that for a table that has a concatenated primary
key, each column in the table that is not part of the primary
key must depend upon the entire concatenated key for its
existence.
If any column only depends upon one part of the
concatenated key, then we say that the entire table has failed
Second Normal Form and we must CREATE ANOTHER
TABLE to rectify the failure.
Second Normal Form
You must ask yourself “Can this column
exist without one or the other part of the
concatenated primary key?”
Second Normal Form
• The address is needed by the Customers table,
but also by the Orders, Shipping, Invoices,
Accounts Receivable, and Collections tables.
• Instead of storing the customer's address as a
separate entry in each of these tables, store it in
one place, either in the Customers table or in a
separate Addresses table.
Second Normal Form
Testing
nd
2
Normal Form
Testing
nd
2
Normal Form
Second Normal Form
Second Normal Form
• Although the solution presented in Figure 2 is improved
over that of Figure 1, it can be normalized further.
• Figure 3 presents the data schema of Figure 2 in second
normal form (2NF).
• An entity type is in second normal form (2NF) when it is in
1NF and when every non-key attribute, any attribute that
is not part of the primary key, is fully dependent on the
primary key.
Second Normal Form
• This was definitely not the case with the OrderItem1NF
table, therefore we need to introduce the new table
Item2NF.
• The problem with OrderItem1NF is that item information,
such as the name and price of an item, do not depend
upon an order for that item.
• For example, if Hal Jordan orders three widgets and Oliver
Queen orders five widgets, the facts that the item is called
a “widget" and that the unit price is $19.95 is constant.
Second Normal Form
• This information depends on the concept of an item, not
the concept of an order for an item, and therefore should
not be stored in the order items table – therefore the
Item2NF table was introduced.
• OrderItem2NF retained the TotalPriceExtended column, a
calculated value that is the number of items ordered
multiplied by the price of the item.
• The value of the SubtotalBeforeTax column within the
Order2NF table is the total of the values of the total price
extended for each of its order items
Third Normal Form (3NF)
• Must be in 2NF
• Remove any transitive dependencies.
• A transitive dependency is when nonkey
attributes are dependent not only on the
primary key, but also on a nonkey
attribute.
13-30
Third Normal Form (3NF)
tbl_Tournament
int_Tournament
Tournament
Year
Winner
Winner Date of Birth
1
Indiana Invitational
1998
Al Fredrickson
July 21, 1975
2
Cleveland Open
1999
Bob Albertson
September 28, 1968
3
Des Moines Masters
1999
Al Fredrickson
July 21, 1975
4
Indiana Invitational
1999
Chip Masterson
March 14, 1977
Third Normal Form (3NF)
• The violation is the last column, Winner
Date of Birth, which is not a property of
the key but is functionally dependent upon
the Winner value.
• The term "functionally dependent" means
that column Winner Date of Birth is a
function of Winner.
Third Normal Form (3NF)
tbl_TournamentWinners
Tournament
int_Tournament
1
Indiana Invitational
2
Cleveland Open
3
Des Moines Masters
4
Indiana Invitational
Year
Winner
1998 Al Fredrickson
1999 Bob Albertson
1999 Al Fredrickson
1999 Chip Masterson
tbl_Players
Player
Winner Date of Birth
Al Fredrickson
July 21, 1975
Bob Albertson
September 28, 1968
Chip Masterson
March 14, 1977
Third Normal Form (3NF)
tbl_TournamentWinners
Tournament
int_Tournament
1
Indiana Invitational
2
Cleveland Open
3
Des Moines Masters
4
Indiana Invitational
Year
1998
1999
1999
1999
Winner
345-67-9097
220-45-1111
345-67-9097
667-45-1356
tbl_Players
Int_Player
Player
Winner Date of Birth
345-67-9097
Al Fredrickson
July 21, 1975
220-45-1111
Bob Albertson
September 28, 1968
667-45-1356
Chip Masterson
March 14, 1977
Third Normal Form (3NF)
Third Normal Form (3NF)
• An entity type is in third normal form (3NF) when
it is in 2NF and when all of its attributes are
directly dependent on the primary key.
• A better way to word this rule might be that the
attributes of an entity type must depend on all
portions of the primary key.
Third Normal Form (3NF)
• In this case there is a problem with the
OrderPayment2NF table, the payment type
description (such as “Mastercard" or “Check")
depends only on the payment type, not on the
combination of the order id and the payment
type.
• To resolve this problem the PaymentType3NF
table was introduced in Figure 4, containing a
description of the payment type as well as a
unique identifier for each payment type.
Third Normal Form (3NF)
• A memorable statement of Edgar F. Codd's
definition of 3NF:
– paralleling the traditional pledge to give true
evidence in a court of law, was given by Bill Kent:
• A common variation supplements this
definition with the oath: "so help me Codd"
Nerd laughter ensues……
Third and Half Normal Form (3.5NF)
Boyce-Codd Normal Form (BCNF or 3.5NF)
• Also referred to as the "third and half (3.5) normal
form",
• Adds one more requirement:
– In addition to meeting all the requirements of the third
normal form.
– Every determinant must be a candidate key.
• A candidate key is a combination of attributes that can be
uniquely used to identify a database record without any
extraneous data. Each table may have one or more candidate
keys. One of these candidate keys is selected as the table primary
key.
Fourth Normal Form
•Meet all the requirements of the third normal form.
•Seeks to “remove all multi-valued dependencies”.
•Many to Many
•So what exactly does this mean?
•A multi-valued attribute is an attribute that helps to
describe the entity.
•In order to represent this in a table, the attribute could
be repeated several times. Consider this real-world
scenario:
Fourth Normal Form
• Two Examples
– Courses, Textbooks, and Instructors
– Town and Region
First 4th Normal Form Example
Fourth Normal Form
2017 Fall Semester ITEC (Information Technology) Courses
•
•
•
•
•
•
•
•
•
Course
630
630
630
630
630
630
640
640
Book
Silberschatz
Nederpelt
Silberschatz
Nederpelt
Silberschatz
Nederpelt
Silberschatz
Silberschatz
Lecturer
John D
Aaron D
William M
John D
Christian G
Christian G
John D
Aaron D
Fourth Normal Form
•
•
•
•
•
•
•
•
•
•
•
•
•
Course
630
630
630
630
630
630
630
630
630
630
630
630
Book
Silberschatz
Canon
Nederpelt
Canon
Silberschatz
Canon
Nederpelt
Canon
Silberschatz
Canon
Nederpelt
Canon
Lecturer
John D
John D
Aaron D
Aaron D
William M
William M
John D
John D
Christian G
Christian G
Christian G
Christian G
These are the
multi-valued
dependencies
Fourth Normal Form
• there are two multivalued dependencies in this
Example:
– {course}{book}
– {course}{lecturer}
• To be compliant with 4NF, one must create a new
table for book that has course as its Foreign Key.
• Lecturer would also exist in a separate table.
Lecture and course would be “connected” via a
linking table. This is a many to many problem
Second 4th Normal Form Example
Fourth Normal Form
Fourth Normal Form
• The town, Region, and (Congressional) District together
form the candidate key.
• In this example, Providence has a split congressional
district and exists in multiple regions.
• Lincoln also exists in multiple regions but only has a single
district.
• The above table meets Boyce-Codd Normal Form (3.5
Normal Form) because all three attributes are part of the
candidate key
Fourth Normal Form
•This type of table is called a cross-product,
where data is repeated.
•It is not in 4th normal form because of this
repeating data scenario.
•To fix, you would create two separate tables:
Fourth Normal Form
Fourth Normal Form
• In 4th normal form, you reduce the need for
inserts, updates and deletes and therefore you
eliminate the possibility for data anomalies.
• It is reasonable to strive for 4th normal form as a
baseline in your designs. In some cases, your
database will take up less space.
• The cost for normalizing this far is only in query
performance, as you’ll need additional joins to
report on the data.
Fifth Normal Form
• The 5th Normal Form is:
– a “projection-based normalization approach
where every nontrivial join dependency is
implied by a candidate key”.
• project-join normal form (PJ/NF)
• A table is said to be in the 5NF if and only if
every join dependency in it is implied by
the candidate keys.
Fifth Normal Form
• A table T is subject to a join dependency if
T can always be recreated by joining
multiple tables each having a subset of the
attributes of T. This is a nontrivial join
dependency
Fifth Normal Form
• PLAIN ENGLISH: The table should have only
candidate keys and its primary key should consist
of only a single column.
Fifth Normal Form
• Any fact should be able to be reconstructed
without any anomalous results in any case
regardless of the number of tables being joined.
Join Statements
• Left Join
– Returns all rows from the left table even if there are no
matches in the right table
• Right Join
– Returns all the rows from the right table even if there
are no matches in the left table.
• Inner Join
– Returns rows where there is at least one match in both
tables.
• Full Join
– Returns rows when there is a match in one of the
tables
Fifth Normal Form
• In the previous example, I showed how one
table became two, reducing the risk of data
anomalies.
• 5th normal form could split a single table
into three or more.
Fifth Normal Form
Fifth Normal Form
• Let’s say you have a table with store,
supplier, and products.
• The table contains a row for each store:
– which is repeated for each supplier the store
uses,
– and again for each product that the supplier
provides to the particular store
– (A store might not sell all the products that a
supplier provides).
Fifth Normal Form
• All fields in this table are valuable and
necessary to represent the row and is in
4th Normal Form.
• A relation is in 4NF if it has no multi-valued
dependencies
• In fact, this is a common type of table that
could represent a many-to-many-to-many
relationship among stores, products, and
suppliers.
• But, notice that there is repeating data:
Fifth Normal Form
Fifth Normal Form
• In 5NF, this table would become three
tables:
1. store_to_prod table
2. supp_to_prod table
3. store_to_supp.
Fifth Normal Form
Fifth Normal Form
• When a store picks up a new supplier, a new
row is added to the store_to_supp table,
eliminating the redundancy of extra rows for
product.
• When a store sells a new product, the
supp_to_prod table will let you know where
to get it.
Fifth Normal Form
• A simple example, yes — but it demonstrates
5th normal form.
• As you can see, you increase the amount of
relations and tables as you normalize deeper
and deeper.
Fifth Normal Form
Traveling Salesman
Brand
Product Type
Jack Schneider
Acme
Vacuum Cleaner
Jack Schneider
Acme
Breadbox
Willy Loman
Robusto
Pruning Shears
Willy Loman
Robusto
Vacuum Cleaner
Willy Loman
Robusto
Breadbox
Willy Loman
Robusto
Umbrella Stand
Louis Ferguson
Robusto
Vacuum Cleaner
Louis Ferguson
Robusto
Telescope
Louis Ferguson
Acme
Vacuum Cleaner
Louis Ferguson
Acme
Lava Lamp
Louis Ferguson
Nimbus
Tie Rack
Fifth Normal Form
• The table's predicate is:
– Products of the type designated by Product Type,
– made by the brand designated by Brand,
– are available from the traveling salesman
designated by Traveling Salesman.
Fifth Normal Form
• In the absence of any rules restricting the
valid possible combinations of Traveling
Salesman, Brand, and Product Type, the
three-attribute table on slide #53 is
necessary in order to model the situation
correctly.
• But what if we did in fact have rules???
Fifth Normal Form
– A Traveling Salesman has certain Brands and
certain Product Types in his repertoire.
– If Brand B1 and Brand B2 are in his repertoire,
and Product Type P is in his repertoire, then
(assuming Brand B1 and Brand B2 both make
Product Type P), (filterless vacuum cleaners)
– the Traveling Salesman must offer products of
Product Type P those made by Brand B1 and
those made by Brand B2.
Fifth Normal Form
• In that case, it is possible to split the table
into three:
Fifth Normal Form
Traveling Salesman
Product Type
Jack Schneider
Vacuum Cleaner
Jack Schneider
Breadbox
Willy Loman
Pruning Shears
Willy Loman
Vacuum Cleaner
Traveling Salesman
Brand
Jack Schneider
Acme
Willy Loman
Robusto
Louis Ferguson
Robusto
Willy Loman
Breadbox
Willy Loman
Umbrella Stand
Louis Ferguson
Acme
Louis Ferguson
Telescope
Louis Ferguson
Nimbus
Louis Ferguson
Vacuum Cleaner
Louis Ferguson
Lava Lamp
Product Types By Traveling Salesman
Brands By Traveling Salesman
Brand
Product Type
Acme
Vacuum Cleaner
Acme
Breadbox
Acme
Lava Lamp
Robusto
Pruning Shears
Robusto
Vacuum Cleaner
Robusto
Breadbox
Robusto
Umbrella Stand
Robusto
Telescope
Nimbus
Tie Rack
Product Types By Brand
Sixth Normal Form
• Sixth normal form (abbreviated 6NF) is
achieved if and only if it satisfies no
nontrivial join dependencies at all
– A table T is subject to a join dependency if T
can always be recreated by joining multiple
tables each having a subset of the attributes of
T. This is a nontrivial join dependency
Sixth Normal Form
– If one of the tables in the join has all the
attributes of the other table(s) it is joined with,
the join dependency is called trivial join
dependency.
Sixth Normal Form
• Sixth Normal Form (6NF) is a term used in relational
database theory by Christopher Date to describe
databases which decompose relational variables to
irreducible elements.
• Important when dealing with temporal variables or other
interval data.
– For instance, if a relation comprises a supplier's name, status, and
city, we may also want to add temporal data, such as the time
during which these values are, or were, valid (e.g., for historical
data)
– but the three values may vary independently of each other and at
different rates.
– We may, for instance, wish to trace the history of changes to
Status.
Sixth Normal Form
• A temporal database is a database which contains aspects
of time analysis.
• Specifically, such databases often deal with two concepts
of time – valid time and transaction time.
– Valid time is defined as the time period during which a data fact
may be true with respect to the real world (i.e., a time when
something actually happened),
– Transaction time refers to the time period during which a data
fact may be stored in the database (i.e., a time when something is
measured and/or reported).
Denormalizing
Denormalizing
• At a certain point it is logical to push back
and denormalize.
• You would do this for performance and
manageability purposes, or because the
design of the database dictates it.
• In the previous example, it might not be
practical from a development standpoint to
bring all tables to 5th form.
Denormalizing
• Consider the purpose of the database.
• If it is a transaction-based database with
many data access points, you will want to
normalize as much as possible to ensure
the integrity of the data.
• If the primary purpose of the database is
for ad-hoc queries, then backing off might
be well received.
Denormalizing
• One thing I always come back to is the original intent of
normalization:
– To reduce the number of data anomalies that can occur during
INSERT, UPDATE, and DELETE operations.
• Normally, if tables are up to 3NF, then INSERT would
never be a problem due to the constraints set up by
primary and foreign key relationships.
• But UPDATEs and DELETEs do create potential issues.
• So, if you are not going to update or delete a row from a
table (or there is little likelihood), then it is safe to
denormalize an otherwise 4th or 5th model “down” to
3rd.