Data Modeling - Pennsylvania State University

Download Report

Transcript Data Modeling - Pennsylvania State University

INFYS540
Normalizing Your Database
and
Why you WANT to do it!
Lesson 7
Chapter 5 Appendix
7/18/2015
1
Data Redundancy Problems
• Redundancy breeds errors
– Same data defined in multiple places is BAD
– Spelling/typographical error prone
– Lack of data integrity
• Inability to perform simple queries
• Inflexibility and inscalability
• Impossible to MAINTAIN!
7/18/2015
3
Relational Database
PROJECTCHIEF
Project
Project Chief
Computing 333-22-1111
Intranet
987-65-4321
Contracting 123-45-6789
CAT
333-22-1111
EMPLOYEES
LName
FName
Jones
Mike
Smith
Tony
Lee
Bruce
Doodle
Yankee
1
SSN
123-45-6789
987-65-4321
567-89-1234
333-22-1111
What is a candidate key?
What is a primary key?
What is a foreign key?
7/18/2015
Dept
M&B
C2G
MLD
M&B
1
DEPARTMENTS
Dept
Dept. Director
MLD
181-94-5676
C2G
987-65-4321
M&B
123-45-6789
Room
B115
123
147
5
Purpose of Normalization
• Take advantage of the powerful tools
available in a DBMS
• There are five levels of Normalization
– The higher the Normal Form the “better” and more
efficient the database
– But, increasing the levels of Normal Form takes time
and effort
– For most applications, 3rd Normal Form will solve
most potential problems with a DB
7/18/2015
11
Normalizing Database
• Process of creating well-structured
tables.
• Improve performance, integrity of data
• 5-step process (w/ 2 rules) to achieve
Third Normal Form (3NF)
• First two steps put DB into a form so
you can normalize it
7/18/2015
12
Rule #1 in Databases
Never design
redundant data into
a Database
duplicate data is not consistent
duplicate data wastes space
7/18/2015
13
Step 1. Primary Keys
• A primary key is one or more data fields
(columns) that uniquely identify each record in the
table
• What would the primary key be below?
– “table of employees, assigned to a department.”
EMPLOYEES
LName
FName
Jones
Mike
Smith
Tony
Lee
Bruce
7/18/2015
SSN
123-45-6789
987-65-4321
567-89-1234
Dept
Math
M&B
Science
14
Step 1. Primary Keys
• Answer: The SSN
• It is the only “guaranteed” unique
column in the table. Names are easily
repeated.
EMPLOYEES
LName
FName
Jones
Mike
Smith
Tony
Lee
Bruce
7/18/2015
SSN
123-45-6789
987-65-4321
567-89-1234
Dept
Math
M&B
Science
15
Step 1. Primary Keys
• Now try the following example:
• “A table of projects assigned to employees, listing
the project name and the employee’s function on the
project.”
EmpProj A Counter --The MS Access Default Key
Counter
SSN
Project
Function
1
123-45-6789 Dining
Designer
2
123-45-6789 Computing
Designer
3
987-65-4321 Contracting Designer
4
444-55-6666 Intranet
Webmaster
5
222-99-7777 Dining
Overwatch
7/18/2015
16
Step 1. Primary Keys
• It is the combination of the SSN and the Project
fields. Why?
EMPLOYEES’ PROJECTS
Counter
SSN
1
123-45-6789
2
123-45-6789
3
987-65-4321
4
444-55-6666
5
222-99-7777
7/18/2015
Project
Dining
Computing
Contracting
Intranet
Dining
Function
Designer
Designer
Designer
Webmaster
Overwatch
17
Step 1. Primary Keys
• Because, you can have the following:
EMPLOYEES’ PROJECTS
Counter
SSN
1
123-45-6789
2
123-45-6789
3
987-65-4321
4
444-55-6666
5
222-99-7777
Project
Dining
Dining
Intranet
Intranet
Dining
Function
Designer
Designer
Designer
Webmaster
Overwatch
• Redundant records! (Redundancy = BAD)
7/18/2015
18
Rule #2 about Databases
NEVER Use a
Counter as a
Primary Key
7/18/2015
19
Step 2: Eliminate Many-to-Many
Relationships
• What is wrong with the following
table?
• “a table of personnel authorized access to a project”
PROJECTS QUERY ACCESS
Project
Access_1
Dining
222-99-7777
Computing
222-99-7777
Intranet
987-65-4321
7/18/2015
Access_2
181-94-5676
181-94-5676
818-49-6765
Access_3
123-45-6789
20
Step 2: Eliminate Many-to-Many
Relationships
• Here’s essentially what this table looks
like within the Access relationships
diagram:
Projects:
Project
Project Chief
Department
Access_1
Access_2
Access_3
7/18/2015
Employees:
SSN
Last Name
First Name
....
21
Step 2: Eliminate Many-to-Many
Relationships
• Here’s how you model it in a database:
– Break it up into two one-to-many relationships
Projects:
Project
Project Chief
Department
....
1
1
Employees:
SSN
Last Name
First Name
....
Access to
Project Info:
Project
SSN
7/18/2015
22
Step 2: Eliminate Many-to-Many
Relationships
• How to do it:
– The primary key of the new table is the
composite of the primary keys of the existing
tables.
• Primary key of Projects = Project Name
• Primary key of Employees = SSN
• New table primary key of Project Name and SSN
7/18/2015
23
Step 2: Eliminate Many-to-Many
Relationships
– No artificial restrictions on number of people with access
– You can add attributes about the types of access granted
– You can easily query who has access to information
about each project
PROJ QUERY
ACCESS
EMPLOYEE
LName
Jones
Smith
Lee
Doodle
7/18/2015
FName
Mike
Tony
Bruce
Yankee
SSN
123-45-6789
987-65-4321
567-89-1234
333-22-1111
Project
Dining
Dining
Computing
Computing
Intranet
Intranet
Intranet
SSN
222-99-7777
181-94-5676
222-99-7777
181-94-5676
987-65-4321
818-49-6765
123-45-6789
PROJECT
Project
Computing
Intranet
Contracting
CAT
ProjectChief Dept
333-22-1111 MATH
987-65-4321 M&B
123-45-6789 M&B
333-22-1111 Admin
24
What is wrong with the following?
“A table of PCs, which are loaded with many different
applications, and assigned to a user.”
PCSerial#
10291
10301
10311
7/18/2015
LoadedSoftware
Word, Powerpoint, ccMail
Word, Powerpoint, Lotus Notes
Word, LotusNotes, Borland C++
Assigned
Jones
Smith
Hacker
25
Step 3: Achieving 1NF:
All Data must be Atomic
• “Atomic” - the data occupying a field cannot be
further broken down.
– i.e., no multi-data entries
– i.e., “No attributes can have more than one value for a
single instance of an entity”
PCSerial#
10291
LoadedSoftware
Word, Powerpoint, ccMail
Assigned
Jones
• If not atomic, updating is complex and error prone
• If not atomic, can not easily query the database
7/18/2015
26
Step 3 Answer
PCSerial#
10291
10291
10291
10301
10301
10301
10311
10311
10311
7/18/2015
LoadedSoftware
Word
Powerpoint
ccMail
Word
Powerpoint
LotusNotes
Word
LotusNotes
Borland C++
Assigned
Jones
Jones
Jones
Smith
Smith
Smith
Hacker
Hacker
Hacker
27
Step 3. Achieving 1NF:
All Data must be Atomic
Another source of redundancy: calculated fields
TotalYTD
Age
DaysRemaining
Solution: Use a Query!
Remove all calculated fields from table and create a query
...then use the query whenever you need up-to-date data
7/18/2015
28
Step 4. Achieving 2NF:
Eliminate Partial Dependencies
• What is a partial dependency?
– Look at the table. What’s redundant?
– “A table of functions an employee is assigned to for a
project, and the project chief.”
EMPLOYEES’ PROJECTS
SSN
Project
123-45-6789
Dining
123-45-6789
Computing
123-45-6789
Intranet
987-65-4321
Intranet
444-55-6666
Intranet
222-99-7777
Dining
7/18/2015
Function
Designer
Designer
Member
Designer
Webmaster
Overwatch
Project Chief
222-99-7777
333-88-5656
987-65-4321
987-65-4321
987-65-4321
222-99-7777
29
Step 4. Achieving 2NF:
Eliminate Partial Dependencies
• Function depends on the entire primary key: SSN
and Project.
• ProjectChief is dependent on just a portion of the
primary key
EMPLOYEES’ PROJECTS
SSN
Project
123-45-6789
Dining
123-45-6789
Computing
123-45-6789
Intranet
987-65-4321
Intranet
444-55-6666
Intranet
222-99-7777
Dining
7/18/2015
Function
Designer
Designer
Member
Designer
Webmaster
Overwatch
ProjectChief
222-99-7777
333-88-5656
987-65-4321
987-65-4321
987-65-4321
222-99-7777
30
Step 4. Achieving 2NF:
Eliminate Partial Dependencies
• Why is this bad?
– Well, what’s wrong with the following?
EMPLOYEES’ PROJECTS
SSN
Project
123-45-6789
Dining
123-45-6789
Computing
123-45-6789
Intranet
987-65-4321
Intranet
444-55-6666
Intranet
222-99-7777
Dining
7/18/2015
Function
Designer
Designer
Member
Designer
Webmaster
Overwatch
Project Chief
222-99-7777
333-88-5656
987-65-4321
987-65-4321
222-99-7777
222-99-7777
31
Step 4. Achieving 2NF:
Eliminate Partial Dependencies
• A partial dependency (PD) occurs when a nonkey field depends on only a part of the primary
key, and not the whole primary key.
• PDs are a relation. So, we need a new table.....
EMPLOYEES’ PROJECTS
SSN
Project
123-45-6789
Dining
123-45-6789
Computing
123-45-6789
Intranet
987-65-4321
Intranet
444-55-6666
Intranet
222-99-7777
Dining
7/18/2015
Function
Designer
Designer
Member
Designer
Webmaster
Overwatch
Project Chief
222-99-7777
333-88-5656
987-65-4321
987-65-4321
987-65-4321
222-99-7777
32
Step 4. Achieving 2NF:
Eliminate Partial Dependencies
• Here’s how it should look......
EMPLOYEES’ PROJECTS
SSN
Project
123-45-6789
Dining
123-45-6789
Computing
123-45-6789
Intranet
987-65-4321
Intranet
444-55-6666
Intranet
222-99-7777
Dining
PROJECTS
Project
Project Chief
Dining
222-99-7777
Computing
333-88-5656
Intranet
987-65-4321
7/18/2015
Function
Designer
Designer
Member
Designer
Webmaster
Overwatch
33
Step 5: Achieving 3NF:
Eliminate Transitive Dependencies
• What is wrong with the following table?
PROJECTS
Project
Dining
Computing
Intranet
Contracting
CAT
7/18/2015
Project Chief
222-99-7777
333-88-5656
987-65-4321
187-87-8787
333-22-1111
Dept.
Admin
Admin
M&B
M&B
Grounds
Dept. Director
181-94-5676
181-94-5676
818-49-6765
818-49-6765
123-45-6789
Room
B115
B115
123
123
147
34
Step 5: Achieving 3NF:
Eliminate Transitive Dependencies
• We have fields dependent on a non-key field:
– The Director and Room fields clearly relate to
the Dept., and have nothing to do with the
project. (Dept is a “determinant” that is not a
candidate key)
PROJECTS
Project
Dining
Computing
Intranet
Contracting
CAT
7/18/2015
Project Chief
222-99-7777
333-88-5656
987-65-4321
187-87-8787
333-22-1111
Dept.
Admin
Admin
M&B
M&B
GRND
Dept. Director
181-94-5676
181-94-5676
818-49-6765
818-49-6765
123-45-6789
Room
B115
B115
123
123
147
35
Step 5: Achieving 3NF:
Eliminate Transitive Dependencies
• A transitive dependency occurs when a non-key field
depends on another non-key field.
• Why is this bad?.
– A typo appeared in the Contracting line. A database
without the transitive dependency would not have
allowed this to happen.
PROJECTS
Project
Dining
Computing
Intranet
Contracting
CAT
7/18/2015
Project Chief
222-99-7777
333-88-5656
987-65-4321
187-87-8787
333-22-1111
Dept.
Admin
Admin
M&B
M&B
GRND
Dept. Director
181-94-5676
181-94-5676
818-49-6765
818-49-6765
123-45-6789
Room
B115
B115
123
124
147
36
Step 5: Achieving 3NF:
Eliminate Transitive Dependencies
• How to do it:
a. Which fields are dependent on a non-key field
in the table? (Director, Room)
b. Which fields are these dependent on? (Dept)
c. Create a new table with (b) as the primary key.
d. Put (a) in the new table.
e. Remove (a) from the old table.
7/18/2015
37
Step 5: Achieving 3NF:
Eliminate Transitive Dependencies
• Here are the new tables.
PROJECTS
Project
Dining
Computing
Intranet
Contracting
CAT
DEPARTMENTS
Dept. Name
Admin
M&B
GRND
7/18/2015
Project Chief
222-99-7777
333-88-5656
987-65-4321
187-87-8787
333-22-1111
Dept.
Admin
Admin
M&B
M&B
GRND
Dept. Director
181-94-5676
818-49-6765
123-45-6789
Room
B115
123
147
38
Data Analysis: Normalization
• An entity is in first normal form (1NF) if there are no
attributes that can have more than one value for a single
instance of the entity.
• An entity is in second normal form (2NF) if it is already
in 1NF, and if the values of all non-primary key attributes
are dependent on the full primary key – not just part of it.
• An entity is in third normal form (3NF) if it is already in
2NF, and if the values of its non-primary key attributes are
not dependent on any other non-primary key attributes.
7/18/2015
39
Conclusion
•
•
•
•
•
•
•
•
7/18/2015
Rule1: Never design redundant data into a database
Rule2: Never use a counter as Primary Key
Identify proper primary keys (1NF)
Break up many-to-many relationships (1NF)
1NF: Break all data into atomic components
2NF: Identify/eliminate partial dependencies
3NF: Eliminate transitive dependencies
Common sense test
41
What is a Good Data Model?
– A good data model is simple.
• As a general rule, the data attributes that describe an entity
should describe only that entity.
– A good data model is essentially nonredundant.
• This means that each data attribute, other than foreign keys,
describes at most one entity.
– A good data model should be flexible and
adaptable to future needs.
• We should make the data models as application-independent as
possible to encourage database structures that can be extended
or modified without impact to current programs.
7/18/2015
42
• Data and Referential Integrity
– There are at least three types of data integrity
that must be designed into any database - key
integrity, domain integrity and referential
integrity.
– Key Integrity:
• Every table should have a primary key (which may
be concatenated).
– The primary key must be controlled such that no two
records in the table have the same primary key value.
– The primary key for a record must never be allowed to
have a NULL value.
7/18/2015
49
• Data and Referential Integrity
– Domain Integrity:
• Appropriate controls must be designed to ensure that
no field takes on a value that is outside of the range
of legal values.
– Referential Integrity:
• A referential integrity error exists when a foreign
key value in one table has no matching primary key
value in the related table.
7/18/2015
50
• Referential Integrity:
• Referential integrity is specified in the form of
deletion rules as follows:
– No restriction.
• Any record in the table may be deleted without regard to
any records in any other tables.
– Delete:Cascade.
• A deletion of a record in the table must be automatically
followed by the deletion of matching records in a related
table.
– Delete:Restrict.
• A deletion of a record in the table must be disallowed until
any matching records are deleted from a related table.
– Delete:Set Null.
• A deletion of a record in the table must be automatically
followed by setting any matching keys in a related table to
the value NULL.
7/18/2015
51
• Database Capacity Planning
– A database is stored on disk.
• The database administrator will want an estimate of
disk capacity for the new database to ensure that
sufficient disk space is available.
– Database capacity planning can be calculated
with simple arithmetic as follows.
1 For each table, sum the field sizes.
– This is the record size for the table.
2 For each table, multiply the record size times the
number of entity instances to be included in the
table.
– This is the table size.
7/18/2015
54
• Database Capacity Planning
– Database capacity planning can be calculated
with simple arithmetic as follows. (continued)
3 Sum the table sizes.
– This is the database size.
4 Optionally, add a slack capacity buffer (e.g., 10%) to
account for unanticipated factors or inaccurate
estimates above.
– This is the anticipated database capacity.
7/18/2015
55