Database Theory and Terminology, Part 1

Download Report

Transcript Database Theory and Terminology, Part 1

Database Theory and
Terminology, Part 1
Theory and Practice
• We will begin by looking at the theory
and the terminology of table design.
• After that, we will focus on the practical
side—using Access to:
– Create a new database
– Design tables
– Create relationships
• That’s what assignment 1 will be about.
Modern Databases
• Your reading assignment for this week is the first two
chapters of a book called “Databases Demystified.”
• This book contains a general definition of a database: a
database is a collection of interrelated data items that are
managed as a single unit.
• This definition is deliberately vague, allowing it to cover
most of the different types of databases that have been
used over the past five decades or so.
• For a relational database, the definition can be more
focused, at least on the logical level:
• Database: A collection of tables, the relationships between
them, and auxiliary items such as views and stored
procedures.
DBMS
• A database is managed, strangely enough, by
something called a “database management system”
(DBMS).
• Popular DBMS’s include Oracle, MySql, DB2, and
Microsoft’s SQL Server (for large-scale databases) and
Access (for smaller databases).
• Large-scale DBMS’s like Oracle and SQL Server typically
run on specialized computers called servers, which
provide data for many computers (clients) over
networks. They typically store their data in many files,
frequently spread across many hard drives, and even
many different servers.
Access
• While Access can be used on networks, it is more
of a “personal” DBMS, running on the user’s
computer instead of a separate server. It stores
everything in a single file (*.mdb for Access 2003
and earlier; *.accdb for Access 2007).
• The single-file feature of Access is why we will
use it in this class. It makes it easy for me to share
entire databases with you, and for you to turn in
databases for assignments and projects.
Microsoft Access
• Access is not the best or most powerful DBMS;
it is just the most convenient for use in this
class.
• Nevertheless, it is good enough and powerful
enough that it serves very nicely as a training
database: most of what you need to know
about databases you can learn using Access.
• You’ll start learning how to use Access in the
next lecture and in lab.
Key Point!
• Good relational database design is about
optimizing how the data is STORED, not how it is
DISPLAYED.
• Most “tables” you have seen—in books, in
lectures, on the web—were probably optimized
for display, not for storage.
• Relational database tables are designed for
consistency and to reduce redundancy. They are
not designed for appearance.
• When we learn SQL and Visual Basic, we will look
at various ways to display the data stored in
relational database tables.
Why “relational?”
• We’ll be learning about database relationships
later is this mega-lecture.
• However, relationships are not what gave the
relational database its name.
• The term “relational” comes from the
mathematical concept of “relation,” which refers
to a set of ordered pairs (or triplets, etc.; the
generic term is “tuple”) of items. A mathematical
function is a special type of relation.
Entities
• In a properly designed relational database, each
relation (table) represents a single “entity”.
• An entity is sort of a generic noun. For example, the
concept of Customer is an entity, but one particular
customer is not an entity.
• “Customers” could be the name of the table
representing the Customer entity. Each row in the table
would represent a particular real customer.
• In object-oriented programming (OOP), an entity is
typically represented by something called a “class.” An
individual instance of that class (a particular customer,
for example), is called an “object.”
In each group, which is the entity?
• IOE 373
• Course
• Team
• Jets
• Person
• Barack Obama
• Tom Brady
• Player
• Company
• Microsoft
• University
• Michigan
• Honda Accord
• Car
• Product
• Crest Toothpaste
• This Slide
• Slide
Tuples, Records, Rows
• What Codd referred to as a “tuple” is generally
referred to as a “record” or a “row” in practical
database discussions.
• Records are called rows because they are
represented by the horizontal rows in database
tables.
• The items in the previous slide that were not
entities were records—instances of the entity.
• In OOP, a record (row) is called an “object,” which
is an instance of a class.
Attributes, Fields, Columns
• Codd and other theorists called the items in each tuple
“attributes”, properties which all instances of an entity
have, but which aren’t always the same for each instance.
• For example, attributes of the Customer entity could
include LastName and FirstName. Each customer has these
attributes, but the particular values are different.
• In practical database discussions, attributes are referred to
as “fields” or “columns”. (“Columns” since they are
represented by the vertical columns in a table.)
• In OOP, an attribute is generally referred to as a “property.”
Tables as Templates
• You can think of a table as a template for
describing or designing things of a particular
type.
• For example, the “car” entity can be
described by attributes such as make,
model, year, color, miles per gallon, etc.
• You can then use a table with those fields to
describe individual cars (rows, records):
Typical Beginners’ Tables
• I’ve often seen the case where people with no
formal database training have tried to create
their own databases—perhaps their CD or recipe
collection, or to run their small business, or for
some event they are organizing.
• Frequently they will turn to Excel. They will take a
single worksheet and try to put all of their data
into a single table. Eventually, these single-table
databases run into some serious problems—the
problems that Codd was trying to address.
Guate Tours
• Here’s an example of a simple table that starts
to grow, and the problems created by it.
• It is based on a database designed by a
student last year. She actually designed the
database properly; I took her data and made
the table worse so I could use it as an
example.
Guate Tours
• Guate Tours is a company in Guatemala which conducts tours of the
historical, cultural, and natural treasures of that country.
• The company's manager, Cristina Rodriguez (""C Rod"") has used
Excel to prepare a list of the tours that they offer.
A student named Rita who works for Guate Tours in the summer returns home from her
IOE 373 class, and tells the manager that they should add a Tour ID number to their table
to serve as a primary key, especially since several of their tours have similar names. In
the past, the manager has told employees to lead the Antigua tour, and the employees
then took the customers on the WRONG Antigua tour. So having a numeric key for each
tour may help.
Each tour is generally led by the same guide, but the manager sometimes forgets which
employee leads which tour. Therefore, she decides to add the employee's name to the
list.
Then one day the father of one of the customers is in a bad accident back in Italy and is
taken to the hospital. The customer's brother calls Guate Tours and asks them to get in
touch with the customer, but the manager has trouble finding the cellphone number of
the guide. When the customer finally finds out about the accident, he is irate and
threatens to sue the company. The manager decides it might be a good idea to add her
employees' contact info to the list.
So this is what the list looks like. Eventually, the manager gets sick of managing and
decides to give up the tour business and go into VB programming. On her last day, she
appoints Rita as the new manager and removes herself from the list.
Manager’s name was here
After this, Rita looks more closely at the list and realizes that she is identified as the
manager if you look at one row, but as a tour guide if you look at another. She recalls
that this is the “update anomaly.” When C-Rod made Rita the manager, she only changed
her title in one place. Rita changes her title to Manager in the other place now.
Rita changes her title to Manager in the other place now.
A few days later Rita gets a call from a customer from California who really wants to take
the Rio Dulce trip. Rita looks at the list and says, “Do you mean the Rio Dulce Boat Trip?”
The customer replies “No, I don't ever want to get that close to crocodiles again! I want
that five-star hotel.” Rita then realizes the mistake-when C-Rod removed herself from the
list, she took the whole Rio Dulce trip along with her. Rita remembers learning in 373
that this is called the “delete anomaly.” Too much data was stored in a single row in the
list.
She adds the Rio Dulce trip back into the list, and realizes that she doesn't have anyone
to lead it.
Rita is now overworked. She is still leading two different tours AND managing the
company, and she's still got the Rio Dulce trip with nobody to guide it. She decides to
hire someone new. She hasn't quite decided which of her trips she is going to give to the
new employee, so she tries to just insert his name and contact info into the table.
She knows something is wrong—this is supposed to be a table of trips, and now she has
a row in the table which doesn’t have a trip or a unique numeric identifier. She recalls
learning something about this problem back in 373—the “insert anomaly”, it was called.
The Three Anomalies
• The story highlights what are known as the three anomalies:
– Insert Anomaly: This occurs when you are storing information
about one entity in a table focused on a different entity. In the
story, the table was designed around tours. When a new
employee was hired, there was no logical place to put his
information because he was not yet assigned to a tour.
– Update Anomaly: For the same reason, updating employee
information, like a change of title or phone number, is
problematic in a table focused on a different entity. If you are not
careful, inconsistencies can arise. In the story, Rita’s title was
changed in one row, but not another. The table was no longer
consistent as to what her title really was.
– Delete Anomaly: When employee information is stored in the
tours table, deleting a tour may result in the loss of all
information about an employee.
• You should know these anomalies for the quiz on 9/21.
The Three Anomalies
• The insert, update, and delete anomalies were
frequently encountered in early attempts to
create useful databases back in the 1960’s.
• Dr. E.F. Codd developed relational database
theory as a way of avoiding these anomalies.
• All three problems arise because data about
multiple entities are stored in a single table.
• In the Guate Tours example, the entities are
Tours and Employees.
Two Entities in One Table
Tours
Employees
The cure: Third Normal Form
• Codd and others defined the rules for the proper
structure of database tables; these have become
known as “normal forms.” While six (and
sometimes more) normal forms have been
defined, the acceptable level of “normalization”
for a database is Third Normal Form (3NF).
• You will learn the details of 3NF later in this
lengthy slideshow.
• I will teach you 3NF mostly by example, but I’ll
give you a brief rundown of what the first three
normal forms mean.
OLTP vs. OLAP
• Modern databases tend to fall into one of two
categories:
– Online Transaction Processing (OLTP)
– Online Analytical Processing (OLAP)
– We’ll spend a couple of weeks learning about
OLTP databases;
– We’ll only spend a few slides learning about OLAP
databases.
– Normalization is only necessary and desirable for
OLTP databases.
OLAP: Online Analytical Processing
• OLAP databases are typically composed of
historical data, such as sales records from
previous years. This data is often gathered
(using SQL) from an OLTP database.
• While there are technical differences in the
definitions, an OLAP database is often called a
“data warehouse.”
OLAP
• Since the data is historical, it doesn’t need to
be updated. Therefore, the insert, update, and
delete anomalies that can plague an OLTP
database are rarely a concern with OLAP
databases.
Uses for OLAP databases
• OLAP databases are generally used to help
management make decisions. They can be
used to answer questions like:
– Which products sell best in which regions?
– Who are our best (and worst) sales people?
– Which stores should we close, and where should
we open new stores?
– Will we be better off trying to sell more lowvolume, high-price items, or more high-volume,
low-price items?
Characteristics of OLAP databases
• Since questions (queries) like these can take a
long time to run when the data is stored in
multiple tables, OLAP databases are frequently
“denormalized”—the data from several tables is
combined into a single table which is not in third
normal form. Queries on single tables run faster
than queries across multiple tables.
• Since these tables do not need to be updated,
denormalization does not pose a problem.
OLAP, we hardly knew ya!
• Unfortunately, that’s all the time we have to
spend on OLAP databases.
• This doesn’t mean they’re not important, just
that they are easier to understand than OLTP
databases—many typical Excel sheets are pretty
close to being OLAP databases.
• If you want to learn more, there are many books
(including Databases Demystified, chapter 12)
and websites which cover the topic.
OLTP
• OLTP (Online Transaction Processing)
databases are the type we’ll be focusing on in
this class. OLTP databases are designed to
handle:
– Frequent changes (inserts, updates, deletes)
– Frequent requests for small amounts of data, such
as a customer or product record.
– Multiple concurrent users.
OLTP databases: Is there anything they
can’t do?
• Well, yes. Properly normalized OLTP databases
are not optimized for:
– Quickly retrieving large amounts of summary data.
– Using in pivot tables to analyze historical data.
• In general, you run your day-to-day business
using an OLTP database.
• You then periodically transfer that data to a nonnormalized OLAP database.
• You use the OLAP database to analyze your
business’s performance.
Practice
• Would you be more likely to find a table called
“2009Sales” in an OLAP or OLTP database?
• How about “Customers”?
• “QuarterlySummary”?
OLTP
• Because the data in OLTP databases is frequently
modified, they are highly susceptible to problems
caused by the three anomalies (insert, update,
delete).
• The vaccine for these anomalies is normalization;
in particular, getting the tables into third normal
form (3NF).
• Reminder: Normalization is not required for OLAP
databases because their data is not updated
frequently. In fact, it is common to denormalize
OLAP databases.
Reminder!
• Good relational database design is about
optimizing how the data is STORED, not how it is
DISPLAYED.
• Most “tables” you have seen—in books, in
lectures, on the web—were probably optimized
for display, not for storage.
• Relational database tables are designed for
consistency and to reduce redundancy. They are
not designed for appearance.
• When we learn SQL and Visual Basic, we will look
at various ways to display the data stored in
relational database tables.
Normalization: A Database Thing
• While it is common to talk about normalizing
tables, normalization is a process that refers to
an entire database.
• Normalizing a table generally involves creating
more tables.
• Before we can learn more about normalizing a
database, we need to know about designing
tables.