IntroducingDatabases

Download Report

Transcript IntroducingDatabases

Introducing Databases
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2005  Department of Computer & Information Science
Goals
By the end of this unit, you should understand …
• … what a database is.
• … what components comprise a database.
• … what a Database Management System is.
• … the difference among the different types of
database structures.
• … generally, how database administrators
construct databases.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
So, what is a database?
• In a general sense, a database is any
organized collection of data.
• Examples:
–
–
–
–
–
Grocery List
Audio CD Catalog
Phone Book
Airline Ticketing Software
Tax Preparation Software
–
–
–
–
–
Oncourse
Google
MapQuest
Amazon
eBay
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Databases in the Digital World
• When we think of applications we commonly
use, we often think of word processors as tools
for solving projects that require us to write; we
think of spreadsheets as tools to help us solve
problems dealing with numbers (statistics,
averages, etc.)
• Whereas spreadsheets are good at answering
questions involving numbers ("What is the
average … ?"), databases are good at solving
other types of questions ("Are there any
compact discs available by … ?").
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Databases in the Digital World
(continued)
• Word processors process text.
• Spreadsheets process number data.
• Databases process data.
(from geekgirl's plain-english computing)
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Data vs. Information
• For the user of a database, the end goal is to
view meaningful information.
• Raw data, the values we store in a database, by
themselves are essentially useless. For instance,
do we know what the value 85215 means? Is it
a zip code? Is it a student ID number? Is it a
code for a billing application? We don't know …
(Hernandez)
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Data Processing
• When we process data, we connect sets of data
to make meaningful information.
• For instance, if we connect the value 85215 to
the value "Tax Preparation – 1040
(Schedule C)", we're probably able to
discern that the value 85215 is a code that
represents some type of billable service – tax
preparation, in this case (Hernandez).
• The end result of data processing is meaningful
information.
• Data is stored; information is retrieved.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Types of Modern Databases
Operational Databases
• Used for online
transaction processing
(OLTP)
• Dynamic in nature
("just in time"
information)
• Used heavily by
commercial entities
Analytical Databases
• Used for online
analytical processing
(OLAP)
• Static in nature
• Often, use OLTPs to
populate data
• Used heavily by
research entities
-from Herenandez
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Historical Database Models
• A database model speaks to how we
create a database.
• Throughout the years, people have used
these models for creating databases:
–
–
–
–
The
The
The
The
Hierarchical Model
Network Model
Relational Model (most commonly used today)
Object-Oriented Model (the future?)
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
The Hierarchical Model
• The hierarchical model connects tables of data
via parent/child relationships. In such relations,
a parent table can have 1 or more children, but
a child table must have 1 and only 1 parent.
• Tables connect using the physical arrangement
of records.
• The hierarchical model requires that a user know
the structure of the database. Access always
starts at the root table.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Hierarchical Model Example
Agents
Entertainers
Schedule
Clients
Engagements
Payments
- Figure 1.1 from Herenandez
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Network Database Model
• Introduces nodes and sets structures. Nodes are
collections of records and set structures are the
relationships in the database.
• The relationship between nodes has 1 node as
the owner node, with 1 or more member nodes.
A record in a member node can only be related
to only 1 record in an owner node. Records in a
member node cannot exist without being related
to a record in an owner node.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Network Model Example
Agents
Represent
Manage
Clients
Make
Payments
Entertainers
Schedule
Perform
Engagements
Play
Musical Styles
- Figure 1.3 from Herenandez
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Relational Model
• Derived from two branches of mathematics – set
theory & first-order predicate logic.
• Stores data in relations (tables). Each table is
composed of tuples (records) and attributes
(fields).
• Two features of this model allow us to access
data without knowing database structure:
– The physical structure of the records and fields in a table doesn’t
matter.
– We identify each individual record in a table by a unique value.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Table Relationships
• We categorize table relationships in the
Relational Model as follows:
– One-to-One (1:1)
– One-to-Many (1:N)
– Many-to-Many (N:N)
• To establish a relationship between tables,
we need to match values of a shared field.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Relationship Example
Agent ID
Agent First Name
Agent Last Name
100
Mike
Hernandez
05/16/95
101
Greg
Piercy
10/15/95
102
Katherine
Ehrlich
03/01/96
Client ID Agent ID
Hire Date
Client First Name
Client Last Name
9001
100
Stewart
Jameson
9002
100
Shannon
McLain
9003
102
Estella
Pundt
- Figure 1.5 from Herenandez
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Advantages of Relational Databases
• Layers of data integrity
– Table level data integrity: ensures records aren’t duplicated and
key values are present
– Relationship level data integrity: ensures that the relationship
between two tables is valid
– Business level: ensures that data is accurate in terms of business
rules
• Data consistency & accuracy – result of built-in
data integrity.
• Independence from physical structure
• Easy data retrieval
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Database Management Software
• Relational database management systems
(RDBMS) are applications used to “create,
maintain, modify and manipulate” a database.
• Typically, RDBMSs include:
–
–
–
–
Tools to build tables and establish table relationships
Tools for creating forms for user input/output.
Tools for querying a database (asking the database a question)
Tools for creating reports for output.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Phases of Database Design
1. Requirements Analysis – Understanding the
information needs of a business client through
interviews to understand their current (and
future) business environment.
2. Data Modeling – Modeling the database
structure using one of the established datamodeling methods, like entity-relationship
diagrams; end goal is to visually represent the
database structure.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Phases of Database Design (cont.)
3. Data Normalization – Breaking large
tables into smaller ones to eliminate
redundant data and avoid problems
when manipulating data.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Database Tables
• A database stores data in relations,
perceived by the user as tables.
– Comprised of tuples (records) and attributes (fields)
– Chief structures in a database
– Logical and physical order of fields and records
doesn’t matter
– Every table must contain a Primary Key Field, which
uniquely identifies each of the table’s records.
– Tables can represent objects or events.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Types of Tables
• Data Table
– Most common type of table in a relational database
– Store data that supplies information
– Dynamic in nature
• Validation Table (Lookup Table)
– Stores data used when enforcing data integrity
– Usually static in nature
– Examples: job codes, city names, billing categories,
etc.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Fields
• A field, or attribute, is the smallest structure in a
database.
• Represents a characteristic of the subject of the
table to which it belongs.
• The quality of information retrieved from the
database depends heavily on the time invested
in ensuring the structural and data integrity of
fields (more on that later …).
• A field should contain 1 and only 1 distinct value
(FirstName or LastName versus FullName, for
instance.)
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Records
• A record, or tuple, is a specific instance of
the subject of a table. A record is made up
of all fields in a table. Some fields may
have empty values.
• The primary key field stores a value that
uniquely identifies the record throughout
the database.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Record & Field Example
Fields
Student ID
Student First Name
Student Last Name
Student Major 1
40853
William
Harden
Political Science
98364
Maria
Garcia-Grande
Nursing
15792
Michael
Bobersky
Psychology
Records
Table Name is Students
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Views
• A view, or a virtual table or saved query, is
made up of fields from other tables in the
database. The contributing tables are called
base tables.
• Since data is stored in other tables, databases
do not store data associated with views (thus
eliminating redundancy). Databases only store
the structure of the view.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Advantages of Views
•
•
•
You can work with data from multiple
base tables simultaneously.
Security – views prevent restricted users
from manipulating data stored in base
tables.
Views are useful for implementing data
integrity (a validation view).
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Primary Keys
• A primary key is a field or group of fields that
uniquely identifies a record. A primary key
comprised of two or more fields is called a
composite primary key. Every table must have a
primary key!
• The most important key in a table:
–
–
–
–
Uniquely identifies a specific record throughout a database
Identifies a specific table throughout the database
Enforces table-level integrity
Helps to establish relationships between tables
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Foreign Keys
• A foreign key is important when we establish
relationships between tables.
• To create a foreign key, you would take a
primary key from one table and copy it in a
second table. In the second table, the key
becomes a foreign key.
• Foreign keys enforce relationship-level integrity
– values in one table's foreign key field must
match exactly with the corresponding values of
a second table's primary key field.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Example of Primary & Foreign Keys
Agent ID
Agent First Name
Agent Last Name
100
Mike
Hernandez
05/16/95
101
Greg
Piercy
10/15/95
102
Katherine
Ehrlich
03/01/96
Client ID Agent ID
Clients
Table
Hire Date
Agents
Table
Client First Name
Client Last Name
9001
100
Stewart
Jameson
9002
100
Shannon
McLain
9003
102
Estella
Pundt
Agent ID is the Primary Key in the Agents Table
and a Foreign Key in the Clients Table.
- Adapted from Figure 3.11 from Herenandez
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Relationships
• We can build a relationship between tables
if we can relate the records in one table
with the records in the joining table.
• Two methods for building a relationship:
– Linking primary and foreign keys
– Linking tables via a third table called a linking table or
associative table
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Importance of Relationships
• Relationships allow users to establish
views based on multiple base tables.
• Relationships help to reduce data
redundancy and eliminate duplicate data,
thus reinforcing data integrity.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Categorizing Relationships
• We categorize relationships between
tables in three ways:
– The type of relationship between tables
– The way that each table in relationship participates in
that relationship
– The degree of participation that each table
participates in a relationship
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Different Types of Relationships
• One-to-One Relationship (1:1)
• One-to-Many Relationship (1:N)
• Many-to-Many Relationship (N:N)
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
One-To-One Relationships (1:1)
• In a one-to-one relationship (1:1), we relate one
and only one record from a parent table to one
and only one record in a second table (a child
table).
• To create a 1:1 relationship, we copy the
primary key of a parent table into a child table,
where it becomes a foreign key.
• This type of relationship is unique because both
tables share the same primary key. The primary
key in the child table serves both as that table's
primary key and a foreign key .
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Example of a 1:1 Relationship
Employee ID
Employee First Name Employee Last Name
100
Zachary
Erlich
101
Susan
McClain
102
Joe
Rosales
Employee ID Hourly Rate
Commission Rate
100
25.00
5.0%
101
19.75
3.5%
102
22.50
5.0%
Compensation Table
Employee
Table
Employee ID is the Primary Key for both tables
and also a Foreign Key in the Compensation Table.
- Adapted from Figure 3.13 from Herenandez
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
One-To-Many Relationships (1:N)
• In a one-to-many (1:N) relationship, we relate a
record in one table (a parent table) to many
records in a second table (a child table).
• To create a 1:N relationship, we copy the
primary key of a parent table into a child table,
where it becomes a foreign key.
• This type of relationship is the most common
type of relationship in the relational database
model.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Example of a 1:N Relationship
Agent ID
Agent First Name
Agent Last Name
100
Mike
Hernandez
05/16/95
101
Greg
Piercy
10/15/95
102
Katherine
Ehrlich
03/01/96
Client ID Agent ID
Clients
Table
Hire Date
Agents
Table
Client First Name
Client Last Name
9001
100
Stewart
Jameson
9002
100
Shannon
McLain
9003
102
Estella
Pundt
Agent ID is the Primary Key in the Agents Table
and a Foreign Key in the Clients Table.
- Adapted from Figure 3.14 from Herenandez
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Many-To-Many Relationships (N:N)
• In a many-to-many relationship, we relate many
records in one table to many records in a second
table.
• We cannot inherently create a N:N relationship.
Instead, we can resolve a N:N relationship by
copying the primary keys of each table into a
third table, called a linking (associative) table.
Together, the copied keys form a composite
primary key. Individually, they serve as foreign
keys for the other table.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Example of Resolving an
N:N Relationship
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Relationship Participation
• There are two ways that we categorize
relationships based on participation:
– Mandatory Participation: If a user MUST enter at least
one record into a parent table before s/he may enter
records in a child table.
– Optional Participation: If a user MAY enter records in
a child table without entering records in the parent
table.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Degrees of Participation
• We calculate a table's degree of participation by:
– The minimum number of records it must associate with a single
record in the related table.
– The maximum number of records that a related table may
associate with a single record in the given table.
• Think of the degree of participation as the
minimum and maximum number of relationships
for a single record in a table.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Example of Degree of Association
• Assume that for a Department, advisors
are assigned at least 1 student and up to
50 students, but no more.
• The degree of participation of the Advisor
Table would be 1,50. That is, an advisor
must be assigned to at least one student
in the Student Table, but has a limit of 50
students in the Student Table.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Field Specification
• Field Specification (also called domain)
includes all of the elements of a field.
There are three types of field elements:
– General Elements: Include all of the basic information
about a field, including the field name, the field
description and a field's parent table.
– Physical Elements: Include information on how the
field is constructed and how a user views the field;
data type, field length and display format are all
physical elements.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Field Specification (continued)
– Logical Elements: Describe the values that a field can
store, including required values, range of values and
default values.
• Field specification is an important part of
database design because it helps to
enforce field-level integrity of a database.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Data Integrity
• "Data integrity refers to the validity,
consistency, and accuracy of the data in a
database." (Hernandez, p. 71)
• Four Types of Data Integrity:
–
–
–
–
Table-level integrity
Field-level integrity
Relationship-level integrity
Business rules
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Table-Level Integrity
• Also known as entity integrity
• Ensures there are no duplicate records
throughout a database
• Makes sure that primary keys with a table
are unique never null
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Field-Level Integrity
• Also known as domain integrity
• Guarantees that that structure of each
field is sound:
– Values are "valid, consistent and accurate"
(Hernandez, p. 71)
– Values of the same type (for instance, we would
define fields related to an academic major in a
consistent manner throughout the database).
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Relationship-Level Integrity
• Also known as referential integrity
• Checks to make sure that the relationships
between tables are sound.
• Also, ensures that records in related tables
are synchronized when someone enters
data, deletes data or otherwise
manipulates it.
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Business Rules
• A database is framed to fit the ways in
which an organization runs its business.
• Business rules may affect several aspects
of database design, including:
–
–
–
–
–
Field ranges and valid values
Types of table relationships
Degree of relationships
Degree of participation
Synchronization of tables
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
Questions?
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science
References
• geekgirl's plain-english computing
(website):
http://www.geekgirls.com/menu_databases.htm
• Database Design for Mere Mortals, 2nd
Edition by Michael Hernandez (AddisonWesley, 2004)
CSCI N207: Data Analysis Using Spreadsheets
Copyright ©2004  Department of Computer & Information Science