Transcript lecture3

LIS 557
Database Design and
Management
William Voon
Michael Cole
Spring '04
Relational Databases
5 February 2004
"Simplicity does not precede complexity, but
follows it." - Alan J. Perlis
The Relational Database
1970 – Invented by Edgar (Ted) Codd
(IBM)
"A Relational Model of Data for Large Shared
Data Banks"
http://www.acm.org/classics/nov95/toc.html
1981 – Turing Award
2002 – Forbes names relational databses as one of the
most important modern innovations
http://www.forbes.com/forbes/2002/1223/172_print.html
The Problem
●
Databases were already important, but computer time cost hundreds of
dollars a minute.
●
So, programs needed to be as efficient as possible before execution.
●
Early databases had
–
a rigid hierarchical structure, or
–
a complex system of navigational pointers to the physical locations
of the data on magnetic tapes.
It took groups of programmers to write database queries (CODASYL).
Every query required a new program. New data types required a
redesign of the database. The price of efficiency was utter inflexibility.
Codd's Insight
●
●
Relationships between data items should be based on
the item's values, and not on separately specified
linking or nesting.
–
A big simplification for queries since we only talk
about the data
–
Allows flexibility to use existing data in new ways
Users should be able to work at a more naturallanguage level and not have to know anything about
the details of the data storage
Codd's RDB
●
●
●
●
This separation of the data from the details of its storage,
etc. requires much more computing power to keep track of
those details
In this respect relational databases are far less efficient than
thier predecessors, but they are spectacularly more cost
effective
Relational databases really took off as computing power
continued to improve (Moore's Law)
Today, relational databases are dominant.
RDB development
●
●
1973 - IBM begins development in the System R project
–
Structured Query Language (SQL), to express queries
(Don Chamberlin and Ray Boyce)
–
A cost-based optimizer, which automatically translates a
high-level query into an efficient plan for executing the
query (Pat Selinger)
–
A query compiler, to save query plans for future use
(Raymond Lorie)
Boyce and Codd also developed Boyce-Codd Normal Form
to design efficient RDB tables without needless duplication
of information in different tables
Early RDB Commercialization
●
●
●
●
●
1970s – INGRES UC-Berkeley (Mike Stonebraker)
1977 – Relational Software Inc. (Larry Ellison)
produces first commercially available relational
database.
1981 – IBM announces SQL/DS.
1983 – IBM introduces DB2 for large mainframes. It
now stores data on handhelds to supercomputers
and handles billions of transactions per day
1983 – Relational Software renamed Oracle
The RDBMS
●
●
●
Handles the complex physical details of the
location and actual relationships between the
data sets
Presents the data model so that the user
sees the database as a collection of tables
In designing and using the database we need
only think about relationships between tables
Tables
Columns store attributes (same data type
for each column!)
Rows (tuples) store entities
Table = entity set = relation
A flat version of a database
Why is this inefficient?
Relating Tables
Table name: CUSTOMER
Table name: SALESREP
The tables are independent, but we can still learn
that Nam June Paik is Sonny Blount's sales rep or
that Paik handles both Blount and John Gilmore
(and that makes sense).
What's the big deal?
●
●
Complex things (the point of the database)
can be broken into logical units (perhaps
along the lines of the data domains)
The units (aka tables) can be selected so
that they minimize the redundant data in all
of the tables
Nature of Inefficiency
●
●
Duplicated data
Natural groupings are associated with types
of queries
Anatomy of a Table
●
A table (relation) is a 2-D structure
●
Each row (tuple) defines a single entity
●
●
Each column defines the attributes of the entities in the
set
Each row/column element has a single data value of a
type appropriate to the attribute (and allowed by the
software)
●
Order of rows and columns does not matter to the user
●
Every table has a primary key
Column = Attribute
●
Each column represents an attribute
●
Each column has a unique name
●
The values in the column are of a single type
●
Range of possible values is the column's
domain
Row = Entity
●
●
A row corresponds to an entity, so reading
across the columns we see all of the specific
attributes for that entity
Taken together the rows are the entity set.
Notice that this is logically equivalent to the
table itself.
Null Values
●
Can be useful
–
Placeholder for an unknown attribute value
–
Known, but missing attribute value
–
A 'not applicable' condition
Data Types
●
Options are defined by the software
●
Should be appropriate to the attribute
●
Standard data types:
●
–
Numeric
–
Character
–
Date
–
Logical
Other data types may be supported
Numeric Data
●
Numbers that are arithmetically meaningful
–
●
It makes sense to add, subtract, etc.
Other uses of numbers are handled as
character data
–
e.g. Telephone numbers, credit card identifiers,
zip codes
Character Data
●
●
Also called string data
Any set of characters that is not intended for
mathematical manipulation
Date Data
Calendar dates are stored in a Julian date format
invented by Joseph Scaliger in 1583
–
●
the Julian date is the number of elapsed
days since the beginning of a cycle of 7,980
years
The system allows easy calculation of the
integer difference between two calendar dates
Julian Dates
●
●
The 7,980 year cycle was derived by combining several
traditional time cycles (solar, lunar, and a particular
Roman tax cycle) for which 7,980 was a common
multiple.
The first Julian cycle began on January 1, 4713 B.C.
and will end on January 22, 3268.
February 5, 2004 @ 19:00:00 EST
= 2453041.5000
A calendar calculator:
http://www.fourmilab.com/documents/calendar/
Logical Data
●
●
True / False
Can perform Boolean logical calculations
(AND, OR, NOT) on this data
Keys
●
The RDB works because of controlled
redundancy
–
Tables share common attributes, so they can be
linked
–
The idea is that a specific attribute in one table
can be used to point to a row (entity) in another
table
Key Types
●
Superkey
–
●
Candidate key
–
●
Can be an attribute or a combination of attributes that
uniquely identifies each entity (row) in a table
A minimal superkey (A superkey that contains no
superkeys)
Primary key
–
A candidate key that identifies all the values in a row
(entity)
The Primary Key
●
●
Each table must have a primary key
–
There must be a unique key for each entity
(row), null values are not allowed
–
The primary key is the identifier for each entity
A table with this property has entity integrity
Key Types
●
Secondary key
–
●
An attribute (or combination) used only for data
retrieval
Foreign key
–
An attribute (or combination) in one table whose
values must match the primary key in another
table or be null.
Integrity
●
Entity integrity depends on the primary key
–
●
Ensures each entity has a unique identifier
Referential integrity depends on the foreign
key(s)
–
Makes it impossible for an attribute to have an
invalid entry (might be null)
Assignment
●
●
Design a collection of tables for a database
Briefly, what is the database about? Who are
the users? What purposes will the database
serve?
–
At least three related tables
–
Explain why this is the most efficient design
–
Is this also the most logical design?