9/13 - SEAS - University of Pennsylvania
Download
Report
Transcript 9/13 - SEAS - University of Pennsylvania
Relational Model & Algebra
Zachary G. Ives
University of Pennsylvania
CIS 550 – Database & Information Systems
September 13, 2005
Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan
Administrivia
Homework assignments will normally be given out
on Thursdays, due the following Thursday unless
otherwise directed
Start thinking about which project you want to do,
who you might work with
Will need to form groups and pick a project by the end of
next week (I’ll announce more)
We will soon be holding an extra session on the project,
as well as essential tools and skills (stay tuned)
2
Thinking Back to Last Time…
There are a variety of ways of representing data,
each with trade-offs
Free text
Classes and subclasses
Shapes/points in space
“Objects” with “properties”
In general, our emphasis will be on the last item
… though there are spatial databases, OO databases, text
databases, and the like…
3
The Relational Data Model (1970)
Lessons from the Codd paper
Let’s separate physical implementation from logical
Model the data independently from how it will be used (accessed,
printed, etc.)
Describe the data minimally and mathematically
A relation describes an association between data items – tuples with
attributes
We generally think of tables and rows, but that’s somewhat
imprecise
Use standard mathematical (logical) operations over the data – these are
the relational algebra or relational calculus
How does this model relate to objects, properties? What are its abilities and
limitations?
4
Why Did It Take So Many Years to
Implement Relational Databases?
Codd’s original work: 1969-70
Earliest relational database research: ~1976
Oracle “2.0”: 1979
Why the gap?
1.
2.
3.
4.
“You could do the same thing in other ways”
“Nobody wants to write math formulas”
“Why would I turn my data into tables?”
“It won’t perform well”
What do you think?
5
Getting More Concrete: Building
a Database and Application
1.
Start with a conceptual model
2.
Design & implement schema
3.
4.
“On paper” using certain techniques we’ll discuss next week
We ignore low-level details – focus on logical representation
Design and codify (in SQL) the relations/tables
Do physical layout – indexes, etc.
Import the data
Write applications using DBMS and other tools
Many of the hard problems are taken care of by other people
(DBMS, API writers, library authors, web server, etc.)
6
Conceptual Design for
CIS Student Course Survey
“Who’s taking what, and what grade do they expect?”
fid
PROFESSOR
This design is independent of
the final form of the report!
STUDENT
sid
name
Teaches
COURSE
Takes
exp-grade
name
cid
name
semester
7
Example Schema
STUDENT
sid
name
COURSE
Takes
sid
exp-grade
cid
cid
subj
sem
1
Jill
1
A
550-0105
550-0105
DB
F05
2
Qun
1
A
700-1005
700-1005
AI
S05
3
Nitin
3
C
500-0105
501-0105
Arch
F05
Our focus now: relational
schema – set of tables
Can have other kinds of
schemas – XML, object, …
PROFESSOR
Teaches
fid
name
fid
cid
1
Ives
1
550-0105
2
Saul
2
700-1005
8
Martin
8
501-0105
8
Some Terminology
Columns of a relation are called attributes or fields
The number of these columns is the arity of the relation
The rows of a relation are called tuples
Each attribute has values taken from a domain, e.g., subj has
domain string
Theoretically: a relation is a set of tuples; no tuple can occur
more than once
Real systems may allow duplicates for efficiency or other reasons –
we’ll ignore this for now
Objects and XML may also have the same content with different
“identity”
9
Describing Relations
A schema can be represented many ways
To the DBMS, use data definition language (DDL) –
like programming language type definitions
In relational DBs, we use relation(attribute:domain)
STUDENT(sid:int, name:string)
Takes(sid:int, exp-grade:char[2], cid:string)
COURSE(cid:string, subj:string, sem:char[3])
Teaches(fid:int, cid:string)
PROFESSOR(fid:int, name:string)
10
More on Attribute Domains
Relational DBMSs have very limited “built-in” domains: either
tables or scalar attributes – int, string, byte sequence, date,
etc.
But more generally:
We can have “nested relations”
Object-oriented, object-relational systems allow complex, user-defined
domains – lists, classes, etc.
XML systems allow for XML trees (or lists of trees) that follow
certain structural constraints
Database people, when they are discussing design, often assume
domains are evident to the reader:
STUDENT(sid, name)
11
Integrity Constraints
Domains and schemas are one form of constraint on a valid
data instance
Other important constraints include:
Key constraints:
Subset of fields that uniquely identifies a tuple, and for which no subset of
the key has this property
May have several candidate keys; one is chosen as the primary key
A superkey is a subset of fields that includes a key
Inclusion dependencies (referential integrity constraints):
A field in one relation may refer to a tuple in another relation by including
its key
The referenced tuple must exist in the other relation for the database
instance to be valid
12
SQL: Structured Query Language
The standard language for relational data
Invented by folks at IBM, esp. Don Chamberlin
Actually not a great language…
Beat a more elegant competing standard, QUEL, from
Berkeley
Separated into a DML & DDL
DML based on relational algebra & (mostly) calculus,
which we discuss this week
13
Table Definition:
SQL-92 DDL and Constraints
CREATE TABLE STUDENT
(sid INTEGER,
name CHAR(20),
)
CREATE TABLE Takes
(sid INTEGER,
exp-grade CHAR(2),
cid STRING(8),
PRIMARY KEY (sid, cid),
FOREIGN KEY (sid)
REFERENCES STUDENT,
FOREIGN KEY (cid)
REFERENCES COURSE
)
14
Example Data Instance
STUDENT
sid
COURSE
Takes
name
sid
exp-grade
cid
cid
subj
sem
1
Jill
1
A
550-0105
550-0105
DB
F05
2
Qun
1
A
700-1005
700-1005
AI
S05
3
Nitin
3
C
500-0105
501-0105
Arch
F05
PROFESSOR
Teaches
fid
name
fid
cid
1
Ives
1
550-0105
2
Saul
2
700-1005
8
Martin
8
501-0105
15
From Tables SQL Application
<html>
<body>
<!-- hypotheticalEmbeddedSQL:
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
-->
</body>
</html>
C -> machine code sequence -> microprocessor
Java -> bytecode sequence -> JVM
SQL -> relational algebra expression -> query execution engine
16
Codd’s Relational Algebra
A set of mathematical operators that compose,
modify, and combine tuples within different relations
Relational algebra operations operate on relations
and produce relations (“closure”)
f: Relation Relation
f: Relation x Relation Relation
17
Codd’s Logical Operations: The
Relational Algebra
Six basic operations:
Projection
Selection
Union
Difference
Product
(Rename)
(R)
(R)
R1 [ R2
R1 – R2
R1 £ R2
->b (R)
And some other useful ones:
Join
Semijoin
Intersection
Division
R1 ⋈ R2
R1 ⊲ R2
R1 Å R2
R1 ¥ R2
18
Data Instance for Operator Examples
STUDENT
Takes
sid
sid
exp-grade
name
COURSE
cid
cid
subj
sem
1
Jill
1
A
550-0105
550-0105
DB
F05
2
Qun
1
A
700-1005
700-1005
AI
S05
3
Nitin
3
A
700-1005
501-0105
Arch
F05
4
Marty
3
C
500-0105
4
C
500-0105
PROFESSOR
Teaches
fid
name
fid
cid
1
Ives
1
550-0105
2
Saul
2
700-1005
8
Martin
8
501-0105
19
Projection,
20
Selection,
21
Product X
22
Join, ⋈: A Combination of Product
and Selection
23
Union
24
Difference –
25
Rename, b
The rename operator can be expressed several
ways:
The book has a very odd definition that’s not algebraic
An alternate definition:
b(x)
Takes the relation with schema
Returns a relation with the attribute list b
Rename isn’t all that useful, except if you join a relation
with itself
Why would it be useful here?
26
Mini-Quiz
This completes the basic operations of the relational
algebra. We shall soon find out in what sense this is
an adequate set of operations. Try writing queries for
these:
The names of students named “Bob”
The names of students expecting an “A”
The names of students in Milo Martin’s 501 class
The sids and names of students not enrolled
27
Deriving Intersection
Intersection: as with set operations, derivable from
difference
A-B
B-A
A
B
AÅB
≡ (A [ B) – (A – B) – (B – A)
≡ (A - B) – (B - A)
28
Division
A somewhat messy operation that can be expressed
in terms of the operations we have already defined
Used to express queries such as “The fid's of faculty
who have taught all subjects”
Paraphrased: “The fid’s of professors for which there
does not exist a subject that they haven’t taught”
29
Division Using Our Existing Operators
All possible teaching assignments: Allpairs:
fid,subj (PROFESSOR £ subj(COURSE))
NotTaught, all (fid,subj) pairs for which professor fid
has not taught subj:
Allpairs - fid,subj(Teaches ⋈ COURSE)
Answer is all faculty not in NotTaught:
fid(PROFESSOR) - fid(NotTaught)
´ fid(PROFESSOR) - fid(
fid,subj (PROFESSOR £ subj(COURSE)) fid,subj(Teaches ⋈ COURSE))
30
Division: R1 R2
Requirement: schema(R1) ¾ schema(R2)
Result schema: schema(R1) – schema(R2)
“Professors who have taught all courses”:
fid (fid,subj(Teaches ⋈ COURSE) subj(COURSE))
What about “Courses that have been taught by all
faculty”?
31
The Big Picture: SQL to Algebra to
Query Plan to Web Page
Web Server /
UI / etc
Query Plan – an
operator tree
Hash
STUDENT
Optimizer
Takes
by cid
Execution
Engine
Merge
COURSE
by cid
Storage
Subsystem
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
32
Hint of Future Things: Optimization
Is Based on Algebraic Equivalences
Relational algebra has laws of commutativity, associativity, etc.
that imply certain expressions are equivalent in semantics
They may be different in cost of evaluation!
c Ç d(R) ´ c(R) [ d(R)
c (R1 £ R2) ´ R1 ⋈c R2
c Ç d (R) ´ c (d (R))
Query optimization finds the most efficient representation to
evaluate (or one that’s not bad)
33
Next Time: An Equivalent, But
Very Different, Formalism
Codd invented a relational calculus that he proved
was equivalent in expressiveness
Based on a subset of first-order logic – declarative,
without an implicit order of evaluation
More convenient for describing certain things, and for
certain kinds of manipulations
… And, in fact, the basis of SQL!
34