9/9 - SEAS - University of Pennsylvania

Download Report

Transcript 9/9 - SEAS - University of Pennsylvania

Relational Model & Algebra
Zachary G. Ives
University of Pennsylvania
CIS 550 – Database & Information Systems
September 9, 2003
Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan
Administrivia
 New classroom (as you know): Towne 311
But Thursday, we combine with Prof. Davidson’s CSE 330 class –
we’ll be in 402 Logan Hall, on 240 South 36th Street
 Dinkar’s office hours and Location:
12:30-1:30, Mondays in Moore 459
 Homework assignments will normally be given out on
Thursdays, due the following Thursday unless otherwise
directed
2
Blogging and the Project
 Who played with blogger.com over the
weekend?
 What’s it all about?
 Notable features?
 Start thinking about which project you want to
do, who you might work with
 Will need to form groups and pick a project by the
end of next week
3
Database Design
Lessons from the Codd paper
 Let’s separate physical implementation from logical
 Model the data independently from how it will be
used (accessed, printed, etc.)
 “Normalize the data” into relations (tables) – this is the
relational model
 Use a standard set of operations over the data – these are
the relational algebra or relational calculus
What are the benefits here??? Why was this so successful?
4
Building a Database Application
1. Start with a conceptual model


“On paper” using certain techniques we’ll discuss next week
We ignore low-level details – focus on logical representation
2. Design & implement schema


Create the tables
Do physical layout – indexes, etc.
3. Import the data
4. Write applications using DBMS and other tools
Many of the hard problems are taken care of by other people
(DBMS, API writers, library authors, web server, etc.)
5
Conceptual Design for
CIS Student Course Survey
“Who’s taking what, and what grade do they expect?”
fid
PROFESSOR
This design is independent of
the final form of the report!
STUDENT
sid
name
Teaches
COURSE
Takes
exp-grade
name
cid
name
semester
6
Example Schema
STUDENT
sid
name
COURSE
Takes
sid
exp-grade
cid
cid
subj
sem
1
Jill
1
A
550-0103
550-0103
DB
F03
2
Qun
1
A
700-1003
700-1003
AI
S03
3
Nitin
3
C
500-0103
501-0103
Arch
F03
 Our focus now: relational
schema – set of tables
 Can have other kinds of
schemas – XML, object,
…
PROFESSOR
Teaches
fid
name
fid
cid
1
Ives
1
550-0103
2
Saul
2
700-1003
8
Roth
8
501-0103
7
Some Terminology




Columns of a relation are called attributes or fields
The number of these columns is the arity of the relation
The rows of a relation are called tuples
Each attribute has values taken from a domain, e.g.,
subj has domain string
Theoretically: a relation is a set of tuples; no tuple can
occur more than once
 Real systems may allow duplicates for efficiency or other
reasons – we’ll ignore this for now
 Objects and XML may also have the same content with different
“identity”
8
Describing Relations
 A schema can be represented many ways
 To the DBMS, use data definition language
(DDL) – like programming language type
definitions
 In relational DBs, we use table(attribute:domain)
STUDENT(sid:int, name:string)
Takes(sid:int, exp-grade:char[2], cid:string)
COURSE(cid:string, subj:string, sem:char[3])
Teaches(fid:int, cid:string)
PROFESSOR(fid:int, name:string)
9
More on Domains
 Relational DBMSs have very limited “built-in” domains:
either tables or scalar attributes – int, string, byte
sequence, date, etc.
 Object-oriented, object-relational systems allow
complex, user-defined domains – lists, classes, etc.
 XML systems allow for XML trees (or lists of trees) that
follow certain structural constraints
Database people, when they are discussing design, often
assume domains are evident to the reader:
STUDENT(sid, name)
10
Integrity Constraints
 Domains and schemas are one form of constraint on a
valid data instance
 Other important constraints include:
 Key constraints:
 Subset of fields that uniquely identifies a tuple, and for which no
subset of the key has this property
 May have several candidate keys; one is chosen as the primary key
 A superkey is a subset of fields that includes a key
 Inclusion dependencies (referential integrity constraints):
 A field in one relation may refer to a tuple in another relation by
including its key
 The referenced tuple must exist in the other relation for the
database instance to be valid
11
SQL: Structured Query Language
The standard language for relational data
 Invented by folks at IBM
 Actually not a great language…
 Beat a more elegant competing standard, QUEL,
from Berkeley
Separated into a DML & DDL
DML based on relational algebra & calculus, which
we discuss today & Thursday
12
SQL-92 DDL and Constraints
CREATE TABLE STUDENT
(sid INTEGER,
name CHAR(20),
)
Should (sid,name) be
unique?
How about (sid)?
How about (name)?
CREATE TABLE Takes
(fid INTEGER,
exp-grade CHAR(2),
cid STRING(8),
PRIMARY KEY (fid, cid),
FOREIGN KEY (fid)
REFERENCES STUDENT,
FOREIGN KEY (cid)
REFERENCES COURSE
)
13
Example Data Instance
STUDENT
sid
name
COURSE
Takes
sid
exp-grade
cid
cid
subj
sem
1
Jill
1
A
550-0103
550-0103
DB
F03
2
Qun
1
A
700-1003
700-1003
AI
S03
3
Nitin
3
C
500-0103
501-0103
Arch
F03
Do these operations satisfy the
constraints?
ins PROFESSOR(2, Smith)
ins PROFESSOR(3, Smith)
del COURSE(501-0103, Arch,
F03)
upd Teaches(1, 550-0103) ->
(1, 555-0103)
PROFESSOR
Teaches
fid
name
fid
cid
1
Ives
1
550-0103
2
Saul
2
700-1003
8
Roth
8
501-0103
14
Applications Embed Queries in SQL
<html>
<body>
<!-- hypotheticalEmbeddedSQL:
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
-->
</body>
</html>
C -> machine code sequence -> microprocessor
Java -> bytecode sequence -> JVM
SQL -> relational algebra expression -> query execution engine
15
Relational Algebra
 Relational algebra operations operate on relations and produce
relations (“closure”)
f: Relation -> Relation
 Six basic operations:






Projection
Selection
Union
Difference
Product
(Rename)
f: Relation x Relation -> Relation
A (R)
 (R)
R1 [ R2
R1 – R2
R1 £ R2
A->B (R)
 And some other useful ones:




Join
Semijoin
Intersection
Division
R1 ⋈ R2
R1 ⊲ R2
R1 Å R2
R1 ¥ R2
16
Data Instance for Operators
STUDENT
Takes
sid
sid
exp-grade
name
COURSE
cid
cid
subj
sem
1
Jill
1
A
550-0103
550-0103
DB
F03
2
Qun
1
A
700-1003
700-1003
AI
S03
3
Nitin
3
A
700-1003
501-0103
Arch
F03
4
Marty
3
C
500-0103
4
C
500-0103
PROFESSOR
Teaches
fid
name
fid
cid
1
Ives
1
550-0103
2
Saul
2
700-1003
8
Roth
8
501-0103
17
Key Points
 Projection
 What happens if projection causes duplicate values?
 “True relational” vs. SQL models – set vs. bag semantics
 Selection
 What can go in the predicate?
 Complex predicates (and, or, not) nice but not really necessary
 Product
 What to do when name clashes
 Join from other operators
 Theta and natural joins
 Union compatibility
 Intersection from difference
 Semijoin from Join
18
Examples
 This completes the basic operations of the
relational algebra. We shall soon find out in
what sense this is an adequate set of
operations. Try writing queries for these:




The names of students named “Bob”
The names of students expecting an “A”
The names of students in Amir Roth’s 501 class
The sids and names of students not enrolled
19
Division (Not Very Commonly Used)
 A somewhat messy operation that can be
expressed in terms of the operations we have
already defined
 Used to express queries such as “The fid's of
faculty who have taught all subjects”
 Paraphrased: “The fid’s of professors for which
there does not exist a subject that they haven’t
taught”
20
Division Using Our Existing Operators
 First we build a relation, Allpairs, with all
possible teaching assignments:
fid,subj (PROFESSOR £ subj(COURSE))
 Next, compute relation NotTaught, all (fid,subj)
pairs for which professor fid has not taught subj:
Allpairs - fid,subj(Teaches ⋈ COURSE)
21
Division Using Existing Operators, ctd.
fid(NotTaught) is the set of id's of faculty who have not
taught some course
Finally, our answer is all faculty who have not failed to
teach some course:
fid(PROFESSOR) - fid(NotTaught)
´ fid(PROFESSOR) fid,subj (PROFESSOR £ subj(COURSE)) fid,subj(Teaches ⋈ COURSE)
22
Division: The  Operator
 Much simpler to use the notation R  S
 Schema of R must be a superset of the schema
of S, and the result has schema schema(R)schema(S).
 We could write “professors who have taught all
courses” as
fid ((Teaches ⋈ COURSE)  subj(COURSE))
 What about “Courses that have been taught by
all faculty”?
23
The Big Picture: SQL to Algebra to
Query Plan to Web Page
Web Server /
UI / etc
Query Plan – an
operator tree
Hash
STUDENT
Optimizer
Takes
by cid
Execution
Engine
Merge
COURSE
by cid
Storage
Subsystem
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
24
Hint of Future Things: Algebraic
Equivalences and Optimization
 Relational algebra has laws of commutativity,
associativity, etc. that imply certain expressions are
equivalent in semantics
 They may be different in cost of evaluation!
c Ç d(R) ´ c(R) [ d(R)
c (R1 £ R2) ´ R1 ⋈c R2
c Ç d (R) ´ c (d (R))
 Query optimization finds the most efficient
representation to evaluate (or one that’s not bad)
25
Next Time: An Equivalent, But
Very Different, Formalism
 Codd invented a relational calculus that he
proved was equivalent in expressiveness
 Based on a subset of first-order logic – declarative,
without an implicit order of evaluation
 More convenient for describing certain things, and for
certain kinds of manipulations
26