Transcript intro

Principles of Database
Management Systems
CSE 544
Introduction
March 31st, 1999
Staff
Instructor: Alon Levy
Sieg, Room 310, [email protected]
Office hours: wed, 2:30-3:30. Or by email.
TAs: Zack Ives and Rachel Pottinger
Office hours:
 Zack: Mondays at noon (224)
 Rachel: Thursdays at 2:30pm (224)
Mailing list: cse544@cs
Web page: (a lot of stuff already there)
http://www.cs.washington.edu/education/courses/544/99sp/
Course Times
In general, WF, 12-1:20pm (with a 5
minute breather in the middle).
Two special dates:
Monday, April 5th
Monday, April 19th
No classes on last week.
Goals of the Course
Purpose:
Foundations of database management
systems.
Issues in building database systems.
Introduction to current research issues in
databases.
Have fun: databases are not just bunches of
tuples.
Grading
Homeworks: 15%
SQL querying fun
Join implementations
Project: 25%
A query optimization engine for data
integration.
Midterm: 15%
Final: 35%
Participation and intangibles: 10%
Textbook
Database System Implementation,
Ullman, Widom, and Garcia-Molina, to be
published by Prentice-Hall in June;
available from the copy center.
Other Useful Texts
 Database Management Systems (Ramakrishnan)
 Foundations of Databases (Abiteboul, Hull & Vianu)
 Parallel and Distributed DBMS (Ozsu and Valduriez)
 Transaction Processing (Gray and Reuter)
 Database Systems (Silberschatz, Korth and Sudarshan)
 Data and Knowledge based Systems (volumes I, II)
(Ullman)
 Readings in Database Systems (Stonebraker and
Hellerstein)
 Proceedings of SIGMOD, VLDB, PODS conferences.
Prerequisites
Real Prerequisites
Operating systems
Data structures and
algorithms
Distributed systems
Complexity theory
Mathematical Logic
Knowledge
Representation
User interface design
Programming
languages
Artificial Intelligence
(Search)
 Greek, Hebrew,
French
Why Use a DBMS?
All programs manipulate data, so why
use a database?
•
•
•
•
•
•
•
Large amounts of data (Giga’s, Tera’s)
Data is very structured
Persistent data
Valuable data
Performance requirements
Concurrent access to the data
Restricted access to data
Functionality of a DBMS
Persistent storage management
Transaction management
Resiliency: recovery from crashes.
Separation between logical and physical
views of the data.
High level query and data manipulation
language.
Efficient query processing
Interface with programming languages
Persistent Storage
Becomes a hard problem because of the
interaction with the other levels of the
DBMS:
What are we storing?
Efficient indexing
Special issues due to resiliency requirements
Exploit “semantic” knowledge
Issue: interaction with the operating
system. Should we rely on the OS?
Transaction Processing
and Recovery
For efficient use of resources, we want
concurrent access to data.
Systems sometimes crash.
A “real” database guarantees ACID:
Atomicity: all or nothing of a transaction.
Consistency: always leave the DB consistent.
Isolation: every transaction runs as if it’s the
only one in the system.
Durability: if committed, we really mean it.
Do we really want ACID?
Physical vs. Logical Levels
External Schema 1
External Schema 2
Relational Schema
Physical Schema
•Conceptual schema: tables and their
attributes
•Physical schema: files, indexes
hash tables.
•External schema: views of the different
applications, classes of users.
System catalog: The component of the
database that stores meta data.
Conceptual design: a precursor to the
relational schema.
Disk
The Relational Model
Data is organized into tables with attributes.
Rows in the tables are tuples.
Student
Course
Quarter
Charles
CS 444
Fall, 1997
Dan
CS 142
…
…
Winter,
1998
…
The power of simplicity!
Logical Model Issues
What data model should we use?
Relational, object-oriented, object-relational,
deductive database model, semi-structured
How do we design a good schema?
(normal forms, index selection)
Are we really providing an abstraction?
How does this abstraction interact with
the programming language? (the
impedance mismatch).
Querying a Database
Find all the students who have taken
CSE444 in Winter, 1998.
S(tructured) Q(uery) L(anguage)
select E.name
from Enroll E
where E.course=CSE444 and

E.quarter=“Winter, 1998”
SQL also provides update facilities.
SQL: an acquired taste (try datalog first)
Issues in Query Languages
Does it provide the appropriate
functionality?
SQL books get thicker and thicker.
Expressive power of a query language.
Ease of use (query by example)
Declarativity
Provide guidance in writing “good”
queries?
Query Optimization
A query is a declarative specification of
“what” you want.
A query execution plan is an imperative
program to produce the answer.
Query optimization: produce an efficient
query execution plan.
Issues: large search space of plans, cost
estimation, semantic transformations
Real goal: avoid the bad plans.
Database Industry
Relational databases are a great success
of theoretical ideas.
“Big 3” DBMS companies are among the
largest software companies in the world.
IBM (with DB2) and Microsoft (SQL
Server, Microsoft Access) are also
important players.
$20B industry
Moving to warehousing, decision support.
Course (Rough) Outline
The basics:
The relational model
SQL
Views, integrity constraints
Conceptual modeling
datalog (recursive queries)
Physical representation:
Index structures.
Course Outline (cont)
Query execution:
Algorithms for joins, selections, projections.
Query Optimization
Advanced topics:
data integration
data mining
semi-structured data
Transaction processing
The relational data model
Terminology
Attribute names
Product (relation name)
Name
Price
Category
Manufacturer
gizmo
$19.99
gadgets
GizmoWorks
Power gizmo $29.99
gadgets
GizmoWorks
SingleTouch $149.99
photography
Canon
MultiTouch
household
Hitachi
tuples
$203.99
(Arity=4)
Product(name: string, Price: real, category: enum, Manufacturer: string)
More Terminology
Every attribute has an atomic type.
Relation Schema: relation name + attribute names + attribute types
Relation instance: a set of tuples. Only one copy of any tuple! (not)
Database Schema: a set of relation schemas.
Database instance: a relation instance for every relation in the schema.
More on Tuples
Formally, a mapping from attribute names to (correctly typed) values:
name
price
category
manufacturer
gizmo
$19.99
gadgets
GizmoWorks
Sometimes we refer to a tuple by itself: (note order of attributes)
(gizmo, $19.99, gadgets, GizmoWorks)
or
Product (gizmo, $19.99, gadgets, GizmoWorks).
Integrity Constraints
An important functionality of a DBMS is to enable the specification
of integrity constraints and to enforce them.
Knowledge of integrity constraints is also useful for query
optimization.
Examples of constraints:
keys, superkeys
foreign keys
domain constraints, tuple constraints.
Functional dependencies, multivalued dependencies.
Keys
A minimal set of attributes that uniquely the tuple (I.e., there is no
pair of tuples with the same values for the key attributes):
Person: social security number
name
name + address
name + address + age
Perfect keys are often hard to find, but organizations usually
invent something anyway.
Superkey: a set of attributes that contains a key.
A relation may have multiple keys: (but only one primary key)
employee number, social-security number
Foreign Key Constraints
Purchase:
Product:
buyer price product
name
Joe
Jack
gizmo G-sym
E-gizmo G-sym
$20
$20
gizmo
E-gizmo
manufacturer description
great stuff
even better
An attribute of a relation R is must refer to a key of a relation S.
Functional Dependencies
Definition:
If two tuples agree on the attributes
A1, A2, … A n
then they must also agree on the attributes
B1, B2, … B m
Formally:
A1, A2, … A n
B1, B2, … B m
Key of a relation: all the attributes are either on the left or right.
Some Obvious Properties
of FD’s
A1, A2, … A n
B1, B2, … B m
Is equivalent to
A1, A2, … A n
B1
A1, A2, … A n
B2
…
A1, A2, … A n
A1, A2, … A n
A
i
Bm
Always holds.
Splitting rule
and
Combing rule
Comparing Functional
Dependencies
Entailment: a set of functional dependencies S1 entails a set S2 if:
any database that satisfies S1 much also satisfy S2.
Example:
A
B,
B
C entails
A
C
Equivalence: two sets of FD’s are equivalent if each entails the other.
{A
B,
B
C}
is equivalent to {A B, A C, B
Closure: Given a set of attributes A and a set of dependencies C,
we want to find all the other attributes that are functionally
determined by A.
C}
Closure Algorithm
Start with Closure=A.
Until closure doesn’t change do:
if A1, A2, … A n
B is in C, and
A1, A2, … A n
Are all in the closure, and
B is not in Closure
then
add B to closure.
Problems in Designing
Schema
Name
Fred
Fred
Joe
Joe
Problems:
- redundancy
- update anomalies
- deletion anomalies
SSN
123-321-99
123-321-99
909-438-44
909-438-44
Phone Number
(201)
(206)
(908)
(212)
555-1234
572-4312
464-0028
555-4000
Relation Decomposition
Break the relation into two relations:
Name
Fred
Joe
SSN
123-321-99
909-438-44
Name
Phone Number
Fred
Fred
Joe
Joe
(201)
(206)
(908)
(212)
555-1234
572-4312
464-0028
555-4000
Boyce-Codd Normal Form
A simple condition for removing anomalies from relations:
A relation R is in BCNF if and only if:
Whenever there is a nontrivial dependency A1, A2, … A n
for R , it is the case that { A , A , … A }
1
2
n
is a super-key for R.
B
In English (though a bit vague):
Whenever a set of attributes of R is determining another attribute,
should determine all the attributes of R.