9/5 - SEAS - University of Pennsylvania
Download
Report
Transcript 9/5 - SEAS - University of Pennsylvania
Introduction
Zachary G. Ives
University of Pennsylvania
CIS 550 – Database & Information Systems
September 5, 2007
Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan
Welcome to CIS 550,
Database and Information Systems!
Instructor: Zachary Ives, [email protected]
576 Levine Hall North
Office hours: Wednesdays, 3:30-4:30PM (before class)
TA: Mengmeng Liu, [email protected]
Office hours: Tuesdays, 3:00-4:00PM
Newsgroup: upenn.cis.cis550
Home page: www.seas.upenn.edu/~zives/cis550/
Texts and readings:
Ramakrishnan & Gerke, Database Systems, 3rd ed.
Supplementary papers (to be handed out in class)
2
Course Format and Grading
Roughly one major topic area per week to two weeks
Readings in the text & current or influential research papers
Occasionally, summaries/commentary on papers (5%)
Homework assignment for each topic area and mini-projects
(30%)
One midterm (10%), one final exam (20%)
Project (30%) – groups of 3
Build an apartment-finding mashup
(Or propose your own idea)
General participation, discussion, intangibles (5%)
3
What’s this Course About?
Most CS courses concentrate on code – our
interest is managing and representing data
Warning: this course doesn’t focus on teaching SQL or
how to be an Oracle DBA (though it will get you started)
It’s not an “application” course!
… So what in the world are we studying for 14
weeks???
4
Data – What Do We Need to Do with It?
5
Ways Information Is Represented
6
Example: An Encyclopedia Entry
(www.wikipedia.com)
A database is a collection of data elements (facts) stored in a computer in a
systematic way, such that a computer program can consult it to answer questions.
The answers to those questions become information that can be used to make
decisions that may not be made with the data elements alone. The computer
program used to manage and query a database is known as a database management
system (DBMS). The properties of database systems are studied in information
science.
At the core of the concept of a database is the idea of a collection of generic facts,
or pieces of knowledge. Facts may be structured in a number of ways, known as
database models. For instance, one database model is to associate each fact with a
record representing an entity (such as a person), and to arrange these entities into
trees or hierarchies – the hierarchical database model. Another model is to arrange
facts into sets of values which satisfy logical predicates – the relational database
model.
The first database management systems were developed in the 1960s. A pioneer in
the field was Charles Bachman. Two key data models arose at this time: the
network model (developed by CODASYL) followed by the hierarchical model (as
implemented in IMS). These were later usurped by the relational model, which was
contemporary with the so-called flat model designed for very small tasks.
7
Example: To-Do List
Buy school supplies
Go to orientation
Exercise
Buy Philly postcards
Take photo with Ben F.
due 9/4
on 9/4
every M/W/F
How does this differ from the plain text model?
What might you do with it that you couldn’t?
8
Example: Your PDA/Cell Phone
Calendar
Event Day
Lunch 10/24
Advice 10/25
Biking 10/26
Dinner 10/26
When Who
1pm Zack
9am Dr. Smith
9am Jane
6PM Jane
Where
Cavanaugh’s
599 Levine
Pottruck
Food Court
Contacts
Who
Zack
Dr. Smith
Jane
Phone
6-2789
6-1234
543-2198
Office
576 Levine N
599 Levine
2220 Walnut St.
Email
zives
drsmith
jane
9
What If We Want to Include Contact
Info on Our Calendar?
Do we also want to keep e-mail addresses, telephone
numbers etc.?
Should we expand the number of columns in our table:
Event When Who-name
Lunch 1pm
Zack
…
Who-email
zives
Who-tel …. Where
6-2789 …. Cav…
What is the trade-off in terms of entering data?
10
“Link” Calendar with Contacts?
Why can’t we “link” calendar entries with contact
info, and show the results of the two?
The link could be based on something as simple as
the person's name
(What’s the danger here? What else might work
better?)
This brings up an issue – how to “follow links”
If we were to do this in C, how might it be done? In
Java?
11
Another Kind of Link: Classes and
Subclasses
Person has attributes:
ssn
PennID
set of user IDs
given name
family name
…
Student IS A person who:
takes courses
is given grades
learns an academic topic
listens to lectures in class, OR over the Web, OR on videotape
This is yet another kind of information
How have you previously seen such relationships encoded?
12
Data Representation and Modeling
All of the data we’ve seen have an implicit data model
The data model includes some basic assumptions about what’s an
“item” of data, how to interpret it, and so on
The relational data model was the first model that is
independent of its data structures and implementation
A theory of normalization guides you in designing relations
Concepts have been adapted to form object-oriented data
models, XML, etc.
There are “sibling” fields to databases with semantic models:
natural language (meanings of bags or lists of words)
information retrieval (associations between words and documents)
ontologies (inferences about relationships between classes, and
classes and subclasses of relationships)
We’ll mostly focus on the relational model and its descendants
13
The DBMS Provides an Interface
over the Database
A database (DB) is a large, integrated collection of data
Generally is cohesive in “some” way
A DB models a real-world organization or unit
A database management system (DBMS) is a software
package designed to store and manage databases
Reliable storage & recovery of 100s of GB
Querying/updating interface and API
(for applications and Web pages)
Support for many concurrent users
Why do we need a DBMS, instead of coding in Java?
14
DBMS Benefit #1: Generality and
Declarativity
Don’t require the programmer or user to know
details like indices, sort orders, machine
speeds, disk speeds, concurrent users, etc.
Instead, the programmer/user programs with a
logical model in mind
The DBMS “makes it happen” based on an
understanding of relative costs of different
methods
15
Benefit #2: Efficiency and Scale
Size of personal address book is probably less
than 100 entries, but there are things we'd like
to do quickly and efficiently:
“Give me all appointments on 10/28”
“When am I next meeting Jim?”
“Program” these as quickly as possible (and
make them resilient to data format changes)
Scale to a corporate calendar with hundreds of
thousands of entries
16
Benefit #3: Management of
Concurrency and Reliability
Suppose other people are allowed access to
your calendar and are allowed to modify it?
How do we stop two people changing the file at
the same time and leaving it in a physical (or
logical) mess?
Suppose the system crashes while we are
changing the calendar. How do we recover our
work?
This requires a basic concept…
17
Transactions
Key concept for concurrency is that of a
transaction: an atomic sequence of database
actions (read/write) on data items (e.g. calendar
entry).
Key concept for recoverability is that of a log:
keeping track of all actions carried out by the db.
18
Anatomy of a Typical DBMS
(Simplification!)
API/GUI
Query
Optimizer
Stats
Physical plan
Exec. Engine
Catalog
Schemas Data/etc
Logging, Recovery,
Requests
Index/file/rec Mgr
Data/etc
Requests
Buffer Mgr
Pages
Pages
Storage Mgr
Data
Red = logical
Blue = physical
Requests
Storage
19
The Database Abstraction
Provided by the DBMS
We think of databases at two levels:
Logical structure:
What users/programmers see – program or query interface
Physical structure:
Organization on disk, indices, etc.
The logical level is further split into:
Overall database design (conceptual; seen by the DB
designer)
Views that various users get to see
20
The Three-level Architecture for
Databases
View 1
View 2
Schema
…
View N
Logical,
Conceptual Level
Physical Level
(file organization,
indexing)
21
Data Independence
A user of a relational database system should
be able to use the database without knowing
about how the precisely how data is stored, e.g.
SELECT When, Where
FROM Calendar
WHERE Who = “Jane"
After all, you don't worry IEEE floating-point
when you do division in a Java program or with
a calculator
22
More on Data Independence
Logical data independence
Protects the user from changes in the logical
structure of the data:
could reorganize the calendar “schema” without changing
how we query it
Physical data independence
Protects the user from changes in the physical
structure of data:
could add an index on who (or sort by when) without
changing how the user would write the query, but the query
would execute faster (query optimization)
23
Presentation Layer (4th Tier):
Data-Driven Web Sites
HTML
view
Processing
“Data driven web sites” also add an HTML
“presentation” layer on top of what we’ve seen
Or they use XML plus “style sheets” to get the same
effect
24
An Issue: 80% of the World’s
Data is Not in a DB!
Examples:
Scientific data
(large images, complex programs that analyze the data)
Personal data
WWW and email
(some of it is stored in something resembling a DBMS)
Data management is expanding to tackle these problems
Flexibility – data management imposes many constraints to
make problems solvable
Must deal with entities outside our control
In this course, we’ll start by focusing on databases, but
eventually look “outside the box” at the Web and at
gluing together data from many places
25
Combining Databases with Mediators
(a kind of middleware)
“Mediated Schema”
XML
A layer above the three-tiered architecture, to combine multiple
databases/sources on the Web
Some of these are databases over which we have no control
Some must be accessed in special ways
We generally need to think about how to translate between different
database formats
This problem of data integration is a particular focus here at Penn (and
several other top departments)
26
How Does One Build a Database?
Start with a conceptual model
Design & implement schema
Write applications using DBMS and other tools
Many ways of doing this where the hard problems are
taken care of by other people (DBMS, API writers,
library authors, web server, etc.)
Common applications include PHP/JSP/servletdriven web sites
The DBMS takes care of query optimization and
execution
27
Conceptual Design
fid
PROFESSOR
name
Teaches
STUDENT
sid
name
COURSE
Takes
cid
name
semester
28
Designing a Schema (Set of Relations)
STUDENT
COURSE
Takes
sid
name
sid
cid
cid
name
sem
1
Jill
1
550-0105
550-0103
DB
F05
2
Bo
1
700-1005
700-1003
AI
S05
3
Maya
3
500-0105
501-0103
Arch
F05
Convert to tables +
constraints
Then need to do
“physical” design: the
layout on disk, indices,
etc.
PROFESSOR
Teaches
fid
name
fid
cid
1
Ives
1
550-0105
2
Saul
2
700-1005
8
Roth
8
501-0105
29
Applications Use Queries in SQL
Structured Query Language, often embedded (e.g., in servlets, JSP)
Based on restricted first-order logic expressions over relations
Not procedural – defines constraints on the output
Converted into a query plan that exploits properties; run over the
data by the query optimizer and query execution engine
<html>
<body>
<!-- hypotheticalEmbeddedSQL:
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
-->
</body>
</html>
30
Processing the Query
Web Server /
UI / etc
Hash
STUDENT
Optimizer
Takes
by cid
Execution
Engine
Merge
COURSE
by cid
Storage
Subsystem
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
31
DBMSs in the Real World
A huge industry for 20% of the world’s data!
Big, mature relational databases
IBM DB2, Oracle, Microsoft SQL Server
Adding advanced features, including “native XML” support
“Middleware” above these systems
SAP, Siebel, PeopleSoft, dozens of special-purpose apps
“Application servers,” aka TP Monitors
WebSphere, WebLogic, Tomcat, …
Support transactional applications in EJB, .NET, etc.
Integration and warehousing systems
BEA AquaLogic, DB2 Information Integrator
Current trends:
Web services; XML everywhere
Smarter, self-tuning systems
32
So What about Database Research?
Not focusing on the problems of Oracle…
Understanding what’s possible to do with XML
Better query processing
Better languages for meta-info (e.g., constraints)
Better interaction with the non-database world
Data streams and sensors
Peer-to-peer architectures and data cooperatives
Integrating data from different formats
Lots of theory and systems-building
You’ll see familiar concepts in this course from operating
systems and from complexity theory/logic
… And from programming languages, AI planning, …
33
In this Course...
Study relational databases, their design, how to
query, what forms of indices to use.
Beyond relational algebra: a logical model of
data (Datalog), recursion
XML and semi-structured data models
Understanding DB internals
How DBs are built
Performance implications
Integrating and mediating between databases
(a huge problem today)
34
Your First Assignment…
Read the Codd paper
Write a one-paragraph summary of the key
ideas in this paper and post to the newsgroup:
upenn.cis.cis550
35