9/4 - SEAS - University of Pennsylvania

Download Report

Transcript 9/4 - SEAS - University of Pennsylvania

Introduction
Zachary G. Ives
University of Pennsylvania
CIS 550 – Database & Information Systems
September 4, 2003
Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan
Welcome!
To CIS 550, officially “Database & Information
Systems” in the course catalog…
… A tour of the data management field…
… A question for you: what does this really
mean? What is this course (and the field)
about?
2
What The Course Is Not…
(A Few Warnings)
This is not a course on Oracle or SQL
 It may not directly impact your marketability
 It’s an investigation into the principles of data management –
which will improve your understanding
This course will not be a cakewalk!
 The data management field is broad, and we’ll touch on many
subjects at a rapid pace
 8 homeworks, paper summaries, term project, midterm, final, …
This course is not suitable for people with a limited
programming background; need skills in:




Algorithms & data structures
Logic
Programming languages
Perhaps even a little complexity theory!
3
What The Course Will Do
Most CS courses concentrate on code – now you’ll
understand data management and representation
 In the end, it’s all about the data!
Background in most of the important areas:




Data design, modeling
Understanding of DB system internals, performance
Understanding of data-driven systems (e.g., web sites)
An understanding of the complexities of integrating data –
perhaps the biggest CS problem today
 Understanding of what research topics in data management are
4
Administrivia
Instructor: Zachary Ives, zives@cis
 Levine 611 (until end of Sept.; then 5th floor GRW)
 Office hours: Tuesday, 3:00-4:00PM (after class)
TA: Dinkar Gupta, [email protected]
 Office hours on Monday; time 3:00-4:00PM (office TBA)
Newsgroup: upenn.cis.cis550
Home page: www.seas.upenn.edu/~zives/cis550/
Text(s):
 Ramakrishnan & Gerke, Database Systems, 3rd ed.
 Supplementary papers (to be handed out in class)
 Other books may be useful (see web page)
5
Course Format and Grading
 We’ll cover roughly one major topic area per week to
two weeks
 Readings in the text & research papers
 Occasionally, summaries/commentary on papers (5%)
 Homework assignment for each topic area (30%)
 One midterm (10%), one final exam (20%)
 Project (30%) – groups of 3-4:
 Build a blogging system on top of a database, or
 Build a P2P data sharing system for XML data
 (Or propose your own idea)
 General participation, discussion, intangibles (5%)
6
Diving In…





What is a database and a DBMS?
Why do we need a DBMS?
Database and data management architectures
Process of building a DB
DBMS components
7
What’s Data Management?
In the 1960’s and early 70’s:
 file formats, traversals, indexes
In the 1980’s (mid- to late-70’s in research) and 90’s:
 Separation of logical + physical data representations
 Well-defined general purpose, declarative data manipulation
language (DML) and data definition language (DDL)
 Reliable, consistent storage + concurrency control
 Sophisticated system that takes DDL statements and knowledge
of physical data representations and produces answers in
“optimized” way
Today:
 All that plus managing and manipulating data in many models
and representations
8
What is a DBMS?
 A database (DB) is a large, integrated collection
of data
 Generally is cohesive in “some” way
 A DB models a real-world organization or unit
 A database management system (DBMS) is a
software package designed to store and
manage databases
 Reliable storage & recovery of 100s of GB
 Querying/updating interface and API
 Support for many concurrent users
9
Connection to Other
Areas of CS…
 Programming languages and software
engineering (obviously)
 Algorithms (obviously)
 Logic, discrete math, and theory of computation
 Systems: concurrency, operating systems, file
organization and networks, peer-to-peer, …
 Web (and Semantic Web), information retrieval,
digital libraries, software agents, …
 AI planning and machine learning
10
But 80% of the World’s
Data is Not in a DB!
Examples:
 scientific data (large images, complex programs that analyze the
data)
 personal data
 WWW
Data management is expanding to tackle these problems
 Flexibility – data management imposes many
constraints to make problems solvable
 Must deal with entities outside our control
In this course, we’ll start by focusing on databases, but
eventually look “outside the box”
11
Why Not “Program up” Databases
As Needed?
For simple (single-concept) and small
databases this is often the best solution
 Flat files and grep get us a long way
 But there are limits:





The structure is complicated (more than a simple table)
The database gets large (e.g., bigger than RAM)
Many people want to use it simultaneously
Need for reliable recovery from crashes
Updates generally require complete rewrite of file
12
Example: Palm-Style Calendar
We might start by building a file with the
following structure:
Event Day
When Who
Where
Lunch 10/24
CS123 10/25
Biking 10/26
Dinner 10/26
1pm
9am
9am
6PM
Joe’s Diner
Morris234
Jane’s house
Café Le Boeuf
Rick
Dr. Egghead
Jane
Jane
This text file is easy to deal with. So there's no
need for a DBMS! Right…?
13
Problem 1: Data Organization
 Consider the all-important who field. Do we also
want to keep e-mail addresses, telephone
numbers etc.?
 Expand our file to look like:
Event When Who-name
Who-email
…
Who-tel …. Where
 Now we are keeping our address book in our
calendar and doing so redundantly
14
“Link” Calendar with
Address Book?
 Two conceptual “entities” – contact information
and calendar – with a relationship between
them, linking people in the calendar to their
contact information
 This link could be based on something as
simple as the person's name
15
Problem 2: Efficiency
 Size of personal address book is probably less
than one hundred entries, but there are things
we'd like to do quickly and efficiently
 “Give me all appointments on 10/28”
 “When am I next meeting Jim?”
 “Program” these as quickly as possible (and
make them resilient to data format changes)
 Have these programs executed efficiently
 What would happen if you were using a
corporate calendar with hundreds of thousands
of entries?
16
Problem 3: Concurrency and
Reliability
 Suppose other people are allowed access to
your calendar and are allowed to modify it?
How do we stop two people changing the file at
the same time and leaving it in a physical (or
logical) mess?
 Suppose the system crashes while we are
changing the calendar. How do we recover our
work?
17
Transactions
 Key concept for concurrency is that of a
transaction : an atomic sequence of database
actions (read/write) on data items (e.g. calendar
entry).
 Key concept for recoverability is that of a log :
keeping track of all actions carried out by the db.
 Sounds like operating systems all over again!
18
Database Architecture:The
Traditional View
It is common to describe databases in two ways:
 Logical structure:
 What users see. The program or query language interface.
 Physical structure:
 How files are organized. What indexing mechanisms are
used.
The logical level is further split into two
components:
 Overall database design (conceptual; seen by the DB
designer)
 Views that various users get to see
19
Three-level Architecture
View 1
View 2
Schema
…
View N
Conceptual Level
Physical Level
(file organization,
indexing)
20
Data Independence
A user of a relational database system should
be able to use query the database without
knowing about how the precisely how data is
stored, e.g.
SELECT When, Where
FROM Calendar
WHERE Who = "Bill"
After all, you don't worry much how numbers are
stored when you program some arithmetic or
use a computer-based calculator
21
More on Data Independence
Logical data independence
 Protects the user from changes in the logical
structure of the data: could reorganize the calendar
“schema” without changing how I query it
Physical data independence
 Protects the user from changes in the physical
structure of data: could add an index on who (or sort
by when) without changing how the user would write
the query, but the query would execute faster (query
optimization)
22
That's the Traditional View, But ...
 Three-level architecture is not always
achievable: when databases get big, queries
must be carefully written to achieve efficiency
 Also, may need a 4th tier… Sometimes this is
called middleware
23
Combining Databases with Mediators
(a kind of middleware)
“Mediated Schema”
XML
May need to add further layers to combine
multiple databases/sources on the Web
 Some of these are databases over
which we have no control
 Some must be accessed in special ways
24
Data-Driven Web Sites:
Consumers of Database Output
HTML
view
Processing
 “Data driven web sites” also add an HTML
“presentation” layer on top of what we’ve
seen
25
The Process of Building a Database
 Start with a conceptual model
 Design & implement schema
 Write applications using DBMS and other tools
 Many ways of doing this where the hard problems are
taken care of by other people (DBMS, API writers,
library authors, web server, etc.)
26
Conceptual Design
fid
PROFESSOR
name
Teaches
STUDENT
sid
name
COURSE
Takes
cid
name
semester
27
Designing a Schema (Set of Relations)
STUDENT
COURSE
Takes
sid
name
sid
cid
cid
name
sem
1
Jill
1
550-0103
550-0103
DB
F03
2
Qun
1
700-1003
700-1003
AI
S03
3
Nitin
3
500-0103
501-0103
Arch
F03
 Convert to tables +
constraints
 Then need to do “physical”
design: the layout on disk,
indices, etc.
PROFESSOR
Teaches
fid
name
fid
cid
1
Ives
1
550-0103
2
Saul
2
700-1003
8
Roth
8
501-0103
28
Applications Use Queries in SQL
 Structured Query Language
 Based on restricted first-order logic expressions over relations
 Not procedural – defines constraints on the output
 Converted into a query plan that exploits properties; run over the
data by the query optimizer and query execution engine
<html>
<body>
<!-- hypotheticalEmbeddedSQL:
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
-->
</body>
</html>
29
Processing the Query
Web Server /
UI / etc
Hash
STUDENT
Optimizer
Takes
by cid
Execution
Engine
Merge
COURSE
by cid
Storage
Subsystem
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
30
DBMS in a Bit More Detail
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Catalog
Schemas Data/etc
Logging, recovery
Requests
Index/file/rec Mgr
Data/etc
Requests
Buffer Mgr
Pages
Pages
Storage Mgr
Data
Requests
Storage
31
DBMSs in the Wild
A huge industry for 20% of the world’s data!
 Big, mature relational databases
 IBM, Oracle, Microsoft
 “Middleware” above these
 SAP, PeopleSoft, dozens of special-purpose apps
 “Application servers”
 Integration and warehousing systems
 Trends:
 More integration; web services; XML everywhere
 Smarter, self-tuning systems
32
The Research World
 Conventional databases aren’t interesting!
 Understanding what’s possible to do with XML
 Better query processing
 Better languages for meta-info (e.g., constraints)




Data streams
Peer-to-peer
Integrating data from different formats
Lots of theory and systems-building
33
In this Course...
 Study relational databases, their design, how to
query, what forms of indices to use.
 Beyond relational algebra: a logical model of
data (Datalog), recursion
 XML and semi-structured data models
 Understanding DB internals
 How DBs are built
 Performance implications
 Integrating and mediating between databases
(a huge problem today)
34
Questions?
Dilbert, 8/9/2003 (via online archive)
35