Slides - SEAS - University of Pennsylvania
Download
Report
Transcript Slides - SEAS - University of Pennsylvania
Introduction
Zachary G. Ives
University of Pennsylvania
CIS 650 – Implementing Data Management Systems
January 10, 2005
Welcome to CIS 650,
Database and Information Systems!
Instructor: Zachary Ives, zives@cis
576 Levine Hall North
Office hours: Tuesday, 2:30-3:30PM (before colloquium)
Home page: www.seas.upenn.edu/~zives/cis650/
Texts and readings:
Hellerstein and Stonebraker: Readings in Database Systems,
4th ed.
(Should be available soon)
Supplementary papers (will be linked via schedule)
2
Course Format and Grading
Very discussion-oriented; about one topic area per week or two
Readings in the text & other research papers – summaries/commentary
on papers (20%)
“Midterm report” (25%)
You’ll take one of the topics we’ve discussed and write a summary and
synthesis paper
Graded for organization, clarity, grammar, etc. as well as content
Project (50%) -- may choose to work in teams:
Implementation
Experimentation / validation
Project report (should be in the style of a research paper)
Brief (~15-minute) presentation
Participation, discussion, intangibles (5%)
At the end, you should be equipped to do research in this field, or to
take ideas from databases and apply them to your field
3
So What Is This Course About?
Not how to build an Oracle-driven Web site…
… nor even how to build Oracle…
4
What Is Unique about Data
Management?
It’s been said that databases and data
management focus on scalability to huge
volumes of data
What is it that makes this possible – and what
makes the work interesting if NOT at huge
scale?
Why are data management techniques useful in
situations where scale isn’t the bottleneck?
5
The Key Principle: Data Independence
Most methods of programming don’t separate
the logical and physical representations of data
The data structures, access methods, etc. are all
given via interfaces!
The relational data model was the first model
for data that is independent of its data
structures and implementation
6
What Is Data Independence?
Codd points out that previous methods had:
Order dependence
Index dependence
Access path dependence
Still true in today’s Java/C#: what is the
drawback?
What might you be able to do in removing
those?
7
The Relational Data Model
More than just tables!
True relations: sets of tuples
The only data representation a user/programmer “sees”
Explicit encoding of everything in values
Additional integrity constraints
Key constraints, functional dependencies, …
General and universal means of encoding everything!
(Semantics are pushed to queries)
A secondary concept: views
Define virtual, derived relations that are always “live”
A way of encapsulating, abstracting data
8
Constraints and Normalization
Fundamental idea: we don’t want to build semantics
into the data model, but we want to be able to encode
certain constraints
Functional dependencies, key constraints, foreign-key
constraints, multivalued dependencies, join dependencies, etc.
Allows limited data validation, plus opportunities for optimization
The theory of normalization (see CSE 330, CIS 550)
makes use of known constraints
Idea: eliminate redundancy, in order to maintain consistency in
the presence of updates
(Note that there’s no reason for normalization of data in views!)
Ergo, XML???
9
Relational Completeness
(Plus Extensions): Declarativity
What is special about relational query languages that
makes them amenable to scalability?
Limited expressiveness – particularly when we consider
conjunctive queries (even with recursion)
Guaranteed polytime execution in size of data
Can reason about containment, invert them, etc.
“Magic sets”
(What about XQuery’s Turing-completeness???)
Equivalence between relational calculus and algebra
Calculus fully declarative, basis of query languages
Algebra imperative but polytime, basis of runtime systems
Predictability of operations cost models
Ability to supplement data with auxiliary structures for
performance
10
Concurrency and Reliability
(Generally requires full control)
Another key element of databases – ACID properties
Atomicity, Consistency, Isolation, Durability
Transaction : an atomic sequence of database actions
(read/write) on data items (e.g. calendar entry)
Recoverability via a log: keeping track of all actions
carried out by the database
How do distributed systems, Web services, serviceoriented architectures, and the like affect these
properties?
11
Other Data Models
Concepts from the relational data model have
been adapted to form object-oriented data
models (with classes and subclasses), XML
models, etc.
But doesn’t this result in some loss of logical-physical
independence?
GMAP and answering queries using views?
12
What Is a Data Management System?
Of course, there are traditional databases
The focus of most work in the past 25 years
“Tight loops” due to locally controlled data
Indexing, transactions, concurrency, recovery,
optimization
But…
13
80% of the World’s Data is Not in
Databases!
Examples:
Scientific data
(large images, complex programs that analyze the data)
Personal data
WWW and email
(some of it is stored in something resembling a DBMS)
Network traffic logs
Sensor data
Are there benefits to declarative techniques and data
independence in tackling these issues?
XML is a great way to make this data available
Also need to deal with data we don’t control and can’t guarantee
consistency over
14
An Example of Data Management with
Heterogeneity: Data Integration
“Mediated Schema”
XML
A layer above heterogeneous sources, to combine them
under a unified logical abstraction
Some of these are databases over which we have no control
Some must be accessed in special ways
Data integration system translates queries over mediated
schema to the languages of the sources; converts answers to
mediated schema
15
Other Interesting Points
Data streams and sensor data
How do we process infinite amounts of data?
Peer-to-peer architectures
What’s the best way of finding data here?
Personal information management
Can we use integration-style concepts and a bit of AI to manage
associations between our data?
Web search
What’s the back-end behind Google?
Semantic Web
How do we semantically interrelate data to build a better Web?
16
Layers of a Typical
Data Management System
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Catalog
Schemas Data/etc
Logging, recovery
Requests
Access Methods
Data/etc
Requests
Buffer Mgr
Pages
Pages
Physical retrieval
Data
Red = logical
Blue = physical
Requests
Source
17
Query Answering in a Data
Management System
Based on declarative query languages
Based on restricted first-order logic expressions over relations
Not procedural – defines constraints on the output
Converted into a query plan that exploits properties; run over the
data by the query optimizer and query execution engine
Data may be local or remote
Data may be heterogeneous or homogeneous
Data sources may have different interfaces, access methods, etc.
Most common query languages:
SQL (based on tuple relational calculus)
Datalog (based on domain relational calculus, plus fixpoint)
XQuery (functional language; has an XML calculus core)
18
Processing the Query
Web Server /
UI / etc
Hash
STUDENT
Optimizer
Takes
by cid
Execution
Engine
Merge
COURSE
by cid
Storage
Subsystem
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
19
DBMSs in the Real World
Big, mature relational databases
IBM, Oracle, Microsoft
“Middleware” above these
SAP, PeopleSoft, dozens of special-purpose apps
“Application servers”
Integration and warehousing systems
Current trends:
Web services; XML everywhere
Smarter, self-tuning systems
Stream systems
20
Our Agenda this Semester
Reading the canonical papers in the data
management literature
Some are very systems-y
Some are very experimental
Some are highly algorithmic, complexity-oriented
Gaining an understanding of the principles of
building systems to handle declarative queries
over large volumes of data
21
For Next Time
Skim Codd if you haven’t already
Read the overview papers of the two first
database systems:
Astrahan et al., pp. 117 Wong et al. (skip Section 2; focus on pp. 200-)
Write a summary of your assigned paper and
email it to me at zives@cis
Key question: how well did this system mesh with
Codd’s relational model? (You may need to skim
through other aspects of your assigned paper to help
answer that question)
22