Slides - SEAS - University of Pennsylvania

Download Report

Transcript Slides - SEAS - University of Pennsylvania

Introduction
Zachary G. Ives
University of Pennsylvania
CIS 650 – Implementing Data Management Systems
January 10, 2005
Welcome to CIS 650,
Database and Information Systems!
Instructor: Zachary Ives, zives@cis
 576 Levine Hall North
 Office hours: Tuesday, 2:30-3:30PM (before colloquium)
Home page: www.seas.upenn.edu/~zives/cis650/
Texts and readings:
 Hellerstein and Stonebraker: Readings in Database Systems,
4th ed.
 (Should be available soon)
 Supplementary papers (will be linked via schedule)
2
Course Format and Grading
Very discussion-oriented; about one topic area per week or two
 Readings in the text & other research papers – summaries/commentary
on papers (20%)
“Midterm report” (25%)
 You’ll take one of the topics we’ve discussed and write a summary and
synthesis paper
 Graded for organization, clarity, grammar, etc. as well as content
Project (50%) -- may choose to work in teams:




Implementation
Experimentation / validation
Project report (should be in the style of a research paper)
Brief (~15-minute) presentation
Participation, discussion, intangibles (5%)
At the end, you should be equipped to do research in this field, or to
take ideas from databases and apply them to your field
3
So What Is This Course About?
Not how to build an Oracle-driven Web site…
… nor even how to build Oracle…
4
What Is Unique about Data
Management?
 It’s been said that databases and data
management focus on scalability to huge
volumes of data
 What is it that makes this possible – and what
makes the work interesting if NOT at huge
scale?
 Why are data management techniques useful in
situations where scale isn’t the bottleneck?
5
The Key Principle: Data Independence
 Most methods of programming don’t separate
the logical and physical representations of data
 The data structures, access methods, etc. are all
given via interfaces!
 The relational data model was the first model
for data that is independent of its data
structures and implementation
6
What Is Data Independence?
 Codd points out that previous methods had:
 Order dependence
 Index dependence
 Access path dependence
 Still true in today’s Java/C#: what is the
drawback?
 What might you be able to do in removing
those?
7
The Relational Data Model
More than just tables!
 True relations: sets of tuples
 The only data representation a user/programmer “sees”
 Explicit encoding of everything in values
Additional integrity constraints
 Key constraints, functional dependencies, …
General and universal means of encoding everything!
 (Semantics are pushed to queries)
A secondary concept: views
 Define virtual, derived relations that are always “live”
 A way of encapsulating, abstracting data
8
Constraints and Normalization
 Fundamental idea: we don’t want to build semantics
into the data model, but we want to be able to encode
certain constraints
 Functional dependencies, key constraints, foreign-key
constraints, multivalued dependencies, join dependencies, etc.
 Allows limited data validation, plus opportunities for optimization
 The theory of normalization (see CSE 330, CIS 550)
makes use of known constraints
 Idea: eliminate redundancy, in order to maintain consistency in
the presence of updates
 (Note that there’s no reason for normalization of data in views!)
 Ergo, XML???
9
Relational Completeness
(Plus Extensions): Declarativity
What is special about relational query languages that
makes them amenable to scalability?
 Limited expressiveness – particularly when we consider
conjunctive queries (even with recursion)




Guaranteed polytime execution in size of data
Can reason about containment, invert them, etc.
“Magic sets”
(What about XQuery’s Turing-completeness???)
 Equivalence between relational calculus and algebra
 Calculus  fully declarative, basis of query languages
 Algebra  imperative but polytime, basis of runtime systems
 Predictability of operations  cost models
 Ability to supplement data with auxiliary structures for
performance
10
Concurrency and Reliability
(Generally requires full control)
 Another key element of databases – ACID properties
 Atomicity, Consistency, Isolation, Durability
 Transaction : an atomic sequence of database actions
(read/write) on data items (e.g. calendar entry)
 Recoverability via a log: keeping track of all actions
carried out by the database
 How do distributed systems, Web services, serviceoriented architectures, and the like affect these
properties?
11
Other Data Models
 Concepts from the relational data model have
been adapted to form object-oriented data
models (with classes and subclasses), XML
models, etc.
 But doesn’t this result in some loss of logical-physical
independence?
 GMAP and answering queries using views?
12
What Is a Data Management System?
 Of course, there are traditional databases
 The focus of most work in the past 25 years
 “Tight loops” due to locally controlled data
 Indexing, transactions, concurrency, recovery,
optimization
 But…
13
80% of the World’s Data is Not in
Databases!
Examples:
 Scientific data
(large images, complex programs that analyze the data)
 Personal data
 WWW and email
(some of it is stored in something resembling a DBMS)
 Network traffic logs
 Sensor data
 Are there benefits to declarative techniques and data
independence in tackling these issues?
 XML is a great way to make this data available
 Also need to deal with data we don’t control and can’t guarantee
consistency over
14
An Example of Data Management with
Heterogeneity: Data Integration
“Mediated Schema”
XML
A layer above heterogeneous sources, to combine them
under a unified logical abstraction
 Some of these are databases over which we have no control
 Some must be accessed in special ways
 Data integration system translates queries over mediated
schema to the languages of the sources; converts answers to
mediated schema
15
Other Interesting Points
Data streams and sensor data
How do we process infinite amounts of data?
Peer-to-peer architectures
What’s the best way of finding data here?
Personal information management
Can we use integration-style concepts and a bit of AI to manage
associations between our data?
Web search
What’s the back-end behind Google?
Semantic Web
How do we semantically interrelate data to build a better Web?
16
Layers of a Typical
Data Management System
API/GUI
(Simplification!)
Query
Optimizer
Stats
Physical plan
Exec. Engine
Catalog
Schemas Data/etc
Logging, recovery
Requests
Access Methods
Data/etc
Requests
Buffer Mgr
Pages
Pages
Physical retrieval
Data
Red = logical
Blue = physical
Requests
Source
17
Query Answering in a Data
Management System
 Based on declarative query languages
 Based on restricted first-order logic expressions over relations
 Not procedural – defines constraints on the output
 Converted into a query plan that exploits properties; run over the
data by the query optimizer and query execution engine
 Data may be local or remote
 Data may be heterogeneous or homogeneous
 Data sources may have different interfaces, access methods, etc.
 Most common query languages:
 SQL (based on tuple relational calculus)
 Datalog (based on domain relational calculus, plus fixpoint)
 XQuery (functional language; has an XML calculus core)
18
Processing the Query
Web Server /
UI / etc
Hash
STUDENT
Optimizer
Takes
by cid
Execution
Engine
Merge
COURSE
by cid
Storage
Subsystem
SELECT *
FROM STUDENT, Takes, COURSE
WHERE STUDENT.sid = Takes.sID
AND Takes.cID = cid
19
DBMSs in the Real World
 Big, mature relational databases
 IBM, Oracle, Microsoft
 “Middleware” above these
 SAP, PeopleSoft, dozens of special-purpose apps
 “Application servers”
 Integration and warehousing systems
 Current trends:
 Web services; XML everywhere
 Smarter, self-tuning systems
 Stream systems
20
Our Agenda this Semester
 Reading the canonical papers in the data
management literature
 Some are very systems-y
 Some are very experimental
 Some are highly algorithmic, complexity-oriented
 Gaining an understanding of the principles of
building systems to handle declarative queries
over large volumes of data
21
For Next Time
 Skim Codd if you haven’t already
 Read the overview papers of the two first
database systems:
 Astrahan et al., pp. 117 Wong et al. (skip Section 2; focus on pp. 200-)
 Write a summary of your assigned paper and
email it to me at zives@cis
 Key question: how well did this system mesh with
Codd’s relational model? (You may need to skim
through other aspects of your assigned paper to help
answer that question)
22