Transcript Document
Lecture 1: Overview of CSCI 585
Prof. Shahram Ghandeharizadeh
Director of USC Database Lab (http://dblab.usc.edu)
Computer Science Department
University of Southern California
Logistics
Collection of technical papers:
Pre-req for the course:
CSCI 485: Introduction to File and Database
Management, and
Knowledge C++ programming language.
Extensive use of Blackboard for homework
and project submissions. Make sure to have
access to:
ACM/IEEE/Springer digital libraries.
URLs work from USC machines.
http://den.usc.edu
Power-point of presentations also available
from http://dblab.usc.edu
Pre-Req
585 assumes you know the following:
Transactions and their ACID properties.
Concurrency control protocols such as locking
and time-stamp based protocols.
Crash recovery techniques such as logging and
shadow paging.
Physical characteristics of magnetic disks.
SQL
Relational algebra operators
ER data modeling
Alternative normal forms.
Visit http://dblab.usc.edu/csci485 for an
overview of this material.
Instructor Details
Dr. Shahram Ghandeharizadeh
Office: SAL 208
E-mail: [email protected]
Phone: 213-740-4781
Office Hours:
Tuesday: 12:30 to 2 pm
Thursday: 4:30 to 5:30 pm
Class URL: http://dblab.usc.edu/csci585
TA
Shahin Shayandeh
Office: SAL 200C
E-mail: [email protected]
Office Hours:
Mondays: 3:30 to 5 pm
Thursday: 12:30 to 2 pm
Outline
Motivation for DBMS
An outline for the course material
Grading: Assignments and projects
Database Management
Systems (DBMS)
Used almost on a daily basis for either
individual or business use.
Relational database vendors were one
of the fastest growing sectors during
the .COM boom!
DATABASE & DBMS
Database: An integrated collection of
data, usually stored on secondary
storage, typically describing the
activities of one or more related
organizations.
Database management system (DBMS):
A collection of software/programs
designed to assist in maintaining and
utilizing large collections of data.
BEFORE DBMS
User 1
User 2
Application
programs
Application
programs
Data
Data
AFTER DBMS
User 1
Application
programs
DBMS
User 2
Application
programs
Data
managed by
DBMS
WHY A DBMS?
1.
2.
3.
4.
5.
6.
7.
8.
Reduced application development time
Data independence: Application programs not dependent on
data representation and storage details
Data sharing: data is better utilized (discovered and reused),
redundancy of data is minimized
Data integrity and consistency: one may enforce consistency
constraints on data, e.g., number of seats sold ≤ number of seats
on the plane × 1.1
Centralized control: DBA tunes the database to balance user's
needs
Security: mechanisms to prevent unauthorized access. These
mechanisms are based on content instead of file-oriented
approach.
Concurrency control: avoids undesirable race conditions that
arise with simultaneous access/updates to data
Crash recovery: ensures the integrity of data in the presence of
failures
DBMS ARCHITECTURE
User 1
…
DBMS
User n
DB
Physical data
Conceptua
l schema
An Emerging Phenomena
User 1
Application
programs
DBMS
User 2
Application
programs
Data
managed by
DBMS
Example
F. Chang et. al. Bigtable: A Distributed
Storage System for Structured Data. In OSDI
2006. Last paragraph of the paper:
“Finally, we have found that there are significant
advantages to building our own storage solution
at Google. We have gotten substantial amount of
flexibility from designing our own data model for
Bigtable. In addition, our control over Bigtable’s
implementation, and the other Google
infrastructure upon which Bigtable depends,
means that we can remove bottlenecks and
inefficiencies as they arise.”
WHAT HAS CHANGED?
1.
2.
Relational database technology is now more than a quarter of
century old.
While concepts such as concurrency control are extremely
valuable, the performance loss attributed to their use is not
justified for some non-banking applications.
E.g., A social networking site is not a banking application.
3.
RDBMS vendors increased functionality for their own niche,
increasing complexity. Each application used a decreasing
fraction of the provided features.
A deployment requires a specialist, trained in database administration, for
maintainence.
4.
Availability of data is paramount.
Cost of downtime is estimated at thousands of dollars per minute.
5.
6.
SQL is too general and cumbersome to use with some
applications.
Storage has become larger and more economical.
10 cents per Gigabyte of magnetic disk storage.
Flash as a new layer in the storage hierarchy: DRAM, Flash, Disk.
7 to 8 dollars per Gigabyte of DRAM.
A bank’s data (TPC benchmark) becomes main memory resident!
Cross-roads
Since 1998, database researchers have
been aware of the limitations:
More modular architecture based on
simple, component-based building
blocks.
One architecture will not satisfy all
applications.
585 Syllabus
Storage and Storage Management:
M. Seltzer. Beyond Relational Databases.
Communications of the ACM, July 2008, Vol. 51,
No. 7.
D. A. Patterson, G. Gibson, and R. H. Katz. A
Case for Redundant Arrays of Inexpensive Disks
(RAID). ACM SIGMOD, 1988.
G. Graefe. The five-minute rule twenty-years
later, and how flash memory changes the rules.
Proceedings of the Third International Workshop
on Data Management on New Hardware (DaMoN),
2007.
Flash as a new storage medium.
2-3 weeks.
Start homework 1 using Berkeley DB.
585 Syllabus (Cont…)
Parallel DBMS:
D. DeWitt et al. The Gamma Database Machine
Project. IEEE Transactions on Knowledge and
Data Engineering, Vol. 2, 1990.
F. Chang et al. Bigtable: A Distributed Storage
System for Structured Data. In OSDI 2006.
J. Dean and S. Ghemawat. MapReduce:
Simplified Data Processing on Large Clusters. In
Communications of the ACM, Vol. 51, No. 1, 2008.
Data intensive applications can be parallelized
effectively.
2 Weeks.
585 Syllabus (Cont…)
Spatial Index Structures:
A. Guttman. R-Trees: A Dynamic Index Structure for
Spatial Searching. In ACM SIGMOD 1984.
P. E. O’Neil, and D. Quass. Improved Query Performance
with Variant Indexes. In ACM SIGMOD 1997.
No substitute for smart data indexing techniques!
Brute-force approaches are not acceptable.
2 Weeks.
Initiate your project to build a relational
query processing software using Berkeley
DB.
585 Syllabus (Cont…)
Query optimizations:
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin,
R. A. Lorie, T. G. Price. Access Path Selection in
Relational Database Management System. In
ACM SIGMOD 1979.
S. Chaudhuri. An Overview of Query
Optimization in Relational Systems. PODS 1998.
Techniques to select index structures. Focus is
on your project.
2 Weeks.
585 Syllabus (Cont…)
Decision Support:
R. Agrawal and R. Srikant. Fast Algorithms for
Mining Association Rules in Large Databases. In
VLDB 1994.
J. Gray et al. Data Cube: A Relational
Aggregation Operator Generalizing Group-by,
Cross-Tab and SubTotals. Data Mining and
Knowledge Discovery 1(1), 1997.
C. Stolte, D. Tang, and P. Hanrahan. Polaris: A
System for Query, Analysis, and Visualization of
Multidimensional Databases. Communications
of the ACM, Vol. 51, No. 11, November 2008.
Discovery of trends in large data sets and their
visualization.
2-3 Weeks.
585 Syllabus (Cont…)
Main Memory Databases:
P. A. Boncz, M. L. Kristen, and S. Manegold.
Breaking the Memory Wall in MonetDB.
Communications of the ACM, December 2008,
Vol. 51, No. 12.
Use L2 cache of a CPU!
2 Weeks
585 Syllabus (Cont…)
Cache Management:
S. Ghandeharizadeh and S. Shyandeh. Greedy
Cache Management Techniques for Mobile
Devices. In First International IEEE Workshop on
Ambient Intelligence, Media and Sensing. April
2007.
Effective support for variable sized objects.
Time permitting, 1 to 2 weeks.
Grading
Midterm 1: 35%
Midterm 2: 35%
Assignments: 10%
Project: 20%
For next lecture
Read:
M. Seltzer. Beyond Relational Databases.
Communications of the ACM, July 2008, Vol. 51,
No. 7.