Running 1996 - University of Houston

Download Report

Transcript Running 1996 - University of Houston

Introduction Data Management
Christoph F. Eick
This Week
1. Introduction
to Databases
2. Course Information
3. Grading and Other Things
4. Questionnaire
5. The E/R Data Model
Christoph F. Eick
Introduction Data Management
Textbooks for COSC 6340
Required Text: Raghu Ramakrishnan and Johannes
Gehrke, Data Management Systems, McGraw Hill,
Third Edition, 2002 (complication: the chapter numbers in
the new edition are different!!)
 Recommended: Jiawei Han and Micheline Kamber,
Data Mining: Concepts and Techniques, Morgan
Kaufman Publishers, 2001, ISBN 1-55860-489-8 (4
chapters will be covered)
 Other books with relevant material: Ramez Elmasri
and Shamkant Navathe, Fundamentals of Database
Systems, Third Edition Addison Wesley ISBN: 0-80531755-4

Christoph F. Eick
Introduction Data Management
Lectures in COSC 6340






I: Basic Database Management Concepts --- Review of basic database concepts,
techniques, and languages (9 classes, Chapters 1-5, 8-11, and 16 of the textbook).
III: Introduction to KDD and Data Warehousing centering on data warehouses,
OLAP, and data mining; moreover, more detailed coverage of querying and mining
data streams and database clustering (5 classes; Chapters 1, 2, 6, and 7 of the
Han/Kamber book & chapter 25 of our textbook and additional material)
III: Relational Database Design (2 classes, chapters 19)
IV: Implementation of Relational Operators, Query Optimization, and Physical
Database Design (Chapters 12+14+20, 3-4 classes)
V: Internet Databases and XML (1 class, chapter 27 of the textbook)
VI: Other: discussion of home works and exams, student presentations; discussion
of course projects (3 classes)
Christoph F. Eick
Introduction Data Management
Tentative Schedule COSC 6340



Jan. 18: Introduction to COSC 6340
Fast Review of Undergraduate Material (Jan. 20-Feb. 15)
 Jan. 20: Entity-Relationship Data Model  more detailed than textbook
 Jan. 25: Entity-Relationship Data Model
 Jan. 27: Relational Data Model
 Feb. 1: Mapping E/R Diagrams to Relations
 Feb. 8/10/15: Index & storage structures, B+-trees, and hashing, PDBD
 Feb. 15/17: Relational Algebra
 Feb. 22: Writing SQL Queries (somewhat short)
 Feb. 24: Leftovers / Review
March 1: Exam0 (Undergraduate Review Exam)
Christoph F. Eick
Introduction Data Management
Tentative Schedule COSC 6340 Part2

II: KDD and Data Warehousing (approx. 5.5 classes)











March 3: Introduction to KDD
March 8: Similarity Assessment
March 10: Clustering
March 22: Association Rule Mining
March 29: Spatial Databases
March 31: Data Warehouses and OLAP
April 8 (30 minutes): Spatial Data Mining
April 5+7+[8]: III: Relational Database Design (2.5 classes)
April 14: Midterm Exam (30 minutes review on April 12, 2005)
April 19+26+[28]: Student Presentations
IV: Physical Database Design and Query Optimization (3 classes)
 April 8(makeup): Implementation of Relational Operators
 April 8(makeup)/12: Introduction to Query Optimization
 April 12/19: Physical Database Design II

V: Internet Databases (1 class)
 April 21: Introduction to XML and Semi-Structured Data

April 28: Course Summary + Teaching Evaluation + Leftovers
Christoph F. Eick
Introduction Data Management
Exams and Homeworks COSC 6340






Tu., March 1: Undergraduate Material Review Exam
Th., April 14: Midterm Exam
3 graded homeworks: deadlines: Feb. 17, April 11, April 28
Ungraded Homeworks
Final Exam: Tu/TH., May 10, 11a-1:30p
Qualifying Exam Part2: Fr., May 13, 10-11:30a
Introduction Data Management
Christoph F. Eick
Why are integrated databases popular?




Avoidance of uncontrolled redundancy
Making knowledge accessible that would otherwise not be
accessible
Standardization --- uniform representation of data facilitating
import and export
Reduction of software development (though the availability of
data management systems)
Bookkeeping
Device
Integrated
Database
Car Salesman
Introduction Data Management
Christoph F. Eick
Popular Topics in Databases









Efficient algorithms for data collections that reside on disks (or which are
distributed over multiple disk drives, multiple computers or over the
internet).
Study of data models (knowledge representation, mappings, theoretical
properties)
Algorithms to run a large number of transactions on a database in parallel;
finding efficient implementation for queries that access large databases;
database backup and recovery,…
Database design
How to use database management systems as an application programmer /
end user.
How to use database management systems as database administrator
How to implement database management systems
Data summarization, knowledge discovery, and data mining
Special purpose databases (genomic, geographical, internet,…)
Introduction Data Management
Christoph F. Eick
Data Model
Data Model
is used to define
Schema (defines
a set of database
states)
Current Database
State
Introduction Data Management
Christoph F. Eick
Schema for the Library Example
using the E/R Data Model
when
title
name
ssn
Person
1-to-1
(0,35)
author
B#
phone
Check_out
1-to Many
(0,1)
Many-to-1
Book
Many-to-Many
Christoph F. Eick
Introduction Data Management
Relational Schema for Library Example
in SQL/92
CREATE TABLE Person
(ssn CHAR(9),
name CHAR(30),
phone INTEGER,
PRIMARY KEY (ssn));
CREATE TABLE Book
(B# INTEGER,
title CHAR(30),
author CHAR(20),
PRIMARY KEY (B#));
CREATE TABLE Checkout(
book INTEGER,
person CHAR(9),
since DATE,
PRIMARY KEY (book),
FOREIGN KEY (book) REFERENCES Book,
FOREIGN KEY (person) REFERENCES Person));
Christoph F. Eick
Introduction Data Management
Referential Integrity in SQL/92

SQL/92 supports all 4 options on
CREATE TABLE Enrolled
deletes and updates.
(sid CHAR(20),
 Default is NO ACTION
cid CHAR(20),
(delete/update is rejected)
 CASCADE (also delete all tuples
grade CHAR(2),
that refer to deleted tuple)
PRIMARY KEY (sid,cid),
 SET NULL / SET DEFAULT (sets
FOREIGN KEY (sid)
foreign key value of referencing
REFERENCES Students
tuple)
ON DELETE CASCADE
ON UPDATE SET DEFAULT )
Christoph F. Eick
Introduction Data Management
Example of an Internal Schema
for the Library Example
INTERNAL Schema Library12 references Library.
Book is stored sequentially,
index on B# using hashing,
index on Author using hashing.
Person is stored using hashing on ssn.
Check_out is stored sequentially,
index on since using B+-tree.
Introduction Data Management
Christoph F. Eick
Example: Stored Database
Index on B#
Block#= B# mod 10
0
1
Relation Book
20
30
Index on
Author
(1, C,W)
1
11
51
(20, Y,W)
(51, C, B)
W,
…
(11, Y,W)
(30, Z, B)
0
Relation Person 1
Block#= sss mod 10
Relation
Checkout
(200,…)
(500,…)
Index on since
(101,…)
Modern Relational DBMS
Transaction
Concepts; capability
of running many
transactions in
parallel; support for
backup and recovery.
Support for Web-Interfaces, XML, and
Data Exchange
Support for OO; capability
to store operations
Efficient Implementation of
Queries (Query Optimization,
Join & Selection &
Indexing techniques)
Modern
DBMS
Support for special
Data-types: long fields,
images, html-links, DNA-sequences,
spatial information,…
Support
for datadriven
computing
Support for
Data Mining
operations
Support for
OLAP and Data
Warehousing
Support for higher level user interfaces:
graphical, natural language, form-based,…
Introduction Data Management
Christoph F. Eick
Disks and Files


DBMS stores information on (“hard”) disks.
This has major implications for DBMS design!
 READ: transfer data from disk to main memory (RAM).
 WRITE: transfer data from RAM to disk.
 Both are high-cost operations, relative to in-memory
operations, so must be planned carefully!
Christoph F. Eick
Introduction Data Management
Why Not Store Everything in Main
Memory?



Costs too much. $100 will buy you either 512MB of
RAM or 50GB of disk today --- that is disk storage
100 times cheaper (but it is approx. 10000 times
slower).
Main memory is volatile. We want data to be saved
between runs. (Obviously!)
Typical storage hierarchy:
 Main memory (RAM) for currently used data.
 Disk for the main database (secondary storage).
 Tapes for archiving older versions of the data
(tertiary storage).
Remark: All reported disk performance/prize data are as of middle of 2003
Components of a Disk
Disk head

The platters spin (say, 90rps).
The arm assembly is
moved in or out to position
a head on a desired track.
Tracks under heads make
a cylinder (imaginary!).

Sector
Arm movement
Only one head
reads/writes at any
one time.

Arm assembly
Block size is a multiple
of sector size (which is fixed).

Spindle
Tracks
Platters
Introduction Data Management
Christoph F. Eick
Accessing a Disk Page


Time to access (read/write) a disk block:
 seek time (moving arms to position disk head on
track)
 rotational delay (waiting for block to rotate under
head)
 transfer time (actually moving data to/from disk
surface)
Seek time and rotational delay dominate.
 Seek time varies from about 1 to 20msec
 Rotational delay varies from 0 to 10msec
 Transfer rate is about 1msec per 32KB page
Christoph F. Eick
Introduction Data Management
Review: The ACID properties

A tomicity: All actions in the Xact happen, or none happen.
C onsistency: If each Xact is consistent, and the DB starts
consistent, it ends up consistent.
I solation: Execution of one Xact is isolated from that of
other Xacts.
D urability: If a Xact commits, its effects persist.

The Recovery Manager guarantees Atomicity & Durability.



Introduction Data Management
Christoph F. Eick
Example

Consider two transactions (Xacts):
T1:
T2:
BEGIN A=A+100, B=B-100 END
BEGIN A=1.06*A, B=1.06*B END
Intuitively, the first transaction is transferring $100
from B’s account to A’s account. The second is
crediting both accounts with a 6% interest payment.
 There is no guarantee that T1 will execute before T2 or
vice-versa, if both are submitted together. However,
the net effect must be equivalent to these two
transactions running serially in some order.

Introduction Data Management
Christoph F. Eick
Atomicity of Transactions



A transaction might commit after completing all its actions, or it
could abort (or be aborted by the DBMS) after executing some
actions.
A very important property guaranteed by the DBMS for all
transactions is that they are atomic.
DBMS logs all actions so that it can undo the actions
of aborted transactions and redo the actions of
successful transactions.
Introduction Data Management
Christoph F. Eick
Concurrency in a DBMS


Users submit transactions, and can think of each transaction
as executing by itself.
 Concurrency is achieved by the DBMS, which interleaves
actions (reads/writes of DB objects) of various transactions.
 Each transaction must leave the database in a consistent
state if the DB is consistent when the transaction begins.
 DBMS will enforce some ICs, depending on the ICs
declared in CREATE TABLE statements.
 Beyond this, the DBMS does not really understand the
semantics of the data. (e.g., it does not understand how
the interest on a bank account is computed).
Issues: Effect of interleaving transactions, and crashes.
Introduction Data Management
Christoph F. Eick
Example (Contd.)

Consider a possible interleaving (schedule):
T1:
T2:

A=1.06*A,
B=B-100
B=1.06*B
This is OK. But what about:
T1:
T2:

A=A+100,
A=A+100,
A=1.06*A, B=1.06*B
B=B-100
The DBMS’s view of the second schedule:
T1:
T2:
R(A), W(A),
R(A), W(A), R(B), W(B)
R(B), W(B)
Introduction Data Management
Christoph F. Eick
Summary



Concurrency control and recovery are among the most
important functions provided by a DBMS.
Users need not worry about concurrency.
 System automatically inserts lock/unlock requests and
schedules actions of different transactions in such a way as
to ensure that the resulting execution is equivalent to
executing the transactions one after the other in some order.
Write-ahead logging (WAL) is used to undo the actions of
aborted transactions and to restore the system to a consistent
state after a crash.