Running 1996 - University of Houston

Download Report

Transcript Running 1996 - University of Houston

Introduction File and Database Systems
Christoph F. Eick
First 4 Weeks
Introduction to Databases
2. Course Information
Week1/2
3. Grading and Other Things
4. Questionnaire
5. The Relational Data Model
6. Relational Algebra / SQL Part1
Weeks 2-4
7. The E/R Data Model
8. SQL Part2
1.
Christoph F. Eick
Introduction File and Database Systems
Textbooks for COSC 6340
Required Text: Raghu Ramakrishnan and Johannes
Gehrke, Data Management Systems, McGraw Hill,
Third Edition, 2002
 Other books with relevant material: Ramez Elmasri
and Shamkant Navathe, Fundamentals of Database
Systems, Fourth Edition

Christoph F. Eick
Introduction File and Database Systems
Lectures in COSC 3480
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Basic Concepts of Database Management (2 classes; Chapter 1, 2.1,2.2, 2.3;
instructor teaching material)
Introduction to the Relational Data Model (1.5 classes; Chapter 3.1, 3.2, 3.3, 3.4,
3.6)
Introduction to the Relational Algebra and SQL (3 classes; Chapter 4.2, Chapter
5)
Conceptual Schema Design using the Entity Relationship Data Model (2-3
classes; instructor material; Chapters 2.4, 2.5)
Relational Database Design and Normalization (2 classes; instructor material,
Chapter 19)
Introduction to KDD and Data Warehousing (1.5 classes; instructor material,
Chapters 25 and 26)
Disks, Files, Storage Structures, Index Structures and Physical Database
Design (4 classes, Chapter 8, 9, 10, 11.1, 11.2, 13, 20)
Spatial Data Management (1,5 classes, Chapter 28)
Internet Databases and XML (2.5 classes; Chapter 7 and 27)
Query Optimization (1 class; Chapter 12, only if enough time left)
Summary: Where Do We Stand? (1 class; instructor material)
Introduction File and Database Systems
Christoph F. Eick
Other News





The lab will start on Th, January 26, 2006, 8:30a. More
details about the lab will be discussed next week.
There will be three exams in Spring 2006: Tu., Feb. 28, Th.,
April 6, and ??, May ??.
All important information about the course can be found in
the webpage associated with this course:
http://www2.cs.uh.edu/~ceick/3480.html
Please inspect the webpage regularly.
The webpage is an evolving information document. Most of
the information still refers to the Fall 2005 teaching of the
course. The webpage will be updated during the course of
the semester!
Introduction File and Database Systems
Christoph F. Eick
UH Data Mining and Machine Learning Group (UH-DMML)
Christoph F. Eick and Ricardo Vilalta
http://www.tlc2.uh.edu/dmmlg
Goal: Development of data analysis and data mining techniques and the
application of these techniques to challenging problems in physics,
geology, astronomy, environmental sciences, and medicine.
Topics investigated:
 Meta Learning
 Classification and Learning from Examples
 Clustering
 Distance Function Learning
 Using Reinforcement Learning for Data Mining
 Spatial Data Mining
 Knowledge Discovery
Introduction File and Database Systems
Christoph F. Eick
Databases
Definition: A database is a collection of data with the
following properties:
 It represents certain aspect of the real-world.
 Its data are logically related.
 It is created for a specific purpose.
Introduction File and Database Systems
Christoph F. Eick
DBMS
Definition: A database management system (DBMS) is a set of
software that are used to define, store, manipulate and control the
data in a database.
 define --- define data types, structures and constraints.
 store --- store data; provide efficient access.
 manipulate --- perform retrieval and update operations using a
query language.
 control --- control access to data.
Database System = Database + DBMS
Introduction File and Database Systems
Christoph F. Eick
A Brief History Note


Database technology has a history of about 40 years.
Database technology has gone through several generations .
First Generation: File systems, 50's -- 60's
 A typical file system consists of a set of independent files, and a
number of application programs

Definition: A file stores a set of record (on a disk drive) all of
which have the same format.
Introduction File and Database Systems
Christoph F. Eick
An Example File System
A banking system may have

files for customers, saving accounts and checking
accounts;

application programs to deposit and withdraw money, to
find balance, etc.

different files are used for customers, saving and checking
accounts
Introduction File and Database Systems
Christoph F. Eick
Problems of File Systems (1)

It is difficult to support new applications. Two existing
application programs:
(i) find customers who have a checking account
(ii) find customers who have a saving account
 Need a new program to find the customers who have a
checking account and a saving account.
Christoph F. Eick
Introduction File and Database Systems
Problems of File Systems (2)

It has no centralized control of all data.
 Files are often created for a particular application.
 Files are created and managed independently.
Christoph F. Eick
Introduction File and Database Systems
Problems of File Systems (3)
There often exists severe data redundancy and
inconsistency.
Checking-Account: Acct#, Owner-name, Owner-SSN, OwnerAddr, Balance, ...
Saving-Account: Acct#, Owner-name, Owner-SSN, OwnerAddr, Balance, Interest, …

Christoph F. Eick
Introduction File and Database Systems
Problems of File Systems (4)

It lacks concurrency control.
Concurrency control: prevent mutual interference of
concurrent requests.
Example (Airplane ticket reservation): Consider the situation
when two customers are trying to book the only ticket left
for a flight through two operators at about the same time.
Introduction File and Database Systems
Christoph F. Eick
Problems of File Systems (5)

Weak security

Can not provide multiple views of the same data

Lack isolation between program and data

Lack self-describing feature
Introduction File and Database Systems
Christoph F. Eick
Database History (Continued)
Second Generation: Hierarchical database systems (HDBS),
late 60's -- early 70's




Best known HDBS: IMS (Information Management System
of IBM).
One-to-many relationships between parent records and
child records which can have different types.
Data are organized in trees
Records are connected by pointers.
Introduction File and Database Systems
Christoph F. Eick
An IMS Query
Query: find all Binghamton University students whose major is
computer science and whose GPA is higher than 3.5.
GU University (Name = `Binghamton University')
Department (Name = `Computer Science')
Student (GPA > 3.5)
L1: GNP Student (GPA > 3.5)
Goto L1
Christoph F. Eick
Introduction File and Database Systems
History of Database (Continued)
Third Generation: Network database systems (NDBS), late 60's
-- early 70's

Some commercial NDBSs: IDS II (Honeywell), DMS II
(UNISYS).

In NDBS, record types are organized into an acyclic graph.

Main problem with HDBS and NDBS: difficult to use.
Christoph F. Eick
Introduction File and Database Systems
History of Database (Continued)
Fourth Generation: Relational database systems (RDBS), early
70's -- now
 Example relational DBSs: Oracle 7, Sybase, Informax, DB2,
Ingres, ...
 In RDBS, data are organized into tables (relations).
Christoph F. Eick
Introduction File and Database Systems
History of Database (Continued)
Fifth Generation: Object-oriented and Object-Relational
database systems (OODBS), 80's -- now
 Example OODBSs: O2, Objectivity, ObjectStore, Versant, …
 Example ORDBSs: Oracle 8, Informix, UniSQL/X.
Introduction File and Database Systems
Christoph F. Eick
Database Languages
Data Definition Language (DDL): used by DBA or database designer to
define database schemas.
Data Manipulation Language (DML): used by database users to
retrieve, insert, delete and update data in the database.
Query language: The part of DML that is used to retrieve data.
Data Control Language (DCL): used by database owners and DBA to
control the access of data.
Christoph F. Eick
Introduction File and Database Systems
Persons Involving DBS (1)

DBMS developers: Those who design and implement
DBMS software: buffer manager, query processor,
transaction manager, interface, ...

Database designers: Those who are responsible for
determining
 what data should be stored in the database;
 how data in the database should be organized;
 the design of customized views;
 the design of special data structures to improve
the performance of the system.
Introduction File and Database Systems
Christoph F. Eick
Persons Involving DBS (2)

Database administrator (DBA): Those who manage
and monitor the daily operation of a database
system.
 authorization for database access, e.g., who can
access what data in what mode.
 routine maintenance: backup, install new tools, ...
 modification to existing database design.
Introduction File and Database Systems
Christoph F. Eick
Persons Involving DBS (3)

End-users:
 Casual users: those who access the database
using SQL directly.

 Naive users: those who access the database
using pre-prepared packages.
Application programmers: Those who write menu
applications for naive users, typically, through
database calls embedded in a program.
Christoph F. Eick
Introduction File and Database Systems
After This Course, You Will Be








familiar with the relational data model;
a decent database designer;
a sophisticated casual user;
a good application programmer;
knowledgeable with major aspects on how to use a DBMS
knowledge in a few advanced topics with respect to
database systems
This course will not teach you to become a database
administrator
The implementation of DBMS will only be partially covered.
Christoph F. Eick
Introduction File and Database Systems
Popular Topics in Databases









Efficient algorithms for data collections that reside on disks (or which are
distributed over multiple disk drives, multiple computers or over the
internet).
Study of data models (knowledge representation, mappings, theoretical
properties)
Algorithms to run a large number of transactions on a database in parallel;
finding efficient implementation for queries that access large databases;
database backup and recovery,…
Database design
How to use database management systems as an application programmer /
end user.
How to use database management systems as database administrator
How to implement database management systems
Data summarization, knowledge discovery, and data mining
Special purpose databases (genomic, geographical, internet,…)
Introduction File and Database Systems
Christoph F. Eick
Review: Why are integrated databases popular?
Bookkeeping
Device
Integrated
Database
Car Salesman
Introduction File and Database Systems
Christoph F. Eick
Review: Why are integrated databases popular?





Avoidance of uncontrolled redundancy
Making knowledge accessible that would otherwise not be
accessible
Standardization --- uniform representation of data facilitating
import and export
Reduction of software development (though the availability of
data management systems)
Support for Parallel Access and Data Security
Bookkeeping
Device
Integrated
Database
Car Salesman
Introduction File and Database Systems
Christoph F. Eick
Data Model
Data Model
is used to define
Schema (defines
a set of database
states)
Current Database
State
Introduction File and Database Systems
Christoph F. Eick
Schema for the Library Example
using the E/R Data Model
when
title
name
ssn
Person
1-to-1
(0,35)
author
B#
phone
Check_out
1-to Many
(0,1)
Many-to-1
Book
Many-to-Many
Christoph F. Eick
Introduction File and Database Systems
Relational Schema for Library Example
in SQL/92
CREATE TABLE Person
(ssn CHAR(9),
name CHAR(30),
phone INTEGER,
PRIMARY KEY (ssn));
CREATE TABLE Book
(B# INTEGER,
title CHAR(30),
author CHAR(20),
PRIMARY KEY (B#));
CREATE TABLE Checkout(
book INTEGER,
person CHAR(9),
since DATE,
PRIMARY KEY (book),
FOREIGN KEY (book) REFERENCES Book,
FOREIGN KEY (person) REFERENCES Person));
Introduction File and Database Systems
Christoph F. Eick
Example Instances
Person(name, ssn, phone): (Eick,111111111.33345), (Miller,
222222222,33337)
Book(B#,title, author): (1,Today,Yu), (2, Today, Yu), (7, Blue, Xu)
Checkout(book,person,since): (2,222222222.8/8/05),
(7,222222222,8/8/05)
Christoph F. Eick
Introduction File and Database Systems
Referential Integrity in SQL/92

SQL/92 supports all 4 options on
CREATE TABLE Enrolled
deletes and updates.
(sid CHAR(20),
 Default is NO ACTION
cid CHAR(20),
(delete/update is rejected)
 CASCADE (also delete all tuples
grade CHAR(2),
that refer to deleted tuple)
PRIMARY KEY (sid,cid),
 SET NULL / SET DEFAULT (sets
FOREIGN KEY (sid)
foreign key value of referencing
REFERENCES Students
tuple)
ON DELETE CASCADE
ON UPDATE SET DEFAULT )
Christoph F. Eick
Introduction File and Database Systems
Example of an Internal Schema
for the Library Example
INTERNAL Schema Library12 references Library.
Book is stored sequentially,
index on B# using hashing,
index on Author using hashing.
Person is stored using hashing on ssn.
Check_out is stored sequentially,
index on since using B+-tree.
Introduction File and Database Systems
Christoph F. Eick
Example: Stored Database
Index on B#
Block#= B# mod 10
0
1
Relation Book
20
30
Index on
Author
(1, C,W)
1
11
51
(20, Y,W)
(51, C, B)
W,
…
(11, Y,W)
(30, Z, B)
0
Relation Person 1
Block#= sss mod 10
Relation
Checkout
(200,…)
(500,…)
Index on since
(101,…)
Introduction File and Database Systems
Christoph F. Eick
3 Schema Architecture
External
Schema
External
Schema
Conceptual
Schema
Internal
Schema
External
Schema
How users
see the Data
What the database
contains and what
constraints hold with
respect to the database
How the data are
physically stored
Introduction File and Database Systems
Christoph F. Eick
Data Independence
Data Independence: the ability to modify the lower level
descriptions of a database without causing application
programs to be rewritten.


Logical Data Independence: the ability to modify the conceptual
schema without causing application programs to be rewritten.
Physical Data Independence: the ability to modify the internal
schema without causing application programs to be rewritten.
Data independence is achieved through proper manipulation of the
above two mappings.
Modern Relational DBMS
Transaction
Concepts; capability
of running many
transactions in
parallel; support for
backup and recovery.
Support for Web-Interfaces, XML, and
Data Exchange
Support for OO; capability
to store operations
Efficient Implementation of
Queries (Query Optimization,
Join & Selection &
Indexing techniques)
Modern
DBMS
Support for special
Data-types: long fields,
images, html-links, DNA-sequences,
spatial information,…
Support
for datadriven
computing
Support for
Data Mining
operations
Support for
OLAP and Data
Warehousing
Support for higher level user interfaces:
graphical, natural language, form-based,…
Introduction File and Database Systems
Christoph F. Eick
Disks and Files


DBMS stores information on (“hard”) disks.
This has major implications for DBMS design!
 READ: transfer data from disk to main memory (RAM).
 WRITE: transfer data from RAM to disk.
 Both are high-cost operations, relative to in-memory
operations, so must be planned carefully!
Christoph F. Eick
Introduction File and Database Systems
Why Not Store Everything in Main
Memory?



Costs too much. $100 will buy you either 512MB of
RAM or 50GB of disk today --- that is disk storage
100 times cheaper (but it is approx. 10000 times
slower).
Main memory is volatile. We want data to be saved
between runs. (Obviously!)
Typical storage hierarchy:
 Main memory (RAM) for currently used data.
 Disk for the main database (secondary storage).
 Tapes for archiving older versions of the data
(tertiary storage).
Remark: All reported disk performance/prize data are as of middle of 2003
Components of a Disk
Disk head

The platters spin (say, 90rps).
The arm assembly is
moved in or out to position
a head on a desired track.
Tracks under heads make
a cylinder (imaginary!).

Sector
Arm movement
Only one head
reads/writes at any
one time.

Arm assembly
Block size is a multiple
of sector size (which is fixed).

Spindle
Tracks
Platters
Christoph F. Eick
Introduction File and Database Systems
Accessing a Disk Page


Time to access (read/write) a disk block:
 seek time (moving arms to position disk head on
track)
 rotational delay (waiting for block to rotate under
head)
 transfer time (actually moving data to/from disk
surface)
Seek time and rotational delay dominate.
 Seek time varies from about 1 to 20msec
 Rotational delay varies from 0 to 10msec
 Transfer rate is about 1msec per 32KB page
Introduction File and Database Systems
Christoph F. Eick
DBMS Support Transactions


Database management systems provide powerful
transaction concepts that “guarantee” ACID
properties
Transaction:
Begin_Transaction
<set of operations that read and modify the database>
End_Transaction

Usually 2 commands are available to terminate a
transaction: Abort and Commit
Christoph F. Eick
Introduction File and Database Systems
Review: The ACID properties

A tomicity: All actions of the transaction happen, or none
happen.
C onsistency: If each transaction is consistent, and the DB
starts consistent, it ends up consistent.
I solation: Execution of one transaction is isolated from
that of other transaction s.
D urability: If a transaction commits, its effects persist.

The Recovery Manager guarantees Atomicity & Durability.



Introduction File and Database Systems
Christoph F. Eick
Example

Consider two transactions (Xacts):
T1:
T2:
BEGIN A=A+100, B=B-100 END
BEGIN A=1.06*A, B=1.06*B END
Intuitively, the first transaction is transferring $100
from B’s account to A’s account. The second is
crediting both accounts with a 6% interest payment.
 There is no guarantee that T1 will execute before T2 or
vice-versa, if both are submitted together. However,
the net effect must be equivalent to these two
transactions running serially in some order.

Christoph F. Eick
Introduction File and Database Systems
Atomicity of Transactions



A transaction might commit after completing all its actions, or it
could abort (or be aborted by the DBMS) after executing some
actions.
A very important property guaranteed by the DBMS for all
transactions is that they are atomic.
DBMS logs all actions so that it can undo the actions
of aborted transactions and redo the actions of
successful transactions.
Christoph F. Eick
Introduction File and Database Systems
Concurrency in a DBMS


Users submit transactions, and can think of each transaction
as executing by itself.
 Concurrency is achieved by the DBMS, which interleaves
actions (reads/writes of DB objects) of various transactions.
 Each transaction must leave the database in a consistent
state if the DB is consistent when the transaction begins.
 DBMS will enforce some ICs, depending on the ICs
declared in CREATE TABLE statements.
 Beyond this, the DBMS does not really understand the
semantics of the data. (e.g., it does not understand how
the interest on a bank account is computed).
Issues: Effect of interleaving transactions, and crashes.
Introduction File and Database Systems
Christoph F. Eick
Example (Contd.)

Consider a possible interleaving (schedule):
T1:
T2:

A=1.06*A,
B=B-100
B=1.06*B
This is OK. But what about:
T1:
T2:

A=A+100,
A=A+100,
A=1.06*A, B=1.06*B
B=B-100
The DBMS’s view of the second schedule:
T1:
T2:
R(A), W(A),
R(A), W(A), R(B), W(B)
R(B), W(B)
Introduction File and Database Systems
Christoph F. Eick
Summary



Concurrency control and recovery are among the most
important functions provided by a DBMS.
Users need not worry about concurrency.
 System automatically inserts lock/unlock requests and
schedules actions of different transactions in such a way as
to ensure that the resulting execution is equivalent to
executing the transactions one after the other in some order.
Write-ahead logging (WAL) is used to undo the actions of
aborted transactions and to restore the system to a consistent
state after a crash.