Chapter 1: Introduction

Download Report

Transcript Chapter 1: Introduction

CS 505
Intermediate Topics in Database Systems
Prof. Tingjian Ge
with thanks to Prof. Stan Zdonik, Brown University
Prof. Sam Madden, MIT
Prof. Avi Silberschatz, Yale University
What is a Database System?
 Database:
A very large collection of related data
 Models a real world enterprise:
 Entities (e.g., teams, games / students, courses)
 Relationships (e.g., The Wildcats are playing in the NCAA)
 DBMS: A software system that can be used to store,
manage and retrieve data from databases
 Database System: DBMS+data (+ applications)
1.2
Why Study Databases??
 Shift from computation to information
 Always true for corporate computing
 More and more true in the scientific world
 and of course, the Web
 DBMS encompasses much of CS in a practical discipline
Operating systems
Languages
Distributed systems
Performance
Theory
AI
 Jobs!
High-paying jobs!!!
1.3
Why Databases??
 Why not store everything in flat files:
i.e., use the file system of the OS, cheap/simple…
Name, Course, Grade
John Smith, CS115, B
Mike Stonebraker, CS405G, A
This is how things were
in the “Bad Old Days”
Jim Gray, CS505, A
John Smith, CS315, B+
…………………
Yes, but not scalable…
1.4
Problem 1
 Data redundancy and inconsistency
 Multiple file formats,
 duplication of information in different files
Name, Course, Email, Grade
John Smith, [email protected], CS112, B
Jim Gray, CS560, [email protected], A
John Smith, CS560, [email protected], B+
Name, Email, Course, Grade
Mike Stonebraker, [email protected], CS234, A
J. Smith, [email protected], CS560, B+
Why is this a problem?
 Wasted space
 Potential inconsistencies
(e.g., multiple formats, John Smith vs Smith J.)
1.5
Problem 2
 Data retrieval:
 Find the students who took CS505
 Find the students with GPA > 3.5
For every query we need to write a program!
 We need the retrieval to be:
 Easy to write
 Execute efficiently
1.6
Problem 3
 Data Integrity
 No support for sharing:
 Prevent simultaneous modifications
 No coping mechanisms for system crashes
 No means of Preventing Data Entry Errors (checks must be hard-coded
in the programs)
 Security problems
 Database systems offer solutions to all the above problems
1.7
Problem 4
 Long-lived data  Evolution
 What happens if I need to change my mind about how
the data is stored?
 Access patterns change
 Tuning
 Don’t want to have to re-write all my applications.
 Solution: Data independence!
1.8
Data Organization
 Data Models: a framework for describing




data objects
data relationships
data semantics
data constraints
 Presents primitives for
 Representing Data  Data Definition Language (DDL)
 Manipulating Data  Data Manipulation Language (DML)
 Entity-Relationship model
 We will concentrate on Relational model
 Other models:
 object-oriented model
 semi-structured data models, XML
1.9
Database Schema
 Similar to types and variables in programming languages
 Schema – the structure of the database
 e.g., the database consists of information about a set of
customers and accounts and the relationship between them)
 Expressed in some data model
 Occurs at multiple levels
 Logical schema: database design at the logical level
 Physical schema: database design at the physical level
1.10
Data Organization
 Two levels of data modeling
 Logical level: describes data stored in database, and the
relationships among the data.
type customer = record
name : string;
street : string;
city : integer;
end;
 Physical level: describes how a record (e.g., customer) is
stored.
 Also, View level: application programs hide details of data types.
Views can also hide information (e.g., salary) for security
purposes.
1.11
View of Data
A logical architecture for a database system
Data independence!
1.12
Entity-Relationship Model
Example of schema in the entity-relationship model
1.13
Entity Relationship Model (Cont.)
 E-R model of real world
 Entities (objects)
 E.g. customers, accounts, bank branch
 Relationships between entities
 E.g. Account A-101 is held by customer Johnson
 Relationship set depositor associates customers with accounts
 Widely used for database design
 Database design in E-R model usually converted to design in the
relational model (coming up next) which is used for storage and
processing
1.14
Relational Model
Attributes
 Example of tabular data in the relational model
Customer-id
customername
192-83-7465
Johnson
019-28-3746
Smith
192-83-7465
Johnson
321-12-3123
Jones
019-28-3746
Smith
Key
customerstreet
customercity
accountnumber
Alma
Palo Alto
A-101
North
Rye
A-215
Alma
Palo Alto
A-201
Main
Harrison
A-217
North
Rye
A-201
Schema = Customer (Customer-id, customer-name,
customer-table, account-number)
This whole table is an instance of the Customer schema.
1.15
Data Organization
 Data Storage (Ch 11)
Where can data be stored?
 Main memory
 Secondary memory (hard disks)
 Optical store
 Tertiary store (tapes)
 Move data? Determined by buffer manager
 Mapping data to files? Determined by file manager
1.16
Data retrieval
 Queries
Query = Declarative data retrieval
describes what data, not how to retrieve it
Ex. Give me the students with GPA > 3.5
vs
Scan the student file and retrieve the records with gpa>3.5
 Why?
1. Easier to write
2. Efficient to execute (why?)
1.17
Data retrieval
Query
Query Processor
Plan
Query Optimizer
Query Evaluator
Data
 Query Optimizer
“compiler” for queries (aka “DML Compiler”)
Plan ~ Assembly Language Program
Optimizer Does Better With Declarative Queries:
1. Algorithmic Query (e.g., in C) 1 Plan to choose from
2. Declarative Query (e.g., in SQL) n Plans to choose from
1.18
SQL
 SQL: widely used (declarative) non-procedural language
 E.g. find the name of the customer with customer-id 192-83-7465
select customer.customer-name
from customer
where customer.customer-id = ‘192-83-7465’
 E.g. find the balances of all accounts held by the customer with
customer-id 192-83-7465
select account.balance
from depositor, account
where depositor.customer-id = ‘192-83-7465’ and
depositor.account-number = account.account-number
 Procedural languages: C++, Java, relational algebra
1.19
Data Integrity
Transaction processing (Ch 15, 16)
 Why Concurrent Access to Data must be Managed?
John and Jane withdraw $50 and $100 from a common
account…
John:
1. get balance
2. if balance > $50
3. balance = balance - $50
4. update balance
Jane:
1. get balance
2. if balance > $100
3. balance = balance - $100
4. update balance
Initial balance $300. Final balance=?
It depends…
1.20
Data Integrity
Recovery (Ch 17)
Transfer $50 from account A ($100) to account B ($200)
1. get balance for A
2. If balanceA > $50
3. balanceA = balanceA – 50
4.Update balanceA in database
5. Get balance for B
System crashes….
6. balanceB = balanceB + 50
7. Update balanceB in database
Recovery management
1.21
Outline
 CS 405G: application-oriented
 How to develop database applications: User, application developer,
DBA
 CS 505 (This course!) : system-oriented
 Learn the internals of a relational DBMS (developer for Oracle,
better application developer, or system researcher)
1.22