CS186: Introduction to Database Systems

Download Report

Transcript CS186: Introduction to Database Systems

EECS 647: Introduction to
Database Systems
Instructor: Luke Huan
Spring 2007
Queries for Today






What is a database?
What is a database management system?
Why take a database course?
Who will teach?
How to take the class?
Preview of class contents
7/20/2015
Luke Huan Univ. of Kansas
2
What: Database Systems Then
7/20/2015
Luke Huan Univ. of Kansas
3
What: Database Systems Today
7/20/2015
Luke Huan Univ. of Kansas
4
What: Database Systems Today
7/20/2015
Luke Huan Univ. of Kansas
5
What: Database Systems Today
7/20/2015
Luke Huan Univ. of Kansas
6
So… What is a Database?



A database is a very large, integrated collection of data.
Data is a group of facts that can be recorded and have an
implicit meaning.
Typically a database is used to model a real-world
“enterprise” (or a miniworld)



Entities (e.g., basketball teams, games)
Relationships (e.g. KU’s basketball team won <you name it> last
week)
Might surprise you how flexible this is

Web search:



P2P filesharing:


7/20/2015
Entities: words, documents
Relationships: word in document, document links to document.
Entities: words, filenames, hosts
Relationships: word in filename, file available at host
Luke Huan Univ. of Kansas
7
Main Characteristics of Databases

Self-describing nature of a database system


Insulation between programs and data


Use data model to hide storage details and present the users with
a conceptual view of the database
Support of multiple views of the data


Allows changing data storage structures and operations without
having to change the DBMS access
Data Abstraction


A DBMS catalog stores the description of the database. The
description is called meta-data.
Each user may see a different view of the database, which
describes only the data of interest to that user.
Sharing of data and multi-user transaction processing
7/20/2015
Luke Huan Univ. of Kansas
8
Databases make these folks happy ...

End users in many fields


DB application programmers



Build data entry & analysis tools on top of DBMSs
Build web services that run off DBMSs
Database administrators (DBAs)





Business, education, science, …
Design logical/physical schemas
Handle security and authorization
Data availability, crash recovery
Database tuning as needs evolve
DBMS vendors, programmers

Oracle, IBM, MS …
7/20/2015
Luke Huan Univ. of Kansas
9
…must
understand how a DBMS works
What is a Database Management System?

A Database Management System (DBMS) is a
collection of programs that enable uses to create
and maintain databases


store, manage, and access data in a databases.
Typically this term is used narrowly

Relational databases with transactions


Mostly because they predate other large repositories


Also because of technical richness
When we say DBMS in this class we will usually
follow this convention

7/20/2015
E.g. Oracle, DB2, SQL Server
But keep an open mind about applying the ideas!
Luke Huan Univ. of Kansas
10
What: Is the WWW a DBMS?

Fairly sophisticated search available



Crawler indexes pages on the web
Keyword-based search for pages
But, currently


data is mostly unstructured and untyped
search only:




can’t modify the data
can’t get summaries, complex combinations of data
few guarantees provided for freshness of data,
consistency across data items, fault tolerance, …
web sites typically have a (relational) DBMS in the
background to provide these functions.
7/20/2015
Luke Huan Univ. of Kansas
11
What: Is the WWW a DBMS?

The picture is changing quickly


Information Extraction to get structure from unstructured
New standards e.g., XML, Semantic Web can help data
modeling
7/20/2015
Luke Huan Univ. of Kansas
12
What: “Search” vs. Query

What if you wanted to
find out which actors
donated to John Kerry’s
presidential campaign?

Try “actors donated to
john kerry” in your
favorite search engine.

If it isn’t “published”, it
can’t be searched!
7/20/2015
Luke Huan Univ. of Kansas
13
What: A “Database Query” Approach
7/20/2015
Luke Huan Univ. of Kansas
14
What: Is a File System a DBMS?

Thought Experiment 1:



You and your project partner are editing the same file.
Q:
How
do
you
write
You both save it at the same time.
programs
over a
Whose
changes survive?
A) Yours
B) Partner’s
C) Both
subsystem
when
it D) Neither E) ???
promises
you
• Thought
Experiment
2:
–You’re updating a file.
A:
Very,
very
–The power goes out.
–Which changes survive?
only “???” ?
carefully!!
A) All B) None C) All Since Last Save D) ???
7/20/2015
Luke Huan Univ. of Kansas
15
OS Support for Data Management

Data can be stored in RAM




this is what every programming language offers!
RAM is fast, and random access
Isn’t this heaven?
Every OS includes a File System




manages files on a magnetic disk
allows open, read, seek, close on a file
allows protections to be set on a file
drawbacks relative to RAM?
7/20/2015
Luke Huan Univ. of Kansas
16
Database Management Systems

What more could we want than a file system?





Simple, efficient ad hoc1 queries
concurrency control
recovery
benefits of good data modeling
S.M.O.P.2? Not really…


as we’ll see this semester
in fact, the OS often gets in the way!
1ad
hoc: formed or used for specific or immediate problems or needs
2SMOP: Small Matter Of Programming
7/20/2015
Luke Huan Univ. of Kansas
17
Current Commercial Outlook





A major part of the software industry:
 Oracle, IBM, Microsoft
 also Sybase, Informix (now IBM), Teradata
 smaller players: java-based dbms, devices, OO, …
Lots of related industries
 data warehouse, document management, storage, backup,
reporting, business intelligence, ERP, CRM, app integration
Traditional Relational DBMS products dominant and evolving
 adapted for extensibility (user-defined types), native XML
support.
 Microsoft merger of file system/DB…?
Open Source coming on strong
 MySQL, PostgreSQL, Apache Derby, BerkeleyDB, Ingres,
EigenBase
And of course, the other “database” technologies
 Search engines, P2P, etc.
7/20/2015
Luke Huan Univ. of Kansas
18
Advantages of a DBMS: a short list







Controlling redundancy
Restrict unauthorized access
Providing persistent storage for program objects
Providing storage structure for efficient query processing
Providing backup and crash recovery
….
And many many others that are going to be explored in this class
7/20/2015
Luke Huan Univ. of Kansas
19
What database systems will we cover?

We will be try to be broad and touch upon




Relational DBMS (e.g. Oracle, SQL Server, DB2, Postgres)
“Semi-structured” DB systems (e.g. XML repositories like
Xindice)
Data mining: transfer data into knowledge!
Starting point


We assume you have used web search engines
We assume you don’t know relational databases


So focus will be on relational DBMSs

7/20/2015
Yet they pioneered many of the key ideas
With frequent side-notes on search engines, XML issues
Luke Huan Univ. of Kansas
20
Why take this class?
A.
B.
C.
D.
E.
Database systems are at the core of CS
They are incredibly important to society
The topic is intellectually rich
It isn’t that much work
Looks good on your resume
Let’s spend a little time on each of these
7/20/2015
Luke Huan Univ. of Kansas
21
Why take this class?
A. Database systems are the core of CS

Shift from computation to information




True in corporate computing for years
Web, p2p made this clear for personal computing
Increasingly true of scientific computing
Need for DB technology has exploded in the last years




7/20/2015
Corporate: retail swipe/clickstreams, “customer relationship
mgmt”, “supply chain mgmt”, “data warehouses”, etc.
Web:not just “documents”. Search engines, e-commerce,
blogs, wikis, other “web services”.
Scientific: digital libraries, genomics, satellite imagery,
physical sensors, simulation data
Personal: Music, photo, & video libraries. Email archives.
File contents (“desktop search”).
Luke Huan Univ. of Kansas
22
Why take this class?
B. DBs are incredibly important to society

“Knowledge is power.” -- Sir Francis
Bacon

“With great power comes great
responsibility.” -- SpiderMan’s Uncle
Ben
Policy-makers should understand technological possibilities.
Informed Technologists needed in public discourse on usage.
7/20/2015
Luke Huan Univ. of Kansas
23
Why take this class?
C. The topic is intellectually rich.

representing information


languages and systems for querying data




controlling concurrent access
ensuring transactional semantics
reliable data storage


complex queries & query semantics*
over massive data sets
concurrency control for data manipulation


data modeling
maintain data semantics even if you pull the plug
data mining
 Let your data speak
* semantics: the meaning or relationship of meanings of a sign or set of signs
7/20/2015
Luke Huan Univ. of Kansas
24
Why take this class?
D. It isn’t that much work.

Bad news: It is a lot of work.

Good news: the course is front loaded


7/20/2015
Most of the hard work is in the first half of the
semester
Load balanced with most other classes
Luke Huan Univ. of Kansas
25
Why take this class?
E. Looks good on my resume.

Yes, but why? This is not a course for:



IBM DB2 engine developers
 Though it’s useful for both!
It is a course for well-educated computer scientists


7/20/2015
Oracle administrators
Database system concepts and techniques increasingly used
“outside the box”
 Ask your friends at Microsoft, Yahoo!, Google, Apple, etc.
 Actually, they may or may not realize it!
A rich understanding of these issues is a basic and
(un?)fortunately unusual skill.
Luke Huan Univ. of Kansas
26
Who?

Instructors



Prof. Luke Huan, EECS
[email protected]
Prof. Office Hours:

7/20/2015
Eaton Hall 2034, M, W 4:15-5:15pm
Luke Huan Univ. of Kansas
27
How? Workload

Projects with a “real world” focus:




Homework assignments and quizzes
Exams – 1 Midterm & 1 Final
Projects to be done in groups of 2


Pick your partner ASAP
The course is “front-loaded”

7/20/2015
Build a web-based application w/PostgreSQL and
your favorite internet programming languge (PHP,
JAVA, PERL, just naming a few)
most of the hard work is in the first half
Luke Huan Univ. of Kansas
28
How? Administrative

Text book: Elmasri & Navatgh 5th edition

Class website - people.eecs.ku.edu/~jhuan/EECS647
 read it regularly
 Please include eecs 647 in your mail to the
instructor (for quick response)
Grading, hand-in policies, etc. is on syllabus

* The textbook will be available at the KU bookstore today or tomorrow
7/20/2015
Luke Huan Univ. of Kansas
29
Academic Misconduct


We take academic misconduct very seriously.
We encourage student to work together for doing
homework and projects, but each student should write
down their own solution.



You are absolutely encouraged to discuss with your
classmates about your homework assignments,
programming exercises, and final projects
You are responsible for all your works
For homework assignments and projects resulted from
discussion, please always acknowledge other people’s
contribution by including a sentence at the beginning of
the hand-in saying “I discussed the homework (project)
with XXX (include more names if necessary)”. There is
absolutely NO penalty for doing so.
7/20/2015
Luke Huan Univ. of Kansas
30
Agenda for the rest of today

Today: a preview of the class contents:









Relational model
SQL
Query processing

XML


This Wednesday


ER model
Logical database design
File system
Transaction
Data mining
Database in the future
the Entity-Relationship model
Today’s lecture is from Chapter 1 and 2 in E & N
Read Chapter 3 for next class.
7/20/2015
Luke Huan Univ. of Kansas
31
Agenda …

Design a Database




7/20/2015
ER model
Relational model
Database logical design
Query with SQL
Luke Huan Univ. of Kansas
32
Data Models Describe Data


A data model is a collection of concepts for
describing data.
Different levels of data models



7/20/2015
Conceptual (semantic) data model uses data (facts) to
describe a physical world
Representational model describes data in database
management systems
Physical data model describe how data is stored in
files in computers.
Luke Huan Univ. of Kansas
33
An ER Model for a mini-World
name
name
ssn
Salary
Employees


number
Works_For
manager
Departments
Entity: Real-world object distinguishable from other
objects. An entity is described (in DB) using a set of
attributes.
Relationship: Association among two or more entities.
E.g., John works in Pharmacy department.
Relational Model Describes Data


Relational model is the most popular representational
data model that are used by current commercial
DBMS.
Relational model describes data using tables
SSN
Name
Salary
123456789
John Smith
30000
333444555
Franklin Wong
40000
453453453
Joyce English
25000
name
ssn
Salary
Employees
Cardinality = 3, degree = 3, all rows distinct
 Do all columns in a relation instance have to be distinct?

7/20/2015
Luke Huan Univ. of Kansas
35
Logical Database Design
name
name
ssn
Salary
Employees
manager
number
Works_For
Departments
SSN
Name
Salary
DNumber
DName
Manager
123456789
John
Smith
30000
5
research
123456789
333444555 Franklin
Wong
40000
453453453
25000
7/20/2015
Joyce
English
SSN
Dnum
123456789
5
333444555
5
453453453
5
Luke Huan Univ. of Kansas
36
Logical Database Design (Cont.)
name
name
ssn
Employees
manager
number
Salary
Works_For
Departments
SSN
Name
Salary
DNum
DNumber
DName
Manager
123456789
John
Smith
30000
5
5
research
123456789
333444555 Franklin
Wong
40000
5
453453453
25000
5
7/20/2015
Joyce
English
Luke Huan Univ. of Kansas
37
Logical Database Design (Cont.)
name
name
ssn
Employees
manager
number
Salary
Works_For
Departments
SSN
Name
Salary
DNumber
DName
Manager
123456789
John Smith
30000
5
research
123456789
333444555
Franklin Wong
40000
5
research
123456789
453453453
Joyce English
25000
5
research
123456789
7/20/2015
Luke Huan Univ. of Kansas
38
SQL: Query Information from a Database

Query 1: Retrieve the salary of
Franklin Wong
Q1: SELECT
SALARY
FROM
EMPLOYEE
WHERE
NAME=‘Franklin Wong‘
SSN
Name
Salary
123456789
John
Smith
30000
333444555 Franklin
Wong
40000
453453453
25000
Answer: 40000
7/20/2015
Luke Huan Univ. of Kansas
Joyce
English
39
Agenda …

Design a Database Management System

File systems




Query processing
Transactions


7/20/2015
Disk storage
Indexing
Concurrency
Recovery
Luke Huan Univ. of Kansas
40
Components of a Disk
Disk head

Spindle
Tracks
The platters spin (say, 90rps).
The arm assembly is
moved in or out to position
a head on a desired track.
Tracks under heads make
a cylinder (imaginary!).

Sector
Arm movement
Arm assembly
Platters
Indexing is the Key to Speed up
Disk Operations

``Find all employees whose salary > $25000’’



If data is in sorted file, do binary search to find first
such employee, then scan to find others.
Cost of binary search can be quite high.
Simple idea: Create an `index’ file.
Page 1
Page 2
Index File
kN
k1 k2
Page 3
Page N
 Can do binary search on (smaller) index file!
Data File
Concurrent execution of user programs

Why?

Utilize CPU while waiting for disk I/O


(database programs make heavy use of disk)
Avoid short programs waiting behind long ones

7/20/2015
e.g. ATM withdrawal while bank manager sums balance
across all accounts
Luke Huan Univ. of Kansas
43
Key concept: Transaction


an atomic sequence of database actions (reads/writes)
takes DB from one consistent state to another
consistent state 1
7/20/2015
transaction
Luke Huan Univ. of Kansas
consistent state 2
44
Example
checking: $200
savings: $1000



Transaction
“transfer $100
from Saving to
checking”
checking: $300
savings: $900
Here, consistency is based on our knowledge of banking
“semantics”
In general, up to writer of transaction to ensure transaction
preserves consistency
DBMS provides (limited) automatic enforcement, via integrity
constraints

e.g., balances must be >= 0
7/20/2015
Luke Huan Univ. of Kansas
45
Agenda …

Advanced topics


XML
Data mining




7/20/2015
Association
Classification
Clustering
Database in the future
Luke Huan Univ. of Kansas
46