CPS 216: Advanced Database Systems
Download
Report
Transcript CPS 216: Advanced Database Systems
CPS 216: Advanced Database
Systems
Shivnath Babu
Outline for Today
• What this class is about: Data management
• What we will cover in this class
• Logistics
What does a Database System mean to you?
(Hint: What are they used for? Give examples)
User/Application
Data Management
Query
Query
Query
Data
DataBase Management System (DBMS)
Example: At a Company
Query 1: Is there an employee named “Nemo”?
Query 2: What is “Nemo’s” salary?
Query 3: How many departments are there in the company?
Query 4: What is the name of “Nemo’s” department?
Query 5: How many employees are there in the
“Accounts” department?
Employee
Department
ID
Name
DeptID
Salary
…
ID
Name
…
10
Nemo
12
120K
…
12
IT
…
20
Dory
156
79K
…
34
Accounts
…
40
Gill
89
76K
…
89
HR
…
52
Ray
34
85K
…
156
Marketing
…
…
…
…
…
…
…
…
…
DataBase Management System (DBMS)
High-level
Query Q
Answer
DBMS
Data
Translates Q into
best execution plan
for current conditions,
runs plan
Example: Store that Sells Cars
Make
Model OwnerID
ID Name
Owners of
Honda Accords Honda Accord
12
12 Nemo
who are <=
Honda Accord
156
156 Dory
23 years old
Join (Cars.OwnerID = Owners.ID)
Filter (Make = Honda and
Model = Accord)
Cars
Age
22
21
Filter (Age <= 23)
Owners
Make
Model
OwnerID
ID
Name
Age
Honda
Accord
12
12
Nemo
22
Toyota
Camry
34
34
Ray
42
Mini
Cooper
89
89
Gill
36
Honda
Accord
156
156
Dory
21
…
…
…
…
…
…
DataBase Management System (DBMS)
High-level
Query Q
Answer
DBMS
Keeps data safe
and correct
despite failures,
concurrent
updates, online
processing, etc.
Data
Translates Q into
best execution plan
for current conditions,
runs plan
DBMS is multi-user
• Example
Get account balance from database;
If balance > amount of withdrawal then
balance = balance - amount of withdrawal;
dispense cash;
store new balance into database;
• Homer at ATM1 withdraws $100
• Marge at ATM2 withdraws $50
• Initial balance = $400, final balance = ?
– Should be $250 no matter who goes first
Final balance = $250
Homer withdraws $100:
read balance; $400
if balance > amount then
balance = balance - amount; $300
write balance; $300
Marge withdraws $50:
read balance; $300
if balance > amount then
balance = balance - amount; $250
write balance; $250
Final balance = $300
Homer withdraws $100:
Marge withdraws $50:
read balance; $400
read balance; $400
If balance > amount then
balance = balance - amount; $350
write balance; $350
if balance > amount then
balance = balance - amount; $300
write balance; $300
Final balance = $350
Homer withdraws $100:
Marge withdraws $50:
read balance; $400
read balance; $400
if balance > amount then
balance = balance - amount; $300
write balance; $300
if balance > amount then
balance = balance - amount; $350
write balance; $350
Concurrency control in DBMS
• Similar to concurrent programming problems
– But data is not all in main-memory
• Appears similar to file system concurrent
access?
– Approach taken by MySQL initially; now
MySQL offers better alternatives
• But want to control at much finer granularity
• Or else one withdrawal would lock up all
accounts!
Recovery in DBMS
• Example: balance transfer
decrement the balance of account X
by $100;
increment the balance of account Y
by $100;
• Scenario 1: Power goes out after the first
instruction
• Scenario 2: DBMS buffers and updates data in
memory (for efficiency); before they are written
back to disk, power goes out
• Log updates; undo/redo during recovery
DataBase Management System (DBMS)
High-level
Query Q
Answer
DBMS
Keeps data safe
and correct
despite failures,
concurrent
updates, online
processing, etc.
Data
Translates Q into
best execution plan
for current conditions,
runs plan
Summary of modern DBMS features
• Persistent storage of data
• Logical data model; declarative queries and
updates ! physical data independence
• Multi-user concurrent access
• Safety from system failures
• Performance, performance, performance
– Massive amounts of data (terabytes ~
petabytes)
– High throughput (thousands ~ millions
transactions per minute)
– High availability (¸ 99.999% uptime)
Modern DBMS Architecture
Applications
SQL
DBMS
Parser
Logical query plan
Query Optimizer
Physical query plan
Query Executor
Access method API calls
Storage Manager
Storage system API calls File system API calls
OS
Disk(s)
World of “Big Data”
• Numbers reported by Google from 2007:
– Data processed per month is 400 PB (PetaBytes)
– Average job size is 180 GB
• For 180 GB of data, it takes:
– 30 minutes to read from disk (@100 MB/s)
– 600 minutes to download at 5 MB/s
• Big data is hard to move (but easy to store –
few cents per GB)
• Can throw parallelism at the problem
Word Count over a Given Set of
Web Pages
see bob throw
see spot run
see
bob
throw
see
spot
run
1
1
1
1
1
1
bob
run
see
spot
throw
Can we do word count in parallel?
1
1
2
1
1
The MapReduce Framework
(pioneered by Google)
Automatic Parallel Execution in
MapReduce
Handles failures automatically, e.g., restarts tasks if a
node fails; runs multiples copies of the same task to
avoid a slow task slowing down the whole job
Course Outline
• Principles of query processing (25%)
– Indexes
– Query execution plans and operators
– Query optimization
• Data storage (15%)
– Databases Vs. filesystems (Google FileSystem, Hadoop Distributed
FileSystem)
– Row-oriented Vs. column-oriented storage
– Flash memory and Solid State Drives
• Scalable data processing (30%)
– Parallel query plans and operators
– Systems based on MapReduce
– Scalable key-value stores
• Concurrency control and recovery (15%)
– Consistency models for data (ACID, BASE, Serializability)
– Write-ahead logging
• Information retrieval and Data mining (15%)
– Web search (Google PageRank, inverted indexes)
– Association rules and clustering
Course Logistics
• Useful reference: Database Systems: The
Complete Book, by H. Garcia-Molina, J. D. Ullman,
and J. Widom
• Web site: http://www.cs.duke.edu/courses/fall09/cps216
• Grading:
– Project 40%
– Homework Assignments 15%
– Midterm 20%
– Final 25%
Summary: Data Management is
Important
• Core aspect of most sciences and
engineering today
• Core need in industry (esp., “big data”)
• Cool mix of theory and systems
• Chances are you will find something
interesting even if you primary interest is
elsewhere