Operating System Support for Virtual Machines

Download Report

Transcript Operating System Support for Virtual Machines

The end of an architectural era:
(it’s time for a complete rewrite)
M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem,
and P. Helland
VLDB, 2007
Presented by: Suprio Ray
The I/O Gap
• Disk capacity doubles every 18 months
The I/O Gap
• Disk capacity doubles every 18 months
• Memory size doubles every 18 months
• Disk bandwidth doubles every 10 years
(R. Feritas et. al. FAST, 2008)
•
Memory (latency) is ~6000 times faster than disk
The I/O Gap
• Disk capacity doubles every 18 months
• Memory size doubles every 18 months
• Disk bandwidth doubles every 10 years
(R. Feritas et. al. FAST, 2008)
• Avoid accessing disk (if possible)
One size does not fit all
• OLTP
– Amazon : 42 TB
– Typical: less than a TB
• Data Warehouse
– Yahoo : 2 PB
– Ebay: 1.4 PB
• Search engines (text)
– Google : 850 TB
• Scientific
– US Department of Energy (NERSC): 3.5 PB
• Stream processing
One size does not fit all
Goal: Build a custom,
high performance
– Amazon : 42 TB
OLTP database
– Typical: less than a TB
• OLTP
• Data Warehouse
– Yahoo : 2 PB
– Ebay: 1.4 PB
• Search engines (text)
– Google : 850 TB
• Scientific
– US Department of Energy (NERSC): 3.5 PB
• Stream processing
Overview
•
•
•
•
•
•
Motivation
OLTP overheads
System architecture
Transaction management
Evaluation
Conclusion and discussion
Database System Architecture
Query Processing
SQL query
Calls from Transactions (read,write)
Parser
relational algebra
Query
Rewriter
and
Optimizer
query execution
plan
Execution
Engine
Transaction Management
Statistics &
Catalogs &
System Data
Buffer
Manager
Data + Indexes
Transaction
Manager
Concurrency
Controller
Recovery
Manager
Log
Lock
Table
OLTP Overheads
• Logging
- Must be written to disk for durability
• Locking
- To read or write a record
• Latching
- Updates to shared data structure
• Buffer management
- Cache disk pages in memory
Design considerations to remove overheads
Optimization
Advantage
Memory resident database
Remove buffer mgt
Partitioning and replication
High-availability, Remove
logging
Single-threaded execution
Remove locking and latching
Transaction variants
Remove concurrency control
H-Store system architecture
• Shared-nothing, main-memory, row-store relational database
• Node
– hosts 1 or more sites
• Site
– single threaded
– one site per core
• Relation
– divided into
one or more partitions
or
– cloned
• Partition
– replicated and hosted on multiple sites
Runtime model
• Stored procedure interface for transaction
– Unique name
– Control and SQL commands
• SQL command execution
–
–
–
–
annotate the exec plan
passed to Transaction mgr
plans are transmitted
results passed back to initiator
System deployment
• Cluster deployment framework (CDF) accepts
–
–
–
–
a set of stored procedure
database schema
sample workload
available sites
• CDF produces
– a set of compiled
stored procedure
– physical DB layout
Transaction variants
• Single-sited
- All queries can be executed on just one node
• One-shot
- Individual queries can be executed on single nodes
• Two-phase
- Phase 2 can be executed without integrity violation
• Strongly two-phase
- Either all replicas continue or all abort
• Sterile
- Order of execution doesn’t matter
Transaction management
• Replica synchronization
– Read any replica; update all replicas
• Transaction ordering
– Each transaction is timestamped
site_id local_unique_timestamp
• Concurrency control considerations
– OLTP transactions are very short-lived
– Single threaded execution avoids page latching
– Not needed for some transaction classes (singlesited/one shot/sterile)
Concurrency control strategy
• Basic strategy
– Wait for a small time for conflicting transactions with
lower timestamp
– If none found, execute the subplan and send result
– Else, issue an abort
• Intermediate strategy
– Wait for a length of time approximated by
MaxD * average_round_trip_message_delay
• Advanced strategy
– If needed, abort a transaction using Optimistic CC rules
Evaluation – experimental setup
• Benchmark: a variant of TPC-C
– all transaction classes made one-shot and strongly twophased
– all transaction classes implemented as stored procedures
• Databases
– H-Store
– a popular commercial RDBMS, X
• Hardware
– Dual-core 2.8GHz system
– 4GB RAM
– 4 x 250 GB SATA disk drives
Evaluation – results
• Metric: Transactions/second per core
• H-Store 82 times faster than X
40000
35000
Transactions/sec per core
35000
30000
25000
20000
15000
10000
5000
425
1250
1000
0
H-Store
X
X (without
logging)
Database
Best TPC-C *
* performance record published by TPC-C
H-Store limitations
• The database must fit into the available memory
• A cluster-wide power failure to cause the loss of
committed transactions
• A limited subset of SQL '99 is supported
– DDL operations like ALTER and DROP aren't supported
• Challenging operations model
– Changing the schema or reconfiguring hardware requires first
saving and shutting down the system
• No WAN support (single data-center)
– In case of a network partition, some queries will not execute
Conclusion
• Demise of general purpose database
(prediction)
• H-Store is a custom, main-memory database
optimized for OLTP
• H-Store shows significant performance
advantage over a popular relational database
Discussion
• Raw speed vs. ease of use
– Limited DDL support, changing schema/node requires reboot
• “Separation of concern”
– Is it a good idea to embed appl. logic in stored procedure?
• Custom vs. general purpose query language
– SQL to be replaced with Ruby-on-Rails ?
• No WAN support: single data-center assumption
– CAP theorem
• Catastrophic failure scenario