H-Store - CUHK CSE

Download Report

Transcript H-Store - CUHK CSE

CSCI5570 Large Scale Data
Processing Systems
NewSQL
James Cheng
CSE, CUHK
Slide Ack.: modified based on the slides from Joy Arulraj
The End of an Architectural Era
(It’s time for a complete rewrite)
Michael Stonebraker, Samuel Madden, Daniel J. Abadi,
Stavros Harizopoulos, Nabil Hachem, Pat Helland
VLDB 2007
2
Outline
• The current state of the world
• Why current architecture is aged
• How to beat it by a factor of 50 in every
market I can think of
• Implications for the research community
3
Motivation
• System R (1974)
– Seminal database design from IBM
– First implementation of SQL
• Hardware has changed a lot over 3 decades
– Databases still based on System R’s design
– Includes DB2, SQL server, etc.
4
System R Architectural Features
• Disk oriented storage and indexing structures
• Multi-threading to hide latency
• Locking-based concurrency control
mechanisms
• Log-based recovery
5
Current DBMS Standard
•
•
•
•
•
Store fields in one record contiguously on disk
Use B-tree indexing
Use small (e.g. 4K) disk blocks
Align fields on byte or word boundaries
Conventional (row-oriented) query optimizer
and executor
6
OLTP Bottlenecks
7
OLTP: Where does the time go?
8
Terminology – “Row Store”
Record 1
Record 2
Record 3
Record 4
E.g. DB2, Oracle, Sybase, SQLServer, …
9
Row Stores
• Can insert and delete a record in one physical
write
• Good for business data processing
• And that was what System R and Ingres were
gunning for
10
Extensions to Row Stores Over the Years
• Architectural stuff (Shared nothing, shared
disk)
• Object relational stuff (user-defined types and
functions)
• XML stuff
• Warehouse stuff (materialized views, bit map
indexes)
• …
11
At This Point, RDBMS is aged
• There are at least 4 (non trivial) markets
where a row store can be clobbered by a
specialized architecture (CIDR 07 paper)
– Warehouses (Vertica, SybaseIQ, KX, …)
– Text (Google, Yahoo, …)
– Scientific data (MatLab, ASAP, …)
– Streaming data (StreamBase, Coral8, …)
12
At This Point, RDBMS is aged
• Leaving RDBMS with only the OLTP market
(i.e., business data processing)
• But they are no good at that either!!!!!!
13
New OLTP Proposal
• First part
–
–
–
–
–
Main memory
No multi-threading
Grid orientation
Transient undo and no redo
No knobs
• Second part
–
–
–
–
JDBC/OCDB overhead
Concurrency control
Undo
2 phase commit
14
New OLTP Proposal
• First part
–
–
–
–
–
Main memory
No multi-threading
Grid orientation
Transient undo and no redo
No knobs
• Second part
–
–
–
–
JDBC/OCDB overhead
Concurrency control
Undo
2 phase commit
15
Main Memory Deployment
• 1970’s: disk
• Now: main memory
TPC-C is 100 Mbytes per warehouse; 1000 warehouses
is a HUGE operation;
i.e. 100 Gbytes;
i.e. main memory
16
Main Memory Deployment
• 1970’s: terminal operator
• Now: unknown client over the web
Cannot allow user stalls inside a transaction!!!!!
Hence, there are no user stalls or disk stalls!!!!!
17
No Multi-threading
• Heaviest TPC-C Xact reads/writes 400 records
– Less than 1 msec!!
• Run all commands to completion; single
threaded
• Dramatically simplifies DBMS
– No concurrent B-tree, no multi-threaded data
structure
– No pool of file handles, buffers, threads, … => no
resource governor!
18
Grid Computing
• Obviously cheaper
• Obvious wave of the foreseeable future
(replacing shared disk)
• Horizontally partition data
– Shared nothing query optimizer and executor
• Add/delete sites on the fly required
High end OLTP has to “scale out” not “scale up”
19
High Availability
• 1970’s: disaster recovery was “tape shipping”
• Now: 7 x 24 x 365 no matter what
• Some organizations today run a hot standby: a
second machine sits idle waiting to take over if
first one fails => only half of the resource used
20
Peer-to-Peer HA
• Multiple machines in a peer-to-peer
configuration
• OLTP load dispersed across multiple machines
• Inter-machine replication used for fault
tolerance
• Redundancy (at the table level) in the grid
• Optimizer chooses which instance of a table to
read, writes all instances (transactionally)
21
Logging
• Undo log for roll-back if a transaction fails, but
can be deleted on transaction commit
• No redo log, simply recover data from an
operational site
22
Recovery in a K-safe Environment
•
•
•
•
•
Restore dead site
Query up sites for live data
When up to speed, join the grid
Stop if you lose K+1 sites
No redo log!!!!
Vertica has shown this to be perfectly workable
23
No Knobs
• RDBMSs have a vast array of complex tuning
knobs
• New design should be self-everything
– self-healing
– self-maintaining
– self-tuning
– self-…
24
Main Sources of Overhead in RDBMS
•
•
•
•
•
•
•
•
•
Disk I/O (gone)
Resource control (gone)
Synchronization (gone)
Latching with multi-threaded data structure (gone)
Redo log (gone)
Undo log (but in main memory and discard on commit)
JDBC/ODBC interface
Dynamic locking for concurrency control
2 phase commit (for multi-site updates and copies)
25
New OLTP Proposal
• First part
–
–
–
–
–
Main memory
No multi-threading
Grid orientation
Transient undo and no redo
No knobs
• Second part
–
–
–
–
JDBC/OCDB overhead
Concurrency control
Undo
2 phase commit
26
H-Store System Architecture
• A grid of computers
• Rows of tables are placed contiguously in
main memory, with B-tree indexing
• Each H-Store site is single-threaded
• Multi-cores => multiple logical sites (one site
for each core) per physical site
• Main memory on physical site partitioned
among logical sites
27
Stored Procedures
• OLTP has changed
– 1970’s: conversational transactions
– Now: can ask for all of them in advance
• Applications use stored procedures =>
JDBC/ODBC overhead (gone)
28
Transaction Classes
• Classify transactions in OTLP applications into
classes and use their properties
• Get all transaction classes in advance
– Instances differ by run-time parameters
• Construct a physical data base design (manually
now; automatically in the future)
– Table partitioning
– Table-level replication
• Create a query plan for each class
29
Transaction Classes
• Example
– Class : “Insert record in History where customer =
$(customer-Id) ; more SQL statements ;”
– Runtime instance supplies $(customer-Id), etc.
• Each transaction class has certain properties
– Optimize concurrency control protocols
– And commit protocols
30
Transaction Classes
• Transaction classes
– Constrained tree applications
– Single-site transactions
– One-shots transactions
– Two-phase transactions
– Sterile transactions
• Prevalent in major commercial online retail
applications
• H-Store makes use of their properties
31
Constrained Tree Application
Every transaction has
equality predicates
on the primary key
of the root node
Customer
Order
Order
Order Line
Order Line
Partition 1
Order
Order Line
Order Line
Order Line
Order Line
Partition 2
32
Single-sited transactions
• All queries hit same partition
• Every transaction run to completion at a single
site
• Constrained Tree Application
– Root table can be horizontally hash-partitioned
– Collocate corresponding shards of child tables
– No communication between partitions
33
Single-sited transactions
• CTAs are common in OLTP
• Making non-CTAs single-sited
– remove read-only tables in the schema from
application and check if it now becomes CTA
– if yes, replicate these tables at all sites
• One-shot transactions
34
One-shot transactions
• One-shot transaction
– execute in parallel without requiring intermediate
results to be communicated among sites
– no inter-query dependencies
• One-shot transaction can be decomposed into
single-sited plans
• Transactions in many applications can often be
made one-shot by
– vertical partitioning of tables among sites
– replicating read-only columns
35
Two-phase classes
• A transaction class is two-phase if
– Phase 1 : read-only operations (transaction may
be aborted based on the query results)
– Phase 2 : queries and updates can’t violate
integrity
• A transaction class is strongly two-phase if
– Two phase and
– Phase 1 operations on all sites produce the same
result wrt aborting or continuing
36
Sterile classes
• Two concurrent transactions commute when
any interleaving of their single-site sub-plans
produces the same final database state as any
other interleaving (if both commit)
• A transaction class is sterile if
– Its transactions commute with transactions of all
other classes (including itself)
37
Query Execution
• Cost-based query optimizer producing query
plans based on transaction classes at
transaction definition time
– Single-sited: dispatch to the appropriate site
– One-shot: decompose into a set of plans, each
executed at a single site
• Standard run-time optimizer for general
transactions as sites may communicate
38
Transaction Management
• Every transaction receives a timestamp
(site_id, local_unique_timestamp)
• Given an ordering of sites, timestamps are
unique and form a total order
39
Transaction Management
• Single-sited or one-shot
– each transaction dispatched to replica sites and
executed to completion
– unless sterile, each execution site waits a small period
of time (account for network delays) for transactions
to arrive => execution in timestamp order => all
replicas updated in the same order => identical
outcome at each replica, all commit or all abort => no
data inconsistency!
– no redo log, no concurrency control, no distributed
commit processing!
40
Transaction Management
• Two-phase
– no integrity violation
– no undo log
– if also single-sited/one-shot, no transaction
facilities at all!
41
Transaction Management
• Sterile
– no concurrency control needed
– no need of execution order of transactions
– no guarantee on all sites abort/continue
• workers respond “abort” or “continue”
• execution supervisor communicates the info to worker
sites
• standard distributed commit processing needed unless
transaction is strongly two-phase
42
Transaction Management
• Non-sterile, non single-sited, non one-shot
– do not use dynamic locking (expensive for shortlived transactions) for transaction consistency,
instead:
– first run with basic strategy
– if too many aborts, run intermediate strategy
– if still too many aborts, escalate to advanced
strategy
43
Transaction Management
• Basic Strategy
– timestamp ordering of subplan pieces
– wait for “small period of time” to preserve
timestamp order
– execute the subplan
– if no abort, continue with next subplan
– if no more subplan, commit
44
Transaction Management
• Intermediate Strategy
– increase wait time to sequence the subplans,
thereby lowering abort probability
• Advanced Strategy
– track read set and write set of each transaction at
each site
– worker site runs each subplan, and aborts (if
necessary) by standard optimistic concurrency
control rules
45
Results
• H-Store
– Targets OLTP workload
– Shared-nothing main memory database
• TPC-C benchmark
– All classes made two-phase → No coordination
– Replication + Vertical partitioning → One-shot
– All classes still sterile in this schema → No waits
46
Results
RDBMS: a very popular commercial RDBMS
Best TPC-C: best record on TPC website
47
TPC-C Performance on a Low-end Machine
• A very popular commercial RDBMS (or the
elephants)
– 850 transactions/sec
• H-Store
– 70,416 transactions/sec
Factor of 82!!!!!
48
Implications for the Elephants
• They are selling “one size fits all”
• Which is 30 year old legacy technology that is
good at nothing
• The elephants: a collection of OLTP systems,
connected to ETL, and connected to one or more
data warehouses
– ETL: extract-transform-load, used to convert OLTP
data to a common format and load it into a data
warehouse (for BI queries)
49
The DBMS Landscape –
Performance Needs
in-between Other apps
complex operations
read-focus
Data Warehouse
simple operations
write-focus
OLTP
50
One Size Does Not Fit All
in-between
Other apps
Big Table,
etc
Elephants get only
“the crevices”
Open
source
Vertica/
C-Store
Data Warehouse
complex operations
read-focus
H-Store
OLTP
simple operations
write-focus
51
Summary
• “One size fits all” databases excel at nothing
• H-Store
– Clean design for OLTP domain from scratch
52