The Design of Scalable, Consistent, Autonomic, and Elastic

Download Report

Transcript The Design of Scalable, Consistent, Autonomic, and Elastic

Scalable, Consistent, and
Elastic Database Systems for
Cloud Platforms
Sudipto Das
Computer Science, UC Santa Barbara
[email protected]
Sponsors:
Web replacing Desktop
Sudipto Das {[email protected]}
2
Paradigm shift in Infrastructure
Sudipto Das {[email protected]}
3
Cloud computing

Computing infrastructure
and solutions delivered as a
service
◦ Industry worth USD150 billion by
2014*

Contributors to success
◦ Economies of scale
◦ Elasticity and pay-per-use pricing

Popular paradigms
◦ Infrastructure as a Service (IaaS)
◦ Platform as a Service (PaaS)
◦ Software as a Service (SaaS)
*http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm
Sudipto Das {[email protected]}
4
Databases for cloud platforms
Data is central to applications
 DBMSs are mission critical component in
cloud software stack

◦ Manage petabytes of data, drive revenue
◦ Serve a variety of applications (multitenancy)

Data needs for cloud applications
◦ OLTP systems: store and serve data
◦ Data analysis systems: decision support,
intelligence
Sudipto Das {[email protected]}
5
Application landscape
 Social
gaming
 Rich
content and
mash-ups
 Managed
applications
 Cloud
application
platforms
Sudipto Das {[email protected]}
6
Challenges for OLTP systems
 Scalability
◦ While ensuring efficient transaction
execution!
 Lightweight
Elasticity
◦ Scale on-demand!
 Self-Manageability
◦ Intelligence without a human controller!
Sudipto Das {[email protected]}
7
Two approaches to scalability

Scale-up
◦ Preferred in classical
enterprise setting (RDBMS)
◦ Flexible ACID transactions
◦ Transactions access a single node

Scale-out
◦ Cloud friendly (Key value
stores)
◦ Execution at a single server
 Limited functionality & guarantees
◦ No multi-row or multi-step
transactions
Sudipto Das {[email protected]}
8
Why care about transactions?
confirm_friend_request(user1, user2)
{
begin_transaction();
update_friend_list(user1, user2, status.confirmed);
update_friend_list(user2, user1, status.confirmed);
end_transaction();
}
Simplicity in application design
with ACID transactions
Sudipto Das {[email protected]}
9
confirm_friend_request_A(user1, user2) {
try {
update_friend_list(user1, user2, status.confirmed);
} catch(exception e) {
report_error(e);
return;
}
try {
update_friend_list(user2, user1, status.confirmed);
} catch(exception e) {
revert_friend_list(user1, user2);
report_error(e);
return;
}
}
confirm_friend_request_B(user1, user2) {
try{
update_friend_list(user1, user2, status.confirmed);
} catch(exception e) {
report_error(e);
add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time());
}
try {
update_friend_list(user2, user1, status.confirmed);
} catch(exception e) {
report_error(e);
add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time());
}
}
Sudipto Das {[email protected]}
10
Challenge: Transactions at Scale
Scale-out
Key Value Stores
RDBMSs
ACID transactions
Sudipto Das {[email protected]}
11
Challenge: Lightweight Elasticity
Provisioning on-demand and not for peak
Resources
Capacity
Demand
Resources
Optimize operating cost!
Capacity
Demand
Time
Time
Traditional Infrastructures
Deployment in the Cloud
Unused resources
Slide Credits: Berkeley RAD Lab
Sudipto Das {[email protected]}
12
Challenge: Self-Manageability

Managing a large distributed system
◦
◦
◦
◦
◦
◦

Detecting failures and recovering
Coordination and synchronization
Provisioning
Capacity planning
…
“A large distributed system is a Zoo”
Cloud platforms inherently multitenant
◦ Balance conflicting goals
 Minimize operating cost while ensuring good performance
Sudipto Das {[email protected]}
13
Contributions for OLTP systems
 Transactions
at Scale
◦ ElasTraS [HotCloud 2009, UCSB TR 2010]
◦ G-Store [SoCC 2010]
 Lightweight
Elasticity
◦ Albatross [VLDB 2011]
◦ Zephyr [SIGMOD 2011]
 Self-Manageability
◦ Pythia [in progress]
Sudipto Das {[email protected]}
14
Contributions
Data Management
Analytics
Transaction Processing
Ricardo
[SIGMOD ‘10]
Dynamic
partitioning
MD-HBase
[MDM ‘11]
Best Paper
Runner up
G-Store
[SoCC ‘10]
Anonimos
[ICDE ‘10],
[TKDE]
Static
partitioning
ElasTraS
[HotCloud ‘09]
[TR ‘10]
Albatross [VLDB ‘11]
Zephyr [SIGMOD ‘11]
This talk
Pythia [in progress]
Sudipto Das {[email protected]}
Novel
Architectures
Hyder
[CIDR ‘11]
Best Paper
CoTS
[ICDE ‘09],
[VLDB ‘09]
TCAM
[DaMoN ‘08]
15
Transactions at Scale
Scale-out
Key Value Stores
RDBMSs
ACID transactions
Sudipto Das {[email protected]}
16
Scale-out with static partitioning

Table level partitioning (range, hash)
◦ Distributed transactions

Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions
Sudipto Das {[email protected]}
17
Scale-out with static partitioning

Table level partitioning (range, hash)
◦ Distributed transactions

Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions

Scaling-out with static partitioning
◦
◦
◦
◦
ElasTraS [HotCloud 2009, TR 2010]
Cloud SQL Server [ICDE 2011]
Megastore [CIDR 2011]
Relational Cloud [CIDR 2011]
Sudipto Das {[email protected]}
18
Dynamically formed partitions

Access patterns change, often rapidly
◦ Online multi-player gaming applications
◦ Collaboration based applications
◦ Scientific computing applications
Not amenable to static partitioning
 How to get the benefit of partitioning
when accesses do not statically partition?

◦ Ours is the first solution to allow that
Sudipto Das {[email protected]}
19
Online Multi-player Games
ID
Name
$$$
Player Profile
Sudipto Das {[email protected]}
20
Score
Online Multi-player Games
Execute transactions
on player profiles while
the game is in progress
Sudipto Das {[email protected]}
21
Online Multi-player Games
Partitions/groups
are dynamic
Sudipto Das {[email protected]}
22
Online Multi-player Games
Hundreds of thousands
of concurrent groups
Sudipto Das {[email protected]}
23
Data Fusion for dynamic partitions
[G-Store, SoCC 2010]
Transactional access to a group of data
items formed on-demand
 Challenge: Avoid distributed transactions!
 Key Group Abstraction

◦ Groups are small
◦ Groups execute non-trivial no. of transactions
◦ Groups are dynamic and on-demand

Groups are dynamically formed tenant
databases
Sudipto Das {[email protected]}
24
Transactions on Groups
Without distributed transactions
Grouping Protocol
Key
Group
Ownership
of keys at a
single node
One key selected as the
leader
 Followers transfer
ownership of keys to leader

Sudipto Das {[email protected]}
25
Why is group formation hard?

Guarantee the contract between
leaders and followers in the presence of:
◦ Leader and follower failures
◦ Lost, duplicated, or re-ordered messages
◦ Dynamics of the underlying system

How to ensure efficient and ACID
execution of transactions?
Sudipto Das {[email protected]}
26
Grouping protocol
Log entries
Follower(s)
Create
Request
Leader
L(Joining)
J
JA
L(Joined)
JAA
Group Opns
L(Creating) L(Joined)
Time

L(Free)
D
L(Deleting)
DA
L(Deleted)
Delete
Request
Conceptually akin to “locking”
◦ Locks held by groups
Sudipto Das {[email protected]}
27
Efficient transaction processing

How does the leader execute transactions?
◦ Caches data for group members  underlying data
store equivalent to a disk
◦ Transaction logging for durability
◦ Cache asynchronously flushed to propagate updates
◦ Guaranteed update propagation
Leader
Transaction Manager
Cache Manager
Log
Asynchronous update
Propagation
Followers
Sudipto Das {[email protected]}
28
Prototype: G-Store [SoCC 2010]
An implementation over Key-value stores
Application Clients
Transactional Multi-Key Access
Grouping middleware layer resident on top of a key-value store
Grouping Transaction
Layer
Manager
Grouping Transaction
Layer
Manager
Grouping Transaction
Layer
Manager
Key-Value Store Logic
Key-Value Store Logic
Key-Value Store Logic
Distributed Storage
G-Store
Sudipto Das {[email protected]}
29
G-Store Evaluation

Implemented using HBase
◦ Added the middleware layer
◦ ~10000 LOC
Experiments in Amazon EC2
 Benchmark: An online multi-player game
 Cluster size: 10 nodes
 Data size: ~1 billion rows (>1 TB)
 For groups with 100 keys

◦ Group creation latency: ~10 – 100ms
◦ More than 10,000 groups concurrently created
Sudipto Das {[email protected]}
30
G-Store Evaluation
Group creation latency
Group creation throughput
Sudipto Das {[email protected]}
31
Lightweight Elasticity
Provisioning on-demand and not for peak
Resources
Capacity
Demand
Resources
Optimize operating cost!
Capacity
Demand
Time
Time
Traditional Infrastructures
Deployment in the Cloud
Unused resources
Slide Credits: Berkeley RAD Lab
Sudipto Das {[email protected]}
32
Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das {[email protected]}
33
Live database migration

Migrate a database partition (or tenant)
in a live system
◦ Optimize operating cost
◦ Resource orchestration in multitenant
systems

Different from
◦ Migration between software versions
◦ Migration in case of schema evolution
Sudipto Das {[email protected]}
34
VM migration for DB elasticity

One DB partition-per-VM
◦ Pros: allows fine-grained load
balancing
◦ Cons
 Performance overhead
 Poor consolidation ratio [Curino et
al., CIDR 2011]

VM
VM
VM
Hypervisor
Multiple DB partitions in
a VM
◦ Pros: good performance
◦ Cons: Migrate all partitions 
Coarse-grained load balancing
Sudipto Das {[email protected]}
VM
Hypervisor
35
Live database migration

Multiple partitions share the same
database process
◦ Shared process multitenancy

Migrate individual partitions ondemand in a live system
◦ Virtualization in the database tier

Straightforward solution
◦
◦
◦
◦
Stop serving partition at the source
Copy to destination
Start serving at the destination
Expensive!
Sudipto Das {[email protected]}
36
Migration cost measures

Service un-availability
◦ Time the partition is unavailable

Number of failed requests
◦ Number of operations failing/transactions
aborting

Performance overhead
◦ Impact on response times

Additional data transferred
Sudipto Das {[email protected]}
37
Two common DBMS architectures

Decoupled storage
architectures
◦ ElasTraS, G-Store, Deuteronomy,
MegaStore
◦ Persistent data is not migrated
◦ Albatross [VLDB 2011]

Shared nothing architectures
◦ SQL Azure, Relational Cloud,
MySQL Cluster
◦ Migrate persistent data
◦ Zephyr [SIGMOD 2011]
Sudipto Das {[email protected]}
38
Why is live DB migration hard?

Persistent data must be migrated (GBs)
◦ How to ensure no downtime?

Nodes can fail during migration
◦ How to guarantee correctness during
failures?
 Transaction atomicity and durability
 Recover migration state after failure

Transactions execute during migration
◦ How to guarantee serializability?
 Transaction correctness equivalent to normal operation
Sudipto Das {[email protected]}
39
Our approach: Zephyr
[SIGMOD 2011]

Migration executed in phases
◦ Starts with transfer of minimal information to
destination (“wireframe”)

Database pages used as granule of
migration
◦ Unique page ownership


Source and destination concurrently
execute transactions in one migration phase
Minimal transaction synchronization
 Guaranteed serializability

Logging and handshaking protocols
Sudipto Das {[email protected]}
40
Simplifying assumptions
 For this talk
◦ Transactions access a single partition
◦ No replication
◦ No structural changes to indices

Extensions in the paper [SIGMOD 2011]
◦ Relaxes these assumptions
Sudipto Das {[email protected]}
41
Design overview
P1
Owned Pages
P2
P3
Pn
Active transactions
TS1,…,
TSk
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
42
Init mode
Freeze indices and migrate wireframe
P1
Owned Pages
Active transactions
P2
P3
P1
P2
P3
Pn
Pn
Un-owned Pages
TS1,…,
TSk
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
43
What is an index wireframe?
Source
Destination
Sudipto Das {[email protected]}
44
Dual mode
Requests for un-owned pages can block
P1
P2
P3
Pn
Old, still active
transactions
P3 accessed by
TDi
P1
P2
P3
P3 pulled
from source
Pn
TSk+1,…,
TSl
TD1,…,
TDm
Source
Destination
Index wireframes remain frozen
New transactions
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
45
Finish mode
Pages can be pulled by the destination, if needed
P1
P2
P3
P1
P2
P3
Pn
P1, P2, …
pushed from
source
Pn
TDm+1,…
, TDn
Completed
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
46
Normal operation
Index wireframe un-frozen
P1
P2
P3
Pn
TDn+1,…,
TDp
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
47
Artifacts of this design

Once migrated, pages are never pulled back
by source
◦ Abort transactions at source accessing the
migrated pages

No structural changes to indices during
migration
◦ Abort transactions (at both nodes) that make
structural changes to indices

Destination “pulls” pages on-demand
◦ Transactions at the destination experience higher
latency compared to normal operation
Sudipto Das {[email protected]}
48
Serializability

Only concern is “dual mode”
◦ Init and Finish: only one node is executing
transactions



Local predicate locking of internal index and
exclusive page ownership  no phantoms
Strict 2PL  Transactions are locally
serializable
Pages transferred only once
◦ No Tdest  Tsource conflict dependency

Guaranteed serializability
Sudipto Das {[email protected]}
49
Recovery

Transaction recovery
◦ For every database page, Tsrc  Tdst
◦ Recovery: transactions replayed in conflict
order

Migration recovery
◦ Atomic transitions between migration modes
 Developed logging and handshake protocols
◦ Every page has exactly one owner
 Bookkeeping at the index level
Sudipto Das {[email protected]}
50
Correctness

In the presence of arbitrary repeated
failures, Zephyr ensures:
◦ Updates made to database pages are consistent
◦ Failure does not leave a page without an owner
◦ Both source and destination are in the same
migration mode

Guaranteed termination and starvation
freedom
Sudipto Das {[email protected]}
51
Implementation

Prototyped using an open source OLTP
database H2
◦
◦
◦
◦

Supports standard SQL/JDBC API
Serializable isolation level
Tree Indices
Relational data model
Modified the database engine
◦ Added support for freezing indices
◦ Page migration status maintained using index
◦ ~6000 LOC

Tungsten SQL Router migrates JDBC
connections during migration
Sudipto Das {[email protected]}
52
Results Overview

Downtime (partition unavailability)
◦ S&C: 3 – 8 seconds (needed to migrate, unavailable
for updates)
◦ Zephyr: No downtime. Either source or
destination is available

Service interruption (failed operations)
◦ S&C: ~100s – 1,000s. All transactions with updates
are aborted
◦ Zephyr: ~10s – 100s. Order of magnitude less
interruption

Minimal operational and data transfer
overhead
Sudipto Das {[email protected]}
53
Failed Operations
Order of
magnitude
fewer failed
operations
Sudipto Das {[email protected]}
54
Concluding Remarks
 Major
enabling
technologies
◦ Scalable distributed
database infrastructure
 ElasTraS
◦ Dynamically formed
data partitions
 G-Store
◦ Live database migration
 Albatross, Zephyr
Sudipto Das {[email protected]}
55
Future Directions

Self-managing controller for large
multitenant database infrastructures

Novel data management architectures
◦ Leveraging advances from novel hardware
◦ Convergence of transactional and analytics
systems for real-time intelligence

Putting human-in-the-loop: Leveraging
crowd-sourcing
Sudipto Das {[email protected]}
56
Thank you!
Collaborators
UCSB:
Divy Agrawal, Amr El Abbadi, Ömer Eğecioğlu
Shashank Agarwal, Shyam Antony, Aaron Elmore,
Shoji Nishimura (NEC Japan)
Microsoft Research Redmond:
Phil Bernstein, Colin Reid
IBM Almaden:
Yannis Sismanis, Kevin Beyer, Rainer Gemulla,
Peter Haas, John McPherson