The Design of Scalable, Consistent, Autonomic, and Elastic
Download
Report
Transcript The Design of Scalable, Consistent, Autonomic, and Elastic
Scalable, Consistent, and
Elastic Database Systems for
Cloud Platforms
Sudipto Das
Computer Science, UC Santa Barbara
[email protected]
Sponsors:
Web replacing Desktop
Sudipto Das {[email protected]}
2
Paradigm shift in Infrastructure
Sudipto Das {[email protected]}
3
Cloud computing
Computing infrastructure
and solutions delivered as a
service
◦ Industry worth USD150 billion by
2014*
Contributors to success
◦ Economies of scale
◦ Elasticity and pay-per-use pricing
Popular paradigms
◦ Infrastructure as a Service (IaaS)
◦ Platform as a Service (PaaS)
◦ Software as a Service (SaaS)
*http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htm
Sudipto Das {[email protected]}
4
Databases for cloud platforms
Data is central to applications
DBMSs are mission critical component in
cloud software stack
◦ Manage petabytes of data, drive revenue
◦ Serve a variety of applications (multitenancy)
Data needs for cloud applications
◦ OLTP systems: store and serve data
◦ Data analysis systems: decision support,
intelligence
Sudipto Das {[email protected]}
5
Application landscape
Social
gaming
Rich
content and
mash-ups
Managed
applications
Cloud
application
platforms
Sudipto Das {[email protected]}
6
Challenges for OLTP systems
Scalability
◦ While ensuring efficient transaction
execution!
Lightweight
Elasticity
◦ Scale on-demand!
Self-Manageability
◦ Intelligence without a human controller!
Sudipto Das {[email protected]}
7
Two approaches to scalability
Scale-up
◦ Preferred in classical
enterprise setting (RDBMS)
◦ Flexible ACID transactions
◦ Transactions access a single node
Scale-out
◦ Cloud friendly (Key value
stores)
◦ Execution at a single server
Limited functionality & guarantees
◦ No multi-row or multi-step
transactions
Sudipto Das {[email protected]}
8
Why care about transactions?
confirm_friend_request(user1, user2)
{
begin_transaction();
update_friend_list(user1, user2, status.confirmed);
update_friend_list(user2, user1, status.confirmed);
end_transaction();
}
Simplicity in application design
with ACID transactions
Sudipto Das {[email protected]}
9
confirm_friend_request_A(user1, user2) {
try {
update_friend_list(user1, user2, status.confirmed);
} catch(exception e) {
report_error(e);
return;
}
try {
update_friend_list(user2, user1, status.confirmed);
} catch(exception e) {
revert_friend_list(user1, user2);
report_error(e);
return;
}
}
confirm_friend_request_B(user1, user2) {
try{
update_friend_list(user1, user2, status.confirmed);
} catch(exception e) {
report_error(e);
add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time());
}
try {
update_friend_list(user2, user1, status.confirmed);
} catch(exception e) {
report_error(e);
add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time());
}
}
Sudipto Das {[email protected]}
10
Challenge: Transactions at Scale
Scale-out
Key Value Stores
RDBMSs
ACID transactions
Sudipto Das {[email protected]}
11
Challenge: Lightweight Elasticity
Provisioning on-demand and not for peak
Resources
Capacity
Demand
Resources
Optimize operating cost!
Capacity
Demand
Time
Time
Traditional Infrastructures
Deployment in the Cloud
Unused resources
Slide Credits: Berkeley RAD Lab
Sudipto Das {[email protected]}
12
Challenge: Self-Manageability
Managing a large distributed system
◦
◦
◦
◦
◦
◦
Detecting failures and recovering
Coordination and synchronization
Provisioning
Capacity planning
…
“A large distributed system is a Zoo”
Cloud platforms inherently multitenant
◦ Balance conflicting goals
Minimize operating cost while ensuring good performance
Sudipto Das {[email protected]}
13
Contributions for OLTP systems
Transactions
at Scale
◦ ElasTraS [HotCloud 2009, UCSB TR 2010]
◦ G-Store [SoCC 2010]
Lightweight
Elasticity
◦ Albatross [VLDB 2011]
◦ Zephyr [SIGMOD 2011]
Self-Manageability
◦ Pythia [in progress]
Sudipto Das {[email protected]}
14
Contributions
Data Management
Analytics
Transaction Processing
Ricardo
[SIGMOD ‘10]
Dynamic
partitioning
MD-HBase
[MDM ‘11]
Best Paper
Runner up
G-Store
[SoCC ‘10]
Anonimos
[ICDE ‘10],
[TKDE]
Static
partitioning
ElasTraS
[HotCloud ‘09]
[TR ‘10]
Albatross [VLDB ‘11]
Zephyr [SIGMOD ‘11]
This talk
Pythia [in progress]
Sudipto Das {[email protected]}
Novel
Architectures
Hyder
[CIDR ‘11]
Best Paper
CoTS
[ICDE ‘09],
[VLDB ‘09]
TCAM
[DaMoN ‘08]
15
Transactions at Scale
Scale-out
Key Value Stores
RDBMSs
ACID transactions
Sudipto Das {[email protected]}
16
Scale-out with static partitioning
Table level partitioning (range, hash)
◦ Distributed transactions
Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions
Sudipto Das {[email protected]}
17
Scale-out with static partitioning
Table level partitioning (range, hash)
◦ Distributed transactions
Partitioning the Database schema
◦ Co-locate data items accessed together
◦ Goal: Minimize distributed transactions
Scaling-out with static partitioning
◦
◦
◦
◦
ElasTraS [HotCloud 2009, TR 2010]
Cloud SQL Server [ICDE 2011]
Megastore [CIDR 2011]
Relational Cloud [CIDR 2011]
Sudipto Das {[email protected]}
18
Dynamically formed partitions
Access patterns change, often rapidly
◦ Online multi-player gaming applications
◦ Collaboration based applications
◦ Scientific computing applications
Not amenable to static partitioning
How to get the benefit of partitioning
when accesses do not statically partition?
◦ Ours is the first solution to allow that
Sudipto Das {[email protected]}
19
Online Multi-player Games
ID
Name
$$$
Player Profile
Sudipto Das {[email protected]}
20
Score
Online Multi-player Games
Execute transactions
on player profiles while
the game is in progress
Sudipto Das {[email protected]}
21
Online Multi-player Games
Partitions/groups
are dynamic
Sudipto Das {[email protected]}
22
Online Multi-player Games
Hundreds of thousands
of concurrent groups
Sudipto Das {[email protected]}
23
Data Fusion for dynamic partitions
[G-Store, SoCC 2010]
Transactional access to a group of data
items formed on-demand
Challenge: Avoid distributed transactions!
Key Group Abstraction
◦ Groups are small
◦ Groups execute non-trivial no. of transactions
◦ Groups are dynamic and on-demand
Groups are dynamically formed tenant
databases
Sudipto Das {[email protected]}
24
Transactions on Groups
Without distributed transactions
Grouping Protocol
Key
Group
Ownership
of keys at a
single node
One key selected as the
leader
Followers transfer
ownership of keys to leader
Sudipto Das {[email protected]}
25
Why is group formation hard?
Guarantee the contract between
leaders and followers in the presence of:
◦ Leader and follower failures
◦ Lost, duplicated, or re-ordered messages
◦ Dynamics of the underlying system
How to ensure efficient and ACID
execution of transactions?
Sudipto Das {[email protected]}
26
Grouping protocol
Log entries
Follower(s)
Create
Request
Leader
L(Joining)
J
JA
L(Joined)
JAA
Group Opns
L(Creating) L(Joined)
Time
L(Free)
D
L(Deleting)
DA
L(Deleted)
Delete
Request
Conceptually akin to “locking”
◦ Locks held by groups
Sudipto Das {[email protected]}
27
Efficient transaction processing
How does the leader execute transactions?
◦ Caches data for group members underlying data
store equivalent to a disk
◦ Transaction logging for durability
◦ Cache asynchronously flushed to propagate updates
◦ Guaranteed update propagation
Leader
Transaction Manager
Cache Manager
Log
Asynchronous update
Propagation
Followers
Sudipto Das {[email protected]}
28
Prototype: G-Store [SoCC 2010]
An implementation over Key-value stores
Application Clients
Transactional Multi-Key Access
Grouping middleware layer resident on top of a key-value store
Grouping Transaction
Layer
Manager
Grouping Transaction
Layer
Manager
Grouping Transaction
Layer
Manager
Key-Value Store Logic
Key-Value Store Logic
Key-Value Store Logic
Distributed Storage
G-Store
Sudipto Das {[email protected]}
29
G-Store Evaluation
Implemented using HBase
◦ Added the middleware layer
◦ ~10000 LOC
Experiments in Amazon EC2
Benchmark: An online multi-player game
Cluster size: 10 nodes
Data size: ~1 billion rows (>1 TB)
For groups with 100 keys
◦ Group creation latency: ~10 – 100ms
◦ More than 10,000 groups concurrently created
Sudipto Das {[email protected]}
30
G-Store Evaluation
Group creation latency
Group creation throughput
Sudipto Das {[email protected]}
31
Lightweight Elasticity
Provisioning on-demand and not for peak
Resources
Capacity
Demand
Resources
Optimize operating cost!
Capacity
Demand
Time
Time
Traditional Infrastructures
Deployment in the Cloud
Unused resources
Slide Credits: Berkeley RAD Lab
Sudipto Das {[email protected]}
32
Elasticity in the Database tier
Load Balancer
Application/
Web/Caching
tier
Database tier
Sudipto Das {[email protected]}
33
Live database migration
Migrate a database partition (or tenant)
in a live system
◦ Optimize operating cost
◦ Resource orchestration in multitenant
systems
Different from
◦ Migration between software versions
◦ Migration in case of schema evolution
Sudipto Das {[email protected]}
34
VM migration for DB elasticity
One DB partition-per-VM
◦ Pros: allows fine-grained load
balancing
◦ Cons
Performance overhead
Poor consolidation ratio [Curino et
al., CIDR 2011]
VM
VM
VM
Hypervisor
Multiple DB partitions in
a VM
◦ Pros: good performance
◦ Cons: Migrate all partitions
Coarse-grained load balancing
Sudipto Das {[email protected]}
VM
Hypervisor
35
Live database migration
Multiple partitions share the same
database process
◦ Shared process multitenancy
Migrate individual partitions ondemand in a live system
◦ Virtualization in the database tier
Straightforward solution
◦
◦
◦
◦
Stop serving partition at the source
Copy to destination
Start serving at the destination
Expensive!
Sudipto Das {[email protected]}
36
Migration cost measures
Service un-availability
◦ Time the partition is unavailable
Number of failed requests
◦ Number of operations failing/transactions
aborting
Performance overhead
◦ Impact on response times
Additional data transferred
Sudipto Das {[email protected]}
37
Two common DBMS architectures
Decoupled storage
architectures
◦ ElasTraS, G-Store, Deuteronomy,
MegaStore
◦ Persistent data is not migrated
◦ Albatross [VLDB 2011]
Shared nothing architectures
◦ SQL Azure, Relational Cloud,
MySQL Cluster
◦ Migrate persistent data
◦ Zephyr [SIGMOD 2011]
Sudipto Das {[email protected]}
38
Why is live DB migration hard?
Persistent data must be migrated (GBs)
◦ How to ensure no downtime?
Nodes can fail during migration
◦ How to guarantee correctness during
failures?
Transaction atomicity and durability
Recover migration state after failure
Transactions execute during migration
◦ How to guarantee serializability?
Transaction correctness equivalent to normal operation
Sudipto Das {[email protected]}
39
Our approach: Zephyr
[SIGMOD 2011]
Migration executed in phases
◦ Starts with transfer of minimal information to
destination (“wireframe”)
Database pages used as granule of
migration
◦ Unique page ownership
Source and destination concurrently
execute transactions in one migration phase
Minimal transaction synchronization
Guaranteed serializability
Logging and handshaking protocols
Sudipto Das {[email protected]}
40
Simplifying assumptions
For this talk
◦ Transactions access a single partition
◦ No replication
◦ No structural changes to indices
Extensions in the paper [SIGMOD 2011]
◦ Relaxes these assumptions
Sudipto Das {[email protected]}
41
Design overview
P1
Owned Pages
P2
P3
Pn
Active transactions
TS1,…,
TSk
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
42
Init mode
Freeze indices and migrate wireframe
P1
Owned Pages
Active transactions
P2
P3
P1
P2
P3
Pn
Pn
Un-owned Pages
TS1,…,
TSk
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
43
What is an index wireframe?
Source
Destination
Sudipto Das {[email protected]}
44
Dual mode
Requests for un-owned pages can block
P1
P2
P3
Pn
Old, still active
transactions
P3 accessed by
TDi
P1
P2
P3
P3 pulled
from source
Pn
TSk+1,…,
TSl
TD1,…,
TDm
Source
Destination
Index wireframes remain frozen
New transactions
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
45
Finish mode
Pages can be pulled by the destination, if needed
P1
P2
P3
P1
P2
P3
Pn
P1, P2, …
pushed from
source
Pn
TDm+1,…
, TDn
Completed
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
46
Normal operation
Index wireframe un-frozen
P1
P2
P3
Pn
TDn+1,…,
TDp
Source
Destination
Page owned by Node
Page not owned by Node
Sudipto Das {[email protected]}
47
Artifacts of this design
Once migrated, pages are never pulled back
by source
◦ Abort transactions at source accessing the
migrated pages
No structural changes to indices during
migration
◦ Abort transactions (at both nodes) that make
structural changes to indices
Destination “pulls” pages on-demand
◦ Transactions at the destination experience higher
latency compared to normal operation
Sudipto Das {[email protected]}
48
Serializability
Only concern is “dual mode”
◦ Init and Finish: only one node is executing
transactions
Local predicate locking of internal index and
exclusive page ownership no phantoms
Strict 2PL Transactions are locally
serializable
Pages transferred only once
◦ No Tdest Tsource conflict dependency
Guaranteed serializability
Sudipto Das {[email protected]}
49
Recovery
Transaction recovery
◦ For every database page, Tsrc Tdst
◦ Recovery: transactions replayed in conflict
order
Migration recovery
◦ Atomic transitions between migration modes
Developed logging and handshake protocols
◦ Every page has exactly one owner
Bookkeeping at the index level
Sudipto Das {[email protected]}
50
Correctness
In the presence of arbitrary repeated
failures, Zephyr ensures:
◦ Updates made to database pages are consistent
◦ Failure does not leave a page without an owner
◦ Both source and destination are in the same
migration mode
Guaranteed termination and starvation
freedom
Sudipto Das {[email protected]}
51
Implementation
Prototyped using an open source OLTP
database H2
◦
◦
◦
◦
Supports standard SQL/JDBC API
Serializable isolation level
Tree Indices
Relational data model
Modified the database engine
◦ Added support for freezing indices
◦ Page migration status maintained using index
◦ ~6000 LOC
Tungsten SQL Router migrates JDBC
connections during migration
Sudipto Das {[email protected]}
52
Results Overview
Downtime (partition unavailability)
◦ S&C: 3 – 8 seconds (needed to migrate, unavailable
for updates)
◦ Zephyr: No downtime. Either source or
destination is available
Service interruption (failed operations)
◦ S&C: ~100s – 1,000s. All transactions with updates
are aborted
◦ Zephyr: ~10s – 100s. Order of magnitude less
interruption
Minimal operational and data transfer
overhead
Sudipto Das {[email protected]}
53
Failed Operations
Order of
magnitude
fewer failed
operations
Sudipto Das {[email protected]}
54
Concluding Remarks
Major
enabling
technologies
◦ Scalable distributed
database infrastructure
ElasTraS
◦ Dynamically formed
data partitions
G-Store
◦ Live database migration
Albatross, Zephyr
Sudipto Das {[email protected]}
55
Future Directions
Self-managing controller for large
multitenant database infrastructures
Novel data management architectures
◦ Leveraging advances from novel hardware
◦ Convergence of transactional and analytics
systems for real-time intelligence
Putting human-in-the-loop: Leveraging
crowd-sourcing
Sudipto Das {[email protected]}
56
Thank you!
Collaborators
UCSB:
Divy Agrawal, Amr El Abbadi, Ömer Eğecioğlu
Shashank Agarwal, Shyam Antony, Aaron Elmore,
Shoji Nishimura (NEC Japan)
Microsoft Research Redmond:
Phil Bernstein, Colin Reid
IBM Almaden:
Yannis Sismanis, Kevin Beyer, Rainer Gemulla,
Peter Haas, John McPherson