SecondSite: Disaster Tolerance as a Service Shriram Rajagopalan
Download
Report
Transcript SecondSite: Disaster Tolerance as a Service Shriram Rajagopalan
SECONDSITE: DISASTER
TOLERANCE AS A SERVICE
Shriram Rajagopalan
Brendan Cully
Ryan O’Connor
Andrew Warfield
FAILURES IN A DATACENTER
2
TOLERATING FAILURES IN A DATACENTER
REMUS
Initial idea behind Remus was to tolerate Datacenter level failures.
3
CAN A WHOLE DATACENTER FAIL ?
Yes!
It’s a “Disaster”!
4
DISASTERS
“Truck driver in
Texas kills all the
websites you really
use”
…Southlake FD found that he
had low blood sugar
Illustrative Image courtesy of TangoPango, Flickr.
- valleywag.com
“Our Internet infrastructure, despite all the talk, is as fragile as a fine
porcelain cup on the roof of a car zipping across a pot-holed goat track.
A single truck driver can take out sites like 37Signals in a snap.”
- Om Malik, GigaOM
5
DISASTERS..
Water-main break
cripples Dallas
County computers,
operations
The county's criminal justice system nearly ground to a halt, as paper
processing from another era led to lengthy delays - keeping some prisoners
in jail longer than normal.
- Dallas Morning News, Jun 2010
6
DISASTERS..
7
MORE FODDER BACK HOME
“
An explosion … near our server
bank
…
electrical
box
containing 580 fiber cables.
electrical box … was covered in
asbestos … mandated the
wearing of hazmat suits ....
Worse
yet,
the
dynamic
rerouting
—which
is
the
hallmark of the internet … did
not function.
In other words, the perfect storm. Oh well. S*it happens. ’’
-Dan Empfield, Slowswitch.com - a Gossamer Threads customer.
8
DISASTER RECOVERY – THE OLD FASHIONED WAY
Storage replication between
a primary and backup site.
Manually restore physical
servers from backup images.
Data Loss and Long Outage
periods.
Expensive Hardware –
Storage Arrays, Replicators,
etc.
9
STATE OF THE ART DISASTER RECOVERY
Protected
Site
VirtualCenter
Recovery
Site
Site Recovery
Manager
VirtualCenter
VMs online in
Protected Site
VMs become
unavailable
Site Recovery
Manager
VMs offline
VMs powered
on
Array Replication
Datastore Groups
Source: VMWare Site Recovery Manager – Technical Overview
Datastore Groups
10
PROBLEMS WITH EXISTING SOLUTIONS
Data
Loss & Service Disruption
(RPO ~15min, RTO ~few hours)
Complicated
Recovery Planning
(e.g. service A needs to be up before B, etc.)
Application
Bottom
Level Recovery
Line: Current State of DR is
Complicated
Expensive
Not suitable for a general purpose cloud-level offering.
11
DISASTER TOLERANCE AS A SERVICE ?
Our
Vision
12
OVERVIEW
A Case for Commoditizing Disaster Tolerance
SecondSite – System Design
Evaluation & Experiences
13
PRIMARY & BACKUP SITES
5ms RTT
14
FAILOVER & FAILBACK WITHOUT
OUTAGE
Primary Site: Vancouver
Backup Site : Kamloops
Primary Site: Vancouver
Primary Site: Kamloops
Complete State Recovery (CPU, disk, memory,
network)
No Application Level Recovery
Primary Site: Kamloops
Backup Site : Vancouver
15
MAIN CONTRIBUTIONS
Remus (NSDI ’08)
Checkpoint based State Replication
Fully Transparent HA
Recovery Consistency
No Application level recovery
RemusDB (VLDB’11)
Optimize Server Latency
Reduce Replication Bandwidth by up to 80% using
Page Delta Compression
Disk Read Tracking
SecondSite (VEE’12)
Failover Arbitration in Wide Area
Stateful Network Failover over Wide Area
16
CONTRIBUTIONS..
17
FAILURE DETECTION IN REMUS
External
Network
LAN
NIC1
NIC1
NIC2
Primar
y
Checkpoints
NIC2
Backup
• A pair of independent dedicated
NICs carry replication traffic.
• Backup declares Primary failure
only if
• It cannot reach Primary via
NIC 1 and NIC2
• It can reach External N/W
via NIC1
• Failure of Replication link alone
results in Backup shutdown.
• Split Brain occurs only when
both NICs/links fail.
18
FAILURE DETECTION IN WIDE AREA
DEPLOYMENTS
INTERNET
External
Network
WAN
LAN
Replication
Channel
NIC1
NIC2
Primary
Checkpoints
Primar
Datacente
y
r
NIC1
NIC2
Backup
Backup
Datacente
r
Cannot distinguish
between link and
node failure.
Higher chances of
Split Brain as the
network is not
reliable anymore
19
FAILOVER ARBITRATION
Local Quorum of Simple Reachability Detectors.
Stewards can be placed on third party clouds.
Google App Server implementation with ~100 LoC.
Provider/User could have other sophisticated
implementations.
20
FAILOVER ARBITRATION..
1
2
Stewards
3
5
4
POLL 5
X
X
X
X
X
Apriori
Steward Set
Agreement
Primary
Replication Stream
Quorum
Logic
I need
majority to
stay alive
Backup
Quorum
Logic
I need exclusive
majority to failover
21
NETWORK FAILOVER WITHOUT SERVICE
INTERRUPTION
Remus – LAN - Gratuitous ARP from Backup
Host
SecondSite – WAN/Internet – BGP Route Update
from Backup Datacenter
Need support from upstream ISP(s) at both
Datacenters
IP Migration achieved through BGP Multi-homing
22
NETWORK FAILOVER WITHOUT SERVICE
INTERRUPTION..
Internet
BGP Multihoming
BCNet (AS-271)
Vancouver
(134.87.2.173)
Kamloops
(207.23.255.237)
VMs
Primary Site
Routing traffic to
Primary Site
as-path prepend
64678
as-path prepend
64678 64678 64678 64678
as-path prepend
64678 64678
134.87.2.174
Replication
207.23.255.238
AS-64678 (stub)
(134.87.3.0/24)
AS-64678 (stub)
(134.87.3.0/24)
VMs
VMs
Backup Site
Re-routing traffic
to Backup Site on
Failover
23
OVERVIEW
A Case for Commoditizing Disaster Tolerance
SecondSite – System Design
Evaluation & Experiences
24
EVALUATION
Failover
Works!!
I want periodic
failovers with no
downtime!
More than one failure
?
I will have to restart
HA!
Did you run
regression tests ?
25
RESTARTING HA
Need to Resynchronize Storage.
Avoiding Service Downtime requires Online
Resynchronization
Leverage DRBD –only resynchronizes blocks that have
changed
Integrate DRBD with Remus
Add checkpoint based asynchronous disk replication protocol.
26
REGRESSION TESTS
Synthetic Workloads to stress test the
Replication Pipeline
Failovers every 90 minutes
Discovered some interesting corner cases
Page-table corruptions in memory checkpoints
Write-after-write I/O ordering in disk replication
27
SECONDSITE – THE COMPLETE PICTURE
4 VMs x 100 Clients/VM
• Service Downtime includes timeout for failure detection (10s)
• Failure Detection Timeout is configurable
28
REPLICATION BANDWIDTH CONSUMPTION
4 VMs x 100 Clients/VM
29
DEMO
Expect a real disaster (conference demos are not
a good idea!)
30
APPLICATION THROUGHPUT VS. REPLICATION
LATENCY
Kamloops
SPECWeb w/ 100 Clients
31
RESOURCE UTILIZATION VS. APPLICATION
LOAD
Domain-0 CPU Utilization
Bandwidth usage on Replication Channel
Cost of HA as a function of Application Load
(OLTP w/ 100 Clients)
32
RESYNCHRONIZATION DELAYS VS. OUTAGE
PERIOD
OLTP Workload
33
SETUP WORKFLOW – RECOVERY SITE
The user creates a recovery plan which is associated to a
single or multiple protection groups
34
Source: VMWare Site Recovery Manager – Technical Overview
RECOVERY PLAN
VM Shutdown
High
Priority
VM
Shutdown
Prepare
Storage
High
Priority
VM
Recovery
Normal
Priority
VM Recovery
Low Priority
VM Recovery
Source: VMWare Site Recovery Manager – Technical Overview
35