SecondSite: Disaster Tolerance as a Service Shriram Rajagopalan

Transcript SecondSite: Disaster Tolerance as a Service Shriram Rajagopalan

SECONDSITE: DISASTER
TOLERANCE AS A SERVICE
Shriram Rajagopalan
Brendan Cully
Ryan O’Connor
Andrew Warfield
FAILURES IN A DATACENTER
2
TOLERATING FAILURES IN A DATACENTER
REMUS
Initial idea behind Remus was to tolerate Datacenter level failures.
3
CAN A WHOLE DATACENTER FAIL ?
Yes!
It’s a “Disaster”!
4
DISASTERS
“Truck driver in
Texas kills all the
websites you really
use”
…Southlake FD found that he
had low blood sugar
Illustrative Image courtesy of TangoPango, Flickr.
- valleywag.com
“Our Internet infrastructure, despite all the talk, is as fragile as a fine
porcelain cup on the roof of a car zipping across a pot-holed goat track.
A single truck driver can take out sites like 37Signals in a snap.”
- Om Malik, GigaOM
5
DISASTERS..
Water-main break
cripples Dallas
County computers,
operations
The county's criminal justice system nearly ground to a halt, as paper
processing from another era led to lengthy delays - keeping some prisoners
in jail longer than normal.
- Dallas Morning News, Jun 2010
6
DISASTERS..
7
MORE FODDER BACK HOME
“
An explosion … near our server
bank
…
electrical
box
containing 580 fiber cables.
electrical box … was covered in
asbestos … mandated the
wearing of hazmat suits ....
Worse
yet,
the
dynamic
rerouting
—which
is
the
hallmark of the internet … did
not function.
In other words, the perfect storm. Oh well. S*it happens. ’’
-Dan Empfield, Slowswitch.com - a Gossamer Threads customer.
8
DISASTER RECOVERY – THE OLD FASHIONED WAY




Storage replication between
a primary and backup site.
Manually restore physical
servers from backup images.
Data Loss and Long Outage
periods.
Expensive Hardware –
Storage Arrays, Replicators,
etc.
9
STATE OF THE ART DISASTER RECOVERY
Protected
Site
VirtualCenter
Recovery
Site
Site Recovery
Manager
VirtualCenter
VMs online in
Protected Site
VMs become
unavailable
Site Recovery
Manager
VMs offline
VMs powered
on
Array Replication
Datastore Groups
Source: VMWare Site Recovery Manager – Technical Overview
Datastore Groups
10
PROBLEMS WITH EXISTING SOLUTIONS
 Data

Loss & Service Disruption
(RPO ~15min, RTO ~few hours)
 Complicated

Recovery Planning
(e.g. service A needs to be up before B, etc.)
 Application
 Bottom
Level Recovery
Line: Current State of DR is
Complicated
 Expensive
 Not suitable for a general purpose cloud-level offering.

11
DISASTER TOLERANCE AS A SERVICE ?
Our
Vision
12
OVERVIEW
A Case for Commoditizing Disaster Tolerance
 SecondSite – System Design
 Evaluation & Experiences

13
PRIMARY & BACKUP SITES
5ms RTT
14
FAILOVER & FAILBACK WITHOUT
OUTAGE
Primary Site: Vancouver
Backup Site : Kamloops


Primary Site: Vancouver
Primary Site: Kamloops
Complete State Recovery (CPU, disk, memory,
network)
No Application Level Recovery
Primary Site: Kamloops
Backup Site : Vancouver
15
MAIN CONTRIBUTIONS

Remus (NSDI ’08)
Checkpoint based State Replication
 Fully Transparent HA
 Recovery Consistency



No Application level recovery
RemusDB (VLDB’11)
Optimize Server Latency
 Reduce Replication Bandwidth by up to 80% using




Page Delta Compression
Disk Read Tracking
SecondSite (VEE’12)
Failover Arbitration in Wide Area
 Stateful Network Failover over Wide Area

16
CONTRIBUTIONS..
17
FAILURE DETECTION IN REMUS
External
Network
LAN
NIC1
NIC1
NIC2
Primar
y
Checkpoints
NIC2
Backup
• A pair of independent dedicated
NICs carry replication traffic.
• Backup declares Primary failure
only if
• It cannot reach Primary via
NIC 1 and NIC2
• It can reach External N/W
via NIC1
• Failure of Replication link alone
results in Backup shutdown.
• Split Brain occurs only when
both NICs/links fail.
18
FAILURE DETECTION IN WIDE AREA
DEPLOYMENTS
INTERNET
External
Network
WAN

LAN
Replication
Channel

NIC1
NIC2
Primary
Checkpoints
Primar
Datacente
y
r
NIC1
NIC2
Backup
Backup
Datacente
r
Cannot distinguish
between link and
node failure.
Higher chances of
Split Brain as the
network is not
reliable anymore
19
FAILOVER ARBITRATION

Local Quorum of Simple Reachability Detectors.

Stewards can be placed on third party clouds.

Google App Server implementation with ~100 LoC.

Provider/User could have other sophisticated
implementations.
20
FAILOVER ARBITRATION..
1
2
Stewards
3
5
4
POLL 5
X
X
X
X
X
Apriori
Steward Set
Agreement
Primary
Replication Stream
Quorum
Logic
I need
majority to
stay alive
Backup
Quorum
Logic
I need exclusive
majority to failover
21
NETWORK FAILOVER WITHOUT SERVICE
INTERRUPTION


Remus – LAN - Gratuitous ARP from Backup
Host
SecondSite – WAN/Internet – BGP Route Update
from Backup Datacenter

Need support from upstream ISP(s) at both
Datacenters

IP Migration achieved through BGP Multi-homing
22
NETWORK FAILOVER WITHOUT SERVICE
INTERRUPTION..

Internet
BGP Multihoming
BCNet (AS-271)

Vancouver
(134.87.2.173)
Kamloops
(207.23.255.237)

VMs
Primary Site
Routing traffic to
Primary Site
as-path prepend
64678
as-path prepend
64678 64678 64678 64678
as-path prepend
64678 64678
134.87.2.174
Replication
207.23.255.238
AS-64678 (stub)
(134.87.3.0/24)
AS-64678 (stub)
(134.87.3.0/24)
VMs
VMs
Backup Site

Re-routing traffic
to Backup Site on
Failover
23
OVERVIEW
A Case for Commoditizing Disaster Tolerance
 SecondSite – System Design
 Evaluation & Experiences

24
EVALUATION
Failover
Works!!
I want periodic
failovers with no
downtime!
More than one failure
?
I will have to restart
HA!
Did you run
regression tests ?
25
RESTARTING HA

Need to Resynchronize Storage.

Avoiding Service Downtime requires Online
Resynchronization

Leverage DRBD –only resynchronizes blocks that have
changed

Integrate DRBD with Remus

Add checkpoint based asynchronous disk replication protocol.
26
REGRESSION TESTS

Synthetic Workloads to stress test the
Replication Pipeline

Failovers every 90 minutes

Discovered some interesting corner cases

Page-table corruptions in memory checkpoints

Write-after-write I/O ordering in disk replication
27
SECONDSITE – THE COMPLETE PICTURE
4 VMs x 100 Clients/VM
• Service Downtime includes timeout for failure detection (10s)
• Failure Detection Timeout is configurable
28
REPLICATION BANDWIDTH CONSUMPTION
4 VMs x 100 Clients/VM
29
DEMO

Expect a real disaster (conference demos are not
a good idea!)
30
APPLICATION THROUGHPUT VS. REPLICATION
LATENCY
Kamloops
SPECWeb w/ 100 Clients
31
RESOURCE UTILIZATION VS. APPLICATION
LOAD
Domain-0 CPU Utilization
Bandwidth usage on Replication Channel
Cost of HA as a function of Application Load
(OLTP w/ 100 Clients)
32
RESYNCHRONIZATION DELAYS VS. OUTAGE
PERIOD
OLTP Workload
33
SETUP WORKFLOW – RECOVERY SITE

The user creates a recovery plan which is associated to a
single or multiple protection groups
34
Source: VMWare Site Recovery Manager – Technical Overview
RECOVERY PLAN
VM Shutdown
High
Priority
VM
Shutdown
Prepare
Storage
High
Priority
VM
Recovery
Normal
Priority
VM Recovery
Low Priority
VM Recovery
Source: VMWare Site Recovery Manager – Technical Overview
35

SecondSite: Disaster Tolerance as a Service Shriram Rajagopalan

Transcript SecondSite: Disaster Tolerance as a Service Shriram Rajagopalan

Directory