DR 2.0 - Virtualb.eu

Download Report

Transcript DR 2.0 - Virtualb.eu

Disaster Recovery 2.0
A paradigm shift in DR Architecture.
Iwan ‘e1’ RahabokVCAP-DCD, VCP5
Staff SE, Strategic Accounts
+65 9119-9226 | [email protected] | virtual-red-dot.blogspot.com | sg.linkedin.com/in/e1ang
© 2009 VMware Inc. All rights reserved
Business Requirements
Protect the Business in the
event of Disaster
It is similar to Insurance.
• It’s no longer acceptable to run business without DR protection.
The question is now about…
• How do we cut the DR cost & complexity? People cost, technology cost, etc.
2
Disaster did strike in Singapore
29 June 2004
• Electricity Supply Interruption
• More than 300,000 homes were left in the dark
• About 30% of Singapore was affected.
 If both your Prod and DR datacenters were on this 30%....
• Caused by the disruption of natural gas supply from West Natuna, Indonesia. A valve
at the gas receiving station operated by ConocoPhillips tripped. Natural gas supply
was disrupted causing 5 units of the combined-cycle gas turbines (CCGT) at Tuas
Power Station, Power Seraya Power Station and SembCorp Cogen to trip.
• Some of the CCGTs could not switch to diesel successfully. Investigation into the
incident is in progress.
Other Similar Incidents
• The first disruption in natural gas supply occurred on 5 Aug 2002 due to a tripping of
a valve in the gas receiving station which led to a power blackout.
3
Disaster Recovery (DR) >< Disaster Avoidance (DA)
DA requires that Disaster must be avoidable.
• DA implies that there is Time to respond to an impending Disaster. The time window
must be large enough to evacuate all necessary system.
Once avoided, for all practical purpose, there is no more disaster.
• There is no recovery required.
• There is no panic & chaos.
DA is about Preventing (no downtime). DR is about Recovering (already down)
• 2 opposite context.
It is insufficient to have DA only.
DA does not protect the business when Disaster strikes.
Get DR in place first, then DA.
DR Context: It’s a Disaster, so…
It might strike when we’re not ready
• E.g. IT team having offsite meeting, and next flight is 8 hours away.
• Key technical personnels are not around (e.g. sick or holiday)
We can’t assume Production is up.
• There might be nothing for us to evacuate or migrate to DR site.
• Even if the servers are up, we might not even able to access it (e.g. network is down).
Even if it’s up, we can’t assume we have time to gracefully shutdown or
migrate.
• Shutting down multi-tier apps are complex and take time when you have 100s…
We can't assume certain system will not be affected
• DR Exercise should involve entire datacenter.
Assume the worst, and start from that point.
5
Singapore MAS Guidelines
MAS is very clear that
DR means Disaster
has happened as
there is outage.
Clause 8.3.3 states
Total Site should be
tested. So if you are
not doing entire DC
test, you’re not in
compliant.
DR: Assumptions
A company wide DR Solution shall assume:
• Production is down or not accessible.
 Entire datacenter, not just some systems.
• Key personnels are not available
 Storage admin, Network admin, AD admin, VMware admin, DBA, security, Windows admin,
RHEL admin, etc.
 Intelligence should be built into the system to eliminate reliance on human expert.
• Manual Run Books are not 100% up to date
 Manual documents (Word, Excel, etc) covering every steps to recover entire datacenter is
prone to human error. It contains thousands of steps, written by multiple authors.
 Automation & virtualisation reduce this risk.
7
DR Principles
To Business Users, actual DR experience must be identical to the Dry Run they experience
• In panic or chaotic situation, users should deal with something they are trained with.
• This means Dry Run has to simulate Production (without shutting down Production)
Dry Run must be done regularly.
• This ensures:
 New employees are covered.
 Existing employees do not forget.
 The procedures are not outdated (hence incorrect or damaging)
• Annual is too long a gap, especially if many users or departments are involved.
DR System must be a replica of Production System
• Testing with a system that is not identical to production deems the Dry Run invalid.
• Manually maintain 2 copies of >100s servers, network, storage, security settings are classic examples of
invalid Dry Run, as the DR System is not the Production system.
• System >< Datacenter. Normally, the DR DC is smaller. System here means a collection of servers, storage,
network, security that make up “an application from business point of view”.
8
Datacenter wide DR Solution: Technical Requirements
Fully Automated
• Eliminate reliance on many key personnels.
• Eliminate outdated (hence misleading) manual runbooks.
Enable frequent Dry Run, with 0 impact to Production.
• Production must not be shutdown, as this impacts the business.
 Once you shutdown production, it is no longer a Dry Run. Actual Run is great, but it is not
practical as Business will not allow entire datacenter to go down regularly just for IT to test
infrastructure.
• No clashing with Production Hostnames and IP addresses.
• If Production is not impacted, then users can take time to test DR. No need to finish
within certain time window anymore.
Scalable to entire datacenter
• 1000s servers
• Cover all aspect of infrastructure, not just server + storage. Network, Security, Backup
have to included so entire datacenter can be failed over automatically.
9
DR 1.0 architecture (current thinking)
Typical DR 1.0 solution (at infrastructure layer) has the following properties:
Area
•
Data drive (LUN) is replicated.
•
OS/App drive is not. So there are 2 copies: Production and DR. They have different host
name, IP address. They can’t be the same, as having identical hostname/IP will result in
conflict as the network spans across both datacenters.
•
This means DR system is actually different to Production, even on actual DR.
Production never fails to DR. Only the data gets mounted.
•
Technically, this is not a “production recovery” solution, but a “Site 2 mounting Site 1
data” solution. IT has been telling Business that IT is recovering Production, while what
IT does is actually running a different system, and the only thing used from Production
is just the data.
Storage
•
Not integrated with the server. Practically 2 different solution, manually run by 2
different team, with a lot of manual coordination and unhappiness.
Network
•
Not aware of DR Test and Dry Run. It’s 1 network for all purpose.
•
Firewall rules manually maintained on both sides.
Server
10
Solution
DR 1.0 architecture: Limitations
Technically, it is not even a DR solution
• We do not recover the Production System. We merely mount production Data on a
different System
 The only way for the System to be recovered is to do SAN boot on DR Site.
• Can’t prove to audit that DR = Production.
• Registry changes, config changes, etc are hard to track at OS and Application level.
Manual mapping of data drive to associated server on DR site.
Not a scalable solution as manual update don’t scale well to 1000s servers.
Heavy on scripting, which are not tested regularly.
DR Testing relies heavily on IT expertise.
11
DR Requirements: Summary
ID
Requirements
Description
R01
DR copy = Production copy.
Dry Run = Actual Run.
This is to avoid an invalid Dry Run as the System Under Test itself are
not the same.
No changes allowed (e.g. IP address and Host name) as it means Dry
Run >< real DR
R02
Identical User Experience
From business users point of view, the entire Dry Run exercise must
match real/actual DR experience.
R03
No impact on Production
during Dry Run.
DR test should not require Production to be shutdown, as it becomes
a real failover. A real failover can’t be done frequently as it impacts
the business. Business will resist testing, making the DR Solution risky
due to rare testing.
R04
Frequent Dry Run
This is only possible if Production is not affected.
R05
No reliance on human experts
An datacenter wide DR needs a lot of expert from disciplines, making
it an expensive effort.
Actual procedure should be simple. It should not recover from error
state.
R06
Scalable to entire datacenter
DR solution should scale to >1000s servers while maintaining RTO and
simplicity.
12
R01: DR Copy = Production Copy
Solution: replicate System + Data, not just data drive (LUN).
• OS, Apps, settings, etc.
Implication of the solution:
• If Production network is not stretched, the server will be unreachable. Changing IP will break
Application.
• If Production network is stretched, IP Address and Hostname will conflict with Production.
Changing Hostname will definitely break Application. Stretched L2 network is not a full solution.
Entire LAN isolation is the solution.
Solution: Entire Dry Run network must be isolated (bubble network)
• No conflict with Production, as it’s actually identical. It’s a shadow of Production LAN.
• All network services (AD, DNS, DHCP, Proxy) must exist in the Shadow Prod LAN.
Implication of the solution:
• For VM, this is easily done via vSphere and SRM
• For Physical Servers, they need to be connected to Dry Run LAN. Permanent connection
simplifies and eliminate risk of accidental update to production.
13
R02: Identical User Experience
desktop.ABC Corp.com
Production desktop pools
DR Test desktop pools
Desktop-DRTest.ABC Corp.com
(on-demand)
VDI is a natural companion to DR as it makes the “front-end” experience seamless.
• Users use Virtual Desktop as their day to day desktop.
• VDI enables us to DR the desktop too.
During Dry Run
• Users connect to desktop.vmware.com for production and desktop-DR.vmware.com for Dry Run. Having 2 desktops mean the
environment is completely isolated.
During actual Disaster
• Desktop-DR.vmware.com is renamed to desktop.vmware.com as the original desktop.vmware.com is down (affected by the
14
same DR). Users connect to desktop.vmware.com, just like they do in their day to day, hence creating an identical experience.
R03: No impact on Production during Dry Run
To achieve the above, the DR Solution:
• Cannot require Production be shutdown or stopped. It must be Business as Usual.
• Must be an independent, full copy altogether, no reliance on Production component.
 Network, security, AD, DNS, Load Balancer, etc.
15
R04: Frequent Dry Run
To achieve the above, the DR Solution cannot:
• Be laborious or prone to human error. A fully automated solution address this.
• Touch production system or network. So it has to be an isolated environment. A
Shadow Production LAN solves this.
• VMware SRM enables the automation component for VM.
You should have the full confidence that the Actual Fail
Over will work. This can only be achieved if you can do
frequent dry run.
16
Solution: Dealing with Physical Servers
Singapore (Prod Site)
Singapore (DR Site)
CRM-Web-Server.vmware.com
10.10.10.10
Shadow Production LAN
CRM-Web-Server.vmware.com
10.10.10.10
CRM-App-Server.vmware.com
CRM-App-Server.vmware.com
10.10.10.20
10.10.10.20
CRM-DB-Server.vmware.com
CRM-DB-Server.vmware.com
10.10.10.30
10.10.10.30
CRM-DB-Server-Test.vmware.com
20.20.20.30
17
Physical Servers: Dual boot option
This VM is a Jump Box.
Without a Jump Box, we cannot access Shadow Production LAN
during Dry Run. It runs on ESXi which is connected to both LANs.
Shadow Production LAN (10.10.10.x)
LAN on Datacenter 2 (20.20.20.x)
Physical Server must be dual-boot (OS):
- Normal Operation: Test/Dev environment (default boot)
- Dry Run or DR: Shadow Production network
18
Physical Servers: Dual partition option
This VM is a Jump Box.
Without a Jump Box, we cannot access Shadow Production LAN
during Dry Run. It runs on ESXi which is connected to both LANs.
Shadow Production LAN (10.10.10.x)
LAN on Datacenter 2 (20.20.20.x)
1 physical box
DR Partition
19
Test/Dev Partition
Typical Physical Network: it’s 1 network
Singapore (Prod Site)
Country X (any site)
Singapore (DR Site)
Production Networks
Production VMs
Production PMs
AD/DNS
Production VMs
Non-AD DNS
AD/DNS
Production PMs
Non-AD DNS
Production VMs
AD/DNS
Production PMs
Non-AD DNS
ABC Corp operates in many
countries in Asia, with Singapore
being the HQ.
A system may consist of multiple
servers from the more than 1
country.
DNS service for Windows is
provided by MS AD.
DNS service for non Windows is
provided by non MS AD.
20
Users Site
Users (from any country) can access any servers (physical or virtual) on any
country as basically there is only 1 “network”. There is routing to connect
various LAN.
In 1 “network”, we can’t have 2 machines with same host name or same IP.
Each LAN has its own network address. Hence changing of IP address is
required when moving from Prod Site to DR Site.
Site 2 needs to have 2 distinct Network
This VM is a Jump Box.
Without a Jump Box, we cannot access Shadow Production LAN
during Dry Run. It runs on ESXi which is connected to both LANs.
Shadow Production LAN (10.10.10.x)
LAN on Datacenter 2 (20.20.20.x)
DR Server
21
Test/Dev Server
Mode: Normal Operation or During Dry Run
Datacenter: Site 2
Datacenter: Site 1
Production LAN (10.10.10.x)
x
Shadow Production LAN (10.10.10.x)
Jump box
Non Prod LAN (20.20.20.x)
Users Site
Desktop LAN (30.30.30.x)
22
Mode: Partial DR
Datacenter: Site 2
Datacenter: Site 1
Production LAN (10.10.10.x)
Non Prod LAN (20.20.20.x)
Users Site
Desktop LAN (30.30.30.x)
23
Summary: DR 2.0 and 1.0
ID
DR 1.0
DR 2.0
R01
Does not meet
- It uses 2 copies, which are manually sync
Meet
R02
Does not meet.
- The DR system >< Production system.
Meet
R03
Does not meet.
- Dry Run is done on another system, not
production copy.
Meet
R04
Does not meet
Meet
R05
Does not meet.
- Resource intensive. Dual boot, script, etc.
Meet
R06
Does not meet.
Meet
Works for Physical Server.
Does not work well in Virtual Environment
VM fits much better than Physical server.
Network must have Shadow Production LAN
24
Pre-Failover
Global DNS
Load Balancer
User 10.30.30.30
DNSQuery:
Response:
Virtual IP 1
DNS
www.abc.com
10.10.10.10
DR Site
Prod Site
192.168.10.0/24
10.10.10.0/24
VIP 1
Load Balancer
Source IP Changed:
SNAT
10.30.30.30 => 10.20.20.20
LOAD BALANCE
Production PMs
VIP Mapped to server IP:
10.10.10.10 => 10.20.20.31
25
Load Balancer
SNAT
10.20.20.0/24
10.20.20.0/24
Production VMs
VIP 2
SOURCE NAT
Post-Failover
Global DNS
Load Balancer
User 10.30.30.30
DNSQuery:
Response:
Virtual IP 2
DNS
www.abc.com
192.168.10.10
DR Site
Prod Site
192.168.10.0/24
10.10.10.0/24
VIP 1
Load Balancer
Source IP Changed:
SNAT
10.30.30.30 => 10.20.20.20
LOAD BALANCE
Production PMs
VIP Mapped to server IP:
192.168.10.10 => 10.20.20.31
26
Load Balancer
SNAT
10.20.20.0/24
10.20.20.0/24
Production VMs
VIP 2
SOURCE NAT
DR Dry Run
Global DNS
Load Balancer
User 10.30.30.30
Response:
Virtual IP 2
DNS DNS
Query:
www-dr-test.abc.com
192.168.10.10
DR Site
Prod Site
192.168.10.0/24
10.10.10.0/24
VIP 1
Load Balancer
Source IP Changed:
SNAT
VIP 2
SOURCE NAT
10.30.30.30 => 10.20.20.20
Load Balancer
SNAT
10.20.20.0/24
10.20.20.0/24
LOAD BALANCE
Production VMs
DR Test VMs
Production PMs
VIP Mapped to server IP:
192.168.10.10 => 10.20.20.31
27
DR Test PMs
Making it Work
• Strict enforcement to have external users use VIP
• Strict enforcement to have peer vApp stacks use VIP
• DNS failover setting at global site load-balancer would have to be manual Network admin needed to update www.abc.com on global site load-balancer to
reflect VIP at secondary DC.
• Server load-balancer use only applicable for serving specific applications.
Application support with load-balancers is vendor dependent, with varying
depth of app support.
• Applications will need to support source NAT. Some applications have known
issues when used in conjunction with NAT (eg FTP), however server loadbalancers may be able to mitigate issues. Vendor dependent.
• Not running a stretched VLAN means VMs with strong systemic dependencies
must be placed on the same site, possibly as a vApp. Communications between
VMs across sites can only be done using VIP, where a specific function and pool
of VMs must have already been configured.
28
DA
From the view of DR
29
DA & DR in virtual environment
DR and DA solution do not fit well together in vSphere 5
• DA requires 1 vCenter
 DA needs long distance migration, which don’t work across 2 vCenters.
• DR requires 2 vCenters.
 vCenter prevents the same VM to appear 2x in the same vCenter.
 We can’t assume vCenter on main site is recoverable.
There is confusion on DR + DA
• You cannot have DA + DR on the same “system”. You need 3 instances.
 1 primary
 1 secondary for DR purpose
 1 secondary for DA purpose.
• Next slide explains limitations of some DA solution for DR use case.
 This is not to criticise the DA solution, as it is a good solution for DA use case.
DA Solution: Stretched Cluster (+ Long Distance vMotion)
When actual DR strikes…
• We can’t assume Production is up. Hence vMotion is not a solution.
• HA will kick in and boot all VMs. Orders will not be honoured.
Challenge of the above solution: How do we Test?
• DR Solution must be tested regularly as per Requirement R04.
• The test must be identical from user point of view, as per Requirement R02.
• So the test will have to be like this:
 Cut replication, then mount the LUNs, then add VMs into VC, boot the VMs.
 But… we cannot mount the LUNs the same vCenter as they have the same signature! Even if
we can, we must know the exact placement of each VMs (which is complex). Even if we can,
we cannot boot 2 VMs on the same vCenter! This means Production VMs must be down. This
fails Requirement R03.
Conclusion:
Stretched Cluster does not even qualify as DR
Solution as it can’t be tested & it’s 100% manual.
31
DA Solution: 2 Clusters in 1 VC (+ Long Distance vMotion)
This is a variant of Stretched Cluster.
• It fixes the risk & complexity of Stretched Cluster. And no performance impact of
uncontrolled long distance vMotion.
When actual DR strikes…
• We can’t assume Production is up. Hence vMotion is not a solution.
• HA will not even kick in as it’s separate cluster. In fact, VMs will be in error state,
appearing italized in vCenters.
Challenge of the above solution: How do we Test?
• All issues facing Stretched Cluster apply.
Conclusion:
2-Cluster is inferior to Stretched Cluster from DR
point of view
32
Stretched Datacenter: View from the Network
Bro, can you add some design info on complexity of stretching the network
(assume no virtualisation, all physical servers).
A lot of VMware folks don’t appreciate the complexity & implication (design,
operational, performance, upgrade, troubleshooting) when a network is
stretched across 2 physical datacenter (say they are 40 km apart)
33
Active/Active or Active/Passive
Which one makes sense?
34
Background
Active/Active Datacenter has many level of definition:
• Both DC are actively running workload, so one is not idle.
 This means Site 2 can be running non Production workload, like Test/Dev and DR.
• Both DC are actively running Production workload
 Build from previous, this means Site 2 must run Production workload.
• Both DC are actively running Production workload, with application-level failover.
 Build from previous, the same App run on both side. But the instance on Site 2 is not serving
users. It’s waiting for an application-level failover.
 This is typicaly done via geo-cluster solution.
• Both DC are actively running Production workload, with A/A aplication-level
 Both Apps are running. Normally done via global Load Balancer.
 No need to failover as each App is “complete”. It has the full data, and it does not need to tell
the other App when its data is updated. No transaction level integrity required.
 This is the ideal. But most apps cannot do this as the data cannot be split. You can only have 1
data.
35
In vSphere context,
this is what it means
by Active/Active
vSphere.
Both vSphere are
actively running
Production VMs
A closer look at Active/Active
vCenter
vCenter
Lots of traffic between:
Prod to Prod
T/D to T/D
500 Test/Dev VMs
250 Prod VMs
250 Prod VMs
500 Test/Dev VMs
T/D Clusters
Prod Clusters
Prod Clusters
T/D Clusters
vCenter
vCenter
500 Prod VMs
1000 Test/Dev VMs
Prod Clusters
T/D Clusters
MAS TRM Guideline
It states “near” 0, not 0.
It states “should”, not
“must”.
It states “critical”, not all
systems. So A/A is only for
a subset. This points to an
Application-level solution,
not Infrastructure-level.
We can add this capability
without changing the
architecture, as shown on
next slide.
37
Adding Active/Active to a mostly Active/Passive vSphere
vCenter
vCenter
500 Prod VMs
1000 Test/Dev VMs
Prod Clusters
T/D Clusters
vCenter
vCenter
Global LB
Global LB
500 Prod VMs
50 VMs
1000 Test/Dev VMs
Prod Clusters
1 Cluster
T/D Clusters
Thank You
39
© 2009 VMware Inc. All rights reserved