Multi-Site Clustering Considerations

Transcript Multi-Site Clustering Considerations

VAROVANJE VIRTUALIZIRANEGA
DATACENTRA – VISOKA
RAZPOLOŽLJIVOST
Gorazd Šemrov
Microsoft Consulting Services
[email protected]
DATA PROTECTION PLANNING
CONSIDERATIONS
• What needs protection?
• Local resources (physical & virtual)
• Remote sites
• What are your recovery goals?
• Prioritize by tier
• Organizational expectations
• Do you have a disaster recovery plan?
• Downtime RPO/RTO
• Testing
• How much bandwidth do you have to manage protection?
• Is your time better spent on other priorities?
• What are your budgetary realties?
HOST CLUSTERING
• Cluster service runs inside (physical) host and
manages VMs
• VMs move between cluster nodes
• Live Migration – No downtime
• Quick Migration – Session state saved to disk
CLUSTER
SAN
GUEST CLUSTERING
• Cluster service runs inside a VM
• Apps and services inside the VM are managed
by the cluster
• Apps move between clustered VMs
CLUSTER
iSCSI
GUEST VS. HOST: HEALTH
DETECTION
Fault
Host Hardware Failure
Parent Partition Failure
VM Failure
Guest OS Failure
Application Failure
Host Cluster
Guest Cluster
HOST + GUEST CLUSTERING
• Optimal solution offer the most flexibility and protection
• VM high-availability & mobility between physical nodes
• Application & service high-availability & mobility between
VMs
• Increases complexity
GUEST CLUSTER
CLUSTER
SAN
iSCSI
CLUSTER
SAN
SETTINGS:
ANTIAFFINITYCLASSNAMES
• AntiAffinityClassNames
• Groups with same AACN try to avoid
moving to same node
• http://msdn.microsoft.com/enus/library/aa369651(VS.85).aspx
• Enables VM distribution across host nodes
• Better utilization of host OS resources
• Failover behavior on large clusters: KB
299631
SETTINGS: AUTO-START
• Mark groups as lower priority
• Enables the most important VM
to start first
• Group property
• Enabled by default
• Disabled VMs needs manual
restart to recover after a crash
SETTINGS: PERSISTENT MODE
• HA Service or Application will return
to original owner
• Better VM distribution after cold start
• Enabled by default for VM groups
• Disabled by default for other groups
MULTI-SITE CLUSTERING
CONSIDERATIONS
Network
Storage
Compute
Quorum
MULTI-SITE CLUSTERS FOR
DISASTER RECOVERY
What are Multi-Site Clusters?
• A single cluster solution
extended over metropolitan
wide distances to protect
against datacenter failures
Site A
Site B
Nodes are
located at a
physically
separate site
MULTI-SITE CLUSTERING
CONSIDERATIONS
Network
Storage
Compute
Quorum
STORAGE DEPLOYMENT OPTIONS
Cluster Traditional Storage
SAN
Disks
Shared-Nothing Storage Model
Unit of Failover at LUN/Disk level
Ideal for Hyper-V Quick Migration
scenarios
STORAGE DEPLOYMENT OPTIONS
Cluster Shared Volumes (CSV)
SAN
Disk
Multiple nodes to concurrently access
Unit of Failover is at VM level
Ideal for Hyper-V Quick and Live Migration
VENDOR support
REPLICATION METHOD:
SYNCHRONOUS
• Host receives “write complete” response from the storage
after the data is successfully written on both storage
devices
2. Replication
1. Write
Request
4. Write
Complete
3. Acknowledgement
Primary
Storage
Secondary
Storage
REPLICATION METHOD:
ASYNCHRONOUS
• Host receives “write complete” response from the
storage after the data is successfully written to just the
primary storage device, then replication
3. Replication
1. Write
Request
2. Write
Complete
Primary
Storage
Secondary
Storage
COMPARING DATA REPLICATION
METHODS
Synchronous
Asynchronous
Recovery Point
Objectives (RPO)
High Business Impact,
critical application (RPO =
0)
Medium-Low Business
Impact, critical applications
( RPO > 0 )
Application I/O
Performance
For applications not
sensitive to high IO latency
Applications sensitive to
high IO latency
Distance between sites
50 km to 300 km
>200 km
Bandwidth cost
High
Mid-Low
CLUSTER VALIDATION WITH
REPLICATED STORAGE
• Multi-Site clusters are not required to pass the
Storage tests to be supported
• Validation Guide and Policy
• http://go.microsoft.com/fwlink/?LinkID=119949
ASYMMETRICAL STORAGE
SUPPORT IN SP1
• Improves multi-site cluster experience
• Storage only visible to a subset of nodes
• Storage Topology used for Smart placement
• Workload placement based on it’s underlying storage connectivity
N1
N2
N3
N4
SAN
Disk set #1
•
•
Disk set #2
Disk #1 is visible on N1&N2 and Disk #2 on N3 & N4
SQL and non-SQL workloads separated
CHOOSING A STRETCHED
STORAGE MODEL
Traditional Cluster
Storage
Cluster Shared Volumes
Live Migration
Hardware
Replication
Consult vendor
Software
Replication
Appliance
Replication
Consult vendor
MULTI-SITE CLUSTERING
CONSIDERATIONS
Network
Storage
Compute
Quorum
NETWORK DEPLOYMENT OPTIONS
Stretched VLANs
Public
Network
Site A
Site B
10.10.10.*
20.20.20.*
Redundant
Network
NETWORK DEPLOYMENT OPTIONS
Different Subnets
Public
Network
Site A
Site B
10.10.10.*
30.30.30.*
20.20.20.*
40.40.40.*
Redundant
Network
CHALLENGES WITH STRETCHED NETWORKS
STRETCHING THE NETWORK
Clustering has no distance limitations (although 3rd party plug-ins may)
Longer distance traditionally means greater network latency
Missed inter-node health checks can cause false failover
Cluster heartbeating is fully configurable
SameSubnetDelay (default = 1 second)
Frequency heartbeats are sent
SameSubnetThreshold (default = 5 heartbeats)
Missed heartbeats before an interface is considered down
CrossSubnetDelay (default = 1 second)
Frequency heartbeats are sent to nodes on dissimilar subnets
CrossSubnetThreshold (default = 5 heartbeats)
Missed heartbeats before an interface is considered down to nodes on dissimilar
subnets
Command Line: Cluster.exe /prop
PowerShell (R2): Get-Cluster | fl *
SECURITY OVER THE WAN
Encrypt inter-node communication
Trade-off security versus performance
SecurityLevel
(default = signed communication)
0 = clear text
1 = signed (default)
2 = encrypted
Site A
Site B
10.10.10.1
30.30.30.1
20.20.20.1
40.40.40.1
UPDATING VM’S IP ADDRESSES ON FAILOVER
Not needed if on same subnet
On cross-subnet failover, if guest is…
DHCP
Static IP
• IP updated automatically
• A new IP Address must be configured after failover
• This can be scripted
If using multiple subnets, it is easier to use
DHCP in guest OS
DNS CONSIDERATIONS
Nodes in dissimilar subnets
VM obtains new IP address
Clients need that new IP Address from DNS to reconnect
DNS Replication
DNS Server 1
Record Created
DNS Server 2
Record Updated
Record Obtained
Record Updated
10.10.10.111
20.20.20.222
VM
VM == 20.20.20.222
10.10.10.111
Site A
Site B
SOLUTION #1: LOCAL FAILOVER FIRST
Configure local failover first for high availability
No change in IP addresses
No DNS replication issues
No data going over the WAN
Cross-site failover for disaster recovery
DNS Server 1
10.10.10.111
20.20.20.222
VM = 10.10.10.111
Site A
Site B
SOLUTION #2: STRETCH VLANS
Deploying a VLAN minimizes client reconnection times
IP Address of the VM never changes
DNS Server 1
DNS Server 2
10.10.10.111
VLAN
FS = 10.10.10.111
Site A
Site B
SOLUTION #3: NETWORKING DEVICE
ABSTRACTION
Networking device uses independent 3rd IP Address
3rd IP Address is registered in DNS & used by client
DNS Server 1
30.30.30.30
10.10.10.111
DNS Server 2
20.20.20.222
VM = 30.30.30.30
Site A
Site B
CSV NETWORKING CONSIDERATIONS
Cluster Shared Volumes requires nodes to be in the
same subnet
Use a VLAN on your CSV network
Other networks can still support multiple subnets
VLAN
CSV
Network
Site A
Site B
LIVE MIGRATING ACROSS SITES
Live migration moves a running VM between cluster nodes
TCP reconnects makes the move unnoticeable to clients
Use VLANs to achieve live migrations between sites
The VM’s IP Address connecting the client to VM will not change
Network Bandwidth Planning
Live migration may require significant network bandwidth based
on the amount of memory allocated to VM
Live migration times will be longer with high latency or low
bandwidth
WAN connections
Remember that CSV and live migration are independent, but
complimentary, technologies
MULTI-SUBNET VS. VLAN RECAP
Multi-Subnet
Live Migration (seamless)
Quick Migration
Fast failover
Cluster Shared Volumes
Static IPs in guest
Flexibility
Complexity
VLAN
MULTI-SITE CLUSTERING CONSIDERATIONS
Network
Storage
Compute
Quorum
QUORUM DEPLOYMENT OPTIONS
1. Disk only
3. Node & Disk Majority
2. Node Majority
4. File Share Witness
REPLICATED DISK WITNESS
A witness is a tie breaker when nodes lose network connectivity
When a witness is not a single decision maker, problems occur
Do not use in multi-site clusters unless directed by vendor
Vote
Vote
Vote
?
Replicated
Storage
NODE MAJORITY
Can I communicate
with majority of the
nodes in the cluster?
Yes, then Stay Up
5 Node Cluster:
Majority = 3
Can I communicate
with majority of the
nodes in the cluster?
No, drop out of
Cluster Membership
Site B
Site A
Cross site network
connectivity broken!
Majority in
Primary Site
NODE MAJORITY
We are down!
5 Node Cluster:
Majority = 3
Can I communicate
with majority of the
nodes in the cluster?
No, drop out of
Cluster Membership
Site B
Site A
Need to force
quorum manually
Disaster at Site 1
Majority in
Primary Site
FORCING QUORUM
Forcing quorum is a way to manually override and
start a node even though it has not achieved quorum
Always understand why quorum was lost
Used to bring cluster online without quorum
Cluster starts in a special “forced” state
Once majority achieved, drops out of “forced”
state
Command Line:
net start clussvc /fixquorum
(or /fq)
PowerShell (R2):
Start-ClusterNode –FixQuorum (or –fq)
MULTI-SITE WITH FILE SHARE WITNESS
Site C (branch office)
Complete resiliency and
automatic recovery from the
loss of any 1 site
Site A
File Share
Witness
\\Foo\Share
WAN
Site B
CHANGES IN SERVICE PACK 1
Node Vote Weight
• Granular control of which nodes
have votes in determining
quorum
• Flexibility for multi-site clusters
•
•
Primary Site
Prefer primary Site during network
split
Complete failure of Backup Site will
not bring down the cluster
Cluster.exe
Cluster.exe . node <NodeName> /prop NodeWeight=0
PowerShell (R2):
(Get-ClusterNode “NodeName”).NodeWeight = 0
Backup Site
CHANGES IN SERVICE PACK 1
• Prevent Quorum
• Admin Started Backup Site with
/ForceQuorum option
• When Primary is restarted
•
N1 & N2 overwrites the Authoritative Cluster
configuration
Changes Made by B3 & B4 overwritten
•
• When Primary is started with
/Prevent Quorum – recomm.
•
•
Quorum override avoided
Changes Made by B3 &` B4 are maintained
•
N1 & N2 gracefully joins the existing
membership
QUORUM MODEL RECAP
Node and File
Share Majority
• Even number of nodes
• Best availability solution – FSW in 3rd site
Node Majority
• Odd number of nodes
• More nodes in primary site
Node and Disk
Majority
• Use as directed by vendor
No Majority: Disk
Only
• Not Recommended
• Use as directed by vendor
MULTI-SITE CLUSTERING CONTENT
Design guide:
http://technet.microsoft.com/en-us/library/dd197430.aspx
Deployment guide/checklist:
http://technet.microsoft.com/en-us/library/dd197546.aspx
VPRAŠANJA?
Po zaključku predavanja prosim izpolnite vprašalnik.
Vprašalniki bodo poslani na vaš e-naslov, dostopni pa bodo tudi preko profila na spletnem
portalu konference. www.ntk.si .
Z izpolnjevanjem le tega pripomorete k izboljšanju konference.
Hvala!

Multi-Site Clustering Considerations

Transcript Multi-Site Clustering Considerations

Directory