Using_Failover_Clusters_for_High_Availabilityx
Download
Report
Transcript Using_Failover_Clusters_for_High_Availabilityx
Contact Information
https://github.com/RowdyVinson
[email protected]
@RowdyVinson
https://www.linkedin.com/in/rowdyvinson-0600024
Using Failover Clusters
for High Availability
ROWDY VINSON
Who are you?
Who here is a Developer-turned-DBA? Systems admins/engineers?
Other?
Anyone work in the virtualization/storage stack regularly?
Who knows PowerShell well enough to change a setting with a
script?
We’ll be talking to concepts in all these areas today, so please ask
questions if something doesn’t translate well into your native tongue
Who I am
In my free time I build things, play video games, and argue
the merits of Star Trek over all other “Star”-based scifi fandoms
Professionally, I’m a Systems Engineer and DBA who’s been
working with SQL server in various roles for about a decade
My team is responsible for delivery and support of 300 virtual
servers in a tier 4 datacenter environment
I’m responsible for the health of about 60 SQL server and
about 400 user databases supporting 30 different applications
Why is HA important?
All servers go down eventually
Planned
Upgrades
Patches
(hardware and software)
(OS and applications)
Unplanned
Failures
Accidents
Jr.
Sys Admins (or Sr. Sys Admins between the hours
of 6pm Friday and 9am Monday)
Save
your skin and start planning for this
How do we get HA? Clusters!
What is a “Cluster”?
Nodes
Configured
Servers
Roles
Services
Resources
Storage
Names/IPs
Networks
How do we get HA? Clusters!
How do I keep my cluster happy?
Quorum
Cluster
This
rules
Validation
is required for Microsoft Support
Redundant hardware platform
VMware
host isolation
Network
multipathing
HA
SAN
How do we get HA? Clusters!
How do they react to X?
Maintenance (Demo 1)
Node-specific Failure (Demo 2)
Role performance issues (Demo 3)
Environment failure (Demo 4)
Demo 0
Scenario
Outcome
Check-Cluster*
This gives us an overview of the
state of the cluster. We’ll also use it
for populating variables for use later
in the demos.
*This is a script of my design and something we’ve found valuable
for troubleshooting in our environment. I’m adding some logic to
it to make remediation of issues better, but this is a work in
progress.
Demo 1
Scenario
Outcome
The node that the SQL role
is running on is gracefully
shut down.
What we see here is a best-case for
clustering. The non-active node
notices 5 seconds of heartbeat failure
(5 is the default threshold and 1
second is the default interval) and
initiates a role recovery on itself.
Demo 2
Scenario
Outcome
NIC is disabled on node
running the SQL role
In this case, a real “failure” happens
from the perspective of the cluster
service. The role is shifted to the
other node and service is restored.
Demo 3
Scenario
Outcome
High node resource use
This demonstrates that most typical
performance related issues will not
cause failover. HA is not a
performance-related solution.
Demo 4
Scenario
Outcome
Interrupted iSCSI connection
(failed SAN connection)
Cluster loses access to it’s quorum
drive and the storage resources for
the SQL role, so both nodes assume
they are wounded and they stop
serving the role.
This error extends well beyond the DBA team. The systems team, as well as
network team would have to coordinate to resolve this. Redundancy in the
network and virtualization stacks could help reduce this rick, but it is never zero.
Can my team support a cluster?
Can my team support a cluster?
What you need:
Strong
Server support team
Virtualization expertise may be required here if it is used
Windows OS expertise is a must
PowerShell is a must
Enough staff to handle SLA commitments during turnover/vacations
Significant
Are
redundancy in the environment
High-tier datacenter
Redundancy in networks and virtualization
they right for me?
¯\_(ツ)_/¯
Questions?
More Reading
Edwin Sarmiento : http://www.edwinmsarmiento.com/resources-windowsserver-failover-clustering-wsfc-for-sql-server-dbas/
Cluster Quorum info: https://technet.microsoft.com/enus/library/jj612870(v=ws.11).aspx
More Quorum info:
https://blogs.msdn.microsoft.com/sqlalwayson/2012/03/13/quorum-voteconfiguration-check-in-alwayson-availability-group-wizards-andy-jing/
DR matrixes: https://www.brentozar.com/archive/2014/05/new-highavailability-planning-worksheet/