FromSLAtoHADRx
Download
Report
Transcript FromSLAtoHADRx
Implementing HA/DR based on a
SLA
=tg= Thomas Grohser, NTT Data
SQL Server MVP
SQL Server Performance Engineering
SQL Saturday #500 Boston, MA March 19th 2016
select * from =tg= where topic =
@@Version
Remark
=tg= Thomas Grohser,
SQL 4.21
First SQL Server ever used (1994)
NTT DATA
SQL 6.0
First Log Shipping with failover
SQL 6.5
First SQL Server Cluster (NT4.0 + Wolfpack)
SQL 7.0
2+ billion rows / month in a single Table
SQL 2000
938 days with 100% availability
SQL 2000 IA64
First SQL Server on Itanium IA64
SQL 2005 IA64
First OLTP long distance database mirroring
SQL 2008 IA64
First Replication into mirrored databases
Senior Director Technical Solutions Architecture
email: [email protected] / [email protected]
Focus on SQL Server Security, Performance Engineering,
Infrastructure and Architecture
New Papers coming 2016
SQL 2008R2 IA64
SQL 2008R2 x64
First 256 CPUs & >500.000 STMT/sec
21 Years
SQL Server
First Scalewith
out > 1.000.000
STMT/sec
First time 1.2+ trillion rows in a table
Close Relationship with
• SQLCAT (SQL Server Customer Advisory Team)
• SCAN (SQL Server Customer Advisory Network)
• TAP (Technology Adoption Program)
• Product Teams in Redmond
Active PASS member and
PASS Summit Speaker
SQL 2012
> 220.000 Transactions per second
> 1.3 Trillion Rows in a table
SQL 2014
> 400.000 Transactions per second
Fully automated deploy and management
AlwaysOn Automatic HA and DR
SQL 2016
Can’t wait to raise the bar again
NTT DATA Overview
• 20,000 professionals – Optimizing
balanced global delivery
• $1.6B – Annual revenues with history of
above-market growth
• Long-term relationships – >1,000
clients; mid-market to large enterprise
• Delivery excellence – Enabled by
process maturity, tools and accelerators
• Flexible engagement – Spans
consulting, staffing, managed services,
outsourcing, and cloud
• Industry expertise – Driving depth in
select industry verticals
Why NTT DATA for MS Services:
• NTT DATA is a Microsoft Gold
Certified Partner. We cover
the entire MS Stack, from
applications to infrastructure to
the cloud
• Proven track record with 500+
MS solutions delivered in the
past 20 years
Drawing at the end of the session
Drop your business card or fill out provided
blank card and drop in the red bag
Must be present at the time of drawing at the
end of the session to win:
Agenda
What can go wrong?
A look at the SLA
A tour on HA/DR features/options
The Surprise
Q&A
ATTENTION:
Important Information
may be displayed on
any slide at any time!
! Without Warning !
What can go wrong?
Single component failure
Server failure
Datacenter failure
Network failure between datacenters
Data loss/corruption
There is a lot more that can go wrong but we
have only one hour today
A look at the SLA
Don’t confuse luck with keeping the SLA!
It is easy to achieve 99.999%
It is almost impossible to guarantee
SLA burns down to
RPO
RTO
different values during
different times of the day/week/month
RPO – Recovery Point Objective
In plain English: How much data can we lose?
None
This is what you always aim for
Some
Seconds
Minutes
Hours
Days
All
This is not the worst that can happen
All and the competition “finds” it
RTO – Recovery Time Objective
In plain English: How much time after a
failure till we have to be up and running again
None
Impossible
Some
Seconds
Minutes
Hours
Days
Forever
HA/DR/LR explained
HA – High Availability
RTO: seconds to minutes
RPO: Zero to seconds
Automatic failover
Well tested (maybe with each patch or release)
DR – Disaster Recovery
RTO: minutes to hours
RPO: seconds to minutes (even hours)
Manual failover into prepared environment
Tested from time to time
LR – Last Resort
RTO: days to weeks
RPO: minutes to hours (even a whole day)
Rebuild system from scratch (Hardware has to be ordered, Floor space,
connectivity to be rented)
Have a rough plan
A good bare minimum SLA
SLA
HA
RTO
RPO
Single component failure
Server failure
Datacenter failure
Network failure between datacenters
Data loss/corruption
Days
a lot
Deletion
Corruption
Sabotage
HA
HA (DR)
DR
HA/DR
HA/DR/LR
DR (LR)
LR (DR)
Any different double failure
Any double failure
Any triple failure
HA DR
DR LR
LR
Priorities
to
Hours
Minutes
Concurrent failures
Minutes
Seconds
LR
Failure Modes (single failures)
Seconds
Close to Zero
DR
Recoverability (no data loss)
Availability (keep going)
Performance (keep running)
Defined maintenance windows and adjusted values based on the day and time of the day
Sample SLA
SLA
RTO
RPO
LR
3 days
up to one day
30 Minutes after end of window
before maintenance window
3 days
up to one day
Deletion
Corruption
Sabotage
HA
HA
DR
DR
DR
DR
LR
Concurrent failures
Any different double failure
Any double failure
Any triple failure
HA
DR/LR
LR
Priorities
N/A
N/A
Single component failure
Server failure
Datacenter failure
Network failure between datacenters
Data loss/corruption
DR
< 30 Minutes
up to 1 minute
Failure Modes (single failures)
< 30 Seconds
Close to Zero
During Maintenance Window
HA
RTO
RPO
Recoverability (no data loss)
Availability (keep going)
Performance (keep running)
Defined maintenance windows Saturday 1pm till Sunday 9pm except if EOM
A tour on HA/DR features/options
Backup / Restore
Available since SQL 4.x (1.x but that one didn’t run on Windows)
A backup is a consistent point in time COPY of the database
Backups contain all data up to the point the backup finishes
Snapshots are not backups
Backups need to be verified (restored and then DBCC CHECKDB
validated)
RTO
Depends on
backup/restore solution and disk speed and DB size
Native Backup/Restore
1Gb/s network
Local Rotating Spindle
5-6 GB/minute
5-12 GB/minute
300-350 GB/hour/NIC
500-700 GB/hour/spindle
Database Recovery Time
RPO
Depends on the backup frequency and if the last backup is good
There is no excuse for not having a valid backup!
Log Shipping
Available since SQL 6.0 (manual possible before)
Take Transaction Log backups and optional restore them as fast as
possible or with a delay to one or more targets.
Manual failover
Failback possible but not if managed via GUI
RTO
Manual process
Depends on
frequency and timing of backup, copy, restore job
Database Recovery Time
RPO
Depends on frequency and timing of backup and copy job
Restores fail from time to time lower RPO till fixed
Stagger Backup, Copy and Restore jobs (by default at the same time)
Backup, copy a minute later, restore a minute later
Don’t run the jobs at the same time for all databases
Failover Clustering
Available since SQL 6.5 (useful since 7.0)
Disk are connected to multiple servers and are used only on the active node.
The SQL Server Service is installed on all nodes but active only on one at a
time.
The most important component (data) is the only one that does not exists twice
Depending on technology the nodes have to be close to very close to each
other
RTO
Automatic failover in some cases
Depends on
Storage System
Number of LUNS (a.k.a. Volumes, Disks)
System Shutdown Time (depends on amount of memory and last checkpoint time)
System Startup Time
Database Recovery Time
Depends on DNS refresh time if cross network failure happens
RPO
Since the same disks are used its either zero (good) or forever (bad)
Not covering data deletion/corruption
Database Mirroring
Available since SQL 2005
Transactions are set to a secondary machine that holds a copy of the
database and committed on all machines. Synchronous or
asynchronous can be configured per replica. Replicas can be made
readable.
Automatic Failover if witness is configured or odd number of replicas.
Works great across datacenters, can be combined with Log Shipping
and/or Clustering
RTO
Database is already running with a warm cache on the secondary
Depending on the size of the redo queue
Depending on Database Recovery Time
RPO
Depending on synchronous (zero) or asynchronous (depending on network speed and
transaction volume)
Issue: if secondary fails primary continues without secondary after a few seconds. Then
RPO can increase. This must be caught by monitoring
Not covering data deletion might cover some cases of physical data corruption
AlwaysOn FCI – Failover Cluster Instance
Available since SQL 2012
Very similar to Failover Clustering
Better failure detection and automatic failover
Other than that same RTO/RPO than
clustering
AlwaysOn AG – Availability Groups
Available since SQL 2012 (enhanced in 2014 and 2016)
Transactions are set to multiple secondary machines that hold copies of the
database and are committed to both machines. Synchronous or asynchronous
commit can be configured per replica. Replicas can be made readable.
Automatic Failover if witness is configured or odd number of replicas.
Works great across datacenters, can be combined with Log Shipping and/or
AlwaysOn FCI
RTO
Database is already running with a warm cache on the secondary’s
Depending on redo queue size
Depending on Database Recovery Time
Depends on DNS refresh time if cross network failure happens
RPO
Depending on synchronous (zero) or asynchronous (depending on network speed and transaction
volume)
Issue: if secondary’s fails primary continues without secondary after a few seconds. Then RPO can
increase. This must be caught by monitoring
Not covering data deletion might cover some cases of physical data corruption
Other HA/DR Technologies
use at your own risk
Transactional/Merge/Peer2Peer Replication
All kinds of storage replications
Please do a real test when using more than one disk
Real Test = Full random writing load generated by SQL
Server (e.g. SQL Stress) on all disks and then pull the plug
If the databases come online on the other side with no data
loss and corruption try again
If this works 3 times you will be the first person on the
planet to have successfully tested this
All kinds of VM replication
Magic and luck (a.k.a. software driven solutions)
Combining HA/DR Features
Backup/Restore
Must (not can – must) be combined with all others. You
need backups of your database
Log Shipping
Should (maybe even must) be combined with all others.
Some extra work required if combined with
Mirroring/AlwaysOn Availability Groups
Log backups must be taken from secondary after failover
Mirroring/AlwaysOn Availability Groups without
automatic failover and clustering works as expected
Mirroring/AlwaysOn Availability Groups with
automatic failover and clustering always creates a
double failover
RTO - Summary
Detect Failure
Verify Failure
Notify Failure
(only with manual failover)
Verify Failure
(only with manual failover)
Initiate Failover
Failover
Database Recovery
Repointing users
(only with manual failover)
How long does it take to detect a failure
The more severe the problem the shorter
usually the detection time
How often do you check your system health?
Every 5 minutes
That means if I as a DBA can failover to system to DR
in less then 4.9 minutes its not actually an outage?
How long does it take to react to a failure
Automatic Failover (seconds, depending on
configuration)
Manual Failover
Who is the DBA on call
Does he/she pick up immediate
Communicate the problem
Connect to Network
Login
How long does it take to fail over
It depends
Backup Restore
Hours
Log Shipping
Minutes
Clustering / AlwaysOn FCI
Seconds to Minutes
Plus time to mount the disks
Eventual time to verify the disks (since Windows Server 2012
that’s seconds, could be hours before that).
DB Mirroring / AlwaysOn AG
Seconds
A bit more its cross network failover
Database Recovery
It depends
On oldest uncommitted transaction
Number and size of transactions in flight that
need rollback
So somewhere between
Less than a second
Several weeks
RPO Summary
How good is your Monitoring
How old is the last known good backup
How old is the last known good log backup
How much are the secondary's behind
Data deletion or corruption: how long does
restore take…
RTO/RPO Improvement
To handle data deletion / corruption
Multiple delayed log shipping copies of the database
Stop restore the moment you detect corruption,
recover till before corruption happened, bring
database online
Recover data
To minimize Database Recovery Time
Monitor long running transactions and alert on them.
Design systems around them.
Partition large tables and rebuild indexes partition by
partition
Multiple Databases
How to handle recovery if more than one
database is involved and the data between the
databases needs to be consistent
If you have data loss in one or more of the databases
And if one or more of the following statements is true
You have Cross Database Transactions
You have Distributed Transaction between
2 or more databases on the same server
2 or more databases on different servers
You need either
A plan how to restore them both to the last point in time they
where in sync
A plan how to reconcile the data into a consistent state
Prioritized list of all system
This is an absolute must
If two or more systems are down at the same
time this list helps making the hard decisions
Who gets recovered first
The list can depend on the point in time (e.g.
different priority on the weekend than during
the week)
THANK YOU!
and may the force be with you…
Questions?
[email protected]
[email protected]