High Availability with Microsoft SQL Server

Download Report

Transcript High Availability with Microsoft SQL Server

MICROSOFT SQL SERVER
HIGH AVAILABILITY
AND DISASTER RECOVERY
Michael Poremba // October 2008
Database HA & DR Experience…
2

Work with business to determine HA or DR
requirements for applications and data?

Design HA or DR solutions?

Administer HA or DR process?

Still learning MS SQL Server HA & DR capabilities?
Scope of this Presentation
3
Presentation Focus

Data Availability
Data recovery
 High availability
 Disaster recovery


Technology Focus
MS SQL Server
 Physical servers
 SANs

Beyond Scope of Presentation









In-depth how-to
(available elsewhere)
Partitioned views (federated)
Advanced DBA techniques
Custom application logic
3rd-party software solutions
Alternate DBMS engines
(e.g. Oracle; DB2)
HA on virtual machines
Complex scenarios & solutions
Load balancing
4
Introduction to Data Availability
So, you need to make your
production database bulletproof…
Data Availability Continuum
5
Degrees of protection for information systems:
Business Risk
Data loss
Solution
Redundant data
High Availability
Downtime of
database service
Redundant system
components
Disaster Recovery
Downtime of
business operations
Redundant systems
and facilities
Data Recovery
Business Case for Availability
6
High Availability


Keep business-critical
applications available
Secondary:
 Server
maintenance
Disaster Recovery


Protect against loss of
data center
Secondary:
 Application
upgrades
 Infrastructure upgrades
Service Level Agreement (SLA)
7




Permitted downtime (planned vs. unplanned?)
Uptime SLA
Downtime
per Year
Downtime
per Month
99.9%
8.76 hours
43.8 minutes
99.99%
52.6 minutes
4.38 minutes
99.999%
5.26 minutes
0.438 minutes
Acceptable data/transaction loss
Application response times
Mean time to recovery
Note: Database uptime is not equivalent to application availability


Failures of other application services
Network outages
Protect What?
8

Application data stores
Databases
 Files
 Other data repositories


Database services


DBMS availability for applications
Application services

Application availability for users and external systems
Databases are the heart of most information systems;
they deserve the highest affordable protection.
Database Failure Scenarios
9
Physical Infrastructure Failures

Storage subsystem
Logical Data Failures

 Disk
 DBMS
interruption
 Drops / deletes
 Controller



Network
Server
Power
Operator errors



Application defects
DBMS defects
Data corruption
Service Recovery Strategies
10
Standby
Mode
Cold
standby
Warm
standby
Failover Behavior
SQL Server Feature
• Manual intervention required • Backup and restore
to restore offline data copy
• Data copy online and ready • Transaction log
• Manual failover required
shipping
• Database mirroring
Hot
• Automatic failover
• Database mirroring
standby
• Failover clustering
Data Recovery—Terminology
11
Terminology varies for source vs. copy
High Availability Strategy
Data Source
Data Copy
Backup and Restore
Database
Backup
Log Shipping
Primary
Secondary
Standby
Database Mirroring
Principal
Mirror
Failover Clustering
Primary
Active
Secondary
Passive
Standby
Inactive
12
Data Recovery
[Briefly…]
Database Backups
13

Traditional backup types
Full backup
 Differential backup
 Transaction log backup


Disk is better than tape
First backup to disk (separate physical disk volume)
 Detect exceptions encountered during backup
 Verify backup files
 Copy backup files to tape or remote disk


Data retention policy for backup files
Database Backup Strategy
14
Backup of user databases not sufficient for recovery
 System database
 Master database
 MSDB database
 Model database
 External data stores…
Synch with External Data Stores
15
Synchronize recovered database with external data
stores:
 Identity column seeds
 Full-text indexes
(SQL Server 2000)



LDAP entries
File system objects
Other databases
Backup Retention Policy
16



Location of backup files
Duration of retention
Protection of sensitive data
 Sarbanes/Oxley
(SOX)
 HIPAA
 Internal

policies for data management and protection
Access to backups from offsite data storage
Data Recovery Process
17

Backup file sets



Full baseline, differential, and
transaction logs
Recovery strategy depends on
failure scenario

Retrieving backup files




Offsite storage
Tape
Network copy
Dependency on multiple
people to get access to
backup files





Create comprehensive failure
matrix
Devise recovery strategy for
each scenario
Does worst-case recovery
scenario fit within SLA
parameters?
Recovery time; SLA
Include future data growth in
recovery plan
Fully test recovery
strategies—practice is
essential
18
High Availability
High Availability
19

Minimize or avoid service downtime
 Whether

planned or unplanned
When components fail,
service interruption is brief or non-existent
 Automatic

failover
Eliminate single points of failure (as affordable)
 Redundant
components
 Fault-tolerant servers
Redundant Components
20
Objective: Avoid single points of failure (where affordable)
Approach: Use redundant components for database service
 Database server nodes
 Server components





DBMS instance
User databases
Storage devices
Storage unit components



MPIO: Interfaces; paths; switches; controllers
RAID: Disks
Networking


ECC RAM; failure-tolerant HW & OS
MPIO: Interfaces; paths; switches
Data copies

E.g. Recovering torn page from mirror in SQL Server 2008
Transaction Log Shipping
21


Warm standby solution
Duplicate user database
 Copy

transaction logs to standby server & restore
Database available for read-only access
 Users
must disconnect for logs to be applied
 Two database licenses required if querying standby



Manual application failover
Supported on standard hardware
Possible data loss (unapplied transactions)
Database Mirroring
22

Redundancy at user database level










Requires witness server
Mirror-aware application client connection

node A
node B
High-availability: commit @ log on mirror; automatic failover
High-protection: commit @ log on mirror; manual failover
High-performance: commit when logged on principal
Very fast automatic failover—seconds


Mirror always redoing transactions from principal
Negligible impact on transaction throughput
Multiple mirroring modes:


witness
(optional)
Mirrored over private network channel


Duplicate copy of user database
Independent storage devices
Multiple copies of instance databases
Provided by client library
Database connection string must specify both servers
Mirror may be available for read-only access (snapshots)
Works with standard hardware
Local Storage
· local sys DBs
· source user DB
Local Storage
· local sys DBs
· mirror user DB
Mirror Witness
23


With mirroring, more than one server is required to
decide on failover
Witness automates failover from primary to mirror
Watches database availability
 Reports observations back to principal and mirror




Runs in separate SQL Server instance (Express is OK)
Prevents “split brain” scenario
Very low resource consumption


Can be witness for multiple databases
Not a single point of failure
SQL Server Failover Clustering
24

Two clustered nodes
 Active/Passive

MS SQL services
 Running

config
on virtual server
Shared storage device
 User
databases
 System databases
 Quorum drive
 Redundant internal
components
node A
node B
Shared Storage
· system DBs
· user DBs
· quorum
Active/Passive Failover Clustering
25

Redundancy at database instance level



Single data copy on shared storage
device





SQL Agent; Analysis Services; Full-Text
engine, MS DTC
Automatic failover (up to minutes)
DBMS accessed over virtual IP
Database not available from inactive
node for DB client connections


No I/O overhead reducing throughput
Storage unit is single point of failure for
cluster
node A
node B
All database services are clustered


All databases fail over together
Shared copy of system databases
Storage is controlled by one cluster node
at a time
Requires hardware certified by Microsoft
for Microsoft Cluster Service
Shared Storage
· system DBs
· user DBs
· quorum
HA Comparison
26
Database Mirroring









Scope: user DB
Standard hardware
One SQL license
(unless querying snapshots on
mirror)
Very fast failover (seconds)
OS flexible (e.g. 32/64)
Independent storage
Independent services
Reporting on mirror
Geographic separation OK
Failover Clustering









Scope: DBMS instance
Certified hardware
One SQL license
(only one node can access
database)
Automatic failover (up to minutes)
Enterprise OS
Shared storage
Clustered services
Standby not available
Servers are usually co-located
Considerations for HA
27

HA complements backup and recovery strategy


Does not replace data recovery plan
Application service availability is often determined by
a network of interdependent services
Availability can be difficult to define (e.g. partial failures)
 Failure probability difficult to measure or compute


Increased system complexity could lead to lower service
availability!
Operator error a leading cause of availability issues
 Increased number/types of system components
 More complex to configure and administer

Data Recovery Requirements
Requirements
Backup and
Recovery
Log Shipping
DB Mirroring –
High-Performance
DB Mirroring –
High-Protection
DB Mirroring –
High-Availability
Failover Clustering
28
Cost
Low
Low/Med
Medium
Medium
Medium
High
Relative complexity
Low
Low
Medium
Medium
High
High
Data loss
Possible
Latest log
Possible
None
None
None
Scope of duplication
Database
Database
Database
Database
Database
DBMS
Failover
Downtime
Downtime
Manual
Manual
Seconds
Up to minutes
Client redirect
Manual
Manual
Automatic
Automatic
Automatic
Automatic
Rolling upgrades & maint.
No
No
OS & DB
OS & DB
OS & DB
OS
Access data on secondary
Restore
Read-only
Snapshot
Snapshot
Snapshot
No
Geographic separation
OK
OK
OK
Latency?
Latency?
Latency?
29
Disaster Recovery
Disaster Recovery
30

Minimize downtime of business operations
 Redundant

systems and facilities
SQL Server features:
 Transaction
log shipping
 Database mirroring
 Failover clustering

Other technologies
 Storage-based
mirroring
Disaster Recovery Planning
31







Data security requirements
Clarify SLA, data loss allowance
Evaluate system cost vs. data protection
Failure analysis
System redundancy
Process validation
Training for personnel
Prevention practices
 Executing disaster recovery and business continuity


Practice, practice, practice
Business Continuity Facility
32

System redundancy
 Systems:
Web servers app servers; database, etc.
 Data: Databases; data files on OS; security info, etc.
 Networking: Domain, routing, subnet, VIPs, etc.

Alternate facilities
 Network
bandwidth
 Physical or network access by operations staff

Failover
 Often
a deliberate decision, using manual failover
Data Redundancy
33

Synchronous redundancy
Network bandwidth cost
 Network latency and application performance
 Network reliability


Asynchronous redundancy
Risk of data loss
 More cost-effective
 Resilient to network latency issues


Candidate Technologies
SQL Server database mirroring
 Failover clustering with SAN-based mirroring

DR Using Database Mirroring
34



Two sites: Primary and DR location
Separate failover clusters at each site
SQL Server database mirroring between sites
witness
(optional)
failover cluster at site A
node A1
failover cluster at site B
node A2
Shared Storage A
· local sys DBs
· local quorum
· source user DB
database
mirroring
node B1
node B2
Shared Storage B
· local sys DBs
· local quorum
· mirror user DB
DR Using SAN-Based Mirroring
35




Two sites: Primary and DR location
Four-node failover cluster; one virtual IP address
SAN-based mirroring between sites
Manual cluster failover
failover cluster nodes at site A
node A1
failover cluster nodes at site B
node A2
Shared Storage A
· system DBs
· quorum
· user DBs
node B1
storagebased
mirroring
node B2
Shared Storage B
· system DBs
· quorum
· user DBs
36
Complimentary Technologies
[Skip if time is running short.]
SAN-Based Data Mirroring
37

Data blocks duplicated at storage level


Copy performed in sequence and coordinated with
database checkpoint




Ensures consistency of mirrored data files
Synchronous or asynchronous mirroring
Co-located or geographically dispersed—both are OK


Similar to transaction log shipping
SAN link bandwidth must support database I/O rate
May require extra feature support from SAN vendor
Could rely on Failover Clustering for HA
SQL Server Database Snapshots
38


Read-only point-in-time database snapshot
No data is copied—instantaneous
 Historical
snapshot pages tracked separately from
changing pages

Snapshots can be maintained indefinitely
 Limited

only by available storage
Snapshot copy can be used for reporting
 Read-only,
so no locking issues
SQL Server Replication
39

Transactional replication




Merge replication



High transaction volume
Low data latency required
Mixed technologies:
Integrates with other DBMS
Bi-directional data changes
Typically server-to-client
Snapshot replication



Large, infrequent data
changes
Data change latency OK
Best for smaller data sets




Subscriber databases
available for reporting
Replicate data subsets
Some data loss is possible
Periodically validate
replicated data
40
App Development and Admin
Considerations for App Developers
41












App services tolerant to database service interruptions
Application transactions must be handled in code—data consistency
Exception handling for transaction retry, connection recovery
Requires coding standards, code reviews, and testing
Bulk data operations
Transaction volume impacts rollback time during failover
Batch jobs must be run on alternate nodes
Don’t bypass transaction logging
Synchronization with external data sources?
Be aware of database recovery model
Mirroring uses FailoverPartner in connection string
Use TCP/IP as client protocol
Considerations for Admins
42


Use identical server hardware, when possible
Design network redundancies, when feasible




Always manage through virtual cluster, not individual cluster nodes
Retest failover/failback after HA maintenance
Diagnose after failover





Consider network latency for geographic separation
Repair alternate node
Resynchronize data, as necessary
Be aware of primary/secondary locations
Ensure application services are connected and functioning properly
Keep server node configurations synchronized:



Service pack and patch levels
Duplicate non-redundant resources
Jobs; logins and permissions; OS & sys objects
HA Risks
43






System performance degradation
HA system complexity leads to availability issues
Some system failures not planned for
Backup and recovery planning incomplete
Administrators not fully trained or informed
User databases not synchronized with other data
sources
Common Admin Use Cases
44

Maintain HA nodes
 Hardware
maintenance
 Rolling upgrades and software patches

Resynchronize the redundant copy
 Re-synch
mirror
 Restart log shipping

Diagnose and repair
 Diagnose
cause of failover
 Repair failed node and restore failover capabilities
 Test failover and failback
Common Admin Actions
45
Train and practice administrators to:
 Initiate a database mirror
 Manually failover mirror database or cluster node
 Add/remove passive node from mirror or cluster
 Upgrade/patch servers nodes
 Restart or redirect application services
46
More Information
References—Books
47
High Availability



Microsoft SQL Server 2008 High
Availability with Clustering &
Database Mirroring
by Michael Otey, 2009.
Microsoft SQL Server High
Availability
by Paul Bertucci, 2004.
Pro SQL Server 2005 High
Availability
by Allan Hirt, 2007.
Related Topics



Pro SQL Server 2005 Replication
by Sujoy Paul, 2006.
Pro SQL Server 2005 Service Broker
by Klaus Aschenbrenner, 2007.
The Rational Guide to SQL Server
2005 Service Broker
by Roger Wolter, 2006.
References—Presentations
48







Microsoft Load Balancing and Clustering
http://ce.sharif.edu/courses/84-85/2/ce317/resources/root/lecture%20slides/
14.%20Microsoft%20Load%20Balancing%20and%20Clustering.ppt
SQL Server 2005 High Availability
http://www.atlantamdf.com/Presentations/AtlantaMDF_111207HA.ppt
High Availability Technologies In SQL Server 2000 And SQL Server 2005
http://202.181.238.2/hk/teched2004/ppt/Day_2_Rm407/DAT431(1330-1445).ppt
Meeting the Availability Challenge
http://download.microsoft.com/download/E/D/C/EDCF54DB-19CD-4882-9FC44F7D46FCEAA6/HighAvailability.ppt
Disaster Recovery Mistakes
http://www.sqlsig.org/Oct%2011%20DASSUG%20-%20Jason%20Hall%2010-11-07%20MM.ppt
SQL Server 2005 High Availability
http://blogs.msdn.com/sql2005event/attachment/564303.ashx
Effective Usage of SQL Server 2005 Database Mirroring
http://www.sqlserver-qa.net/SSQAEffective%20Usage%20of%20SQL%20Server%202005%20Database%20Mirroring_show.ppt
References—Articles
49

Achieve High Availability for SQL Server
http://technet.microsoft.com/en-us/magazine/cc162477.aspx

Geographically Dispersed Clusters in Windows
Server 2003
http://www.microsoft.com/windowsserver2003/techinfo/overview/clustergeo.mspx

Restoring file and filegroup backups
http://support.microsoft.com/kb/281122/en-us

Restoring specific tables or rows from backups
http://support.microsoft.com/kb/321836/en-us

Maintaining Availability During Upgrades
http://msdn.microsoft.com/en-us/library/ms191449.aspx