Exchange Server 2013 High Availability and Site Resilience

Download Report

Transcript Exchange Server 2013 High Availability and Site Resilience

Capabilities are subject to change
Packaging and licensing have not yet been determined
Any screen captures or concepts shown are pre-release and for illustration purposes only
Disclaimer
This presentation contains preliminary information that may be changed substantially prior to final commercial release of the software described herein.
The information contained in this presentation represents the current view of Microsoft Corporation on the issues discussed as of the date of the presentation. Because
Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the
accuracy of any information presented after the date of the presentation. This presentation is for informational purposes only.
MICROSOFT MAKES NO WARRANTIES, EXPRESSED, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this presentation. Except as expressly
provided in any written license agreement from Microsoft, the furnishing of this information does not give you any license to these patents, trademarks, copyrights, or other
intellectual property.
Database sizes
must be
manageable
Reseeds must
be fast and
reliable
Passive copy
disk IOPS
inefficient
Lagged copies
Limited agility
have assymetric from low disk
storage design space recovery
Capacity is
increasing,
but IOPS aren’t
Server1
DB1 Passive
Server2
Server3
20MB/Sec
DB1 Active
20MB/Se
DB2 Passive
DB2 Active
DB1 c
Passive
DB3 Passive
DB4 Active
Server4
DB1
Active
12MB/Sec
DB1
Passive
DB3 Active
12MB/Sec
DB4 Active
20MB/Sec
DAG
Single database copy/disk:
• Reseed 2TB Database = ~23 hrs
• Reseed 8TB Database = ~93 hrs
4 database copies/disk:
• Reseed 2TB Disk = ~9.7 hrs
• Reseed 8TB Disk = ~39 hrs
Server1
Server2
Server3
Server4
50% IOPS
40%
Utilization
40% IOPS
IOPS
Utilization
Utilization
50% IOPS
40%
Utilization
40% IOPS
IOPS
Utilization
Utilization
30% IOPS
Utilization
50% IOPS
40%
Utilization
40% IOPS
IOPS
Utilization
Utilization
30% IOPS
Utilization
50% IOPS
40%
IOPS
Utilization
Utilization
30% IOPS
Utilization
Active
A/A/P/L
A/P/P/L
Passive
A/A/P/L
Passive
A/P/P/L
Passive
A/A/P/L
Passive
A/P/P/L
Lag
A/A/P/L
A/P/P/L
Lag
DAG
Exchange 2013 failover speed 2x better than Exchange 2010!
Decreased IO latency = better user experience
Exchange Server 2010
Passive database IOPS = active database IOPS
• Active = 100MB Checkpoint Depth
• Passives = 5MB Checkpoint Depth
Exchange Server 2013
Passive database IOPS = 50% of active database IOPS
• IOPS Savings + ESE Fast failover = 100MB
Checkpoint Depth on passives with no failover perf
penalty
25% increase in aggregate disk utilization, e.g.,
• 4 database copies/disk
• Balanced = 1 Active, 2 Passives, 1 Lag
Failure Mode: Actives/disk doubles*
• 2 Active, 1 Passive, 1 Lag
• 50% IOPS Utilization
passive database IO
=
active database IO
passive copy
performs
aggressive prereading
background
database
maintenance runs
at 5 MB/sec/copy
Single logical
disk/partition
per physical disk
Database copies
per volume =
copies per
database
Same neighbors
on all servers
Balance
activation
preferences
Disk failure on
active copy =
database
failover
Failed disk &
database
corruption
need to be
addressed
quickly
Fast recovery to
restore
redundancy is
needed
Use spares to automatically
restore database redundancy
after a disk failure
Automatic Reseed
Periodically scan
for failed and
suspended copies
Check
prerequisites:
single copy, spare
availability
Allocate and
remaps a spare
Start the seed
Verify that healthy
copy
Release the
original disk
AutoDagDatabasesRootFolderPath
AutoDagVolumesRootFolderPath
Configure storage
subsystem with spare disks
Create DAG, add servers
with configured storage
Create directory and
mount points
Configure DAG, including 3
new properties
Create mailbox databases
and database copies
MDB1
AutoDagDatabaseCopiesPerVolume = 1
MDB1
MDB1 DB
MDB2
MDB2
MDB1 logs
MDB1 DB
MDB1 logs
Name
Check
Action
Threshold
System Bad State
No threads, including
non-managed threads,
can be scheduled
Hard restart (bugcheck)
302 seconds
Long I/O Times
I/O operation latency
Hard restart (bugcheck)
41 seconds
Replication service
memory threshold
(ok, not a storage
failure )
MSExchangeRepl.exe
consumes excessive
memory
1. Log event 4395 with
termination message
2. Initiate termination of
msexchangerepl.exe
3. If termination fails, hard
restart (bugcheck)
4 GB
Managed
Availability
Database
Failover
Changes
DAG Network
Auto-Config
Best Copy
Selection
Changes
Cmdlet
Enhancements
Maintenance
Mode
Transport HA
Enhancements
If a protocol goes down on a mailbox server, every active
database loses access to that protocol
For most protocols, quick correction is provided
through restart action
If restart fails, often a failover is triggered
• Protocols control recovery sequence
• Recovery sequence optimized thru Office 365 experience;
Service experience accrues to enterprise!
Layer 4 LB
time
CAS15-1 CAS15-2
DAG
MBX15-1
DB1
DB1
OWA
DB2
MBX15-2
DB1
OWA
DB2
Managed Availability = Monitoring + HA
“Stuff breaks, but the Experience does not”
MBX15-3
DB1
OWA
DB2
• Reliable and scalable monitoring framework for Exchange components
Provides • Broader perspective across groups of Exchange servers
• Sequencing mechanism to control when recovery actions are done vs.
alert issued (human engaged)
Provides • Common set of recovery actions
• Set of enhancements to the best copy selection (BCS) process
• Mechanism to control in and out of service for Mailbox and CAS
Provides (maintenance mode++)
Restart
Service - kill and
start a service;
optional dump
AppPool - restart
an app pool;
optional dump
Server bugcheck the
machine
Failover,
Offline,
Online
Database - failover
a single active
database
System- failover
all active
databases
Protocol off - set
health state for
protocol to offline
Protocol on calculate when a
health set is green
Escalate
Notify a
human of an
issue
Checks for a
server hosting
a copy of the
affected
database that
has health sets
in a state that
is the same as
the current
server hosting
the affected
copy
Same as Source
Checks for a
server hosting
a copy of the
affected
database that
has health sets
in a state that
is better than
the current
server hosting
the affected
copy
All Better than Source
Checks for a
server hosting
a copy of the
affected
database that
has all health
sets Medium
and above in a
healthy state
Up to Normal Healthy
All Healthy
Checks for a
server hosting
a copy of the
affected
database that
has all health
sets in a
healthy state
cas1
cas2
Redmond
cas3
cas4
Portland
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
mbx1
mbx2
Redmond
dag1
mbx3
mbx4
Portland
1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond
2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc
3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland
mbx1
mbx2
Redmond
dag1
mbx3
mbx4
Portland
namespace simplification consolidation
of server roles separation of CAS array and DAG
recovery de-coupling of CAS and Mailbox by AD
site
load balancing changes
three locations
Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log
file, automatic failover should occur
If not, you can perform fast recovery using previous steps
mbx1
mbx2
dag1
mbx3
mbx4
witness
Redmond
Portland
Download the preview version of Exchange Server 2013
Try the new Exchange Online in the Office 365 Enterprise Preview
Follow the Exchange Team Blog
Product Documentation