Availability - Welcome to CUNY

Download Report

Transcript Availability - Welcome to CUNY

Introduction to the new mainframe:
Large-Scale Commercial Computing
Chapter 5: Availability
© Copyright IBM Corp., 2006. All rights reserved.
Introduction to the new mainframe
Chapter objectives
After completing this chapter, you will be
able to:
• Understand what availability means to a
commercial enterprise
• Describe the inhibitors to availability
• Describe operating system facilities that
improve availability
• Describe the major components of Parallel
Sysplex
© Copyright IBM Corp., 2006. All rights reserved.
2
Introduction to the new mainframe
A real customer requirement:
Royal Bank Boosts Availability - Online Banking
IBM System z Parallel Sysplex
System
Front End - Internet
•WebSphere MQ
For z/OS, V5.3
Challenge: Maximize Availability
Back End - Data/Applications
•DB2 Database
•IMS Database
•CICS Applications
12 million customers
2.5 million online
60,000 employees
Benefits
Reliable integration with internet
Supports ~40 web-based applications
Efficient use of parallel sysplex
Improved customer availability
© Copyright IBM Corp., 2006. All rights reserved.
3
Introduction to the new mainframe
Introduction to availability
High Availability
Fault-tolerant, failureresistant infrastructure
supporting continuous
application processing
Continuous Operations
Non-disruptive backups
and system maintenance
coupled with continuous
availability of applications
Protection of critical business data
Recovery is predictable and reliable
Disaster Recovery
Protection against
unplanned outages
such as disasters
through reliable,
Operations continue after a disaster predictable
recovery
Costs are predictable and manageable
© Copyright IBM Corp., 2006. All rights reserved.
4
Introduction to the new mainframe
What is availability?
Availability is the state of an application being
accessible to the end user.
© Copyright IBM Corp., 2006. All rights reserved.
5
Introduction to the new mainframe
Outage Definition
An outage (unavailability) is the time, a system is not
available to an end user.
Outages may be planned or unexpected (unplanned).
Planned outages include causes like data base
reorganisation, release changes, and network
reconfiguration. Unplanned outages are caused by some
kind of a hardware, software or data problem
While planned outages can be scheduled, they still are
disruptive. The modern trend is to try to avoid planned
outages altogether. This requires extensive hardware
and software facilities.
© Copyright IBM Corp., 2006. All rights reserved.
6
Introduction to the new mainframe
Cost of outages (1)
Financial Impact of Downtime Per Hour (by various Industries)
Source: Contingency Planning Research & Strategic Research Corp.
© Copyright IBM Corp., 2006. All rights reserved.
7
Introduction to the new mainframe
Cost of outages (2)
© Copyright IBM Corp., 2006. All rights reserved.
8
Introduction to the new mainframe
Types of Outages
Common Causes for “Application Downtime”
Source: Standish Group Research
© Copyright IBM Corp., 2006. All rights reserved.
9
Introduction to the new mainframe
Inhibitors to availability
Number of 9s – or the Myth of the nines
Class of 9s
Outage
99,999 % 5 min / year
99,99 % 53 min / year
99,9 % 8,8 hrs / year
99 % 88 hrs / year
90 % 876 hrs / year
Example
Continous
Availability
z/OS Parallel
Sysplex
Fault Tolerant
S/390 Parallel
Sysplex
High
Availability
Single IBM
System z CPC
General
Purpose
High available
UNIX Cluster
Campus LAN
© Copyright IBM Corp., 2006. All rights reserved.
10
Introduction to the new mainframe
IBM System z9 EC – Under the covers (Model S38 or S54)
Internal
Batteries
(optional)
Hybrid
Cooling
Power
Supplies
Processor Books
and Memory
CEC
Cage
3x I/O
cages
Support
Elements
Front View
© Copyright IBM Corp., 2006. All rights reserved.
11
Introduction to the new mainframe
Redundancy – IBM Mainframe Hardware
• Power
 2x Power Supply
 2x Power feed
• Internal Battery Feature
 Optional internal battery in cause of loss of external power)
• Cooling
• Dynamic oscillator switchover
• Processors
 Multiprocessors
 Spare PUs
• Memory
 Chip sparing
 Error Correction and Checking
• Enhanced book availability
© Copyright IBM Corp., 2006. All rights reserved.
12
Introduction to the new mainframe
Concurrent Maintenance and Upgrades
• Duplex Units
 Power Supplies,
•
•
•
•
Concurrent Microcode (Firmware) updates
Hot Pluggable I/O
PU Conversion
Permanent and Temporary Capacity Upgrades
 Capacity Upgrade on Demand (CUoD)
 Customer Initiated Upgrade (CIU)
 On/Off Capacity on Demand (On/Off CoD)
• Capacity BackUp (CBU)
© Copyright IBM Corp., 2006. All rights reserved.
13
Introduction to the new mainframe
Capacity BackUp (CBU)
Who Needs It?
• Any business with a requirement for increased availability or Disaster Recovery
What Is It?
• Provides the ability to nondisruptively increment capacity
temporarily, when capacity is lost elsewhere in the enterprise
• Dual Microcode Loads
 Provide two machine configurations in one box
• Take advantage of "spare" PUs
• Significant cost savings possible
 Standby MIPS cost can be eliminated
CBU Server
 IBM Software license charges on standby MIPS can be eliminated
Production Server
• Configure memory and channels to support production workload
How Can I Use It?
• Adjacent machines in the same location
• Multiple images in the same Parallel Sysplex® cluster
• Backup/Recovery site
© Copyright IBM Corp., 2006. All rights reserved.
14
Introduction to the new mainframe
z9 EC Enhanced Book Availability
Book Add
• Model Upgrade by the addition of a single new book adding physical
processors, memory, and I/O Connections
Continued Capacity with Fenced Book
• Make use of the LICCC defined resources of the fenced book to
allocate physical resources on the operational books as possible
Book Repair
• Replacement of a defective book when that book had been previously
fenced from the system during the last IML
Book Replacement
• Removal and replacement of a book for either repair or upgrade
© Copyright IBM Corp., 2006. All rights reserved.
15
Introduction to the new mainframe
IBM System z9 EC – Enhanced Book Replacement (EBR) Flow
Book Add
• Processor Upgrade
• Add Memory
• Additional I/O Bandwidth
Book Replace/Repair
• Models S18, S28, S38, S54 only
• Requires sufficient resources
in remaining Book(s)
Failed Book
• Models S18, S28, S38, S54
only
• Requires sufficient resources
in remaining Book(s)
1
2
Book Replace/Repair
• Prepare for Book removal via SE
• Resource reassigned to active
Book(s) before repair/replace
• 'Fence' off Book for removal
Failed Book
• Re-IML system with failed Book
'fenced' off
• During IML, reassign resource to
surviving Book(s)
• Remove 'fenced' Book for
replacement/repair
Resources
Remove Book to be
replaced/repaired
Replace with
new/repaired Book
3
New
Book
Old
Book
4
After Book
Add/Replace/Repair
• Restore/Reconfigure
Processors
Memory
I/O
Resources
© Copyright IBM Corp., 2006. All rights reserved.
16
Introduction to the new mainframe
EBR - Dynamic Memory Move
Example:
• The Dynamic Memory Move operation
concurrently changes the physical
memory backing of an absolute storage
increment
• Performed transparent to the Operating
System
• Utilizes the zSeries Copy/Reassign
Hardware
• Used during EBA to:
Absolute storage increment “123” is
concurrently moved from physical
memory increment 1 to physical
memory increment 2.
Absolute
Storage Space
 Move physical memory usage from the
targeted book to books that will be remaining
in the system.
 Optimize memory allocation after EBA
completion.
Physical
Memory
123
1
2
© Copyright IBM Corp., 2006. All rights reserved.
17
Introduction to the new mainframe
EBR - Redundant I/O Interconnect (RII)
Processor Book 0
Processor Book 1
Memory Cards
Memory Cards
L2 Cache
L2 Cache
PU PU PU PU
PU PU PU PU
STI Multipath Module (STI-MP)
• A multiplexer that supports attachment to four I/O
features in an I/O domain and has an alternate path to a
second STI-MP for a redundant I/O infrastructure.
Key Usage
•
•
•
•
•
•
•
Memory Upgrade
Dynamic MBA fanout error recovery
Reduction of UIRA outage
Book Repair
STI cable repair
MBA fanout card repair
On book add MBA fanouts used for I/O are concurrently
rebalanced to the new book
STI from
Book 0
STI from
Book 1
PU PU
PU PU
PU PU
8 MBA Fanout
PU PU
8 MBA Fanout
16 STIs
I/O
Cage
Ring
Structure
16 STIs
STI 2.7 GB/sec
ICB-4
2 GB/sec
STI
daughte
r card
STI
mothe
r
card
I/O Ports
I/O features I/O features
STI-MP & STI-A8 Cards
FICON Express2
I/O Ports
I/O Cage
I/O
Feature
OSA-Express2
© Copyright IBM Corp., 2006. All rights reserved.
18
Introduction to the new mainframe
EBR - Concurrent Physical Processor Reassignment
• This operation is used for
concurrently changing the
physical backing of one or more
logical processors
• The state of source operating
physical processor is captured
and transplanted into the target
physical processor.
• Expected to be transparent to the
operating system.
• Utilizes the PU sparing function
• Used during EBA to:
Logical
Physical PUx
PU6
PUy
 Move processors from the targeted
book to spare processors on a book
remaining in the system
 Rebalance processors after EBA
completion.
© Copyright IBM Corp., 2006. All rights reserved.
19
Introduction to the new mainframe
Evolution of RAS for IBM System z high end Systems
z900
Microcode Driver Updates
z990
6 Hr Scheduled outage 6 Hr Scheduled outage
Book Replacement**
Not Applicable
Scheduled Outage
Memory Replacement
Scheduled Outage
Scheduled Outage
Unscheduled Outage
Unscheduled Outage
ECC on Memory Control
Circuitry (EX: SMI)
Memory Bus Adapter
(MBA) Replacement
STI Failure
z9 EC
Concurrent*
Concurrent
Concurrent
(Book Offline)
Transparent
Scheduled Outage. Lose Scheduled Outage. Lose
Concurrent.
connectivity to I/O
connectivity to I/O
Connectivity to I/O
Domain
Domain
Domain remains
As for MBA
As for MBA
As for MBA
Oscillator Failure
Unscheduled Outage
Unscheduled Outage
Transparent
Processor Upgrades
Physical Memory
Upgrades
Concurrent
Concurrent
Concurrent
Scheduled Outage
Scheduled Outage
I/O Upgrades
Concurrent
Concurrent
Concurrent
(Book Offline)
Concurrent
Spare PUs
1 System
2 / Book
2 / System
*In select circumstances
**Customer pre-planning required, may require acquisition of additional hardware resources
© Copyright IBM Corp., 2006. All rights reserved.
20
Introduction to the new mainframe
Create a redundant I/O configuration
LPARn
LPAR2
LPAR1
LPARn
LPAR2
LPAR1
CSS /
CHPID
Director
(Switch)
DASD CU
DASD CU
....
© Copyright IBM Corp., 2006. All rights reserved.
21
Introduction to the new mainframe
RAS Features of an Storage Subsystem
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Independent dual power feeds
N+1 power supply technology/hot swappable power supplies, fans
N+1 cooling
Battery backup
Non-Volatile Subsystem cache, to protect writes that have not been hardened to
DASD yet
Nondisruptive maintenance
Concurrent LIC activation
Concurrent repair and replace actions
RAID architecture
Redundant microprocessors and data paths
Concurrent upgrade support (that is, ability to add disks while subsystem is
online)
Redundant shared memory
Spare disk drives
Remote Copy to a second storage subsystem
 Synchronous (Peer to Peer Remote Copy, PPRC)
 Asynchronous (Extended Remote Copy, XRC)
© Copyright IBM Corp., 2006. All rights reserved.
22
Introduction to the new mainframe
Disk Mirroring using PPRC and XRC
PPRC (Metro Mirror)
•
Synchronous remote data mirroring

•
•
XRC (z/OS Global Mirror)
Application receives “I/O complete” when
both primary and secondary disks are
updated
Typically supports metropolitan
distance
Performance impact must be
considered

•
Asynchronous remote data mirroring

•
•
•
Application receives “I/O complete” as soon
as primary disk is updated
Unlimited distance support
Performance impact negligible
System Data Mover (SDM) provides


Latency of 10 km
Data consistency of secondary data
Central point of control
XRC
PPRC
SDM
System z
z/OS
1
1
4
2
3
4
2
3
© Copyright IBM Corp., 2006. All rights reserved.
23
Introduction to the new mainframe
PPRC Failover / Failback (FO/FB)
• The new primary volumes (at the remote site) records changes while in failover
mode.
• The original mode of the volumes at the local site is preserved as it was when
the failover was initiated.
• Only need to resynchronize from time of failover, not entire data set
Normal
Application I/Os
A
B
Sync
PPRC
Failover
Failback Start
Application I/Os
A
Failback Finish
Application I/Os
Application I/Os
A
B
Sync PPRC
(suspended)
C
R
A
A
B
Sync PPRC
(full duplex)
O
O
S
© Copyright IBM Corp., 2006. All rights reserved.
B
B
Sync PPRC
(full duplex)
24
Introduction to the new mainframe
Parallel Sysplex
Parallel Sysplex
• Removes Single Point of Failure
•
•
•
•
•
 Server
 LPAR
 Subsystems
Planned and Unplanned Outages
Single System Image
Dynamic Session Balancing
Dynamic Transaction Routing
Highlights




Data sharing
Locking
Cross-system workload dispatching
Synchronization of time for logging, etc.




Coupling Facility
Sysplex Timer – TOD clock synchronization
Workload Manager in z/OS
Compatibility and exploitation in software
subsystems, like DataSharing in Database
systems
IBM System z
IBM System z
IBM System z
• Hardware/software combination
© Copyright IBM Corp., 2006. All rights reserved.
25
Introduction to the new mainframe
z/OS factors to availability
Workload Balancing using Workload Manager (WLM)
Highly automated system
Capability to restart applications using the Automatic
Restart Manager (ARM) without interfering other
applications or the z/OS itself
Assists Two-Phase commits using Resource Recovery
Services (RRS)
Make dynamicly changes to your system configuration
using the System Modification Program Extended
(SMP/E)
© Copyright IBM Corp., 2006. All rights reserved.
26
Introduction to the new mainframe
Error recording and error recovery routines
© Copyright IBM Corp., 2006. All rights reserved.
27
Introduction to the new mainframe
z/OS Recovery
z/OS Recovery features
• Recovery Termination Manager (RTM)
• Extended Specify Task Abnormal Exit (ESTAE)
• Functional Recovery Routine (FRR)
© Copyright IBM Corp., 2006. All rights reserved.
28
Introduction to the new mainframe
The Human Factor ….
Automation: critical for successful rapid recovery and continuity
The More People Involved…..
….. The Higher the Odds of Human Errors.
The benefits of automation:
• Allows business continuity processes to be built on a reliable, consistent
recovery time
• Recovery times can remain consistent as the system scales to provide a flexible
solution designed to meet changing business needs
• Reduce infrastructure management cost and staffing skills
• Reduces or eliminates human error during the recovery process at time of
disaster
• Facilitates regular testing to help ensure repeatable, reliable, scalable business
continuity
• Helps maintain recovery readiness by managing and monitoring the server, data
replication, workload and the network along with the notification of events that
occur within the environment
© Copyright IBM Corp., 2006. All rights reserved.
29
Introduction to the new mainframe
Tiers of Disaster Recovery
GDPS/PPRC
RTO < 1 hr; RPO 0
Mission Critical
Applications
Value
Somewhat Critical
Applications
GDPS/XRC
GDPS/Global Mirror
RTO < 2 hr; RPO < 1min
Dedicated Remote Hot Site
GDPS/PPRC HyperSwap Manager
Tier 7 - Near zero or zero Data Loss: Highly automated takeover
RTO depends on customer automation;
on a complex-wide or business-wide basis, using remote disk
RPO 0
mirroring
Tier 6 - Near zero or zero Data Loss remote disk mirroring helping with
data integrity and data consistency
Active
Secondary Site
Tier 5 - software two site, two phase commit (transaction integrity);
or repetitive PiT copies w/ small data loss
Tier 4 - Batch/Online database shadowing & journaling,
repetitive PiT copies, fuzzy copy disk mirroring
Tier 3 - Electronic Vaulting
Tier 2 - PTAM, Hot Site
Not so Critical
Applications
Point-in-Time Backup
15 Min.
1-4
4 -6
6-8
Tier 1 - PTAM*
8-12
12-16
24
72
Time to Recover (hrs)
Tiers based on Share Group 1992
*PTAM = Pickup Truck Access Method
© Copyright IBM Corp., 2006. All rights reserved.
30
Introduction to the new mainframe
Today’s Business Continuity Objectives Demand Rapid Database
Availability
Achieve Application and Database Restart
•Consistent, repeatable, fast
•Database Restart: To start a database application
following an outage without having to restore the
database
 This is a process measured in minutes
Avoid Application and Database Recovery
•Unpredictable recovery time, usually very long and
very labor intensive
•Database Recovery:
 Restore last set of Image Copy tapes and apply
log changes to bring database up to point of failure
 This is a process measured in hours or even days
© Copyright IBM Corp., 2006. All rights reserved.
31
Introduction to the new mainframe
NETWORK
What is GDPS/PPRC?
(Metro Mirror)
11
11
1
2
3
4
8
7
6
2
3
4
7
6
5
SITE 2
12
10
1
8
NETWORK
9
12
10
9
5
SITE 1
Multi-site base or Parallel Sysplex environment
Remote data mirroring using PPRC
Manages unplanned reconfigurations
• z/OS, CF, disk, tape, site
• Designed to maintain data consistency and integrity
across all volumes
• Supports fast, automated site failover
• No or limited data loss - (customer business policies)
Single point of control for
• Standard actions
Stop, Remove, IPL system(s)
• Parallel Sysplex Configuration management
• User defined script (e.g. Planned Site Switch)
• PPRC Configuration management
© Copyright IBM Corp., 2006. All rights reserved.
32
Introduction to the new mainframe
Multiple Site Workload - Cross-site Sysplex
Continuous Availability Configuration
SITE 1
11
CF1
12
1
11
2
10
8
6
1
2
3
9
4
7
12
10
3
9
P1
SITE 2
8
P2
6
5
PROD
P
PROD
CBU
P4
P3
K1
K2
K/L
CF2
4
7
5
P
P
P
S
S
S
S
K/L
© Copyright IBM Corp., 2006. All rights reserved.
33
Introduction to the new mainframe
Continuous Availability and Disaster Recovery at unlimited distance
(GDPS/PPRC & GDPS/XRC)
IBM System z Solution
Production Site 1
metropolitan
distance
unlimited
distance
Site 2
Site 3
CF
CF
Parallel Sysplex
Parallel Sysplex
FICON™ or ESCON
CF
P'
PPRC secondary
GDPS/
PPRC
CF
GDPS/XRC
PX
PPRC primary
XRC primary
XRC secondary
Continuous Availability
GDPS PPRC or GDPS/PPRC HM
Designed to provide continuous availability and
no data loss between sites 1 and 2
Sites 1 and 2 can be same building or campus
distance to minimize performance impact
X'
Disaster/Recovery
Production site 1 failure
ƒ
Site 3 can recover with no data loss in most instances
Site 2 failure
ƒ
Production can continue with site 1 data (P')
Site 1 and 2 failure
ƒ
SIte 3 can recover with minimal loss of data
© Copyright IBM Corp., 2006. All rights reserved.
34
Introduction to the new mainframe
SUMMARY
Built In Redundancy
Capacity Upgrade on
Demand
Capacity Backup
Hot Pluggable I/O
Addresses Planned/Unplanned
Hardware and Software Outages
Addresses Site
Failure/Maintenance
Flexible, Nondisruptive Growth
Capacity beyond largest CEC
Scales better than SMPs
Sync/Async Data Mirroring
Eliminates Tape/Disk SPOF
No/Some Data Loss
Dynamic Workload/Resource
Management
Application Independent
© Copyright IBM Corp., 2006. All rights reserved.
35
Introduction to the new mainframe
Key terms in this chapter
•
•
•
•
•
•
•
•
•
ARM
Automate
Availability
CA
Data sharing
Disaster
Disk mirroring
GDPS
HA
•
•
•
•
•
•
•
•
•
•
LPAR
MTBF
N+1
Recover
SMP/E
SPOF
Sysplex
Sysplex Timer
System log
Trace
© Copyright IBM Corp., 2006. All rights reserved.
36