CSIS0402 System Architecture Resiliency

Transcript CSIS0402 System Architecture Resiliency

CSIS0402 System Architecture
Resiliency
K.P. Chow
University of Hong Kong
Aspects of Resiliency





How do we ensure that there is no data loss and no message loss
during system failure?
How do we reduce the number of visible stoppages?
How do we reduce the length of the system down time if there is a
failure?
Disaster recovery?
How to provide 7 by 24 by 365 service?
Operational Procedures vs. Reliability

Manage an IBM mainframe like manage a NT 4.0 servers at home
 low reliability

Take an application
–
–
–
–
Run it on any major platform, e.g. NT 4.0 server
Perform testing systematically
Proper configuration control procedure
Good Operation procedure
 High reliability
Causes of Downtime





Hardware failure (20%)
Planned downtime, e.g. hardware maintenance or database backup
Application problems
Operational problems
Network failure
Reliability (1)

Reliability: represents the probability that the system survives, e.g. the
reliability of the disk is 99.99% means the failure probability is 0.0001
B
(R=pB)
A
(R=pA)
Series configuration:
Reliability of the system =
pA*pB*pC
X
(R=pX)
Y
(R=pY)
Z
(R=pZ)
C
(R=pC)
Parallel configuration:
Reliability of the system =
…
Reliability (2)

How about multiple components running independently?
A
(R=pA)
C
(R=pC)
B
(R=pB)
Independent components:
Reliability of the system =
min(pA,pB,pC)
E.g. a web server, a task scheduler,
and an intrusion detection system
System Recovery with Backup Server
1.
2.
3.
4.
Detecting the failure (Detect)
Cleaning up work in progress (Clean up)
Activating the application on the backup system (Restart)
Reprocessing “lost” messages (Reprocess)
Production
Client
Backup
Detect
Clean Up
Backup Server – Failure Detection
Reprocess

Restart
Heartbeat:
– Backup server sends the primary system a message on regular basis: “Are
you there?”
– The primary system replies: “Yes I am.”

Issues of heartbeat: only a coarse mechanism
– It only checks whether the heartbeat program is running, not checking the
production system
– If the heartbeat reports a failure, the backup cannot distinguish between
primary is inactive or heartbeat problem (e.g. network failure)

Additional supports:
– How about a backup network connection? Heartbeat cannot distinguish
between a temporary glitch or a long stoppage
– Client program to give regular reports on its progress, such as response
times of the last 10 transactions
Failure Handling
Detect
Clean Up
Reprocess

1.
2.

Restart
What to do when a failure is detected?
On any failure, switch to backup, then identify the cause
Stop, identify the cause, and switch to backup as the last resort
Strategy 1 is used on “really resiliency-conscious” sites and with a
plan for switching:
– Ensure switching will not cause system stop
– Ensure only qualified operators can execute the switching or automatic
switching

Strategy 2 is used when resiliency requirement is “low”:
– Problem: backup switching procedures that are not well tested will
generate many problems, e.g. incorrect network configurations, incorrect
software configurations, hardware capacity to small too run the “real”
production load
Backup Server – Clean-up Work in
Progress



1.
2.

Detect
Clean Up
Reprocess
Restart
Require all data updated on the production system be copied to the backup
system
Use the last copy of the data: not sufficient because incomplete
transactions must be rolled back
2 techniques:
Roll forward: copy the database logs and apply them against a copy of the
database (continual roll forward), non-database data is copied directly
Mirroring: keep a mirror copy of the disk subsystem on the backup system
by duplicate all writes on the backup system
Roll forward vs. mirroring
– Roll forward is more efficient
– Mirroring is easier to operate
– Mirroring is more expensive: additional network bandwidth, disk and
processor
Backup Server – Activating the
Application


Detect
Reprocess
Restart the system and put it in a “production” state
Make sure the following components are restarted properly:
– Network
– Batch jobs
– On-line programs
Clean Up
Restart
Detect
Restart the Network


Clean Up
Reprocess
Client must send its input to the backup machine instead of the primary
machine
Should the client resume the unfinished session or restart the session
with a new log on? Depends:
– For TCP/IP sessions for Web traffic  a new session
– For sessions between Web server and transaction server  resume the
session if possible
– Others, …

Restart
Techniques to resume the session:
– Backup machine comes up with the same IP address as the production
machine (possible if production and backup machines are on the same
LAN segment)
– Use intelligent routers to redirect the traffic from one IP to another
– Use special application protocol between the client and the server, e.g.
client automatically starts a new session
Detect
Clean Up
Restart the Batch Jobs
Reprocess



Restart
Batch job is a sequence of jobs to be executed and the sequence is
defined by the job control deck
Common usage of batch jobs: update a database with data from
external input file or a table in the database
Criteria to restart the batch program in a batch job:
– Able to reposition the input data on the next record to be processed
– Able to rebuild the internal data in the memory, e.g. cumulative totals

Techniques:
– Store the program restart data, e.g. the number of input records already
processed, current value of the total variables
– Add table or records in the database to store the restart data
Detect
Restart the Batch Jobs

Clean Up
Reprocess
Restart
Problems in restarting batch jobs:
– Identify the job control deck, determine the failure job and restart at the
right place: handled by good mainframe OS, e.g. IBM
– How about if there is no restart information? 2 alternatives:


At the failure time, the program just start running
At the failure time, the program was finishing and had deleted the restart data
– Need to synchronize the job control and the restart data: case by case

Worst case: reload the database (or roll back the database), restart the
batch job from the beginning
Detect
Clean Up
Restart the Online Program
Reprocess

Client failure:
–
–
–
–

Restart
Client application restarts
Need restart information in order to resume the session
User may require to logon again before resume the session
Need to handle case where the user has been off-line for a long time
Server failure:
– Recovers to the last transaction (to be discussed)

Both client and server fail:
– Use same failure handling routines discussed above

Failure scenarios are complex:
– Use simple approach, e.g. store all state information in the database and
both client and server applications are stateless
Detect
Backup Server – Reprocess
“lost” Messages


Issue: client sent a message to
the server and then the server
failed, client does not know
whether the failure was before
the transaction was committed
or after the transaction was
committed
How about ask the user to
query the database?
Clean Up
Reprocess
Client
Restart
Server
X
X
X
X
X: Failure points to
consider
Detect
Reprocessing “lost” Messages

Clean Up
Reprocess
Restart
2 questions during recovery:
– How did the client know the server completed the last message input?
– If the message was completed, what was the last message output?

1.
2.
3.
Recovery protocol:
Client put a sequence number in the message to the server, sequence
number is incremented in every transaction
Server stores the sequence number in a special row for recovery
On recovery, the client sends a special message to query the last
sequence number (which indicates the last processed transaction)
Dual Active


Applications that cannot afford delay in switching to backup during
recovery
Dual active strategies:
– The backup application actively waiting for the data to start the recovery
process
– The client logged onto both systems  no need to switch the network and
security checking
– Since both systems are equal, can be used for load-balancing

Approaches:
– Clustering: the database (should be mirrored) is shared by both systems
– Two-database approach: each system has its own database
Dual Active - Clustering


Both systems can read and
write to all disks
Problems:
– Buffer management: if system
A updated a buffer and B
wants the same data, either
write the data to disk or send
the buffer data to B
– Lock management: both
systems see all the lock
– Log file: to reduce system
complexity, single DB log file
can be used, which may have
performance problem
Cross-coupled
Networking
L
A
Cross-coupled
Disk
L
B
lock manager
Clustering Techniques



All above problems can be solved at the cost of lower performance
Solutions are application dependent
E.g. solution for lock manager duplication problem will introduces
large performance overhead
– Can be improved by locking out more at one time (reduce number of
locking), e.g. block lock instead of record lock
– Block lock will increase lock contention
– … (depends on the application)

Can clustering increase performance instead?
– Depends: if the number of reads far exceeds number of writes
Dual Active – Two-database Approach

Each system has its own database
– Each transaction is processed twice, once on each database
– Two-phase commit is used to ensure the systems are kept in sync
– If one system fails, the other system should stop sending transactions

1.
Two problems:
If a system was down and is fixed, then before bringing the system
online, it must catch up with all missed transactions

Capture the missed transactions
 Reprocess the input for catch-up purposes: in the same order the
transactions were processed by the live system, not the input order
2.
If the client connection to both systems is working but the connection
between systems is broken
Both systems try to contact a 3rd system when it cannot contact the partner
 The 3rd system should kill one of the system (to avoid 2 inconsistent running
systems)


Performance problem: work well if most transactions are read
Two-Phase Commit
A
B
C
(coordinator)
Prepare phase:
coordinator checks if all
subtransactions are ready to
commit
Fail
OK
OK
Abort
Commit phase:
coordinator tells all
subtransactions to commit or
abort
Commit
Done
Commit
Done
Two-Phase Commit

Failure scenarios:
– If A or C fail in prepare phase, B will time out and abort the transaction
– If B fails in prepare phase, A and C will time out and mark the transaction
abort
– If A and C fail after sending “OK” but before the commit, they will query
B when come back up to check whether the transaction is committed or
aborted  final state of the transaction must be stored on disk
– If B fails in commit phase, A and C will time out with their locks held,
heuristic abort can be used

Performance problems:
– Many network messages
– Subtransaction and transaction status are written to disk: extra disk write
– Commit processing takes some time, locks are held longer: more chance
for deadlock
Example



Design: handle the resiliency in
the transaction servers (i.e.
session handling in transaction
server), not the web servers
Router: balance web traffic
across the web servers
Web server: dual active
Web Browser
Router
Web Servers
(Dual Active)
– Hold static data

Transaction server switching:
Transaction
– Web servers open connections and DB Servers
to both transaction servers
– No heartbeat is needed, why?
– 2 phase commit to synchronize
the DB
(Production
and Backup)
System Software Failures


System software: DBMS, operating systems, …
Causes:
– Timing errors, memory leakage
– Corrupt data on disk (worst)

Solution:
– Reboot the system
– Database reorganization: rebuild the index
– Switch to backup: will work if timing error
Planned Downtime

Why planned downtime?
– Housekeeping functions: make a backup copy of the database
– Preventive maintenance
– Change hardware and software configuration

1.
Database backup
Online backup

To recover the DB from the online backup copy, one must apply the
afterimages from the log to bring the DB back to a consistent state
 Required even if the DB is not doing anything while copying, because
existing of modified buffers in memory
2.
Disk mirroring

Disk system should support disk mirroring and hot swapping (removing the
disk while the system is running)
 If the DB is running when removing the disk, one need the log to bring the
DB back to a consistent state
Planned Downtime (cont)

Preventive maintenance
– Not an issue if hardware platform supports hot replacement and dynamic
partitioning
– Otherwise, preventive maintenance is the same as hardware reconfiguration

Hardware/software reconfiguration
–
1.
2.
3.
4.
5.
6.
Use the backup system:
Take the backup machine offline
Change the software or hardware
Bring the backup machine up-to-date
Switch to backup
Repeat process on the other machine
Bring the primary machine up-to-date and switch back to primary machine
Application Software Failure
Worst kind of error
 Best case: program fails, solved by fixing the error and restart the
program
 Worst case: does not fail but the DB is corrupted
Database roll back
 Go back to a state before the corruption and rerun the work






Identify the corrupted point
Get the DB to the state before the corruption
 Roll back: use DB roll-back facility together with log beforeimages
 Roll forward: reload a backup DB and apply the log afterimages
Reapply the transaction (available from input audit trial)
Manual reprocessing: used if automatic reprocessing is not possible, e.g.
no audit trial
Semi-automatic can be used if multiple DBs are involved, e.g. 2 systems
are communicated using message queue and only one system is corrupted
Application Software Failure (cont)
Database Patch
 Patch the DB by writing a program that fixes the corrupted data
 Information needed in order to build the patch:
– Transaction type in error
– The effect of the error
– Input data for that transaction

Second order effect: a program or a user could have seen the corrupted
data and acted wrongly
Application Software Failure – Error Reporting

Input log of all important input transaction:
– Who the user was
– What was the transaction
– Input data

Good error reporting: display the nature of the error and enough
information to do a trace back to the input log

Prevent application software error:
– Prevent bad data from getting to the DB: check all input data, avoid
database commit if anything strange
– Look for corrupted data in the DB when reading the DB
– Display enough information when encounter an error

CSIS0402 System Architecture Resiliency

Transcript CSIS0402 System Architecture Resiliency

Directory