PowerPoint - (FALSE) 2002

Download Report

Transcript PowerPoint - (FALSE) 2002

Fault Tolerant MPI
Anthony Skjellum*$, Yoginder Dandass$, Pirabhu
Raman*
MPI Software Technology, Inc*
Misissippi State University$
FALSE2002 Workshop
November 14, 2002
Outline

Motivations

Strategy

Audience

Technical Approaches

Summary and Conclusions
2
Motivations for MPI/FT




Well written and well tested legacy MPI applications
will abort, hang or die more often in harsh or longrunning environments because of extraneously
introduced errors.
Parallel Computations are Fragile at Present
There is apparent demand for recovery of running
parallel applications
Learn how “fault tolerant” can we make MPI programs
and implementations without abandoning this
programming model
3
Strategy










Build on MPI/Pro, a commercial MPI
Support extant MPI applications
Define application requirements/subdomains
Do a very good job for Master/Slave Model First
Offer higher-availability services
Harden Transports
Work with OEMs to offer more recoverable services
Use specialized parallel computational models to
enhance effective coverage
Exploit third-party checkpoint/restart, nMR, etc.
Exploit Gossip for Detection
4
Audience






High Scale, Higher Reliability Users
Low Scale, Extremely High Reliability Users
(nMR involved for some nodes)
Users of clusters for production runs
Users of embedded multicomputers
Space-based, and highly embedded settings
Grid-based MPI applications
5
Detection and Recovery From
Extraneously Induced Errors
Errors/Failures
DETECTION
Application
MPI
Network, Drivers, NIC
ABFT/aBFT
MPI Sanity
APPLICATION
EXECUTION
MODEL
SPECIFICS
N/W Sanity
OS, R/T, Monitors
WATCHDOG/
BIT/OTHER
RECOVERY Process
Application Recovers
6
Coarse Grain Detection and Recovery
no error
mpirun –np NP my.app
(adequate if no SEU)
process dies
MPI-lib error
my.app finishes
my.app aborts
my.app hangs
mpirun finishes
mpirun finishes
mpirun hangs
(success)
(failure)
y
RECOVERY
aborted run ?
n
send it to ground
(failure)
abort my.app
RECOVERY
DETECTION
y
hung job ?
n
continue waiting
7
Example: NIC errors
r0
send
buf
r1
recv
buf
user level
user level
MPI
MPI
device level
device level
NIC
NIC
2nd highest SEU strikerate after main cpu
• Legacy MPI applications will be run in simplex mode
8
“Obligations” of a Fault-Tolerant
MPI







Ensure Reliability of Data Transfer at the MPI
Level
Build Reliable Header Fields
Detect Process Failures
Transient Error Detection and Handling
Checkpointing support
Two-way negotiation with scheduler and
checkpointing components
Sanity checking of MPI applications and underlying
resources (non-local)
9
Low Level Detection Strategy
for Errors and Dead Processes
A
Initiate Device Level Communication
Low Level
Success ?
y
Return MPI_Success
n
Ask EH:
n
Recoverable?
Trigger
y
Event
Reset Timeout
A
Other
Error Type ?
Timeout
Ask SCT:
Is Peer
Alive ?
y
EH : Error Handler
SCT: Self-Checking Thread
n
Trigger
Event
Reset Timeout
A
10
Overview of MPI/FT Models
Application MPI Support
Style
Model
Name
MFT-I
With MPI-1.2
MFT-II
nMR
No ranks nMR
Several ranks nMR
SPMD
No ranks nMR
No MPI
Several ranks nMR
MFT-IIIs Rank 0 nMR
With MPI-1.2
MFT-IIIm Several ranks nMR
With MPI-2
MFT-IVs Rank 0 nMR
Master/Slave
DPM
MFT-IVm Several ranks nMR
No ranks nMR
No MPI
Several ranks nMR
Cp/Recov
App
Sys
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
11
Design Choices
Message Replication (nMR to simplex)
Replicated ranks send/receive messages
independently from each other
One copy of the replicated rank
acts as the message conduit
Message Replication (nMR to nMR)
Replicated ranks send/receive
messages independently from each
other
One copy of the replicated rank acts
as the message conduit
12
Parallel nMR Advantages
•Voting on messages only (not on each word of state)
A
0
2
B
1
3
0
2
1
3
C
0
2
1
3
n=3; np=4
• Local errors remain local
• Requires two failures to fail (e.g., A0 and C0)
13

MFT-IIIs: Master/Slave with MPI1.2
Master (rank 0) is nMR


MPI_COMM_WORLD is maintained in
nMR
MPI-1.2 only



Pros:





Application cannot actively manage
processes
Only middleware can restart processes
Supports send/receive MPI-1.2
Minimizes excess messaging
Largely MPI application transparent
Quick recovery possible
 ABFT based process recovery assumed.
Cons:



Scales to O(10) Ranks only
Voting still limited
Application explicitly fault aware
14
MFT-IVs: Master/Slave with MPI2

Master (rank 0) is nMR


MPI_COMM_WORLD is maintained in
nMR
MPI_COMM_SPAWN()


Pros:






Application can actively restart processes
Supports send/receive MPI-1.2 + DPM
Minimizes excess messaging
Largely MPI application transparent
Quick recovery possible, simpler than MFT-IIIs
ABFT based process recovery assumed.
Cons:



Scales to O(10) Ranks only
Voting still limited
Application explicitly fault aware
15
Checkpointing the Master for recovery





Master checkpoints
Voting on master liveness
Master failure detected
 Lowest rank slave restarts master from
checkpointed data
 Any of the slaves could promote and
assume the role of master
 Peer liveness knowledge required to
decide the lowest rank
Pros:
 Recovery independent of the number of
faults
 No additional resources
Cons:
 Checkpointing further reduces scalability
 Recovery time depends on the
checkpointing frequency
Rank 0
Storage
Medium
Message
from 1
to 0
Rank 1
(Slave)
Message from 0 to 2
Rank 2
(Slave)
Rank n
(Slave)
MPI messages
Checkpointing data
16
Checkpointing Slaves for Recovery
(Speculative)
Rank 0





Slaves checkpoint periodically at a low frequency
Prob. of failure of a slave > prob. of failure of
the master
Message
Master failure detected
from 1
Message from 0 to 2
to 0
 Recovered from data checkpointed at
Rank 1
Rank 2
Rank n
various slaves
(Slave)
(Slave)
(Slave)
 Peer liveness knowledge required to decide
the lowest rank
Pros:
SM
 Checkpointing overhead of master
SM
SM
eliminated
Flow of MPI messages
 Aids in faster recovery of slaves
Checkpointing data
Cons:
SM Storage Medium
 Increase in Master recovery time
•Slaves are stateless and hence checkpointing slaves
doesn’t help in anyway in restarting the slaves
 Increase in overhead due to checkpointing
•Checkpointing at all the slaves could be really
of all the slaves
expensive
•Instead of checkpointing slaves could return the results
17
tto the master
Adaptive checkpointing and nMR of the
master for recovery






Start with ‘n’ replicates
Initial Checkpointing calls generate
No-ops
Slaves track the liveness of master
and the replicates
Failure of last replicate initiates
checkpointing
Pros:
 Tolerates ‘n’ faults with negligible
recovery time
 Subsequent faults can still be
recovered
Cons:
 Increase in overhead of tracking
the replicates
Replicated Rank 0
Rank 0
Rank 0
Rank 0
Storage
Medium
Message
from 1
to 0
Rank 1
(Slave)
Message from 0 to 2
Rank 2
(Slave)
Rank n
(Slave)
Logical flow of MPI messages
Actual flow of MPI message
Checkpointing data
18
Self-Checking Threads
(Scales > O(10) nodes can be considered)
Invoked by MPI library
Queries by coordinator
•Checks whether peers are alive
•Checks for network sanity
•Server to coordinator queries
•Exploits timeouts
•Periodic execution, no polling
•Provides heart-beat across app.
•Can check internal MPI state
•Vote on communicator state
•Check buffers
•Check queues for aging
•Check local program state
•Invoked Periodically
•Invoked when suspicion arises
19
MPI/FT SCT Support Levels

Simple, Non-portable, uses internal state of MPI
and/or system (“I”)

Simple, Portable, exploits threads and PMPI_ (“II”) or
PERUSE API

Complex state checks, Portable exploits queue
interfaces (“III”)

All of the above
20
MPI/FT Coordinator




Spawned by mpirun or similarly
Closely coupled / friend with the MPI library of
application
User Transparent
Periodically collects status information from the
application

Can kill and restart the application or individual ranks

Preferably implemented using MPI-2

We’d like to replicate and distribute this functionality
21
Use of Gossip in MPI/FT




Applications in Model III and IV assume a star
topology
Gossip requires a virtual all-to-all topology
Data network may be distinct from the control network
Gossip provides:
 Potentially scalable and fully distributed scheme for
failure detection and notification with reduced
overhead
 Notification of failures in the form of broadcast
22
Gossip-based Failure Detection
Tcleanup 5 * Tgossip
Node 1’s Data
Node 2’s Data
S - Suspect vector
Node
Heart
Beat
Node
Heart
Beat
1
0
1
2
2
0 03
2
0
3
22453
3
2
1
S
1
2
3
0
0
10
< 02
Heartbeat 4>2
30 >
Gossip
Node dead !!!
Node 2’s Data
Node
S
Heart
Beat
1
20
2
0
3
3524
No update
Update
Update
Node 3’s Data
Node
Heart
Beat
1
3
2
1
3
0
3
2
Clock
1
2
3
0
0
10
Cycles Elapsed : 2314
S
1
2
3
0
0
0
23
Consensus about Failure
L
At Node 1
1
2
3
1
1
1
0
0
1
0
0
0
0
0
0
Node 3
dead
At Nodes 1 and 2
L
L
At Node 2
1
2
3
1
1
1
0
0
0
0
0
1
0
0
0
Suspect
matrices
merged at
Node 1
1
2
3
1
1
01
0
0
1
0
0
1
0
0
0
L – Live list
24
Issues with Gossip - I


After node a fails
If node b, the node that arrives at consensus on
node a’s failure last (notification broadcaster)
also fails before broadcast
Gossiping continues until another node, c,suspects
that node b has failed
 Node c broadcasts the failure notification of node a
 Eventually node b is also determined to have failed

25
Issues with Gossip - II

If control and data networks are separate:

MPI progress threads monitor the status of the data network




Failure of the link to the master is indicated when communication
operations timeout
Gossip monitors the status of the control network
The progress threads will communicate the suspected
status of the master node to the gossip thread
Gossip will incorporate this information in its own
failure detection mechanism
26
Issues with Recovery

If network failure causes the partitioning of processes:



Two or more isolated groups may form that communicate within
themselves
Each group assumes that the other processes have failed and attempts
recovery
Only the group that can reach the checkpoint data is allowed to initiate
recovery and proceed


The issue of recovering when multiple groups can access the checkpoint data
is under investigation
If only nMR is used, the group with the master is allowed to proceed

The issue of recovering when the replicated master processes are split
between groups is under investigation
27
Shifted APIs


Try to “morally conserve” MPI standard
Timeout parameter added to messaging calls to control the
behavior of individual MPI calls



Add a callback function to MPI calls (for error handling)





Modify existing MPI calls
Add new calls with the added functionality to support idea
Modify existing MPI calls
Add new calls with the added functionality
Support in-band or out-of-band error management made
explicit to application
Runs in concert with MPI_ERRORS_RETURN.
Offers opportunity to give hints as well, where meaningful.
28
Application-based Checkpoint







Point of synchronization for a cohort of processes
Minimal fault tolerance could be applied only at such
checkpoints
Defines “save state” or “restart data” needed to resume
Common practice in parallel CFD and other MPI
codes, because of reality of failures
Essentially gets no special help from system
Look to Parallel I/O (MPI-2) for improvement
Why? Minimum complexity of I/O + Feasible
29
In situ Checkpoint Options





Checkpoint to bulk memory
Checkpoint to flash
Checkpoint to other distributed RAM
Other choices?
Are these useful … depends on error model
30
size (bytes)
Comparison of nMR, CRC
with baseline using MPI/Pro
(version 1.6.1-1tv)
crc/nocrc
13
10
72
52
42
88
32
76
8
81
92
20
48
51
2
3mr/nocrc
12
8
Time ratio
13
10
72
52
42
88
81
92
32
76
8
20
48
51
2
12
8
no crc
crc
3mr
7
6
5
4
3
2
1
0
32
140
120
100
80
60
40
20
0
32
Total time (sec)
Early Results with Hardening Transport:
CRC vs. time-based nMR
size (bytes)
MPI/Pro Comparisons of Time Ratios,
normalized against baseline
performance
31
70
time ratios
Comparison of nMR and CRC
with baseline using MPICH
(version 1.2.1)
crc_mpich/nocrc_mpipro
40
3mr_mpich/nocrc_mpipro
30
20
10
32
76
8
81
92
20
48
51
2
0
13
10
72
52
42
88
size (bytes)
50
12
8
20
48
81
92
32
76
8
13
10
72
52
42
88
51
2
12
8
nocrc_mpich/nocrc_mpipro
60
no crc
crc
3mr
32
2000
1500
1000
500
0
32
Total time (sec)
Early Results, II.
size (bytes)
MPICH Comparison of Time Ratios Using
baseline MPI/Pro Timings
32
Early Results, III
time-based nMR with MPI/Pro
Total Time for 10,000 Runs vs
Message Size for Various nMR
MPI/Pro Time Ratio Comparisons
for various nMR to baseline
33
Other Possible Models







Master Slave was considered before
Broadcast/Reduce Data Parallel Apps.
Independent Processing + Corner Turns
Ring Computing
Pipeline Bi-Partite Computing
General MPI-1 models (all-to-all)
Idea: Trade Generality for Coverage
34
What about Receiver-Based
Models?







Should we offer, instead or in addition to MPI/Pro, a
receiver-based model?
Utilize publish/subscribe semantics
Bulletin boards? Tagged messages, etc.
Try to get rid of single point of failure this way
Sounds good, can it be done?
Will anything like an MPI code work?
Does anyone code this way!? (e.g., Java Spaces, Linda,
military embedded distributed computing)
35
Plans for Upcoming 12 Months






Continue Implementation of MPI/FT to Support
Applications in simplex mode
Remove single points of failure for master/slave
Support for multi-SPMD Models
Explore additional application-relevant models
Performance Studies
Integrate fully with Gossip protocol for detection
36
Summary & Conclusions










MPI/FT = MPI/Pro + one or more availability enhancements
Fault-tolerant concerns leads to new MPI implementations
Support for Simplex, Parallel nMR and/or Mixed Mode
nMR not scalable
Both Time-based nMR and CRC (depends upon message size
and the MPI implementation) - can do now
Self Checking Threads - can do now
Coordinator (Execution Models) - can do very soon
Gossip for detection – can do, need to integrate
Shifted APIs/callbacks - easy to do, will people use?
Early Results with CRC vs. nMR over TCP/IP cluster shown
37
Related Work

G. Stellner (CoCheck, 1996)


M. Hayden (The Ensemble System, 1997)


Growing/shrinking communicators in response to node failures, memorybased checkpointing, reversing calculation?
G. Bosilca et al, (MPICH-V, 2002) – new, to be presented at SC2002



Event bus, works in specialized language, related to Ensemble
G.F. Fagg, and J.J. Dongarra, (FT-MPI, 2000)


Redundant processes approach to masking failed nodes
A. Agbaria and R. Friedman, (Starfish, 1999)


Next-generation Horus communication toolkit
Evripidou et al (A Portable Fault Tolerant Scheme for MPI, 1998)


Checkingpointing that works with MPI and Condor
Grid-based modifications to MPICH for “volatile nodes”
Automated checkpoint, rollback, message logging
Also, substantial literature related to ISIS/HORUS (Dr. Berman et al at
Cornell) that is interesting for distributed computing
38