Multiple Fault-tolerant Framework for High

Transcript Multiple Fault-tolerant Framework for High

Providing Fault-tolerance for
Parallel Programs on Grid
(FT-MPICH)
Heon Y. Yeom
Distributed Computing Systems Lab.
Seoul National University
Contents
1
Motivation
2
Introduction
3
Architecture
4
Conclusion
Motivation
Hardware performance limitations are
overcome by Moore's Law
These cutting-edge technologies make
“Tera-scale” clusters feasible !!!
However..
 What about “THE” system reliability ???
 Distributed systems are still fragile due
to unexpected failures…
Motivation
Multiple
Fault-tolerant
Framework
MPICH-G2
(Ethernet)
Good speed
(1Gbps)
Common
MPICH Standard
Demand Faultresilience !!!
MPICH-GM
(Myrinet)
High-speed
(10Gbps)
Popular
MPICH Compatible
Demand Faultresilience !!!
High-performance Network Trend
MVAPICH
(InfiniBand)
High-speed
(Up to 30Gbps)
Will be Popular
MPICH Compatible
Demand Faultresilience !!!
Introduction
Unreliability of distributed systems

Even a single local failure can be fatal
to parallel processes since it could
render useless all computations
executed to the point of failure.
Our goal is

To construct a practical multiple faulttolerant framework for various types
of MPICH variants working on highperformance clusters/Grids.
Introduction
Why Message Passing Interface (MPI)?



Designing a generic FT framework is
extremely hard due to the diversity of
hardware and software systems.
 We chosen MPICH series ....
MPI is the most popular programming
model in cluster computing.
Providing fault-tolerance to MPI is
more cost-effective than providing it
to the OS or hardware…
Architecture
-Concept-
Failure
Detection
Monitoring
Multiple
Fault-tolerant
Framework
Consensus
& Election
Protocol
C/R
Protocol
Architecture
-Overall System-
Management System
Communication
Ethernet
Gigabit Ethernet
High-speed Network (Myrinet, InfiniBand)
Ethernet
Others
Ethernet
Others
Ethernet
Others
Communication
Communication
Communication
MPI Process
MPI Process
MPI Process
Architecture
-Development History2003
2004
2005
Current
MPICH-GF
FT-MPICH-GM
FT-MVAPICH
Fault-tolerant
MPICH-G2
-Ethernet-
Fault-tolerant
MPICH-GM
-Myrinet-
Fault-tolerant
MVAPICH
-InfiniBand-
Management System
Failure
Detection
Checkpoint
Coordination
Management
System
Recovery
Makes MPI more reliable
Initialization
Coordination
Output
Management
Checkpoint
Transfer
Management System
CLI
User
Third-party Scheduler
(e.g. PBS, LSF)
Leader Job Manager
Communication
over Ethernet
Local Job Manager
Local Job Manager
Local Job Manager
MPI Process
MPI Process
MPI Process
Stable Storage
Communication over High Speed Network
(e.g. Myrinet, Infiniband)
Job Management System 1/2
Job Management System




Manages and monitors multiple MPI processes
and their execution environments
Should be lightweight
Helps the system take consistent checkpoints
and recover from failures
Has a fault-detection mechanism
Two main components

Central Manager & Local Job Manager
Job Management System 2/2
Central Manager


Manages all system functions and states
Detects node failures by periodic heartbeats
and Job Manager’s failures
Job Manager


Relays messages between Central Manager &
MPI Processes
Detects unexpected MPI process failures
Fault-Tolerant MPI 1/3
To provide MPI fault-tolerance,
we adopt



Coordinated checkpointing scheme
(vs. Independent scheme)
 The Central Manager is the Coordinator!!
Application-level checkpointing (vs. kernellevel CKPT.)
 This method does not require any efforts
on the part of cluster administrators
User-transparent checkpointing scheme (vs.
User-aware)
 This method requires no modification of
MPI source codes
Fault-Tolerant MPI 2/3
Coordinated
Checkpointing
rank0
rank1
rank2
rank3
Central
Manager
checkpoint
command
ver 1
storage
ver 2
Fault-Tolerant MPI 3/3
Recovery from failures
rank0
rank1
rank2
rank3
Central
Manager
checkpoint
command
ver 1
storage
failure detection
Management System
MPICH-GF




Based on Globus Toolkit2
Hierarchical Management System
 Suitable for multiple clusters
Supports recovery from process/manager/node
failure
Limitation
 Does not support recovery from multiple
failures
 Has single point of failure (Central Manager)
Management System
FT-MPICH-GM



New version
 It does not rely on the Globus Toolkit.
Removes of hierarchical structure
 Myrinet/Infiniband clusters no longer
require hierarchical structure.
Supports recovery from multiple failures
FT-MVAPICH

More robust
 Removes the single point of failure

Leader election for the job manager
Fault-tolerant MPICH-variants
MPICH-GF
FT-MPICH-GM
FT-MVAPICH
Collective Operations
P2P Operations
ADI(Abstract Device Interface)
Globus2 (Ethernet)
FT Module
GM (Myrinet)
Recovery Module
Checkpoint
Toolkit
Ethernet
MVAPICH (InfiniBand)
Myrinet
Connection
Re-establishment
InfiniBand
Future Works
We’re working to incorporate our FT
protocol into the GT-4 framework.



MPICH-GF is GT-2 compliant
Incorporating fault-tolerant management
protocol into GT-4
Make MPICH work with different clusters
 Gig-E
 Myrinet
Open-MPI, VMI, etc.
 Infiniband
Supporting non-Intel CPUs

AMD(Opteron)
GRID Issues
Who should be responsible for



Monitoring the up/down of nodes.
Resubmitting the failed process.
Allocating new nodes.
GRID Job Management



Resource management
Scheduler
Health Monitoring
?

Multiple Fault-tolerant Framework for High

Transcript Multiple Fault-tolerant Framework for High

Directory