Multiple Fault-tolerant Framework for High
Download
Report
Transcript Multiple Fault-tolerant Framework for High
Providing Fault-tolerance for
Parallel Programs on Grid
(FT-MPICH)
Heon Y. Yeom
Distributed Computing Systems Lab.
Seoul National University
Contents
1
Motivation
2
Introduction
3
Architecture
4
Conclusion
Motivation
Hardware performance limitations are
overcome by Moore's Law
These cutting-edge technologies make
“Tera-scale” clusters feasible !!!
However..
What about “THE” system reliability ???
Distributed systems are still fragile due
to unexpected failures…
Motivation
Multiple
Fault-tolerant
Framework
MPICH-G2
(Ethernet)
Good speed
(1Gbps)
Common
MPICH Standard
Demand Faultresilience !!!
MPICH-GM
(Myrinet)
High-speed
(10Gbps)
Popular
MPICH Compatible
Demand Faultresilience !!!
High-performance Network Trend
MVAPICH
(InfiniBand)
High-speed
(Up to 30Gbps)
Will be Popular
MPICH Compatible
Demand Faultresilience !!!
Introduction
Unreliability of distributed systems
Even a single local failure can be fatal
to parallel processes since it could
render useless all computations
executed to the point of failure.
Our goal is
To construct a practical multiple faulttolerant framework for various types
of MPICH variants working on highperformance clusters/Grids.
Introduction
Why Message Passing Interface (MPI)?
Designing a generic FT framework is
extremely hard due to the diversity of
hardware and software systems.
We chosen MPICH series ....
MPI is the most popular programming
model in cluster computing.
Providing fault-tolerance to MPI is
more cost-effective than providing it
to the OS or hardware…
Architecture
-Concept-
Failure
Detection
Monitoring
Multiple
Fault-tolerant
Framework
Consensus
& Election
Protocol
C/R
Protocol
Architecture
-Overall System-
Management System
Communication
Ethernet
Gigabit Ethernet
High-speed Network (Myrinet, InfiniBand)
Ethernet
Others
Ethernet
Others
Ethernet
Others
Communication
Communication
Communication
MPI Process
MPI Process
MPI Process
Architecture
-Development History2003
2004
2005
Current
MPICH-GF
FT-MPICH-GM
FT-MVAPICH
Fault-tolerant
MPICH-G2
-Ethernet-
Fault-tolerant
MPICH-GM
-Myrinet-
Fault-tolerant
MVAPICH
-InfiniBand-
Management System
Failure
Detection
Checkpoint
Coordination
Management
System
Recovery
Makes MPI more reliable
Initialization
Coordination
Output
Management
Checkpoint
Transfer
Management System
CLI
User
Third-party Scheduler
(e.g. PBS, LSF)
Leader Job Manager
Communication
over Ethernet
Local Job Manager
Local Job Manager
Local Job Manager
MPI Process
MPI Process
MPI Process
Stable Storage
Communication over High Speed Network
(e.g. Myrinet, Infiniband)
Job Management System 1/2
Job Management System
Manages and monitors multiple MPI processes
and their execution environments
Should be lightweight
Helps the system take consistent checkpoints
and recover from failures
Has a fault-detection mechanism
Two main components
Central Manager & Local Job Manager
Job Management System 2/2
Central Manager
Manages all system functions and states
Detects node failures by periodic heartbeats
and Job Manager’s failures
Job Manager
Relays messages between Central Manager &
MPI Processes
Detects unexpected MPI process failures
Fault-Tolerant MPI 1/3
To provide MPI fault-tolerance,
we adopt
Coordinated checkpointing scheme
(vs. Independent scheme)
The Central Manager is the Coordinator!!
Application-level checkpointing (vs. kernellevel CKPT.)
This method does not require any efforts
on the part of cluster administrators
User-transparent checkpointing scheme (vs.
User-aware)
This method requires no modification of
MPI source codes
Fault-Tolerant MPI 2/3
Coordinated
Checkpointing
rank0
rank1
rank2
rank3
Central
Manager
checkpoint
command
ver 1
storage
ver 2
Fault-Tolerant MPI 3/3
Recovery from failures
rank0
rank1
rank2
rank3
Central
Manager
checkpoint
command
ver 1
storage
failure detection
Management System
MPICH-GF
Based on Globus Toolkit2
Hierarchical Management System
Suitable for multiple clusters
Supports recovery from process/manager/node
failure
Limitation
Does not support recovery from multiple
failures
Has single point of failure (Central Manager)
Management System
FT-MPICH-GM
New version
It does not rely on the Globus Toolkit.
Removes of hierarchical structure
Myrinet/Infiniband clusters no longer
require hierarchical structure.
Supports recovery from multiple failures
FT-MVAPICH
More robust
Removes the single point of failure
Leader election for the job manager
Fault-tolerant MPICH-variants
MPICH-GF
FT-MPICH-GM
FT-MVAPICH
Collective Operations
P2P Operations
ADI(Abstract Device Interface)
Globus2 (Ethernet)
FT Module
GM (Myrinet)
Recovery Module
Checkpoint
Toolkit
Ethernet
MVAPICH (InfiniBand)
Myrinet
Connection
Re-establishment
InfiniBand
Future Works
We’re working to incorporate our FT
protocol into the GT-4 framework.
MPICH-GF is GT-2 compliant
Incorporating fault-tolerant management
protocol into GT-4
Make MPICH work with different clusters
Gig-E
Myrinet
Open-MPI, VMI, etc.
Infiniband
Supporting non-Intel CPUs
AMD(Opteron)
GRID Issues
Who should be responsible for
Monitoring the up/down of nodes.
Resubmitting the failed process.
Allocating new nodes.
GRID Job Management
Resource management
Scheduler
Health Monitoring
?