Scalable FT - Computer Science

Download Report

Transcript Scalable FT - Computer Science

Scalable Fault Tolerance:
Xen Virtualization for PGAs Models
on High-Performance Networks
Daniele Scarpazza, Oreste Villa, Fabrizio Petrini, Jarek Nieplocha,
Vinod Tipparaju, Manoj Krishnan
Pacific Northwest National Laboratory
Radu Teodoresci, Jun Nakanom Josep Torrellas
University of Illinois
Duncan Roweth
Quadrics
In collaboration with
Patrick Mullaney, Novell
Wayne Augsburger, Mellanox
FastOS, Santa Clara CA, June 18 2007
Project Motivation
MTBF as a function of system size
160
10,000
100,000
1,000,000
MTBF (hours)
140
120
TeraFLOPS
100
80
60
40
PetaFLOPS
20
0
1000
10000
100000
• Component count in high-end
systems has been growing
• How do we utilize large (103-5
processor) systems for solving
complex science problems?
• Fundamental problems
– Scalability to massive processor
counts
– Application performance on single
processor given the increasingly
complex memory hierarchy
– Hardware and software failures
System Size (number of nodes)
FastOS, Santa Clara CA, June 18 2007
Multiple FT Techniques
Application drivers

Multidisciplinary, multiresolution, and multiscale nature of scientific problems drive the demand
for high end systems

Applications place increasingly differing demands on the system resources: disk, network,
memory, and CPU

Some of them have natural fault resiliency and require very little support
System drivers

Different I/O configurations, programmable or simple/commodity NICs,
proprietary/custom/commodity operating systems
Tradeoffs between acceptable rates of failure and cost

Cost effectiveness is the main constraint in HPC
Therefore, it is not cost-effective or practical to rely on a single
fault tolerance approach for all applications and systems
FastOS, Santa Clara CA, June 18 2007
Key Elements of SFT
BCS
ReVive (I/O)
Focus of the talk
IBA/QsNET
Buffered Coscheduling provides global coordination
of system activities, communication, CR
ReVive and ReVive I/O provide efficient CR capability
for shared memory servers (cluster node):
Virtualization of High Performance Network Interfaces
and Protocols
FT ARMCI
Fault-Tolerance module for ARMCI runtime system
XEN
Hypervisor to enable virtualization of compute node
environment including OS (external dependency)
FastOS, Santa Clara CA, June 18 2007
Transparent System-level CR of PGA
Applications on Infiniband and QsNET
We explored a new approach to cluster faulttolerance by integrating Xen with the latest
generations of Infiniband and Quadrics highperformance networks
Focus on Partitioned Global Address Space
(PGAs) programming models

Most of existing work focused on MPI
Design Goals


low-overhead and
transparent migration
FastOS, Santa Clara CA, June 18 2007
Main Contributions
Integration of Xen and Infiniband

Enhanced Xen’s kernel modules to fully support userlevel Infiniband protocols and IP over IB with minimal
overhead
Support for Partitioned Global Address space
programming models (PGAs)

Emphasis on ARMCI
Automatic Detection of a Global Recovery Line
and Coordinated Migration

Perform a live migration without any change to user
applications
Experimental evaluation
FastOS, Santa Clara CA, June 18 2007
Xen Hypervisor
On each machine Xen allows the creation of a
privileged virtual machine (Dom0) and and one or
more non-privileged VMs (DomUs)
Xen provides the ability to pause, un-pause,
checkpoint and resume DomUs
Xen employs para-virtualization


Non-privileged domains run a modified operating
system featuring guest device drivers
Their requests are forwarded to the native device driver
in Dom0 using a split driver model
FastOS, Santa Clara CA, June 18 2007
Infiniband Device Driver
The driver is implemented in two sections


A paravirtualized section for slow path control operations (e.g., qpair creation) and
A direct access section for fast path data operations
(transmit/receive)
Based on Ohio State/IBM implementation
Driver was extended to support additional CPU
architectures and Infiniband adapters



Added proxy layer to allow Subnet and Connection management
from guest VMs
Propagate suspend/resume to the applications not only kernel
modules
Several stability improvements
FastOS, Santa Clara CA, June 18 2007
Software Stack
FastOS, Santa Clara CA, June 18 2007
Xen/Infiniband Device Driver:
Communication Performance
FastOS, Santa Clara CA, June 18 2007
Xen/Infiniband Device Driver:
Communication Performance
FastOS, Santa Clara CA, June 18 2007
Parallel Programming Models
Single Threaded

Data Parallel, e.g. HPF
Multiple Processes

Partitioned-Local Data Access


Uniform-Global-Shared Data Access


OpenMP
Partitioned-Global-Shared Data Access


MPI
Co-Array Fortran
Uniform-Global-Shared + Partitioned Data Access

UPC, Global Arrays, X10
FastOS, Santa Clara CA, June 18 2007
Fault Tolerance In PGAs Models
Implementation considerations

1-sided communication, perhaps some 2-sided and collectives

Special considerations in implementation of global recovery line

Memory operations need to be synchronized for
checkpoint/restart
Memory is a combination of local and global (globally visible)

Global memory could be shared from OS view

Pinned and registered with network adapter
G
G
G
G
L
L
L
L
P
P
P
P
SMP node n
SMP node 0
Network
FastOS, Santa Clara CA, June 18 2007
Xen-enabled ARMCI
Runtime system - one-sided communication

Global Arrays, Rice Co-Array Fortran, GPSHEM, IBM X10 port
under way
Portable high performance remote memory copy
interface
Fundamental Communication
Models in HPC
A
B
receive
send
P1
P0
message passing
2-sided model
Asynchronous remote memory access (RMA)
Fast Collective Operations
Zero-copy protocols, explicit NIC support
“Pure” non-blocking communication - 99.9% overlap
A
P0
P1
Data Locality

remote memory access (RMA)
1-sided model
Shared-memory within SMP node and RMA across
nodes
High performance delivered on wide range of
platforms

B
put
Multi-protocol and multi-method implementation
Examples of data
transfers optimized
in ARMCI
FastOS, Santa Clara CA, June 18 2007
A
B
A=B
P0
P1
shared memory load/stores
0-sided model
Global Recovery Lines (GRLs)
A GRL is required before each checkpoint / migration
A GRL is required for Infiniband networks because

IBA does not allow location-independent layer 2 and 3 addresses

IBA hardware maintains stateful connections not accessible by
software
The protocol that enforces a GRL has

A drain phase, which completes any ongoing communication

Followed by a global silence, where it is possible to perform node
migration

And a resume phase, where the processing nodes acquire
knowledge of the new network topology
FastOS, Santa Clara CA, June 18 2007
GRL and Resource Management
FastOS, Santa Clara CA, June 18 2007
Experimental Evaluation
Our experimental testbed is a cluster of 8 Dell
PowerEdge 1950
Each node has two dual-core Intel Xeon
Woodcrest running @ 3.0 GHz, with 8 Gbytes of
memory
The cluster is interconnected by a Mellanox
Infinihost III 4X HCA adapters
Suse Linux Enterprise Server 10.0
Xen 3.02
FastOS, Santa Clara CA, June 18 2007
Timing of a GRL
FastOS, Santa Clara CA, June 18 2007
Scalability of Network Drain
FastOS, Santa Clara CA, June 18 2007
Scalability of Network Resume
FastOS, Santa Clara CA, June 18 2007
Save and Restore Latencies
FastOS, Santa Clara CA, June 18 2007
Migration Latencies
FastOS, Santa Clara CA, June 18 2007
Conclusion
We have presented a novel software
infrastructure that allows completely transparent
checkpoint/restart
We have implemented a device driver that
enhances the existing Xen/Infiniband drivers
Support for PGAs programming models
Minimal overhead, 10s of milliseconds

Most of the time is spent saving/restoring the node
image
FastOS, Santa Clara CA, June 18 2007