The NetFT Project

Download Report

Transcript The NetFT Project

Low Overhead Fault Tolerant
Networking (in Myrinet)
Vijay Lakamraju
Israel Koren
C. Mani Krishna
Architecture and Real-Time Systems (ARTS) Lab.
Department of Electrical and Computer Engineering
University of Massachusetts Amherst MA 01003
Motivation

An increasing use of COTS components in
systems has been motivated by the need to



The emergence of low cost, high performance
COTS networking solutions


Reduce cost in design and maintenance
Reduce software complexity
e.g., Myrinet, SCI, FiberChannel etc.
The increasing complexity of network interfaces
has renewed concerns about its reliability

The amount of silicon used has increased tremendously
The Basic Question
How can we incorporate fault tolerance
into a COTS network technology without
greatly compromising its performance?
Microprocessor-based Networks



Most modern network technologies have processors
in their interface cards that help to achieve superior
network performance
Many of these technologies allow changes in the
program running on the network processor
Such programmable interfaces offer numerous
benefits:




Developing different fault tolerance techniques
Validating fault recovery using fault injection
experimenting with different communication protocols
We use Myrinet as the platform for our study
Myrinet







Myrinet is a cost-effective high performance (2.2
Gb/s) packet switching technology
At its core is a powerful RISC processor
It is scalable to thousands of nodes
Low latency communication (8 ms) is achieved
through direct interaction with network interface
(“OS bypass”)
Flow control, error control and simple “heartbeat
mechanisms” are incorporated in hardware
Link and routing specifications are public & standard
Myrinet support software is supplied “open source”
Myrinet Configuration
Host Node
Host
Processor
System
Memory
System
Bridge
I/O Bus
0
Timers
1
2
PCI
Bridge
DMA
Engine
PCIDMA
LANai SRAM
Host
Interface
RISC
LANai 9
Packet
Interface
SAN/LAN
Conversion
Hardware & Software
Application
Host
Processor
System
Memory
I/O Bus
Middleware
(e.g., MPI)
TCP/IP
interface
OS
driver
Myrinet Card
Network
Processor
Local
Memory
Myrinet Control Program
Programmable Interface
Susceptability to Failures

Dependability evaluation was carried out using
software implemented fault injection


A wide range of failures were observed






Faults were injected in the Control Program (MCP)
Unexpected latencies and reduction of bandwidth
The network processor can hang and stop responding
A host system can crash/hang
A remote network interface can get affected
Similar type of failures can be expected from
other high-speed networks
Such failures can greatly impact the
reliability/availability of the system
Summary of Experiments
Failure Category
Count % of Injections
Host Interface Hang
514
24.6
Messages Dropped/Corrupted
264
12.7
65
3.1
9
0.43
23
1.15
No Impact
1205
57.9
Total
2080
100
MCP Restart
Host Computer Crash
Other Errors

More than 50% of the failures were host
interface hangs
Design Considerations



The faults must be detected and diagnosed as
quickly as possible
The network interface must be up and running
as soon as possible
The recovery process must ensure that no
messages are lost or improperly received/sent



Complete correctness should be achieved
The overhead on the normal running of the
system must be minimal
The fault tolerance should be made as
transparent to the user as possible
Fault Detection





Continuously polling the card can be very costly
We use a spare interval timer to implement a
watchdog timer functionality for fault detection
We set the LANai to raise an interrupt when the
timer expires
A routine (L_timer) that the LANai is supposed to
execute every so often resets this interval timer
If the interface hangs, then L_timer is not
executed, causing our interval timer to expire
and raising a FATAL interrupt
Fault Recovery Summary





The FATAL interrupt signal is picked by the fault
recovery daemon on the host
The failure is verified through numerous probing
messages
The control program is reloaded into the LANai
SRAM
Any process that was accessing the board prior to
the failure is also restored to its original state
Simply reloading the MCP will not ensure
correctness
Myrinet Programming Model


Flow control is achieved through send and receive
tokens
Myrinet software (GM) provides reliable in-order
delivery of messages



A modified form of “Go-Back-N” protocol is used
Sequence numbers for the protocol are provided by the
MCP
One stream of sequence numbers exists per destination
Typical Control Flow
Sender
User process prepares message
User process sets send token
LANai
LANai
LANai
LANai
sdmas message
sends message
receives ACK
sends event to process
User process handles notification event
User process reuses buffer
Receiver
User process provides receive buffer
User process sets recv token
LANai
LANai
LANai
LANai
recvs message
sends ACK
rdmas message
sends event to process
User process handles notification event
User process reuses buffer
Duplicate Messages
Receiver
Sender
User process provides receive buffer
User process sets recv token
User process prepares message
User process sets send token
LANai sdmas message
LANai sends message
LANai goes down
Driver reloads MCP into board
Driver resends all unacked messages
LANai sdmas message
LANai sends message
LANai
LANai
LANai
LANai
Lost ACK
recvs message
sends ACK
rdmas message
sends event to process
User process handles notification event
User process reuses buffer
Duplicate message
LANai recvs message
ERROR!
Lack of redundant state information is the cause for this problem
Lost Messages
Receiver
Sender
User process prepares message
User process sets send token
LANai
LANai
LANai
LANai
User process provides receive buffer
User process sets recv token
LANai recvs message
LANai sends ACK
sdmas message
sends message
receives ACK
sends event to process
User process handles notification event
User process reuses buffer
LANai goes down
Driver reloads MCP into board
Driver sets all recv tokens again
LANai waits for message
ERROR!
Incorrect commit point is the cause of this problem
Fault Recovery

We need to keep a copy of the state information



GM functions are modified so that




Checkpointing can be a big overhead
Logging critical message information is enough
A copy of the send tokens and the receive tokens is
made with every send and receive call
The host processes provide the sequence numbers,
one per (destination node, local port) pair
Copy of send and receive token is removed when the
send/receive completes successfully
MCP is modified

ACK is sent out only after a message is DMAed to host
memory
Performance Impact

The scheme has been integrated successfully
into GM


How much of the performance of the system has
been compromised ?


Over 1 man year for complete implementation
After all one can’t get a free lunch these days!
Performance is measured using two key
parameters


Bandwidth obtained with large messages
Latency of small messages
Latency
Bandwidth
Summary of Results
Performance Metric
GM
FTGM
92.4 MHz
92 MHz
11.5 ms
13.0 ms
Host-CPU utilization
for send
0.3 ms
0.55 ms
Host-CPU utilization
for receive
0.75 ms
1.15 ms
6.0 ms
6.8 ms
Bandwidth
Latency
LANai-CPU utilization
Host Platform: Pentium III with 256MB
RedHat Linux 7.2
Summary of Results
Fault Detection Latency = 50 ms
Fault Recovery Latency = 0.765 s
Per-Process Latency
= 0.50 s
Our Contributions





We have devised smart ways to detect and
recover from network interface failures
Our fault detection technique for “network
processor hangs” uses software implemented
watchdog timers
Fault recovery time (including reloading of
network control program) ~ 2 seconds
Performance impact is under 1% for messages
over 1KB
Complete user transparency was achieved