Computing in the RAIN: A Reliable Array of Independent Nodes

Download Report

Transcript Computing in the RAIN: A Reliable Array of Independent Nodes

Computing in the RAIN: A
Reliable Array of Independent
Nodes
Group A3
Ka Hou Wong
Jahanzeb Faizan
Jonathan Sippel
Introduction
Presenter: Ka Hou Wong
Introduction

RAIN


Research collaboration between Caltech
and Jet Propulsion Laboratory
Goal

Identify and develop key building blocks for
reliable distributed systems built with
inexpensive off-the-shelf components
Hardware Platform

C0
C5
Heterogeneous cluster of computing and/or
storage nodes connected via multiple
interfaces through a network of switches
C1
C2
C3
S0
S1
S2
S3
C6
C7
C8
C4
C = Computer
S = Switch
C9
Software Platform

Collection of software modules that run in
conjunction with operating system services
and standard network protocols
Application
MPI/PVM
RAIN
TCP/IP
Ethernet
Myrinet
ATM
Network Connections
Servernet
Key Building Blocks For
Distributed Computer Systems

Communication



Fault Management


Fault-tolerant communication topologies
Reliable communication protocols
Group membership techniques
Storage

Distributed data storage schemes based on
error-control codes
Features of RAIN

Communication

Provides fault tolerance in the network via
the following mechanisms



Bundled interfaces
Link monitoring
Fault-tolerant interconnect topologies
Features of RAIN (cont’d)

Group membership


Identifies healthy nodes that are
participating in the cluster
Data storage

Uses redundant storage schemes over
multiple disks for fault tolerance
Communication
Presenter: Jahanzeb Faizan
Communication


Fault-tolerant interconnect topologies
Network interfaces
Fault-tolerant Interconnect
Technologies

Goal
To connect computer nodes to a network
of switches in order to maximize the
network’s resistance to partitioning

C
S
S
S
C
C
S
S
C
C
C
C
S
S
S
C
How do you
connect n nodes
to a ring of n
switches?
Naïve Approach

Connect the computer nodes to the
nearest switches in a regular fashion
C
C
S
S
C
S
S
C
C
1-fault-tolerant
C
The network is easily
partitioned with two
switch failures
S
S
C
S
S
C
Diameter Construction
Approach

Connect computer nodes to the switching
network in the most non-local way possible


Computer nodes are connected to maximally
distant switches
Nodes of degree 2 connected between switches
should form a diameter
Diameter Construction
Approach (cont’d)
Construction (Diameters). Let ds = 4 and dc = 2. i, 0 < i
< n, label all compute nodes ci and switches si. Connect
switch si to s(i+1)mod n, i.e., in a ring.
Connect
node ci to
Can tolerate
3 faults of
switches si and s(i+ n/2 +1)mod n.any kind without
partitioning the network
S0
S0
S7
S1
C3
S
S6
C
1
C2
S5
n=7
C4
C1
C5
C0
S4
C2
3
C6
S3
S2
S6
C4
C1
C5
C0
n=8
S5
C7
S4
C6
S3
S2
Protocol for Link Failure

Goal


Monitoring of available paths
Requirements



Correctness
Bounded Slack
Stability
Correctness

A
Must correctly reflect the true state of
the channel
Bi-directional
Communication
B
If one side sees timeouts…
Both sides should mark the channel as being down
Bounded Slack

Ensure that both have a maximum slack
of n transactions
Link History A B
U
Time
U = link up
D = link down
D
U U
Node A sees
many more
transactions
than node B
A B
U U
D
U
D
U
D
D D
U
U U
D
D
D
D
Nodes A and B
see tightly
coupled views
of the channel
Stability

Each real channel event (i.e. time-out)
should cause at most some bounded
number state transactions at each
endpoint
Consistent-History Protocol for
Link Failures


Monitor available paths in the network
for proper functioning
Modified Ping Protocol guarantees each
side of communication channel sees the
same history (bounded slack)
The Protocol


Reliable Message Passing
Implementation:



Sliding window protocol
Existing reliable communication layer not
needed
Reliable messaging built on top of ping
messages
The Protocol (cont’d)
Protocol
Sending and receiving of
token using reliable
messaging
Tokens are
sent on
request
Consistent
history
maintained
Sending and receiving
of Ping messages using
unreliable messaging
Detect when
link is up or
down
Implemented
by Pings or
hardware
feedback
Demonstration
Start
Down
tout/1
t=1
T/0
Down
t=0
T/1
tout/1
Up
t=2
T/0
T/1
T/1
Down
t=2
t: token count
T: token arrival event
tout: time-out event
Up
t=1
trigger event / token sent
Group Membership
Presenter: Jonathan Sippel
Group Membership



Provides a level of agreement between
non-faulty processes in a distributed
application
Tolerates permanent and transient
failures in both nodes and links
Based on two mechanisms


Token Mechanism
911 Mechanism
Token Mechanism




Nodes in the membership are ordered in a
logical ring
Token passed at a regular interval from one
node to the next
Token carries the authoritative knowledge of
the membership
Node updates its local membership
information according to the received token
Token Mechanism (cont’d)

Aggressive Failure Detection
A
D
A
D
B
C
B
C
Token Mechanism (cont’d)

Conservative Failure Detection
A
D
A
D
B
C
B
C
911 Mechanism

When is the 911 Mechanism used?



Token Regeneration - Regenerate a token that is
lost if a node or a link fails
Dynamic Scalability - Add a new node to the
system
What is a 911 message?


Request for the right to regenerate the lost token
Must be approved by all the live nodes in the
membership
Token Regeneration





Only one node is allowed to regenerate the token
Token sequence number is used to guarantee
mutual exclusivity and is incremented every time
the token is passed from one node to the next
Each node makes a local copy of the token on
receipt
Sequence number on the node’s local copy of the
token is added to the 911 message and compared
to all the sequence numbers on the local copies of
the token on the other live nodes
911 request is denied by any node with a more
recent copy of the token
Dynamic Scalability


911 message sent by a new node to
join the group
Receiving node


Treats the message as a join request
because the originating node is not in the
membership
Updates the membership the next time it
receives the token and sends it to the new
node
Data Storage

The RAIN system provides a distributed
storage system based on a class of
erasure-correcting codes called array
codes that provide a mathematical
means of representing data so lost
information can be recovered
Data Storage (cont’d)

Array codes




With an (n, k) erasure-correcting code, k symbols
of original data are represented with n symbols of
encoded data
With an m-erasure-correcting code, the original
data can be recovered even if m symbols of the
encoded data are lost
A code is said to be Maximum Distance Separable
(MDS) if m = n – k
The only operations necessary to encode/decode
an array code are simple binary XOR operations
Data Storage (cont’d)
a
b
c
d
e
f
A
B
C
D
D
F
B+D+e+f
C+E+f+a
D+F+a+b
E+A+b+c
F+B+c+d
A+C+d+e
Data Placement Scheme for a (6, 4) Array Code
Data Storage (cont’d)
?
?
c
d
e
f
?
?
C
D
D
F
?
?
D+F+a+b
E+A+b+c
F+B+c+d
A+C+d+e
Data Placement Scheme for a (6, 4) Array Code
A = C + d + e + (A + C + d + e)
b = A + (E + A + b + c) + c + E
a = b + (D + F + a + b) + D + F
B = a + c + (F + B + c + d) + d
Data Storage (cont’d)

Distributed store/retrieve operations




For a store operation a block of data of size d is
encoded into n symbols, each of size d/k, using an
(n, k) MDS array code
For a retrieve operation, symbols are collected
from any k nodes and decoded
The original data can be recovered with up to n –
k node failures
The encoding scheme provides for dynamic
reconfigurability and load balancing
RAIN Contributions to Distributed
Computing Systems



Fault-tolerant interconnect topologies
and communication protocols providing
consistent error reporting of link failures
Fault management techniques based on
group membership
Data storage schemes based on
computationally efficient error-control
codes
References


Vasken Bohossian, Chenggong C. Fan,
Paul S. LeMahieu, Marc D. Riedel, Lihao
Xu, Jehoshua Bruck, “Computing in the
RAIN: A Reliable Array of Independent
Nodes,” IEEE Transactions On Parallel
and Distributed Systems, Vol. 12, No. 2,
February 2001
http://www.rainfinity.com/