Case Study: Infiniband

Download Report

Transcript Case Study: Infiniband

• Infiniband architecture
– Specification (Infiniband
architecture specification release 1.2,
Oct. 5, 2004) available at Infiniband
Trade Association
(http://www.infinibandta.org)
• Potential improvements
• Infiniband architecture overview
• Infiniband architecture overview
– Components:
•
•
•
•
Links
Channel adaptors
Switches
Routers
– The specification allows Infiniband wide area network,
but mostly adopted as a system/storage area network.
– Topology:
• Irregular
• Regular: Fat tree
– Link speed:
• 2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X).
• Layers: somewhat similar to TCP/IP
– Physical layer
– Link layer
•
•
•
•
Error detection (CRC checksum)
flow control (credit based)
switching, virtual lanes (VL),
forwarding table computed by subnet manager
– Not adaptive
– Network layer: across subnets.
• No use for the cluster environment
– Transport layer
• Reliable/unreliable, connection/datagram
– Verbs: interface between adaptors and OS/Users
• Packet format:
• Local Route Header (LRH): 8 bytes. Used for local routing by
switches within a IBA subnet
• Global Route Header (GRH): 40 Bytes. Used for routing
between subnets
• Base Transport header (BTH): 12 Bytes, for IBA transport
• Reliable datagram extended transport header (RDETH): 4
bytes, just for reliable datagram
• Datagram extended transport header (DETH): 8 bytes
• RDMA extended transport header (RETH): 16 bytes
• Atomic, ACK, Atomic ACK,
• Immediate DATA extended transport header: 4 bytes,
optimized for small packets.
• Invalidate
• Invariant CRC and variant CRC:
– CRC for fields not changed and changed.
• Local Route Header:
– Switching based on the destination port address (LID)
– Multipath switching by allocating multiple LIDs to one
port
• Local Route Header:
– Switching based on the destination port address
(LID)
– Multipath switching by allocating multiple
LIDs to one port
• GRH: same format as IPV6 address (16
bytes address)
• Base transport header:
• Verbs
– OS/Users access the adaptor through verbs
– Communication mechanism: Queue Pair (QP)
• Support the four types of services, including reliable
connection service
• Each connection takes one QP on each end.
• Each QP has a send queue and a receive queue.
• Users can post send requests to the send queue and
receive requests to the receive queue.
• Three types of send operations: SEND, RDMA(WRITE, READ, ATOMIC), MEMORY-BINDING
• One receive operation (matching SEND)
• Queue Pair:
– The status of the result of an operation
(send/receive) is stored in the complete queue.
– Send/receive queues can bind to different
complete queues.
• Related system level verbs:
– Open QP, create complete queue, Open HCA,
open protection domain, register memory,
allocate memory window, etc
• User level verbs:
– post send/receive request, poll for completion.
• To communicate:
– Make system calls to setup everything (open QP, bind
QP to port, bind complete queues, connect local QP to
remote QP, register memory, etc).
– Post send/receive requests.
– Check completion.
– What if a packet arrives before a receive request is
posted?
• Not specified in the standard
• The right response should be a ‘receiver not ready (RNR)’
error. The sender is back-pressed in this case.
• Infiniband has a perfect software interface
(Chien'94 paper):
– The network subsystem realizes all user level
functionality.
– User level accesses to the network interface. A
few machine instructions will accomplish the
transmission task without involving the OS.
– Network supports in-order delivery and and
fault tolerance.
– Buffer management is pushed out to the user.
• SilverStorm 9024:
–
–
–
–
–
–
24 ports 4X(10Gbps) or 8 ports 12X(30 Gbps)
switch type:
cut-through
switch latency:
< 140ns
switch bandwidth: 480 Gbps
forwarding table size: 48K
VL support:
8 + 1 management
• SilverStorm 9240:
– 24 expansion slots, each expansion model 12
port 4X or 4 port 12X (24x12 = 288, 288 by
288 switch)
– switch type
cut-through
– switch latency: < 140ns to < 420ns
– switch bandwidth: 5.76Tbps
– forwarding table size: 48K
– VL support: 8 + 1 management
• Potential improvements on Infiniband using
compiled communication
– Improving the internal Infiniband fabric:
• Offline routing for static pattern (static SM for a
reduced traffic pattern) can be beneficial for
irregular networks.
• Simplify the layer architecture by having a direct
link model (for known patterns), the header can be
simplified, may not matter much (Infiniband layers
are thin).
• Simplify the protection mechanism.
• Circuit switch type Infiniband.
• Reliable communication protocol is still needed.
• Potential benefits can be evaluated by simulation.
• Improving the messaging software (software to
hardware interface): no chance.
• Improving the MPI implementation over
Infiniband: similar to our current work on Ethernet
– Message scheduling for collective/point-to-point
communications based on the network topology.
– Exploring NIC features (buffers in NIC, multicast)
– Reducing the number of instructions in a library routine
makes sense. Compiled communication can be used to
optimize the MPI library.
– Compiled communication can help improving the
library implementation (e.g. reducing the number of
message copies, early requests posting , using RDMA,
etc).
• One particular project:
– Design algorithms for Infiniband subnet manager
– Improving routing performance for Infiniband subnet
manager (SM).
• Objective: minimize the maximum channel load for an given
traffic pattern
• Optimize according to a given pattern: the traffic pattern in an
application is usually not all-to-all
– Default routing used in IBA SM
• For a sparse traffic pattern, the maximum channel load can
usually be minimized using the minimim interference
principle.
– Need to extend minimum interference routing for load balance
deadlock free routing.
– The best way to realize IBA SM is still not clear
(unknown) at this time, we can probably do something
here.
• Irregular network or Fat tree network