Transcript InfiniBand
InfiniBand
• Was originally designed as a “system area network”:
connecting CPUs and I/O devices.
– A larger role: replaceing all I/O standards for data centers:
PCI (backplane), Fibre Channel, and Ethernet: everything
connects through InfiniBand. Not long haul yet.
– A less role: Low latency, high bandwidth, low overhead
interconnect for commercial datacenters between servers and
storage.
• Can form local area or even large area networks.
• Has become the de-facto interconnect for high
performance clusters (100+ systems in top 500
supercomputer list).
• Infiniband architecture
– Specification (Infiniband
architecture specification release 1.3,
March 3, 2015) available at
Infiniband Trade Association
(http://www.infinibandta.org)
• Infiniband architecture overview
• Infiniband architecture overview
– Components:
• Links, Channel adaptors, Switches, Routers
– The specification allows Infiniband wide area network,
but mostly adopted as a system/storage area network.
• Cabling specification?
– Topology:
• Irregular
• Regular: Fat tree, hypercube, etc
• Infiniband architecture overview
– Physical layer: Cabling standard?
– Link speed (signal rate):
• Single data rate (SDR): 2.5Gbps (1X), 10Gbps (4X), and 30Gbps
(12X).
• Double data rate (DDR): 5Gbps (1X), 20 Gbps (4X),
60Gbps(12X)
• Quad data rate (QDR): 10Gbps (1X), 40Gbps(4X), 120Gbps(12X)
• Fourteen data rate (FDR): 14Gbps(1X), 56Gbps(4X),
168Gbps(12X)
• Enhanced data rate (EDR): 25Gbps(1X), 100Gbps(4X),
300Gbps(12X)
– 8b/10b enconding in SDR, DDR, and QDR
• Map 8bit symbol to 10-bit symbol to have DC-balance
(similar number of 0’s and 1’s in 20 bits, no more than
five 1’s or 0’s in a row, etc).
– 64b/66b enconding in FDR and EDR
Infiniband link speed
Infiniband Roadmap from InfiniBand trade association
http://www.infinibandta.org/content/pages.php?pg=technolog
y_overview
• Layer architecture: somewhat similar to TCP/IP
– Physical layer, somewhat simple
– Link layer
•
•
•
•
Error detection (CRC checksum)
flow control (credit based)
switching, virtual lanes (VL),
forwarding table computed by subnet manager
– Not adaptive/adaptive
– Network layer: across subnets.
• No use for the cluster environment
– Transport layer
• Reliable/unreliable, connection/datagram
– Verbs: interface between adaptors and OS/Users
• Link layer Packet format:
• Local Route Header (LRH): 8 bytes. Used for local routing by
switches within a IBA subnet
• Global Route Header (GRH): 40 Bytes. Used for routing
between subnets
• Base Transport header (BTH): 12 Bytes, for IBA transport
• Reliable datagram extended transport header (RDETH): 4
bytes, just for reliable datagram
• Datagram extended transport header (DETH): 8 bytes
• RDMA extended transport header (RETH): 16 bytes
• Atomic, ACK, Atomic ACK,
• Immediate DATA extended transport header: 4 bytes,
optimized for small packets.
• Invalidate
• Invariant CRC and variant CRC:
– CRC for fields not changed and changed.
• Local Route Header:
– Switching based on the destination port address (LID)
– Multipath switching by allocating multiple LIDs to one
port
• Local Route Header:
– Switching based on the destination port address
(LID).
• Forwarding table entry: (LID, outgoing-port)
• Local Route Header:
– Multipath switching by allocating multiple
LIDs to one port, see the previous example.
• GRH: same format as IPV6 address (16
bytes address)
Subnet management
• Discover subnet topology and topology changes,
compute the paths, assign LIDs, distribute the
routes, configure devices
– Not well-defined in the specification
– Forwarding table must be computed such that all
devices in the network can be reached.
• References
• A. Bermudez, R. Casado, F.J. Quiles, T. M. Pinkston, J. Duato, “Evaluation of
a Subnet Management Mechanism for Infiniband Networks”, ICPP 2003.
• A. Vishnu, A. R. Mamidala, H. Jin, D. K. Panda, “Performance Modeling of
Subnet Management on Fat Tree Infiniband Networks using OpenSM”,
Workshop on System Management Tools on Large Scale Parallel Systems,
Held in Conjunction with IPDPS 2005
• InfiniBand devices and entities related to
subnet management
• Devices: Channel Adapters (CA), Host Channel
Adapters, switches, routers
• Subnet manager (SM): discovering, configuring,
activating and managing the subnet
• A subnet management agent (SMA) in every device
generates, responses to control packets (subnet
management packets (SMPs)), and configures local
components for subnet management
• SM exchange control packets with SMA with subnet
management interface (SMI).
• Subnet management packets (SMP)
– 256 bytes of data
– Use unreliable datagram service on the
management virtual lane (VL 15)
– Two routing schemes
• LID routed: use lookup table for forwarding
– Use after the subnet is setup. E.g. Check the status of an
active port
• Direct routed: has the information of the output port
for each intermediate hop.
– Subnet discovery for the subnet is setup
• Subnet management packets (SMP)
–
–
–
–
–
Define the operation to be performed by SM
Get: get the information about CA, switch, port
Set: set the attribute of a port (e.g. LID)
GetResp: get response
Trap: inform SM about the state of a local node
• A SMA stop sending Trap message until it receives
TrapRepress packet.
• Topology information can be obtained by a sweep
and by peridical Traps.
• Subnet Management phases:
– Topology discovery: sending direct routed SMP
to every port and processing the responses.
– Path computation: computing valid paths
between each pair of end node
– Path distribution phase: configuring the
forwarding table
• Subnet discovery
– SM starts by sending a direct routed Get SMP to its
local node. Upone receiving response, SM sends SMPs
with additive depth.
• Path computation:
– Compute paths between all pair of nodes
– For irregular topology:
• Up/Down routing does not work directly
– Need information about the incoming interface and the
destination and Infiniband only uses destination
– Potential solution:
» find all possible paths
» remove all possible down link following up links in each
node
» find one output port for each destination
– Other solutions: destination renaming
– Fat tree topology:
• What is the best that can be achieved (optimal routing) is also
not clear.
• Path distribution:
– Ordering issue: the network may be in an
inconsistent state when partially updated, which
may result in deadlock during this period.
• Traditional solution, no data packets for a period of
time
• deadlock free reconfiguration schemes.
– How to do this correctly, effectively, and
incrementally is still open.
• Base transport header:
• Verbs
– OS/Users access the adaptor through verbs
– Communication mechanism: Queue Pair (QP)
• Users can queue up a set of instructions that the
hardware executes.
• A pair of queues in each QP: one for send, one for
receive.
• Users can post send requests to the send queue and
receive requests to the receive queue.
• Three types of send operations: SEND, RDMA(WRITE, READ, ATOMIC), MEMORY-BINDING
• One receive operation (matching SEND)
• Queue Pair:
– The status of the result of an operation
(send/receive) is stored in the complete queue.
– Send/receive queues can bind to different
complete queues.
• Related system level verbs:
– Open QP, create complete queue, Open HCA,
open protection domain, register memory,
allocate memory window, etc
• User level verbs:
– post send/receive request, poll for completion.
• To communicate:
– Make system calls to setup everything (open
QP, bind QP to port, bind complete queues,
connect local QP to remote QP, register
memory, etc).
– Post send/receive requests.
– Check completion.
• InfiniBand has an almost perfect
software/network interface (Chien'94
paper):
– The network subsystem realizes all user level
functionality.
– User level accesses to the network interface. A
few machine instructions will accomplish the
transmission task without involving the OS.
– Network supports in-order delivery and and
fault tolerance.
– Buffer management is pushed out to the user.
• Mellanox product brief: “Switch-2 Virtual
Protocol Interconnect Optimized for SDN”
• Mellanox product brief: “Switch-2 Virtual
Protocol Interconnect Optimized for SDN”
– Virtual protocol interconnect
• Automatically sensing Infiniband, Ethernet and
Fiber channel, and data center bridging
• Flexible port configuration
– 36 IB FDR ports or 40/56GbE Ports
– 64 10GbE ports
– 24 2/4/5Gb FC ports
– SDN support
• Complete support for Openflow and Subnet
management
• Remote configurable routing table, overlay, control
plan.
Some claims (InfiniBand
advantages)
• Infiniband and Ethernet can carry each other’s
traffic (EoIB) and (IBoE), and both can carry
TCP/IP
• InfiniBand is in general faster
– 10G Ethernet .vs. IB DDR (20G) and QDR(40G)
– 40G Ethernet .vs. IB EDR (100G)
• InfiniBand is no longer hard to use
• InfiniBand is optimized for fat-tree
• InfiniBand still has more features than Ethernet
– Fault tolerance, multicast, etc