Arimilli_Baba_2010-0..

Download Report

Transcript Arimilli_Baba_2010-0..

Hot Interconnects 18
The PERCS
High-Performance Interconnect
Baba Arimilli: Chief Architect
Ram Rajamony, Scott Clark
© 2010 IBM Corporation
Outline
 HPCS Program Background and Goals
 PERCS Topology
 POWER7 Hub Chip
– Overview
– HFI and Packet Flows
– ISR and Routing
– CAU
– Chip and Module Metrics
 Summary
“This design represents a tremendous increase in the use of optics in systems,
and a disruptive transition from datacom- to computercom-style optical interconnect technologies.”
This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the
views of the Defense Advanced Research Projects Agency.
2
© 2010 IBM Corporation
HPCS Background
and
Goals
3
© 2010 IBM Corporation
DARPA’s “High Productivity Computing Systems” Program
Goal: Provide a new generation of economically viable high productivity
computing systems for the national security and industrial user community
 Impact:
– Performance (time-to-solution): speedup by 10X to 40X
– Programmability (idea-to-first-solution): dramatically reduce cost & development time
– Portability (transparency): insulate software from system
– Robustness (reliability): continue operating in the presence of localized hardware
failure, contain the impact of software defects, & minimize likelihood of operator error
 Applications:
Weather Prediction
Ocean/wave
Forecasting
Ship Design
Climate
Modeling
Nuclear Stockpile
Stewardship
Weapons
Integration
PERCS – Productive, Easy-to-use, Reliable Computing System is
IBM’s response to DARPA’s HPCS Program
4
© 2010 IBM Corporation
What The HPC Community Told Us They Needed
 Maximum Core Performance
– … to minimize number of cores needed for a given level of performance as well as
lessen the impact of sections of code with limited scalability
 Low Latency, High Bandwidth Communications Fabric
– … to maximize scalability of science and engineering applications
 Large, Low Latency, High Bandwidth Memory Subsystem
– … to enable the solution of memory-intensive problems
 Large Capacity, High Bandwidth I/O Subsystem
– … to enable the solution of data-intensive problems
 Reliable Operation
– … to enable long-running simulations
5
© 2010 IBM Corporation5
Design Goals
 High bisection bandwidth
 Low packet latency
 High interconnect bandwidth (even for packets < 50 bytes)
6
© 2010 IBM Corporation
Design Goals
 High bisection bandwidth
– Small fanout from hub chip necessitates mesh/toroidal topologies
– Use very high fanout from hub chip to increase interconnect “reach”
 Low packet latency
– Large number of stages requires delays at every stage
– Use a topology with very small number of hops
 High interconnect bandwidth (even for packets < 50 bytes)
– Architect switch pipeline to handle small packets
– Automatically (in hardware) aggregate and disaggregate small packets
7
© 2010 IBM Corporation
End Result: The PERCS System Rack
Bulk Power Regulators
Universal Power Input
Storage Enclosure
Hub Chip Module
POWER7 QCM
Water Conditioning Units
Accepts Standard Building Chilled Water
8
© 2010 IBM Corporation
PERCS Topology
9
© 2010 IBM Corporation
Topology
 Numerous Topologies evaluated
 Converged on a Multi-level Direct Connect Topology to addresses the
design goals of high bisection bandwidth and low latency
– Multiple levels
– All elements in each level of the hierarchy are fully connected with each other
 Examples:
10
© 2010 IBM Corporation
Design Trade-offs
 Narrowed options to 2-level and 3-level direct-connect topologies
 Link bandwidth ratios determined by the number of levels
– Let the levels be denoted using Z, L, D
– 3 levels: longest direct path is ZLZ-D-ZLZ  need Z:L:D bandwidth ratios at 4:2:1
– 2 levels: longest direct path is L-D-L  need L:D bandwidth ratios at 2:1
 Maximum system size determined by the numbers of links of each type
– 3 levels: Maximum system size is ~ Z4.L2.D1 .Number-of-POWER7s-per-Hub chip
– 2 levels: Maximum system size is ~
L2.D1 .Number-of-POWER7s-per-Hub chip
11
© 2010 IBM Corporation
Design Trade-offs within the Topology
 Aggregate bi-directional link bandwidths (in GB/s)
– 3 levels, 1 POWER7/Hub chip: 4-2-8
D = 40
L = 80
Z = 160
– 3 levels, 2 POWER7s/Hub chip: 2-2-8
D = 80
L = 160
Z = 320
– 3 levels, 4 POWER7s/Hub chip: 4-1-8
D = 160 L = 320
Z = 640
– 2 levels, 1 POWER7/Hub chip: 64-32
D = 40
L = 80
– 2 levels, 2 POWER7s/Hub chip: 32-32
D = 80
L = 160
– 22 levels,
levels, 44 POWER7s/Hub
POWER7s/Hub chip:
chip: 16-32
160 LL == 320
–
16-32 DD==160
320
– 2 levels, 4 POWER7s/Hub chip: 64-16 D = 160 L = 320
– 2 levels, 4 POWER7s/Hub chip: 4-64
– …
design point
D = 160 L = 320
Too many Hub chips  high cost
Too much bandwidth per link  high power
Too many links  low point-to-point bw, high cost & power
12
© 2010 IBM Corporation
PERCS POWER7 Hierarchical Structure
 POWER7 Chip
– 8 Cores
 POWER7 QCM & Hub Chips
– QCM: 4 POWER7 Chips
•32 Core SMP Image
– Hub Chip: One per QCM
•Interconnect QCM, Nodes, and Super Nodes
– Hub Module: Hub Chip with Optics
Hub
Chip
Hub Module
with Optics
 POWER7 HPC Node
– 2U Node
– 8 QCMs, 8 Hub Chip Modules
•256 Cores
 POWER7 ‘Super Node’
– Multiple Nodes per ‘Super Node’
– Basic building block
 Full System
– Multiple ‘Super Nodes’
13
© 2010 IBM Corporation
Logical View of PERCS Interconnect
Supernode
full direct
connectivity
between
Quad Modules
via L-links
Drawer
Quad POWER7 module
Hub chip
POWER7
chip
POWER7
chip
POWER7 Coherency Bus
POWER7
chip
SmpRouter
POWER7
chip
HFI
(Host Fabric
Interface)
ISR
(Integrated 1
Switch/ 7
Router)
“local” L-links
within Drawers
1
CAU
(Collective
Acceleration
Unit
Full direct
Connectivity
between
Supernodes
via D-links
“remote” L-links
between Drawers
24
.
.
.
1
HFI
(Host Fabric
Interface)
16
D-links
8
1
1
.
..
n
.
.
.
N
1
14
© 2010 IBM Corporation
System-Level D-Link Cabling Topology
 Number of SuperNodes in system dictates D-link connection topology
– >256 SN Topology: 1 D-link interconnecting each SN-SN pair
– 256-SN Topology: 2 D-Links interconnecting each SN-SN pair
– 128-SN Topology: 4 D-links interconnecting each SN-SN pair
15
– 64-SN Topology:
8 D-links interconnecting each SN-SN pair
– 32-SN Topology:
16 D-links interconnecting each SN-SN pair
– 16-SN Topology:
32 D-links interconnecting each SN-SN pair
© 2010 IBM Corporation
POWER7 Hub Chip
16
© 2010 IBM Corporation
POWER7 Hub Chip Overview
 Extends POWER7 capability for high performance cluster optimized systems
 Replaces external switching and routing functions in prior networks
 Low diameter Two-tier Direct graph network topology is used to interconnect tens of thousands of
POWER7 8-core processor chips to dramatically improve bi-section bandwidth and latency
 Highly Integrated Design
– Integrated Switch/Router (ISR)
– Integrated HCA (HFI)
– Integrated MMU (NMMU)
– Integrated PCIe channel controllers
– Distributed function across the POWER7 and Hub chipset
– Chip and optical interconnect on module
• Enables maximum packaging density
 Hardware Acceleration of key functions
– Collective Acceleration
•No CPU overhead at each intermediate stage of the spanning tree
– Global Shared Memory
•No CPU overhead for remote atomic updates
•No CPU overhead at each intermediate stage for small packet disaggregation/aggregation
– Virtual RDMA
•No CPU overhead for address translation
17
P7
P7
P7
P7
Coherency Bus Ctl
HFI
Hub
CAU
HFI
Hub
Integrated Switch Router
© 2010 IBM Corporation
POWER7 Hub Chip Block Diagram
POWER7 QCM Connect 192 GB/s
4 POWER7 Links – 3 Gb/s x 8B
POWER7 LinkCtl
LL LinkCtl
NMMU
HFI
CAU
HFI
PCIE 2.1
x16
ISR
24x
LR LinkCtl
24x
24 LRemote – 10 Gb/s x 6b
Intra-SuperNode Connect
240 GB/s
PCIE 2.1
x16
PCIE 2.1
x8
40 GB/s
PCIE Connect
7 LLocal – 3 Gb/s x 8B
POWER7 Coherency Bus
3 PCIE – 5 Gb/s x 40b
Inter-Node Connect
336 GB/s
16x
D LinkCtl
16x
16 D – 10 Gb/s x 12b
Inter-SuperNode Connect
320 GB/s
1.128 TB/s of off-chip interconnect bandwidth
18
© 2010 IBM Corporation
Host Fabric Interface (HFI)
and
Packet Flows
19
© 2010 IBM Corporation
Host Fabric Interface (HFI) Features
 Non-coherent interface between the POWER7
QCM and the ISR
– Four ramps/ports from each HFI to the ISR
 Address Translation provided by NMMU
– HFI provides EA, LPID, Key, Protection Domain
– Multiple page sizes supported
 POWER7 Cache-based sourcing to HFI, injection
from HFI
– HFI can extract produced data directly from
processor cache
– HFI can inject incoming data directly into
processor L3 cache
20
POWER7 Coherency Bus
LL LinkCtl
 Communication controlled through “windows”
– Multiple supported per HFI
POWER7 LinkCtl
NMMU
HFI
HFI
CAU
PCIE 2.1
x16
ISR
LR LinkCtl
24x
24x
PCIE 2.1
x16
PCIE 2.1
x8
D LinkCtl
16x
16x
© 2010 IBM Corporation
Host Fabric Interface (HFI) Features (cont’d)
 Supports three APIs
– Message Passing Interface (MPI)
– Global Shared Memory (GSM)
• Support for active messaging in HFI (and POWER7 Memory Controller)
– Internet Protocol (IP)
 Supports five primary packet formats
– Immediate Send
• ICSWX instruction for low latency
– FIFO Send/Receive
• One to sixteen cache lines moved from local send FIFO to remote receive FIFO
– IP
• IP to/from FIFO
• IP with Scatter/Gather Descriptors
– GSM/RDMA
• Hardware and software reliability modes
– Collective: Reduce, Multi-cast, Acknowledge, Retransmit
21
© 2010 IBM Corporation
Host Fabric Interface (HFI) Features (cont’d)
 GSM/RDMA Packet Formats
– Full RDMA (memory to memory)
• Write, Read, Fence, Completion
• Large message sizes with multiple packets per message
– Half-RDMA (memory to/from receive/send FIFO)
• Write, Read, Completion
• Single packet per message
– Small-RDMA (FIFO to memory)
• Atomic updates
• ADD, AND, OR, XOR, and Cmp & Swap with and without Data Fetch
– Remote Atomic Update (FIFO to memory)
• Multiple independent remote atomic updates
• ADD, AND, OR, XOR
• Hardware guaranteed reliability mode
22
© 2010 IBM Corporation
HFI Window Structure
HFI
Window 0
Window Context Detail
Real (physical) memory
HFI command count, …
Send FIFO address
Window 2 send FIFO
Window 1
Window 2
Receive FIFO address
Window 2 rcv FIFO
Epoch vector address
Segment table for
task
SDR1
Page table pointer
Page table for
Job key
partition
Process ID
using window 2
LPAR ID
Window n
23
USER
using window 2
OS
Hypervisor
© 2010 IBM Corporation
End-to-End Packet Flow
.
Proc,
caches
POWER7
Chip
...
Proc,
caches
Proc,
caches
Mem
POWER7
Chip
POWER7 Coherency Bus
POWER7 Coherency Bus
Hub
Chip
HFI
HFI
HFI
ISR
ISR
ISR Network
Mem
POWER7
Link
POWER7 Coherency Bus
HFI
Proc,
caches
POWER7 Coherency Bus
POWER7
Link
Hub
Chip
...
ISR
ISR
.
24
© 2010 IBM Corporation
HFI FIFO Send/Receive Packet Flow
Check Space in SFIFO
Calculate flits count
 Single or Multiple packets per doorbell
 Packets processed in FIFO order
 Packet sizes up to 2KB
Get send FIFO slots
 Bandwidth optimized
Build base header and
message header
User callback to
consume the data
Copy data to send FIFO
Ring HFI doorbell
Recv FIFO
(Cache/Memory)
Send FIFO
(Cache/Memory)
yes
Packet is valid
dma_read
HFI
read_data
Source Node
25
Copy from recv FIFO
to user’s buffer
ISR
NETWORK
HFI
dma_write of packet
Destination Node
© 2010 IBM Corporation
HFI Immediate Send/Receive Packet Flow
 Single cache line packet size
 Latency optimized
Check for available HFI
Buffers
User callback to
consume the data
Build base header and
message header in Cache
Execute ICSWX instruction
Recv FIFO
(Cache/Memory)
Copy from recv FIFO
to user’s buffer
yes
Cache
Packet is valid
HFI
push of data
Source Node
26
ISR
NETWORK
HFI
dma_write of packet
Destination Node
© 2010 IBM Corporation
HFI GSM/RDMA Message Flow (RDMA Write)
Check Space in RDMA CMD FIFO
Build RDMA base header in RDMA
CMD FIFO
Ring HFI doorbell
 Single or Multiple RDMA messages per doorbell
 Send-side HFI breaks a large message into
multiple packets of up to 2KB each
 Large RDMA messages are interleaved with
smaller RDMA messages to prevent HOL blocking
in the RDMA CMD FIFO
 Initiator notification packet traverses network in
opposite direction from data (not shown)
RDMA CMD FIFO
(Cache/Memory)
User call back to
consume notification
dma_read
RDMA_HDR
Memory
Memory
Generate initiator and/or
remote completion
notifications to receive
FIFOs for last packet
Task EA
Space
dma_read
Task EA
Space
RDMA
Payload
HFI
Source Node
27
ISR
NETWORK
HFI
dma_write of packet
Destination Node
© 2010 IBM Corporation
HFI GSM Small RDMA (Atomic) Update Flow
Check Space in SFIFO
 Single or Multiple packets per doorbell
Calculate flits count
Get send FIFO slots
Build base header and
message header
 Packets processed in FIFO order
 Single cache line packet size
 Initiator notification packet traverses network in
opposite direction from data (not shown)
Copy data to send FIFO
User call back to
consume notification
Ring HFI doorbell
Memory
Send FIFO
(Cache/Memory)
Task EA
Space
Atomic
RMW
dma_read
HFI
read_data
Source Node
28
ISR
NETWORK
HFI
Generate initiator and/or
remote completion
notifications to
receive FIFOs with or
without Fetch Data
dma_write of packet
Destination Node
© 2010 IBM Corporation
HFI GSM Remote Atomic Update Flow
Check Space in SFIFO
 Single or Multiple packets per doorbell
 Packets processed in FIFO order
Calculate flits count
 Single cache line packet size
Get send FIFO slots
 Single or multiple remote atomic updates per
packet
Build base header and
message header
 No notification packets
 Assumes Hardware reliability
Copy data to send FIFO
Ring HFI doorbell
Update packet
sent indicated
count register
Send FIFO
(Cache/
Memory)
Memory
Task EA
Space
Atomic
RMW
dma_read
HFI
read_data
Source Node
29
ISR
NETWORK
HFI
Update packet
received indicated
count register
dma_write of packet
Destination Node
© 2010 IBM Corporation
Integrated Switch Router (ISR)
and
Routing
30
© 2010 IBM Corporation
Integrated Switch Router (ISR) Features
 Two tier, full graph network
POWER7 LinkCtl
 3.0 GHz internal 56x56 crossbar switch
– 8 HFI, 7 LL, 24 LR, 16 D, and SRV ports
 Input/Output Buffering
LL LinkCtl
 Virtual channels for deadlock prevention
POWER7 Coherency Bus
NMMU
HFI
CAU
 2 KB maximum packet size
– 128B FLIT size
24x
PCIE 2.1
x16
PCIE 2.1
x16
ISR
LR LinkCtl
 Link Reliability
– CRC based link-level retry
– Lane steering for failed links
HFI
PCIE 2.1
x8
D LinkCtl
24x
16x
16x
 IP Multicast Support
– Multicast route tables per ISR for replicating and forwarding multicast packets
 Global Counter Support
– ISR compensates for link latencies as counter information is propagated
– HW synchronization with Network Management setup and maintenance
31
© 2010 IBM Corporation
Integrated Switch Router (ISR) - Routing
 Packet’s View: Distributed Source Routing
– The paths taken by packets is deterministic direct routing
– The packets are injected with the desired destination indicated in the header
– Partial Routes are picked up in the ISR at each hop of the path
 Routing Characteristics
– 3-hop L-D-L longest direct route
– 5-hop L-D-L-D-L longest indirect route
– Cut-through Wormhole routing
– Full hardware routing using distributed route tables across the ISRs
• Source route tables for packets injected by the HFI
• Port route tables for packets at each hop in the network
• Separate tables for inter-supernode and intra-supernode routes
– FLITs of a packet arrive in order, packets of a message can arrive out of order
32
© 2010 IBM Corporation
Integrated Switch Router (ISR) – Routing (cont’d)
 Routing Modes
– Hardware Single Direct Routing
– Hardware Multiple Direct Routing
• For less than full-up system where more than one direct path exists
– Hardware Indirect Routing for data striping and failover
• Round-Robin, Random
– Software controlled indirect routing through hardware route tables
 Route Mode Selection
– Dynamic network information provided to upper layers of the stack to select route mode
– Decision on route can be made at any layer and percolated down
– Route mode is placed into the packet header when a packet is injected into the ISR
33
© 2010 IBM Corporation
SuperNode
Hub chip with
Integrated Switch Router
(ISR)
QCM: 4 chip
POWER7 processor
module
POWER7 IH Drawer with
8 octants (QCM-Hub chip
pairs)
16 D links to other
supernodes
ISR
7 Llocal Links to Hubs within
the drawer
24 Lremote Links to Hubs in other three
drawers of the supernode
34
© 2010 IBM Corporation
Direct Routing
SuperNode A
SuperNode B
SN-SN Direct Routing
L-Hops: 2
D-Hops: 1
LR12
Total: 3 hops
LR21
D3
- Max Bisection BW
- One to many paths depending on system size
35
© 2010 IBM Corporation
Indirect Routing
SuperNode A
SuperNode B
SN-SN Indirect Routing
L-Hops: 3
D-Hops: 2
Total: 5 hops
LL7
Total paths = # SNs – 2
LR21
D5
SuperNode x
D12
LR30
36
© 2010 IBM Corporation
Integrated Switch Router (ISR) – Network Management
 Network Management software initializes and monitors the interconnect
– Central Network Manager (CNM) runs on the Executive Management Server (EMS)
– Local Network Management Controllers (LNMC) run on the service processors in each
drawer
 Cluster network configuration and verification tools for use at cluster installation
 Initializes the ISR links and sets up the route tables for data transmission
 Runtime monitoring of network hardware
– Adjusts data paths to circumvent faults and calls out network problems
 Collects state and performance data from the HFI & ISR on the hubs during runtime
 Configures and maintains Global Counter master
37
© 2010 IBM Corporation
Collectives Acceleration Unit
38
© 2010 IBM Corporation
Collectives Acceleration Unit (CAU) Features
 Operations
– Reduce: NOP, SUM, MIN, MAX, OR, AND, XOR
– Multicast
POWER7 Coherency Bus
LL LinkCtl
 Operand Sizes and Formats
– Single Precision and Double Precision
– Signed and Unsigned
– Fixed Point and Floating Point
POWER7 LinkCtl
 Extended Coverage with Software Aid
– Types: barrier, all-reduce
– Reduce ops: MIN_LOC, MAX_LOC, (floating point) PROD
NMMU
HFI
HFI
CAU
PCIE 2.1
x16
ISR
LR LinkCtl
24x
24x
PCIE 2.1
x16
PCIE 2.1
x8
D LinkCtl
16x
16x
 Tree Topology
– Multiple entry CAM per CAU: supports multiple independent trees
– Multiple neighbors per CAU: each neighbor can be either a local or remote CAU/HFI
– Each tree has one and only one participating HFI window on any involved node
– It’s up to the software to setup the topology
39
© 2010 IBM Corporation
Collectives Acceleration Unit (CAU) Features (cont’d)
 Sequence Numbers for Reliability and Pipelining
– Software driven retransmission protocol if credit return is delayed
– The previous collective is saved on CAU for retransmission
• Allows retransmit from point of failure vs. restart of entire operation
 Reproducibility
– Binary trees for reproducibility
– Wider trees for better performance
40
© 2010 IBM Corporation
CAU: Example of Trees
Tree A: node 0, 1, 2, 3
Tree C: node 4, 5, 6, 7
Tree B: node 0, 5
Tree D: node 3, 7
Node 0
HFI
CAU
Node 1
HFI
CAU
B
HFI
CAU
HFI
CAU Node 5
C
A
41
Node 2
HFI
CAU
Node 3
HFI
CAU
Node 4
D
HFI
CAU
Node 6
HFI
CAU
Node 7
© 2010 IBM Corporation
CAU: Operations on A Tree
Broadcast
Root
42
Reduce
HFI
CAU
Node 4
HFI
CAU
Node 4
HFI
CAU
Node 5
HFI
CAU
Node 5
HFI
CAU
Node 6
HFI
CAU
Node 6
HFI
CAU
Node 7
HFI
CAU
Node 7
Root
© 2010 IBM Corporation
Chip and Module Metrics
43
© 2010 IBM Corporation
POWER7 Hub Chip
PCIE Phys
POWER7 Phys
PCIE
 45 nm lithography, Cu, SOI
– 13 levels metal
– 440M transistors
HFI
POWER7 Coherency Bus
D Link
Phys
POWER7 Phys
 1.128 TB/s interconnect bandwidth
PCIE
POWER7 Coherency Bus
 582 mm2
– 26.7 mm x 21.8 mm
– 3707 signal I/O
– 11,328 total I/O
 61 mm x 96 mm Glass Ceramic LGA module
– 56 – 12X optical modules
• LGA attach onto substrate
PCIE
NMMU
CAU
HFI
LR & D
Link
Phys
ISR
LLocal Link Phys
44
© 2010 IBM Corporation
Off-chip Interconnect and PLL’s
 (4) 8B W,X,Y,Z Interconnect Bus to POWER7
– 3.0 Gb/s Single Ended EI-3 → 192 GB/s throughput
 (7) 8B L-Local (LL) Interconnect Busses to Hub chips
– 3.0 Gb/s Single Ended EI-3 → 336 GB/s throughput
• Shared physical transport for Cluster Interconnect and
POWER7 Coherency Bus protocol
1.128 TB/s Total
Hub I/O
Bandwidth
 (24) L-Remote: LR[0..23] within Drawer Hub Optical Interconnect Bus
– 6b @ 10Gb/s Differential, 8/10 encoded → 240GB/s throughput
 (16) D[0..15] between Drawer Hub Optical Interconnect Bus
– 10b @ 10Gb/s Differential 8/10 encoded → 320 GB/s throughput
 PCI General Purpose I/O → 40GB/s throughput
 24 total PLL’s
– (3) “Core” PLLs
• (1) Internal Logic • (1) W,X,Y,Z EI3 buses • (1) LL0 – LL6 EI3 buses
– (2) ”Intermediate Frequency “IF LC Tank” PLLs
• (1) Optical LR buses • (1) Optical D buses
– (14) High Frequency “HF LC Tank” PLLs
• (6) Optical LR buses • (8) Optical D buses
– (5) PCI-E “Combo PHY” PLLs
45
© 2010 IBM Corporation
Hub Module Overview
Attribute
Technology
Module Assembly
Optical Modules
& Fiber Ribbons
Definition
High Performance
Glass Ceramic LGA
Body Size
61mm x 95.5mm
LGA Grid
Depopulated 58 x 89
Layer Count
90
Module BSM I/O
5139
D-link
Optical
Module Sites
POWER7
Hub Chip
mLGA Interposer
LR-link
Optical
Module
Sites
46
© 2010 IBM Corporation
Summary
47
© 2010 IBM Corporation
Key System Benefits Enabled by the PERCS Interconnect
 A PetaScale System with Global Shared Memory
 Dramatic improvement in Performance and Sustained Performance
– Scale Out Application Performance (Sockets, MPI, GSM)
– Ultra-low latency, enormous bandwidth and very low CPU utilization
 Elimination of the Traditional Infrastructure
– HPC Network: No HCAs and External Switches, 50% less PHYs and cables of equivalent fat-tree
structure with the same bisection bandwidth
– Storage: No FCS HBAs, External Switches, Storage Controllers, DASD within the compute node
– I/O: No External PCI-Express Controllers
 Dramatic cost reduction
– Reduce the overall Bill of Material (BOM) costs in the System
 A step function improvement in Data Center reliability
– Compared to commodity clusters with external storage controllers, routers/switches, etc.
 Full virtualization of all hardware in the data center
 Robust end-to-end systems management
 Dramatic reduction in Data Center power compared to commodity clusters
48
© 2010 IBM Corporation
Acknowledgements
 Authors
– Baba Arimilli, Ravi Arimilli, Robert Blackmore, Vicente Chung , Scott Clark, Wolfgang
Denzel, Ben Drerup, Torsten Hoefler, Jody Joyner, Jerry Lewis , Jian Li, Nan Ni, Ram
Rajamony, Aruna Ramanan and Hanhong Xue
This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the
views of the Defense Advanced Research Projects Agency.
49
© 2010 IBM Corporation