Network Interface

Download Report

Transcript Network Interface

NaNet: A Custom NIC for Low-Latency, Real-Time GPU
Stream Processing
Alessandro Lonardo, Andrea Biagioni
INFN Roma 1
GPU L0 RICH TRIGGER
■ Very promising results in increasing selection efficiency of interesting events integrating GPUs
■
into the central L0 trigger processor, exploiting their computing power to implement more
complex trigger primitives.
Requirements for the RICH detector L0 trigger:
 Throughput: event rate of 600MB/s
 latL0-GPU = latproc + latcomm < 1 ms
RO
board
10 MHz
L0
GPU
L0TP
10 MHz
1 MHz
Max 1 ms latency
■ Processing is not an issue. For RICH single ring fitting on, processing of a 1000 events buffer on
■
a Kepler GTX 680:
 Throughput of 2.6 GB/s
 Latency of 60 μs
(data from “Real-Time Use of GPUs in NA62 Experiment”, CERN-PH-EP-2012-260)
The real challenge is to implement a RO Board-L0 GPU link with:
 Sustained Bandwidth > 600 MB/s, (RO board output on GbE links)
 Small and stable latency
TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN
2
Processing Latency
 Processing (on buffers of >

1000 events) takes 50 ns per
event
latproc is quite stable (< 200
μs)
Once data is available to be
processed!
Consolidated results on C1060,
Fermi and Kepler far better.
TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN
3
Communication Latency
latcomm : time needed to copy event data from L0-GPU receiving
GbE MAC to GPU memory.
Standard NIC data flow:
1. NIC receive incoming packets, data
are written in CPU memory buffer
(kernel driver network stack protocol
handling).
1. CPU writes data to GPU memory
(application issued
CudaMemcpyHostToDevice).
Start
Copy data from
CPU to GPU
Copy results from
GPU to CPU
End
1000 evts per packet
TDAQ
Meeting -time
10 October 2012 – Alessandro Lonardo - INFN
Processing
4
Communication Latency
1) Host to GPU memory data transfer latency
for a 1000 event buffer O(100) μs.
2) Time spent in Linux kernel network stack
protocol handling for 64B packet data
transfers O(10) μs.
Both are affected by relevant fluctuations due
to OS Jitter.
TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN
5
NaNet Solution
Problem: lower communication latency and its fluctuations.
How?
1) Offloading the CPU from network stack protocol
management.
2) Injecting directly data from the NIC into the GPU(s) memory.
NaNet solution:
Re-use the APEnet+ FPGA-based NIC already implementing
(2) adding a network stack protocol management offloading
engine to the logic (UDP Offloading Engine).
TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN
6
APEnet+
3D Torus Network:
■ Scalable (today up to 32K nodes)
■ Direct Network: no external switches.
APEnet+ Card:
■ FPGA based (ALTERA EP4SGX290)
■ PCI Express x16 slot, signaling
■
capabilities for up to dual x8 Gen2
(peak 4+4 GB/s)
Single I/O slot width, 4 torus links, 2-d
(secondary piggy-back double slot
width, 6 links, 3-d torus topology)
Fully bidirectional torus links, 34 Gbps
Industry standard QSFP
A DDR3 SODIMM bank
■
■
■
APEnet+ Logic:
■ Torus Link
■ Router
■ Network Interface
NIOS II 32 bit microcontroller

RDMA engine

GPU I/O accelerator.
PCI x8 Gen2 Core

■
TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN
7
APEnet+ GPUDirect P2P Support
APEnet+ Flow
■ P2P between Nvidia Fermi and APEnet+



First non-Nvidia device supporting it!!!
Joint development with Nvidia.
APEnet+ board acts as a peer.
■ No bounce buffers on host. APEnet+ can
■
■
■
target GPU memory with no CPU
involvement.
GPUDirect allows direct data exchange on the
PCIe bus.
Real zero copy, inter-node GPU-to-host, hostto-GPU and GPU-to-GPU.
Latency reduction for small messages.
TDAQ Meeting - 10 October 2012 – Alessandro Lonardo- INFN
8
Overview of the NaNet Implementation
Network Interface
32bit Micro
Controller
UDP offload
NaNet Ctrl
1Gb Eth
port
GPU I/O
accelerator
memory
controller
On Board
Memory
TX/RX
Block
PCIe X8 Gen2 core
PCIe X8 Gen2 8@5 Gbps
 Stripped down APEnet+ logic, (logically) eliminating torus and router blocks
 UDP offloading engine
 HAL based microcontroller firmware (essentially used for configuration only)
 Implemented on the Altera Stratix IV development system
TDAQ Meeting - 10 October 2012 – Alessandro Lonardo - INFN
9
APEnet+
3D Torus Network:
■
■
Scalable (today up to 32K nodes)
Cost effective: no external switches.
APEnet+ Card:
■
■
■
■
■
■
FPGA based (ALTERA EP4SGX290)
PCI Express x16 slot, signaling
capabilities for up to dual x8 Gen2
(peak 4+4 GB/s)
Single I/O slot width, 4 torus links, 2-d
torus topology;

secondary
piggy-back
card,
resulting in a double slot width, 6
links, 3-d torus topology.
Fully bidirectional torus links, 34 Gbps
aggregated raw bandwidth (408 Gbps
total switching capacity…)
Industry standard QSFP+ (Quad
Small Form-factor Pluggable) for highdensity applications on copper as well
as on optical medium (4*10 Gbps
lanes per interface)
A DDR3 SODIMM bank
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
1
0
APEnet+ Core: DNP
APEnet based on DNP:
■
■
■
RDMA: Zero-copy RX & TX!
Small latency and high bandwidth
GPU clusters features (APEnet+):


■
■
RDMA support GPU and host. for GPUs! No buffer
copies between
Very good GPU to GPU latency (Direct GPU
interface, nvidia P2P.
SystemC models, VHDL (synthesizable) code,
AMBA Interface (SHAPES), PCI Express Interface
(APEnet+).
Implementation on FPGA and “almost” tape-out on
ASIC
Network Interface:

TX: Gathers data coming in from the PCI-e
port, fragmenting data stream into packets
forwarded to the relevant destination port.

RX: RDMA (Remote Direct Memory Access)
capabilities, PUT and GET, are implemented
at the firmware level.

Microcontroller (NIOS II) simplifies the DNPcore HW and the host-side driver. It manages
the RDMA LUT allocated in the On Board
Memory:
•
•
Add/delete entries in case of reg/unreg buffer
operation.
Retrieve the entry to satisfy buffer info requests
for the incoming DNP PUT/GET operands.
Y+
Y-
Z+
Z-
torus torus torus torus torus torus
link link link link link
link
Router
routing logic
7x7 ports
switch
arbiter
Network Interface
TX/RX
Block
32bit Micro
Controller
Collective
comm block
GPU I/O
accelerator
memory
controller
On Board
Memory
■
Torus
Links
X-
1Gb Eth
port
The HW block structure is split into:
X+
DNP
PCIe X8 Gen2 core
PCIe X8 Gen2 8@5 Gbps
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
1
1
APEnet+ Core: DNP (2)
■ ROUTER:



Dimension-order
routing
policy
to
implement communications on the switch’s
port.
Router allocates and grants the proper
path.
Arbiter
manages
conflicts
between
packets.
■ MULTIPLE TORUS LINK:

a light, low-level, word-stuffing protocol.
Fixed size header/footer envelope.
 Error detection via EDAC/CRC at packet
level.
 Virtual Channels and Flow-Control Logic
to guarantee deadlock-free transmission
and enhance fault tolerance.
Y+
Y-
Z+
Z-
torus torus torus torus torus torus
link link link link link
link
Router
routing logic
7x7 ports
switch
arbiter
Network Interface
TX/RX
Block
32bit Micro
Controller
Collective
comm block
GPU I/O
accelerator
memory
controller
On Board
Memory

Packet-based direct network 2d/3d torus
topology
Bidirectional Ser/Des with 8b10b encoder
for DC balance, de-skewing technology
and CDR.
Encapsulating the APEnet+ packets into
Torus
Links
X-
1Gb Eth
port

X+
DNP
PCIe X8 Gen2 core
PCIe X8 Gen2 8@5 Gbps
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
12
LATENCY BENCHMARK
■ One-way point to point test involving
two nodes:

Receiver node tasks:
•
•
•
•
•

Allocates a buffer on either host or
GPU memory.
Registers it for RDMA.
Sends its address to the transmitter
node.
Starts a loop waiting for N buffer
received events.
Ends by sending back an
acknowledgement packet.
Transmitter node tasks:
•
•
•
Waits for an initialization packet
containing the receiver node buffer
(virtual) memory address
Writes that buffer N times in a loop
with RDMA PUT
Waits for a final ACK packet.
■ No small message optimizations like
copying of data in temporary buffers:


Reduced pipelining capability of the
APEnet+ HW
No large difference of performance with
round-trip test
■ ~ 7-8 ms on GPU-GPU test!
■ ~ 2x for GPU TX !! still …
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
13
Latency benchmark: P2P effects
■ No P2P =
cudaMemcpyD2H/H2D()
on host bounce buffers
■ Buffers
pinned
with
cuMemHostRegister
■ cuMemcpy()
costs
~
8/10us
■ MVAPICH2 tested on same
test system
2 x cuMemcpy()
* http://mvapich.cse.ohio-state.edu/performance/mvapich2/inter_gpu.shtml
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
14
APEnet+ VS rest of the World
■ Below 32 KB P2P
wins
■ 32 kB – 128 KB P2P
shows limits
■ over 128 KB Pure
bandwidth wins
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
15
Bandwidth Benchmark
■ Preliminary result on Fermi:
■ curves exhibit a plateau at
message:



Host RX ~ 1.3 GB/s
GPU RX ~ 1.1 GB/s
Accelerate
the
buffer
research performed by the
mC
■ GPU TX curves:

P2P
read
overhead
protocol
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
16
NaNet
DNP
NaNet is based on APEnet+:
■ It mantains all the features of the APEnet+
■ Different card: ALTERA Devkit (with
■
Router
Y
Z
torus
link
torus
link
torus
link
routing logic
7x7 ports
switch
arbiter
Network Interface
32bit Micro
Controller
UDP offload
NaNet Ctrl
GPU I/O
accelerator
memory
controller
On Board
Memory
TX/RX
Block
1Gb Eth
port
■
smaller device, Stratix IV EP4SGX230)
Router and Torus Link are on board but
they are not used at the moment.
New Feature: UDP offload and NaNet
Controller
Torus
Links
X
PCIe X8 Gen2 core
PCIe X8 Gen2 8@5 Gbps
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
17
UDP OFFLOAD – NaNet Ctrl
■ NiosII UDP Offload:



TX/RX
Block
32bit Micro
Controller
UDP offload
NaNet Ctrl
GPU I/O
accelerator
memory
controller
On Board
Memory

Open IP
Collection of HW components than
can be programmed by Nios II to
selectively redirect UDP packets
over Altera TSE MAC into an HW
processing path.
Porting to Altera Stratix IV
EP4SGX230 (the project was
based on Stratix II 2SGX90).
clk@200MHz (instead of 35 MHz).
The output of the UDP Offload is
the PRBS packet
that only
contains the number of bytes
indicated in the UDP header for its
payload .
1Gb Eth
port

Network Interface
PCIe X8 Gen2 core
PCIe X8 Gen2 8@5 Gbps
■ NaNet CTRL:


It implements an Avalon streaming
sink interface to collect data
coming from the source interface
of the UDP offload.
Encapsulate the UDP payload into
the APEnet+ packets.
•
•
UDP Offload / NaNet Ctrl
TSE
MAC
SRC
SNK
UDP
Offload
SIZE
PAYLOAD
NaNet
CTRL
HEADER
PAYLOAD
FOOTER
?
1 header + 1 footer (128 bit word)
Payload (128 bit word, max size
4KB – 256 words)
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
18
TEST
■ Benchmarking Platform:


1U, two multi core INTEL server equipped with APEnet+ card.
1U, S2075 NVIDIA system packing 4 Fermi-class GPUs (~4 Tflops).
■ UDP offload and NaNet CTRL test:
The host generates a data stream of 105 32bit word (packet
size is 4KB).
 The packets follow the standard path.
 The Nios II reads the packet and checks whether the data
correspond to those sent by the host.
Integration of UDP offload and NaNet CTRL in Network Interface
completed:
 Debugging stage.
 Latency measures.

■
TESTBENCH
HOST
ETH
TSE
MAC
UDP
Offload
SIZE
PAYLOAD
NaNet
CTRL
HEADER
PAYLOAD
FOOTER
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
NIOS
19
THANK YOU
Alessandro
Lonardo
Roberto
Ammendola
Andrea
Biagioni
Pier Stanislao
Paolucci
Davide
Rossetti
Ottorino
Frezza
Francesca
Lo Cicero
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
Piero
Vicini
Francesco
Simula
Laura
Tosoratto
20
BACK-UP SLIDE
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
21
QUonG: GPU + 3D NETWORK
The EURETILE HPC platform is based on the QUonG development.
QUantum chromodynamics ON Gpu (QUonG) is a comprehensive initiative
aiming at providing a hybrid, GPU-accelerated x86_64 cluster with a 3D toroidal
mesh topology, able to scale up to 104/105 nodes, with bandwidths and
latencies balanced for the requirements of modern LQCD codes.
■ Heterogeneous cluster: PC mesh accelerated with high-end GPU and interconnected
■
■
■
■
■
via 3-d torus network
Tight integration between accelerators (GPU) and custom/reconfigurable network
(DNP on FPGA) allowing latency reduction and computing efficiency gain
Communicating with optimized custom interconnect (APEnet+), with a standard
software stack (MPI, OpenMP, …)
Optionally an augmented programming model (cuOS)
Community of researchers sharing codes and expertise (LQCD, GWA, Biocomputing, Laser-plasma interaction)
GPU by Nvidia:


solid HW and good SW
Collaboration with Nvidia US development team to “integrate” GPU with our network
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
2
2
EURETILE PLATFORM
■ 2 parallel and synergic development lines:


High Performance Computing HPC.
Virtual Emulation Platform.
■ Based on common and unifying key elements:




Benchmarks.
Common software tool-chain.
Simulation framework.
Fault tolerant, brain inspired network (i.e. DNP)
interfaced to custom ASIP and/or commodities cumping
accelerators.
■ Scientific High Performance Computing Platform,
leveraging on INFN QUonG project:


Intel CPUs, networked through an interconnected mesh
composed of PCI-e boards hosting DNP integrated in
FPGA.
Software-programmable accelerators in the form of
ASIPs (developed using TARGET’s ASIP design toolsuite) integrated in the FPGA. INFN will also explore
the addition of GPGPU.
■ High Abstraction level simulation platform (RWTHAACHEN):

Based on RISC models provided by RWTH-AACHEN
and TLM models of the INFN DNP.
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
2
3
APEnet+ board production and test
■ 4 APEnet+ boards produced during 2011.
■ 15 APEnet+ boards on 2Q/12 and 10 more to complete the QUonG rack for
■
■
4Q/12.
Preliminary technical test performed by the manufacturer.
Deeper functional tests:

Clock Generators
• Fixed frequency oscillators measured through a digital oscilloscope.
• Programmable clock (si570) firmware have been validated.

JTAG Chain
• Stratix IV and MAX2 (EPM2210). 64MB Flash memory. Master controller EPM240 CPLD.
• Windows OK, complete functionality on Linux obtained by-passing the EPM240 firmware.

PCIe
• Altera Hard IP + PLDA IP.
• PLDA test-bench adapted and implemented. Successfull.

Memory
• SODIMM DDR3. FPGA acts as memory controller.
• NIOS + Qsys environment (read and write). Still in progress.

Ethernet
• 2 Ethernet RJ45 connectors (1 main board + 1 daughter board)
• NIOS + Qsys environment. Still in progress.

Remote Links
• 6 links (4 main board + 2 daughter board)
• Tranceiver Toolkit by Altera (bit error rate with random pattern) to find the best parameters
(400MHz / 32GB main board – 350MHz / 28GB daughter board)
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
2
4
Benchmarking Platform
■ 3 slightly different servers





SuperMicro motherboards.
CentOS 5.6/5.7/5.8 x86_64.
Dual Xeon 56xx.
12GB – 24GB DDR3 memory.
Nvidia C2050/M2070 on X16 Gen2 slots.
■ Preliminary benchmarks:



Coded with APEnet RDMA API.
CUDA 4.1.
One-way point to point test involving two nodes.
•
•
Receiver node tasks:
–
Allocates a buffer on either host or GPU memory.
–
Registers it for RDMA.
–
Sends its address to the transmitter node.
–
Starts a loop waiting for N buffer received events.
–
Ends by sending back an acknowledgement packet.
Transmitter node tasks:
–
Waits for an initialization packet containing the receiver
node buffer (virtual) memory address
–
Writes that buffer N times in a loop with RDMA PUT
–
Waits for a final ACK packet.
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
2
5
QUonG Status and Next Future
1U, two multi-core INTEL server
equipped with APEnet+ card
■ Deployment of the system in
2012

42U standard rack system
• 60/30 TFlops/rack in
single/double precision
• 25KW/rack (0.4KW/TFlops)
• 300K€/rack (<5K€/TFlops)
1U, S2075 NVIDIA system
packing 4 Fermi-class
GPUs (~4 Tflops)
■ Full Rack prototype
10
construction
 20 TFlops ready at 1Q/12
 Full rack ready at 4Q/12
 …waiting for Kepler GPUs
1U, two multi-core INTEL server
equipped with APEnet+ card
EURETILE Review - 12-13 April 2012 – Andrea Biagioni - INFN
26
QUonG status and next future
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
GPU if
CPU
GPGPU
CPU
CPU
GPGPU
CPU
CPU
GPGPU
CPU
CPU
GPGPU
CPU
CPU
GPGPU
CPU
APEnet+
•
GPU if
QUonG elementary
mechanical assembly:
–
APEnet+
–
APEnet+
GPU if
APEnet+
APEnet+
GPU if
APEnet+
•
–
multi-core INTEL (packed in
2 1U rackable system)
S2090 FERMI GPU system
(5 TFlops)
2 APEnet+ board
42U rack system:
–
–
–
60 TFlops/rack peak
25 kW/rack (i.e. 0.4
kW/TFlops)
300 k€/rack (i.e. 5
K€/TFlops)
The
EURETILE
HW
Platform
demonstrator at 2012 project review
will be a stripped version of QUonG
elementary mechanical assembly with
•
•
2 CPU systems with/without GPUs
connected with ApeNet+ boards
To demonstrate:
– running prototype of EURETILE HW
platform
– preliminary implementation of
“faults awareness” hardware block
(sensors registers and link error
counter read,…)
APEnet+
GPU if
APEnet+
APEnet+
APEnet+
6 Torus
Links
1+ GPUs
GPU if
APEnet+
Cluster node
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
27
GPU support: P2P
■ CUDA 4.0:



Uniform address space
GPUdirect 2.0 aka P2P among up to 8 GPUs
CUDA 4.1: P2P protocol with alien devices
■ P2P between Nvidia Fermi and APEnet+




First non-Nvidia device to support it!!!
Joint development with NVidia
APEnet+ card acts as a peer
APEnet+ I/O on GPU FB memory
■ Problems:



work around current chipset bugs
Exotic PCIe topologies
Sandy Bridge Xeon
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
28
P2P advantages
P2P means:
■ Data exchange on the PCIe bus
■ No bounce buffers on host
So:
■ Latency reduction for small msg
■ Avoid host cache pollution for large msg
■ Free GPU resources, e.g. for GPU-to-GPU memcpy
■ More room for comp/comm overlap
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
29
DNP technical details
•
PCI Express Interface
—
•
High speed serial link interface to DNP
core
Transmitter Channel
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
Serializer
CDR
Deserializer
RX PMA
Word Aligner
8b10b Dec
Receiver Channel
Deskew Fifo
TX PMA
8b10b Enc
Lane
Byte Ser
•
Design of hardware/firmware RDMA
supporting sub-system based on
ALTERA native uP (NIOS II)
Experimental direct interface for GPU
and/or custom integrated accelerator
Byte Deser
—
Multiple virtual channel to avoid deadlock
4 bounded independent serial links each
lane running at 8.5 Gb/s
Each lane providing CDR, 8b10b encoder,
de-skewing logic,…
Byte Order
—
—
•
Built on ALTERA PCI HardIP + commercial
wrapper/multi-dma
engine
(PLDA
EZDMA2) with (up to) 8 independent and
concurrent DMA engines
30
THE END
THANK YOU
TDAQ Meeting - 10 October 2012 – Andrea Biagioni - INFN
31