Diapositiva 1

Download Report

Transcript Diapositiva 1

Call FP7-ICT-2009-4 Objective ICT-2009.8.1
FET – Future Emerging Technologies: Concurrent Tera-device Computin
EURETILE HW platforms
Piero VICINI - INFN Roma
CASTNESS’11 - January 2011 - Rome
INFN contribution to EURETILE project
• Medium/Long term INFN Objectives
• HPC systems dedicated to scientific applications (more than 20 years of “history”
•
and several generations of APE machines): design and use
Optimize well-established application kernels (LQCD) and explore new challenging
applications (neural network modeling, Bio-Computing, Gravitational wave
analysis, Complex systems…)
• Scaling to (multi) petaflops parallel system needs of a scalable interconnection
•
•
•
network characterized by 103-105 network routers
• DNP (Distributed Network Processor) refinement and evolution
Analyze collective behavior of a “huge” number of interconnected DNP-based computing
nodes (deadlock, starvation, throughput efficiency,…)
Add fault tolerance capabilities to limit the impact of link failures on network
Add “brain inspired” features to explore new programming model and to boost computing
performances
SHAPES legacy
•
A suitable INFN APE computing engine
•
(Sh)ApOtto multi-tile (8+) processor, 40(+) GFlops, 10W
•
•
•
SoC based on “single tile replica” allowing for increasing number
of tile with silicon processes
Multi-chip high density system:
•
•
8(+) RISC+VLIW_FP Core + DNP based network
1K (Sh)ApOtto, 40 TFlops, 20 KW, 200 KEuro per rack
Enhanced/New programming model, semi-automated
application mapping software, HW dependant Light OS
• But…
•
•
•
Back connectors area (Power Supply)
M8+ (0,7)
DC/DC
DC/DC
M8+ (1,7)
DC/DC
DC/DC
M8+ (2,7)
DC/DC
DC/DC
M8+ (3,7)
DC/DC
DC/DC
M8+ (0,6)
M8+ (1,6)
M8+ (2,6)
M8+ (3,6)
M8+ (0,5)
M8+ (1,5)
M8+ (2,5)
M8+ (3,5)
DC/DC
DC/DC
M8+ (0,4)
DC/DC
DC/DC
M8+ (1,4)
DC/DC
DC/DC
M8+ (2,4)
DC/DC
DC/DC
•
3DT connectors area for TeraMotherBoard stacking
M8+ (0,3)
DC/DC
DC/DC
M8+ (1,3)
DC/DC
DC/DC
M8+ (2,3)
DC/DC
DC/DC
DC/DC
M8+ (0,2)
M8+ (1,2)
M8+ (2,2)
M8+ (3,2)
M8+ (0,1)
M8+ (1,1)
M8+ (2,1)
M8+ (3,1)
DC/DC
DC/DC
M8+ (0,0)
DC/DC
DC/DC
M8+ (1,0)
DC/DC
DC/DC
M8+ (2,0)
Front connectors area (I/O)
DC/DC
DC/DC
M8+ (3,0)
Risky investment and mass production in 3/5 years from now
technology is growing fast and people learned the lesson…
•
M8+ (3,3)
DC/DC
We need strong partnership with silicon foundry
…and last but not least…
•
M8+ (3,4)
We need 3-5 MEuro for NRE (chip, mechanic, man power…)
•
Pump up flops/W, flops/Euro, flops/m3
The race is still open but the current situation doesn’t allow
us to start NOW and successfully compete with emerging
“commodity” hardware
HPC Emerging “commodity”: (GP)GPU
•
•
•
•
•
General Purpose Graphic Processing Unit
Impressive peak performances
•
TFlops per chip
Videogames market i.e. 10 G$/yr
Two main competitors (Nvidia, ATI)
Architecture and characteristics fit very well with
LQCD requirements
•
•
•
•
•
Nvidia Fermi (Tesla 20xx)
•
•
•
•
(3 *109 transistors)
~500 core, 1 TF SP, 0.5 TF DP
6 GB external memory (150 GB/s)
~250W, <2K Euro
Many-Core (>>100) SIMD-like architecture
Single core specialized for data parallel floating
point computation
High local memory bandwidth
“Green”: high Flops/W ratio
Cost effective: high Flops/$ ratio
Xeon X5670 Opteron 8439 ATI HD5870
# of cores
GFlops (SP)
Gflops (DP)
TDP (Watt)
Price (Euro)
GFlops/Euro
GFlops/Watt
6
140
70
95
1600
0,04
0,74
6
134
67
105
2000
0,03
0,64
1600
2720
544
188
400
1,36
2,89
Tesla C1060
Tesla C2070
240
933
78
188
1500
0,05
0,41
448
1030
515
247
< 2000
> 0,26
2,09
… it’s always a matter of “brute force”…
LQCD & GPU




Story begins with video games… (Egri, Fodor et al.
2006)
Wilson-Dirac operator at 120Gflops (K.Ogawa 2009)
Domain Wall fermions (Tsukuba/Taiwan 2009)
Definitive work: Quda lib (M.A.Clark et al. 2009):




INFN codes development




Double, Single, Half-precision
Half-prec solver with reliable updates > 100Gflops
MIT/X11 Open Source License
2D Spin models (Di Renzo et al, 2008)
LQCD Stag. fermions on Chroma (Cossu, D'Elia et al,
2009) with impressive results for single GPU:

1 cpu+C1060 = 1.5 apeNEXT crate (!)
But one GPU is not enough, we need to scale up to 100-1000
Many level of parallelism are needed:

Intra-GPU i.e. efficient single-GPU codes

Intra-node i.e. efficient hardware to support from GPU-to-GPU communication in the same host

Inter-node i.e low latency, high bandwidth network optimized to support RDMA, first neighbors
comms,…
Embedded system emerging (…)
architectures: ARM + accelerators
Have you ever heard of similar
architectures? ;-)
NVidia Tegra: multi-arm +
specialized
audio/video/graphic
accelerators
FreeScale i.MX6
platform
TI DaVinci:
(multi) ARM + DSP
Project Denver, Jan 5 announcement: NVIDIA CPU running the ARM instruction
set, integrated on the same chip as the NVIDIA GPU
“An ARM processor coupled with an NVIDIA GPU represents the computing platform of the
future.” W. Dally, Nvidia Chief Scientist
Next generation FPGA
•
•
Latest FPGA-based systems are the ideal hardware to prototype significant components of the
EURETILE reference platform
Two main FPGA families: ALTERA STRATIX V – XILINX VIRTEX 7,
• 28nm, introduction during 2011
• TFlops performances, (multi)Terabits I/O bandwidth, HWired uP cores
Altera Embedded
Initiative
Xilinx Extensible
Processing Platform
DNP: Distributed Network Processor
•
DNP: 3D Torus network controller
•
•
•
•
•
•
•
packet-based direct network with 2D/3D torus
topology.
fixed size header/footer envelope (header+
footer)
auto-routing using dimension-order static
routing, with dead-lock avoidance.
Error detection via EDAC/CRC at packet level.
RDMA capabilities, PUT and GET, are
implemented at the firmware level.
SystemC models, VHDL (synthesizable) code,
AMBA Interface (SHAPES), PCI Express
Interface
• Implementation on FPGA and “almost”
tape-out on ASIC
DNP enhancements in EURETILE
•
•
•
•
•
Introduce fault tolerance hardware capabilities
link self-diagnostic
new (fault tolerant) routing algorithms
Exploration of “on-chip” DNP-ASIP integration
and optimization of “off-chip” DNP-GPU
Explore brain-inspired features (multicast,…)
X
X
Y
Y
Z
Z
torus
link
torus
link
torus
link
torus
link
torus
link
torus
link
+
-
+
-
+
-
routing
logic
7x7 ports switch
arbiter
router
TX/RX
FIFOs &
Logic
collective
communication
block
100/1000
Eth port
memory
controller
DDR3
Module
128@250MHz bus
PCIe X8
Gen2 core
PCIe X8 Gen2 8@5 Gbps
NIOS II
processor
Custom PC Cluster Network: APEnet+
•
APEnet+: DNP on FPGA-based PCI-Express card to get
a 3Dim Torus network for PC Cluster
•
•
•
•
FPGA-based (Altera Stratix IV) card with PCIe form factor
Single slot width, 4 torus links, 2D torus topology.
Secondary piggy-back card, resulting in a double slot width, 6
links, 3D torus topology.
Embedded NIOS processor to support RDMA operations
•
FPGA (Stratix IV EP4SGX2xx) synthesis results:
•
•
•
•
•
•
•
PCIe x8 Gen2 host interface (peak 4+4 GB/s)
6 Torus link fully bidir 34 Gb/s each direction (~400Gb/s
integrated bandwidth) on 4 lanes using QSFP+ interconnect
mechanics
Internal Clk up to 210 MHz
128 bit word size crossbar switch
Resource usage on Stratix IV EP4SGX290:
• 15% Logic Elements , 20% registers, 50% internal
memory
For next generation FPGA (28nm) these numbers become
negligible!!
• Preliminary estimation gives 3% LE, 4% regs, 15%
mems
Deliverables
•
•
•
3 channels prototype board for links electrical characterization
and firmware development completed and tested in 2010
APEnet+ board (6 channels) design completed in 2010
4 APEnet+ boards in 1Q 2011
QUOnG as the EURETILE HPC platform
demonstrator
•
QUantum chromodynamics ON Gpu
•
•
•
PC clusters accelerated with high-end GPU
and interconnected via 3D torus network
APEnet+
Added value: tight integration between
accelerators (GPU) and
custom/reconfigurable network (DNP on
FPGA) allows computing efficiency gain
Production and deployment of
medium/large systems green and cost
effective in 2011
•
•
•
Elementary unit:
• multi-core INTEL (packed in 2 1U
rackable system)
• S2070 FERMI GPU system (4 TFlops)
• 2 apenet+ board
42U rack system:
• 60 TFlops/rack peak
• 25 kW/rack (i.e. 0.4 kW/TFlops)
• 300 k€/rack (i.e. 5 K€/TFlops)
We will leverage on QUOnG system
to demonstrate EURETILE HPC
platform.
•
Similar mechanics/topology but
enhanced DNP (fault tolerant, brain
inspired,…) on network board
Embedded reference platform demonstrator
•
Apenet+ is the “ready to use” elementary component of
EURETILE embedded platform.
•
•
Interconnected apenet+ boards realize the prototype of the
array of DNP-interconnected FPGA useful to test fault tolerant
capabilities and brain inspired enhancements of the network.
On availability of 28nm FPGA component
•
evaluate the design of a new enhanced apenet+ board
(EURETNet…)
•
•
integrated hardwired ARM uP.
many resources to explore coupling with ASIP accelerator
•
investigate the integration of 16/32 small modules equipped
with 28nm FPGA (+ local memory banks) on a “backbone”
board (leveraging on APE/SHAPES mechanics design)
•
to accelerate dedicated task (via ASIP) in bio-computing application
currently under evaluation
BackPlane
3
7
11
15
1
5
9
13
2
6
10
14
0
4
8
12
FrontPlane
PB
Backup Slides
Next years INFN computing requirements
V. Lubicz – CSN4 talk, settembre
2009
•
•
Compute intensive Physics (excluding LHC stuff)
•
•
~ 0.01-1 Pflops for single research group
~ 0.1-10 Pflops nationwide
and beyond LQCD…
•
2D Spin models, Bio-Computing, Gravitational wave analysis, Complex systems,
2D/3D Fluid Dynamics, Montecarlo for medical and space sciences, …
GPUs activities: (inter)national scenario

A random list (not exaustive):







“Keeneland” (GeorgiaTech, Nvidia, HP, Oak Ridge): HP + Nvidia,
in 2012 2 PFlops peak
“Tianhe” (NUDT China): Intel CPU + AMD GPUs, 1.2 Pflops
peak
“Nebulae” (“China’s Dawning Information Industry Co”,
Schenzen China): AMD Opteron + Nvidia GPUs, 1.2 Pflops peak
GPU Supercomputer at CQSE (Taiwan), 16 Nvidia S1070
SGI: next servers (UltraViolet) CPU+GPU
TOP500 systems
INFN activities:

GPU Computing Interest Group (see Bosi talk)
GPU for LQCD computing
The question is: Are the GPUS good enough for LQCD computation?
• FP vs local memory bandwidth
•
•
•
LQCD requirements are: 1 word I/O vs 8 flop -> 4 Bytes / 8 flop
GPU memory bandwidth equal to 150 GB/s, Peak Performances 2.7 TFlops
LQCD on GPU Theoretical Peak performances is (150/4)*8=300 Gflops i.e 11% of
peak (similar to measured efficency….)
• Remote access vs local memory bandwidth i.e scaling capabilities
•
•
•
LQCD requirement (R): 16 (8) (words) local access / 1 (word) remote access -> Rqcd =
16 (8)
GPUs I/O Inteface is a PCI Express Gen2 x16 -> 16*5Gb/s -> 10 GB/s
Ratio Local/Remote is: 150/10 = 15 ≈ Rqcd
•
GPU (LQCD peak) 300 GFlops / 180W = 1.7 vs 2 (SHAPES platform)
•
GPU (LQCD peak) 300 Gflops/ 350Euro = 0.85 vs 1.3 (SHAPES platform)
• GFlops/Watt
• Gflops/Euro
So the answer is :
Yes!! GPUs show value of system parameters similar (perhaps better?) to
SHAPES and it’s also real hardware….