Journées Informatiques IN2P3/DAPNIA La Londes-lès

Download Report

Transcript Journées Informatiques IN2P3/DAPNIA La Londes-lès

Evaluating a new chip for LQCD simulations ?
Lets hope …
Gilbert Grosdidier
LAL-Orsay/IN2P3/CNRS
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
1
The ‘LQCD chip’ Hall of Fame

Or at least the ‘quasi LQCD dedicated chip’ Hall of Fame …

Ape family :
–
–
–
–

Ape
ApeMille
ApeNext
???
QCDOC family
– QCDOC
– IBM BlueGene (L, P, …)
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
2
New ‘off the shelf’ candidates ?

IBM Cell BE
– Standing for Cell Broadband Engine
– Supplying the Sony PS3

GP GPU
– Standing for General Purpose Graphics Processing Unit
– ATI cards, NVIDIA, …
– They supply every commodity computer on our desks …

FPGA processors
– Standing for Fast Programmable Gated Arrays
– Not evaluated yet

The future Intel many-core processors
– In the lab … now
– Into my computer … but when ?

As for the target, PR told us : for LQCD, it’s one PetaFlop or die !
– Is this too ambitious ?
– Or just too shy ?
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
3
What is the IBM Cell chip ?

In one single chip, a PPC processor embedded with 8 small but
fast graphic units
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
4
Cell BE Roadmap
Next Gen
2PPE’+32SPE’
~1TF (est.)
45nm SOI
100 GF
DP ?
Performance
Enhancements/
Scaling
Cost
Reduction
200 GF
SP ?
Cell BE
(1+8)
90nm SOI
Cell BE
(1+8)
65nm SOI
2006
2007
Enhanced
Cell BE
(1+8eDP)
65nm SOI
2008
1 TF ??
2009
2010
All future dates are estimations only; Subject to change without notice.
14 juin 2007
Cell BE Roadmap Version 5.0 24-Jul-2006
New LQCD chip @ LAL - GDR meeting - GG
5
Cell Broadband Engine™ Blade – Road Map (2)
Target Availability: 1H08
Performance
Enhanced Cell BE-based
Blade
2 Enhanced Cell BE Processors
SP & DP Floating Point Affinity
Up to 16 GB Memory
Up to 16X PCI Express
Target Availability: 2H07
Advanced Cell BE-Based
Blade
GA: 2H06
2 Cell BE Processors
Single Precision Floating Pt Affinity
2 GB Memory
Up to 16X PCI Express
Cell BE-Based Blade
2 Cell BE Processors
Single Precision Floating Pt Affinity
1 GB Memory
Up to 4X PCI Express™
SDK 2.0
SDK 1.1
SDK 3.0
Target Availability: 2H07
Hardware
Target Availability: 1H07
Alpha Software
Beta Software
Available: 17 July 2006
GA Software
2006
2007
2008
All future dates are estimations only; Subject to change without notice.
14 juin 2007
Cell BE Roadmap Version 5.0 24-Jul-2006
New LQCD chip @ LAL - GDR meeting - GG
6
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
7
Cell Features

Heterogeneous multi-core
system architecture
–
–

Power Processor Element for
control tasks
Synergistic Processor Elements for
data-intensive processing
Synergistic Processor
Element (SPE) consists of
–
–
Synergistic Processor Unit (SPU)
Synergistic Memory Flow Control
(MFC)
•
Data movement and
synchronization
•
Interface to highperformance Element
Interconnect Bus
SPE
SPU
SPU
SPU
SPU
SPU
SPU
SPU
SPU
SXU
SXU
SXU
SXU
SXU
SXU
SXU
SXU
LS
LS
LS
LS
LS
LS
LS
LS
MFC
MFC
MFC
MFC
MFC
MFC
MFC
MFC
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
16B/cycle
PPE
PPU
L2
L1
MIC
16B/cycle (2x)
BIC
PXU
32B/cycle 16B/cycle
Dual
XDRTM
FlexIOTM
64-bit Power Architecture with VMX
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
8
Cell Processor
25.6 GB/sec
Memory
Inteface
N
SPU
Local Store
Synergistic Processor Elements
(SPE):
•8 per chip
•128-bit wide SIMD Units
•Integer and Floating Point capable
•256KB Local Store
•Up to 25.6 GF/s per SPE --200GF/s total *
* At clock speed of 3.2GHz
N
N
MFC
AUC
N
SPU
AUC
MFC
Power Processor Element (PPE):
•General Purpose, 64-bit RISC
Processor (PowerPC 2.02)
•2-Way Hardware Multithreaded
•L1 : 32KB I ; 32KB D
•L2 : 512KB
•Coherent load/store
•VMX
•3.2 GHz
Local Store
“Supercomputer-on-a-Chip”
Element Interconnect Bus
AUC
Local Store
AUC
MFC
Local Store
N
NCU
N
MFC
SPU
SPU
Power Core
(PPE)
Local Store
AUC
AUC
L2 Cache
N
Internal Interconnect:
•Coherent ring structure
•300+ GB/s total internal
interconnect bandwidth
•DMA control to/from SPEs
supports >100 outstanding
memory requests
Local Store
N
MFC
MFC
SPU
SPU
N
N
AUC
MFC
N
Local Store
SPU
N
14 juin 2007
Local Store
SPU
External Interconnects:
•25.6 GB/sec BW memory interface
•2 Configurable I/O Interfaces
•Coherent interface (SMP)
•Normal I/O interface (I/O & Graphics)
•Total BW configurable between interfaces
•Up to 35 GB/s out
•Up to 25 GB/s in
AUC
MFC
20 GB/sec
Coherent
Interconnect
5 GB/sec I/O
Bus
Memory Management & Mapping
•SPE Local Store aliased into PPE system memory
•MFC/MMU controls SPE DMA accesses
•Compatible with PowerPC Virtual Memory
architecture
•S/W controllable from PPE MMIO
•Hardware or Software TLB management
•SPE DMA access protected by MFC/MMU
New LQCD chip @ LAL - GDR meeting - GG
9
IBM Cell review (1)

Pros : very flexible, at least theoritically
– Attractive design : a PowerPC970 Unit + 8 SPUs packaged on the same chip
– SPU: Synergistic Processor Unit (sic)
– Very good FP computation performance: 200 Gflops aggregated / chip
– Very attractive and flexible processing model to accommodate SPUs
–
–
–
–
–
–
–
– Pipe-line, SIMD, custom …
DMA between the SPUs
Fast bus access to each SPU (dual port, 25Gb/sec)
Fast Infiniband connectivity with other chips (4GB/sec)
The chip exists …
And a board shipping with 2 Cells is already available (Cell BE Blade)
Overlays available
Several important development tools available
•
•
•
•
14 juin 2007
Simulator
Debugger
Static performance analyzer
Integrated development environment (unused)
New LQCD chip @ LAL - GDR meeting - GG
11
IBM Cell review (2)

Cons : seems to be too many constraints
– Single precision FP computations only (DP rather slow ~ 2GF/SPU)
– Scalar performance very low
• required to port the code to SPU specific Vector instructions
– Very small memory store on board of each SPU (256kB data+code, no cache)
• Difficult to fit a QCD sub-lattice larger that 2x2x2x4 on this (70kB), even within SP
– Transfer overhead when moving data thru DMA is a penalty
• Difficult to fit last 2 points at the same time
– No MPI port available yet for SPU-to-SPU transfers when on different
Chip/Blade/Chassis/Rack (only for PPU-to-PPU transfers)
– Missing some very basic vector operations (dot product, …)
– Not easy to trigger a SPU-to-SPU DMA transfer
– Required to use PPU bridge
– Very hectic merge between LQCD code and Cell environment makefile systems

Unsure
–
–
–
–
Price ? A PS3 station costs 600€, but with 6 SPUs only
Availability ? Chip mass production seems difficult/unreliable (cf above)
Obsolescence ?
Future Road Map ?
• Memory on board
• Double precision FP computations
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
12
GP GPU preview

No detailed investigation yet
(on my side :-), but here’s what I gathered
– A few insights: several reports from independent sources seem to indicate that
• an overall x10 speed-up increase for several important kernels in LQCD code
– Wilson matrix or staggered matrix multiplication (Wilson kernel)
– Hopping matrix (HMC) - something very close to Dirac matrix inversion

Pros
• Sub-lattice sizes of the order of 16x16x16x60 can fit a single Nvidia card
– Large memory on board of GPU card - 768MB for the more recent issue
– Specific (CUDA) compiler to avoid having to use openGL instructions to process
LQCD computations
• I was told the CUDA is C-like, and very close to C
• This compiler is SIMD-able (vectorization)
– Theoretically, several GPU cards can fit a commodity computer

Cons
• 4 to 6, depending on the number of PCI slots available (?)
– Specialized processors, hence some libraries missing (??)
– significant overhead when transferring data from main processor to GPU
• But this could be balanced by a reduced number of exchanges (large memory)
– The CUDA compiler is vendor S/W, not open source (Nvidia)
• And it is dedicated to Nvidia only (not ATI), of course
– Card cost : about 600$/€ per GPU
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
13
Future Intel many-core chip(s)

The multi-core trend is now moving towards ‘many-core’
– Talking right now about 80 cores (‘tiles’) per chip
– With reduced clock speed to decrease power dissipation
• A 15% slower clock usually means a 50% power save …
– They claim to achieve 1 TFLOPS on a single chip with 62W dissipation
– They stack the main memory chip(s) directly on top of the CPU chip
•
•
•
•
•

Saving wire dissipation
Increasing the number of connections
They can stack several memory layers on top of each other, hence modularity
They claim to be able to achieve right now up to 256MB/tile (??)
This quasi-on-chip-memory can be split between shared memory and private
memory for each of the 80 tiles
Worrying
– But this chip runs in the lab, and will only be on the market in 2012 (?)
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
14
1 TFLOPS on a Chip
62W
Experimental Research Test Chip
100M Transistors – 80 Tiles – 275mm2
Multi/Many-Core Chip Research
Future tera-scale chips could use an array of tens to hundreds of cores with
reconfigurable caches, as well as special-purpose hardware accelerators utilizing
a scalable on-die interconnect fabric.
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
All plans, products, dates, and figures are preliminary and are subject to change without any notice.
16
The targeted LQCD code : MILC


Rather recent (frozen around 2002), suggested by LPT people
Obviously optimized for scalar machines
• More specifically Intel processors


MPI version available, but not used
A thorough review showed (too late)
– it was definitely too much scalar oriented
• Hence very hard to vectorize
– Some awful sequences of code
• Gaussian random generator
• triangular matrices simulation
– However rather well designed at data tree structure level
• Header embedding
– And easy to split between a master thread and several slave threads
• Most probably because of earlier MPI restructuring

Other code proposed, but makefile reorganization is painful
• HMC, in standby for the time being, because of very low manpower
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
17
The evaluation

First of all, it was meant to be a Cell speed-up evaluation
– Was the FP computing power of the SPUs as huge as advertised ???
– Eventually, evaluation of the porting, debugging, development tools as well
– Avoid as much as possible any code or data restructuring
• Hoping the scalar code could run ‘as is’

Top-down approach was selected. Two ideas behind this
– No DMA transfers between SPUs
• It was not the main target of the study
– Avoid PPU-SPU transfers as much as possible
– Relocate only the most CPU-hungry part on the SPUs

A (rather) special version of the MILC code was tailored by Zhao-Feng Liu in
October to achieve this
– The algorithm is split in 2 chunks only between PPU and SPU
• On the PPU: initializations, data structure building, file reading, final result cross-checks
• On the SPUs: the main compute loop for a 2x2x2x4 QCD sub-lattice
– The compute loop includes only Dirac matrix inversions
• It can be run 1000 … 1000000 times ….
• It runs identically on the 8 SPUs in parallel
– Results are cross-checked against scalar version running on a standard Linux node
• They appear to be fine up to now
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
18
Evaluation pitfalls ?

The overlay technique to split and run the program was tried and
dropped
– Wasting too much memory space in the very tiny LS space
– Does not help that much for this test

It is not yet fully obvious that all pointers on the SPU are only
accessing the LS
– Meaning that some phases of the compute loop on the SPUs can still be
accessing directly the main memory board of the PPU
• Which is a definite speed-up penalty !!!
– There does not seem to exist a convenient tool to forbid/chase this
– This issue was not cleared because it became screened by other
considerations
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
19
First results: very disappointing


Showerman (NCSA) claimed at SC’06 to get x30 speed-up with
MILC on Cell
We managed to run our tests on both a Barcelona Cell (thanks to
Christine) and a Montpellier one (thanks to IBM), and we also got
a x30 increase
– But it was about the execution time, not about the speed ;-((

There are several obvious explanations for this
– the code we ran was purely scalar, and not using any SPU specific vector
instructions
– it contains a lot of array index integer computations, for which the Cell is
very weakly fitted
– It was not optimized at all in terms of instruction reshuffling, and the
compiler could be oddly behaving in this area
– The claim from NCSA was involving a small piece of the MILC code, not at
all the overall compute loop, and was then a bit misleading (if not more …)
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
20
The partial conclusion

Even to only achieve an evaluation of the Cell with any LQCD
code, it is required to invest a huge amount of work
– To reorganize and vectorize the code
– To reorganize the data, and more specifically the locality of the data
• Meaning that 2 pieces of the same data will have to be contiguous in LS memory
space to allow for efficient vectorization
• It is then highly recommended to use SOA (Structures of Arrays) instead of
AOS (Arrays of Structures) to allow for easier navigation thru very simple
iterators
– Avoiding slow array index computations
– Chase for heavily scalar routines or sequences of code
– Carefully apply loop unrolling techniques to allow for efficient use of the 2
ALU (and pipelines) provided in each SPU - and this is very uneasy because
– These 2 pipelines are not functionally identical
• One is for FP and integer computations
• The other one for bit/byte reshuffling, more or less

But, clearly, we would like to know if it is worth to achieve all this
only for evaluation purposes ???
– Avoiding putting the cart before the horse :-((
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
21
Another approach ?

Clearly, from what I discussed later with Showerman, they
choose the bottom-up technique to port MILC to the Cell
– They select only CPU hungry small bits of code which can fit on the SPU
– Leaving the PPU manage the overall task

But they have to confess
– The x30 speed-up they claim was the best one they can achieve
– The pieces are too small to avoid the eventual CPU speed-up to be washed
up by transfer overheads


And I would like to add they do not avoid the headache of code
vectorization and data reshuffling in this approach
But I found very interesting that both ways, which are
complementary, are, ‘in fine’, pointing to big difficulties to
efficiently use this Cell architecture as it is now
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
22
Where does it hurt ?


There are too many contentions currently on the Cell, and it is
very painful to try to escape one of them without being bitten by
another one, if not several
If I dare to suggest improvements to IBM, given the current
status of the evaluation, here follows my priority list
– Above all, increase SPU LS size, by a significant factor (x1000)
– Implement a real fast Double Precision FP ALU on the SPU
– Implement efficiently some missing vector operations on the SPU
–
–
–
–
• Dot product
Implement MPI for the SPU
Improve connection speed with outside of the Cell
Improve main PPU processing performance
Increase number of SPUs per Cell chip
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
23
Next steps

Try to evaluate the top speed-up we can achieve with the purest
FP compute chunk
– A simple Wilson matrix multiplication for example
– with vectorization but excluding the data reorganization step

We very well know that this maximum will be an asymptote
– Overall speed-up will always be smaller, even if we do our best
• to redesign the data structures
• to vectorize just everything
– Also because we will need to implement DMA exchanges at all levels
• SPU-SPU on chip and between chips, SPU-SPU between chassis and racks
• To account for boundary sub-lattice nodes

So, if this asymptote is too low, the game will be over
– Which does not mean we will be able to get some interesting overall
performance even if it is high enough

Is x30 a reasonable target already for the asymptote value ?
– I seem to remember the initial one was x50-x100 for the overall increase
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
24
Next steps & Conclusion(s)

Obviously, if ever a worth-while speed-up is achieved
– one must also evaluate the DMA performance between 2 neighbour SPUs
• On the same chip
– If this one is too low, forget the Cell

But, we cannot forget about it before these 2 values are
measured
– Bare FP compute speed-up
– DMA transfer overheads

And this because the Cell chip is there and available
– we will be blamed if we do not at least try to use it
– Again, the price tag is a real issue then
• But we are not there yet …
– And port costs could be very high as well

QUESTIONS ?
14 juin 2007
New LQCD chip @ LAL - GDR meeting - GG
25