Accelerators_SCED11x

Download Report

Transcript Accelerators_SCED11x

 Old CW: Transistors expensive
 New CW: “Power wall” Power expensive, Transistors free
(Can put more on chip than can afford to turn on)
 Old: Multiplies are slow, Memory access is fast
 New: “Memory wall” Memory slow, multiplies fast
(200-600 clocks to DRAM memory, 4 clocks for FP multiply)
 Old : Increasing Instruction Level Parallelism (ILP) via compilers, innovation
(Out-of-order, speculation, VLIW, …)
 New CW: “ILP wall” diminishing returns on more ILP
 New: Power Wall + Memory Wall + ILP Wall = Brick Wall
 Old CW: Uniprocessor performance 2X / 1.5 yrs
 New CW: Uniprocessor performance only 2X / 5 yrs?
Credit: D. Patterson, UC-Berkeley
"If one ox could not do the job, they did not try to grow a bigger ox,
but used two oxen." - Admiral Grace Murray Hopper.
 It turns out, sacrificing uniprocessor performance for power
savings can save you a lot.
 Example:
 Scenario One: one-core processor with power budget W
 Increase frequency/ILP by 20%
 Substantially increases power, by more than 50%
 But, only increase performance by 13%
 Scenario Two: Decrease frequency by 20% with a simpler core
 Decreases power by 50%
 Can now add another core (one more ox!)
"If one ox could not do the job, they did not try to grow a bigger ox,
but used two oxen." - Admiral Grace Murray Hopper.
"If you were plowing a field, which would you rather use?
Two strong oxen or 1024 chickens ?" - Seymour Cray
 Chickens are gaining momentum nowadays:
 For certain classes of applications (not including field
plowing...), you can run many cores at lower frequency and
come ahead (big time) at the speed game
 Molecular Dynamics Codes (VMD, NAMD, etc.) reported
speedups of 25x – 100x!!
 Oxen are good at plowing
 Chickens pick up feed
 Which do I use if I want to catch mice?
 I’d much rather have a couple cats
Moral: Finding the most appropriate tool for the job brings
about savings in efficiency
Addendum: That tool will only exist and be affordable if
someone can make money on it.
Cray High Density Custom
Compute System
 “Same” performance on Cray’s
2-cabinet custom solution
compared to 200-cabinet x86
Off-the-Shelf system
 Engineered to achieve
application performance at
< 1/100 the space, weight and
power cost of an off-the shelf
system
 Cray designed, developed,
integrated and deployed
System
Characteristics
Cray Custom
Solution
Off-the-Shelf
System
Cabinets
2
200
Sockets
48
37,376
Core Count
96
149,504
FPGAs
88
0
Total Power
42.7 Kw
8,780 Kw
Peak Flops
499 Gf
1.2 Pf
Total Floor Space
8.87 Sq Ft
4,752 Sq Ft
7
8
Energy Efficiency (log scale)
GPUs were here 7-10 years ago
1000
Dedicated
HW ASIC
100
Reconfigurable
Processor/Logic
Now, they’re in this space
10
1
ASPs
DSPs
Embedded Processors
0.1
Flexibility (Coverage)
General Purpose computing on Graphics Processing Units
 Previous GPGPU Constraint:
 To get general purpose code
working, you had to use the
corner cases of the graphics API
Input Registers
per thread
per Shader
per Context
Fragment Program
Texture
 Essentially – re-write entire
Constants
program as a collection of
shaders and polygons
Temp Registers
Output Registers
FB
Memory
10
 “Compute Unified Device Architecture”
 General purpose programming model
 User kicks off batches of threads on the GPU
 GPU = dedicated super-threaded, massively data
parallel co-processor
 Targeted software stack
 Compute oriented drivers, language, and tools
 Driver for loading computational programs
onto GPU
11
 512 GPU cores
 1.30 GHz
 Single precision floating point performance:




1331 GFLOPs
(2 single precision flops per clock per core)
Double precision floating point performance: 665 GFLOPs
(1 double precision flop per clock per core)
Internal RAM: 6 GB DDR5
Internal RAM speed: 177 GB/sec (compared 30s-ish GB/sec for
regular RAM)
Has to be plugged into a PCIe slot (at most 8 GB/sec)
12
 Calculation: TFLOPS vs. 150 GFLOPS
 Memory Bandwidth: ~5-10x
Many-core GPU
Multi-core CPU
Courtesy: John Owens
Figure 1.1. Enlarging Perform ance Gap betw een GPUs and CPUs.
 Cost Benefit: GPU in every PC– massive volume
13
 The Good:
 Performance: focused silicon use
 High bandwidth for streaming applications
 Similar power envelope to high-end CPUs
 High volume  affordable
 The Bad:
 Programming: Streaming languages (CUDA, OpenCL, etc.)
 Requires significant application intervention / development
 Sensitive to hardware knowledge – memories, banking, resource management, etc.
 Not good at certain operations or applications
 Integer performance, irregular data, pointer logic, low compute intensity*
 Questions about reliability / error
 Many have been addressed in most recent hardware models
 Knights Ferry
 32 Cores
 Wide vector units
 x86 ISA
 Mostly a test platform at this
point
 Knights Corner will be first
real product - 2012
 Configurable logic blocks
 Interconnection mesh
 Can be incorporated into cards
or integrated inline.
 The Good:
 Performance: good silicon use (do only what you need)
 (maximize parallel ops/cycle)
 Rapid growth: Cells, Speed, I/O
 Power: 1/10th CPUs
 Flexible: tailor to application
 The Bad:
 Programming: VHDL, Verilog, etc.
 Advances have been made here to translate high level code (C, Fortran, etc.) to HW
 Compile Time: Place and Route for the FPGA layout can take
multiple hours
 FPGAs are typically clocked about 1/10th to 1/5th of ASIC
 Cost: They’re actually not cheap
 External – entire application offloading
 “Appliances” – DataPower, Azul
 Attached – targeted offloading
 PCIe cards – CUDA/FireStream GPUs, FPGA cards.
 Integrated – tighter connection
 On-chip – AMD Fusion, Cell BE, Network processing chips
 Incorporated – CPU instructions
 Vector instructions, FMA, Crypto-acceleration
Cray XK6 Integrated Hybrid Blade
AMD “Fusion”
IBM “CloudBurst”
(DataPower)
Nvidia M2090
 External – entire application offloading
 “Appliances” – DataPower, Azul
 Attached – targeted offloading
 PCIe cards – CUDA/FireStream GPUs, FPGA cards.
 Integrated – tighter connection
 On-chip – AMD Fusion, Cell BE, Network processing chips
 Incorporated – CPU instructions
 Vector instructions, FMA, Some crypto-acceleration
C. Cascaval, et al., IBM Journal of R&D, 2010
 Programming accelerators requires describing:
What portions of code will be run on the accelerator (as
opposed to on the CPU)
2. How does that code map to the architecture of the
accelerator
1.
 both compute elements and memories
 The first is typically done on a function-by-function basis
 i.e. GPU kernel
 The second is much more variable
 Parallel directives, SIMT block description, VHDL/Verilog…
 Integrating these is not very mature at this point, but coming
__global__ void
saxpy_cuda(int n, float a, float *x, float *y)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if(i < n)
y[i] = a*x[i] + y[i];
}
…
int nblocks = (n + 255) / 256;
//invoke the kernel with 256 threads per block
saxpy_cuda<<<nblocks, 256>>>(n, 2.0, x, y);
23
 There are several efforts (mostly libraries and directive
methods) to lower the entry point for accelerator
programming
 Library example: Thrust – STL-like interface for GPUs
thrust :: device_vector < int > D (10 , 1) ;
thrust :: fill (D . begin () , D. begin () + 7 , 9) ;
thrust :: sequence (H. begin () , H. end () );
…
 Accelerator example: OpenACC – Like OpenMP
#pragma acc parallel [clauses]
{ structured block }
http://www.openacc-standard.org/
1.
2.
3.
4.
Profile your code
 What code is heavily used (and amenable to acceleration)
Write accelerator kernels for heavily used code (Amdahl)
 Replace CPU version with accelerator offload
Play
??? “chase the bottleneck” around the accelerator
 AKA re-write the kernel a dozen times
Profit!
 Faster science/engineering/finance/whatever!
 Brandon’s stuff
 Architectures are moving towards “effective use of space” (or




power).
Focusing architectures on a specific task (at the expense of
others) can make for very efficient/effective tools (for that
task)
HPC systems are beginning to integrate acceleration at
numerous levels, but “PCIe card GPU” is the most common
Exploiting the most popular accelerators requires intervention
by application programmers to map codes to the architecture.
Developing for accelerators can be challenging as significantly
more hardware knowledge is needed to get good performance
 There are major efforts at improving this
 Tomorrow
 2 – 3 pm: CUDA Programming Part I
 3:30 – 5 pm: CUDA Programming Part II
 WSCC 2A/2B
 Tomorrow at 5:30pm
 BOF: Broad-based Efforts to Expand Parallelism
Preparedness in the Computing Workforce
 WSCC 611/612 (here)
 Wednesday at 10:30am
 Panel/Discussion: Parallelism, the Cloud, and the Tools of
the Future for the next generation of practitioners
 WSCC 2A/2B