CUDA Lecture 1 Introduction to Massively Parallel Computing
Download
Report
Transcript CUDA Lecture 1 Introduction to Massively Parallel Computing
Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The
University of Akron.
A quiet revolution and potential buildup
Computation: TFLOPs vs. 100 GFLOPs
T12
GT200
G80
G70
NV30
NV40
3GHz Core2
Duo
3GHz Xeon
Quad
Westmere
CPU in every PC – massive volume and potential impact
Introduction to Massively Parallel Processing – Slide 2
Introduction to Massively Parallel Processing – Slide 2
• Solve computation-intensive problems faster
• Make infeasible problems feasible
•
Reduce design time
• Solve larger problems in the same amount of time
• Improve an answer’s precision
•
Reduce design time
• Gain competitive advantage
Introduction to Massively Parallel Processing – Slide 4
• Another factor: modern scientific method
Nature
Observation
Numerical
Simulation
Physical
Experimentation
Theory
Introduction to Massively Parallel Processing – Slide 5
A growing number of applications in science,
engineering, business and medicine are requiring
computing speeds that cannot be achieved by the
current conventional computers because
• Calculations too computationally intensive
• Too complicated to model
• Too hard, expensive, slow or dangerous for the
laboratory.
Parallel processing is an approach which helps to make
these computations feasible.
Introduction to Massively Parallel Processing – Slide 6
One example: grand challenge problems that cannot
be solved in a reasonable amount of time with today’s
computers.
•
Modeling large DNA structures
•
Global weather forecasting
•
Modeling motion of astronomical bodies
Another example: real-time applications which have
time constrains.
If data are not processed during some time the result
become meaningless.
Introduction to Massively Parallel Processing – Slide 7
Atmosphere modeled by dividing it into 3-
dimensional cells. Calculations of each cell repeated
many times to model passage of time.
For example
Suppose whole global atmosphere divided into cells of
size 1 mile 1 mile 1 mile to a height of 10 miles (10
cells high) – about 5108 cells.
If each calculation requires 200 floating point
operations, 1011 floating point operations necessary in
one time step.
Introduction to Massively Parallel Processing – Slide 8
Example continued
To forecast the weather over 10 days using 10-minute
intervals, a computer operating at 100 megaflops (108
floating point operations per second) would take 107
seconds or over 100 days.
To perform the calculation in 10 minutes would require a
computer operating at 1.7 teraflops (1.71012 floating
point operations per second).
Introduction to Massively Parallel Processing – Slide 9
Two main types
Parallel computing: Using a computer with more than one
processor to solve a problem.
Distributed computing: Using more than one computer to
solve a problem.
Motives
• Usually faster computation
• Very simple idea – that n computers operating
simultaneously can achieve the result n times faster
• It will not be n times faster for various reasons.
• Other motives include: fault tolerance, larger amount of
memory available, …
Introduction to Massively Parallel Processing – Slide 10
“... There is therefore nothing new in the idea of parallel
programming, but its application to computers. The
author cannot believe that there will be any insuperable
difficulty in extending it to computers. It is not to be
expected that the necessary programming techniques will
be worked out overnight. Much experimenting remains to
be done. After all, the techniques that are commonly used
in programming today were only won at considerable toil
several years ago. In fact the advent of parallel
programming may do something to revive the pioneering
spirit in programming which seems at the present to be
degenerating into a rather dull and routine occupation …”
- S. Gill, “Parallel Programming”, The Computer Journal, April 1958
Introduction to Massively Parallel Processing – Slide 11
Applications are typically written from scratch (or
manually adapted from a sequential program)
assuming a simple model, in a high-level language (i.e.
C) with explicit parallel computing primitives.
Which means
Components difficult to reuse so similar components
coded again and again
Code difficult to develop, maintain or debug
Not really developing a long-term software solution
Introduction to Massively Parallel Processing – Slide 12
Processors
Highest level
Registers
Memory level 1
Small, fast, expensive
Memory level 2
Memory level 3
Lowest level
Large, slow, cheap
Main Memory
Introduction to Massively Parallel Processing – Slide 13
Execution speed relies on exploiting data locality
temporal locality: a data item just accessed is likely to be
used again in the near future, so keep it in the cache
spatial locality: neighboring data is also likely to be used
soon, so load them into the cache at the same time using
a ‘wide’ bus (like a multi-lane motorway)
from a programmer point of view, all handled
automatically – good for simplicity but maybe not for
best performance
Introduction to Massively Parallel Processing – Slide 14
Useful work (i.e. floating point ops) can only be done
at the top of the hierarchy.
So data stored lower must be transferred to the registers
before we can work on it.
Possibly displacing other data already there.
This transfer among levels is slow.
This then becomes the bottleneck in most computations.
Spend more time moving data than doing useful work.
The result: speed depends to a large extent on problem
size.
Introduction to Massively Parallel Processing – Slide 15
Good algorithm design then requires
Keeping active data at the top of the hierarchy as long as
possible.
Minimizing movement between levels.
Need enough work to do at the top of the hierarchy to
mask the transfer time at the lower levels.
The more processors you have to work with, the larger
the problem has to be to accomplish this.
Introduction to Massively Parallel Processing – Slide 16
Introduction to Massively Parallel Processing – Slide 17
Each cycle, CPU takes data from registers, does an
operation, and puts the result back
Load/store operations (memory registers) also take
one cycle
CPU can do different operations each cycle
Output of one operation can be input to next
CPUs haven’t been this simple for a long time!
Introduction to Massively Parallel Processing – Slide 18
“The complexity for minimum component
costs has increased at a rate of roughly a
factor of two per year... Certainly over the
short term this rate can be expected to
continue, if not to increase. Over the
longer term, the rate of increase is a bit
more uncertain, although there is no
reason to believe it will not remain nearly
constant for at least 10 years. That means
by 1975, the number of components per
integrated circuit for minimum cost will be
65,000. I believe that such a large circuit
can be built on a single wafer.”
- Gordon E. Moore, co-founder, Intel
Electronics Magazine, 4/19/1965
Photo credit: Wikipedia
Introduction to Massively Parallel Processing – Slide 19
Translation: The number of transistors on an integrated
circuit doubles every 1½ - 2 years.
Data credit: Wikipedia
Introduction to Massively Parallel Processing – Slide 20
As the chip size (amount of circuitry) continues to
double every 18-24 months, the speed of a basic
microprocessor also doubles over that time frame.
This happens because, as we develop tricks, they are
built into the newest generation of CPUs by the
manufacturers.
Introduction to Massively Parallel Processing – Slide 21
Instruction-level parallelism (ILP)
Out-of-order execution, speculation, …
Vanishing opportunities in power-constrained world
Data-level parallelism
Vector units, SIMD execution, …
Increasing opportunities; SSE, AVX, Cell SPE,
Clearspeed, GPU, …
Thread-level parallelism
Increasing opportunities; multithreading, multicore, …
Intel Core2, AMD Phenom, Sun Niagara, STI Cell,
NVIDIA Fermi, …
Introduction to Massively Parallel Processing – Slide 22
Most processors have
multiple pipelines for
different tasks and can
start a number of
different operations each
cycle.
For example consider the
Intel core architecture.
Introduction to Massively Parallel Processing – Slide 23
Each core in an Intel Core 2 Duo chip has
14 stage pipeline
3 integer units (ALU)
1 floating-point addition unit (FPU)
1 floating-point multiplication unit (FPU)
FP division is very slow
2 load/store units
In principle capable of producing 3 integer and 2 FP
results per cycle
Introduction to Massively Parallel Processing – Slide 24
Technical challenges
Compiler to extract best performance, reordering
instructions if necessary
Out-of-order CPU execution to avoid delays waiting for
read/write or earlier operations
Branch prediction to minimize delays due to conditional
branching (loops, if-then-else)
Memory hierarchy to deliver data to registers fast
enough to feed the processor
These all limit the number of pipelines that can be
used and increase the chip complexity
90% of Intel chip devoted to control and data
Introduction to Massively Parallel Processing – Slide 25
Power
Performance
(courtesy Mark Horowitz and Kevin Skadron)
Introduction to Massively Parallel Processing – Slide 26
CPU clock stuck at about 3 GHz since 2006 due to
problems with power consumption (up to 130W per
chip)
Chip circuitry still doubling every 18-24 months
More on-chip memory and memory management units
(MMUs)
Specialized hardware (e.g. multimedia, encryption)
Multi-core (multiple CPUs on one chip)
Thus peak performance of chip still doubling every 18-
24 months
Introduction to Massively Parallel Processing – Slide 27
Processor
Memory
Processor
Memory
Global Memory
Handful of processors each supporting ~1 hardware
thread
On-chip memory near processors (cache, RAM or
both)
Shared global memory space (external DRAM)
Introduction to Massively Parallel Processing – Slide 28
Processor
Memory
•••
Processor
Memory
Global Memory
Many processors each supporting many hardware
threads
On-chip memory near processors (cache, RAM or
both)
Shared global memory space (external DRAM)
Introduction to Massively Parallel Processing – Slide 29
Four 3.33 GHz cores, each of which can run 2 threads
Integrated MMU
Introduction to Massively Parallel Processing – Slide 30
Cannot continue to scale processor frequencies
No 10 GHz chips
Cannot continue to increase power consumption
Can’t melt chips
Can continue to increase transistor density
As per Moore’s Law
Introduction to Massively Parallel Processing – Slide 31
Exciting applications in future mass computing
market have been traditionally considered
“supercomputing applications”
Molecular dynamic simulation, video and audio coding
and manipulation, 3D imaging and visualization,
consumer game physics, virtual reality products, …
These “super apps” represent and model physical,
concurrent world
Various granularities of parallelism exist, but…
Programming model must not hinder parallel
implementation
Data delivery needs careful management
Introduction to Massively Parallel Processing – Slide 32
Computers no longer get faster, just wider
You must rethink your algorithms to be parallel!
Data-parallel computing is most scalable solution
Otherwise: refactor code for 2 cores, 4 cores, 8 cores, 16
cores, …
You will always have more data than cores – build the
computation around the data
Introduction to Massively Parallel Processing – Slide 33
Traditional parallel architectures cover some super
apps
DSP, GPU, network and scientific applications, …
The game is to grow mainstream architectures “out” or
domain-specific architectures “in”
CUDA is the latter
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Obstacles
Introduction to Massively Parallel Processing – Slide 34
Introduction to Massively Parallel Processing – Slide 35
Massive economies of scale
Massively parallel
Introduction to Massively Parallel Processing – Slide 36
Produced in vast numbers for computer graphics
Increasingly being used for
Computer game “physics”
Video (e.g. HD video decoding)
Audio (e.g. MP3 encoding)
Multimedia (e.g. Adobe software)
Computational finance
Oil and gas
Medical imaging
Computational science
Introduction to Massively Parallel Processing – Slide 37
The GPU sits on a PCIe graphics card inside a standard
PC/server with one or two multicore CPUs
Introduction to Massively Parallel Processing – Slide 38
Up to 448 cores on a single chip
Simplified logic (no out-of-order execution, no branch
prediction) means much more of the chip is devoted
to computation
Arranged as multiple units with each unit being
effectively a vector unit, all cores doing the same thing
at the same time
Very high bandwidth (up to 140GB/s) to graphics
memory (up to 4GB)
Not general purpose – for parallel applications like
graphics and Monte Carlo simulations
Can also build big clusters out of GPUs
Introduction to Massively Parallel Processing – Slide 39
Four major vendors:
NVIDIA
AMD: bought ATI several years ago
IBM: co-developed Cell processor with Sony and
Toshiba for Sony Playstation, but now dropped it for
high-performance computing
Intel: was developing “Larrabee” chip as GPU, but now
aimed for high-performance computing
Introduction to Massively Parallel Processing – Slide 40
A quiet revolution and potential buildup
Computation: TFLOPs vs. 100 GFLOPs
T12
GT200
G80
G70
NV30
NV40
3GHz Core2
Duo
3GHz Xeon
Quad
Westmere
CPU in every PC – massive volume and potential impact
Introduction to Massively Parallel Processing – Slide 41
A quiet revolution and potential buildup
Bandwidth: ~10x
T12
GT200
G80
NV40
NV30
G70
3GHz Core2
Duo
3GHz Xeon
Quad
Westmere
CPU in every PC – massive volume and potential impact
Introduction to Massively Parallel Processing – Slide 42
High throughput computation
GeForce GTX 280: 933 GFLOPs/sec
High bandwidth memory
GeForce GTX 280: 140 GB/sec
High availability to all
180M+ CUDA-capable GPUs in the wild
Introduction to Massively Parallel Processing – Slide 43
“Fermi”
3B xtors
RIVA 128
3M xtors
1995
GeForce®
256
23M xtors
2000
GeForce 3
60M xtors
GeForce FX
125M xtors
2005
GeForce 8800
681M xtors
2010
Introduction to Massively Parallel Processing – Slide 44
Throughput is paramount
Must paint every pixel within frame time
Scalability
Create, run and retire lots of threads very rapidly
Measured 14.8 Gthreads/sec on increment() kernel
Use multithreading to hide latency
One stalled thread is o.k. if 100 are ready to run
Introduction to Massively Parallel Processing – Slide 45
Different goals produce different designs
GPU assumes work load is highly parallel
CPU must be good at everything, parallel or not
CPU: minimize latency experienced by one thread
Big on-chip caches, sophisticated control logic
GPU: maximize throughput of all thread
Number of threads in flight limited by resources lots
of resources (registers, bandwidth, etc.)
Multithreading can hide latency skip the big caches
Share control logic across many threads
Introduction to Massively Parallel Processing – Slide 46
GeForce 8800 (2007)
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Load/store
Load/store
Global Memory
Introduction to Massively Parallel Processing – Slide 47
16 highly threaded streaming multiprocessors (SMs)
> 128 floating point units (FPUs)
367 GFLOPs peak performance (25-50 times that of
current high-end microprocessors)
265 GFLOPs sustained for applications such as visual
molecular dynamics (VMD)
768 MB DRAM
86.4 GB/sec memory bandwidth
4 GB/sec bandwidth to CPU
Introduction to Massively Parallel Processing – Slide 48
Massively parallel, 128 cores, 90 watts
Massively threaded, sustains 1000s of threads per
application
30-100 times speedup over high-end microprocessors
on scientific and media applications
Medical imaging, molecular dynamics, …
Introduction to Massively Parallel Processing – Slide 49
Fermi GF100 (2010)
~ 1.5 TFLOPs (single precision), ~800 GFLOPs (double
precision)
230 GB/sec DRAM bandwidth
Introduction to Massively Parallel Processing – Slide 50
Introduction to Massively Parallel Processing – Slide 51
32 CUDA cores per SM (512 total)
8x peak FP64 performance
50% of peak FP32 performance
Direct load/store to memory
Usual linear sequence of bytes
High bandwidth (hundreds of GB/sec)
64K of fast, on-chip RAM
Software or hardware managed
Shared amongst CUDA cores
Enables thread communication
Introduction to Massively Parallel Processing – Slide 52
SIMT (Single Instruction Multiple Thread) execution
Threads run in groups of 32 called warps
Minimum of 32 threads all doing the same thing at (almost)
the same time
Threads in a warp share instruction unit (IU)
All cores execute the same instructions simultaneously, but
with different data
Similar to vector computing on CRAY supercomputers
Hardware automatically handles divergence
Natural for graphics processing and much scientific
computing; also a natural choice for many-core chips to
simplify each core
Introduction to Massively Parallel Processing – Slide 53
Hardware multithreading
Hardware resource allocation and thread scheduling
Hardware relies on threads to hide latency
Threads have all resources needed to run
Any warp not waiting for something can run
Context switching is (basically) free
Introduction to Massively Parallel Processing – Slide 54
Lots of active threads is the key to high performance:
no “context switching”; each thread has its own registers,
which limits the number of active threads
threads on each SM execute in warps – execution
alternates between “active” warps, with warps becoming
temporarily “inactive” when waiting for data
for each thread, one operation completes long before the
next starts – avoids the complexity of pipeline overlaps
which can limit the performance of modern processors
memory access from device memory has a delay of 400600 cycles; with 40 threads this is equivalent to 10-15
operations, so hopefully there’s enough computation to
hide the latency
Introduction to Massively Parallel Processing – Slide 55
Intel Core 2 / Xeon / i7
4-6 MIMD (Multiple Instruction / Multiple Data) cores
each core operates independently
each can be working with a different code, performing
different operations with entirely different data
few registers, multilevel caches
10-30 GB/s bandwidth to main memory
Introduction to Massively Parallel Processing – Slide 56
NVIDIA GTX280:
240 cores, arranged as 30 units each with 8 SIMD (Single
Instruction / Multiple Data) cores
all cores executing the same instruction at the same time, but
working on different data
only one instruction de-coder needed to control all cores
functions like a vector unit
lots of registers, almost no cache
5 GB/s bandwidth to host processor (PCIe x16 gen 2)
140 GB/s bandwidth to graphics memory
Introduction to Massively Parallel Processing – Slide 57
One issue with SIMD / vector execution: what happens
with if-then-else branches?
Standard vector treatment: execute both branches but
only store the results for the correct branch.
NVIDIA treatment: works with warps of length 32. If a
particular warp only goes one way, then that is all that is
executed.
Introduction to Massively Parallel Processing – Slide 58
How do NVIDIA GPUs deal with same architectural
challenges as CPUs?
Very heavily multi-threaded, with at least 8 threads per
core:
Solves pipeline problem, because output of one calculation
can be used an input for next calculation for same thread
Switch from one set of threads to another when waiting for
data from graphics memory
Introduction to Massively Parallel Processing – Slide 59
Introduction to Massively Parallel Processing – Slide 60
Scalable programming model
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel computing
Introduction to Massively Parallel Processing – Slide 61
45X
17X
100X
110-240X
13–457x
35X
Introduction to Massively Parallel Processing – Slide 62
Big breakthrough in GPU computing has been
NVIDIA’s development of CUDA programming
environment
Initially driven by needs of computer games developers
Now being driven by new markets (e.g. HD video
decoding)
C plus some extensions and some C++ features (e.g.
templates)
Introduction to Massively Parallel Processing – Slide 63
Host code runs on CPU, CUDA code runs on GPU
Explicit movement of data across the PCIe connection
Very straightforward for Monte Carlo applications,
once you have a random number generator
Harder for finite difference applications
Introduction to Massively Parallel Processing – Slide 64
At the top level, we have a master process which runs
on the CPU and performs the following steps:
initialize card
allocate memory in host and on device
repeat as needed
1.
2.
3.
1.
2.
3.
4.
copy data from host to device memory
launch multiple copies of execution “kernel” on device
copy data from device memory to host
de-allocate all memory and terminate
Introduction to Massively Parallel Processing – Slide 65
At a lower level, within the GPU:
each copy of the execution kernel executes on an SM
if the number of copies exceeds the number of SMs,
then more than one will run at a time on each SM if
there are enough registers and shared memory, and the
others will wait in a queue and execute later
all threads within one copy can access local shared
memory but can’t see what the other copies are doing
(even if they are on the same SM)
there are no guarantees on the order in which the copies
will execute
Introduction to Massively Parallel Processing – Slide 66
Augment C/C++ with minimalist abstractions
Let programmers focus on parallel algorithms, not
mechanics of parallel programming language
Provide straightforward mapping onto hardware
Good fit to GPU architecture
Maps well to multi-core CPUs too
Scale to 100s of cores and 10000s of parallel threads
GPU threads are lightweight – create/switch is free
GPU needs 1000s of threads for full utilization
Introduction to Massively Parallel Processing – Slide 67
Hierarchy of concurrent threads
Lightweight synchronization primitives
Shared memory model for cooperating threads
Introduction to Massively Parallel Processing – Slide 68
Parallel kernels composed of many threads
Thread t
All threads execute the same sequential program
Threads are grouped into thread blocks
Threads in the same block can cooperate
Block b
t0 t1 … tB
Threads/blocks have unique IDs
Introduction to Massively Parallel Processing – Slide 69
CUDA virtualizes the physical hardware
Thread is a virtualized scalar processor: registers, PC,
state
Block is a virtualized multiprocessor: threads, shared
memory
Scheduled onto physical hardware without pre-
emption
Threads/blocks launch and run to completion
Blocks should be independent
Introduction to Massively Parallel Processing – Slide 70
Block
Memory
Block
•••
Memory
Global Memory
Introduction to Massively Parallel Processing – Slide 71
Cf. PRAM (Parallel Random Access Machine) model
Processors
Global Memory
Global synchronization isn’t cheap
Global memory access times are expensive
Introduction to Massively Parallel Processing – Slide 72
Cf. BSP (Bulk Synchronous Parallel) model, MPI
Processor
Memory
•••
Processor
Memory
Interconnection Network
Distributed computing is a different setting
Introduction to Massively Parallel Processing – Slide 73
Multicore CPU
Manycore GPU
Introduction to Massively Parallel Processing – Slide 74
Philosophy: provide minimal set of extensions
necessary to expose power
Function and variable qualifiers
Execution configuration
Built-in variables and functions valid in device code
Introduction to Massively Parallel Processing – Slide 75
Introduction to Massively Parallel Processing – Slide 76
Application
Description
Source
Kernel
% time
H.264
SPEC ‘06 version, change in guess vector
34,811
194
35%
LBM
SPEC ‘06 version, change to single precision and
print fewer reports
1,481
285
>99%
RC5-72
Distributed.net RC5-72 challenge client code
1,979
218
>99%
FEM
Finite element modeling, simulation of 3D graded
materials
1,874
146
99%
RPES
Rye Polynomial Equation Solver, quantum chem, 2electron repulsion
1,104
281
99%
PNS
Petri Net simulation of a distributed system
322
160
>99%
SAXPY
Single-precision implementation of saxpy, used in
Linpack’s Gaussian elim. routine
952
31
>99%
TPACF
Two Point Angular Correlation Function
536
98
96%
FDTD
Finite-Difference Time Domain analysis of 2D
electromagnetic wave propagation
1,365
93
16%
MRI-Q
Computing a matrix Q, a scanner’s configuration in
MRI reconstruction
490
33
>99%
Introduction to Massively Parallel Processing – Slide 77
GeForce 8800 GTX vs. 2.2 GHz Opteron 248
Introduction to Massively Parallel Processing – Slide 78
10x speedup in a kernel is typical, as long as the kernel
can occupy enough parallel threads
25x to 400x speedup if the function’s data
requirements and control flow suit the GPU and the
application is optimized
Introduction to Massively Parallel Processing – Slide 79
Parallel hardware is here to stay
GPUs are massively parallel manycore processors
Easily available and fully programmable
Parallelism and scalability are crucial for success
This presents many important research challenges
Not to mention the educational challenges
Introduction to Massively Parallel Processing – Slide 80
Reading: Chapter 1, “Programming Massively Parallel
Processors” by Kirk and Hwu.
Based on original material from
The University of Akron: Tim O’Neil, Kathy Liszka
Hiram College: Irena Lomonosov
The University of Illinois at Urbana-Champaign
David Kirk, Wen-mei W. Hwu
The University of North Carolina at Charlotte
Barry Wilkinson, Michael Allen
Oregon State University: Michael Quinn
Oxford University: Mike Giles
Stanford University: Jared Hoberock, David Tarjan
Revision history: last updated 6/4/2011.
Introduction to Massively Parallel Processing – Slide 81