CUDA Lecture 1 Introduction to Massively Parallel Computing

Transcript CUDA Lecture 1 Introduction to Massively Parallel Computing

Prepared 6/4/2011 by T. O’Neil for 3460:677, Fall 2011, The
University of Akron.
 A quiet revolution and potential buildup
 Computation: TFLOPs vs. 100 GFLOPs
T12
GT200
G80
G70
NV30
NV40
3GHz Core2
Duo
3GHz Xeon
Quad
Westmere
 CPU in every PC – massive volume and potential impact
Introduction to Massively Parallel Processing – Slide 2
Introduction to Massively Parallel Processing – Slide 2
• Solve computation-intensive problems faster
• Make infeasible problems feasible
•
Reduce design time
• Solve larger problems in the same amount of time
• Improve an answer’s precision
•
Reduce design time
• Gain competitive advantage
Introduction to Massively Parallel Processing – Slide 4
• Another factor: modern scientific method
Nature
Observation
Numerical
Simulation
Physical
Experimentation
Theory
Introduction to Massively Parallel Processing – Slide 5
 A growing number of applications in science,
engineering, business and medicine are requiring
computing speeds that cannot be achieved by the
current conventional computers because
• Calculations too computationally intensive
• Too complicated to model
• Too hard, expensive, slow or dangerous for the
laboratory.
 Parallel processing is an approach which helps to make
these computations feasible.
Introduction to Massively Parallel Processing – Slide 6
 One example: grand challenge problems that cannot
be solved in a reasonable amount of time with today’s
computers.
•
Modeling large DNA structures
•
Global weather forecasting
•
Modeling motion of astronomical bodies
 Another example: real-time applications which have
time constrains.
 If data are not processed during some time the result
become meaningless.
Introduction to Massively Parallel Processing – Slide 7
 Atmosphere modeled by dividing it into 3-
dimensional cells. Calculations of each cell repeated
many times to model passage of time.
 For example
 Suppose whole global atmosphere divided into cells of
size 1 mile  1 mile  1 mile to a height of 10 miles (10
cells high) – about 5108 cells.
 If each calculation requires 200 floating point
operations, 1011 floating point operations necessary in
one time step.
Introduction to Massively Parallel Processing – Slide 8
 Example continued
 To forecast the weather over 10 days using 10-minute
intervals, a computer operating at 100 megaflops (108
floating point operations per second) would take 107
seconds or over 100 days.
 To perform the calculation in 10 minutes would require a
computer operating at 1.7 teraflops (1.71012 floating
point operations per second).
Introduction to Massively Parallel Processing – Slide 9
 Two main types
 Parallel computing: Using a computer with more than one
processor to solve a problem.
 Distributed computing: Using more than one computer to
solve a problem.
 Motives
• Usually faster computation
• Very simple idea – that n computers operating
simultaneously can achieve the result n times faster
• It will not be n times faster for various reasons.
• Other motives include: fault tolerance, larger amount of
memory available, …
Introduction to Massively Parallel Processing – Slide 10
“... There is therefore nothing new in the idea of parallel
programming, but its application to computers. The
author cannot believe that there will be any insuperable
difficulty in extending it to computers. It is not to be
expected that the necessary programming techniques will
be worked out overnight. Much experimenting remains to
be done. After all, the techniques that are commonly used
in programming today were only won at considerable toil
several years ago. In fact the advent of parallel
programming may do something to revive the pioneering
spirit in programming which seems at the present to be
degenerating into a rather dull and routine occupation …”
- S. Gill, “Parallel Programming”, The Computer Journal, April 1958
Introduction to Massively Parallel Processing – Slide 11
 Applications are typically written from scratch (or
manually adapted from a sequential program)
assuming a simple model, in a high-level language (i.e.
C) with explicit parallel computing primitives.
 Which means
 Components difficult to reuse so similar components
coded again and again
 Code difficult to develop, maintain or debug
 Not really developing a long-term software solution
Introduction to Massively Parallel Processing – Slide 12
Processors
Highest level
Registers
Memory level 1
Small, fast, expensive
Memory level 2
Memory level 3
Lowest level
Large, slow, cheap
Main Memory
Introduction to Massively Parallel Processing – Slide 13
 Execution speed relies on exploiting data locality
 temporal locality: a data item just accessed is likely to be
used again in the near future, so keep it in the cache
 spatial locality: neighboring data is also likely to be used
soon, so load them into the cache at the same time using
a ‘wide’ bus (like a multi-lane motorway)
 from a programmer point of view, all handled
automatically – good for simplicity but maybe not for
best performance
Introduction to Massively Parallel Processing – Slide 14
 Useful work (i.e. floating point ops) can only be done
at the top of the hierarchy.
 So data stored lower must be transferred to the registers
before we can work on it.
 Possibly displacing other data already there.
 This transfer among levels is slow.
 This then becomes the bottleneck in most computations.
 Spend more time moving data than doing useful work.
 The result: speed depends to a large extent on problem
size.
Introduction to Massively Parallel Processing – Slide 15
 Good algorithm design then requires
 Keeping active data at the top of the hierarchy as long as
possible.
 Minimizing movement between levels.
 Need enough work to do at the top of the hierarchy to
mask the transfer time at the lower levels.
 The more processors you have to work with, the larger
the problem has to be to accomplish this.
Introduction to Massively Parallel Processing – Slide 16
Introduction to Massively Parallel Processing – Slide 17
 Each cycle, CPU takes data from registers, does an
operation, and puts the result back
 Load/store operations (memory registers) also take
one cycle
 CPU can do different operations each cycle
 Output of one operation can be input to next
 CPUs haven’t been this simple for a long time!
Introduction to Massively Parallel Processing – Slide 18
“The complexity for minimum component
costs has increased at a rate of roughly a
factor of two per year... Certainly over the
short term this rate can be expected to
continue, if not to increase. Over the
longer term, the rate of increase is a bit
more uncertain, although there is no
reason to believe it will not remain nearly
constant for at least 10 years. That means
by 1975, the number of components per
integrated circuit for minimum cost will be
65,000. I believe that such a large circuit
can be built on a single wafer.”
- Gordon E. Moore, co-founder, Intel
Electronics Magazine, 4/19/1965
Photo credit: Wikipedia
Introduction to Massively Parallel Processing – Slide 19
 Translation: The number of transistors on an integrated
circuit doubles every 1½ - 2 years.
Data credit: Wikipedia
Introduction to Massively Parallel Processing – Slide 20
 As the chip size (amount of circuitry) continues to
double every 18-24 months, the speed of a basic
microprocessor also doubles over that time frame.
 This happens because, as we develop tricks, they are
built into the newest generation of CPUs by the
manufacturers.
Introduction to Massively Parallel Processing – Slide 21
 Instruction-level parallelism (ILP)
 Out-of-order execution, speculation, …
 Vanishing opportunities in power-constrained world
 Data-level parallelism
 Vector units, SIMD execution, …
 Increasing opportunities; SSE, AVX, Cell SPE,
Clearspeed, GPU, …
 Thread-level parallelism
 Increasing opportunities; multithreading, multicore, …
 Intel Core2, AMD Phenom, Sun Niagara, STI Cell,
NVIDIA Fermi, …
Introduction to Massively Parallel Processing – Slide 22
 Most processors have
multiple pipelines for
different tasks and can
start a number of
different operations each
cycle.
 For example consider the
Intel core architecture.
Introduction to Massively Parallel Processing – Slide 23
 Each core in an Intel Core 2 Duo chip has
 14 stage pipeline
 3 integer units (ALU)
 1 floating-point addition unit (FPU)
 1 floating-point multiplication unit (FPU)
 FP division is very slow
 2 load/store units
 In principle capable of producing 3 integer and 2 FP
results per cycle
Introduction to Massively Parallel Processing – Slide 24
 Technical challenges
 Compiler to extract best performance, reordering
instructions if necessary
 Out-of-order CPU execution to avoid delays waiting for
read/write or earlier operations
 Branch prediction to minimize delays due to conditional
branching (loops, if-then-else)
 Memory hierarchy to deliver data to registers fast
enough to feed the processor
 These all limit the number of pipelines that can be
used and increase the chip complexity
 90% of Intel chip devoted to control and data
Introduction to Massively Parallel Processing – Slide 25
Power
Performance
(courtesy Mark Horowitz and Kevin Skadron)
Introduction to Massively Parallel Processing – Slide 26
 CPU clock stuck at about 3 GHz since 2006 due to
problems with power consumption (up to 130W per
chip)
 Chip circuitry still doubling every 18-24 months
 More on-chip memory and memory management units
(MMUs)
 Specialized hardware (e.g. multimedia, encryption)
 Multi-core (multiple CPUs on one chip)
 Thus peak performance of chip still doubling every 18-
24 months
Introduction to Massively Parallel Processing – Slide 27
Processor
Memory
Processor
Memory
Global Memory
 Handful of processors each supporting ~1 hardware
thread
 On-chip memory near processors (cache, RAM or
both)
 Shared global memory space (external DRAM)
Introduction to Massively Parallel Processing – Slide 28
Processor
Memory
•••
Processor
Memory
Global Memory
 Many processors each supporting many hardware
threads
 On-chip memory near processors (cache, RAM or
both)
 Shared global memory space (external DRAM)
Introduction to Massively Parallel Processing – Slide 29
 Four 3.33 GHz cores, each of which can run 2 threads
 Integrated MMU
Introduction to Massively Parallel Processing – Slide 30
 Cannot continue to scale processor frequencies
 No 10 GHz chips
 Cannot continue to increase power consumption
 Can’t melt chips
 Can continue to increase transistor density
 As per Moore’s Law
Introduction to Massively Parallel Processing – Slide 31
 Exciting applications in future mass computing
market have been traditionally considered
“supercomputing applications”
 Molecular dynamic simulation, video and audio coding
and manipulation, 3D imaging and visualization,
consumer game physics, virtual reality products, …
 These “super apps” represent and model physical,
concurrent world
 Various granularities of parallelism exist, but…
 Programming model must not hinder parallel
implementation
 Data delivery needs careful management
Introduction to Massively Parallel Processing – Slide 32
 Computers no longer get faster, just wider
 You must rethink your algorithms to be parallel!
 Data-parallel computing is most scalable solution
 Otherwise: refactor code for 2 cores, 4 cores, 8 cores, 16
cores, …
 You will always have more data than cores – build the
computation around the data
Introduction to Massively Parallel Processing – Slide 33
 Traditional parallel architectures cover some super
apps
 DSP, GPU, network and scientific applications, …
 The game is to grow mainstream architectures “out” or
domain-specific architectures “in”
 CUDA is the latter
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Obstacles
Introduction to Massively Parallel Processing – Slide 34
Introduction to Massively Parallel Processing – Slide 35
 Massive economies of scale
 Massively parallel
Introduction to Massively Parallel Processing – Slide 36
 Produced in vast numbers for computer graphics
 Increasingly being used for
 Computer game “physics”
 Video (e.g. HD video decoding)
 Audio (e.g. MP3 encoding)
 Multimedia (e.g. Adobe software)
 Computational finance
 Oil and gas
 Medical imaging
 Computational science
Introduction to Massively Parallel Processing – Slide 37
 The GPU sits on a PCIe graphics card inside a standard
PC/server with one or two multicore CPUs
Introduction to Massively Parallel Processing – Slide 38
 Up to 448 cores on a single chip
 Simplified logic (no out-of-order execution, no branch




prediction) means much more of the chip is devoted
to computation
Arranged as multiple units with each unit being
effectively a vector unit, all cores doing the same thing
at the same time
Very high bandwidth (up to 140GB/s) to graphics
memory (up to 4GB)
Not general purpose – for parallel applications like
graphics and Monte Carlo simulations
Can also build big clusters out of GPUs
Introduction to Massively Parallel Processing – Slide 39
 Four major vendors:
 NVIDIA
 AMD: bought ATI several years ago
 IBM: co-developed Cell processor with Sony and
Toshiba for Sony Playstation, but now dropped it for
high-performance computing
 Intel: was developing “Larrabee” chip as GPU, but now
aimed for high-performance computing
Introduction to Massively Parallel Processing – Slide 40
 A quiet revolution and potential buildup
 Computation: TFLOPs vs. 100 GFLOPs
T12
GT200
G80
G70
NV30
NV40
3GHz Core2
Duo
3GHz Xeon
Quad
Westmere
 CPU in every PC – massive volume and potential impact
Introduction to Massively Parallel Processing – Slide 41
 A quiet revolution and potential buildup
 Bandwidth: ~10x
T12
GT200
G80
NV40
NV30
G70
3GHz Core2
Duo
3GHz Xeon
Quad
Westmere
 CPU in every PC – massive volume and potential impact
Introduction to Massively Parallel Processing – Slide 42
 High throughput computation
 GeForce GTX 280: 933 GFLOPs/sec
 High bandwidth memory
 GeForce GTX 280: 140 GB/sec
 High availability to all
 180M+ CUDA-capable GPUs in the wild
Introduction to Massively Parallel Processing – Slide 43
“Fermi”
3B xtors
RIVA 128
3M xtors
1995
GeForce®
256
23M xtors
2000
GeForce 3
60M xtors
GeForce FX
125M xtors
2005
GeForce 8800
681M xtors
2010
Introduction to Massively Parallel Processing – Slide 44
 Throughput is paramount
 Must paint every pixel within frame time
 Scalability
 Create, run and retire lots of threads very rapidly
 Measured 14.8 Gthreads/sec on increment() kernel

 Use multithreading to hide latency
 One stalled thread is o.k. if 100 are ready to run
Introduction to Massively Parallel Processing – Slide 45
 Different goals produce different designs
 GPU assumes work load is highly parallel
 CPU must be good at everything, parallel or not
 CPU: minimize latency experienced by one thread
 Big on-chip caches, sophisticated control logic
 GPU: maximize throughput of all thread
 Number of threads in flight limited by resources  lots
of resources (registers, bandwidth, etc.)
 Multithreading can hide latency  skip the big caches
 Share control logic across many threads
Introduction to Massively Parallel Processing – Slide 46
 GeForce 8800 (2007)
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Load/store
Load/store
Global Memory
Introduction to Massively Parallel Processing – Slide 47
 16 highly threaded streaming multiprocessors (SMs)
 > 128 floating point units (FPUs)
 367 GFLOPs peak performance (25-50 times that of
current high-end microprocessors)
 265 GFLOPs sustained for applications such as visual
molecular dynamics (VMD)
 768 MB DRAM
 86.4 GB/sec memory bandwidth
 4 GB/sec bandwidth to CPU
Introduction to Massively Parallel Processing – Slide 48
 Massively parallel, 128 cores, 90 watts
 Massively threaded, sustains 1000s of threads per
application
 30-100 times speedup over high-end microprocessors
on scientific and media applications
 Medical imaging, molecular dynamics, …
Introduction to Massively Parallel Processing – Slide 49
 Fermi GF100 (2010)
 ~ 1.5 TFLOPs (single precision), ~800 GFLOPs (double
precision)
 230 GB/sec DRAM bandwidth
Introduction to Massively Parallel Processing – Slide 50
Introduction to Massively Parallel Processing – Slide 51
 32 CUDA cores per SM (512 total)
 8x peak FP64 performance
 50% of peak FP32 performance
 Direct load/store to memory
 Usual linear sequence of bytes
 High bandwidth (hundreds of GB/sec)
 64K of fast, on-chip RAM
 Software or hardware managed
 Shared amongst CUDA cores
 Enables thread communication
Introduction to Massively Parallel Processing – Slide 52
 SIMT (Single Instruction Multiple Thread) execution
 Threads run in groups of 32 called warps
 Minimum of 32 threads all doing the same thing at (almost)
the same time
 Threads in a warp share instruction unit (IU)
 All cores execute the same instructions simultaneously, but
with different data
 Similar to vector computing on CRAY supercomputers
 Hardware automatically handles divergence
 Natural for graphics processing and much scientific
computing; also a natural choice for many-core chips to
simplify each core
Introduction to Massively Parallel Processing – Slide 53
 Hardware multithreading
 Hardware resource allocation and thread scheduling
 Hardware relies on threads to hide latency
 Threads have all resources needed to run
 Any warp not waiting for something can run
 Context switching is (basically) free
Introduction to Massively Parallel Processing – Slide 54
 Lots of active threads is the key to high performance:
 no “context switching”; each thread has its own registers,
which limits the number of active threads
 threads on each SM execute in warps – execution
alternates between “active” warps, with warps becoming
temporarily “inactive” when waiting for data
 for each thread, one operation completes long before the
next starts – avoids the complexity of pipeline overlaps
which can limit the performance of modern processors
 memory access from device memory has a delay of 400600 cycles; with 40 threads this is equivalent to 10-15
operations, so hopefully there’s enough computation to
hide the latency
Introduction to Massively Parallel Processing – Slide 55
 Intel Core 2 / Xeon / i7
 4-6 MIMD (Multiple Instruction / Multiple Data) cores
 each core operates independently
 each can be working with a different code, performing
different operations with entirely different data
 few registers, multilevel caches
 10-30 GB/s bandwidth to main memory
Introduction to Massively Parallel Processing – Slide 56
 NVIDIA GTX280:
 240 cores, arranged as 30 units each with 8 SIMD (Single
Instruction / Multiple Data) cores
 all cores executing the same instruction at the same time, but
working on different data
 only one instruction de-coder needed to control all cores
 functions like a vector unit
 lots of registers, almost no cache
 5 GB/s bandwidth to host processor (PCIe x16 gen 2)
 140 GB/s bandwidth to graphics memory
Introduction to Massively Parallel Processing – Slide 57
 One issue with SIMD / vector execution: what happens
with if-then-else branches?
 Standard vector treatment: execute both branches but
only store the results for the correct branch.
 NVIDIA treatment: works with warps of length 32. If a
particular warp only goes one way, then that is all that is
executed.
Introduction to Massively Parallel Processing – Slide 58
 How do NVIDIA GPUs deal with same architectural
challenges as CPUs?
 Very heavily multi-threaded, with at least 8 threads per
core:
 Solves pipeline problem, because output of one calculation
can be used an input for next calculation for same thread
 Switch from one set of threads to another when waiting for
data from graphics memory
Introduction to Massively Parallel Processing – Slide 59
Introduction to Massively Parallel Processing – Slide 60
 Scalable programming model
 Minimal extensions to familiar C/C++ environment
 Heterogeneous serial-parallel computing
Introduction to Massively Parallel Processing – Slide 61
45X
17X
100X
110-240X
13–457x
35X
Introduction to Massively Parallel Processing – Slide 62
 Big breakthrough in GPU computing has been
NVIDIA’s development of CUDA programming
environment
 Initially driven by needs of computer games developers
 Now being driven by new markets (e.g. HD video
decoding)
 C plus some extensions and some C++ features (e.g.
templates)
Introduction to Massively Parallel Processing – Slide 63
 Host code runs on CPU, CUDA code runs on GPU
 Explicit movement of data across the PCIe connection
 Very straightforward for Monte Carlo applications,
once you have a random number generator
 Harder for finite difference applications
Introduction to Massively Parallel Processing – Slide 64
 At the top level, we have a master process which runs
on the CPU and performs the following steps:
initialize card
allocate memory in host and on device
repeat as needed
1.
2.
3.
1.
2.
3.
4.
copy data from host to device memory
launch multiple copies of execution “kernel” on device
copy data from device memory to host
de-allocate all memory and terminate
Introduction to Massively Parallel Processing – Slide 65
 At a lower level, within the GPU:
 each copy of the execution kernel executes on an SM
 if the number of copies exceeds the number of SMs,
then more than one will run at a time on each SM if
there are enough registers and shared memory, and the
others will wait in a queue and execute later
 all threads within one copy can access local shared
memory but can’t see what the other copies are doing
(even if they are on the same SM)
 there are no guarantees on the order in which the copies
will execute
Introduction to Massively Parallel Processing – Slide 66
 Augment C/C++ with minimalist abstractions
 Let programmers focus on parallel algorithms, not
mechanics of parallel programming language
 Provide straightforward mapping onto hardware
 Good fit to GPU architecture
 Maps well to multi-core CPUs too
 Scale to 100s of cores and 10000s of parallel threads
 GPU threads are lightweight – create/switch is free
 GPU needs 1000s of threads for full utilization
Introduction to Massively Parallel Processing – Slide 67
 Hierarchy of concurrent threads
 Lightweight synchronization primitives
 Shared memory model for cooperating threads
Introduction to Massively Parallel Processing – Slide 68
 Parallel kernels composed of many threads
Thread t
 All threads execute the same sequential program
 Threads are grouped into thread blocks
 Threads in the same block can cooperate
Block b
t0 t1 … tB
 Threads/blocks have unique IDs
Introduction to Massively Parallel Processing – Slide 69
 CUDA virtualizes the physical hardware
 Thread is a virtualized scalar processor: registers, PC,
state
 Block is a virtualized multiprocessor: threads, shared
memory
 Scheduled onto physical hardware without pre-
emption
 Threads/blocks launch and run to completion
 Blocks should be independent
Introduction to Massively Parallel Processing – Slide 70
Block
Memory
Block
•••
Memory
Global Memory
Introduction to Massively Parallel Processing – Slide 71
 Cf. PRAM (Parallel Random Access Machine) model
Processors
Global Memory
 Global synchronization isn’t cheap
 Global memory access times are expensive
Introduction to Massively Parallel Processing – Slide 72
 Cf. BSP (Bulk Synchronous Parallel) model, MPI
Processor
Memory
•••
Processor
Memory
Interconnection Network
 Distributed computing is a different setting
Introduction to Massively Parallel Processing – Slide 73
Multicore CPU
Manycore GPU
Introduction to Massively Parallel Processing – Slide 74
 Philosophy: provide minimal set of extensions
necessary to expose power
 Function and variable qualifiers
 Execution configuration
 Built-in variables and functions valid in device code
Introduction to Massively Parallel Processing – Slide 75
Introduction to Massively Parallel Processing – Slide 76
Application
Description
Source
Kernel
% time
H.264
SPEC ‘06 version, change in guess vector
34,811
194
35%
LBM
SPEC ‘06 version, change to single precision and
print fewer reports
1,481
285
>99%
RC5-72
Distributed.net RC5-72 challenge client code
1,979
218
>99%
FEM
Finite element modeling, simulation of 3D graded
materials
1,874
146
99%
RPES
Rye Polynomial Equation Solver, quantum chem, 2electron repulsion
1,104
281
99%
PNS
Petri Net simulation of a distributed system
322
160
>99%
SAXPY
Single-precision implementation of saxpy, used in
Linpack’s Gaussian elim. routine
952
31
>99%
TPACF
Two Point Angular Correlation Function
536
98
96%
FDTD
Finite-Difference Time Domain analysis of 2D
electromagnetic wave propagation
1,365
93
16%
MRI-Q
Computing a matrix Q, a scanner’s configuration in
MRI reconstruction
490
33
>99%
Introduction to Massively Parallel Processing – Slide 77
 GeForce 8800 GTX vs. 2.2 GHz Opteron 248
Introduction to Massively Parallel Processing – Slide 78
 10x speedup in a kernel is typical, as long as the kernel
can occupy enough parallel threads
 25x to 400x speedup if the function’s data
requirements and control flow suit the GPU and the
application is optimized
Introduction to Massively Parallel Processing – Slide 79
 Parallel hardware is here to stay
 GPUs are massively parallel manycore processors
 Easily available and fully programmable
 Parallelism and scalability are crucial for success
 This presents many important research challenges
 Not to mention the educational challenges
Introduction to Massively Parallel Processing – Slide 80
 Reading: Chapter 1, “Programming Massively Parallel
Processors” by Kirk and Hwu.
 Based on original material from
 The University of Akron: Tim O’Neil, Kathy Liszka
 Hiram College: Irena Lomonosov
 The University of Illinois at Urbana-Champaign
 David Kirk, Wen-mei W. Hwu
 The University of North Carolina at Charlotte
 Barry Wilkinson, Michael Allen
 Oregon State University: Michael Quinn
 Oxford University: Mike Giles
 Stanford University: Jared Hoberock, David Tarjan
 Revision history: last updated 6/4/2011.
Introduction to Massively Parallel Processing – Slide 81

CUDA Lecture 1 Introduction to Massively Parallel Computing

Transcript CUDA Lecture 1 Introduction to Massively Parallel Computing

Directory