PPT - Penn Engineering - University of Pennsylvania
Download
Report
Transcript PPT - Penn Engineering - University of Pennsylvania
Introduction to
CUDA (1 of n*)
Joseph Kider
University of Pennsylvania
CIS 565 - Spring 2011
* Where n is 2 or 3
Agenda
GPU architecture review
CUDA
First
of two or three dedicated classes
Acknowledgements
Many slides are from
Kayvon
Fatahalian's From Shader Code to a
Teraflop: How GPU Shader Cores Work:
http://bps10.idav.ucdavis.edu/talks/03fatahalian_gpuArchTeraflop_BPS_SIGGRAPH201
0.pdf
David
Kirk and Wen-mei Hwu’s UIUC course:
http://courses.engr.illinois.edu/ece498/al/
GPU Architecture Review
GPUs are:
Parallel
Multithreaded
Many-core
GPUs have:
Tremendous
computational horsepower
High memory bandwidth
GPU Architecture Review
GPUs are specialized for
Compute-intensive,
highly parallel computation
Graphics!
Transistors are devoted to:
Processing
Not:
Data caching
Flow control
GPU Architecture Review
Transistor Usage
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Threading Hardware in G80
Sources
Slides by ECE 498 AL : Programming
Massively Parallel Processors : Wen-Mei
Hwu
John Nickolls, NVIDIA
Fixed-function pipeline
3D API:
OpenGL or
Direct3D
3D API
Commands
3D
Application
Or Game
CPU-GPU Boundary (AGP/PCIe)
Primitive
Assembly
Pre-transformed
Fragments
Pre-transformed
Vertices
Programmable
Vertex
Processor
Programmable
Fragment
Processor
Transformed
Fragments
GPU
Front End
Pixel
Pixel
Location
Updates
Stream
Rasterization
Raster
Frame
and
Operations
Buffer
Interpolation
Assembled
Primitives
Transformed
Vertices
GPU
Command &
Data Stream
Vertex
Index
Stream
Programmable pipeline
3D API:
OpenGL or
Direct3D
3D API
Commands
3D
Application
Or Game
CPU-GPU Boundary (AGP/PCIe)
Primitive
Assembly
Pre-transformed
Fragments
Pre-transformed
Vertices
Programmable
Vertex
Processor
Programmable
Fragment
Processor
Transformed
Fragments
GPU
Front End
Pixel
Pixel
Location
Updates
Stream
Rasterization
Raster
Frame
and
Operations
Buffer
Interpolation
Assembled
Primitives
Transformed
Vertices
GPU
Command &
Data Stream
Vertex
Index
Stream
3D API:
OpenGL or
Direct3D
3D API
Commands
Unified Programmable
pipeline
3D
Application
Or Game
CPU-GPU Boundary (AGP/PCIe)
Pixel
Pixel
Location
Updates
Stream
Rasterization
Raster
Frame
and
Operations
Buffer
Interpolation
Primitive
Assembly
Pre-transformed
Fragments
Pre-transformed
Vertices
Unified Vertex,
Fragment,
Geometry
Processor
Transformed
Fragments
Assembled
Primitives
Transformed
Vertices
GPU
Command &
Data Stream
GPU
Front End
Vertex
Index
Stream
General Diagram (6800/NV40)
TurboCache
Uses PCI-Express bandwidth to render
directly to system memory
Card needs less memory
Performance boost while lowering cost
TurboCache Manager dynamically
allocates from main memory
Local memory used to cache data and to
deliver peak performance when needed
NV40 Vertex Processor
An NV40 vertex processor is able to execute one vector operation (up to four
FP32 components), one scalar FP32 operation, and make one access to the
texture per clock cycle
NV40 Fragment Processors
Early termination from mini z buffer and z
buffer checks; resulting sets of 4 pixels
(quads) passed on to fragment units
Why NV40 series was better
Massive parallelism
Scalability
Lower end products have fewer pixel pipes
and fewer vertex shader units
Computation Power
222 million transistors
First to comply with Microsoft’s DirectX 9
spec
Dynamic Branching in pixel shaders
Dynamic Branching
Helps detect if pixel needs shading
Instruction flow handled in groups of pixels
Specify branch granularity (the number of
consecutive pixels that take the same
branch)
Better distribution of blocks of pixels
between the different quad engines
General Diagram (7800/G70)
General Diagram
(7800/G70)
General
Diagram
(6800/NV40)
GeForce Go 7800 – Power Issues
Power consumption and package are the same as the 6800 Ultra
chip, meaning notebook designers do not have to change very much
about their thermal designs
Dynamic clock scaling can run as slow as 16 MHz
This is true for the engine, memory, and pixel clocks
Heavier use of clock gating than the desktop version
Runs at voltages lower than any other mobile performance part
Regardless, you won’t get much battery-based runtime for a 3D
game
GeForce 7800 GTX
Parallelism
8 Vertex Engines
Z-Cull
Triangle Setup/Raster
Shader Instruction Dispatch
Fragment Crossbar
Memory
Partition
Memory
Partition
24 Pixel Shaders
16 Raster Operation Pipelines
Memory
Partition
Memory
Partition
G80 – Graphics Mode
The future of GPUs is programmable
processing
So – build the architecture around the
processor
Host
Input Assembler
Setup / Rstr / ZCull
Vtx Thread Issue
SP
SP
SP
TF
SP
TF
L1
SP
TF
L1
SP
SP
SP
TF
L1
L1
SP
SP
TF
L1
L2
FB
Pixel Thread Issue
SP
TF
L2
FB
SP
SP
TF
L1
L2
FB
SP
Geom Thread Issue
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
L2
FB
G80 CUDA mode – A Device
Example
Processors execute computing threads
New operating mode/HW interface for
computing
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
Why Use the GPU for
Computing ?
The GPU has evolved into a very flexible and
powerful processor:
GFLOPS
It’s programmable using high-level languages
It supports 32-bit floating point precision
It offers lots of GFLOPS:
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
GPU in every PC and workstation
NV30 = GeForce FX 5800
What
is Behind
such
Evolution?
The GPU
is specialized
for an
compute-intensive,
highly data parallel computation (exactly what
graphics rendering is about)
So, more transistors can be devoted to data
processing rather than data caching and flow control
ALU
ALU
ALU
ALU
Control
CPU
GPU
Cache
DRAM
DRAM
The fast-growing video game industry exerts
What is (Historical) GPGPU ?
General Purpose computation using GPU and
graphics API in applications other than 3D
graphics
GPU
accelerates critical path of application
Data parallel algorithms leverage GPU attributes
Large
data arrays, streaming throughput
Fine-grain SIMD parallelism
Low-latency floating point (FP) computation
Applications – see //GPGPU.org
Previous
GPGPU
Constraints
Dealing with graphics API
Working
with the corner cases
of the graphics API
Fragment Program
Addressing modes
Limited
Input Registers
Temp Registers
Output Registers
FB
Limited
Instruction sets
Lack
outputs
of Integer & bit ops
Communication limited
Between
pixels
Scatter a[i] = p
Texture
Constants
texture size/dimension
Shader capabilities
per thread
per Shader
per Context
Memory
An Example of Physical Reality
Behind CUDA
CPU
(host)
GPU w/
local DRAM
(device)
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of
threads
– All threads run the same code (SPMD)
– Each thread has an ID that it uses to compute
memory addresses and make control decisions
threadID
0 1 2 3 4 5 6 7
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Blocks: Scalable
Cooperation
Divide monolithic thread array into multiple
blocks
Threads within a block cooperate via shared
memory, atomic operations and barrier
synchronization
Thread Block
0
Thread
Block N - 1
Threads
blocks
cannot
cooperate
Thread Block 0in different
threadID
0
1
2
3
4
5
6
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
7
0
1
2
3
4
5
6
7
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
0
…
1
2
3
4
5
6
7
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Batching: Grids and Blocks
A kernel is executed as a
grid of thread blocks
All threads share data
memory space
Device
Grid 1
Kernel
1
A thread block is a batch of
threads that can cooperate
with each other by:
Synchronizing their execution
Host
For hazard-free shared
memory accesses
Efficiently sharing data
through a low latency shared
memory
Two threads from two
different blocks cannot
cooperate
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Courtesy: NDVIA
Block and Thread IDs
Threads and blocks have IDs
Grid 1
So each thread can decide
what data to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data
Device
Image processing
Solving PDEs on volumes
…
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Courtesy: NDVIA
CUDA Device Memory Space
Overview
Each thread can:
(Device) Grid
R/W per-thread registers
R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constant
memory
Read only per-grid texture
memory
Host
The host can R/W
global, constant, and
texture memories
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Local
Memory
Local
Memory
Global, Constant, and Texture Memories
(Long Latency Accesses)
Global memory
(Device) Grid
Main
means of
communicating R/W
Data between host and
device
Contents visible to all
threads
Block (0, 0)
Shared Memory
Registers
Texture and Constant
Memories
Constants
initialized by
host
Contents visible to all
threads
Block (1, 0)
Host
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Courtesy: NDVIA
Local
Memory
Block IDs and Thread IDs
Host
Each thread uses IDs to
decide what data to
work on
Device
Grid 1
Kernel
1
Block ID: 1D or 2D
Thread ID: 1D, 2D, or
3D
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Kernel
2
Block (1, 1)
Simplifies memory
addressing when
processing
multidimensional data
Image processing
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Thread Thread Thread Thread
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Org
CUDA Memory Model Overview
Global memory
Main
means of
communicating R/W
Data between host
and device
Contents visible to
all threads
Long latency
Host
access
We will focus on
global memory for
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Parallel Computing on a GPU
8-series GPUs deliver 25 to 200+
GFLOPS
on compiled parallel C applications
GeForce 8800
Available in laptops, desktops, and
clusters
Tesla D870
GPU parallelism is doubling every
year
Programming model scales
transparently
Programmable in C with CUDA
Tesla S870
Single-Program Multiple-Data
(SPMD)
CUDA integrated CPU + GPU application
C program
Serial C code executes on CPU
Parallel Kernel C code executes on GPU
CPU Serial Code
thread blocks
Grid 0
GPU Parallel Kernel
KernelA<<< nBlk, nTid >>>(args);
...
CPU Serial Code
Grid 1
GPU Parallel Kernel
KernelB<<< nBlk, nTid >>>(args);
...
Grids and Blocks
A kernel is executed as
a grid of thread blocks
All threads share global
memory space
A thread block is a
batch of threads that
can cooperate with
each other by:
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Kernel
2
Block (1, 1)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Synchronizing their
Thread Thread Thread Thread
execution using barrier
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Efficiently sharing data
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
through a low latency
Courtesy: NDVIA
shared memory
Two threads fromFigure
two 3.2. An Example of CUDA Thread Org
CUDA Thread Block
Programmer declares (Thread)
Block:
Block size 1 to 512 concurrent
threads
Block shape 1D, 2D, or 3D
Block dimensions in threads
All threads in a Block execute
the same thread program
Threads share data and
synchronize while doing their
share of the work
Threads have thread id
numbers within Block
CUDA Thread Block
Thread Id #:
0123…
m
Thread program
Courtesy: John Nickolls, NVIDIA
GeForce-8 Series HW Overview
Streaming Processor Array
TPC
TPC
TPC
Texture Processor Cluster
…
TPC
TPC
Streaming Multiprocessor
Instruction L1
SM
TPC
Data L1
Instruction Fetch/Dispatch
Shared Memory
TEX
SP
SM
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
CUDA
Processor
Terminology
SPA
TPC
Streaming Processor Array (variable across
GeForce 8-series, 8 in GeForce8800)
Texture Processor Cluster (2 SM + TEX)
SM
Streaming Multiprocessor (8 SP)
Multi-threaded processor core
Fundamental processing unit for CUDA thread
block
SP
Streaming Processor
Streaming Multiprocessor (SM)
Streaming Multiprocessor (SM)
Multi-threaded instruction
dispatch
8 Streaming Processors (SP)
2 Super Function Units (SFU)
1 to 512 threads active
Shared instruction fetch per 32
threads
Cover latency of texture/memory
loads
20+ GFLOPS
16 KB shared memory
texture and global memory
Streaming Multiprocessor
Instruction L1
Data L1
Instruction Fetch/Dispatch
Shared Memory
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
G80 Thread Computing Pipeline
Processors
computing
threads
The
future ofexecute
GPUs is
programmable
processing
Alternative operating mode specifically for
Generates Thread
grids based on
computing
So
– build the architecture
around
the
kernel calls
processor
Host
Host
Input
Input Assembler
Assembler
Setup / Rstr / ZCull
Thread Execution Manager
Vtx Thread Issue
SP
SP
SP
SP
SP
SP
Geom Thread Issue
SP
SP
SP
SP
Pixel Thread Issue
SP
SP
SP
SP
SP
SP
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Texture
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Load/store
L2
Load/storeL2
FB
FB
Load/store
L2
Load/store
L2
FB Global Memory
FB
Thread Processor
Load/store
L2
Load/store
L2
FB
FB
Thread Life Cycle in HW
Grid is launched on the
SPA
Thread Blocks are
serially distributed to all
the SM’s
Each SM launches
Warps of Threads
Potentially >1 Thread
Block per SM
2 levels of parallelism
SM schedules and
executes Warps that are
ready to run
As Warps and Thread
Blocks complete,
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
SM Executes Blocks
t0 t1 t2 … tm
SM 0 SM 1
MT IU
SP
t0 t1 t2 … tm
MT IU
Blocks
SP
Blocks
Shared
Memory
Threads are assigned to
SMs in Block granularity
Shared
Memory
TF
Texture L1
Up to 8 Blocks to each SM
as resource allows
SM in G80 can take up to
768 threads
L2
Memory
Could be 256 (threads/block)
* 3 blocks
Or 128 (threads/block) * 6
blocks, etc.
Threads run concurrently
Thread Scheduling/Execution
Each Thread Blocks is divided
in 32-thread Warps
This is an implementation
decision, not part of the CUDA
programming model
Warps are scheduling units in
SM
If 3 blocks are assigned to an
SM and each Block has 256
threads, how many Warps are
there in an SM?
Each Block is divided into 256/32
= 8 Warps
There are 8 * 3 = 24 Warps
Block 1 Warps
…
t0 t1 t2 … t31
…
Block 2 Warps
…
t0 t1 t2 … t31
…
Streaming Multiprocessor
Instruction L1
Data L1
Instruction Fetch/Dispatch
Shared Memory
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
SM Warp Scheduling
SM multithreaded
Warp scheduler
SM hardware implements
zero-overhead Warp
scheduling
time
warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 95
..
.
warp 8 instruction 12
warp 3 instruction 96
Warps whose next instruction
has its operands ready for
consumption are eligible for
execution
Eligible Warps are selected for
execution on a prioritized
scheduling policy
All threads in a Warp execute
the same instruction when
selected
4 clock cycles needed to
SM Instruction Buffer – Warp
Scheduling
Fetch one warp instruction/cycle
from instruction L1 cache
into any instruction buffer slot
Issue one “ready-to-go” warp
instruction/cycle
I$
L1
from any warp - instruction buffer slot
operand scoreboarding used to
prevent hazards
Issue selection based on roundrobin/age of warp
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select
MAD
SFU
Scoreboarding
All register operands of all instructions in the
Instruction Buffer are scoreboarded
Instruction becomes ready after the needed values
are deposited
prevents hazards
cleared instructions are eligible for issue
Decoupled Memory/Processor pipelines
TB1, W1 stall
TB2, W1 stall
TB3, W2 stall
any thread can continue to issue instructions
until
TB1
TB2
TB3 issue
TB3
TB2
TB1
TB1
TB1
TB3
scoreboarding
prevents
W1
W1
W1
W2
W1
W1
W2
W3
W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4
allows Memory/Processor ops to proceed in
Time of other waiting
TB =
Thread Block, W = Warp ops
shadow
Memory/Processor
Granularity Considerations
For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32
tiles?
For 4X4, we have 16 threads per block, Since each SM can take up to
768 threads, the thread capacity allows 48 blocks. However, each SM
can only take up to 8 blocks, thus there will be only 128 threads in each
SM!
For 8X8, we have 64 threads per Block. Since each SM can take up to
768 threads, it could take up to 12 Blocks. However, each SM can only
take up to 8 Blocks, only 512 threads will go into each SM!
There are 8 warps but each warp is only half full.
There are 16 warps available for scheduling in each SM
Each warp spans four slices in the y dimension
For 16X16, we have 256 threads per Block. Since each SM can take up
to 768 threads, it can take up to 3 Blocks and achieve full capacity
unless other resource considerations overrule.
There are 24 warps available for scheduling in each SM
Each warp spans two slices in the y dimension
Memory Hardware in G80
CUDA Device Memory Space:
Review
Each thread can:
R/W per-thread registers
R/W per-thread local
memory
R/W per-block shared
memory
R/W per-grid global
memory
The host can
Host
Read only per-grid constant
R/W
global,
memory
constant,
and
Read only
per-grid texture
memory
texture
(Device) Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Local
Memory
Local
Memory
Parallel Memory Sharing
Thread
Local Memory
Local Memory:
thread
Block
Shared
Memory
Grid 0
...
Grid 1
...
Private per thread
Auto variables, register
spill
Shared Memory:
Block
per-
per-
Shared by threads of the
same block
Inter-thread
communication
Global
Sequential
Global
Memory:
Memory
Grids perapplication in Time
Shared by all threads
Inter-Grid communication
SM Memory Architecture
t0 t1 t2 … tm
SM 0 SM 1
MT IU
SP
t0 t1 t2 … tm
MT IU
Blocks
SP
Blocks
Shared
Memory
Shared
Memory
Threads in a block share
data & results
TF
Texture L1
Courtesy:
John Nicols, NVIDIA
L2
Memory
In Memory and Shared
Memory
Synchronize at barrier
instruction
Per-Block Shared Memory
Allocation
Keeps data close to
processor
SM Register File
Register File (RF)
32 KB (8K entries) for each SM in
G80
TEX pipe can also read/write RF
I$
L1
2 SMs share 1 TEX
Load/Store pipe can also
read/write RF
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select
MAD
SFU
Programmer View of Register
3 blocks
4 blocks
File
There are 8192
registers in each SM
in G80
This
is an
implementation
decision, not part of
CUDA
Registers are
dynamically partitioned
across all blocks
assigned to the SM
Once assigned to a
Matrix Multiplication Example
If each Block has 16X16 threads and each
thread uses 10 registers, how many thread
can run on each SM?
Each
block requires 10*256 = 2560 registers
8192 = 3 * 2560 + change
So, three blocks can run on an SM as far as
registers are concerned
How about if each thread increases the
use of registers by 1?
Each
Block now requires 11*256 = 2816
registers
8192 < 2816 *3
More on Dynamic Partitioning
Dynamic partitioning gives more flexibility
to compilers/programmers
One
can run a smaller number of threads that
require many registers each or a large
number of threads that require few registers
each
This allows for finer grain threading than traditional
CPU threading models.
The
compiler can tradeoff between
instruction-level parallelism and thread level
Let’s program
this thing!
GPU Computing History
2001/2002 – researchers see GPU as dataparallel coprocessor
The
GPGPU field is born
2007 – NVIDIA releases CUDA
– Compute Uniform Device Architecture
GPGPU shifts to GPU Computing
CUDA
2008 – Khronos releases OpenCL
specification
CUDA Abstractions
A hierarchy of thread groups
Shared memories
Barrier synchronization
CUDA Terminology
Host – typically the CPU
Code
written in ANSI C
Device – typically the GPU (data-parallel)
Code
written in extended ANSI C
Host and device have separate memories
CUDA Program
Contains
both host and device code
CUDA Terminology
Kernel – data-parallel function
Invoking
a kernel creates lightweight threads
on the device
Threads are generated and scheduled with
hardware
Does a kernel remind you of a shader in OpenGL?
CUDA Kernels
Executed N times in parallel by N different
CUDA threads
Thread ID
Declaration
Specifier
Execution
Configuration
CUDA Program Execution
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Grid – one or more thread blocks
1D
or 2D
Block – array of threads
1D,
2D, or 3D
Each block in a grid has the same number of
threads
Each thread in a block can
Synchronize
Access shared memory
Thread Hierarchies
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Block – 1D, 2D, or 3D
Example:
Index into vector, matrix, volume
Thread Hierarchies
Thread ID: Scalar thread identifier
Thread Index: threadIdx
1D: Thread ID == Thread Index
2D with size (Dx, Dy)
Thread
ID of index (x, y) == x + y Dy
3D with size (Dx, Dy, Dz)
Thread
ID of index (x, y, z) == x + y Dy + z Dx Dy
Thread Hierarchies
2D Index
1 Thread Block
2D Block
Thread Hierarchies
Thread Block
Group
of threads
G80 and GT200: Up to 512 threads
Fermi: Up to 1024 threads
Reside
on same processor core
Share memory of that core
Thread Hierarchies
Thread Block
Group
of threads
G80 and GT200: Up to 512 threads
Fermi: Up to 1024 threads
Reside
on same processor core
Share memory of that core
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Block Index: blockIdx
Dimension: blockDim
1D
or 2D
Thread Hierarchies
16x16
Threads per block
2D Thread Block
Thread Hierarchies
Example: N = 32
16x16
threads per block (independent of N)
threadIdx ([0, 15], [0, 15])
2x2
thread blocks in grid
blockIdx ([0, 1], [0, 1])
blockDim = 16
i = [0, 1] * 16 + [0, 15]
Thread Hierarchies
Thread blocks execute independently
In
any order: parallel or series
Scheduled in any order by any number of
cores
Allows code to scale with core count
Thread Hierarchies
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Thread Hierarchies
Threads in a block
Share
(limited) low-latency memory
Synchronize execution
To coordinate memory accesses
__syncThreads()
Barrier – threads in block wait until all threads reach this
Lightweight
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Host can transfer to/from device
Global
memory
Constant memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
cudaMalloc()
Allocate
global memory on device
cudaFree()
Frees
memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Pointer to device memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Size in bytes
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
cudaMemcpy()
Memory
transfer
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
Does this remind you of VBOs in OpenGL?
CUDA Memory Transfers
cudaMemcpy()
Memory
transfer
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
CUDA Memory Transfers
cudaMemcpy()
Memory
transfer
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
CUDA Memory Transfers
cudaMemcpy()
Memory
transfer
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
CUDA Memory Transfers
cudaMemcpy()
Memory
transfer
Host to host
Host to device
Device to host
Device to device
Host
Device
Global Memory
All transfers are asynchronous
CUDA Memory Transfers
Host to device
Host
Device
Global Memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Destination (device)
Source (host)
Host
Device
Global Memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Host
Device
Global Memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
P=M*N
Assume M and N are
square for simplicity
Is this data-parallel?
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
1,000 x 1,000 matrix
1,000,000 dot products
Each 1,000 multiples and 1,000 adds
Matrix Multiply: CPU Implementation
void MatrixMulOnHost(float* M, float* N, float* P, int width)
{
for (int i = 0; i < width; ++i)
for (int j = 0; j < width; ++j)
{
float sum = 0;
for (int k = 0; k < width; ++k)
{
float a = M[i * width + k];
float b = N[k * width + j];
sum += a * b;
}
P[i * width + j] = sum;
}
}
Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
Step 1
Add
CUDA memory transfers to the skeleton
Matrix Multiply: Data Transfer
Allocate input
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Allocate output
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Read back
from device
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Does this remind you of GPGPU with GLSL?
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
Step 2
Implement
the kernel in CUDA C
Matrix Multiply: CUDA Kernel
Accessing a matrix, so using a 2D block
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Kernel
Each kernel computes one output
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Kernel
Where did the two outer for loops
in the CPU implementation go?
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Kernel
No locks or synchronization, why?
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
Step 3
Invoke
the kernel in CUDA C
Matrix Multiply: Invoke Kernel
One block with width
by width threads
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
Block 1
One Block of threads compute
matrix Pd
Nd
Grid 1
2
4
Each thread computes one element
of Pd
2
Thread
(2, 2)
6
Each thread
Loads a row of matrix Md
Loads a column of matrix Nd
Perform one multiply and addition
for each pair of Md and Nd
elements
Compute to off-chip memory
access ratio close to 1:1 (not very
high)
Size of matrix limited by the number
of threads allowed in a thread block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
3
2
5
4
48
WIDTH
Md
Pd
125
Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt
Matrix Multiply
What is the major performance problem
with our implementation?
What is the major limitation?