ALU - start [kondor.etf.rs]
Download
Report
Transcript ALU - start [kondor.etf.rs]
GPU computing and CUDA
Marko Mišić ([email protected])
Milo Tomašević ([email protected])
YUINFO 2012
Kopaonik, 29.02.2012.
Introduction to GPU computing (1)
Graphics Processing Units (GPUs) have been used
for non-graphics computation for several years
This trend is called
General-Purpose computation on GPUs (GPGPU)
The GPGPU applications can be found in:
Computational physics/chemistry/biology
Signal processing
Computational geometry
Database management
Computational finance
Computer vision
Introduction to GPU computing (2)
The GPU is a highly parallel processor
good at data-parallel processing
with many calculations per memory access
The same computation executed
on many data elements in parallel
with high arithmetic intensity
Same computation means lower requirement
for sophisticated flow control
High arithmetic intensity and many data elements
mean that memory access latency can be hidden
with calculations instead of big data caches
CPU vs. GPU trends (1)
CPU is optimized to execute tasks
Big caches hide memory latencies
Sophisticated flow control
GPU is specialized for compute-intensive,
highly parallel computation
More transistors can be devoted to data processing
rather than data caching and flow control
ALU
ALU
ALU
ALU
Control
CPU
GPU
Cache
DRAM
DRAM
CPU vs. GPU trends (2)
The GPU has evolved into a very flexible and powerful processor
Programmable using high-level languages
Computational power: 1 TFLOPS vs. 100 GFLOPS
Bandwidth: ~10x bigger
GPU is found in almost every workstation
Many-core GPU
Multi-core CPU
Courtesy: John Owens
CPU vs. GPU trends (3)
197x
CUDA
Advantage
47x
20x
10x
Rigid Body
Physics
Solver
Matrix
Numerics
Wave
Equation
BLAS1:
60+ GB/s
BLAS3:
100+ GFLOPS
FDTD:
1.2 Gcells/s
FFT:
52 GFLOPS
Biological
Sequence
Match
SSEARCH:
5.2 Gcells/s
(GFLOPS as defined by benchFFT)
Finance
Black Scholes:
4.7 GOptions/s
History of GPU programming
The fast-growing video game industry puts
strong pressure that forces constant innovation
Programmable shaders (2000)
GPUs evolved from fixed-function pipeline processors
to the more programmable, general-purpose processors
Programmed through OpenGL and DirectX API
Lots of limitations
Memory access, ISA, floating-point support, etc.
NVIDIA CUDA (2007)
AMD/ATI (Brook+, FireStream, Close-To-Metal)
Microsoft DirectCompute (DirectX 10/DirectX 11)
OpenCompute Language, OpenCL (2009)
CUDA overview (1)
Compute Device Unified Architecture (CUDA)
A new hardware and software architecture
for issuing and managing computations on the GPU
Started with NVIDIA 8000 (G80) series GPUs
General-purpose programming model
SIMD / SPMD
User launches batches of threads on the GPU
GPU could be seen as dedicated
super-threaded, massively data parallel coprocessor
Explicit and unrestricted memory management
CUDA overview (2)
The GPU is viewed as a compute device
that is a coprocessor to the CPU (host)
Executes compute-intensive part of the application
Runs many threads in parallel
Has its own DRAM (device memory)
Data-parallel portions of an application are expressed
as device kernels which run on many threads
GPU threads are extremely lightweight
Very little creation overhead
GPU needs 1000s of threads for full efficiency
Multicore CPU needs only a few
CUDA overview (3)
Dedicated software stack
Runtime and driver
C-language extension
for easier programming
Targeted API for advanced users
Complete tool chain
Compiler, debugger, profiler
Libraries and 3rd party support
GPU Computing SDK
cuFFT, cuBLAS...
FORTRAN, C++, Python,
MATLAB, Thrust, GMAC…
CPU
Application
CUDA Libraries
(FFT, BLAS)
CUDA Runtime
CUDA Driver
GPU
Programming model (1)
CUDA application consists of two parts
Sequential parts are executed on the CPU (host)
Compute-intensive parts are executed on the GPU (device)
The CPU is responsible for data management,
memory transfers, and the GPU execution configuration
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
...
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
...
Programming model (2)
A kernel is executed as
a grid of thread blocks
A thread block is a batch of
threads that can cooperate
with each other by:
Efficiently sharing data
through shared memory
Synchronizing their execution
Two threads from
two different blocks
cannot cooperate
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Programming model (3)
Threads and blocks have IDs
So each thread can decide
what data to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data
Image processing
Solving PDEs on volumes
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Memory model (1)
Each thread can:
Read/write
per-thread registers
Read/write
per-thread local memory
Read/write
per-block shared memory
Read/write
per-grid global memory
Read only
per-grid constant memory
Read only
per-grid texture memory
Grid
Block (0, 0)
Shared Memory
Registers
Host
Registers
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Local
Memory
Local
Memory
Memory model (2)
The host can read/write
global, constant,
and texture memory
Global memory accesses are
slow
All stored in device DRAM
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Around ~200 cycles
Memory architecture
optimized for high bandwidth
Device
Memory banks
Transactions
Global Memory (DRAM)
Host
Global Memory (DRAM)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Memory model (3)
Shared memory is a fast on-chip memory
Allows threads in a block to share intermediate data
Access time ~ 3-4 cycles
Could be seen as user-managed cache (scratchpad)
Threads are responsible
to bring the data to and move it from the shared memory
Small in size (up to 48KB)
Control
Cache
Shared
memory
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
d0
d1
d2
d3
Control
Cache
Shared
memory
ALU
ALU
ALU
...
…
d4
d5
d6
d7
d4
d5
d6
d7
…
A common programming strategy
Local and global memory reside in device memory
(DRAM)
Much slower access than shared memory
A common way of performing computation on the device
is to block it up (tile) to take advantage of fast shared
memory
Partition the data set into subsets that fit into shared memory
Handle each data subset with one thread block by:
Loading the subset from global memory to shared memory
Performing the computation on the subset from shared memory
Each thread can efficiently multi-pass over any data element
Copying results from shared memory to global memory
Matrix Multiplication Example (1)
N
One thread handles one element of P
M and N are loaded WIDTH times from
global memory
M
WIDTH
P = M * N of size WIDTH x WIDTH
Without blocking:
P
WIDTH
WIDTH
WIDTH
One thread block handles one
BLOCK_SIZE x BLOCK_SIZE
sub-matrix Psub of P
M and N are only loaded
WIDTH / BLOCK_SIZE times
from global memory
Great saving of memory bandwidth!
M
P
Psub
BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE
WIDTH
WIDTH
N
WIDTH
P = M * N of size WIDTH x WIDTH
With blocking:
BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE
BLOCK_SIZE
Matrix Multiplication Example (2)
BLOCK_SIZE
WIDTH
CUDA API (1)
The CUDA API is an extension
to the C programming language consisting of:
Language extensions
To target portions of the code for execution on the device
A runtime library split into:
A common component providing built-in vector types
and a subset of the C runtime library
in both host and device codes
A host component to control and access
one or more devices from the host
A device component providing device-specific functions
CUDA API (2)
Function declaration qualifiers
Variable qualifiers
__host__, __device___, __shared__, etc.
Built-in variables
__global__, __host__, __device__
gridDim, blockDim, blockIdx, threadIdx
Mathematical functions
Kernel calling convention (execution configuration)
myKernel<<< DimGrid, DimBlock >>>(arg1, … );
Programmer explicitly specifies block and grid organization
1D, 2D or 3D
Hardware implementation (1)
The device is a set of multiprocessors
Each multiprocessor is a set of 32-bit
processors with a SIMD architecture
At each clock cycle, a multiprocessor
executes the same instruction on a
group of threads called a warp
Multiprocessor N
…
Multiprocessor 2
Multiprocessor 1
Instruction
Unit
Processor 1
Including branches
Allows scalable execution of kernels
Device
Adding more multiprocessors improves
performance
Processor 2
…
Processor M
Hardware implementation (2)
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
Hardware implementation (3)
Each thread block of a grid is split into warps
that get executed by one multiprocessor
Each thread block is executed by only one multiprocessor
Warp consists of threads with consecutive thread IDs)
Shared memory space resides in the on-chip shared memory
Registers are allocated among the threads
A kernel that requires too many registers will fail to launch
A multiprocessor can execute several blocks concurrently
Shared memory and registers are allocated
among the threads of all concurrent blocks
Decreasing shared memory usage (per block) and
register usage (per thread) increases number of
blocks that can run concurrently
Memory architecture (1)
In a parallel machine,
many threads access memory
Each bank can service one address per cycle
Memory is divided into banks
Essential to achieve high bandwidth
A memory can service
as many simultaneous accesses as it has banks
Multiple simultaneous accesses to a bank
result in a bank conflict
Conflicting accesses are serialized
Shared memory is organized in similar fashion
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Bank 15
Memory architecture (2)
When accessing global memory,
accesses are combined into transactions
Peak bandwidth is achieved
when all threads in a half warp access
continuous memory locations
“Memory coalescing”
In that case, there are no bank conflicts
Programmer is responsible to optimize algorithms
to access data in appropriate fashion
Performance considerations
CUDA has a low learning curve
Performance can vary greatly
depending on the resource constraints of
the particular device architecture
It is easy to write a correct program
Performance concerned programmers still need
to be aware of them
to make a good use of a contemporary hardware
It is essential to understand
hardware and memory architecture
Thread scheduling and execution
Suitable memory access patterns
Shared memory utilization
Resource limitations
Conclusion
Highly multithreaded architecture of modern GPUs is
very suitable for solving data-parallel problems
It is expected that GPU architectures will evolve
to further broaden application domains
Vastly improves performance in certain domains
We are in the dawn of heterogeneous computing
Software support is developing rapidly
Mature tool chain
Libraries
Available applications
OpenCL
References
David Kirk, Wen-mei Hwu, Programming Massively Parallel Processors: A Hands
on Approach, Morgan Kaufmann, 2010.
Course ECE498AL, University of Illinois, Urbana-Champaign
http://courses.engr.illinois.edu/ece498/al/
Dann Connors, OpenCL and CUDA Programming for Multicore
and GPU Architectures, ACACES 2011, Fiuggi, Italy, 2011.
David Kirk, Wen-mei Hwu, Programming and tUnining
Massively Parallel Systems, PUMPS 2011, Barcelona, Spain, 2011.
NVIDIA CUDA C Programming Guide 4.0, 2011.
Mišić, Đurđević, Tomašević, “Evolution and Trends in GPU Computing”, MIPRO
2012, Abbazia, Croatia, 2012. (to be published)
NVIDIA Developer zone, http://developer.nvidia.com/category/zone/cuda-zone
http://en.wikipedia.org/wiki/GPGPU
http://en.wikipedia.org/wiki/CUDA
GPU training wiki,
https://hpcforge.org/plugins/mediawiki/wiki/gpu-training/index.php/Main_Page
GPU computing and CUDA
Questions?
Marko Mišić ([email protected])
Milo Tomašević ([email protected])
YUINFO 2012
Kopaonik, 29.02.2012.