ALU - start [kondor.etf.rs]

Transcript ALU - start [kondor.etf.rs]

GPU computing and CUDA
Marko Mišić ([email protected])
Milo Tomašević ([email protected])
YUINFO 2012
Kopaonik, 29.02.2012.
Introduction to GPU computing (1)



Graphics Processing Units (GPUs) have been used
for non-graphics computation for several years
This trend is called
General-Purpose computation on GPUs (GPGPU)
The GPGPU applications can be found in:






Computational physics/chemistry/biology
Signal processing
Computational geometry
Database management
Computational finance
Computer vision
Introduction to GPU computing (2)

The GPU is a highly parallel processor
good at data-parallel processing
with many calculations per memory access



The same computation executed
on many data elements in parallel
with high arithmetic intensity
Same computation means lower requirement
for sophisticated flow control
High arithmetic intensity and many data elements
mean that memory access latency can be hidden
with calculations instead of big data caches
CPU vs. GPU trends (1)

CPU is optimized to execute tasks



Big caches hide memory latencies
Sophisticated flow control
GPU is specialized for compute-intensive,
highly parallel computation

More transistors can be devoted to data processing
rather than data caching and flow control
ALU
ALU
ALU
ALU
Control
CPU
GPU
Cache
DRAM
DRAM
CPU vs. GPU trends (2)

The GPU has evolved into a very flexible and powerful processor




Programmable using high-level languages
Computational power: 1 TFLOPS vs. 100 GFLOPS
Bandwidth: ~10x bigger
GPU is found in almost every workstation
Many-core GPU
Multi-core CPU
Courtesy: John Owens
CPU vs. GPU trends (3)
197x
CUDA
Advantage
47x
20x
10x
Rigid Body
Physics
Solver
Matrix
Numerics
Wave
Equation
BLAS1:
60+ GB/s
BLAS3:
100+ GFLOPS
FDTD:
1.2 Gcells/s
FFT:
52 GFLOPS
Biological
Sequence
Match
SSEARCH:
5.2 Gcells/s
(GFLOPS as defined by benchFFT)
Finance
Black Scholes:
4.7 GOptions/s
History of GPU programming

The fast-growing video game industry puts
strong pressure that forces constant innovation


Programmable shaders (2000)






GPUs evolved from fixed-function pipeline processors
to the more programmable, general-purpose processors
Programmed through OpenGL and DirectX API
Lots of limitations
 Memory access, ISA, floating-point support, etc.
NVIDIA CUDA (2007)
AMD/ATI (Brook+, FireStream, Close-To-Metal)
Microsoft DirectCompute (DirectX 10/DirectX 11)
OpenCompute Language, OpenCL (2009)
CUDA overview (1)

Compute Device Unified Architecture (CUDA)



A new hardware and software architecture
for issuing and managing computations on the GPU
Started with NVIDIA 8000 (G80) series GPUs
General-purpose programming model




SIMD / SPMD
User launches batches of threads on the GPU
GPU could be seen as dedicated
super-threaded, massively data parallel coprocessor
Explicit and unrestricted memory management
CUDA overview (2)

The GPU is viewed as a compute device
that is a coprocessor to the CPU (host)




Executes compute-intensive part of the application
Runs many threads in parallel
Has its own DRAM (device memory)
Data-parallel portions of an application are expressed
as device kernels which run on many threads


GPU threads are extremely lightweight
 Very little creation overhead
GPU needs 1000s of threads for full efficiency
 Multicore CPU needs only a few
CUDA overview (3)

Dedicated software stack





Runtime and driver
C-language extension
for easier programming
Targeted API for advanced users
Complete tool chain
 Compiler, debugger, profiler
Libraries and 3rd party support
 GPU Computing SDK
 cuFFT, cuBLAS...
 FORTRAN, C++, Python,
MATLAB, Thrust, GMAC…
CPU
Application
CUDA Libraries
(FFT, BLAS)
CUDA Runtime
CUDA Driver
GPU
Programming model (1)

CUDA application consists of two parts



Sequential parts are executed on the CPU (host)
Compute-intensive parts are executed on the GPU (device)
The CPU is responsible for data management,
memory transfers, and the GPU execution configuration
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
...
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
...
Programming model (2)


A kernel is executed as
a grid of thread blocks
A thread block is a batch of
threads that can cooperate
with each other by:



Efficiently sharing data
through shared memory
Synchronizing their execution
Two threads from
two different blocks
cannot cooperate
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Programming model (3)




Threads and blocks have IDs
 So each thread can decide
what data to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data
 Image processing
 Solving PDEs on volumes
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Memory model (1)

Each thread can:






Read/write
per-thread registers
Read/write
per-thread local memory
Read/write
per-block shared memory
Read/write
per-grid global memory
Read only
per-grid constant memory
Read only
per-grid texture memory
Grid
Block (0, 0)
Shared Memory
Registers
Host
Registers
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Local
Memory
Local
Memory
Memory model (2)

The host can read/write
global, constant,
and texture memory


Global memory accesses are
slow


All stored in device DRAM

Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Around ~200 cycles
Memory architecture
optimized for high bandwidth

Device
Memory banks
Transactions
Global Memory (DRAM)
Host
Global Memory (DRAM)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Memory model (3)

Shared memory is a fast on-chip memory



Allows threads in a block to share intermediate data
 Access time ~ 3-4 cycles
Could be seen as user-managed cache (scratchpad)
 Threads are responsible
to bring the data to and move it from the shared memory
Small in size (up to 48KB)
Control
Cache
Shared
memory
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
d0
d1
d2
d3
Control
Cache
Shared
memory
ALU
ALU
ALU
...
…
d4
d5
d6
d7
d4
d5
d6
d7
…
A common programming strategy

Local and global memory reside in device memory
(DRAM)


Much slower access than shared memory
A common way of performing computation on the device
is to block it up (tile) to take advantage of fast shared
memory


Partition the data set into subsets that fit into shared memory
Handle each data subset with one thread block by:
 Loading the subset from global memory to shared memory
 Performing the computation on the subset from shared memory
 Each thread can efficiently multi-pass over any data element
 Copying results from shared memory to global memory
Matrix Multiplication Example (1)


N
One thread handles one element of P
M and N are loaded WIDTH times from
global memory
M
WIDTH

P = M * N of size WIDTH x WIDTH
Without blocking:
P
WIDTH

WIDTH
WIDTH


One thread block handles one
BLOCK_SIZE x BLOCK_SIZE
sub-matrix Psub of P
M and N are only loaded
WIDTH / BLOCK_SIZE times
from global memory
Great saving of memory bandwidth!
M
P
Psub
BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE
WIDTH
WIDTH

N
WIDTH

P = M * N of size WIDTH x WIDTH
With blocking:
BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE

BLOCK_SIZE
Matrix Multiplication Example (2)
BLOCK_SIZE
WIDTH
CUDA API (1)

The CUDA API is an extension
to the C programming language consisting of:

Language extensions


To target portions of the code for execution on the device
A runtime library split into:
A common component providing built-in vector types
and a subset of the C runtime library
in both host and device codes
 A host component to control and access
one or more devices from the host
 A device component providing device-specific functions

CUDA API (2)

Function declaration qualifiers


Variable qualifiers



__host__, __device___, __shared__, etc.
Built-in variables


__global__, __host__, __device__
gridDim, blockDim, blockIdx, threadIdx
Mathematical functions
Kernel calling convention (execution configuration)


myKernel<<< DimGrid, DimBlock >>>(arg1, … );
Programmer explicitly specifies block and grid organization
 1D, 2D or 3D
Hardware implementation (1)



The device is a set of multiprocessors
Each multiprocessor is a set of 32-bit
processors with a SIMD architecture
At each clock cycle, a multiprocessor
executes the same instruction on a
group of threads called a warp


Multiprocessor N
…
Multiprocessor 2
Multiprocessor 1
Instruction
Unit
Processor 1
Including branches
Allows scalable execution of kernels

Device
Adding more multiprocessors improves
performance
Processor 2
…
Processor M
Hardware implementation (2)
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
Hardware implementation (3)

Each thread block of a grid is split into warps
that get executed by one multiprocessor


Each thread block is executed by only one multiprocessor




Warp consists of threads with consecutive thread IDs)
Shared memory space resides in the on-chip shared memory
Registers are allocated among the threads
A kernel that requires too many registers will fail to launch
A multiprocessor can execute several blocks concurrently


Shared memory and registers are allocated
among the threads of all concurrent blocks
Decreasing shared memory usage (per block) and
register usage (per thread) increases number of
blocks that can run concurrently
Memory architecture (1)

In a parallel machine,
many threads access memory



Each bank can service one address per cycle




Memory is divided into banks
Essential to achieve high bandwidth
A memory can service
as many simultaneous accesses as it has banks
Multiple simultaneous accesses to a bank
result in a bank conflict
Conflicting accesses are serialized
Shared memory is organized in similar fashion
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Bank 15
Memory architecture (2)

When accessing global memory,
accesses are combined into transactions

Peak bandwidth is achieved
when all threads in a half warp access
continuous memory locations



“Memory coalescing”
In that case, there are no bank conflicts
Programmer is responsible to optimize algorithms
to access data in appropriate fashion
Performance considerations

CUDA has a low learning curve


Performance can vary greatly
depending on the resource constraints of
the particular device architecture


It is easy to write a correct program
Performance concerned programmers still need
to be aware of them
to make a good use of a contemporary hardware
It is essential to understand
hardware and memory architecture




Thread scheduling and execution
Suitable memory access patterns
Shared memory utilization
Resource limitations
Conclusion

Highly multithreaded architecture of modern GPUs is
very suitable for solving data-parallel problems


It is expected that GPU architectures will evolve
to further broaden application domains


Vastly improves performance in certain domains
We are in the dawn of heterogeneous computing
Software support is developing rapidly




Mature tool chain
Libraries
Available applications
OpenCL
References










David Kirk, Wen-mei Hwu, Programming Massively Parallel Processors: A Hands
on Approach, Morgan Kaufmann, 2010.
Course ECE498AL, University of Illinois, Urbana-Champaign
http://courses.engr.illinois.edu/ece498/al/
Dann Connors, OpenCL and CUDA Programming for Multicore
and GPU Architectures, ACACES 2011, Fiuggi, Italy, 2011.
David Kirk, Wen-mei Hwu, Programming and tUnining
Massively Parallel Systems, PUMPS 2011, Barcelona, Spain, 2011.
NVIDIA CUDA C Programming Guide 4.0, 2011.
Mišić, Đurđević, Tomašević, “Evolution and Trends in GPU Computing”, MIPRO
2012, Abbazia, Croatia, 2012. (to be published)
NVIDIA Developer zone, http://developer.nvidia.com/category/zone/cuda-zone
http://en.wikipedia.org/wiki/GPGPU
http://en.wikipedia.org/wiki/CUDA
GPU training wiki,
https://hpcforge.org/plugins/mediawiki/wiki/gpu-training/index.php/Main_Page
GPU computing and CUDA
Questions?
Marko Mišić ([email protected])
Milo Tomašević ([email protected])
YUINFO 2012
Kopaonik, 29.02.2012.

ALU - start [kondor.etf.rs]

Transcript ALU - start [kondor.etf.rs]

Directory