Transcript slides
NVVP, Existing Libraries, Q/A
Text
book / resources
Eclipse Nsight, NVIDIA Visual Profiler
Available libraries
Questions
Certificate dispersal
(Optional) Multiple GPUs: Where’s PixelWaldo?
TEXT BOOK
Programming Massively
Parallel Processors, A
Hands on approach
David Kirk, Wen-mei Hwu
NVIDIA DEVELOPER ZONE
Early access to updated
drivers / updates
Heavily curated help forum
Requires registration and
approval (nearly
automated)
developer.nvidia.com
US!
We’re pretty passionate
about this GPU computing
stuff.
Collaboration is cool
If you think you’ve got a
problem that can benefit
from GPU computation we
may have some ideas.
IDE
with an Eclipse foundation
CUDA aware syntax highlighting /
suggestions / recognition
Hooked into NVVP
Deep
profiling of every aspect of GPU
execution ( memory bandwidth, branch
divergence, bank conflicts, compute /
transfer overlap, and more! )
Provides suggestions for optimization
Graphical view of GPU performance
Nsight
and NVVP are available on our
cuda# machines
Ssh –X <user>@<cuda machine>
Nsight
demo on Week 3 code
Why
re-invent the wheel?
• There are many GPU enabled tools built on
CUDA that are already available
• These tools have been extensively tested for
efficiency and in most cases will outperform
custom solutions
• Some require CUDA-like code structure
Linear Algebra, cuBLAS
CUDA
enabled basic linear algebra
subroutines
• GPU-accelerated version of the complete
standard BLAS library
• Provided with the CUDA toolkit. Code examples
are also provided
• Callable from C and Fortran
Linear Algebra, cuBLAS
Linear Algebra, cuBLAS
Linear Algebra, CULA, MAGMA
CULA
and MAGMA extend BLAS
• CULA (Paid)
CULA-dense: LAPACK and BLAS implementations,
solvers, decompositions, basic matrix operations
CULA-sparse: sparse matrix specialized routines,
specialized storage structures, iterative methods
• MAGMA (Free, BSD) (Fortran Bindings)
LAPACK and BLAS implementations, developed by the
same dev. team as LAPACK.
Linear Algebra, CULA, MAGMA
Linear Algebra, CULA, MAGMA
IMSL Fortran/C Numerical Library
Large
collection of mathematical and
statistical gpu-accelerated functions
• Free evaluation, paid extension
• http://www.roguewave.com/products/imsl-
numerical-libraries/fortran-library.aspx
Image/Signal Processing: NVIDIA
Performance Primitives
1900
Image processing and 600 signal
processing algorithms
• Free and provided with the CUDA toolkit, code
examples included.
• Can be used in tandem with visualization
libraries like OpenGL, DirectX.
Image/Signal Processing: NVIDIA
Performance Primitives
CUDA without the CUDA:
Thrust Library
Thrust
is a high level interface to GPU
computing.
• Offers template-interface access to sort, scan,
reduce, etc.
• A production tested version is provided with the
CUDA toolkit.
CUDA without the CUDA:
Thrust Library
CUDA without the CUDA:
Thrust Library
CUDA without the CUDA:
Thrust Library
Python and CUDA
PyCUDA
• Python interface to CUDA functions.
• Simply a collection of wrappers, but effective.
NumbaPro
(Paid)
• Announced this year at GTC 2013, native CUDA
python compiler
• Python = 4th major cuda language
R and CUDA
R+GPU
• Package with accelerated alternatives for
common R statistical functions
Rpud
/ rpudplus
• Package with accelerated alternatives for
common R statistical functions
Rcuda
• … Package with accelerated alternatives for
common R statistical functions
R and CUDA
Where’s
Pixel-Waldo?
Motivation: Given two images which contain a
unique suspect and a number of distinct
bystanders, identify the suspect by pairwise
comparison.
This
is hard
We’ll simplify the problem by reducing the targets
to pixel triples.
0: upload
an image and a list to store
targets to each GPU.
f.bmp
GPU0
0|0|0|…
s.bmp
GPU1
0|0|0|…
1: Find
all positions of potential targets
(triples) within each image using both GPUS
independently.
f.bmp
GPU0
11 | 143 | 243 | …
s.bmp
GPU1
3 | 1632 | 54321 | …
2: Allow
GPU0 to access GPU1 memory, use
both images and target lists to compare
potential suspects.
f.bmp
s.bmp
11 | 143 | 243 | …
GPU0
GPU1
0|0
PCI Bus
3 | 1632 | 54321 | …
3: Print
the positions of the single
matching suspect.
f.bmp
CPU
11 | 143 | 243 | …
GPU0
132 | 629
PCI Bus
Walk
though the source code.
Things to note:
• This is un-optimized and known to be inefficient,
but the concepts of asynchronous streams, GPU
context switching, universal addressing, and
peer-to-peer access are covered
• Source code requires the tclap library to
compile appropriately.
• Source code will be made available in a github
repository after the workshop.