NVidia CUDA Programming Guide

Download Report

Transcript NVidia CUDA Programming Guide

Ceng 545
GPU Computing
1
Grading
 Midterm Exam: 20%
 Homeworks: 40%
 Demo/knowledge: 25%
 Functionality: 40%
 Report: 35%
 Project: 40%
 Design Document: 25%
 Project Presentation: 25%
 Demo/Final Report: 50%
2
TextBooks /References
 D. Kirk and W. Hwu,”Programming Massively Parallel Processors”,Morgan




Kaufmann 2010, 978-0-12-381472-2
J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to GeneralPurpose GPU Programming”,Pearson 2010,978-0-13-138768-3
Draft textbook by Prof. Hwu and Prof. Kirk available at the website
NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007 (reference book)
Videos (Stanford University)
http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322
 Lecture Notes (Illinois University )
http://courses.engr.illinois.edu/ece498/al/
 Lecture notes will be posted at the class web site
3
GPU vs CPU
 A GPU is tailored for highly parallel operation while a CPU
executes programs serially
 For this reason, GPUs have many parallel execution units and
higher transistor counts (GTX 480 has 3200*million) , while
CPUs have few execution units and higher clockspeeds
 GPUs have much deeper pipelines (several thousand stages vs 1020 for CPUs)
 GPUs have significantly faster and more advanced memory
interfaces as they need to shift around a lot more data than CPUs
Many-core GPUs vs Multicore CPU
 Design philosophies:
The design of a CPU is optimized for sequential code
performance.(Large cache memories are provided to reduce the
instruction and data access latencies)
 Memory bandwith:
CPU :It has to satisfy requirements from OS, applications and
I/O devices.
GPU: Small cache memories are provided to help control the
bandwith requirements so multiple threads that access the
same memory data do not need to all go to the DDRAM
 Marketplace
CPU vs. GPU - Hardware
 More transistors devoted to data processing
6
Supercomputing 2008 Education Program
Processing Element
 Processing element = thread processor = ALU
7
Supercomputing 2008 Education Program
Memory Architecture
 Constant Memory
 Texture Memory
 Device Memory
8
Supercomputing 2008 Education Program
Why Massively Parallel Processor
 A quiet revolution and potential build-up
Calculation: 367 GFLOPS vs. 32 GFLOPS
Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
Until last year, programmed through graphics API
GFLOPS



G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra

9
NV30 = GeForce FX 5800
GPU in every PC and workstation – massive volume and potential impact
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
GeForce 8800
16 highly threaded SM’s (each with 8 SP), >128 FPU’s, 367 GFLOPS,
768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
10
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
Load/store
Load/store
We have
GTX 480
32 highly threaded SM’s (each with 15 SP), >480
FPU’s, 1344 GFLOPS, 1536 MB DRAM, 177.4
GB/S Mem BW
Tesla C2070
Compared to the latest quad-core CPUs,Tesla
C2050 and C2070 Computing Processors deliver
equivalent supercomputing performance at
1/10th the cost and 1/20th the
powerconsumption.
11
Terms
 GPGPU
 General-Purpose computing on a Graphics Processing Unit
 Using graphic hardware for non-graphic computations
 CUDA
 Compute Unified Device Architecture
 Software architecture for managing data-parallel programming
12
Parallel Programming
 MPI: Computing nodes do not share memory; all data
sharing and interaction must be done through explicit
passing.
Cuda provides sharde memory
 OpenMP :It has not been able to scale beyond a couple
hundred computing nodes due to thread management
overheads and cache coherence hardware requirements.
Cuda achivies simple thread management and no cache coherence
hardware requirements.
13
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
Future Apps Reflect a Concurrent World
 Exciting applications in future mass computing market have been
traditionally considered “supercomputing applications”
 Molecular dynamics simulation, Video and audio coding and
manipulation, 3D imaging and visualization, Consumer game
physics, and virtual reality products
 These “Super-apps” represent and model physical, concurrent world
 Various granularities of parallelism exist, but…
 programming model must not hinder parallel implementation
 data delivery needs careful management
14
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
Stretching Traditional Architectures
 Traditional parallel architectures cover some super-applications
 DSP, GPU, network apps, Scientific
 The game is to grow mainstream architectures “out” or domain-
specific architectures “in”
 CUDA is latter
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Obstacles
15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
Previous Projects
16
Application
Description
Source
Kernel
H.264
SPEC ‘06 version, change in guess vector
34,811
194
35%
LBM
SPEC ‘06 version, change to single precision and print fewer
reports
1,481
285
>99%
RC5-72
Distributed.net RC5-72 challenge client code
1,979
218
>99%
FEM
Finite element modeling, simulation of 3D graded materials
1,874
146
99%
RPES
Rye Polynomial Equation Solver, quantum chem, 2-electron
repulsion
1,104
281
99%
PNS
Petri Net simulation of a distributed system
322
160
>99%
SAXPY
Single-precision implementation of saxpy, used in Linpack’s
Gaussian elim. routine
952
31
>99%
TRACF
Two Point Angular Correlation Function
536
98
96%
FDTD
Finite-Difference Time Domain analysis of 2D electromagnetic
wave propagation
1,365
93
16%
MRI-Q
Computing a matrix Q, a scanner’s configuration in MRI
reconstruction
490
33
>99%
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
% time
Speedup of Applications
457 316
431 263
210
79
GPU Speedup
Relative to CPU
60
50
40
Ke rn e l
Ap p lic a tio n
30
20
10
0
H .2 6 4
LBM
R C 5 -7 2
F EM
R PES
PN S
SA XPY T PA C F
FDTD
M R I-Q
M R IFHD
 GeForce 8800 GTX vs. 2.2GHz Opteron 248
 10 speedup in a kernel is typical, as long as the kernel can occupy enough
parallel threads
 25 to 400 speedup if the function’s data requirements and control flow suit the
GPU and the application is optimized
17
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign