Transcript GPU Handout
Graphics
Processing Units
REFERENCES:
•COMPUTER
ARCHITECTURE 5TH EDITION, HENNESSY AND PATTERSON, 2012
•HTTP://WWW.NVIDIA.COM/CONTENT/PDF/FERMI_WHITE_PAPERS/NVIDIA_FE
RMI_COMPUTE_ARCHITECTURE_WHITEPAPER.PDF
•HTTP://WWW.REALWORLDTECH.COM/PAGE.CFM?ARTICLEID=RWT09300911
0932&P=1
•HTTP://WWW.MODERNGPU.COM/INTRO/PERFORMANCE.HTML
•HTTP://HEATHER.CS.UCDAVIS.EDU/PARPROCBOOK
CPU vs. GPU
• CPU: small fraction of chip used for
arithmetic
http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html
CPU vs GPU
• GPU: large fraction of chip used for
arithmetic
http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-GT200-Revealed
-GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite
CPU vs GPU
Intel Haswell
AMD Radeon R9 290
170 GFlops on quad-core at 3.4GHz
4800 GFlops at 9.5GHz
Nvidia GTX 970
5000 Gflops at 1.05GHz
GPGPU
General Purpose GPU programming
Massively parallel
Scientific computing, brain simulations, etc
In supercomputers
53 of top500.org supercomputers used NVIDIA/AMD
GPUs (Nov 2014 ranking)
Including 2nd and 6th places
OpenCL vs CUDA
Both for GPGPU
OpenCL
Open standard
Supported on AMD, NVIDIA, Intel, Altera, …
CUDA
Proprietary (Nvidia)
Losing ground to OpenCL?
Similar performance
CUDA
Programming on Parallel Machines, Norm Matloff,
Chapter 5
http://www.nvidia.com/content/PDF/fermi_white
_papers/NVIDIA_Fermi_Compute_Architecture_W
hitepaper.pdf
Uses a thread hierarchy
Thread
Block
Grid
Thread
Executes an instance of a kernel (program)
ThreadID (within block), program counter,
registers, private memory, input and output
parameters
Private memory for register spills, function calls,
array variables
Nvidia Fermi Whitepaper pg 6
Block
Set of concurrently executing threads
Cooperate via barrier synchronization and shared
memory (fast but small)
BlockID (within grid)
Nvidia Fermi Whitepaper pg 6
Grid
Array of thread blocks running same kernel
Read and write global memory (slow – hundreds of cycles)
Synchronize between dependent kernel calls
Nvidia Fermi Whitepaper pg 6
Hardware Mapping
GPU
Streaming Multiprocessor (sm)
executes 1+ kernel (program) grids
executes 1+ thread blocks
CUDA core
executes thread
Fermi Architecture
Debuted in 2010
512 CUDA cores
executes 1 FP or integer instruction per cycle
32 CUDA cores per SM
16 SMs per GPU
6 64-bit memory ports
PCI-Express interface to CPU
GigaThread scheduler distributes blocks to SMs
each SM has a thread scheduler (in hardware)
fast context switch
3 billion transistors
Nvidia Fermi Whitepaper pg 7
CUDA core
pipelined integer and FP units
IEEE 754-2008 FP
fused multiply-add
integer unit
boolean, shift, move, compare, ...
Nvidia Fermi Whitepaper pg 8
Streaming Multiprocessor
(SM)
32 CUDA cores
16 ld/st units
calculate source/destination
addresses
Special Function Units
sin, cosine, reciprocal, sqrt
Nvidia Fermi Whitepaper pg 8
Warps
32 threads from a block are bundled into warps
which execute the same instr/cycle
this becomes the minimum size of SIMD data
warps are implicitly synchronized
if threads branch in different directions, they step
through both using predicated instructions
two warp schedulers select 1 instruction from a
warp each to issue to 16 cores, 16 ld/st units or 4
SFUs
Maxwell Architecture
2014
16 streaming multiprocessors * 128 cores/SM
= 2048 cores
Programming CUDA
C code
daxpy(n,2.0,x,y); // invoke
void daxpy(int n, double a, double *x double
*y) {
for(int i=0; i<n; i++)
y[i] = a*x[i] + y[i];
}
Programming CUDA
CUDA code
__host__
int nblocks=(n+511)/512; // grid size
daxpy<<<nblocks,512>>(n,2.0,x,y);
// 512 threads/block
__global__
void daxpy(int n, double a, double *x
double *y) {
int i=blockIdx.x*blockDim.x +
threadIdx.x;
if(i<n) y[i] = a*x[i] + y[i];
}
n=8192, 512 threads/block
grid
block0
warp0
Y[0]=A*X[0]+Y[0]
...
Y[31]=A*X[31]+Y[31]
...
warp15
Y[480]=A*X[480]+Y[480]
...
Y[511]=A*X[511]+Y[511]
...
block15
warp0
Y[7680]=A*X[7680]+Y[7680]
...
Y[7711]=A*X[7711]+Y[7711]
...
warp15
Y[8160]=A*X[8160]+Y[8160]
...
Y[8191]=A*X[8191]+Y[8191]
Moving data between
host and GPU
int main() {
double *x, *y, a, *dx, *dy;
x = (double *)malloc(sizeof(double)*n);
y = (double *)malloc(sizeof(double)*n);
// initialize x and y…
cudaMalloc(dx, n*sizeof(double));
cudaMalloc(dy, n*sizeof(double));
cudaMemcpy(dx, x, n*sizeof(double),
cudaMemcpyHostToDevice); …
daxpy<<<nblocks,512>>(n,2.0,x,y);
cudaThreadSynchronize();
cudaMemcpy(y, dy, n*sizeof(double),
cudaMemcpyDeviceToHost);
cudaMemFree(dx); cudaMemFree(dy);
free(x); free(y);
}