Transcript GPU Handout

Graphics
Processing Units
REFERENCES:
•COMPUTER
ARCHITECTURE 5TH EDITION, HENNESSY AND PATTERSON, 2012
•HTTP://WWW.NVIDIA.COM/CONTENT/PDF/FERMI_WHITE_PAPERS/NVIDIA_FE
RMI_COMPUTE_ARCHITECTURE_WHITEPAPER.PDF
•HTTP://WWW.REALWORLDTECH.COM/PAGE.CFM?ARTICLEID=RWT09300911
0932&P=1
•HTTP://WWW.MODERNGPU.COM/INTRO/PERFORMANCE.HTML
•HTTP://HEATHER.CS.UCDAVIS.EDU/PARPROCBOOK
CPU vs. GPU
• CPU: small fraction of chip used for
arithmetic
http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html
CPU vs GPU
• GPU: large fraction of chip used for
arithmetic
http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-GT200-Revealed
-GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite
CPU vs GPU

Intel Haswell


AMD Radeon R9 290


170 GFlops on quad-core at 3.4GHz
4800 GFlops at 9.5GHz
Nvidia GTX 970

5000 Gflops at 1.05GHz
GPGPU

General Purpose GPU programming

Massively parallel

Scientific computing, brain simulations, etc

In supercomputers

53 of top500.org supercomputers used NVIDIA/AMD
GPUs (Nov 2014 ranking)

Including 2nd and 6th places
OpenCL vs CUDA

Both for GPGPU

OpenCL

Open standard



Supported on AMD, NVIDIA, Intel, Altera, …
CUDA

Proprietary (Nvidia)

Losing ground to OpenCL?
Similar performance
CUDA

Programming on Parallel Machines, Norm Matloff,
Chapter 5

http://www.nvidia.com/content/PDF/fermi_white
_papers/NVIDIA_Fermi_Compute_Architecture_W
hitepaper.pdf

Uses a thread hierarchy

Thread

Block

Grid
Thread

Executes an instance of a kernel (program)

ThreadID (within block), program counter,
registers, private memory, input and output
parameters

Private memory for register spills, function calls,
array variables
Nvidia Fermi Whitepaper pg 6
Block

Set of concurrently executing threads

Cooperate via barrier synchronization and shared
memory (fast but small)

BlockID (within grid)
Nvidia Fermi Whitepaper pg 6
Grid

Array of thread blocks running same kernel

Read and write global memory (slow – hundreds of cycles)

Synchronize between dependent kernel calls
Nvidia Fermi Whitepaper pg 6
Hardware Mapping

GPU


Streaming Multiprocessor (sm)


executes 1+ kernel (program) grids
executes 1+ thread blocks
CUDA core

executes thread
Fermi Architecture

Debuted in 2010

512 CUDA cores

executes 1 FP or integer instruction per cycle

32 CUDA cores per SM

16 SMs per GPU

6 64-bit memory ports

PCI-Express interface to CPU

GigaThread scheduler distributes blocks to SMs


each SM has a thread scheduler (in hardware)

fast context switch
3 billion transistors
Nvidia Fermi Whitepaper pg 7
CUDA core

pipelined integer and FP units

IEEE 754-2008 FP


fused multiply-add
integer unit

boolean, shift, move, compare, ...
Nvidia Fermi Whitepaper pg 8
Streaming Multiprocessor
(SM)

32 CUDA cores

16 ld/st units


calculate source/destination
addresses
Special Function Units

sin, cosine, reciprocal, sqrt
Nvidia Fermi Whitepaper pg 8
Warps

32 threads from a block are bundled into warps
which execute the same instr/cycle

this becomes the minimum size of SIMD data

warps are implicitly synchronized


if threads branch in different directions, they step
through both using predicated instructions
two warp schedulers select 1 instruction from a
warp each to issue to 16 cores, 16 ld/st units or 4
SFUs
Maxwell Architecture

2014

16 streaming multiprocessors * 128 cores/SM

= 2048 cores
Programming CUDA

C code
daxpy(n,2.0,x,y); // invoke
void daxpy(int n, double a, double *x double
*y) {
for(int i=0; i<n; i++)
y[i] = a*x[i] + y[i];
}
Programming CUDA

CUDA code
__host__
int nblocks=(n+511)/512; // grid size
daxpy<<<nblocks,512>>(n,2.0,x,y);
// 512 threads/block
__global__
void daxpy(int n, double a, double *x
double *y) {
int i=blockIdx.x*blockDim.x +
threadIdx.x;
if(i<n) y[i] = a*x[i] + y[i];
}
n=8192, 512 threads/block
grid
block0
warp0
Y[0]=A*X[0]+Y[0]
...
Y[31]=A*X[31]+Y[31]
...
warp15
Y[480]=A*X[480]+Y[480]
...
Y[511]=A*X[511]+Y[511]
...
block15
warp0
Y[7680]=A*X[7680]+Y[7680]
...
Y[7711]=A*X[7711]+Y[7711]
...
warp15
Y[8160]=A*X[8160]+Y[8160]
...
Y[8191]=A*X[8191]+Y[8191]
Moving data between
host and GPU
int main() {
double *x, *y, a, *dx, *dy;
x = (double *)malloc(sizeof(double)*n);
y = (double *)malloc(sizeof(double)*n);
// initialize x and y…
cudaMalloc(dx, n*sizeof(double));
cudaMalloc(dy, n*sizeof(double));
cudaMemcpy(dx, x, n*sizeof(double),
cudaMemcpyHostToDevice); …
daxpy<<<nblocks,512>>(n,2.0,x,y);
cudaThreadSynchronize();
cudaMemcpy(y, dy, n*sizeof(double),
cudaMemcpyDeviceToHost);
cudaMemFree(dx); cudaMemFree(dy);
free(x); free(y);
}