PPT - Bo Yuan

Download Report

Transcript PPT - Bo Yuan

GPU Computing
Dr. Bo Yuan
E-mail: [email protected]
Overview
Foundation
Intermediate
Advanced
Extension
GPU
Kernel
Warp
Floating
Point
CUDA
Vector
Addition
Memory
Access
Stream
Thread
Matrix
Multiplication
Resource
Optimization
Multiple
GPUs
Memory
Structure
Shared
Memory
Dynamic
Parallelism
Parallel
Matlab
2
What is GPU?
•
Graphics Processing Unit
•
First GPU: GeForce 256 (1999)
•
Connected to motherboard via PCI Express
•
High computational density and memory bandwidth
•
Massively multithreaded many-core chips
•
Traditionally used for real-time rendering.
•
Several millions units are sold each year.
3
Graphics Cards
4
GPU Pipeline
5
GPU Pipeline
Rasterization
6
Anti-Aliasing
Triangle
Geometry
Triangle
Geometry
Aliased
Aliased
Anti-Aliased
Anti-Aliased
7
GPGPU
•
General-Purpose Computing on GPUs
•
Massively Parallel, Simple Operations
•
Suitable for compute-intensive engineering problems
•
The original problem needs to be cast into native graphics operations.
•
Launched through OpenGL or DirectX API calls
•
Input data are stored in texture images and issued to the GPU by submitting triangles.
•
Highly restricted access to input/output
•
Very tedious, limited success with painstaking efforts
8
Trend of Computing
9
CPU vs. GPU
ALU
ALU
ALU
ALU
Control
Cache
DRAM
DRAM
CPU
GPU
Number of ALUs
Memory Bandwidth
10
Power of the Crowd
•
SM
–
–
–
–
–
Streaming Multiprocessor
Multi-threaded processor core
Processing unit for thread block
SPs (Streaming Processor)
SFUs (Special Function Unit)
Streaming Multiprocessor
•
SP
Instruction L1
– Streaming Processor
– Scalar ALU for a single CUDA thread
•
SIMT
– Single-Instruction, Multiple-Thread
– Shared instruction fetch per 32 threads (warp)
Instruction Fetch/Dispatch
Shared Memory
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
11
Need For Speed
12
Green Computing
21.8
20
10
5
6.48
GTX 680
15
GTX 750 Ti
15.85
GTX 580
GFLOPS per Watt
25
Intel Core i7-980XE
1.7
0
13
Supercomputing
•
TITAN, Oak Ridge National Laboratory
•
Speed: 24.8 PFLOPS (Theory), 17.6 PFLOPS (Real)
•
CPU: AMD Opteron 6274 (18,688 × 16 cores)
•
GPU: NVIDIA Tesla K20 (18,688 × 2496 cores)
•
Cost: US$ 97 Million
•
Power: 9 MW
14
Personal Supercomputer
15
What is CUDA?
•
Compute Unified Device Architecture
•
Introduced by NVIDIA in 2007
•
Scalable Parallel Programming Model
•
Small extensions to standard C/C++
•
Enable general-purpose GPU computing
•
Straightforward APIs to manage devices,
memory etc.
•
Only supports NVIDIA GPUs.
http://developer.nvidia.com/category/zone/cuda-zone
16
CUDA-Enabled GPU
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Load/store
Load/store
Global Memory
17
CUDA GPUs
Compute
Capability
2.0
2.1
3.0
GPUs
GF100, GF110
Cards
GeForce GTX 470, GTX 480, GTX 570, GTX 580, GTX
590, Tesla C2050, C2070
GF104, GF114, GF116, GeForce GT 430, GT 440, GTX 460, GTX 550 Ti, GTX
GF108, GF106
GK104, GK106, GK107
3.5
GK110, GK208
5.0
GM107, GM108
560 Ti, GT 640, GT 630
GeForce GTX 690, GTX 680, GTX 670, GTX 660, GTX
650 Ti, GTX 650
GeForce GTX TITAN, GT 640 (Rev. 2), GT 630 (Rev. 2),
Tesla K40, Tesla K20
GeForce GTX 750 Ti, GTX 750, GTX 860M
18
Fermi Architecture
19
Kepler Architecture
•
GeForce GTX 680 (Mar. 22, 2012)
•
GK104, 28 nm process
•
3.5 billion transistors on a 294 mm2 die
•
CUDA Cores: 1536 (8 SMs X 192 SPs)
•
Memory Bandwidth: 192 GB/S
•
Peak Performance: 3090 GFLOPS
•
TDP: 195W
•
Release Price: $499
20
Maxwell Architecture
•
GeForce GTX 750 Ti (Feb. 18, 2014)
•
GM107, 28 nm process
•
1.87 billion transistors on a 148 mm2 die
•
CUDA Cores: 640 (5 SMs X 128 Cores)
•
Memory Bandwidth: 86.4 GB/S
•
Peak Performance: 1306 GFLOPS
•
TDP: 60W
•
Release Price: $149
21
CUDA Teaching Lab
•
•
•
•
•
•
•
•
GTX 750 (GM107)
Compute Capability: 5.0
512 CUDA Cores
1GB, 128-bit GDDR5
80 GB/S
1044 GFLOPS
TDP: 55W
RMB 799
•
•
•
•
•
•
•
•
GT 630 (GK208)
Compute Capability: 3.5
384 CUDA Cores
2GB, 64-bit GDDR3
14.4 GB/S
692.7 GFLOPS
TDP: 25W
RMB 419
22
CUDA Installation
https://developer.nvidia.com/cuda-downloads
23
CUDA: deviceQuery
24
CUDA: bandwidthTest
25
CUDA Applications
26
CUDA Showcase
27
Heterogeneous Computing
28
Heterogeneous Computing
29
Grids, Blocks and Threads
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
30
Thread Block
•
Threads have thread ID numbers within block.
•
Threads use thread ID to select work.
•
Threads are assigned to SMs in block granularity.
•
Each GT200 SM can have maximum 8 blocks.
Thread Id #:
0123…
m
Thread program
•
Each GT200 SM can have maximum 1024 threads.
•
Threads in the same block can share data and synchronize.
•
Threads in different blocks cannot cooperate.
•
Each block can execute in any order relative to other blocks.
31
Transparent Scalability
Kernel grid
Device
Block 0 Block 1
Device
Block 2 Block 3
Block 0
Block 1
Block 4 Block 5
Block 6 Block 7
Block 2
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Block 3
Block 4
Block 5
Block 6
Block 7
32
Code Example
33
Hello World!
int main(void) {
printf(“Hello World!\n”);
return 0;
}
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf(“Hello World!\n”);
return 0;
}
34
Device Code
•
CUDA keyword __global__ indicates a kernel function that:
– Runs on the device.
– Called from the host.
•
CUDA keyword __device__ indicates a device function that:
– Runs on the device.
– Called from a kernel function or another device function.
•
Triple angle brackets <<< >>> indicate a call from host code to device code.
– Kernel launch
• nvcc separates source code into two components:
– Device functions are processed by NVIDIA compiler.
– Host functions are processed by standard host compiler.
– $ nvcc hello.cu
35
Memory Management
• Host and device memories are separate entities.
• Device pointers point to GPU memory.
– May be passed to/from host code.
– May not be dereferenced in host code.
• Host pointers point to CPU memory
– May be passed to/from device code.
– May not be dereferenced in device code.
• CUDA APIs for handling device memory
– cudaMalloc(), cudaFree(), cudaMemcpy()
– C equivalents: malloc(), free(), memcpy()
36
Memory Space
Grid
•
Each thread can:
Block (0, 0)
–
Read/write per-thread registers
–
Read/write per-block shared memory
–
Read/write per-grid global memory
–
Read/only per-grid constant memory
Shared Memory
Host
GeForce GTX 680
Block (1, 0)
Registers
Registers
Thread (0, 0) Thread (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Constant Memory
Memory Bandwidth … 192 GB/S
Single-Precision Floating Point … 4B
Peak Performance … 3090 GFLOPS
Practical Performance … 48 GFLOPS
37
Addition on Device
__global__ void add (int *a, int *b, int *c) {
*c=*a+*b;
}
• add () will execute on the device.
• add () will be called from the host.
• a, b, c must point to device memory.
•
We need to allocate memory on GPU.
38
Addition on Device: main()
int main(void) {
int a, b, c;
int *d_a, *d_b, *d_c;
int size=sizeof(int);
GPU
CPU
CPU→GPU
// host copies
// device copies
// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
a=2;
b=7;
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
39
Addition on Device: main()
// Launch add() kernel on GPU
add<<<1,1>>>(d_a,d_b,d_c);
GPU→CPU
GPU
// Copy result back to host
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
40
Moving to Parallel
•
Each call to add() adds two integers.
•
With add() running in parallel, we can do vector addition in parallel.
• add<<<nblocks, 1>>>(d_a, d_b, d_c)
•
Each parallel invocation of add() is referred to as a block.
•
By using blockIdx.x to index into the array, each block handles a
different index.
•
Block can be 2D:
– dim3 nblocks(M, N)
– blockIdx.x, blockIdx.y
41
Vector Addition on Device
__global__ void add (int *a, int *b, int *c) {
c[blockIdx.x]=a[blockIdx.x]+b[blockIdx.x];
}
Block 0
c[0]=a[0]+b[0];
Block 2
c[2]=a[2]+b[2];
Block 1
c[1]=a[1]+b[1];
Block 3
c[3]=a[3]+b[3];
42
Vector Addition on Device: main()
# define N 512
int main(void) {
int *a, *b, *c;
int *d_a, *d_b, *d_c;
int size=N*sizeof(int);
// host copies
// device copies
// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Allocate space of host copies of
// Set up initial values
a=(int *)malloc(size); rand_ints(a,
b=(int *)malloc(size); rand_ints(b,
c=(int *)malloc(size); rand_ints(c,
a, b, c
N);
N);
N);
43
Vector Addition on Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks
add<<<N, 1>>>(d_a, d_b, d_c);
// Copy results back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
44
CUDA Threads
• Each block can be split into parallel threads.
• Threads can be up to 3D:
– dim3 nthreads(M, N, P)
– threadIdx.x, threadIdx.y, threadIdx.z
__global__ void add (int *a, int *b, int *c) {
c[threadIdx.x]=a[threadIdx.x]+b[threadIdx.x];
}
add<<<1, N>>>(d_a, d_b, d_c);
45
Combining Blocks and Threads
•
We have seen parallel vector addition using:
– Many blocks with one thread each
– One block with many threads
•
Let’s adapt vector addition to use both blocks and threads.
– Why bother?
46
Indexing
M=8; // 8 threads/block
int index=threadIdx.x+blockIdx.x*M;
int index=threadIdx.x+blockIdx.x*blockDim.x;
__global__ void add (int *a, int *b, int *c) {
int index=threadIdx.x+blockIdx.x*blockDim.x;
c[index]=a[index]+b[index];
}
47
Indexing
#define N (2048*2048)
#define M 512 // THREADS_PER_BLOCK
…
add<<<N/M, M>>>(d_a, d_b, d_c);
add<<<(N+M-1)/M, M>>>(d_a, d_b, d_c, N);
__global__ void add (int *a, int *b, int *c, int n) {
int index=threadIdx.x+blockIdx.x*blockDim.x;
if (index<n)
c[index]=a[index]+b[index];
}
48
Data Access Pattern
radius
radius
How many times?
input
output
49
Sharing Data Between Threads
•
Each thread generates one output element.
– blockDim.x elements per block
•
Each input element needs to be read several times.
– High I/O cost
•
Within a block, threads can share data via shared memory.
– Data are not visible to threads in other blocks.
•
Extremely fast on-chip memory
•
Declared using keyword: __shared__, allocated per block.
•
Read (blockDim.x+2*radius)input elements from global to shared memory.
50
Collaborative Threads
……
0
T0
0
1
1
2
3
4
5
6
7
8
T0 blockDim.x output elements
2
3
4
5
6
7
8
9
……
9
input
T0
10 11 12 13 14 15
shared
T9
9
Thread 0 produces the values of temp[i], i=0, 3, 13.
Thread 9 requires the values of temp[i], i=9, 10, 11, 12, 13, 14, 15.
void __syncthreads()
51
Kernel Synchronization
__global__ void vector_sum(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE+2*RADIUS];
int gindex=threadIdx.x+blockIdx.x*blockDim.x;
// global index
int lindex=threadIdx.x+RADIUS;
// local index
// read input elements into shared memory
temp[lindex]=in[gindex];
if (threadIdx.x<RADIUS) {
// some extra work
temp[lindex-RADIUS]=in[gindex-RADIUS];
temp[lindex+BLOCK_SIZE]=in[gindex+BLOCK_SIZE];
}
__syncthreads();
// make sure temp is ready
int offset, result=0;
for (offset=-RADIUS; offset<=RADIUS; offset++)
result+=temp[lindex+offset];
out[gindex]=result;
}
// sum based on shared memory
52
Matrix Multiplication
WIDTH
k
WIDTH
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
{
int i, j, k;
N
float a, b, sum;
for (i = 0; i < Width; ++i)
j
for (j = 0; j < Width; ++j) {
sum = 0;
for (k = 0; k < Width; ++k) {
a = M[i * Width + k];
b = N[k * Width + j];
sum += a * b;
}
P[i * Width + j] = sum; M
P
}
}
i
Host Code Only
k
WIDTH
WIDTH
53
Single Thread Block
dim3 dimGrid(1,1);
dim3 dimBlock(Width, Width);
…
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
…
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,
int Width){
int k=0;
float Pvalue = 0, Melement, Nelement;
for (k = 0; k < Width; ++k) {
Melement = Md[threadIdx.y*Width+k]; // Md[threadIdx.y, k]
Nelement = Nd[k*Width+threadIdx.x]; // Nd[k, threadIdx.x]
Pvalue += Melement * Nelement;
}
// Pd[threadIdx.y, threadIdx.x]
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
}
54
Single Thread Block
Nd
Grid 1
Block 1
•
•
What is the maximum size of the matrix?
2
– Each thread computes one element of Pd.
4
2
Thread
(2, 2)
Each thread:
6
•
CGMA
WIDTH
– Loads a row of matrix Md.
– Loads a column of matrix Nd.
– Perform one multiply and addition for each
pair of Md and Nd elements.
3
2
5
4
48
– Compute to Global Memory Access
Md
Pd
55
bx
0
Multiple Blocks
0 1 2 TILE_WIDTH-1
• Break Pd into square tiles.
Nd
WIDTH
• Each block calculates one tile:
– Each threads calculates one element.
– Block size equals to tile size.
• Require both block ID and thread ID.
Md
Pdsub
TILE_WIDTH-1
TILE_WIDTH
WIDTH
WIDTH
56
WIDTH
0
1
2
TILE_WIDTH
Pd
Multiple Blocks: An Example
Block(0,0)
Nd0,0 Nd1,0
Block(1,0)
Nd0,1 Nd1,1
Nd0,2 Nd1,2
P0,0 P1,0 P2,0 P3,0
TILE_WIDTH = 2
Nd0,3 Nd1,3
P0,1 P1,1 P2,1 P3,1
P0,2 P1,2 P2,2 P3,2
P0,3 P1,3 P2,3 P3,3
Block(0,1)
Block(1,1)
Md0,0Md1,0Md2,0Md3,0
Pd0,0 Pd1,0 Pd2,0 Pd3,0
Md0,1Md1,1Md2,1Md3,1
Pd0,1 Pd1,1 Pd2,1 Pd3,1
Pd0,2 Pd1,2 Pd2,2 Pd3,2
Pd0,3 Pd1,3 Pd2,3 Pd3,3
57
Multiple Blocks: Indexing
• TILE_WIDTH
• Block:
blockIdx.x, blockIdx.y
• Thread:
threadIdx.x, threadIdx.y
blockIdx.y * TILE_WIDTH + threadIdx.y
blockIdx.x * TILE_WIDTH + threadIdx.x
threadIdx.y
blockIdx.y
• Row:
• Col:
(0,0) (1,0) (2,0) (3,0)
(0,1) (1,1) (2,1) (3,1)
(0,2) (1,2) (2,2) (3,2)
(0,3) (1,3) (2,3) (3,3)
blockIdx.x
threadIdx.x
58
Multiple Blocks: Device Code
__global__ void MatrixMulKernel(float* Md, float* Nd,
float* Pd, int Width){
// Calculate the row index of the Pd element and Md
int Row=blockIdx.y*TILE_WIDTH+threadIdx.y;
// Calculate the col index of the Pd element and Nd
int Col=blockIdx.x*TILE_WIDTH+threadIdx.x;
int k;
float Pvalue = 0;
// Each thread computes one element of Pd
for (k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
59
Block Granularity
t0 t1 t2 … tm
SM 0
SM 1
MT IU
MT IU
SP
t0 t1 t2 … tm
SP
Blocks
Blocks
Shared
Memory
Shared
Memory
•
Each SM in GT200 can take up to 1024 threads and 8 blocks.
•
8×8: 64 threads per block, 1024/64=16 blocks, 64×8=512 threads per SM
•
16×16: 256 threads per block, 1024/256=4 blocks, full capacity!
•
32×32: 1024 threads per block, exceeding the limit of 512 threads/block
60
Global Memory Access
•
Each thread requires one row from Md and
one column from Nd.
•
For a k×k thread block, each row/column
will be accessed k times.
•
To reduce the global memory I/O, it is
beneficial to load the required data once
into the shared memory.
Nd
T0,0 T1,0
T0,1 T1,1
Md
2×2 Thread Block
61
Splitting Md
•
The shared memory per SM is limited (e.g., 64KB).
•
Shared among all blocks in the same SM.
•
Luckily, not all data needs to be in the shared memory simultaneously.
Phase 2
Phase 1
shared
shared
Mds0,0
Mds1,0
Md0,0
Md1,0
Md2,0
Md3,0
Mds0,0
Mds1,0
Mds0,1
Mds1,1
Md0,1
Md1,1
Md2,1
Md3,1
Mds0,1
Mds1,1
Mds
Md
Mds
62
Shared Memory: Device Code
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,
int Width){
1. __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
2. __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;
4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE_WIDTH + ty;
6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaboratively load Md and Nd tiles into shared memory
9.
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
10.
Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
11.
__syncthreads(); // Make sure the shared memory is ready
63
Shared Memory: Device Code
12.
for (int k = 0; k < TILE_WIDTH; ++k)
13.
Pvalue += Mds[ty][k] * Nds[k][tx];
// Make sure all threads have finished working on the shared memory
14.
__syncthreads();
}
15.
Pd[Row*Width+Col] = Pvalue;
}
•
Each SM in G80 can take up to 768 threads and 8 blocks.
•
Each SM has 16KB shared memory.
•
For a tile of size 16 × 16, each block requires 16 × 16 × 4 = 1KB for Mds.
•
Totally, 2KB are required for each block and 8 blocks can be supported.
•
However, since 768/(16 × 16) = 3, only 3 blocks and 6KB will be in use.
64
Performance Considerations
•
GPU computing is easy:
– Host, Device, Kernel, Block, Thread
– GPU Cards
– Up and running in a few days
•
As long as performance is not a major concern:
– Various performance bottlenecks
– 10× speedup is often within your reach.
– 100× speedup takes significant amount of tuning efforts.
65
Thread Execution
•
Conceptually, threads in a block can execute in any order with respect to
each other.
– Exception: barrier synchronizations
•
Each thread block is partitioned into warps.
– Hardware cost considerations
– The unit of thread scheduling in SMs
– 32 threads per warp: [0,…, 31], [32,…, 63] …
•
All threads in a warp execute the same instruction.
•
SM hardware implements zero-overhead thread scheduling.
– Can tolerate long-latency operations with several warps around.
– GPU does not require as much chip area for cache memories and branch
prediction mechanisms as CPUs.
66
Warp Scheduling
• Suppose 1 global memory access
is needed for every 4 instructions.
A
B
• Instruction: 4 clock cycles
C
• Memory latency: 200 clock cycles
• At least 14 warps are required to
keep the units fully utilized.
D
A
67
Flow Divergence
•
SIMT can reduce the cost of fetching and
processing instructions.
SM multithreaded
Warp scheduler
time
•
SIMT works well when all threads in a warp
follow the same control flow.
warp 8 instruction 11
warp 1 instruction 42
•
Multiple sequential passes may be required
for an if-then-else construct.
then
X
if
else
warp 3 instruction 95
..
.
warp 8 instruction 12
warp 3 instruction 96
Y
68
Flow Divergence
69
70
Flow Divergence
• Main performance concern with branching is divergence.
– Threads within a warp take different paths.
– The control paths are traversed one at a time.
• How to avoid divergence when the branch condition is a function
of thread ID?
– With divergence:
• if (threadIdx.x>2) { }
• Threads 0, 1, and 2 follow a different path than the rest threads in the warp.
– Without divergence:
• if (threadIdx.x/WARP_SIZE>=2) { }
• Creates two different paths for threads in a block.
• Branch granularity is a whole multiple of warp size.
• All threads in any given warp follow the same path.
71
Memory Coalescing
• Dynamic Random Access Memory (DRAM)
Each bit is stored in a separate capacitor.
All storage locations have nearly identical access time.
In practice, many consecutive locations are accessed in parallel.
All threads in a warp should access continuous memory locations
(coalescing) to maximize memory bandwidth utilization.
Not coalesced
Md
Coalesced
Nd
WIDTH
–
–
–
–
Thread 1
Thread 2
WIDTH
72
Memory Layout of a Matrix in C
M0,0 M1,0 M2,0 M3,0
Time
M0,1 M1,1 M2,1 M3,1
M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3
Time Period 1
Time Period 2
T0 T1 T2 T3
T0 T1 T2 T3
M
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
73
Memory Layout of a Matrix in C
Time
M0,0 M1,0 M2,0 M3,0
M0,1 M1,1 M2,1 M3,1
M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3
Time Period 2
T0
T1
T2
T3
Time Period 1
T0
T1
T2
T3
M
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
74
Shared Memory Architecture
•
Many threads access memory:
–
–
–
–
Shared memory is divided in banks.
Successive 32-bit words are assigned to successive banks.
Each bank has a bandwidth of 32 bits per clock cycle.
G80 has 16 banks: bank=address % 16
– Same as the size of half a warp
•
Each memory bank can service one address per cycle.
– Can service as many simultaneous accesses as the number of banks.
•
Multiple simultaneous accesses to the same bank may result in a bank conflict.
– Conflicting accesses are serialized.
– No bank conflicts between different half warps.
75
Bank Conflicts
• Shared memory is as fast as registers if there are no
bank conflicts.
• The fast case:
– If all threads of a half-warp access different banks, there
is no bank conflict.
– If all threads of a half-warp access the identical address,
there is no bank conflict (broadcast).
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
• The slow case:
– Bank Conflict: Multiple threads in the same half-warp
access the same bank.
– Must serialize the accesses.
– Cost = max # of simultaneous accesses to a single bank
Bank 15
76
Bank Addressing Example
•
No Bank Conflicts
–
Linear addressing
•
No Bank Conflicts
–
Random 1:1 Permutation
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 15
Bank 15
Thread 15
Bank 15
77
Bank Addressing Example
•
2-way Bank Conflicts
–
Linear addressing
stride = 2
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 8
Thread 9
Thread 10
Thread 11
•
8-way Bank Conflicts
–
Linear addressing
stride = 8
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 15
Thread 15
x8
x8
Bank 0
Bank 1
Bank 2
Bank 7
Bank 8
Bank 9
Bank 15
78
Partitioning of SM Resources
• Execution resources in SM
–
–
–
–
–
–
Registers
Block Slots (GT200: 8)
Thread Slots (GT200: 1024)
Number of 16×16 blocks = 1024/(16×16) = 4
Determine the number of threads running on a SM.
Subtle interactions that may cause underutilization of resources
• Register File
–
–
–
–
Store automatic variables declared in a CUDA kernel.
G80: 32KB (8192 entries) for each SM
Dynamically partitioned across all blocks in the same SM.
Each thread can only access registers assigned to itself.
79
SM Resources Example
• For 16×16 blocks, if each thread uses 10 registers:
– Each block requires 16×16×10 = 2560 registers.
– Number of blocks = 8129/2560 = 3
• If each thread increases the use of registers by 1:
–
–
–
–
–
Each block now requires 16×16×11 = 2816 registers.
Number of blocks = 8129/2816 = 2
Only two blocks can run on a SM.
Number of threads drops from 768 to 512
1/3 reduction of parallelism due to the single extra automatic variable!
80
Occupancy Calculator
81
Instruction Mix
•
Each processor core has limited instruction processing bandwidth.
•
Every instruction consumes processing bandwidth:
– Floating point calculation
– Load instruction
– Branch instruction
•
We should try to increase the efficiency of instructions.
1 loop counter increment instruction
1 loop branch instruction
for (int k=0; k<BLOCK_SIZE; k++)
Pvalue+=Mds[ty][k]*Nds[k][tx];
2 address arithmetic instructions
2 floating point arithmetic instructions
82
Instruction Mix
Loop Unrolling
Pvalue+=Mds[ty][0]*Nds[0][tx]+…
+Mds[ty][15]*Nds[15][tx];
•
Express the dot-product computation as one long multiply-add expression.
•
Eliminate the loop branch instruction.
•
Eliminate the loop counter update.
•
Matrix indices are constants rather than a variable.
•
With the help of compiler, address arithmetic instructions can be also eliminated!
83
Dynamic Parallelism
• A child CUDA kernel can be called from within a parent CUDA
kernel, without CPU involvement.
• Extension to flat, single-level of parallelism
• Requires Compute Capability 3.5+
CPU
GPU
CPU
GPU
• Benefits:
–
–
–
–
Simplified CPU/GPU Cooperation
Dynamic Load Balancing
Data-Dependent Execution
Recursive Parallel Algorithms
84
What does it mean?
CPU
GPU
CPU
GPU
Dynamic Parallelism
85
What does it mean?
86
Example
__global__ ChildKernel(void* data){
//Operate on data
}
__global__ ParentKernel(void *data){
ChildKernel<<<1, 16>>>(data);
}
// In Host Code
ParentKernel<<<256, 64>>(data);
__global__ RecursiveKernel(void*data){
if(continueRecursion == true)
RecursiveKernel<<<64, 16>>>(data);
}
87
Matrix Example
for (int i=0; i<N; i++){
for (int j=0; j<M; j++){
convolution_function(i, j);
}
}
88
Matrix Example
for (int i=0; i<N; i++){
for (int j=0; j<M[i]; j++){
convolution_function(i,j);
}
}
Oversubscription
89
Matrix Example
__global__ void con_kernel(int i){
convolution_function(i, threadIdx.x);
}
__global__ void dynamic_parallelism_kernel(int *M){
con_kernel<<<1, M[blockIdx.x]>>>(blockIdx.x);
}
//In Host Code
dynamic_parallelism_kernel<<<N, 1>>>(M);
90
Synchronization
__global__ void Parent_Kernel() {
...
//Kernel code
if(threadIdx.x==0){
Child_Kernel<<<1, 256>>>();
// Thread will launch kernel and keep going
cudaDeviceSynchronize();
// Make thread wait for Child_Kernel to complete
}
__syncthreads();
//If all threads in the block need Child_Kernel to complete
...
//Code that needs data generated by Child_Kernel
}
91
Timing GPU Kernels
float time;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
kernel <<<..>>>(..);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,
cudaEventDestroy(start);
cudaEventDestroy(stop);
stop
// Place the start event
// Returns immediately
// Place the stop event
// Make sure stop is reached
start, stop);
Kernel
start
Stream 0
92
Multiple Kernels
// Create two streams
cudaStream_t stream[2];
for (int i=0; i<2; ++i)
cudaStreamCreate(&stream[i]);
// Launch a Kernel from each stream
KernelOne <<<64, 512, 0, stream[0]>>>(..);
KernelTwo <<<64, 512, 0, stream[1]>>>(..);
// Destroy the streams
for (int i=0; i<2; ++i)
cudaStreamDestroy(stream[i]);
• Synchronization is implied for events within the same stream.
• More than one stream can be associated with a GPU.
93
Multiple GPUs
int nDevices;
cudaGetDeviceCount(&nDevices);
cudaDeviceProp prop;
for (int i=0; i<nDevices; i++) {
cudaGetDeviceProperties(&prop, i);
printf("Device Number: %d\n", i);
printf("Device Name: %s\n", prop.name);
printf("Compute Capability: %d.", prop.major);
printf("%d\n", prop.minor);
printf("Memory Bus Width: %d\n", prop.memoryBusWidth);
}
94
Streams and Multiple GPUs
cudaSetDevice(0);
cudaStreamCreate(&streamA);
cudaSetDevice(1);
cudaStreamCreate(&streamB);
// Launch kernels
KernelOne <<<..., streamA>>>(..);
KernelTwo <<<..., streamB>>>(..);
// Invalid!
// Valid
• Streams belong to the GPU that was active when they were created.
• Calls to a stream are invalid if the associated GPU is not active.
95
Floating Point Considerations
• Numeric values are represented as bit patterns.
• IEEE Floating Point Standard
– Sign (S), Exponent (E) and Mantissa (M)
– Each (S, E, M) pattern uniquely identifies a floating point number.
• For each bit pattern, it numeric value is derived as:
– Value = (-1)S × M × {2E}, where 1.0B ≤ M < 10.0B
• The interpretation of S:
– S=0: Positive Number
– S=1: Negative Number
96
Normalized Representation of M
•
Subscripts D and B are for decimal place and binary place values respectively.
•
Specifying 1.0B ≤ M < 10.0B makes the mantissa value for each floating point
number unique.
–
–
–
–
•
For example: 0.5D = 1.0B × 2-1
The only valid mantissa value is M=1.0B.
Neither 10.0B × 2-2 (M = 10.0B) nor 0.1B × 20 (M = 0.1B) qualifies.
Just like 10.0D × 105, or 0.9D × 10-3 are not valid.
Because all mantissa values are of the form 1.XX…, we can omit the “1.” part
from the representation.
– The mantissa value of 0.5D in a 2-bit mantissa is 00, by omitting “1.” from 1.00.
– With the IEEE format, an n-bit mantissa is effectively an (n+1)-bit mantissa.
97
Decimal Value
Two’s Complement
Excess-3
Reserved
100
111
-3
101
000
-2
110
001
-1
111
010
0
000
011
1
001
100
2
010
101
3
011
110
Monotonically
Monotonically
Excess Encoding of E
In an n-bit exponent representation, 2n-1-1 is added to its
two's complement to form its excess representation.
(-1)S × 1.M × 2(E-2^(n-1)+1)
98
Representable Numbers
No-zero
M
00
00
2-1
-(2-1)
0
0
0
0
01
2-1+1*2-3
-(2-1+1*2-3)
0
0
1*2-2
-1*2-2
10
2-1+2*2-3
-(2-1+2*2-3)
0
0
2*2-2
-2*2-2
11
2-1+3*2-3
-(2-1+3*2-3)
0
0
3*2-2
-3*2-2
00
20
-(20)
20
-(20)
20
-(20)
01
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
10
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
11
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
00
21
-(21)
21
-(21)
21
-(21)
01
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
10
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
11
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
10
11
S=1
S=0
Denormalization
E
01
S=0
Abrupt Underflow
S=1
Reserved Pattern
S=0
S=1
99
Representable Numbers
0
1
2
3
•
The exponent bits define the major intervals of representable numbers.
•
The mantissa bits define the number of representable numbers in each interval.
•
Zero is not representable in this format.
•
Representable numbers become closer to each other toward 0.
•
There is a gap in representable numbers in the vicinity of 0.
100
Representing Zero
•
Abrupt Underflow
– Treats all bit patters with E=0 as 0.
– Takes away several representable numbers near zero and lumps them all into 0.
0
•
1
2
3
Denormalization
– Relaxes the normalization requirement for numbers very close to 0.
– Whenever E=0, the mantissa is assumed to be 0.xx.
– The exponent is assumed to be the same as the previous interval.
0
1
2
3
0.M × 2-2^(n-1)+2
0 00 01 (S E M)  0.01 X 20 = 2-2
101
Accuracy and Rounding
• 1.00 × 2-2 + 1.00 × 21
= 0.001 × 21 + 1.00 × 21 = 1.001 × 21 ≈ 1.00 × 21 (Error = 0.001 × 21)
• 1.00 × 20 +1.00 × 20 + 1.00 × 2-2 + 1.00 × 2-2
= 1.00 × 21 + 1.00 × 2-2 + 1.00 × 2-2
= 1.00 × 21 + 1.00 × 2-2
= 1.00 × 21
• [1.00 × 20 +1.00 × 20 ]+ [1.00 × 2-2 + 1.00 × 2-2]
= 1.00 × 21 + 1.00 × 2-1
= 1.01 × 21
• Sorting data in ascending order may help achieve greater accuracy.
– Numbers with similar numerical values are close to each other.
102
Single vs. Double Precision
• GPUs were traditionally not good at double precision calculation.
– Requires compute capability 1.3 or above.
– Around 1/8 of single precision performance.
– Improved greatly to 1/2 with Fermi architecture.
• Important to avoid using double precision when it is not necessary.
– Add ‘f’ specifier on float literals:
• Y=X*0.123;
// double assumed
• Y=X*0.123f; // float explicit
– Use float version of standard library functions:
• Y=sin(X);
// double assumed
• Y=sinf(X);
// single precision explicit
103
Matlab in Parallel
• Matlab: Numerical Computing Environment
• Parallel Computing Toolbox (PCT)
• Offload work from one MATLAB session
(client) to other MATLAB sessions (workers).
• Run as many as eight MATLAB workers
(2011b) on your local machine in addition to
your MATLAB client session.
http://www.mathworks.cn/cn/help/distcomp/index.html
104
Parfor
•
Parallel for-loop
•
The parfor body is executed on the client and workers.
•
The data on which parfor operates is sent from the client to workers,
and the results are sent back to the client and pieced together.
•
MATLAB workers evaluate iterations in no particular order, and
independently of each other.
•
Classification of Variables
– Loop, Sliced, Reduction, Broadcast, Temporary
105
Classification of Variables
a=0;
temporary
c=pi;
z=0;
r=rand(1,10);
parfor i=1:10
a=i;
z=z+i;
b(i)=r(i);
if i<=c
d=2*a;
reduction
sliced
loop
sliced
broadcast
end
end
106
Parfor Example
X=zeros(1,N);
for i = 1:N
X(i)=sin(i/N*2*pi);
end
parallelization
X=zeros(1,N);
matlabpool open local 8 % create 8 workers
parfor i = 1:N
X(i)=sin(i/N*2*pi);
end
matlabpool close
% close all workers
107
Notes on Parfor
• Each loop must be independent of other loops.
• In the Windows Task Manager, there are multiple Matlab processes:
– Higher CPU Usage
– Higher Memory Usage
• It incurs significant overhead: only works with long loop iterations
and/or time consuming calculations.
• Be prepared to be surprised:
– Some Matlab functions are already optimized for multithreading.
– The practical speedup value is generally quite moderate.
108
GPU Accelerated Matlab
• Matlab users can now easily enjoy the benefits of GPU computing.
• Capabilities
– Evaluating built-in functions on the GPU.
– Running MATLAB code on the GPU.
• Requirements
– Matlab 2014a (Recommended)
– NVIDIA CUDA-enabled devices with compute capability of 1.3 or greater
– NVIDIA CUDA device driver 3.1 or greater
• Check the GPU environment
– gpuDeviceCount: number of available GPU devices
– gpuDevice: select and query GPU device
109
Create Data on GPU
• Transferring data between workspace and GPU:
Workspace
M = rand(6);
G = gpuArray(single(M));
N = gather(G);
GPU
• Directly creating GPU data:
G = ones(100,100,50, 'single', 'gpuArray');
size(G)
100 100 50
classUnderlying(G)
single
110
Execute Code on GPU
• Run Built-In Functions
X = rand(1000,'single','gpuArray');
Gfft = fft(X);
Y = gather(Gfft);
• Run Element-Wise Matlab Code
function c = myCal(rawdata, gain, offst)
c = (rawdata .* gain) + offst;
meas = ones(1000)*3;
// CPU
gn = rand(1000,'gpuArray')/100;
// GPU
offs = rand(1000,'gpuArray')/50; // GPU
corrected = arrayfun(@myCal,meas,gn,offs);
results = gather(corrected);
111
Timing GPU Code
A = rand(1024,'gpuArray'); A
fh = @()fft(A);
B
gputimeit(fh);
f
t
gd = gpuDevice();
tic();
B = fft(A);
wait(gd);
t = toc();
=
=
=
=
rand(12000,400,'gpuArray');
rand(400,12000,'gpuArray');
@()A*B;
gputimeit(f);
X = rand(1000,'gpuArray');
f = @()svd(X);
t1 = gputimeit(f,1);
t3 = gputimeit(f,3);
112
Testing Host-GPU Bandwidth
sizeOfDouble = 8; sizes = power(2, 14:28);
sendTimes = inf(size(sizes)); gatherTimes = inf(size(sizes));
for i=1:numel(sizes)
numElements = sizes(i)/sizeOfDouble;
hostData = randi([0 9], numElements, 1);
gpuData = gpuArray.randi([0 9], numElements, 1);
sendFcn = @()gpuArray(hostData);
sendTimes(i) = gputimeit(sendFcn);
gatherFcn = @()gather(gpuData);
gatherTimes(i) = gputimeit(gatherFcn);
end
sendBandwidth = (sizes./sendTimes)/1e9;
[maxSendBandwidth, maxSendIdx] = max(sendBandwidth);
gatherBandwidth = (sizes./gatherTimes)/1e9;
[maxGatherBandwidth, maxGatherIdx] = max(gatherBandwidth);
113
Testing Host-GPU Bandwidth
Data Transfer Bandwidth
5
4.5
Send to GPU
Gather from GPU
Transfer speed (GB/s)
4
3.5
3
2.5
2
1.5
1
0.5
0
4
10
5
10
6
7
10
10
Array size (bytes)
8
10
9
10
114
Testing CPU Bandwidth
sizeOfDouble = 8; sizes = power(2, 14:28);
memoryTimesHost = inf(size(sizes));
for i=1:numel(sizes)
numElements = sizes(i)/sizeOfDouble;
hostData = randi([0 9], numElements, 1);
plusFcn = @()plus(hostData, 1.0);
memoryTimesHost(i) = timeit(plusFcn);
end
memoryBandwidthHost = 2*(sizes./memoryTimesHost)/1e9;
[maxBWHost, maxBWIdxHost] = max(memoryBandwidthHost);
115
Testing GPU Bandwidth
memoryTimesGPU = inf(size(sizes));
for i=1:numel(sizes)
numElements = sizes(i)/sizeOfDouble;
gpuData = gpuArray.randi([0 9], numElements, 1);
plusFcn = @()plus(gpuData, 1.0);
memoryTimesGPU(i) = gputimeit(plusFcn);
end
memoryBandwidthGPU = 2*(sizes./memoryTimesGPU)/1e9;
[maxBWGPU, maxBWIdxGPU] = max(memoryBandwidthGPU);
116
Bandwidth: CPU vs. GPU
Read+Write Bandwidth
150
GPU
Host
Speed (GB/s)
100
50
0
4
10
5
10
6
7
10
10
Array size (bytes)
8
10
9
10
117
Testing Matrix Multiplication
sizes = power(2, 12:2:24); N = sqrt(sizes);
mmTimesHost = inf(size(sizes));
mmTimesGPU = inf(size(sizes));
for i=1:numel(sizes)
A = rand(N(i), N(i)); B = rand(N(i), N(i));
mmTimesHost(i) = timeit(@()A*B);
A = gpuArray(A); B = gpuArray(B);
mmTimesGPU(i) = gputimeit(@()A*B);
end
mmGFlopsHost = (2*N.^3 - N.^2)./mmTimesHost/1e9;
[maxGFlopsHost,maxGFlopsHostIdx] = max(mmGFlopsHost);
mmGFlopsGPU = (2*N.^3 - N.^2)./mmTimesGPU/1e9;
[maxGFlopsGPU,maxGFlopsGPUIdx] = max(mmGFlopsGPU);
118
Testing Matrix Multiplication
Double precision matrix-matrix multiply
1000
900
GPU
Host
Calculation Rate (GFLOPS)
800
700
600
500
400
300
200
100
0
3
10
4
10
5
6
10
10
Matrix size (numel)
7
10
8
10
119
Testing Matrix Multiplication
Single precision matrix-matrix multiply
3000
GPU
Host
Calculation Rate (GFLOPS)
2500
2000
1500
1000
500
0
3
10
4
10
5
6
10
10
Matrix size (numel)
7
10
8
10
120
Review
•
What are the differences among MPI, OpenMP and CUDA?
•
Why is GPU suitable for high performance computing?
•
What is the general framework of CUDA programming?
•
What is a kernel function and how to call it from the host code?
•
What is the advantage of splitting a block into threads?
•
Why do we need multiple thread blocks?
•
What are the major memory types in CUDA?
•
When should we use shared memory?
•
What resource factors are critical to GPU programming?
121
Review
•
What is a warp and why do we need it?
•
What is flow divergence and how to avoid it?
•
What is bank conflict?
•
What is instruction mix?
•
What are the benefits of Dynamic Parallelism?
•
How to measure the performance of GPU code?
•
How to run kernel functions in parallel?
•
How is a floating point number represented in the IEEE format?
•
How to execute Matlab code on GPU?
122