Particle Swarm Methods for Parameter Optimization on GPU

Download Report

Transcript Particle Swarm Methods for Parameter Optimization on GPU

Parallel Concept and Hardware Architecture
CUDA Programming Model Overview
Yukai Hung
Department of Mathematics
National Taiwan University
Parallel Concept Overview
making your program faster
Parallel Computing Goals
Solve the problems in less time
- divide one problem into smaller pieces
- solve smaller problems concurrently
- allow to solve more bigger problems
Prepare to parallelize one problem
- represent algorithms as Directed Acyclic Graphs
- identify dependencies in the problems
- identify critical paths in the algorithms
- modify dependencies to shorten the critical paths
3
Parallel Computing Goals
What is parallel computing?
4
Amdahl’s Law
Speedup of a parallel program is limited by amount of serial works
5
Amdahl’s Law
Speedup of a parallel program is limited by amount of serial works
6
Race Condition
Consider the following parallel program
- threads are almost impossibly executed at the same time
7
Race Condition
Scenario 1
- the result value R is 2 if the initial value R is 1
8
Race Condition
Scenario 2
- the result value R is 2 if the initial value R is 1
9
Race Condition
Scenario 3
- the result value R is 3 if the initial value R is 1
10
Race Condition with Lock
Solve the race condition by Locking
- manage the shared resource between threads
- avoid the deadlock or unbalanced problems
11
Race Condition with Lock
Guarantee the executed instruction order is correct
- the problem is back to the sequential procedure
- lock and release procedure have high overhead
12
Race Condition with Semaphore
Solve the race condition by Semaphore
- multi-value locking method (binary locking extension)
- instructions in procedure P and V are atomic operations
13
Instruction Level Parallelism
Multiple instructions are executed simultaneously
- reorder the instructions carefully to get efficiency
- compiler reorders the assemble instructions automatically
step 1
step 2
14
step 3
Data Level Parallelism
Multiple data operations are executed simultaneously
- computational data is separable and independent
- single operation repeated over different input data
sequential procedure
parallel procedure
15
Flynn’s Taxonomy
Classification for parallel computers and programs
Single Data
Multiple Data
Single Instruction
SISD
(single core CPU)
SIMD
(GPU/vector processor)
16
Multiple Instruction
MISD
(very rare)
MIMD
(multiple core CPU)
Flynn’s Taxonomy
Classification for parallel computers and programs
SISD
SIMD
17
Flynn’s Taxonomy
Classification for parallel computers and programs
MISD
MIMD
18
CPU and GPU Hardware Comparison
CPU versus GPU
20
CPU versus GPU
Intel Penryn quad-core
255mm2 in 0.82B transistors
NVIDIA GTX280
>500mm2 in 1.4B transistors
21
CPU versus GPU
computing GFLOPS
memory bandwidth
22
CPU versus GPU
GPU
CPU
comparison control/cache size and number of ALUs
comparison the clock rates/core numbers/executing latency
23
General Purpose GPU Computation
24
General Purpose GPU Computation
algorithm conversion requires
the knowledge of graphic APIs
(OpenGL and DirectX)
25
General Purpose GPU Computation
convert from pixel data to required data
restrict the general usage of algorithms
26
Start from Traditional GPU Architecture to Nowadays
Simplified Graphic Pipeline
sorting stage
z-buffer collection
28
Simplified Graphic Pipeline
maximum depth
z-cull feedback
29
Simplified Graphic Pipeline
scale up some units
30
Simplified Graphic Pipeline
add framebuffer access
bottleneck is the FBI unit for memory management
31
Simplified Graphic Pipeline
add programmability by ALU units
add programmable geometry and pixel shaders
32
Simplified Graphic Pipeline
consider two similar units
special case on the pipeline
(1) one trigonometry but lots of pixels: pixel shader is busy
(2) lots of trigonometry but one pixel: geometry shader is busy
33
Iterative Graphic Pipeline
combine two units to unified shader
scalable between geometries and pixels
memory resource management becomes important
34
Graphic Pipeline Comparison
software pipeline
hardware pipeline
35
Unified Graphic Architecture
Switch on two modes: Graphic and CUDA mode
Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Geom Thread Issue
SP
TF
L1
L2
FB
36
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
Unified Graphic Architecture
Switch on two modes: Graphic and CUDA mode
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
37
Load/store
Load/store
Thread Streaming Processing
no data communication issue which is suitable for traditional graphical issue
38
Shader Register File/Cache
Separate register files
- strict streaming processing mode
- no data sharing at instruction-level
- dynamical allocation and renaming
- not exist memory addressing order
- registers overflow into local memory
39
Thread Streaming Processing
communication issues between different threads (in the same shader)
40
Shader Register File/Cache
Shared register files
- extra memory hierarchy between
shader registers and global memory
- share data in the same shader
or threads in the same block
- synchronize all threads in the shader
41
Unified Graphic Architecture
42
CPU and GPU Chip Architecture
43
CPU and GPU Chip Architecture
44
Nowadays GPU Applications
Traditional game rendering
- clothing demo and star tales benchmark
- real time physical phenomenon rendering
- mixed mode for physical phenomenon simulation
Scientific computational usage
- molecular dynamic simulation
- protein folding and nbody simulation
- medical images and computational fluid dynamics
CUDA community showcase
45
Nowadays GPU Features
GPUs are becoming more programmable than before
- only standard C extension on unified scalable shaders
GPUs now support 32-bit and 64-bit floating point operations
- almost IEEE floating point compliant except for some specials
- lack mantissa denormalization for small floating point number
GPUs have much higher memory bandwidth than CPUs
- multiple memory bank driven by need of high-performance graphics
Massive data-level parallel architecture
- hundreds of thread processors on the chip
- thousands of concurrent threads on the shader
- lightweight thread switch and long latency memory
46
General Purpose GPU Environment
CUDA: Compute Unified Device Architecture
- realistic hardware and software GPGPU solution
- minimal set of standard C language extensions
- tool set includes compiler and software development kits
OpenCL: Open Computing Language
- similar to CUDA from GPGPU points of view
- support both CPU and GPU hardware architecture
- execute across heterogeneous platform resources
47
CUDA Programming Overview
CUDA Programming Model
Integrated host and device application C program
- serial or modestly parallel parts in host C code
- highly parallel parts in device C extension code
...
...
49
CUDA Programming Model
What is the computed device?
- coprocessor to the host part
- have its own device memory space
- run many active threads in parallel
What is the difference between CPU and GPU threads?
- GPU threads are extremely lightweight
- GPU threads have almost no creating overhead
- GPU needs more than 1000 threads for full occupancy
- multiple core CPU can execute or create only a little threads
50
CUDA Programming Model
A kernel is a function executed on the GPU
- all threads execute the same kernel function
- all threads can take different controlling paths
- each thread has its own local ID to control itself
float x=data[threadID]
float y=kernel_func(x)
51
CUDA Thread Hierarchy
GPU can create numerous threads concurrently
52
CUDA Thread Hierarchy
GPU can create numerous threads concurrently
- all threads are divided into some blocks
53
CUDA Thread Hierarchy
GPU can create numerous threads concurrently
- all threads are divided into some blocks
- all blocks are grouped into one grid
54
CUDA Thread Hierarchy
GPU can create numerous threads concurrently
- all threads are divided into some blocks
- all blocks are grouped into one grid
55
CUDA Thread Hierarchy
GPU can create numerous threads concurrently
- all threads are divided into some blocks
- all blocks are grouped into one grid
- a kernel is executed as a grid of blocks of threads
56
CUDA Transparent Scalability
1
2
3
4
5
6
7
57
8
9 10 11 12
11
12
9
10
7
8
5
6
3
4
1
2
CUDA Transparent Scalability
1
2
3
4
5
6
7
8
9 10 11 12
9 10 11 12
1
58
2
3
4
5
6
7
8
CUDA Transparent Scalability
1
1
2
3
2
4
3
5
4
6
5
7
6
7
8
9 10 11 12
9 10 11 12
8
59
...
CUDA Transparent Scalability
60
CUDA Thread Hierarchy
Identify local and global thread index
- each block has its own local blockID in the same grid
- each thread has its own local threadID in the same block
- global thread index is evaluated by blockID and threadID
0
block index 0
block index 1
0 1 2 3 4 5 6
0 1 2 3 4 5 6
1
2
3
4
5
6 7
8
9
block index 2
0 1 2 3 4 5 6
10 11 12 13 14 15 16 17 18 19 20
61
CUDA Thread Hierarchy
CUDA thread hierarchy summary
grid of blocks of threads
62
CUDA Thread Hierarchy
CUDA thread hierarchy summary
63
CUDA Programming Model
A kernel executes as a grid of thread blocks
- different kernel can execute in different grid and block size
- each block has its blockID and each thread has its threadID
kernel 1
kernel 2
0
1
2
3
1D block
(0,0) (0,1) (0,2) (0,3)
(1,0) (1,1) (1,2) (1,3)
64
2D block
CUDA Memory Hierarchy
block of threads
registers
local memory
shared memory
65
CUDA Memory Hierarchy
grid of blocks
registers
local memory
shared memory
66
CUDA Memory Hierarchy
grid of blocks
registers
local memory
shared memory
texture memory
constant memory
global memory
67
CUDA Memory Hierarchy
CUDA memory hierarchy summary
Host
68
Start from a Vector-Addition Simple Example
Vector-Addition Simple Example
Global memory or device memory
- contents are visible between all threads
- communicate between host and device
- readable and writable between all threads
- read/write long memory accessing latency
We will focus on the global memory now
70
Vector-Addition Simple Example
cudaMalloc(…)
- allocate space in the global memory
- require address pointer and space size
cudaFree(…)
- free space in the global memory
- require address pointer to free space
71
Vector-Addition Simple Example
cudaMemcpy(…)
- data transfers between host and device
- require source and destination pointer
- require data transfer direction and size
cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
Host
72
Vector-Addition Simple Example
//allocate and initialize the host memory space
float *h_A;
float *h_B;
float *h_C;
h_A=(float*)malloc(N*sizeof(float));
h_B=(float*)malloc(N*sizeof(float));
h_C=(float*)malloc(N*sizeof(float));
//allocate the device memory space
float *d_A;
float *d_B;
float *d_C;
cudaMalloc((void**)&d_A,N*sizeof(float));
cudaMalloc((void**)&d_B,N*sizeof(float));
cudaMalloc((void**)&d_C,N*sizeof(float));
//copy raw data from host to device memory
cudaMemcpy(d_A,h_A,N*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(d_B,h_B,N*sizeof(float),cudaMemcpyHostToDevice);
73
Vector-Addition Simple Example
//execute the kernel on N/256 blocks of 256 threads
vectorAddition<<< N/256,256 >>>(d_A,d_B,d_C);
//copy resukt data from device to host memory
cudaMemcpy(h_C,d_C,N*sizeof(float),cudaMemcpyDeviceToHost);
//free the host memory space
free( h_A );
free( h_B );
free( h_C );
//free the device memory space
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
74
Vector-Addition Simple Example
//compute vector addition C=A+B
//each thread performs one pair-wise addition
__global__ void vectorAddition( float* d_A, float* d_B, float* d_C )
{
int globalIdx;
//compute the thread global index
globalIdx = blockIdx.x*blockDim.x+threadIdx.x;
//each thread performs its own element addition
d_C[globalIdx] = d_A[globalIdx]+d_B[global_Idx];
return;
}
75
Vector-Addition Simple Example
Declaration
__global__
__device__
__host__
Call
host
device
host
76
Execute
device
device
host
Vector-Addition Simple Example
Restrictions on the global function
- return type is only available void
Restrictions on the device function
- no function pointer
- no recursion function
- no variable number of arguments
- no static variable declares inside the function
77
Vector-Addition Simple Example
//compute vector addition C=A+B
//each thread performs one pair-wise addition
__global__ void vectorAddition( float* d_A, float* d_B, float* d_C )
{
int globalIdx;
//compute the thread global index
globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
//each thread performs its own element addition
d_C[globalIdx] = addElement(d_A[globalIdx],d_B[global_Idx]);
return;
}
__device__ float addElement( float input_a, float input_b )
{
return (input_a+input_b);
}
78
Vector-Addition Simple Example
CUDA built-in structures
struct
{
int
int
int
}
dim3
x;
y;
z;
struct uint3
{
unsigned int x;
unsigned int y;
unsigned int z;
}
79
Vector-Addition Simple Example
CUDA built-in variables
- used on the global or device function
gridDim (dim3 variable)
blockIdx (uint3 variable)
blockDim (dim3 variable)
threadIdx (uint3 variable)
record the dimension on current grid
record the block index on current grid
record the dimension on current block
record the thread index on current block
80
Matrix-Addition Simple Example
//allocate and initialize the host memory space
float *h_A;
float *h_B;
float *h_C;
h_A=(float*)malloc(N*N*sizeof(float));
h_B=(float*)malloc(N*N*sizeof(float));
h_C=(float*)malloc(N*N*sizeof(float));
//allocate the device memory space
float *d_A;
float *d_B;
float *d_C;
cudaMalloc((void**)&d_A,N*N*sizeof(float));
cudaMalloc((void**)&d_B,N*N*sizeof(float));
cudaMalloc((void**)&d_C,N*N*sizeof(float));
//copy raw data from host to device memory
cudaMemcpy(d_A,h_A,N*N*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(d_B,h_B,N*N*sizeof(float),cudaMemcpyHostToDevice);
81
Matrix-Addition Simple Example
dim3 gridSize;
dim3 blockSize;
//set the grid size and block size
gridSize.x=N/16; blockSize.x=16;
gridSize.y=N/16; blockSize.y=16;
//execute the kernel on N/256 blocks of 256 threads
matrixAddition<<<gridSize,blockSize>>>(d_A,d_B,d_C);
//copy resukt data from device to host memory
cudaMemcpy(h_C,d_C,N*N*sizeof(float),cudaMemcpyDeviceToHost);
//free the host memory space
free(h_A);
free(h_B);
free(h_C);
//free the device memory space
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
82
Matrix-Addition Simple Example
//compute matrix addition C=A+B
//each thread performs one pair-wise addition
__global__ void matrixAddition( float* d_A, float* d_B, float* d_C )
{
int msize;
int rowIdx;
int colIdx;
//comput the matrix size
msize = gridDim.x*blockDim.x;
//compute the thread global index
rowIdx = blockIdx.y*blockDim.y+threadIdx.y;
colIdx = blockIdx.x*blockDim.x+threadIdx.x;
//each thread performs its own element addition
d_C[rowIdx*msize+colIdx] =d_A[rowIdx*msize+colIdx]
+d_B[rowIdx*msize+colIdx];
return;
}
83
Reference
- Mark Harris
http://www.markmark.net/
- Wei-Chao Chen http://www.cs.unc.edu/~ciao/
- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php
84