pptx - Domemtech.com
Download
Report
Transcript pptx - Domemtech.com
IEEE Boston Continuing Education Program
Ken Domino, Domem Technologies
May 9, 2011
Announcements
Course website updates:
Syllabus- http://domemtech.com/ieee-pp/Syllabus.docx
Lecture1– http://domemtech.com/ieee-pp/Lecture1.pptx
Lecture2– http://domemtech.com/ieee-pp/Lecture2.pptx
References- http://domemtech.com/ieee-pp/References.docx
Ocelot April 5 download is not working
PRAM
Parallel Random Access Machine (PRAM).
Idealized SIMD parallel computing model.
Unlimited RAM’s, called Processing Units (PU).
RAM’s operate with same instructions and synchronously.
Shared Memory unlimited, accessed in one unit time.
Shared Memory access is one of CREW, CRCW, EREW.
Communication between RAM’s is only through Shared Memory.
PRAM pseudo code
Parallel for loop
for Pi , 1 ≤ i ≤ n in parallel do … end
(aka “data-level parallelism)
Synchronization
A simple example from C:
Synchronization
What happens if we have two threads competing for
the same resources (char_in/char_out)?
Synchronization
What happens if two threads execute this code
serially?
No prob!
Synchronization
What happens if two threads execute this code in
parallel? We can sometimes get a problem.
char_in of T2 overwrites char_in of T1!
Synchronization
Synchronization forces thread serialization, e.g., so
concurrent access does not cause problems.
Synchronization
Two types:
Mutual exclusion,
using a “mutex”
semaphore = a
lock
Cooperation, wait
on an object until
all other threads
ready, using wait()
+ notify(), barrier
synchronization
Deadlock
The use of mutual exclusion of two or more resources.
PRAM Synchronization
”stay idle” – wait until other processors complete,
”cooperative” synchronization
CUDA
“Compute Unified Device Architecture”
Developed by NVIDIA, introduced November 2006
Based on C, extended later to work with C++.
CUDA provides three key abstractions:
a hierarchy of thread groups
shared memories
barrier synchronization
http://www.nvidia.com/object/IO_37226.html,
http://www.gpgpu.org/oldsite/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf,
Nickolls, J., Buck, I., Garland, M. and Skadron, K. Scalable parallel programming with CUDA. Queue, 6 (2). 40-53.
GPU coprocessor to CPU
NVIDIA GPU Architecture
Multiprocessor (MP) =
texture/processor clust
er (TPC)
Dynamic randomaccess
memory (DRAM) aka
“global memory”
Raster operation
processor (ROP)
L2 – Level-2 memory
cache
NVIDIA GPU Architecture
Streaming
Multiprocessor (SM)
Streaming processor (SP)
Streaming
multiprocessor control
(SMC)
Texture processing unit
(TPU)
Con Cache – “constant”
memory
Sh. Memory – “shared”
memory
Multithreaded
instruction fetch and
issue unit (MTIFI)
1st generation, G80 – 2006
3rd generation, Fermi, GTX 570 - 2010
Single-instruction, multiple-thread
“SIMT”
SIMT = SIMD + SPMD (single
program, multiple data).
Multiple threads.
Sort of “Single Instruction”—
except that each instruction
executed is in multiple
independent parallel threads.
Instruction set architecture: a
register-based instruction set
including floating-point, integer,
bit, conversion, transcendental,
flow control, memory load/store,
and texture operations.
Single-instruction, multiple-thread
The Stream Multiprocessor is a hardware
multithreaded unit.
Threads are executed in groups of 32 parallel threads
called warps.
Each thread has its own set of registers.
Individual threads composing a warp are of the same
program and start together at the same program
address, but they are otherwise free to branch and
execute independently.
Single-instruction, multiple-thread
Instruction executed is same for each warp.
If threads of a warp diverge via a data dependent
conditional branch, the warp serially executes each
branch path taken.
Single-instruction, multiple-thread
Warps are serialized if there is:
Divergence in instructions (i.e., conditional branch
instruction)
write access to the same memory
Warp Scheduling
SM hardware implements near-zero overhead
Warp scheduling
Warps whose next instruction has its operands ready for
consumption can be executed
Eligible Warps are selected for execution by priority
All threads in a Warp execute the same instruction
4 clock cycles needed to dispatch the instruction for all
threads (G80)
Cooperative Thread Array (CTA)
An abstraction to
synchronizing threads
AKA a thread block, grid
CTA’s are mapped to
warps
Cooperative Thread Array (CTA)
Each thread has a unique integer thread ID (TID).
Threads of a CTA share data in global or shared
memory
Threads synchronize with the barrier instruction.
CTA thread programs use their TIDs to select work and
index shared data arrays.
Cooperative Thread Array (CTA)
The programmer
declares a 1D, 2D, or
3D grid shape and
dimensions in
threads.
The TID is 1D, 2D,
or 3D indice.
Restrictions in grid sizes
Kernel
Every thread in a grid executes the same body of
instructions, called a kernel.
In CUDA, it’s just a function.
CUDA Kernels
Kernels declared with __global__ void
Parameters are the same for all threads.
__global__ void fun(float * d, int size)
{
int idx = threadIdx.x +
blockDim.x * blockIdx.x
+ blockDim.x * gridDim.x * blockDim.y * blockIdx.y
+ blockDim.x * gridDim.x * threadIdx.y;
if (idx < 0)
return;
if (idx >= size)
return;
d[idx] = idx * 10.0 / 0.1;
}
CUDA Kernels
Kernels are called via “chevron syntax”
Func<<< Dg, Db, Ns, S >>>(parameters)
Dg is of type dim3 and specifies the dimension and size of the grid
Db is of type dim3 and specifies the dimension and size of the block
Dg is of type dim3 and specifies the dimension and size of the grid
Ns is of type size_t and specifies the number of bytes in shared memory that is
dynamically allocated per block
S is of type cudaStream_t and specifies the associated stream
Kernel is void type; must return value through cbv parameter
Example:
Foo<<<1, 100>>(1, 2, i);
Memory
CTA’s have various
types of memory
Global, shared,
constant, textured,
registers
Threads can access
host memory, too.
Types of memory
CUDA Memory
Data types (int, long, float, double, etc) are the same
as in the host.
Shared memory shared between blocks in a thread.
Global memory shared by all threads in all blocks.
Constant memory shared by all threads in all blocks,
but it cannot be changed (so, faster).
Host memory (of CPU) can be access by all threads in
all blocks.
Shared Memory
__shared__ declares a variable that:
Resides in the shared memory space of a thread block,
Has the lifetime of the block,
Is only accessible from all the threads within the block.
Examples:
extern __shared__ float shared[];
(or declared on kernel call—later!)
Global Memory
__device__ declares a variable that:
Resides in global memory space;
Has the lifetime of an application;
Is accessible from all the threads within the grid and
from the host through the runtime library
(cudaGetSymbolAddress() / cudaGetSymbolSize() /
cudaMemcpyToSymbol() /
cudaMemcpyFromSymbol())
Can be allocated through cudaMalloc()
Examples:
extern __device__ int data[100];
cudaMalloc(&d, 100*sizeof(int));
Basic host function calls
Global memory allocation via cudaMalloc()
Copying memory between host and GPU via
cudaMemcpy()
Kernels are called by chevron syntax
Counting 6’s
Have an array of integers, h[], want to count the
number of 6’s that appear in the array.
H[0..size-1]
How do we do this in CUDA?
Counting 6’s
Divide the array
into blocks of
blocksize
threads.
For each block,
sum the number
of times 6
appears.
Return the sum
for each block.
Counting 6’s
Divide the array
into blocks of
blocksize
threads.
For each block,
sum the number
of times 6
appears.
Return the sum
for each block.
#include <stdio.h>
__global__ void c6(int * d_in, int * d_out, int size)
{
int sum = 0;
for (int i=0; i < blockDim.x; i++)
{
int val = d_in[i + blockIdx.x * blockDim.x];
if (val == 6)
sum++;
}
d_out[blockIdx.x] = sum;
}
Counting 6’s
In main program,
call the kernel
with the correct
dimensions of the
block.
Note: size %
blocksize = 0.
How would we
extend this for
arbitrary array
size?
int main()
{
int size = 300;
int * h = (int*)malloc(size * sizeof(int));
for (int i = 0; i < size; ++i)
h[i] = i % 10;
int * d_in;
int * d_out;
int bsize = 100;
int blocks = size/bsize + 1;
int threads_per_block = bsize;
int rv1 = cudaMalloc(&d_in, size*sizeof(int));
int rv2 = cudaMalloc(&d_out, blocks*sizeof(int));
int rv3 = cudaMemcpy(d_in, h, size*sizeof(int),
cudaMemcpyHostToDevice);
c6<<<blocks, threads_per_block>>>(d_in, d_out, size);
cudaThreadSynchronize();
int rv4 = cudaGetLastError();
int * r = (int*)malloc(blocks * sizeof(int));
int rv5 = cudaMemcpy(r, d_out, blocks*sizeof(int),
cudaMemcpyDeviceToHost);
int sum = 0;
for (int i = 0; i < blocks; ++i)
sum += r[i];
printf("Result = %d\n", sum);
return 0;
}
Developing CUDA programs
Install CUDA SDK (drivers, Toolkit, examples)
Windows, Linux, Mac:
Use Version 4.0, release candidate 2. (The older 3.2 release
does not work with VS2010 easily! You can install both VS2010
and VS2008, but you will have to manage paths.)
http://developer.nvidia.com/cuda-toolkit-40
Install toolkit, tools SDK, and example code
For drivers, you must have an NVIDIA GPU card
Recommendation: The CUDA examples use definitions in a
common library—do not force your code to depend on it by
using it.
Developing CUDA programs
Emulation
Do not install CUDA drivers (will fail).
Windows and Mac only
Install VirtualBox.
Create 40GB virtual drive.
Install Ubuntu from ISO image on VirtualBox.
Install Ocelot
(http://code.google.com/p/gpuocelot/downloads/list)
Install various dependencies (sudo apt-get xxxx install, for
g++, boost, etc.)
Note: There is a problem with the current release of Ocelot—I
emailed [email protected] to resolve build issue.
Developing CUDA programs
Windows:
Install VS2010 C++ Express
(http://www.microsoft.com/visualstudio/enus/products/2010-editions/visual-cpp-express)
(Test installation with “Hello World” .cpp example.)
Developing CUDA programs
Windows:
Create an empty c++ console project
Create hw.cu “hello world” program in source directory
Project ‐> Custom Build Rules, check box for CUDA 4.0
targets
Add hw.cu into your empty project
Note: “.cu” suffix stands for “CUDA source code”. You
can put CUDA syntax into .cpp files, but build
environment won’t know what to compile it with
(cl/g++ vs nvcc).
Developing CUDA programs
#include <stdio.h>
__global__ void fun(int * mem)
{
*mem = 1;
}
hw.cu:
int main()
{
int h = 0;
int * d;
cudaMalloc(&d, sizeof(int));
cudaMemcpy(d, &h, sizeof(int), cudaMemcpyHostToDevice);
fun<<<1,1>>>(d);
cudaThreadSynchronize();
int rv = cudaGetLastError();
cudaMemcpy(&h, d, sizeof(int), cudaMemcpyDeviceToHost);
printf("Result = %d\n", h);
return 0;
}
Compile, link,
Developing
CUDA programs
and run
(Version 4.0
installation
adjusts all
environmental
variables.)
NVCC
nvcc (NVIDIA
CUDA compiler)
is a driver program
for compiler
phases
Use –keep option
to see
intermediate files.
(Need to add “.” to
include directories
on compile.)
NVCC
Compiles to “.cu” into a “.cu.cpp” file
Two types of targets: virtual and real, represented in
PTX assembly code and “cubin” binary code,
respectively.
PTXAS
Compiles PTX assembly code into machine code, placed in an ELF module.
# cat hw.sm_10.cubin |
0000000 7f 45 4c 46 01
0000020 02 00 be 00 01
0000040 34 00 00 00 0a
0000060 16 00 01 00 00
0000100 00 00 00 00 00
0000120 00 00 00 00 00
0000140 03 00 00 00 00
0000160 7f 01 00 00 00
0000200 00 00 00 00 0b
0000220 00 00 00 00 23
od
01
00
01
00
00
00
00
00
00
05
-t
01
00
0a
00
00
00
00
00
00
00
x1
33
00
00
00
00
00
00
00
00
00
| head
02 00 00
00 00 00
34 00 20
00 00 00
00 00 00
00 00 00
00 00 00
00 00 00
03 00 00
22 00 00
00
00
00
00
00
00
00
00
00
00
00
34
03
00
00
01
a4
04
00
00
00
18
00
00
00
00
03
00
00
00
00
00
28
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
Disassembly of the machine code can be done using cuobjectdump or my own
utility nvdis (http://forums.nvidia.com/index.php?showtopic=183438)
PTX, the GPU assembly code
PTX = “Parallel
Thread Execution”
Target for PTX is an
abstract GPU
machine.
Contains
operations for load,
store, register
declarations, add,
sub, mul, etc.
.version 1.4
.target sm_10, map_f64_to_f32
// compiled with …/be.exe
// nvopencc 4.0 built on 2011-03-24
.entry _Z3funPi (
.param .u32 __cudaparm__Z3funPi_mem)
{
.reg .u32 %r<4>;
.loc
16
4
0
$LDWbegin__Z3funPi:
.loc
16
6
0
mov.s32
%r1, 1;
ld.param.u32
%r2, [__cudaparm__Z3funPi_mem];
st.global.s32
[%r2+0], %r1;
.loc
16
7
0
exit;
$LDWend__Z3funPi:
} // _Z3funPi
CUDA GPU targets
Virtual – PTX code is
embedded in
executabe as a string,
then compiled at
runtime “just-intime”.
Real – PTX code is
compiled into target
execute.
Next time
For next week, we will go into more detail:
The CUDA runtime API;
Writing efficient CUDA code;
Look at some important examples.