Introduction to CUDA Programming

Download Report

Transcript Introduction to CUDA Programming

Profiler, Assembly, and Floating-Point
Introduction to CUDA Programming
Andreas Moshovos
Winter 2009
Some material from:
Wen-Mei Hwu and David Kirk
NVIDIA
Robert Strzodka, Dominik Göddeke, NVISION08 presentation
http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf
The CUDA Profiler
• Both GUI and command-line
• Non-GUI control:
– CUDA_PROFILE
• set to 1 or 0 to enable or disable the profiler
– CUDA_PROFILE_LOG
• set to the name of the log file (will default to ./cuda_profile.log)
– CUDA_PROFILE_CSV
• set to 1 or 0 to enable or disable a comma separated version of the
log
– CUDA_PROFILE_CONFIG
• specify a config file with up to 4 signals
Profiler Signals
Profiler Counters
• Grid size X, Y
• Block size X, Y, Z
• Dyn smem per block:
– Dynamic shared memory
• Sta smem per block:
– static shared memory
• Reg per thread
• Mem transfer dir
– Direction: 0  host to device, 1  device to host
• Mem transfer size
– bytes
Interpreting Profiler Counters
• Values represent events within a thread warp
• Only targets one multiprocessor
– Values will not correspond to the total number of warps launched for a
particular kernel.
– Launch enough thread blocks to ensure that the target multiprocessor is
given a consistent percentage of the total work.
• Values are best used to identify relative performance
differences between non-optimized and optimized code
– e.g., make the number of non-coalesced loads go from some non-zero
value to zero
CUDA Visual Profiler
• Helps measure and find potential performance
problem
– GPU and CPU timing for all kernel invocations and
memcpys
– Time stamps
• Access to hardware performance counters
Assembly
PTX: Assembly for NVIDIA GPUs
• Parallel Thread eXecution
• Virtual Assembly
– Translated to actual machine code at runtime
– Allows for different hardware implementations
• Might enable additional optimizations
– E.g., %clock register to time blocks of code
Code Generation Flow
•
C/C++
Application
ASM-level
Library
Programmer
C/C++
Compiler
PTX Code
Parallel Thread eXecution (PTX)
–
–
–
•
ISA – Instruction Set Architecture
–
–
PTX Code
•
Translator
•
G80
…
Target code
GPU
Translates PTX to Target code
Program install time
Driver implements VM runtime
–
C
Variable declarations
Instructions and operands
Translator is an optimizing
compiler
–
–
PTX to Target
Virtual Machine and ISA
Programming model
Execution resources and state
Coupled with Translator
How to See the PTX code
• nvcc –keep
– Produces .ptx and .cubin
• nvcc --opencc-options -LIST:source=on
PTX Example
CUDA
float4 me = gx[gtid];
me.x += me.y * me.z;
PTX
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
# 174
me.x += me.y * me.z;
mad.f32
$f1, $f5, $f3, $f1;
Registers are virtual – The actual hardware registers are hidden from PTX
PTX Syntax Example
Another Example: CUDA Function
• CUDA
__device__ void interaction(
float4 b0, float4 b1, float3 *accel)
{
r.x = b1.x - b0.x;
r.y = b1.y - b0.y;
r.z = b1.z - b0.z;
float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
float s = 1.0f/sqrt(distSqr);
accel->x += r.x * s;
accel->y += r.y * s;
accel->z += r.z * s;
}
• PTX
sub.f32
sub.f32
sub.f32
mul.f32
mul.f32
mul.f32
add.f32
add.f32
rsqrt.f32
mad.f32
mov.f32
mad.f32
mov.f32
mad.f32
mov.f32
$f18, $f1, $f15;
$f19, $f3, $f16;
$f20, $f5, $f17;
$f21, $f18, $f18;
$f22, $f19, $f19;
$f23, $f20, $f20;
$f24, $f21, $f22;
$f25, $f23, $f24;
$f26, $f25;
$f13, $f18, $f26, $f13;
$f14, $f13;
$f11, $f19, $f26, $f11;
$f12, $f11;
$f9, $f20, $f26, $f9;
$f10, $f9;
PTX Data types
Predicated Execution
If (cond)
– p = Evaluate cond
– Branch not true
After
– Then Code
• After:
Then Code
After Code
– After Code
– p = Evaluate cond
– (p) Then Code
– After Code
PTX Predicated Execution
Variable Declaration
Parameterized Variable Names
• How to create 100 register “variables”
• .reg .b32 %r<100>
• Declares %r0 - %r99
Addresses as Operands
The value of x
The value of tbl[12]
The base address of tlb
Compiling a loop that calls a function - again
• CUDA
– sx is shared
– mx, accel are local
for (i = 0; i < K; i++) {
if (i != threadIdx.x) {
interaction(
sx[i], mx, &accel
);
}
}
• PTX
mov.s32
$r12, 0;
$Lt_0_26:
setp.eq.u32
$p1, $r12, $r5;
@$p1 bra $Lt_0_27;
mul.lo.u32 $r13, $r12, 16;
add.u32 $r14, $r13, $r1;
ld.shared.f32
$f15, [$r14+0];
ld.shared.f32
$f16, [$r14+4];
ld.shared.f32
$f17, [$r14+8];
[func body from previous slide inlined here]
$Lt_0_27:
add.s32 $r12, $r12, 1;
mov.s32 $r15, 128;
setp.ne.s32
$p2, $r12, $r15;
@$p2 bra $Lt_0_26;
Yet Another Example: SAXPY code
cvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDs
cvt.u32.u16 $blocksize, %ntid.x;
cvt.u32.u16 $tid, %tid.x;
mad24.lo.u32 $i, $blockid, $blocksize, $tid;
ld.param.u32 $n, [N]; // Nothing to do if n ≤ i
setp.le.u32 $p1, $n, $i;
@$p1 bra $L_finish;
mul.lo.u32 $offset, $i, 4; // Load y[i]
ld.param.u32 $yaddr, [Y];
add.u32 $yaddr, $yaddr, $offset;
ld.global.f32 $y_i, [$yaddr+0];
ld.param.u32 $xaddr, [X]; // Load x[i]
add.u32 $xaddr, $xaddr, $offset;
ld.global.f32 $x_i, [$xaddr+0];
ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i]
mad.f32 $y_i, $alpha, $x_i, $y_i;
st.global.f32 [$yaddr+0], $y_i;
$L_finish: exit;
The %clock register
• Real time clock cycle counter
• How to read:
– mov.u32
$r1, %clock;
• Can be used to time code
• It measures real time not just time spent
executing this thread
– If a thread is blocks time still elapses
PTX Reference
• Please Read the PTX ISA specification
– Posted under the handouts section
Occupancy Calculator
• http://developer.download.nvidia.com/compute/cuda/C
UDA_Occupancy_calculator.xls
• GPU Occupancy
–
–
–
–
Active warps / max warps
Threads/block
Registers/thread
Shared memory/block
• Nvcc –cubin
– code {
name = my_kernel
lmem = 0
smem = 24
reg = 5
bar = 0
bincode { � }
const { � }
}
Occupancy Calculator Example
Floating Point Considerations
Comparison of FP Capabilities
G80
SSE
IBM Altivec
Cell SPE
Precision
IEEE 754
IEEE 754
IEEE 754
IEEE 754
Rounding modes for
FADD and FMUL
Round to nearest
and round to zero
All 4 IEEE, round to
nearest, zero, inf, inf
Round to nearest
only
Round to
zero/truncate only
Denormal handling
Flush to zero
Supported,
1000’s of cycles
Supported,
1000’s of cycles
Flush to zero
NaN support
Yes
Yes
Yes
No
Overflow and Infinity
support
Yes, only clamps to
max norm
Yes
Yes
No, infinity
Flags
No
Yes
Yes
Some
Square root
Software only
Hardware
Software only
Software only
Division
Software only
Hardware
Software only
Software only
Reciprocal estimate
accuracy
24 bit
12 bit
12 bit
12 bit
Reciprocal sqrt
estimate accuracy
23 bit
12 bit
12 bit
12 bit
log2(x) and 2^x
estimates accuracy
23 bit
No
12 bit
No
IEEE Floating Point Representation
• A floating point binary number consists of three parts:
– sign (S), exponent (E), and mantissa (M).
– Each (S, E, M) pattern uniquely identifies a floating point
number.
• For each bit pattern, its IEEE floating-point value is
derived as:
– value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B
• The interpretation of S is simple: S=0 results in a
positive number and S=1 a negative number.
Normalized Representation
• Specifying that 1.0B ≤ M < 10.0B makes the mantissa
value for each floating point number unique.
– For example, the only one mantissa value allowed for 0.5D is
M =1.0
• 0.5D = 1.0B * 2-1
– Neither 10.0B * 2 -2 nor 0.1B * 2 0 qualifies
• Because all mantissa values are of the form 1.XX…,
one can omit the “1.” part in the representation.
– The mantissa value of 0.5D in a 2-bit mantissa is 00, which is
derived by omitting “1.” from 1.00.
Exponent Representation
• In an n-bits exponent
representation, 2n-1-1 is added
to its 2's complement
representation to form its
excess representation.
– See Table for a 3-bit exponent
representation
• A simple unsigned integer
comparator can be used to
compare the magnitude of two
FP numbers
• Symmetric range for +/exponents (111 reserved)
2’s complement
Actual decimal
Excess-3
000
0
011
001
1
100
010
2
101
011
3
110
100
(reserved
pattern)
111
101
-3
000
110
-2
001
111
-1
010
E = represented E - BIAS
A Hypothetical 5-bit Floating Point Representation
• Assume 1-bit S, 2-bit
E, and 2-bit M
– 0.5D = 1.00B * 2-1
– 0.5D = 0 00 00,
– where
• S = 0,
• E = 00
• M = (1.)00
2’s complement
Actual decimal
Excess-1
00
0
01
01
1
10
10
(reserved pattern)
11
11
-1
00
Representable Numbers
• The representable
numbers of a given format
is the set of all numbers
that can be exactly
represented in the format.
•
• See Table for
representable numbers of
an unsigned 3-bit integer
format
1
0 1 2 3 4 5 6 7 8 9
000
0
001
1
010
2
011
3
100
4
101
5
110
6
111
7
Hypothetical 5-bit FP: Representable Numbers
Abrupt underflow
No-zero
E
M
00
00
2-1
-(2-1)
0
0
0
0
01
2-1+1*2-3
-(2-1+1*2-3)
0
0
1*2-2
-1*2-2
10
2-1+2*2-3
-(2-1+2*2-3)
0
0
2*2-2
-2*2-2
11
2-1+3*2-3
-(2-1+3*2-3)
0
0
3*2-2
-3*2-2
00
20
-(20)
20
-(20)
20
-(20)
01
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
10
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
11
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
00
21
-(21)
21
-(21)
21
-(21)
01
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
10
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
11
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
01
10
11
S=0
S=0
Gradual underflow
S=1
S=1
Reserved pattern
S=0
S=1
Flush To Zero
• Treat all bit patterns with E=0 as 0.0
– This takes away several representable
numbers near zero and lump them all into 0.0
– For a representation with large M, a large
number of representable numbers numbers
will be removed.
0
1
2
3
4
Hypothetical 5-bit FP: Representable Numbers
Abrupt underflow
No-zero
E
M
00
00
2-1
-(2-1)
0
0
0
0
01
2-1+1*2-3
-(2-1+1*2-3)
0
0
1*2-2
-1*2-2
10
2-1+2*2-3
-(2-1+2*2-3)
0
0
2*2-2
-2*2-2
11
2-1+3*2-3
-(2-1+3*2-3)
0
0
3*2-2
-3*2-2
00
20
-(20)
20
-(20)
20
-(20)
01
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
10
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
11
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
00
21
-(21)
21
-(21)
21
-(21)
01
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
10
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
11
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
01
10
11
S=0
S=0
Gradual underflow
S=1
S=1
Reserved pattern
S=0
S=1
Denormalized Numbers
• The actual method adopted by the IEEE
standard is called denormalized numbers or
gradual underflow.
– The method relaxes the normalization requirement for
numbers very close to 0.
– whenever E=0, the mantissa is no longer assumed to
be of the form 1.XX. Rather, it is assumed to be 0.XX.
In general, if the n-bit exponent is 0, the value is
• 0.M * 2 - 2 ^(n-1) + 2
0
1
2
3
Hypothetical 5-bit FP: Representable Numbers
Abrupt underflow
No-zero
E
M
00
00
2-1
-(2-1)
0
0
0
0
01
2-1+1*2-3
-(2-1+1*2-3)
0
0
1*2-2
-1*2-2
10
2-1+2*2-3
-(2-1+2*2-3)
0
0
2*2-2
-2*2-2
11
2-1+3*2-3
-(2-1+3*2-3)
0
0
3*2-2
-3*2-2
00
20
-(20)
20
-(20)
20
-(20)
01
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
10
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
11
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
00
21
-(21)
21
-(21)
21
-(21)
01
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
10
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
11
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
01
10
11
S=0
S=0
Gradual underflow
S=1
S=1
Reserved pattern
S=0
S=1
Floating Point Numbers
• As the exponent gets larger
• The distance between two representable
numbers increases
Arithmetic Instruction Throughput
• int and float add, shift, min, max and float mul, mad: 4
cycles per warp
– int multiply (*) is by default 32-bit
• requires multiple cycles / warp
– Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int
multiply
– For G80, for G20 should be OK
• Integer divide and modulo are expensive
– Compiler will convert literal power-of-2 divides to shifts
– Be explicit in cases where compiler can’t tell that divisor is
a power of 2
– Useful trick: foo % n == foo & (n-1) if n is a power of 2
Arithmetic Instruction Throughput
• Reciprocal, reciprocal square root, sin/cos,
log, exp: 16 cycles per warp
– These are the versions prefixed with “__”
– Examples:__rcp(), __sin(), __exp()
• Other functions are combinations of the above
– y / x == rcp(x) * y == 20 cycles per warp
– sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp
Runtime Math Library
• There are two types of runtime math
operations
– __func(): direct mapping to hardware ISA
• Fast but low accuracy (see prog. guide for details)
• Examples: __sin(x), __exp(x), __pow(x,y)
– func() : compile to multiple instructions
• Slower but higher accuracy (5 ulp, units in the least
place, or less)
• Examples: sin(x), exp(x), pow(x,y)
• The -use_fast_math compiler option forces
every func() to compile to __func()
•
Make
float-safe!
G20
has your
doubleprogram
precision support
–
–
G80 is single-precision only
Double precision has additional performance cost
•
–
•
Only one unit per multiprocessor
Careless use of double or undeclared types may run more slowly on
G80+
Important to be float-safe (be explicit whenever you want
single precision) to avoid using double precision where it is not
needed
–
Add ‘f’ specifier on float literals:
• foo = bar * 0.123;
• foo = bar * 0.123f;
–
// double assumed
// float explicit
Use float version of standard library functions
• foo = sin(bar);
• foo = sinf(bar);
// double assumed
// single precision explicit
Deviations from IEEE-754
• Addition and Multiplication are IEEE 754 compliant
– Maximum 0.5 ulp (units in the least place) error
• However, often combined into multiply-add (FMAD)
– Intermediate result is truncated
•
•
•
•
Division is non-compliant (2 ulp)
Not all rounding modes are supported
Denormalized numbers are not supported
No mechanism to detect floating-point exceptions
Units in the Last Place Error
• If the result of a FP computation is:
– 3.12 x 10^-2 = 0.0312
• But the answer when computed to infinite
precision is:
– -0.0312159
• Then ulp is:
– 0.0314 – 0.0312 = 0.159
• For binary representations the maximum ulp is
0.5
– Round to nearest number
Mixed Precision Methods
From slides by
Robert Strzodka
Dominik Göddeke
http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf
What is a Mixed Precision Method?
Mixed Precision Performance Gains
Single vs Double Precision FP
Float s23e8
Double s53e11
Round off and Cancellation
Double Precision != Better Accuracy
The Dominant Data Error
Understanding Floating Point Operations
Commutative Summation
Commutative Summation Example
Say we want to calculate 1 + 0.0000004 – 0.00000003
High Precision Emulation
Example: Addition c = a + b
Please read the following: