Floating-Point Considerations

Download Report

Transcript Floating-Point Considerations

Floating-Point Considerations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
1
Motivation
• High speed floating point arithmetic has become a
standard feature for microprocessors.
– Important for application programmers to understand and
take advantage of floating point arithmetic in developing
applications
• Focus on
– Accuracy of floating point arithmetic
– Precision of floating point number representation
• How the accuracy and precision of floating point
number representation should be taken into
consideration in parallel computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
2
Objectives
• To understand the fundamentals of floating-point
representation
• To know the IEEE-754 Floating Point Standard
• GeForce 8800 CUDA Floating-point speed, accuracy
and precision
–
–
–
–
Deviations from IEEE-754
Accuracy of device runtime functions
-fastmath compiler option
Future performance considerations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
3
What is IEEE floating-point format?
• A floating point binary number consists of three parts:
– sign (S), exponent (E), and mantissa (M).
– Each (S, E, M) pattern uniquely identifies a floating point number.
• For each bit pattern, its IEEE floating-point value is derived as:
– value = (-1)S * M * {2E}, where 1.0B ≤ M < 10.0B = (2.0)10
• The interpretation of S is simple: S=0 results in a positive
number and S=1 a negative number.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
4
Normalized Representation
• Specifying that 1.0B ≤ M < 10.0B makes the mantissa
value for each floating point number unique.
– For example, the only one mantissa value allowed for 0.5D
is M =1.0
• 0.5D = 1.0B * 2-1
– Neither 10.0B * 2 -2 nor 0.1B * 2 0 qualifies
• Because all mantissa values are of the form 1.XX…,
one can omit the “1.” part in the representation.
– The mantissa value of 0.5D in a 2-bit mantissa is 00, which
is derived by omitting “1.” from 1.00.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
5
Exponent Representation
• In an n-bits exponent
representation, 2n-1-1 is
added to its 2's complement
representation to form its
excess representation.
– See Table for a 3-bit exponent
representation
• A simple unsigned integer
comparator can be used to
compare the magnitude of
two FP numbers
• Symmetric range for +/exponents (111 reserved)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
2’s complement
Actual decimal
Excess-3
000
0
011
001
1
100
010
2
101
011
3
110
100
(reserved
pattern)
111
101
-3
000
110
-2
001
111
-1
010
6
A simple, hypothetical 5-bit FP format
• Assume 1-bit S, 2-bit E, and
2-bit M
– 0.5D = 1.00B * 2-1
– 0.5D = 0 00 00, where S = 0,
E = 00, and M = (1.)00
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
2’s complement
Actual decimal
Excess-1
00
0
01
01
1
10
10
(reserved
pattern)
11
11
-1
00
7
Representable Numbers
• The representable numbers
of a given format is the set
of all numbers that can be
exactly represented in the
format.
• See Table for representable
numbers of an unsigned 3bit integer format
-1 0
1
000
0
001
1
010
2
011
3
100
4
101
5
110
6
111
7
2 3 4 5 6 7 8 9
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
8
Cannot of
represent
Representable Numbers
a 5-bit Hypothetical
Zero!
IEEE Format:
No-zero
E
M
00
00
2-1
-(2-1)
01
2-1+1*2-3
-(2-1+1*2-3)
10
2-1+2*2-3
-(2-1+2*2-3)
11
2-1+3*2-3
-(2-1+3*2-3)
00
20
-(20)
01
20+1*2-2
-(20+1*2-2)
10
20+2*2-2
-(20+2*2-2)
11
20+3*2-2
-(20+3*2-2)
00
21
-(21)
01
21+1*2-1
-(21+1*2-1)
10
21+2*2-1
-(21+2*2-1)
Abrupt underflow
Gradual underflow
Mmmmmmmmmmmmmmmmmm
m S=0
S=1
S=0
S=1
0
0
0
0
mmmmmmmmmmmmmmmmmm
Exponent
Values
0Mmmmmmmmmmmmmmmmmm
0
1*2
-1*2
Actual decimal
Excess-1
0
0
2*2
-2*2
m
0
0
3*2
-3*2
0
01
mmmmmmmmmmmmmmmmmm
2
-(2 )
2
-(2 )
Mmmmmmmmmmmmmmmmmm
1 )
10 -(2 +1*2 )
2 +1*2
-(2 +1*2
2 +1*2
m
2 +2*2
-(2 +2*2 )
2 +2*2
-(2 +2*2 )
(reserved
11
mmmmmmmmmmmmmmmmmm
2 +3*2
-(2 +3*2
)
2 +3*2
-(2 +3*2 )
pattern)
Mmmmmmmmmmmmmmmmmm
2
-(2 ) -1
2
00 -(2 )
2 +1*2
-(2 +1*2 )
2 +1*2
-(2 +1*2 )
m
2Mmmmmmmmmmmmmmmmm
+2*2
-(2 +2*2 )
2 +2*2
-(2 +2*2 )
11
21+3*2-1
-(21+3*2-1)
21+3*2-1
No-zero
01
0
10
S=0
S=1
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
0
0
-2
-2
-2
-2
-2
-2
0
0
0
-2
0
-2
0
-2
0
-2
0
-2
0
-2
0
-2
0
-2
0
-2
0
-2
0
-2
0
-2
1
1
1
1
1
-1
1
-1
1
-1
1
-1
1
-1
1
-1
1
-1
1
-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
Reserved pattern
9
Representable Numbers: 5 Observations
1. 3 major intervals of representable numbers (defined by the
number of different exponent values)
– E = 00B representable numbers: {0.5 , 0.625, 0.750, 0.875}10
– E = 01B representable numbers: {1.00, 1.250, 1.500, 1.750}10
– E = 10B representable numbers: {2.00, 2.500, 3.000, 3.500}10
0
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
2
3
4
10
Representable Numbers: 5 Observations (cont.)
2. Mantissa bits define the number of representable numbers in
every range; with N bits mantissa, 2N representable numbers
in each range
– A value falling in an interval will be rounded to a representable number
– The larger the number of representable numbers in an interval, the
more precisely we can represent the value
– Hence number of mantissa bits determine the precision of the
representation
3. 0 is not represented
4. Numbers less than 0.5 and greater than 0 are not represented
(a gap in representable numbers)
5. Representable numbers closer to each other towards 0
0
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
2
3
4
11
Flush to Zero
• Treat all bit patterns with E=0 as 0.0
– This takes away several representable numbers near zero
and lump them all into 0.0
– For a representation with large M, a large number of
representable numbers will be removed.
0
0
1
2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
3
4
12
Flush to Zero
No-zero
E
M
00
00
2-1
-(2-1)
0
0
01
2-1+1*2-3
-(2-1+1*2-3)
0
0
10
2-1+2*2-3
-(2-1+2*2-3)
0
0
11
2-1+3*2-3
-(2-1+3*2-3)
0
0
00
20
-(20)
20
-(20)
01
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
10
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
11
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
00
21
-(21)
21
-(21)
01
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
10
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
11
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
0
01
10
S=0
Flush to Zero
S=1
© David
11 Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
S=0
S=1
Reserved pattern
Denormalized
Mmmmmmmm
S=0
S=1
Mmmmmmmm
0Mmmmmmmm
0
1*2
-1*2
Mmmmmmmm
2*2
-2*2
Mmmmmmmm
3*2
-3*2
Mmmmmmmm
2
-(2 )
Mmmmmmmm
2 +1*2
-(2 +1*2 )
Mmmmmmmm
2 +2*2
-(2 +2*2 )
Mmmmmmmm
2 +3*2
-(2 +3*2 )
Mmmmmmmm
2
-(2 )
Mmmmmmmm
2 +1*2
-(2 +1*2 )
Mmmmmmmm
2 +2*2
-(2 +2*2 )
Mmmmmmmm
2 +3*2
-(2 +3*2 )
-2
-2
-2
-2
-2
-2
0
0
0
-2
0
-2
0
-2
0
-2
0
-2
0
-2
1
1
1
-1
1
-1
1
-1
1
-1
1
-1
1
-1
13
Denormalized Numbers
• The actual method adopted by the IEEE standard is
called denromalized numbers or gradual underflow.
– The method relaxes the normalization requirement for
numbers very close to 0.
– whenever E=0, the mantissa is no longer assumed to be of the
form 1.XX. Rather, it is assumed to be 0.XX.
– In general, if the n-bit exponent is 0, the value is
• 0.M * 2{-2^(n-1) + 2}
• e.g. if a 3-bit exponent is 0, the value is 0.M * 2-2
0
1
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
2
3
14
Denormalization
No-zero
M
00
00
2-1
-(2-1)
0
0
0
0
01
2-1+1*2-3
-(2-1+1*2-3)
0
0
1*2-2
-1*2-2
10
2-1+2*2-3
-(2-1+2*2-3)
0
0
2*2-2
-2*2-2
11
2-1+3*2-3
-(2-1+3*2-3)
0
0
3*2-2
-3*2-2
00
20
-(20)
20
-(20)
20
-(20)
01
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
20+1*2-2
-(20+1*2-2)
10
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
20+2*2-2
-(20+2*2-2)
11
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
20+3*2-2
-(20+3*2-2)
00
21
-(21)
21
-(21)
21
-(21)
01
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
21+1*2-1
-(21+1*2-1)
10
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
21+2*2-1
-(21+2*2-1)
11
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
21+3*2-1
-(21+3*2-1)
0
10
S=1
11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
S=0
Denormalized
E
01
S=0
Flush to Zero
S=1
S=0
S=1
Reserved pattern
15
Special Bit Patterns
All numbers are normalized floating point numbers
except when
• All E bits are 0s
–
–
If M= 0 ; The number represents Zero
If M ≠ 0 ; The number is denormalized
• All E bits are 1s
–
If M= 0 ; The number represents ∞
•
•
•
–
All representable numbers fall between -∞ and + ∞
Divide by 0 results ∞ (overflow)
Divide by ∞ results in 0
If M ≠ 0 ; The number is “Not a Number”:NaN
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
16
Special Bit Patterns: NaN
• Generated by operations with input values that do not
make sense: e.g. 0/0; 0 x ∞; ∞ / ∞; ∞ - ∞
• Signaling NaNs (SNaN): represented with MSB of M 0
–
Cause exception if used as input of arithmetic operation;
execution is interrupted
•
Something wrong with the execution of the code
• Making uninitialized data SNaNs allows detection of the
use of such data during execution
• Current GPUs do not support SNaN due to the difficulty
of supporting accurate signaling during massively
parallel execution
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
17
Special Bit Patterns: Quiet NaNs
• Quiet NaNs: represented with MSB of M 1
• (1.0 + quiet NaN) generates a quiet NaN
• A quite NaN is printed as “NaN”
–
User can review the output and decide if the application should
be re-run with a different input for more valid results.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
18
Precision
• Single-precision(SP) 32 bits: 1-bit S, 8-bit E, 23-bit M
• Double-precision(DP) 64 bits: 1-bit S, 11-bit E, 52-bit M
– M has 29 bits more than SP: largest representation error is
reduced to (1/229 ) of that of the SP format
– E has 3 bits more than SP: # of intervals of representable
numbers is extended
• Very large and very small values are representable
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
19
Arithmetic Accuracy and Rounding
• Accuracy of a floating point arithmetic operation is measured
by maximal error introduced by the operation
– source of error is when the result cannot be exactly
represented and requires rounding
• mantissa of the result needs too many bits to be represented exactly
• Cause of rounding is pre-shifting in floating point arithmetic
– Example: two operands to a floating-point addition have different
exponents
• Mantissa of operand with the smaller exponent is right shifted until
exponents are equal
• Final result has more bits than the format can accommodate
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
20
Rounding Error Example
Assume 5-bit representation (E: Excess-1; Denormilized M)
• 1.00*21 + 1.00*2-2 { (0 10 00) + (0 00 01) }
– Right shift mantissa of 1.00*2-2 (0 00 01) so that the exponent becomes 1
instead of -2
– 1.00*21 + 0.001*21 = 1.001 * 21 (ideally)
– Result is not representable using 5-bit format (requires 3 bit mantissa)
– Generate a result = closest representable number , 1.01 * 21
• Introduces an error, 0.001 *21 = Half the place value of the least
significant place: 0.5 ULP (Units in the Last Place)
• The most error introduced by properly designed hardware is no
more than 0.5 ULP.
– This is the accuracy achieved by addition and subtraction operations in
G80/GT200.
– Inversion introduces 2ULP errors
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
21
Algorithm Considerations
Numerical algorithms need to sum up a large number of values
• Order of summing values can affect accuracy of final result
• Example: Using 5-bit representation (E: Excess 1; M:
Denormilized) {1.00*20 = (0 01 00); 1.00*2-2 = (0 00 01) }
Then sequential order of performing the following addition yields:
(1.00*20 ) + ( 1.00*20 ) + (1.00*2-2) + (1.00*2-2 )
= (1.00*21 ) + (1.00*2-2) + (1.00*2-2 )
= (1.00*21 ) + (1.00*2-2 )
= 1.00*21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
22
Algorithm Considerations (cont.)
Using a parallel algorithm, order of performing the addition yields:
[(1.00*20 ) + ( 1.00*20 )] + [(1.00*2-2) + (1.00*2-2 )]
= (1.00*21 ) + (1.00*2-1)
= 1.01*21
• Some other parallel algorithm might produce less accurate results
than the serial algorithm
Pre-sorting of operands is often used to maximize floating-point
arithmetic accuracy
[(1.00*2-2) + (1.00*2-2 )] + [(1.00*20 ) + ( 1.00*20 )]
= (1.00*2-1) + (1.00*21 )
= 1.01*21
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
23
Arithmetic Instruction Throughput
•
int and float add, shift, min, max and float mul, mad: 4 cycles per
warp
–
int multiply (*) is by default 32-bit
•
–
•
SP Float is 32 bits; 23-bit M
requires multiple cycles / warp
Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply
Integer divide and modulo are expensive
–
–
–
Compiler will convert literal power-of-2 divides to shifts
Be explicit in cases where compiler can’t tell that divisor is a power of 2 !
Useful trick: foo % n == foo & (n-1) if n is a power of 2
• Example: foo = 11; n=8; n-1=7; 11%8 = 3 = (1011)&(0111)=0011
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
24
Arithmetic Instruction Throughput
• Reciprocal, reciprocal square root, sin/cos, log, exp:
16 cycles per warp
– These are the versions prefixed with “__”
– Examples:__rcp(), __sin(), __exp()
• Other functions are combinations of the above
– y / x == rcp(x) * y == 20 cycles per warp
– sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
25
Runtime Math Library
• There are two types of runtime math operations
– __func(): direct mapping to hardware ISA
• Fast but low accuracy (see prog. guide for details)
• Examples: __sin(x), __exp(x), __pow(x,y)
– func() : compile to multiple instructions
• Slower but higher accuracy (5 ulp, units in the least place, or less)
• Examples: sin(x), exp(x), pow(x,y)
• The -use_fast_math compiler option forces every
func() to compile to __func()
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
26
Make your program float-safe!
•
Future hardware will have double precision support
–
–
–
•
G80 is single-precision only
Double precision will have additional performance cost
Careless use of double or undeclared types may run more slowly on
G80+
Important to be float-safe (be explicit whenever you want
single precision) to avoid using double precision where it is not
needed
–
Add ‘f’ specifier on float literals:
• foo = bar * 0.123;
• foo = bar * 0.123f;
–
// double assumed
// float explicit
Use float version of standard library functions
• foo = sin(bar);
• foo = sinf(bar);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
// double assumed
// single precision explicit
27
Deviations from IEEE-754
• Addition and Multiplication are IEEE 754 compliant
– Maximum 0.5 ulp (units in the least place) error
• However, often combined into multiply-add (FMAD)
– Intermediate result is truncated
•
•
•
•
Division is non-compliant (2 ulp)
Not all rounding modes are supported
Denormalized numbers are not supported
No mechanism to detect floating-point exceptions
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
28
GPU Floating Point Features
G80
SSE
IBM Altivec
Cell SPE
Precision
IEEE 754
IEEE 754
IEEE 754
IEEE 754
Rounding modes for
FADD and FMUL
Round to nearest and
round to zero
All 4 IEEE, round to
nearest, zero, inf, -inf
Round to nearest only
Round to
zero/truncate only
Denormal handling
Flush to zero
Supported,
1000’s of cycles
Supported,
1000’s of cycles
Flush to zero
NaN support
Yes
Yes
Yes
No
Overflow and Infinity
support
Yes, only clamps to
max norm
Yes
Yes
No, infinity
Flags
No
Yes
Yes
Some
Square root
Software only
Hardware
Software only
Software only
Division
Software only
Hardware
Software only
Software only
Reciprocal estimate
accuracy
24 bit
12 bit
12 bit
12 bit
Reciprocal sqrt
estimate accuracy
23 bit
12 bit
12 bit
12 bit
log2(x) and 2^x
estimates accuracy
23 bit
No
12 bit
No
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE408, University of Illinois, Urbana-Champaign
29