Floating Point Computation

Transcript Floating Point Computation

Floating Point Computation
Jyun-Ming Chen
1
Contents
• Sources of Computational Error
• Computer Representation of (Floating-point)
Numbers
• Efficiency Issues
2
Sources of Computational Error
• Converting a
mathematical problem to
numerical problem, one
introduces errors due to
limited computation
resources:
– round off error (limited
precision of representation)
– truncation error (limited
time for computation)
• Misc.
– Error in original data
– Blunder
(programming/data
input error)
– Propagated error
3
Supplement: Error Classification
(Hildebrand)
• Gross error: caused by
human or mechanical
mistakes
• Roundoff error: the
consequence of using a
number specified by n
correct digits to
approximate a number
which requires more than
n digits (generally
infinitely many digits) for
its exact specification.
• Truncation error: any error
which is neither a gross
error nor a roundoff error.
• Frequently, a truncation
error corresponds to the
fact that, whereas an exact
result would be afforded
(in the limit) by an infinite
sequence of steps, the
process is truncated after a
certain finite number of
steps.
4
Common measures of error
• Definitions
– total error = round off + truncation
– Absolute error = | numerical – exact |
– Relative error = Abs. error / | exact |
• If exact is zero, rel. error is not defined
5
Ex: Round off error
Representation consists
of finite number of
digits
Implication: real-number
is discrete (more later)
R
6
Watch out for printf !!
• By default, “%f” prints out 6 digits behind
decimal point.
7
Ex: Numerical Differentiation
• Evaluating first derivative of f(x)
f ( x  h)  f ( x )  f ' ( x ) h  f " ( x )
h2
2

f ( x  h)  f ( x )
f ' ( x) 
 f " ( x) h2  
h
f ( x  h ) f ( x )
f ' ( x) 
, for small h
h
Truncation
error
8
Numerical Differentiation (cont)
• Select a problem with known answer
– So that we can evaluate the error!
f ( x)  x 3  f ' ( x)  3x 2
 f ' (10)  300
9
Numerical Differentiation (cont)
• Error analysis
– h  (truncation) error 
• What happened at h =
0.00001?!
10
Ex: Polynomial Deflation
• F(x) is a polynomial with 20 real roots
f ( x)  ( x  1)( x  2)  ( x  20)
• Use any method to numerically solve a root,
then deflate the polynomial to 19th degree
• Solve another root, and deflate again, and
again, …
• The accuracy of the roots obtained is getting
worse each time due to error propagation
11
Computer Representation of
Floating Point Numbers
Floating point VS. fixed point
Decimal-binary conversion
Standard: IEEE 754 (1985)
12
Floating VS. Fixed Point
• Decimal, 6 digits (positive number)
– fixed point: with 5 digits after decimal point
• 0.00001, … , 9.99999
– Floating point: 2 digits as exponent (10-base); 4 digits
for mantissa (accuracy)
• 0.001x10-99, … , 9.999x1099
• Comparison:
– Fixed point: fixed accuracy; simple math for
computation (sometimes used in graphics programs)
– Floating point: trade accuracy for larger range of
representation
13
Decimal-Binary Conversion
• Ex: 134 (base 10)
• Ex: 0.125 (base 10)
• Ex: 0.1 (base 10)
14
Floating Point Representation
 f b
e
• Fraction, f
– Usually normalized so that b 1  f  1
• Base, b
– 2 for personal computers
– 16 for mainframe
–…
• Exponent, e
15
Understanding Your Platform
16
Padding
How about
17
IEEE 754-1985
• Purpose: make floating system portable
• Defines: the number representation, how
calculation performed, exceptions, …
• Single-precision (32-bit)
• Double-precision (64-bit)
18
Number Representation
• S: sign of mantissa
• Range (roughly)
– Single: 10-38 to 1038
– Double: 10-307 to 10307
• Precision (roughly)
– Single: 7 significant
decimal digits
– Double: 15 significant
decimal digits
Describe how these
are obtained
19
Implication
• When you write your program, make sure
the results you printed carry the meaningful
significant digits.
20
Implicit One
• Normalized mantissa to increase one extra
bit of precision
• Ex: –3.5
21
Exponent Bias
• Ex: in single precision, exponent has 8 bits
– 0000 0000 (0) to 1111 1111 (255)
• Add an offset to represent +/ – numbers
– Effective exponent = biased exponent – bias
– Bias value: 32-bit (127); 64-bit (1023)
– Ex: 32-bit
• 1000 0000 (128): effective exp.=128-127=1
22
Ex: Convert
– 3.5 to 32-bit FP Number
23
Examine Bits of FP Numbers
• Explain how this
program works
24
The “Examiner”
• Use the previous program to
– Observe how ME work
– Test subnormal behaviors on your
computer/compiler
– Convince yourself why the subtraction of two
nearly equal numbers produce lots of error
– NaN: Not-a-Number !?
25
Design Philosophy of IEEE 754
• [s|e|m]
• S first: whether the number is +/- can be tested
easily
• E before M: simplify sorting
• Represent negative by bias (not 2’s complement)
for ease of sorting
– [biased rep] –1, 0, 1 = 126, 127, 128
– [2’s compl.] –1, 0, 1 = 0xFF, 0x00, 0x01
• More complicated math for sorting, increment/decrement
26
Exceptions
• Overflow:
– ±INF: when number exceeds the range of representation
• Underflow
– When the number are too close to zero, they are treated
as zeroes
• Dwarf
– The smallest representable number in the FP system
• Machine Epsilon (ME)
– A number with computation significance (more later)
27
Extremities
More
later
• E : (1…1)
– M (0…0): infinity
– M not all zeros; NaN (Not a Number)
• E : (0…0)
– M (0…0): clean zero
– M not all zero: dirty zero (see next page)
28
Not-a-Number
• Numerical exceptions
– Sqrt of a negative number
– Invalid domain of trigonometric functions
–…
• Often cause program to stop running
29
Extremities (32-bit)
• Max:
0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1.
(1.111…1)2254-127=(10-0.000…1) 21272128
• Min (w/o stepping into dirty-zero)
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1.
(1.000…0)21-127=2-126
30
a.k.a.: also known as
Dirty-Zero (a.k.a. denormals)
• No “Implicit One”
• IEEE 754 did not specify compatibility for
denormals
• If you are not sure how to handle them, stay
away from them. Scale your problem
properly
– “Many problems can be solved by pretending
as if they do not exist”
31
Dirty-Zero (cont)
denormals
R
0
dwarf
2-126
00000000 10000000 00000000 00000000
2-126
-127
00000000 01000000 00000000 00000000 2
00000000 00100000 00000000 00000000 2-128
00000000 00010000 00000000 00000000 2-129
(Dwarf: the smallest representable)
32
Drawf (32-bit)
Value: 2-149
33
Machine Epsilon (ME)
• Definition
– smallest non-zero number that makes a
difference when added to 1.0 on your working
platform
• This is not the same as the dwarf
34
Computing ME (32-bit)
1+eps
Getting closer to 1.0
ME:
(00111111 10000000 00000000 00000001)
–1.0
= 2-23  1.12  10-7
35
Effect of ME
36
Significance of ME
• Never terminate the iteration on that 2 FP
numbers are equal.
• Instead, test whether |x-y| < ME
37
Numerical Scaling
• Number Density: there
are as many IEEE 754
numbers between [1.0,
2.0] as there are in
[256, 512]
• Revisit:
– “roundoff” error
– ME: a measure of
density near the 1.0
• Implication:
– Scale your problem so
that intermediate
results lie between 1.0
and 2.0 (where
numbers are dense; and
where roundoff error is
smallest)
R
38
Scaling (cont)
• Performing computation on denser portions
of real line minimizes the roundoff error
– but don’t over do it; switch to double precision
will easily increase the precision
– The densest part is near subnormal, if density is
defined as numbers per unit length
39
How Subtraction is Performed
on Your PC
• Steps:
– convert to Base 2
– Equalize the exponents by adjusting the
mantissa values; truncate the values that do not
fit
– Subtract mantissa
– normalize
40
Subtraction of Nearly Equal
Numbers
• Base 10: 1.24446 – 1.24445
1.
Significant loss of accuracy
(most bits are unreliable)
1110111
– 0100011
1010100…
41
Theorem of Loss Precision
• x, y be normalized floating point machine
numbers, and x>y>0
y
p
• If 2  1   2q
x
then at most p, at least q significant binary
bits are lost in the subtraction of x-y.
• Interpretation:
– “When two numbers are very close, their
subtraction introduces a lot of numerical error.”
42
Implications
• When you program:
• You should write these
instead:
f ( x)  x  1  1
f ( x)  ( x  1  1)
g ( x)  ln( x)  1
x
g ( x)  ln( x)  ln( e)  ln( )
e
2
2
x 2 1 1
x 2 1 1

x2
x 2 11
Every FP operation introduces error, but the
subtraction of nearly equal numbers is the worst
and should be avoided whenever possible
43
Efficiency Issues
• Horner Scheme
• program examples
44
Horner Scheme
• For polynomial evaluation
• Compare efficiency
45
Accuracy vs. Efficiency
46
Good Coding Practice
47
On Arrays …
48
Issues of PI
• 3.14 is often not accurate enough
– 4.0*atan(1.0) is a good substitute
49
Compare:
50
Exercise
• Explain why
100, 000
 0.1  10,000
i 0
• Explain why converge when implemented
numerically

1
1 1 1
 1   

2 3 4
n 1 n
51
Exercise
• Why Me( ) does not work as advertised?
• Construct the 64-bit version of everything
– Bit-Examiner
– Dme( );
• 32-bit: int and float. Can every int be
represented by float (if converted)?
52

Floating Point Computation

Transcript Floating Point Computation

Directory