Introduction to Floating-point Numbers

Transcript Introduction to Floating-point Numbers

Introduction and
Floating-point Numbers
Douglas Wilhelm Harder, M.Math. LEL
Department of Electrical and Computer Engineering
University of Waterloo
Waterloo, Ontario, Canada
ece.uwaterloo.ca
[email protected]
© 2012 by Douglas Wilhelm Harder. Some rights reserved.
Introduction and Floating-point Numbers
Outline
This topic discusses numerical methods:
– We will quickly quick overview of the tools used
•
•
•
•
•
Iteration
Linear algebra
Interpolation
Taylor series
Bracketing
– Landau symbols and floating-point numbers
– Topics to be covered in NE 216 and NE 217
• IVPs in NE 216
• PDEs in NE 217
2
Introduction and Floating-point Numbers
Five Tools for Numerical Methods
For many numerical algorithms, we use one or more of
five tools:
–
–
–
–
–
Iteration
Linear algebra
Interpolation
Taylor series
Bracketing
All numerical algorithms you have seen use these
techniques in some combination
3
Introduction and Floating-point Numbers
Iteration
The first tool used by almost all numerical algorithms
is iteration
– We have a problem for which we are looking for a solution x*
– We start with an initial approximation or guess, call it x0
– We develop an algorithm or formula that takes a given
approximation xk and hopefully produces a better approximation
xk + 1 of x*
– We need mechanisms to indicate:
• When our guess is “good enough”
• When it is pointless to continue using a particular algorithm
4
Introduction and Floating-point Numbers
Iteration
The easiest example is an application of the fixed-point
theorem
Given a problem of the form
x = f(x)
there are certain conditions (local contraction mappings)
under which this may be solved by starting with an initial
guess x0 and iterating:
xk + 1 = f(xk)
5
Introduction and Floating-point Numbers
Iteration
Let’s look at
x = cos(x)
Plotting this, we see that
x0 = 0.7388
is likely a good initial approximation
6
Introduction and Floating-point Numbers
Iteration
>> x = 0.7388
x =
0.738800000000000
>> for i = 1:20
x = cos( x )
end
x* = 0.739085133215160641655312
0.739277172332048
0.738955759728378
0.739172274566653
0.739026430946466
0.739124674496043
0.739058497154931
0.739103075323556
0.739073047076154
0.739093274529801
0.739079649103192
0.739088827367838
0.739082644784437
0.739086809449742
0.739084004082345
0.739085893812137
0.739084620867674
0.739085478338494
0.739084900735888
0.739085289815975
0.739085027726959
7
Introduction and Floating-point Numbers
Iteration
>> x = 0.7388
x =
0.738800000000000
x* = 0.739085133215160641655312
>> for i = 1:3
x = x + (cos(x) - x)/(sin(x) + 1)
end
0.739085151172220
0.739085133215161
0.739085133215161
Note: some algorithms are better than others…
8
Introduction and Floating-point Numbers
Selecting Initial Points
Note: there is no such thing as a good numerical
algorithm that can find a solution with any random initial
point
– Perhaps a mathematician cares about such properties, but if you
as an engineer do not have a reasonable idea of what a good
starting point is, you have a problem…
– In this course, where necessary, you will be given initial points
unless you are looking at a problem where you are explicitly
asked to estimate an initial point
9
Introduction and Floating-point Numbers
Linear Algebra
The next tool used is linear algebra
– Most interesting systems contain more than one variable
– Usually this results in a system of either non-linear or linear
equations
– The only systems that we can easily solve are linear systems
– Systems of non-linear equations can be solved by approximating
it by a system of linear equations and iterating
10
Introduction and Floating-point Numbers
Linear Algebra
In one dimension, this is what Newton’s method does:
– Given a non-linear function f and a point xk,
– Find a tangent to the function at (xk, f(xk))
11
Introduction and Floating-point Numbers
Linear Algebra
The equation of a line with a slope m at a point (xk, yk) is
m(x – xk) + yk
Finding the root of this line requires only simple algebra:
y
x  xk  k
m
In this case, we have:
f  xk 
xk 1  xk  1
f  xk 
12
Introduction and Floating-point Numbers
Linear Algebra
Once we start getting into larger systems of linear
equations there are fast iterative systems
– Much better than Gaussian elimination…
– One of the worst performing is the Jacobi method
Mx  b
 D  Moff  x  b
Dx  M off x  b
Dx  b  M off x
x  D1  b  Moff x 
x = f(x)
13
Introduction and Floating-point Numbers
14
Linear Algebra
Consider this system of linear equations:
>>
>>
>>
>>
>>
M = [5.2 0.3 0.7 0.4; 0.3 4.8 -1.3 0.5; 0.7 -1.3 7.3 -0.8; 0.4 0.5 -0.8 6.4];
b = [2 4 5 2]';
D = diag( diag( M ) );
Moff = M – D;
% M = D + Moff
0.3 0.7
0.4 
 5.2
 2
x = D^-1 * b
% initial guess: Dx = b


 
x =
0.3
4.8

1.3
0.5
4



0.384615384615385
x


5
0.7 1.3 7.3 0.8
0.833333333333333


 
0.684931506849315
0.5 0.8
6.4 
 0.4
 2
0.312500000000000
Solution:
 0.18039333 


1.02772874

x
 0.88701657 


0.33181118


Introduction and Floating-point Numbers
Linear Algebra
>> for i = 1:5
x = D^-1 * (b - Moff*x)
end
x =
0.220297681770285
0.962245071566561
0.830698981383913
0.308973810151036
x =
0.193509166827092
1.012360930456767
0.869025926564131
0.327393371346209
x =
0.184041581486460
1.022496722669196
0.882538012313945
0.329943220201888
 0.18039333 


1.02772874

x
 0.88701657 


 0.33181118 
x =
0.181441747403601
1.026482360721093
0.885530302546704
0.331432096237808
x =
0.180694469520357
1.027300171035569
0.886652537362349
0.331657244174278
15
Introduction and Floating-point Numbers
Interpolation
The next tool is finding interpolating polynomials:
– Given a set of points, find the best-fitting polynomial that passes
through them
– For example, given (2, 5) and (7, 8), find a line y = ax + b that
passes through these points
2a + b = 5
7a + b = 8
– This is a system of two equations and two unknowns:
>> [2 1; 7 1] \ [5 8]'
ans =
0.600000000000000
3.800000000000000
y = 0.6 x + 3.8
16
Introduction and Floating-point Numbers
Interpolation
Given n points with unique x values, it is always possible
to find a unique interpolating polynomial passing through
the points
– For example, given (–2, 6), (–1, 0), (3, 4), and (5, 7), find a cubic
polynomial y = ax3 + bx2 + cx + d passing through them
(–2)3 a + (–2)2 b + (–2) c + d = 6
(–1)3 a + (–1)2 b + (–1) c + d = 0
33 a + 32 b + 3 c + d = 4
53 a + 52 b + 5 c + d = 7
– This gives the system
 8

 1
 27

125
4
1
9
25
2
1
3
5
1
1
1
1
 6
  
c   0
  4
  
 7
17
Introduction and Floating-point Numbers
18
Interpolation
In Matlab, we would do the following:
>> M = [-8 4 -2 1; -1 1 -1 1; 27 9 3 1; 125 25 5 1]
M =
-8
4
-2
1
4
 8
-1
1
-1
1

1
27
9
3
1
 1
125
25
5
1
 27
9
>> c = M \ [6 0 4 7]'
c =
-0.188095238095238
1.400000000000000
-0.483333333333333
-2.071428571428571

125
25
2
1
3
5
1
1
1
1

6

 
x  0

 4

 

7
y = –0.188 x3 + 1.400 x2 – 0.483 x – 2.071
Introduction and Floating-point Numbers
Interpolation
Checking our answer:
>>
>>
>>
>>
>>
plot( [-2 -1 3 5], [6 0 4 7], 'ro' )
xs = -2.5:0.1:5.5;
ys = polyval( c, xs );
hold on
plot( xs, ys, 'b' );
19
Introduction and Floating-point Numbers
Taylor Series
The fourth tool used in numerical methods that we will be
seriously using is Taylor series
– Normally, these are written as
f  x   f  x0   f    x0  x  x0  
1
1  2
1 3
2
3
f  x0  x  x0   f    x0  x  x0  
2
3!
We can truncate the series as
f  x   f  x0   f    x0  x  x0  
1
where    x0 , x 
1  2
1 3
2
3
f  x0  x  x0   f     x  x0 
2
3!
20
Introduction and Floating-point Numbers
Taylor Series
In numerical analysis, we usually write Taylor series as
– Given a point x, we’d like to know what happens at x + h where h
is very small
– Thus, the Taylor series will usually be in the form
f  x  h  f  x  f    x h 
1
1  2
1 3
f  x  h 2  f    x  h3 
2
3!
– The truncated forms, of course, contain an approximation of the
truncation error:
1 2
1
f  x  h   f  x   f    x  h  f     h 2
2
where    x, x  h 
21
Introduction and Floating-point Numbers
Bracketing
The fifth tool is bracketing:
– Determine that the solution is on an interval [a, b]
– Find an algorithm to reduce the size of the interval to either
[a, c] or [c, b]
where a < c < b
– Iterate until the width of the interval is sufficiently small
• Choose the endpoint that best satisfies the conditions
– The most inefficient of methods…
22
Introduction and Floating-point Numbers
Summary of Numerical Tools
We have quickly summarized five tools used in
numerical algorithms:
–
–
–
–
–
Iteration
Linear algebra
Interpolation
Taylor series
Bracketing
The balance of this topic will discuss:
– Landau symbols
– Floating-point numbers
– Topics to be covered in NE 216 and NE 217
• IVPs in NE 216
• PDEs in NE 217
23
Introduction and Floating-point Numbers
Landau Symbols
We will also use Landau symbols
– When iterating, very often, we will have the situation:
• Given a value of h that describes a quantitative value about which
we may make observations
– For example
• For some algorithms, if we make h smaller by half, the error of our
approximation is reduced by approximately half
• In others, (e.g., Newton’s method), if the error is h at one iteration,
the error at the next iteration will be h2
24
Introduction and Floating-point Numbers
Landau Symbols
To state this mathematically, we will use the big-O
Landau symbol:
– An algorithm is O(h) if halving h reduces the error by half
– We will use O(h2) to indicate the halving h will reduce the error by
a factor of four:
2
h
h2
 
  
4
2
– We will see one case where the error is O(h4)—halving h will
reduce the error by a factor of 16:
4
h4
h
  
 2  16
25
Introduction and Floating-point Numbers
Floating-point Numbers
Everything we do deals with floating-point numbers
– Unfortunately, there are problems with floating point numbers:
• We can only store a finite amount of precision
• We lose associativity: (x + y) + z ≠ x + (y + z)
– Both of these require us to carefully design our algorithms…
26
Introduction and Floating-point Numbers
27
Floating-point Numbers
Floating-point operations are specified in IEEE-754
• Every individual in the working group could have promoted the
floating-point format used by his or her company
• Instead, they created a floatingpoint format that no one used but
was superior to all that were
George M. Bergman
– Led by William Kahan
– One of the most successful examples of collaboration
Introduction and Floating-point Numbers
Floating-point Numbers
Some users will try:
>> 1 - 2/3 - 1/3
ans =
5.551115123125783e-017
and claim floating-point numbers don’t work….
1
2/3
1/3
1.0000000000000000000000000000000000000000000000000000
0.10101010101010101010101010101010101010101010101010101
0.010101010101010101010101010101010101010101010101010101
1.0000000000000000000000000000000000000000000000000000
- 0.10101010101010101010101010101010101010101010101010101
0.01010101010101010101010101010101010101010101010101011
0.01010101010101010101010101010101010101010101010101011
- 0.010101010101010101010101010101010101010101010101010101
0.000000000000000000000000000000000000000000000000000001
28
Introduction and Floating-point Numbers
Floating-point Numbers
Thus,
>> 1 - 2/3 - 1/3
ans =
5.551115123125783e-017
>> 2^-54
ans =
5.551115123125783e-017
Floating-point arithmetic is not exact
– Each operation—including conversion of decimal numbers to
binary—may have an error up to 0.5 in the least-significant bit
– For function evaluates, the allowable error is slightly larger
29
Introduction and Floating-point Numbers
Absolute and Relative Error
To begin, we need to review absolute and relative error:
– If a approximates the value x, we say that
x  a is the absolute error of a
xa
is the relative error of a
x
xa
100 % is the percent relative error
x
30
Introduction and Floating-point Numbers
Absolute and Relative Error
Absolute error is not very useful:
– An error of 1 mm may be:
• Insignificant in a trip to Mars
• Possibly significant in designing a Mars rover
• Catastrophic for anything designed at the nanometer scale
We will focus on the relative error:
– A 0.00001 error or 0.001 % relative error is usually acceptable in
many engineering applications regardless of scale
– This is, however, application specific
31
Introduction and Floating-point Numbers
Storing Real Numbers
Let us try to store real numbers with a finite number of
digits—for example,
  3.14
We will some constraints:
–
–
–
–
Use a fixed amount of memory
Represent both very large and very small numbers
Represent numbers with a small relative error
Easily test equality and relative magnitude
32
Introduction and Floating-point Numbers
A Simple Example
How significant a range can we represent with six
decimal digits and a sign?
±NNNNNN
Ideas?
33
Introduction and Floating-point Numbers
A Reasonable Representation
Here’s one very simple idea:
– Let the six digits
±NNNNNN
represent:
 NNN.NNN
– For example, +039432 represents 39.432
– We store –3.14152 as -003142
34
Introduction and Floating-point Numbers
A Reasonable Representation
Our range is somewhat limited…
– We can only represent numbers from 0.001 to 999.999
– Also, consider the relative error:
Value
Representation
0.0015
+000002
999.9985
+999998
Relative Error
33 %
0.00005 %
Limited range, some numbers with large relative errors…
Relative magnitude can, however, be found quickly
35
Introduction and Floating-point Numbers
A Reasonable Representation
Here is another idea:
– Let the six digits
±EEMNNN
represent:
 M .NNN 10 EE  49
where we will require that M is non-zero
– For example, +549238 represents 9.238 × 105
– We represent 372.863 as +513729
36
Introduction and Floating-point Numbers
A Reasonable Representation
How does it fare?
±00MNNN represents numbers as small as:
 M .NNN 1000 49   M .NNN 1049
• For example, +005723 represents 5.723 × 10–49
±99MNNN represents numbers as large as:
 M .NNN 1099 49   M .NNN  1050
• For example, +995723 represents 5.723 × 1050
37
Introduction and Floating-point Numbers
A Reasonable Representation
Also, no number on the range
[1.000 × 10–49, 9.999 × 1050]
has a representation with a relative error larger
than 0.05 %
– For example, 33 476 688 is represented by +563348 with a
relative error of 0.0099 %
– For example, 6.626 069 57×10−34 is represented by +156626 with
a relative error of 0.0010 %
38
Introduction and Floating-point Numbers
A Reasonable Representation
The requirement that the digit M is non-zero ensures
unique representations:
±EEMNNN
Otherwise, all four of
+491000 +500100 +510010 +520001
represent the same value: 1
1.000 × 100 0.100 × 101 0.010 × 102 0.001 × 103
– Imagine if simply checking for equality required addition and
iteration?
39
Introduction and Floating-point Numbers
40
A Reasonable Representation
Also, by choosing the order
±EEMNNN
and using a bias (the –49), we have one final advantage:
– Relative comparisons are also fast
These four represent numbers using our format:
+856729 +389657 +573823 +195737
Which is the largest in magnitude? Which is smallest?
6.729 1036
3.823  108 9.657 1011 5.737  10 30
Introduction and Floating-point Numbers
A Reasonable Representation
We have a few more issues:
– Representing zero:
– Requiring negative zero:
+000000
-000000
Why?
– Recall that a floating-point zero represents all numbers on the
range (–s, s) where s is the smallest number that can be
represented by a non-zero floating-point number
– One reason: Branch cuts:
ln  1   j    j
ln  1   j    j
41
Introduction and Floating-point Numbers
A Reasonable Representation
Other issues:
– 1.153 × 10 –49 can be represented with full precision, but
4.853 × 10 –50 must be represented by +000000
Solution?
Denormalized numbers: ±00NNNN represents ±N.NNN × 10 –49
For example, 4.853 × 10 –50 = 0.4853 × 10 –49 is represented
by ±000485
– Representing infinity (for example, 1/0, –1/0):
+990000 and -990000
– Representing undefined operations (0/0, not-a-number or NaN):
+990100
42
Introduction and Floating-point Numbers
IEEE 754
When Dr. Kahan lead the committee that eventually
produced the IEEE 754 standard, there were numerous
conflicts
– People from numerous corporations were represented, each
wanting to advocate for their own representations
– Each corporation already had invested in their own designs
• No one wants to modify existing hardware that has already been
tested
Fortunately, this committee overcame these biases and
produced an excellent standard
– IEEE 754-2008 contains most of the original standard
43
Introduction and Floating-point Numbers
IEEE 754
The original standard defines two formats:
– The float, a single-precision floating-point number
– The double, a double-precision floating-point number
For most applications outside of graphics, float is
unacceptable
– We will focus on double
44
Introduction and Floating-point Numbers
45
IEEE 754
The double uses 64 bits:
SEEEEEEEEEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
to represent (in binary):
 1 1.NNNNNNN
S
where 011111111112 = 1023
N 2  2 EEEEEEEEEEE2 011111111112
Introduction and Floating-point Numbers
IEEE 754
The smallest positive normalized number is
0000000000010000000000000000000000000000000000000000000000000000
which represents
 1 1.0002  212 011111111112  21022  2.225 10308
0
46
Introduction and Floating-point Numbers
47
IEEE 754
Denormalized numbers go as small as
0000000000000000000000000000000000000000000000000000000000000001
which represents
 1  0.000
0
012  212  011111111112  21074  4.94110324
Note: 252  212 011111111112  21521023  21074
Introduction and Floating-point Numbers
48
IEEE 754
The largest positive number is
0111111111101111111111111111111111111111111111111111111111111111
which represents
 1 1.111
0
12  2111111111102 011111111112  21024  1.79769 10308
>> format long
>> 1.999999999999999778 * 2^1023
ans =
1.797693134862316e+308
>> 2^1024
ans =
Inf
Recall in decimal:
1.999910 = 2
In binary:
1.11112 = 2
Introduction and Floating-point Numbers
IEEE 754
Infinity is represented by
S111111111110000000000000000000000000000000000000000000000000000
which represents
 1  
S
If the mantissa is not all zero, it represents a NaN
– NaN has special properties:
NaN == NaN returns false (0)—you must use isnan( x )
49
Introduction and Floating-point Numbers
50
IEEE 754
You can view the underlying format:
>> format hex
>> pi
ans =
400921fb54442d18
>> exp(1)
ans =
4005bf0a8b14576a
>> 1/0
ans =
7ff0000000000000
>> 0
ans =
0000000000000000
>> 1e-300
ans =
01a56e1fc2f8f359
Hexadecimal
0
1
2
3
4
5
6
7
8
9
a
b
c
d
e
f
Binary
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Introduction and Floating-point Numbers
Issues with Floating-point Numbers
There are still issues:
–
–
–
–
Overflow and underflow
Subtractive cancellation
Adding large and small numbers
Order of operations—not associative
These must be dealt with in our algorithms...
51
Introduction and Floating-point Numbers
A More Accurate Sum
The Kahan algorithm for adding numbers:
function [s] = Kahan_sum( v )
s = 0;
c = 0;
for x = v
y = x - c;
t = s + y;
c = (t - s) - y;
s = t;
end
end
52
Introduction and Floating-point Numbers
A More Accurate Sum
Consider:
>> v = rand( 1, 10000000 );
>> sum( v )
ans =
5.000908006717473e+006
>> Kahan_sum( v )
ans =
5.000908006717303e+006
>> sum( sort( v, 'ascend' ) )
ans =
5.000908006717453e+006
>> sum( sort( v, 'descend' ) )
ans =
5.000908006717471e+006
53
Introduction and Floating-point Numbers
A More Accurate Sum
Are you serious?
> Kahan_sum := proc( v )
local s, c, y, t, x;
s := 0;
c := 0;
for x in v do
y := x - c;
t := s + y;
c := (t - s) - y;
s := t;
end do;
return s;
end proc:
> S := [seq( rand()/1e12, i = 1..1000000 )]:
54
Introduction and Floating-point Numbers
A More Accurate Sum
Are you serious?
> add( i, i = S );
5.002264965 · 105
> add( i, i = sort( S ) );
5.002264658 · 105
> add( i, i = sort( S, `>` ) );
5.002264784 · 105
> Kahan_sum( S );
5.002264636 · 105
> Digits := 30:
> add( i, i = S );
5.00226463567620254 · 105
55
Introduction and Floating-point Numbers
The Laboratories
In the laboratories of this class, you will see six problems
that arise often in nanotechnology engineering in
association with this course
The laboratories will be divided into two parts:
– A one-hour presentation one week, and
– A one-hour help session the next week for assistance
56
Introduction and Floating-point Numbers
The Laboratories
This is part of an integrated approach in your
nanotechnology courses:
– NE 113 Engineering Computation was your introduction
– NE 216 will focus on numerical solutions to IVPs
– NE 217 focuses on PDEs
This will lead to
NE 336 Micro and Nanosystem Computer-aided Design
57
Introduction and Floating-point Numbers
The Laboratories
NE 216 will look at numerical approximations to:
– Numerical algorithms
– Differentiation
– 1st-order initial value problems (IVPs) using Euler and Heun's
methods
– 1st-order IVPs with the better 4th-order Runge-Kutta and the
Dormand-Prince methods
– Systems of IVPs and converting higher-order IVPs into a system
of 1st-order IVPs
– Boundary-value problems (BVPs) using the shooting method
58
Introduction and Floating-point Numbers
The Laboratories
NE 217 will look at numerical approximations to:
– Boundary-value problems using finite differences
– Heat-conduction/diffusion equation
– Heat-conduction/diffusion using the Crank-Nicolson method with
insulated boundaries
– Wave equation
– Laplace's equation in two and three dimensions
– The heat-conduction/diffusion and wave equations in two and
three dimensions
59
Introduction and Floating-point Numbers
The Laboratories
By the end of this sequence of laboratories, you will be
able to produce the following animations:
60
Introduction and Floating-point Numbers
Outline
In this topic, we saw:
– Five tools used in numerical algorithms:
•
•
•
•
•
Iteration
Linear algebra
Interpolation
Taylor series
Bracketing
– Landau symbols and floating-point numbers, IEEE 744
– A summary of the laboratories in NE 216 and NE 217
• IVPs and then PDEs
61

Introduction to Floating-point Numbers

Transcript Introduction to Floating-point Numbers

Directory