Floating point numbers

Download Report

Transcript Floating point numbers

Floating point numbers
Computable reals
 “computable numbers may be
described briefly as the real numbers
whose expressions as a decimal are
calculable by finite means.”(A. M. Turing,
On Computable Numbers with an Application to the
Entschiedungsproblem, Proc. London Mathematical
Soc., Ser. 2 , Vol 42, pages 230-265, 1936-7.)
Look first at decimal reals
 A real number may be approximated by a
decimal expansion with a determinate
decimal point.
 As more digits are added to the decimal
expansion the precision rises.
 Any effective calculation is always finite – if
it were not then the calculation would go on
for ever.
 There is thus a limit to the precision that
the reals can be represented as.
Transcendental numbers
 In principle, transcendental numbers
such as Pi or root 2 have no finite
representation
 We are always dealing with
approximations to them.
 We can still treat Pi as a real rather
than a rational because there is
always an algorithmic step by which
we can add another digit to its
expansion.
First solution
 Store the numbers in memory just as they
are printed as a string of characters.
 249.75
Would be stored as 6 bytes as shown below
Note that decimal numbers are in the range 30H
to 39H as ascii codes
Full stop char
Char for 3
32
34
39
2E
37
35
Implications
 The number strings can be of variable
length.
 This allows arbitrary precision.
 This representation is used in
systems like Mathematica which
requires very high accuracy.
Example with Mathematica






5!
Out[1]=120
In[2]:=10!
Out[2]=3628800
In[3]:=50!
Out[3]=3041409320171337804361260816
60647688443776415689605120000000000
00
Decimal byte arithmetic
“9”+ “8”= “17” decimal
 39H+38H=71H hexadecimal ascii
 57+56=113 decimal ascii
 Adjust by taking 30H=48 away ->
41H=65
 If greater than “9”=39H=57 take
away 10=0AH and carry 1
 Thus 41H-0Ah = 65-10=55=37H so
the answer would be 31H,37H = “17”
Representing variables
 Variables are represented as pointers
to character strings in this system
 A=249.75
A
32
34
39
2E
37
35
Advantages
 Arbitrarily precise
 Needs no special hardware
Disadvantages
 Slow
 Needs complex memory management
Binary Coded Decimal (BCD) or
Calculator style floating point
 Note that 249.75 can be represented
as 2.4975 x 102
 Store this 2 digits to a byte to fixed
precision as follows
mantissa
24
exponent
97
50
02
32 bits overall
Each digit uses 4 bits
Normalise
Convert N to format with one digit in
front of the decimal point as follows:
1. If N>10 then Whilst N>10 divide by
10 and add 1 to the exponent
2. Else whilst N<1 multiply by 10 and
decrement the exponent
Add floating point
1. Denormalise smaller number so that
exponents equal
2. Perform addition
3. Renormalise
Eg 949.75 + 52.0 = 1002.75
9.49750 E02
→ 9.49750 E02
5.20000 E01
→ 0.52000 E02 +
10.02750 E02 → 1.00275 E03
Note loss of accuracy
Compare Octave which uses floating point
numbers with Mathematica which uses full
precision arithmetic
 Octave floating point gives only 5 figure
accuracy
Octave
fact(5)
ans = 120
fact(10)
ans = 3628800
fact(50)
ans = 3.0414e+64
Mathematica
5!
Out[1]=120
10!
Out[2]=3628800
50!
Out[3]=3041409320171337804
3612608166064768844377641
568960512000000000000
Loss of precison continued
 When there is a big difference
between the numbers the addition is
lost with floating point
Octave
325000000 + 108
ans =
3.2500D+08
Mathematica
In[1]:=
325000000 + 108
Out[1]=
325000108
Institution of Electrical and Electronic Engineers
IEEE floating point numbers
Single Precision
E
F
Definition
 N=-1s x 1.F x 2E-128
Delete this bit
Example 1
3.25
In fixed point binary = 11.01
= 1.101 x 21
In IEEE format this is
s=0 E=129, F=10100… thus in IEEE it is
S E
F
0|1000 0001|1010 0000 0000 0000 0000 000
Example 2
-0.375 = -3/8
In fixed point binary = -0.011
=-11 x 1.1 x 2-2
In IEEE format this is
s=1 E=126, F=1000 … thus in IEEE it is
S E
F
1|0111 1110|1000 0000 0000 0000 0000 000
Range
 IEEE32 1.17 * 10–38 to +3.40 * 1038
 IEEE64 2.23 * 10–308 to +1.79 * 10308
 80bit 3.37 * 10–4932 to +1.18 * 104932