IEEE Arithmetic

Download Report

Transcript IEEE Arithmetic

IEEE Arithmetic
UC Berkeley
Fall 2004, E77
http://jagger.me.berkeley.edu/~pack/e77
Copyright 2005, Andy Packard. This work is licensed under the Creative Commons Attribution-ShareAlike
License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.0/ or send a letter to
Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.
Computer representation of numbers
Integers and rational numbers in different bases
Floating point
IEEE standards
Roundoff
Errors on basic arithmetic
Integers in base 10
The base10 number (2075)10 means
2 10  0 10  7 10  5 10
3
2
1
0
What are the “rules”?
– In each digit, there is a number from 0 to 9
– The “value” of the slots are increasing powers of 10, starting at
100
Throughout the lecture, the “explanatory”
expression is always given in base 10
Integers in base 2
The base2 number (1001101)2 means
1  2  0  2  0  2  1  2  1 2  0  2  1  2
6
5
4
3
2
1
What are the “rules”?
– In each digit, there is a number from 0 to 1
– The “value” of the slots are increasing powers of 2, starting at 20
This numbers in this “explanatory”
expression are base 10
0
Integers in base p
The base p number (d4 d3 d2 d1 d0)p means
d 4  p 4  d3  p 3  d 2  p 2  d1  p1  d 0  p 0
What are the “rules”?
– In each digit, there is a number from 0 to p-1
– The “value” of the slots are increasing powers of p, starting at p0
Rational numbers in base 10
The base 10 number (75.396)10 means
1
2
7 10  5 10  3 10  9 10  6 10
1
0
What are the “rules”?
– In each digit, there is a number from 0 to 9
– The “value” of the slots to the left of the decimal point are
increasing powers of 10, starting at 100
– The “value” of the slots to the right of the decimal point are
decreasing powers of 10, starting at 10-1
– The leading digit is usually not zero, but it is ok if it is.
– We usually write 75.396 instead of 075.396, but they are the same.
3
Rational numbers in base 2
The base 2 number (1001.101)2 means
1 23  0  22  0  21  1 20  1 21  0  22  1 23
What are the “rules”?
– In each digit, there is a number from 0 to 1
– The “value” of the slots to the left of the decimal point are
increasing powers of 2, starting at 20
– The “value” of the slots to the right of the decimal point are
decreasing powers of 2, starting at 2-1
– The leading digit is usually not zero (see previous slide)
Rational numbers in base p
The base p number (d2 d1 d0 .d-1 d-2 d-3)p means
d 2  p 2  d1  p1  d 0  p 0  d 1  p 1  d 2  p 2  d 3  p 3
What are the “rules”?
– In each digit, there is a number from 0 to p-1
– The “value” of the slots to the left of the decimal point are
increasing powers of p, starting at p0
– The “value” of the slots to the right of the decimal point are
decreasing powers of p, starting at p-1
– The leading digit is usually not 0 (see previous slide)
“Floating Point” numbers in base p
The base p number (d1 d0 .d-1 d-2 d-3) × pe means
mantissa
d  p
1
1
exponent
 d 0  p 0  d 1  p 1  d 2  p 2  d 3  p 3  p e
What are the “rules”?
– In each digit, there is a number from 0 to p-1
– The “value” of the slots to the left of the decimal point are
increasing powers of p, starting at p0
– The “value” of the slots to the right of the decimal point are
decreasing powers of p, starting at p-1
– The exponent, e, must be an integer.
• It just “moves” the decimal point e places to the right.
Normalized Floating point numbers in base 2
A normalized binary number is written d0 .d-1 d-2 d-3 × 2e.
The rules for normalized are:
– Only one digit to left of decimal point
– Value of this digit, d0 is always 1
So, normalized binary numbers appear as (for example)
1 .d-1 d-2 d-3 d-4 d-5 d-6 × 2e
In an unnormalized number, still allow only one digit to left
of decimal point, but it may be 0.
Finally, we need a ± sign in front to denote positive or
negative, giving the form
± d0 .d-1 d-2 d-3 d-4 d-5 d-6 × 2e
Computer representation of numbers
Computers “store” everything
– Integers, Characters
– Floating point numbers (eg., 1.435678, cos(2.3), etc.)
– Memory location addresses
– Computer chip instructions (“multiply contents of register A by
contents of register B, and store result in register C”)
as (long) sequences of digits made up of 0’s and 1’s.
– Each single digit is called a “bit” (8 bits is a byte)
– So, a “bit” can store a single base2 number (0 or 1)
For scientific computing and engineering calculations, an
important question is “How are floating point numbers
represented?” Need to store three things:
– the mantissa,
– the exponent, and
– the sign
Storing the mantissa, exponent and sign
Normalized numbers are of the form
±1 .d-1 d-2 d-3 d-4 d-5 d-6 × 2e
Store the sign with 1 bit: 0 means positive, 1 means
negative.
Store the mantissa (6 digits above) bit-by-bit.
Store the exponent as a binary integer
– What about the sign of the exponent? Later…
The number of bits used to store the mantissa controls the
precision to which numbers can be stored.
The number of bits used to store the exponent controls the
range of numbers (smallest to biggest) that can be stored
IEEE Standard
Single precision
– 32 bits
• 1 sign bit (0 means positive, 1 means negative)
• 8 bits for the exponent
• 23 bits for the fraction
Double precision
– 64 bits
• 1 sign bit (0 means positive, 1 means negative)
• 11 bits for the exponent
• 52 bits for the fraction
Plus, some additional rules as how to interpret things.
Let’s use a different precision to explain the rules.
Toy precision, using IEEE rules
7 bits (only 27=128 different possibilities)
1 sign bit, S, (0 means positive, 1 means negative)
3 bits for the exponent, E (000,001,010,011,100,101,110,111)
3 bits for the mantissa, M
Rules
– If 1≤E≤6, then value is -1S × 2E-3 × 1.M
•
•
•
•
2E-3 takes on values 0.25, 0.5, 1, 2, 4, 8
1.M values: 1, 1.125, 1.25, 1.375, 1.5, 1.6125, 1.75, 1.8725
Minimum value is 0.25, maximum value is 15 (realmin, realmax)
48 positive numbers, 48 negative numbers
– If E=0, and M≠0, then value is -1S × 2-2 × 0.M
• 1/32, 2/32, 3/32, 4/32, 5/32, 6/32, 7/32 (and their negatives)
• These are the unnormalized numbers (14 of them)
– If E=0 and M=0, then value is 0 or -0 (based on S)
– If E=7 and M=0, then value is Inf or –Inf (based on S)
– If E=7 and M≠0, then value is NaN (14 of these)
Double precision, using IEEE rules
64 bits (264=18400000000000000000 different possibilities)
1 sign bit, S, (0 means positive, 1 means negative)
11 bits for the exponent, E (between 0 and 2047)
52 bits for the mantissa, M (between 0 and 4.5×1015)
Rules
– If 1≤E≤2046, then value is -1S × 2E-1023 × 1.M
• 2E-1023 takes on 2046 different values, from 2-1022 to 21023
• 1.M takes on 4.5×1015 evenly spaced values from 1 to 2 (just less)
• Min value is 2.2×10-308 , maximum value is 1.8×10308
– If E=0, and M≠0, then value is -1S × 2-1022 × 0.M
• 4.5×1015 evenly spaced values from 0 to 2-1022 (just less)
• These are the unnormalized numbers (9.0×1015 of them)
– If E=0 and M=0, then value is 0 or -0 (based on S)
– If E=2047 and M=0, then value is Inf or –Inf (based on S)
– If E=2047 and M≠0, then value is NaN (4.5×1015 of these)
Toy precision, using IEEE rules
3-bit mantissa, 3-bit exponent, and sign bit gives
110 distinct numbers between -15 and 15,
+0 and -0, +Inf and –Inf, NaN
There are gaps. For 15<x<15, let fl(x) denote the closest
floating point number to x. The relative representation
error is
x  fl ( x)
fl ( x)
For toy precision, the maximum relative error is 2-4. In IEEE
double precision, the maximum relative error is 2-53 which
is about 10-16.
Size of gaps between representable numbers depends on
number (relative representation error)
IEEE Arithmetic
Arithmetic (add, subtract, multiply and divide) on pairs of
these numbers can give results that are not representable.
IEEE standard is that the stored result of these operations
should be the nearest floating point number. Hence
fl(x+y) = (x+y)×(1+δ) where | δ |<10-16
fl(x-y) = (x-y)×(1+δ) where | δ |<10-16
fl(x×y) = (x×y)×(1+δ) where | δ |<10-16
fl(x/y) = (x/y)×(1+δ) where | δ |<10-16
IEEE arithmetic also has fl(√y) = √y(1+δ)