Floating Point Numbers and Arithmetic
Download
Report
Transcript Floating Point Numbers and Arithmetic
CDA 3101
Spring 2016
Introduction to Computer Organization
Floating Point
Theory, Notation, MIPS
11, 16 February 2016
Overview
• Floating point numbers
• Scientific notation
– Decimal scientific notation
– Binary scientific notation
• IEEE 754 FP Standard
• Floating point representation inside a computer
– Greater range vs. precision
• Decimal to Floating Point conversion
• Type is not associated with data
• MIPS floating point instructions, registers
Computer Numbers
• Computers are made to deal with numbers
• What can we represent in n bits?
– Unsigned integers:
– Signed integers:
0
-2(n-1)
to 2n - 1
to 2(n-1) - 1
• What about other numbers?
– Very large numbers? (seconds/century)
3,155,760,00010 (3.1557610 x 109)
– Very small numbers? (atomic diameter)
0.0000000110 (1.010 x 10-8)
– Rationals (repeating pattern)
2/3
(0.666666666. . .)
– Irrationals:
21/2
(1.414213562373. . .)
– Transcendentals:
e (2.718...), (3.141...)
Scientific Notation
mantissa
exponent
6.02 x 1023
decimal point
radix (base)
• Normalized form: no leadings 0s
(exactly one digit to left of decimal point)
• Alternatives to representing 1/1,000,000,000
–Normalized:
1.0 x 10-9
–Not normalized:
0.1 x 10-8, 10.0 x 10-10
Binary Scientific Notation
Exponent
Mantissa
1.0two x 2-1
“binary point”
radix (base)
• Floating point arithmetic
–Binary point is not fixed (as it is for integers)
–Declare such variable in C as float
Floating Point Representation
• Normal format: +1.xxxxxxxxxxtwo*2yyyytwo
• Multiple of Word Size (32 bits)
31 30
23 22
S Exponent
1 bit
8 bits
Significand
0
23 bits
• S represents Sign
Exponent represents y’s
Significand represents x’s
• Represent numbers as small as 2.0 x 10-38 to
as large as 2.0 x 1038
Overflow and Underflow
• Overflow
– Result is too large (> 2.0x1038 )
– Exponent larger than represented in 8-bit Exponent
field
• Underflow
– Result is too small
• >0, < 2.0x10-38
– Negative exponent larger than represented in 8-bit
Exponent field
• How to reduce chances of overflow or underflow?
Double Precision FP
• Use two words (64 bits)
31 30
20 19
S
Exponent
1 bit
11 bits
0
Significand
20 bits
Significand (cont’d)
32 bits
• C variable declared as double
• Represent numbers almost as small as
2.0 x 10-308 to almost as large as 2.0 x 10308
• Primary advantage is greater accuracy (52 bits)
Floating Point Representation
Normalized scientific notation: +1.xxxxtwo*2yyyytwo
Single
Precision
31 30
S
23 22
Exponent
1 bit
Double
Precision
1 bit
Significand
8 bits
31 30
S
0
23 bits
20 19
0
Significand
Exponent
11 bits
20 bits
Significand (cont’d)
32 bits
Exponent: biased notation
Significand: sign – magnitude notation
Bias 127 (SP)
1023 (DP)
IEEE 754 FP Standard
• Used in almost all computers (since 1980)
– Porting of FP programs
– Quality of FP computer arithmetic
1 means negative
• Sign bit:
0 means positive
• Significand:
– Leading 1 implicit for normalized numbers
– 1 + 23 bits single, 1 + 52 bits double
– always true: 0 < Significand < 1
• 0 has no leading 1
– Reserve exponent value 0 just for number 0
(-1)S * (1 + Significand) * 2Exp
IEEE 754 Exponent
• Use FP numbers even without FP hardware
– Sort records with FP numbers using integer
compares
• Break FP number into 3 parts: compare
signs, then compare exponents, then
compare significands
• Faster (single comparison, ideally)
– Highest order bit is sign ( negative < positive)
– Exponent next, so big exponent => bigger #
– Significand last: exponents same => bigger #
Biased Notation for Exponents
• Two’s complement does not work for
exponent
• Most negative exponent: 00000001two
• Most positive exponent: 11111110two
• Bias: number added to real exponent
– 127 for single precision
– 1023 for double precision
• 1.0 * 2-1
0 0111 1110 0000 0000 0000 0000 0000 000
(-1)S * (1 + Significand) * 2(Exponent - Bias)
Significand
• Method 1 (Fractions):
– In decimal: 0.34010 => 34010/100010 => 3410/10010
– In binary: 0.1102 => 1102/10002 (610/810) => 112/1002 (310/410)
– Helps understand the meaning of the significand
• Method 2 (Place Values):
–
–
–
–
Convert from scientific notation
In decimal: 1.6732 = (1x100) + (6x10-1) + (7x10-2) + (3x10-3) + (2x10-4)
In binary: 1.1001 = (1x20) + (1x2-1) + (0x2-2) + (0x2-3) + (1x2-4)
Interpretation of value in each position extends beyond the
decimal/binary point
– Good for quickly calculating significand value
– Use this method for translating FP numbers
Binary to Decimal FP
0 0110 1000 101 0101 0100 0011 0100 0010
• Sign: 0 => positive
• Exponent:
– 0110 1000two = 104ten
– Bias adjustment: 104 - 127 = -23
• Significand:
– 1 + 1x2-1+ 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 +...
=1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22
= 1.0 + 0.666115
• Represents: 1.666115*2-23 ~ 1.986*10-7
Decimal to Binary FP (1/2)
• Simple Case: If denominator is a power of
2 (2, 4, 8, 16, etc.), then it’s easy.
• Example: Binary FP representation of -0.75
–
–
–
–
–
-0.75 = -3/4
-11two/100two = -0.11two
Normalized to -1.1two x 2-1
(-1)S x (1 + Significand) x 2(Exponent-127)
(-1)1 x (1 + .100 0000 ... 0000) x 2(126-127)
1 0111 1110 100 0000 0000 0000 0000 0000
Decimal to Binary FP (2/2)
• Denominator is not an exponent of 2
– Number can not be represented precisely
– Lots of significand bits for precision
– Difficult part: get the significand
• Rational numbers have a repeating pattern
• Conversion
– Write out binary number with repeating pattern.
– Cut it off after correct number of bits (different
for single vs. double precision).
– Derive sign, exponent and significand fields.
Decimal to Binary
- 3 . 3 3 3 3 3 3…
0.33333333
x2
0 .66666666
0.66666666
x2
1 .33333332
0.33333332
x2
0 .66666664
- 1 1 . 0 1 0 1 0 1 0 . . . => - 1.1010101.. x 21
1. Significand: 101 0101 0101 0101 0101 0101
2. Sign: negative => 1
3. Exponent: 1+ 127 = 128ten = 1000 0000two
1 1000 0000 101 0101 0101 0101 0101 0101
Types and Data
0011 0100 0101 0101 0100 0011 0100 0010
–1.986 *10-7
–878,003,010
–“4UCB”
ori $s5, $v0, 17218
• Data can be anything; operation of instruction
that accesses operand determines its type!
• Power/danger of unrestricted addresses/pointers:
–Use ASCII as FP, instructions as data, integers as
instructions, ...
–Security holes in programs
IEEE 754 Special Values
Negative
Overflow
Negative
Underflow
Expressible
Negative Numbers
-(1-2-24)*2128
Positive
Underflow
Expressible
Positive Numbers
-.5*2-127 0 .5*2-127
Positive
Overflow
(1-2-24)*2128
Special Value
Exponent
Significand
+/- 0
Denormalized number
0000 0000
0000 0000
0
Nonzero
NaN
+/- infinity
1111 1111
1111 1111
Nonzero
0
Value: Not a Number
• What is the result of: sqrt(-4.0)or 0/0?
– If infinity is not an error, these shouldn’t be either.
– Called Not a Number (NaN)
– Exponent = 255, Significand nonzero
• Applications
– NaNs help with debugging
– They contaminate: op(NaN, X) = NaN
–Don’t use NaN
– Ask math majors
Value: Denorms
• Problem: There’s a gap among representable FP
numbers around 0
–
–
–
–
Smallest pos num: a = 1.0… 2 * 2-126 = 2-126
2nd smallest pos num: b = 1.001 2 * 2-126 = 2-126 + 2-150
a - 0 = 2-126
b
Gap!
b - a = 2-150
-
+
0 a
• Solution:
– Denormalized numbers: no leading 1
– Smallest pos num: a = 2-150
– 2nd smallest num: b = 2-149
-
0
+
Rounding
• Math on real numbers => rounding
• Rounding also occurs when converting types
– Double single precision integer
• Round towards +infinity
– ALWAYS round “up”: 2.001 => 3; -2.001 => -2
• Round towards -infinity
– ALWAYS round “down”: 1.999 => 1; -1.999 => -2
• Truncate
– Just drop the last bits (round towards 0)
• Round to (nearest) even (default)
– 2.5 => 2; 3.5 => 4
FP Fallacy
• FP Add, subtract associative: FALSE!
– x = – 1.5 x 1038, y = 1.5 x 1038, and z = 1.0
– x + (y + z) = –1.5x1038 + (1.5x1038 + 1.0)
= –1.5x1038 + (1.5x1038) = 0.0
– (x + y) + z
= (–1.5x1038 + 1.5x1038) + 1.0
= (0.0) + 1.0 = 1.0
• Floating Point add, subtract are not associative!
– Why? FP result approximates real result
– 1.5 x 1038 is so much larger than 1.0 that 1.5 x 1038 +
1.0 in floating point representation is still 1.5 x 1038
Computational Errors with FP
FP Addition / Subtraction
• Much more difficult than with integers
• Can’t just add significands
• Algorithm
–
–
–
–
De-normalize to match exponents
Add (subtract) significands to get resulting one
Keep the same exponent
Normalize (possibly changing exponent)
• Note: If signs differ, just perform a subtract
instead.
MIPS FP Architecture (1/2)
• Separate floating point instructions:
– Single Precision: add.s, sub.s, mul.s, div.s
– Double Precision: add.d, sub.d, mul.d, div.d
• These instructions are far more complicated than
their integer counterparts
• Problems:
– It’s inefficient to have different instructions take vastly
differing amounts of time.
– Generally, a particular piece of data will not change from
FP to int, or vice versa, within a program.
– Some programs do not do floating point calculations
– It takes lots of hardware relative to integers to do FP fast
MIPS FP Architecture (2/2)
• 1990 Solution: separate chip that handles only FP.
• Coprocessor 1: FP chip
– Contains 32 32-bit registers: $f0, $f1, …
– Most registers specified in .s and .d instructions ($f)
– Separate load and store: lwc1 and swc1
(“load word coprocessor 1”, “store …”)
– Double Precision: by convention, even/odd pair contain
one DP FP number: $f0/$f1, … , $f30/$f31
• 1990 Computer contains multiple separate chips:
– Processor: handles all the normal stuff
– Coprocessor 1: handles FP and only FP;
• Move data between main processor and coprocessors:
– mfc0, mtc0, mfc1, mtc1, etc.
Floating Point Hardware (FP Add)
C => MIPS
float f2c (float fahr) {
return ((5.0 / 9.0) * (fahr – 32.0));
}
F2c:
lwc1 $f16, const5($gp)
lwc1 $f18, const9($gp)
div.s $f16, $f16, $f18
lwc1 $f20, const32($gp)
sub.s $f20, $f12, $f20
mul.s $f0, $f16, $f20
jr
$ra
# $f16 = 5.0
# $f18 = 9.0
# $f16 = 5.0/9.0
# $f20 = 32.0
# $f20 = fahr – 32.0
# $f0 = (5/9)*(fahr-32)
# return
Conclusion
• Floating Point numbers approximate values that we
want to use.
• IEEE 754 Floating Point Standard is most widely
accepted attempt to standardize FP arithmetic
• MIPS architectural elements to support FP
– Registers ($f0-$f31)
– Single Precision (32 bits, 2x10-38… 2x1038)
• add.s, sub.s, mul.s, div.s
– Double Precision (64 bits , 2x10-308…2x10308)
• add.d, sub.d, mul.d, div.d
• Type is not associated with data, bits have no
meaning unless given in context (e.g., int vs. float)
Weekend
(source: https://s-media-cache-ak0.pinimg.com/564x/06/ab/e0/06abe0a00487923fea17503e8ac5a9f4.jpg)