Lect 29a-IEEE Floating Point Units
Download
Report
Transcript Lect 29a-IEEE Floating Point Units
IEEE Floating Point
Handling Denormalized Numbers
1/8/2007 - L24 IEEE Floating Point
Basics
Copyright 2006 - Joanne DeGroat, ECE, OSU
1
Lecture overview
Denormalized setup for operation with
normalized
1/8/2007 - L24 IEEE
Floating Point Basics
Copyright 2006 - Joanne DeGroat, ECE, OSU
2
The floating point standard
Single Precision
s
e (8-bits)
f (23-bits)
Value of bits stored in representation is:
If e=255 and f /= 0, then v is NaN regardless of s
s
If e=255 and f = 0, then v = (-1)
s
If 0 < e < 255, then v = (-1) 2e-127 (1.f) – normalized number
s
If e = 0 and f /= 0, the v = (-1) 2-126 (0.f)
Denormalized numbers – allow for graceful underflow
s
If e = 0 and f = 0 the v = (-1) 0 (zero)
1/8/2007 - L24 IEEE
Floating Point Basics
Copyright 2006 - Joanne DeGroat, ECE, OSU
3
Consider the example
A = 100
$42C8 0000
0100 0010 1100 1000 0000 0000 0000 0000
S=0
E = 1000 0101= 133 – 127 = 6
F = 1001 0000 --ManA = 1.100100000
B = 25
$41C8 0000
0100 0001 1100 1000 0000 0000 0000 0000
S=0
E = 1000 0011 = 131 – 127 = 4
F = 1001 0000 --ManB = 1.100100000
1/8/2007 - L24 IEEE
Floating Point Basics
Copyright 2006 - Joanne DeGroat, ECE, OSU
4
Example Continued
For A + B need to align binary pt by 2 places
ManA =
1.10010000000
ShfManB = 0.01100100000 - aligned to exp of 6
Sum is
1.1111010000 with a bin exp of 6
0 100 0010 1 111 1010 0000 0000 ---$4
2
F
A
0
0
1/8/2007 - L24 IEEE
Floating Point Basics
Copyright 2006 - Joanne DeGroat, ECE, OSU
5
Some basics
Consider 11001 or 25
This could be represented as 1.1001 x 24
OR 0.001101 x 26 to have the same value
Consider
0.00110000 as the fractional part of a
denormalized number
This has value 0.00110000 x 2-126
Or with a shift by 1 position 0.0110000 x 2-127
(Note this is a shift left on a fixed binary pt)
1/8/2007 - L24 IEEE
Floating Point Basics
Copyright 2006 - Joanne DeGroat, ECE, OSU
6
Why do this
With the shift it has the same value but the
format becomes:
s
-127
s
e-127
v = (-1) 2 (x.xxxx) = (-1) 2
(x.xxxx)
where x.xxxx is the shifted fractional part
and the e here is 0
This is the same format as normalized number
when e=0 and the operation now does not care
that one of the inputs was denormalized.
1/8/2007 - L24 IEEE
Floating Point Basics
Copyright 2006 - Joanne DeGroat, ECE, OSU
7
General Rule
Fixed binary point and shift digits
000.1000 to 001.0000
Shift left – subtract 1 from exponent
001.0000 to 000.1000
Shift right – add 1 to exponent
Fixed binary digits and move binary point
000.1000 to 0001.000
Move right – subtract 1 from exponent
0001.000 to 000.1000
Move left – add 1 to exponent
1/8/2007 - L24 IEEE
Floating Point Basics
Copyright 2006 - Joanne DeGroat, ECE, OSU
8