15-Floating Point
Download
Report
Transcript 15-Floating Point
Number Representation
Fixed and Floating Point
•
No Method Capable of Representing ALL Real
Numbers Using Finite Register Lengths
•
Must Use Approximations to Represent Values
•
Concentrate on Two Forms:
– Fixed Point
– Floating Point
– Others are:
•
Rational Number Systems – uses ratios of integers
•
Logarithmic Number Systems – uses signs and logarithms of
values
Fixed Versus Floating Point
•
Fixed Point Values Represent Values where Any
Two Differ by 1 unit in the last place (ulp)
– Equal Spacing Between Numbers
•
Floating Point Values Use Two Multi-Bit Words
– Mantissa
– Exponent
•
Both Forms Must be Capable of Representing
Signed Quantities
•
Fixed Point Values CAN be Used to Represent
Fractional Quantities
Floating Point Characteristics
• Total Number of Representations = Total Bit Strings
– For n-bit Register we have 2n
• Range of Value is Larger than Fixed Point
• Precision of Value is Smaller
• Distance Between Two Consecutive Values Increases
Floating Point
s
e
m
s – Sign Bit (signed magnitude)
e – Exponent (in 2’s Complement Form)
m – Mantissa (significand or fraction) mMAX=1 - ulp; [0,1)
hidden bit
Value (1) 1.m 2
s
( e BIAS )
float – BIAS = 127 (32 bits-23 for m and 8 for e)
double – BIAS=1023 (64 bits-52 for m and 11 for e)
Sign of Exponent is Complement of it’s MSb
Thus, adding/subtracting bias is just complementation of MSb
Floating Point Example
double = 00000000 bfe80000
Big Endian – MSW has Higher Address
s
e
1
011 1111 1110
m
1000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
s = 1; e = 1022; m = 0.5
Value = (-1)11.5 2(1022-1023)
Value = -(1.5)(0.5) = -0.75
Floating Point Normalization
• Redundant /representations are Possible!
0.110101 2101 0.01101 2110 0.01101 2111
• Hidden Bit Helps
• Out of All Possible Representations, Choose One With
Fewest Leading Zeros in Significand
• This is Normalization
• After Performing Arithmetic, Renormalization May
Need to be Accomplished
Floating Point Special Numbers
Value v when exponent e and fraction f are
special values (IEEE standard)
Note: NaN = Not a Number
IEEE/ANSI 754/854 Standard
Denormalized Numbers
• Allows for Gradual Degradation for Underflow
Denormals
Operations – Internal Precision
Floating Point Addition/Subtraction
Floating Point Multiplication/Division
Conversions and Roundings
Exceptions
Rounding Schemes
Signed Magnitude
Two’s Complement
Round to Nearest (Signed Magnitude)
Rounding Comments
Round to Nearest Even/Odd
Round to Nearest Even
Round to Nearest Odd (R*)
Jamming/von Neumann Rounding
ROM Rounding
Rounding
Rounding Examples
Round Towards +
Downward Directed Rounding
Floating Point Operations
Adders/Subtractors
Operand Packing/Unpacking
Other Key Parts of FP Add/Sub Unit
Pre-Shifting
Four-stage Combinational Shifter
Pre-shifts Operand by 0 to 15 Bits
Leading Zeros/Ones – Counting vs. Prediction
Leading Zeros Prediction
Guard Digits
What is the smallest number of extra digits
needed for rounding? post-normalization?
• Multiplication – Double Length Result
• Add/Sub w/ differing exp. – Can have Double
Length Result
• FP Unit Provides One Length Result
Significand Ranges
• Assume Significand M(0,1-ulp]
• Then Normalized M ranges as:
M min 1
M max 1 ulp
• Multiplication: prod=M1M2
1
2
prod 1 2ulp ulp 2 1
• For postnormalization need at most one shift
left to get:
1 prod 1
Significand Ranges (cont)
• Division: quot=M1M2
1 quot ulp
• Need at most one shift right to get:
1 quot 1
• Conclusion:
– 1 Extra Digit Needed for Postnormalization
– 1 Extra Digit Needed for Round-to-Nearest
• 2 Extra Digits Needed
– G - guard
– R - round
“Sticky Bit” in std754
• Round-to-Nearest-Even Requires 1
Extra Bit
– The “sticky bit”, S
• Turns out to be Logical-OR of Other
Additional Bits
Floating Point Multiplier
Floating Point Divider