1.610×10 -1 - s3.amazonaws.com

Download Report

Transcript 1.610×10 -1 - s3.amazonaws.com

Floating Point Arithmetic
Ellen Spertus
MCS 111
October 11, 2001
Decimal addition (1)
• Problem: 9.999×101 + 1.610×10-1
• Estimate answer:
2
Decimal addition (2)
• Problem: 9.999×101 + 1.610×10-1
• Calculate answer:
9.999×101
+1.610×10-1
3
Decimal addition (3)
• Problem: 9.999×101 + 1.610×10-1
• How should we add them?
4
Floating point addition
• Adjust numbers to have same exponent
• Add the significands
• Normalize the sum
5
Binary addition (1)
• Problem: 1.01×22 + 1.101×2-1
• Adjust numbers to have same exponent:
• Add the significands
• Normalize the sum
6
Binary addition (2)
• Problem: 1.11×21 + 1.01×23
• Adjust numbers to have same exponent:
• Add the significands
• Normalize the sum
7
8-bit floating-point format (2)
• Exponent (3 bits) is biased by 3
• The leading one of significand is implicit
• Zero is represented by all zeros
sign exponent significand
1 bit 3 bits
4 bits
0
100
0000
0
000
1000
number
base 2
number
base 10
8
Practice
Add two numbers from previous slide
sign exponent significand
1 bit 3 bits
4 bits
0
100
0000
0
000
1000
number
base 2
number
base 10
9
Problem
10
Rounding (1)
• Round 1.00011 to have one fewer digit
• Modes
– Always round up (IRS)
– Always round down
– Truncate
– Round to nearest even
11
Rounding (2)
• Round -1.00011 to have one fewer digit
• Modes
– Always round up (IRS)
– Always round down
– Truncate
– Round to nearest even
12
Ensuring accurate results
• Our significands are 4 bits wide.
• We use 6 bits when adding two
significands.
– Guard bit
– Round bit
• Purpose: Accurate rounding
13
Adding large numbers
• What if we add 1.1111×24 + 1.1111×24
14
How can we get underflow?
15
Associativity of arithmetic
• (x+y)+z = x+(y+z)
• When is this true?
16
Breakdown of associativity
• Values
– x = 1.0000
– y = 0.00001
– z = 0.00001
• (x+y)+z
• x+(y+z)
Assume rounding by truncation.
17
MIPS floating point
• 32 floating-point registers (32 bits each)
• Instructions
– Addition: add.s, add.d
– Subtraction: sub.s, sub.d
– Multiplication: mul.s, mul.d
– Division: div.s, div.d
– Comparison: c.x.s and c.x.d where x is:
eq, neq, lt, le, gt, ge
– Conditional branch: bc1t, bc1f
18
Summary
• Computers aren’t limited to integers
• Floating-point arithmetic is quirky
– Loss of precision due to rounding
– Underflow
– Overflow
• Big picture: Floating point arithmetic can
be implemented with enough
______________________.
19
20