Overview of basics

Download Report

Transcript Overview of basics

Computer Architecture
Lecture Notes
Spring 2005
Dr. Michael P. Frank
Competency Area 4:
Computer Arithmetic
Introduction
• In previous chapters we’ve discussed:
—Performance (execution time, clock cycles,
instructions, MIPS, etc)
—Abstractions:
Instruction Set Architecture
Assembly Language and Machine Language
• In this chapter:
—Implementing the Architecture:
– How does the hardware really add, subtract, multiply and
divide?
– Signed and unsigned representations
– Constructing an ALU (Arithmetic Logic Unit)
Introduction
• Humans naturally represent numbers in base
10, however, computers understand base 2.
Example:
-1
(Signed Representation)
(1111 1111)2 =
255 (Unsigned Representation)
Note: Signed representation includes sign-magnitude and two’s
complement. Also, one’s complement representation.
Possible Representations
•
Sign Magnitude:
000 = +0
001 = +1
010 = +2
011 = +3
100 = -0
101 = -1
110 = -2
111 = -3
One's Complement
000 = +0
001 = +1
010 = +2
011 = +3
100 = -3
101 = -2
110 = -1
111 = -0
Two's Complement
000 = +0
001 = +1
010 = +2
011 = +3
100 = -4
101 = -3
110 = -2
111 = -1
•
•
•
Sign Magnitude (first bit is sign bit, others magnitude)
Two’s Complement (negation: invert bits and add 1)
One’s Complement (first bit is sign bit, invert other bits for magnitude)
•
NOTE: Computers today use two’s complement binary representations for
signed numbers.
Two’s Complement Representations
• 32 bit signed numbers (MIPS):
0000
0000
0000
...
0111
0111
1000
1000
1000
...
1111
1111
1111
0000 0000 0000 0000 0000 0000 0000two = 0ten
0000 0000 0000 0000 0000 0000 0001two = + 1ten
0000 0000 0000 0000 0000 0000 0010two = + 2ten
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1111
1111
0000
0000
0000
1110two
1111two
0000two
0001two
0010two
=
=
=
=
=
+ 2,147,483,646ten
+ 2,147,483,647ten
– 2,147,483,648ten
– 2,147,483,647ten
– 2,147,483,646ten
maxint
minint
1111 1111 1111 1111 1111 1111 1101two = – 3ten
1111 1111 1111 1111 1111 1111 1110two = – 2ten
1111 1111 1111 1111 1111 1111 1111two = – 1ten
•The hardware need only test the first bit to determine the sign.
Two’s Complement Operations
• Negating a two's complement number:
– invert all bits and add 1
– Or, preserve rightmost 1 and 0’s to its right,
flip all bits to the left of the rightmost 1
• Converting n-bit numbers into m-bit numbers with m > n:
Example: Convert 4-bit signed number into 8-bit
number.
0010  0000 0010
(+210)
1010  1111 1010
(-610)
— "sign extension" is used. The most significant bit is
copied into the right portion of the new word. For
unsigned numbers, the leftmost bits are filled with 0’s.
—Example instructions: lbu/lb, slt/sltu, etc.
Addition and Subtraction
• Just like in grade school (carry/borrow 1s)
0111
0111
0110
+ 0110
- 0110
- 0101
• Two's complement operations easy
— subtraction using addition of negative numbers
0111
+ 1010
• Overflow (result too large for finite computer word):
— e.g., adding two n-bit numbers does not yield an n-bit number
0111
+ 0001
note that overflow term is somewhat misleading,
1000
it does not mean a carry “overflowed”
32-bit ALU with Zero Detect:
Bnegate
* Recall that given
following control lines,
we get these functions:
000
001
010
110
111
=
=
=
=
=
and
or
add
subtract
slt
* We’ve learned how to
build each of these
functions in hardware.
Operation
a0
b0
CarryIn
ALU0
Less
CarryOut
Result0
a1
b1
0
CarryIn
ALU1
Less
CarryOut
Result1
a2
b2
0
CarryIn
ALU2
Less
CarryOut
Result2
a31
b31
0
CarryIn
ALU31
Less
Zero
Result31
Set
Overflow
So far…
•
We’ve studied how to implement a 1-bit ALU in hardware that supports
the MIPS instruction set:
— key idea: use multiplexor to select desired output function
— we can efficiently perform subtraction using two’s complement
— we can replicate a 1-bit ALU to produce a 32-bit ALU
•
Important issues about hardware:
— all of the gates are always working
— the speed of a gate is affected by the number of inputs to the gate
— the speed of a circuit is affected by the number of gates in series
(on the “critical path” or the “deepest level of logic”)
•
Changes in hardware organization can improve performance
— we’ll look at examples for addition (carry lookahead adder) and
multiplication, and division
Better adder design
• For adder design:
— Problem  ripple carry adder is slow due to sequential
evaluation of carry-in/carry-out bits
• Consider the carryin inputs:
carryin1  cin1  a0  cin0  b0  cin0  a0  b0  cout0
carryin 2  cin 2  a1  cin1  b1  cin1  a1  b1  cout1
carryin 3  cin3  a3  cin3  b3  cin3  a3  b3  cout 2

• Using substitution, we can see the “ripple” effect:
cin1  a0  cin0  b0  cin0  a0  b0
then,
cin2  a1  a0  cin0  b0  cin0  a0  b0  b1  a0  cin0  b0  cin0  a0  b0  a1  b1

Carry-Lookahead Adder
• Faster carry schemes exist that improve the speed
of adders in hardware and reduce complexity in
equations, namely the carry lookahead adder.
• Let cini represent the ith carryin bit, then:
cini 1  (ai  cini )  (bi  cini )  (ai  bi )
cini 1  (ai  bi )  cini  (ai  bi )
• We can now define the terms generate and
propagate:
generate  g i  (ai  bi )
propagate  pi  (ai  bi )
• Then, cin  g  p  cin
i 1
i
i
i
Carry-Lookahead Adder
• Suppose gi is 1. The adder generates a carryout
independent of the value of the carryin, i.e.
(couti  cini 1 )
cini 1  gi  pi  cini  1  pi  cini  1
• Now suppose gi is 0 and pi is 1:
cini 1  gi  pi  cini  0  1 cini  cini
• The adder propagates a carryin to a carryout. In summary, cout
is 1 if either gi is 1 or both pi and cin are 1.
• This new approach creates the first level of abstraction.
cin1  g 0  ( p0  cin0 )
cin2  g1  ( p1  cin1 )  g1  ( p1  g 0 )  ( p1  p0  cin0 )
cin3  g 2  ( p2  cin2 )  g 2  ( p2  g1 )  ( p2  p1  g 0 )  ( p2  p1  p0  cin0 )
cin4  g 3  ( p3  cin3 )  g 3  ( p3  g 2 )  ( p3  p2  g1 )  ( p3  p2  p1  g 0 )  ( p3  p2  p1  p0  cin0 )

Carry-Lookahead Adder
• Sometimes the first level of abstraction will produce large
equations. It is beneficial then to look at the second level of
abstraction. It is produced by considering a 4-bit adder where
we propagate and generate signals at a higher level:
P0  p3  p2  p1  p0
P1  p7  p6  p5  p4
P2  p11  p10  p9  p8
P3  p15  p14  p13  p12
• We’re representing a 16-bit adder, with a “super” propagate signal
and a “super” generate signal.
• So Pi is true only if the each of the bits in the group propagates a
carry.
Carry-Lookahead Adder
• For the “super” generate signals it matters only if there is a
carry out in the most significant bit.
G0  g 3  ( p3  g 2 )  ( p3  p2  g1 )  ( p3  p2  p1  g 0 )
G1  g 7  ( p7  g 6 )  ( p7  p6  g 5 )  ( p7  p6  p5  g 4 )
G2  g11  ( p11  g10 )  ( p11  p10  g 9 )  ( p11  p10  p9  g8 )
G3  g15  ( p15  g12 )  ( p15  p14  g13 )  ( p15  p14  p13  g12 )
• Now we can represent the carryout signals for the 16-bit adder
with two levels of abstraction as
Cin1  G0  ( P0  cin0 )
Cin2  G1  ( P1  G0 )  ( P1  P0  cin0 )
Cin3  G2  ( P2  G1 )  ( P2  P1  G0 )  ( P2  P1  P0  cin0 )
Cin4  G3  ( P3  G2 )  ( P3  P2  G1 )  ( P3  P2  P1  G0 )  ( P3  P2  P1  P0  cin0 )

CarryIn
a0
b0
a1
b1
a2
b2
a3
b3
2nd Level of Abstraction
Carry-LookAhead
a4
Adder Design
b4
a5
b5
a6
b6
a7
b7
a8
b8
a9
b9
a10
b10
a11
b11
a12
b12
a13
b13
a14
b14
a15
b15
CarryIn
Result0--3
ALU0
P0
G0
pi
gi
Carry-lookahead unit
C1
ci + 1
CarryIn
Result4--7
ALU1
P1
G1
pi + 1
gi + 1
C2
ci + 2
CarryIn
Result8--11
ALU2
P2
G2
pi + 2
gi + 2
C3
ci + 3
CarryIn
Result12--15
ALU3
P3
G3
pi + 3
gi + 3
C4
CarryOut
ci + 4
O(log n)-time carry-skip adder
With this structure, we can do a
2n-bit add in 2(n+1) logic stages
Hardware
2nd carry tick
overhead is
<2× regular
ripple-carry.
(8 bit segment shown)
3rd carry tick
4th carry tick
S AB
G
S AB
Cin
GCoutCin
P
Pms
G
S AB
G
P
S AB
GCoutCin
Cin
P
Gls Pls
MS
Pms
GCout
P
S AB
G
P
Gls
LS
P
Pls
Pms
G
Cin
G
Gls
S AB
GCoutCin
Cin
P
Pms
S AB
G
P
Cin
Gls Pls
MS
Pms
GCout
Gls
GCout LS
P
P
Pms
Gls
GCout LS
P
Pls
Cin
P
Gls
LS
P
Pms
MS
GCoutCin
P
P
Pls
S AB
Pls
Cin
Pls
Cin
Multiplication Algorithms
• Recall that multiplication is
accomplished via shifting and addition.
• Example:
0010 (multiplicand)
x 0110 (multiplier)
Multiply by LSB of multiplier
0000
+0010 (shift multiplicand left 1 bit)
Intermediate product
00100
+ 0010
0001100 (product)
Multiplication Algorithm 1
Hardware implementation of Algorithm 1:
Multiplicand
Shift left
64 bits
Multiplier
Shift right
64-bit ALU
32 bits
Product
Write
64 bits
Control test
Multiplication Algorithm 1
Start
For each bit:
Multiplier0 = 1
1. Test
Multiplier0
Multiplier0 = 0
1a. Add multiplicand to product and
place the result in Product register
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Multiplication Algorithm 1
Example: (4-bit) 00102  00112  000001102
Iteration
Step
Multiplier
Multiplicand
Product
0
Initial Steps
0011
0000 0010
0000 0000
1
1a  LSB multiplier = 1
0011
0000 0010
0000 0010
--
0000 0100
0000 0010
3  shift Multiplier rgt
0001
0000 0100
0000 0010
1a  LSB multiplier = 1
0001
0000 0100
0000 0010
--
0000 1000
0000 0110
3  shift Multiplier rgt
0000
0000 1000
0000 0110
1  LSB multiplier = 0
0000
0000 1000
0000 0110
--
0001 0000
0000 0110
3  shift Multiplier rgt
0000
0001 0000
0000 0110
1  LSB multiplier = 0
0000
0001 0000
0000 0110
0010 0000
0000 0110
0010 0000
0000 0110
2  Shift Mcand lft
2
2  Shift Mcand lft
3
2  Shift Mcand lft
4
2  Shift Mcand lft
3  shift Multiplier rgt
0000
Multiplication Algorithms
• For Algorithm 1 we initialize the left half of the
multiplicand to 0 to accommodate for its left shifts.
All adds are 64 bits wide. This is wasteful and slow.
• Algorithm 2 instead of shifting multiplicand left,
shift product register to the right => half the widths
of the ALU and multiplicand
Multiplicand
32 bits
Multiplier
Shift right
32-bit ALU
32 bits
Product
64 bits
Shift right
Write
Control test
Multiplication Algorithm 2
Start
For each bit:
Multiplier0 = 1
1. Test
Multiplier0
Multiplier0 = 0
1a. Add multiplicand to the left half of
the product and place the result in
the left half of the Product register
2. Shift the Product register right 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Multiplication Algorithm 2
Example: (4-bit) 00102  00112  000001102
Iteration
Step
Multiplier
Multiplicand
Product
0
Initial Steps
0011
0010
0000 0000
1
1a  LSB multiplier = 1
0011
0010
0010 0000
2  Shift Product register rgt
0011
0010
0001 0000
3  shift Multiplier rgt
0001
0010
0001 0000
1a  LSB multiplier = 1
0001
0010
0011 0000
2  Shift Product register rgt
0001
0010
0001 1000
3  shift Multiplier rgt
0000
0010
0001 1000
1  LSB multiplier = 0
0000
0010
0001 1000
2  Shift Product register rgt
0000
0010
0000 1100
3  shift Multiplier rgt
0000
0010
0000 1100
1  LSB multiplier = 0
0000
0010
0000 1100
2  Shift Product register rgt
0000
0010
0000 0110
3  shift Multiplier rgt
0000
0010
0000 0110
2
3
4
Multiplication Algorithm 3
• The third multiplication algorithm combines the right half of
the product with the multiplier.
• This reduces the number of steps to implement the multiply and
it also saves space.
Hardware Implementation of Algorithm 3:
Multiplicand
32 bits
32-bit ALU
Product
64 bits
Shift right
Write
Control
test
Multiplication Algorithm 3
Start
For each bit:
Product0 = 1
1. Test
Product0
Product0 = 0
1a. Add multiplicand to the left half of
the product and place the result in
the left half of the Product register
2. Shift the Product register right 1 bit
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Multiplication Algorithm 3
Example: (4-bit) 00102  00112  000001102
Iteration
Step
Multiplicand
Product
0
Initial Steps
0010
0000 0011
1
1a  LSB product = 1
0010
0010 0011
2  Shift Product register rgt
0010
0001 0001
1a  LSB product = 1
0010
0011 0001
2  Shift Product register rgt
0010
0001 1000
1  LSB product = 0
0010
0001 1000
2  Shift Product register rgt
0010
0000 1100
1  LSB product = 0
0010
0000 1100
2  Shift Product register rgt
0010
0000 0110
2
3
4
Division Algorithms
• Example:
(Divisor)
1210
1310
16610
(Quotient)
(Dividend)
- 12
46
- 46
0
• Hardware implementations are similar to
multiplication algorithms:
+ Algorithm 1  implements conventional division method
+ Algorithm 2  reduces divisor register and ALU by half
+ Algorithm 3  eliminates quotient register completely
Floating Point Numbers
•
We need a way to represent:
— numbers with fractions, e.g., 3.1416
— very small numbers, e.g., .000000001
— very large numbers, e.g., 3.15576  109
•
Representation:
— sign, exponent, significand:
(–1)sign  significand  2exponent
— more bits for significand gives more accuracy
— more bits for exponent increases dynamic range
•
IEEE 754 floating point standard:
SIGN
EXPONENT
SIGNIFICAND
— For Single Precision: 8 bit exponent, 23 bit significand, 1 bit sign
— For Double Precision: 11 bit exponent, 52 bit significand, 1 bit sign
IEEE 754 floating-point standard
• Leading “1” bit of significand is implicit
• Exponent is usually “biased” to make sorting easier
—All 0s is smallest exponent, all 1s is largest
—bias of 127 for single precision and 1023 for double
precision
• Summary:
(–1)sign  (1fraction)  2exponent – bias
• Example: −0.7510 = −1.122−1
— Single precision: (−1)1  (1 + .1000…)  2126−127
– 1|01111110|10000000000000000000000
— Double precision: (−1)1  (1 + .1000…)  21022−1023
– 1|01111111110|10000000000000000000000…(32 more 0s)
FP Addition Algorithm
• The number with the smaller
exponent must be shifted
right before adding.
—So the “binary points” align.
• After adding, the sum must
be normalized.
—Then it is rounded,
– and possibly re-normalized
• Possible errors include:
—Overflow (exponent too big)
—Underflow (exp. too small)
Floating-Point Addition Hardware
• Implements
algorithm
from prev.
slide.
• Note high
complexity
compared
with integer
addition HW.
FP Multiplication Algorithm
• Add the exponents.
—Adjusting for bias.
• Multiply the significands.
• Normalize,
—Check for over/under flow,
• then round.
—Repeat if necessary.
• Compute the sign.
Ethics Addendum: Intel Pentium FP bug
• In July 1994, Intel discovered there was a bug in the
Pentium’s FP division hardware…
— But decided not to announce the bug, and go ahead and ship
chips having the flaw anyway, to save them time & money
– Based on their analysis, they thought errors could arise only rarely.
• Even after the bug was
discovered by users, Intel
initially refused to
replace the bad chips
on request!
— They got a lot of
bad PR from this…
• Lesson: Good, ethical
engineers fix problems
when they first find them, and don’t cover them up!
Summary
• Computer arithmetic is constrained by limited precision.
• Bit patterns have no inherent meaning but standards do exist:
– two’s complement
– IEEE 754 floating point
• Computer instructions determine “meaning” of the bit patterns.
• Performance and accuracy are important so there are many
complexities in real machines (i.e., algorithms and
implementation).
* Please read the remainder of the chapter on your own. However,
you will only be responsible for the material that has been covered in
the lectures for exams.