The Design of Survivable Networks

Download Report

Transcript The Design of Survivable Networks

Computer Architecture and
Design – ECEN 350
Part 7
[Some slides adapted from M. Irwin, D. Paterson and others]
ALU Implementation
Review: Basic Hardware
1. AND gate (c = a . b)
a
c
b
2. OR gate (c = a + b)
a
c
b
3. Inverter (c = a)
a
4. Multiplexor
(if d = = 0, c = a;
else c = b)
c
d
a
0
c
b
1
a
b
c=a.b
0
0
0
0
1
0
1
0
0
1
1
1
a
b
c=a+b
0
0
0
0
1
1
1
0
1
1
1
1
a
c=a
0
1
1
0
d
c
0
a
1
b
A Simple Multi-Function Logic
Unit

To warm up let's build a logic unit to support the and and or
instructions for MIPS (32-bit registers)

We'll just build a 1-bit unit and use 32 of them
operation
selector
a
output
b

Possible implementation using a multiplexor :
Implementation with a
Multiplexor

Selects one of the inputs to be the output
based on a control input
.
.
.
Operation
a
0
Result
b
1
Implementations

Not easy to decide the best way to implement



do not want too many inputs to a single gate
do not want to have to go through too many gates (= levels)
Let's look at a 1-bit ALU for addition:
CarryIn
cout = a.b + a.cin + b.cin
a
Sum
b
CarryOut
sum = a.b.cin + a.b.cin +
a.b.cin + a.b.cin
= a  b  cin
exclusive or (xor)
Building a 1-bit Binary Adder
carry_in
A
B
1 bit
Full
Adder
carry_out
S
A
B
carry_in
carry_out
S
0
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
1
1
1
0
1
0
0
0
1
1
0
1
1
0
1
1
0
1
0
1
1
1
1
1
S = A xor B xor carry_in
carry_out = AB v Acarry_in v Bcarry_in
(majority function)

How can we use it to build a 32-bit adder?

How can we modify it easily to build an
adder/subtractor?
1-bit Adder Logic
xor
Half-adder with one xor gate
Full-adder from 2 half-adders and
an or gate
Half-adder with the xor gate replaced
by primitive gates using the equation
AB = A.B +A.B
Building a 32-bit ALU
Multiplexor control
line
CarryIn
a0
Operation
CarryIn
b0
Operation
CarryIn
ALU0
Result0
CarryOut
a
0
a1
1
Result
b1
CarryIn
ALU1
Result1
CarryOut
2
b
a2
b2
CarryIn
ALU2
Result2
CarryOut
CarryOut
1-bit ALU for AND, OR and add
a31
b31
CarryIn
ALU31
Result31
Ripple-Carry Logic for 32-bit ALU
What about Subtraction (a – b) ?


Two's complement approach: just negate b and add.
How do we negate?

recall negation shortcut : invert each bit of b and set CarryIn to
least significant bit (ALU0) to 1
Binvert
Operation
CarryIn
a
0
1
b
0
2
1
CarryOut
Result
Overflow Detection and Effects
 Overflow:
the result is too large to represent in the
number of bits allocated
 When adding operands with different signs, overflow
cannot occur! Overflow occurs when




 On


adding two positives yields a negative
or, adding two negatives gives a positive
or, subtract a negative from a positive gives a negative
or, subtract a positive from a negative gives a positive
overflow, an exception (interrupt) occurs
Control jumps to predefined address for exception
Interrupted address (address of instruction causing the
overflow) is saved for possible resumption
 Don't
always want to detect (interrupt on) overflow
Overflow Detection
 Overflow

can be detected by:
Carry into MSB xor Carry out of MSB
 Logic
for overflow detection therefore can be put in
to ALU31
0
+
1
1
1
1
0
1
1
1
7
0
0
1
1
3
1
0
1
0
–6
+
0
1
1
0
0
–4
1
0
1
1
–5
0
1
1
1
7
New MIPS Instructions
Category
Instr
Arithmetic add unsigned
(R & I
sub unsigned
format)
add
imm.unsigned
Data
Transfer
Cond.
Branch
(I & R
format)
Op Code
Example
Meaning
0 and 21 addu $s1, $s2, $s3 $s1 = $s2 + $s3
0 and 23 subu $s1, $s2, $s3
$s1 = $s2 - $s3
9
addiu $s1, $s2, 6
$s1 = $s2 + 6
ld byte
unsigned
24
lbu
$s1, 25($s2)
$s1 = Mem($s2+25)
ld half unsigned
25
lhu
$s1, 25($s2)
$s1 = Mem($s2+25)
$s1, $s2, $s3
if ($s2<$s3) $s1=1
else
$s1=0
if ($s2<6) $s1=1
else
$s1=0
set on less than
unsigned
set on less than
imm unsigned
0 and 2b sltu
b
sltiu $s1, $s2, 6
Sign extend – addiu, addiu, slti, sltiu
 Zero extend – andi, ori, xori
 Overflow detected – add, addi, sub

Tailoring the ALU to MIPS:
Test for Less-than and Equality

Need to support the set-on-less-than instruction




e.g., slt $t0, $t3, $t4
remember: slt is an R-type instruction that produces 1 if rs < rt
and 0 otherwise
idea is to use subtraction: rs < rt  rs – rt < 0. Recall msb of
negative number is 1
two cases after subtraction rs – rt:





if no overflow then rs < rt  most significant bit of rs – rt = 1
if overflow
then rs < rt  most significant bit of rs – rt = 0
why?
e.g., 5ten – 6ten = 0101 – 0110 = 0101 + 1010 = 1111 (ok!)
-7ten – 6ten = 1001 – 0110 = 1001 + 1010 = 0011 (overflow!)
set bit is sent from ALU31 to ALU0 as the Less bit at ALU0; all other
Less bits are hardwired 0; so Less is the 32-bit result of slt
Binvert
Supporting slt
Operation
CarryIn
a
0
Binvert
CarryIn
Operation
1
Result
b
0
2
1
Less
Less input of
the 31 most
significant ALUs
is always 0
3
a.
CarryOut
1- bit ALU for the 31 least significant bits
a0
b0
CarryIn
ALU0
Less
CarryOut
a1
b1
0
CarryIn
ALU1
Less
CarryOut
a2
b2
0
CarryIn
ALU2
Less
CarryOut
Extra set bit, to be routed to the Less input of the least significant 1-bit
ALU, is computed from the most significant Result bit and the Overflow bit
(it is not the output of the adder as the figure seems to indicate)
Binvert
Operation
CarryIn
a
0
Result0
Result1
Result2
1
CarryIn
Result
b
0
2
1
Less
3
Set
Overflow
detection
b.
1-bit ALU for the most significant bit
a31
b31
0
CarryIn
ALU31
Less
Result31
Set
Overflow
Overflow
32-bit ALU from 31 copies of ALU at top left and 1 copy
of ALU at bottom left in the most significant position
Tailoring the ALU to MIPS:
Test for Less-than and Equality

Need to support test for equality

e.g., beq $t5, $t6, $t7

use subtraction: rs - rt = 0  rs = rt

do we need to consider overflow?
Supporting
Test for Equality
Bnegate
Combine CarryIn
to least significant
ALU and Binvert to
a single control line
as both are always
either 1 or 0
Operation
a0
b0
CarryIn
ALU0
Less
CarryOut
Result0
a1
b1
0
CarryIn
ALU1
Less
CarryOut
Result1
ALU
control
lines
Bneg- Operate
ation
0
0
0
1
1
00
01
10
10
11
Function
and
or
add
sub
slt
Zero
ALU operation
a2
b2
0
CarryIn
ALU2
Less
CarryOut
Result2
a
Output is 1 only if all Result bits are 0
ALU
Zero
Result
Overflow
b
a31
b31
0
CarryIn
ALU31
Less
Result31
CarryOut
Set
Overflow
32-bit MIPS ALU
Symbol representing ALU
Conclusion


We can build an ALU to support the MIPS instruction set

key idea: use multiplexor to select the output we want

we can efficiently perform subtraction using two’s complement

we can replicate a 1-bit ALU to produce a 32-bit ALU
Important points about hardware



speed of a gate depends number of inputs (fan-in) to the gate
speed of a circuit depends on number of gates in series
(particularly, on the critical path to the deepest level of logic)
Speed of MIPS operations

clever changes to organization can improve performance
(similar to using better algorithms in software)

we’ll look at examples for addition, multiplication and division
Building a 32-bit ALU
Multiplexor control
line
CarryIn
Operation
Operation
CarryIn
a0
b0
CarryIn
ALU0
Result0
CarryOut
a
0
1
a1
Result
b1
CarryIn
ALU1
Result1
CarryOut
2
b
a2
b2
CarryIn
ALU2
Result2
CarryOut
CarryOut
1-bit ALU for AND, OR and add
a31
b31
CarryIn
ALU31
Result31
Ripple-Carry Logic for 32-bit ALU
Problem: Ripple-carry Adder is
Slow


Is a 32-bit ALU as fast as a 1-bit ALU? Why?
Is there more than one way to do addition? Yes:



one extreme: ripple-carry – carry ripples through 32 ALUs, slow!
other extreme: sum-of-products for each CarryIn bit – super fast!
CarryIn bits:
c1 = b0.c0 + a0.c0 + a0.b0
Note: ci is CarryIn bit into i th ALU;
c0 is the forced CarryIn into the
least significant ALU
c2 = b1.c1 + a1.c1 + a1.b1
= a1.a0.b0 + a1.a0.c0 + a1.b0.c0
(substituting for c1)
+ b1.a0.b0 + b1.a0.c0 + b1.b0.c0 + a1.b1
c3 = b2.c2 + a2.c2 + a2.b2
= … = sum of 15 4-term products…

How fast? But not feasible for a 32-bit ALU! Why? Exponential complexity!!
Two-level Carry-lookahead
Adder: First Level


An approach between our two extremes
Motivation:




c1
c2
c3
c4

if we didn't know the value of a carry-in, what could we do?
when would we always generate a carry? (generate) gi = ai . bi
when would we propagate the carry?
(propagate) pi = ai + bi
Express (carry-in equations in terms of generate/propagates)
= g0 + p0.c0
= g1 + p1.c1 = g1 + p1.g0 + p1.p0.c0
= g2 + p2.c2 = g2 + p2.g1 + p2.p1.g0 + p2.p1.p0.c0
= g3 + p3.c3 = g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0
+ p3.p2.p1.p0.c0
Feasible for 4-bit adders – with wider adders unacceptable complexity.

solution: build a first level using 4-bit adders, then a second level on top
Two-level Carry-lookahead Adder:
Second Level for a
16-bit adder

P0
P1
P2
P3

G0
G1
G2
G3
Propagate signals for each of the four 4-bit adder blocks:
= p3.p2.p1.p0
= p7.p6.p5.p4
= p11.p10.p9.p8
= p15.p14.p13.p12
Generate signals for each of the four 4-bit adder blocks:
= g3 + p3.g2 + p3.p2.g1 + p3.p2.p1.g0
= g7 + p7.g6 + p7.p6.g5 + p7.p6.p5.g4
= g11 + p11.g10 + p11.p10.g9 + p11.p10.p9.g8
= g15 + p15.g14 + p15.p14.g13 + p15.p14.p13.g12
Two-level Carry-lookahead Adder: Second
Level for a
16-bit adder

C1
C2
C3
C4
CarryIn signals for each of the four 4-bit adder blocks (see
earlier carry-in equations in terms of generate/propagates):
= G0 + P0.c0
= G1 + P1.G0 + P1.P0.c0
= G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.c0
= G3 + P3.G2 + P3.P2.G1 + P3.P2.P1.G0 +
P3.P2.P1.P0.c0
C a rry In
a0
b0
a1
b1
a2
b2
a3
b3
C a rry In
R e s u lt0 - - 3
Carry-lookahead
Logic
4bAdder0
CarryIn
P0
G0
Carry-lookahead Unit
a8
b8
a9
b9
a10
b1 0
a11
b1 1
a12
b1 2
a13
b1 3
a14
b1 4
a15
b1 5
C a rry In
4bAdder1
P1
G1
C2
C a rry In
4bAdder2
P2
G2
C3
C a rry In
4bAdder3
R e s u lt4 - - 7
ALU0
a1
b1
ALU1
a2
b2
ALU2
a3
ALU3
R e s u lt8 - - 1 1
s0
s1
s2
s3
b3
R e s u lt1 2 - - 1 5
P3
G3
C4
C a rry O u t
a0
b0
Logic to compute
c1, c2, c3, c4, P0, G0
a4
b4
a5
b5
a6
b6
a7
b7
Logic to compute C1, C2, C3, C4
C1
16-bit carry-lookahead adder from four 4-bit
adders and one carry-lookahead unit
Blow-up of 4-bit adder:
(conceptually) consisting of
four 1-bit ALUs plus logic to
compute all CarryOut bits
and one super generate and
one super propagate bit.
Each 1-bit ALU is exactly as
for ripple-carry except c1, c2,
c3 for ALUs 1, 2, 3 comes
from the extra logic
Two-level Carry-lookahead Adder:
Second Level
for a 16-bit adder

Two-level carry-lookahead logic steps:
1.
compute pi’s and gi’s at each 1-bit ALU
2.
compute Pi’s and Gi’s at each 4-bit adder unit
3.
compute Ci’s in carry-lookahead unit
4.
compute ci’s at each 4-bit adder unit
5.
compute results (sum bits) at each 1-bit ALU
Multiplication

Standard shift-add method:
1000
x 1001
1000
0000
0000
1000
01001000
Multiplicand
Multiplier
Product


m bits x n bits = m+n bit product
Binary makes it easy:



multiplier bit 1 => copy multiplicand (1 x multiplicand)
multiplier bit 0 => place 0
(0 x multiplicand)
3 versions of multiply hardware & algorithms:
Shift-add Multiplier Version 1
Start
Multiplier0 = 1

1000
1001
1000
0000
0000
1000
01001000
1. Test
Multiplier0
Multiplier0 = 0
Multiplicand
Multiplier
Product
1a. Add multiplicand to product and
place the result in Product register
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Algorithm
Shift-add Multiplier Version 1
Start
Multiplier0 = 1
32-bit multiplicand starts at right half of multiplicand register
Multiplicand
1. Test
Multiplier0
Multiplier0 = 0
1a. Add multiplicand to product and
place the result in Product register
Shift left
64 bits
Multiplier
Shift right
64-bit ALU
2. Shift the Multiplicand register left 1 bit
32 bits
Product
Write
Control test
3. Shift the Multiplier register right 1 bit
64 bits
Product register is initialized at 0
Multiplicand register, product register, ALU are
64-bit wide; multiplier register is 32-bit wide
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Algorithm
Shift-add Multiplier Version1
Start
Multiplier0 = 1
1. Test
Multiplier0
Multiplier0 = 0
Example: 0010 * 0011:
Itera Step
-tion
0
1a. Add multiplicand to product and
place the result in Product register
1
init
values
1a
2
3
2
…
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Algorithm
Multiplier
Multiplicand
Product
0011
0000 0010
0000 0000
0011
0011
0001
0000 0010
0000 0100
0000 0100
0000 0010
0000 0010
0000 0010
Observations on Multiply
Version 1




1 step per clock cycle  nearly 100 clock cycles to multiply two
32-bit numbers
Half the bits in the multiplicand register always 0
 64-bit adder is wasted
0’s inserted to right as multiplicand is shifted left
 least significant bits of product never
change once formed
Intuition: instead of shifting multiplicand to left, shift product to
right…
Shift-add Multiplier Version 2
Start
Multiplier0 = 1

1000
1001
1000
0000
0000
1000
01001000
Multiplicand
Multiplier
Product
1. Test
Multiplier0
Multiplier0 = 0
1a. Add multiplicand to the left half of
the product and place the result in
the left half of the Product register
2. Shift the Product register right 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Algorithm
Shift-add Multiplier Version 2
Start
Multiplier0 = 1
1. Test
Multiplier0
Multiplier0 = 0
1a. Add multiplicand to the left half of
the product and place the result in
the left half of the Product register
Multiplicand
32 bits
Multiplier
Shift right
32-bit ALU
32 bits
Product
Shift right
Write
2. Shift the Product register right 1 bit
Control test
3. Shift the Multiplier register right 1 bit
64 bits
Product register is initialized at 0
Multiplicand register, multiplier register, ALU
are 32-bit wide; product register is 64-bit wide;
multiplicand adds to left half of product register
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Algorithm
Shift-add Multiplier Version 2
Start
Example: 0010 * 0011:
Multiplier0 = 1
1. Test
Multiplier0
Multiplier0 = 0
Itera Step
-tion
0
1a. Add multiplicand to the left half of
the product and place the result in
the left half of the Product register
1
init
values
1a
2
3
2
…
2. Shift the Product register right 1 bit
3. Shift the Multiplier register right 1 bit
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Algorithm
Multiplier
Multiplicand
Product
0011
0010
0000 0000
0011
0011
0001
0010
0010
0010
0010 0000
0001 0000
0001 0000
Observations on Multiply
Version 2


Each step the product register wastes space that exactly matches
the current size of the multiplier
Intuition: combine multiplier register and product register…
Shift-add Multiplier Version 3
Start
Product0 = 1
1. Test
Product0
Product0 = 0
Multiplicand
32 bits
1a. Add multiplicand to the left half of
the product and place the result in
the left half of the Product register
32-bit ALU
Product
Shift right
Write
Control
test
2. Shift the Product register right 1 bit
64 bits
Product register is initialized with multiplier on right
No separate multiplier register; multiplier
placed on right side of 64-bit product register
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Algorithm
Shift-add Multiplier Version 3
Start
Product0 = 1
1. Test
Product0
Product0 = 0
Example: 0010 * 0011:
Itera Step
-tion
0
1a. Add multiplicand to the left half of
the product and place the result in
the left half of the Product register
1
2
2. Shift the Product register right 1 bit
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done
Algorithm
init
values
1a
2
…
Multiplicand
Product
0010
0000 0011
0010
0010
0010 0011
0001 0001
Observations on Multiply
Version 3


2 steps per bit because multiplier & product combined
What about signed multiplication?


easiest solution is to make both positive and remember whether to
negate product when done, i.e., leave out the sign bit, run for 31 steps,
then negate if multiplier and multiplicand have opposite signs
Booth’s Algorithm is an elegant way to multiply signed numbers using
same hardware – it also often quicker…
MIPS Notes




MIPS provides two 32-bit registers Hi and Lo to hold a 64-bit
product
mult, multu (unsigned) put the product of two 32-bit register
operands into Hi and Lo: overflow is ignored by MIPS but can
be detected by programmer by examining contents of Hi
mflo, mfhi moves content of Hi or Lo to a general-purpose
register
Pseudo-instructions mul (without overflow), mulo (with
overflow), mulou (unsigned with overflow) take three 32-bit
register operands, putting the product of two registers into the
third
Divide
1001
Divisor 1000




1001010
–1000
10
101
1010
–1000
10
Quotient
Dividend
Remainder
standart method: see how big a multiple of the divisor can be
subtracted, creating quotient digit at each step
Binary makes it easy  first, try 1 * divisor; if too big, 0 * divisor
Dividend = (Quotient * Divisor) + Remainder
3 versions of divide hardware & algorithm:
Divide Version 1
Start
Example: 0111 / 0010:
1. Subtract the Divisor register from the
Remainder register and place the
result in the Remainder register
Remainder –> 0
Test Remainder
2a. Shift the Quotient register to the left,
setting the new rightmost bit to 1
Itera- Step Quotient
tion
Remainder < 0
2b. Restore the original value by adding
the Divisor register to the Remainder
register and place the sum in the
Remainder register. Also shift the
Quotient register to the left, setting the
new least significant bit to 0
3. Shift the Divisor register right 1 bit
33rd repetition?
No: < 33 repetitions
Yes: 33 repetitions
Done
Algorithm
0
1
init
1
2b
3
2
3
4
5
…
0000
0000
0000
0000
Divisor
0010
0010
0010
0001
0000
0000
0000
0000
Remainder
0000
1110
0000
0000
0111
0111
0111
0111
Divide Version 1
Start
1. Subtract the Divisor register from the
Remainder register and place the
result in the Remainder register
32-bit divisor starts at left half of divisor register
Remainder –> 0
Quotient register is
initialized to be 0
Divisor
Shift right
Test Remainder
Remainder < 0
64 bits
2a. Shift the Quotient register to the left,
setting the new rightmost bit to 1
Quotient
Shift left
64-bit ALU
2b. Restore the original value by adding
the Divisor register to the Remainder
register and place the sum in the
Remainder register. Also shift the
Quotient register to the left, setting the
new least significant bit to 0
32 bits
Remainder
Write
Control
test
64 bits
Remainder register is initialized with the dividend at right
Divisor register, remainder register, ALU are
64-bit wide; quotient register is 32-bit wide
3. Shift the Divisor register right 1 bit
33rd repetition?
No: < 33 repetitions
Yes: 33 repetitions
Why 33?
Done
Algorithm
Divide Version 1
Observations on Divide Version 1




Half the bits in divisor always 0
  1/2 of 64-bit adder is wasted
  1/2 of divisor register is wasted
Intuition: instead of shifting divisor to right, shift remainder to left…
Step 1 cannot produce a 1 in quotient bit – as all bits corresponding
to the divisor in the remainder register are 0 (remember all operands
are 32-bit)
Intuition: switch order to shift first and then subtract – can save 1
iteration…
Divide Version 2
Start
1. Shift the Remainder register left 1 bit
Divisor
32 bits
Remainder register is initialized
with the dividend at right
2. Subtract the Divisor register from the
left half of the Remainder register and
place the result in the left half of the
Remainder register
Quotient
Shift left
32-bit ALU
32 bits
Remainder
Shift left
Write
Remainder >
– 0
Test Remainder
Remainder < 0
Control
test
64 bits
Divisor register, quotient register,
ALU are 32-bit wide; remainder
register is 64-bit wide
3a. Shift the Remainder register to the
left, setting the new rightmost bit to 0.
Also shift the Quotient register to the left
setting to the new rightmost bit to 1.
3b. Restore the original value by adding
the Divisor register to the left half of the
Remainder register and place the sum
in the left half of the Remainder register.
Also shift the Remainder register to the
left, setting the new rightmost bit to 0
Also shift the Quotient register to the left
setting to the new rightmost bit to 0.
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Why this correction step?
Done. Shift left half of Remainder right 1 bit
Algorithm
Observations on Divide Version 2


Each step the remainder register wastes space that exactly matches
the current size of the quotient
Intuition: combine quotient register and remainder register…
Start
Divide Version 3
1. Shift the Remainder register left 1 bit
2. Subtract the Divisor register from the
left half of the Remainder register and
place the result in the left half of the
Remainder register
Divisor
Remainder >
– 0
32 bits
Test Remainder
Remainder < 0
32-bit ALU
3a. Shift the Remainder register to the
left, setting the new rightmost bit to 1
Remainder
Shift right
Shift left
Write
Control
test
3b. Restore the original value by adding
the Divisor register to the left half of the
Remainder register and place the sum
in the left half of the Remainder register.
Also shift the Remainder register to the
left, setting the new rightmost bit to 0
64 bits
Remainder register is initialized with the dividend at right
No separate quotient register; quotient
is entered on the right side of the 64-bit
remainder register
Why this correction step?
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done. Shift left half of Remainder right 1 bit
Algorithm
Divide Version 3
Example: 0111 / 0010:
Start
Iteration
1. Shift the Remainder register left 1 bit
0
2. Subtract the Divisor register from the
left half of the Remainder register and
place the result in the left half of the
Remainder register
Remainder >
– 0
Test Remainder
3a. Shift the Remainder register to the
left, setting the new rightmost bit to 1
1
2
3
4
Remainder < 0
3b. Restore the original value by adding
the Divisor register to the left half of the
Remainder register and place the sum
in the left half of the Remainder register.
Also shift the Remainder register to the
left, setting the new rightmost bit to 0
32nd repetition?
No: < 32 repetitions
Yes: 32 repetitions
Done. Shift left half of Remainder right 1 bit
Algorithm
Step
init
1
2
3b
…
Divisor
0010
0010
0010
0010
Remainder
0000
0000
1110
0001
0111
1110
1110
1100
Observations on Divide Version 3

Same hardware as Multiply Version 3

Signed divide:




make both divisor and dividend positive and perform division
negate the quotient if divisor and dividend were of opposite signs
make the sign of the remainder match that of the dividend
this ensures always


dividend = (quotient * divisor) + remainder
–quotient (x/y) = quotient (–x/y) (e.g. 7 = 3*2 + 1 & –7 = –3*2 – 1)
MIPS Divide Instruction

Divide generates the reminder in hi and the quotient in
lo
div
$s0, $s1
# lo = $s0 / $s1
# hi = $s0 mod $s1
op


rs
rt
rd
shamt
funct
Instructions mfhi rd and mflo rd are provided to move
the quotient and reminder to (user accessible) registers in the
register file
As with multiply, divide ignores overflow so software
must determine if the quotient is too large. Software
must also check the divisor to avoid division by 0.
MIPS Notes


div rs, rt (signed), divu rs, rt (unsigned), with two
32-bit register operands, divide the contents of the operands
and put remainder in Hi register and quotient in Lo; overflow is
ignored in both cases
pseudo-instructions div rdest, rsrc1, src2 (signed with
overflow), divu rdest, rsrc1, src2 (unsigned without
overflow) with three 32-bit register operands puts quotients of
two registers into third
Floating Point
Review of Numbers
• Computers are made to deal with
numbers
• What can we represent in N bits?
• Unsigned integers:
0
to
2N - 1
• Signed Integers (Two’s Complement)
-2(N-1)
to
2(N-1) - 1
Other Numbers
• What about other numbers?
• Very large numbers? (seconds/century)
3,155,760,00010 (3.1557610 x 109)
• Very small numbers? (atomic diameter)
0.0000000110 (1.010 x 10-8)
• Rationals (repeating pattern)
2/3
(0.666666666. . .)
• Irrationals
21/2
(1.414213562373. . .)
• Transcendentals
e (2.718...),  (3.141...)
• All represented in scientific notation
Scientific Notation (in Decimal)
mantissa
exponent
6.0210 x 1023
decimal point
radix (base)
• Normalized form: no leadings 0s
(exactly one digit to left of decimal point)
• Alternatives to representing 1/1,000,000,000
• Normalized:
1.0 x 10-9
• Not normalized:
0.1 x 10-8,10.0 x 10-10
Scientific Notation (in Binary)
mantissa
exponent
1.0two x 2-1
“binary point”
radix (base)
• Computer arithmetic that supports it
called floating point, because it
represents numbers where the binary
point is not fixed, as it is for integers
• Declare such variable in C as float
Floating Point Representation (1/2)
• Normal format: +1.xxxxxxxxxxtwo*2yyyytwo
• Multiple of Word Size (32 bits)
31 30
23 22
S Exponent
1 bit
8 bits
Significand
23 bits
• S represents Sign
Exponent represents y’s
Significand represents x’s
• Represent numbers as small as
2.0 x 10-38 to as large as 2.0 x 1038
0
Floating Point Representation (2/2)
• What if result too large? (> 2.0x1038 )
• Overflow!
• Overflow  Exponent larger than
represented in 8-bit Exponent field
• What if result too small? (>0, < 2.0x10-38 )
• Underflow!
• Underflow  Negative exponent larger than
represented in 8-bit Exponent field
• How to reduce chances of overflow or
underflow?
Double Precision Fl. Pt. Representation
• Next Multiple of Word Size (64 bits)
31 30
20 19
S
Exponent
1 bit
11 bits
Significand
0
20 bits
Significand (cont’d)
32 bits
• Double Precision (vs. Single Precision)
• C variable declared as double
• Represent numbers almost as small as
2.0 x 10-308 to almost as large as 2.0 x 10308
• But primary advantage is greater accuracy
due to larger significand
IEEE 754 Floating Point Standard (1/4)
• Single Precision, DP similar
• Sign bit:
1 means negative
0 means positive
• Significand:
• To pack more bits, leading 1 implicit for
normalized numbers
• 1 + 23 bits single, 1 + 52 bits double
• always true: Significand < 1
(for normalized numbers)
• Note: 0 has no leading 1, so reserve
exponent value 0 just for number 0
IEEE 754 Floating Point Standard (2/4)
• Want FP numbers to be used even if no
FP hardware; e.g., sort records with FP
numbers using integer compares
• Could break FP number into 3 parts:
compare signs, then compare exponents,
then compare significands
• Then want order:
• Highest order bit is sign ( negative < positive)
• Exponent next, so big exponent => bigger #
• Significand last: exponents same => bigger #
IEEE 754 Floating Point Standard (3/4)
• Negative Exponent?
• 2’s comp? 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)
1/2 0 1111 1111 000 0000 0000 0000 0000 0000
2 0 0000 0001 000 0000 0000 0000 0000 0000
• This notation using integer compare of
1/2 v. 2 makes 1/2 > 2!
• Instead, pick notation 0000 0001 is most
negative, and 1111 1111 is most positive
• 1.0 x 2-1 v. 1.0 x2+1 (1/2 v. 2)
1/2 0 0111 1110 000 0000 0000 0000 0000 0000
2 0 1000 0000 000 0000 0000 0000 0000 0000
IEEE 754 Floating Point Standard (4/4)
• Called Biased Notation, where bias is
number subtracted to get real number
• IEEE 754 uses bias of 127 for single prec.
• Subtract 127 from Exponent field to get
actual value for exponent
• 1023 is bias for double precision
• Summary (single precision):
31 30
23 22
S Exponent
1 bit
8 bits
Significand
23 bits
• (-1)S x (1 + Significand) x 2(Exponent-127)
• Double precision identical, except with
exponent bias of 1023
0
• Floating Point numbers approximate
values that we want to use.
• IEEE 754 Floating Point Standard is most
widely accepted attempt to standardize
interpretation of such numbers
• Every desktop or server computer sold since
~1997 follows these conventions
• Summary (single precision):
31 30
23 22
S Exponent
1 bit
8 bits
Significand
23 bits
• (-1)S x (1 + Significand) x 2(Exponent-127)
• Double precision identical, bias of 1023
0
Example: Converting Binary FP to Decimal
0 0110 1000 101 0101 0100 0011 0100 0010
• Sign: 0 => positive
• Exponent:
• 0110 1000two = 104ten
• Bias adjustment: 104 - 127 = -23
• Significand:
• 1 + 1x2-1+ 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 +...
=1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22
= 1.0ten + 0.666115ten
• Represents: 1.666115ten*2-23 ~ 1.986*10-7
(about 2/10,000,000)
Converting Decimal to FP (1/3)
• Simple Case: If denominator is an
exponent of 2 (2, 4, 8, 16, etc.), then
it’s easy.
• Show MIPS representation of -0.75
• -0.75 = -3/4
• -11two/100two = -0.11two
• Normalized to -1.1two x 2-1
• (-1)S x (1 + Significand) x 2(Exponent-127)
• (-1)1 x (1 + .100 0000 ... 0000) x 2(126-127)
1 0111 1110 100 0000 0000 0000 0000 0000
Converting Decimal to FP (2/3)
• Not So Simple Case: If denominator is
not an exponent of 2.
• Then we can’t represent number precisely,
but that’s why we have so many bits in
significand: for precision
• Once we have significand, normalizing a
number to get the exponent is easy.
• So how do we get the significand of a
neverending number?
Converting Decimal to FP (3/3)
• Fact: All rational numbers have a
repeating pattern when written out in
decimal.
• Fact: This still applies in binary.
• To finish conversion:
• Write out binary number with repeating
pattern.
• Cut it off after correct number of bits
(different for single v. double precision).
• Derive Sign, Exponent and Significand
fields.
Example: Representing 1/3 in MIPS
• 1/3
= 0.33333…10
= 0.25 + 0.0625 + 0.015625 + 0.00390625 + …
= 1/4 + 1/16 + 1/64 + 1/256 + …
= 2-2 + 2-4 + 2-6 + 2-8 + …
= 0.0101010101… 2 * 20
= 1.0101010101… 2 * 2-2
• Sign: 0
• Exponent = -2 + 127 = 125 = 01111101
• Significand = 0101010101…
0 0111 1101 0101 0101 0101 0101 0101 010
Representation for ± ∞
• In FP, divide by 0 should produce ± ∞,
not overflow.
• Why?
• OK to do further computations with ∞
E.g., X/0 > Y may be a valid comparison
• IEEE 754 represents ± ∞
• Most positive exponent reserved for ∞
• Significands all zeroes
Representation for 0
• Represent 0?
• exponent all zeroes
• significand all zeroes too
• What about sign?
•+0: 0 00000000 00000000000000000000000
•-0: 1 00000000 00000000000000000000000
• Why two zeroes?
• Helps in some limit comparisons
Special Numbers
• What have we defined so far?
(Single Precision)
Exponent
Significand
Object
0
0
0
0
nonzero
???
1-254
anything
+/- fl. pt. #
255
0
+/- ∞
255
nonzero
???
Representation for Not a Number
• What is sqrt(-4.0)or 0/0?
• If ∞ not an error, these shouldn’t be either.
• Called Not a Number (NaN)
• Exponent = 255, Significand nonzero
• Why is this useful?
• Hope NaNs help with debugging?
• They contaminate: op(NaN, X) = NaN
Floating Point
Addition
Start
1. Compare the exponents of the two numbers.
Shift the smaller number to the right until its
exponent would match the larger exponent
2. Add the significands

Algorithm:
3. Normalize the sum, either shifting right and
incrementing the exponent or shifting left
and decrementing the exponent
Overflow or
underflow?
Yes
No
4. Round the significand to the appropriate
number of bits
No
Still normalized?
Yes
Done
Exception
Floating Point
Addition
Sign

Exponent
Significand
Sign
Exponent
Compare
exponents
Small ALU
Hardware:
Significand
Exponent
difference
0
1
0
Control
1
0
Shift smaller
number right
Shift right
Big ALU
0
1
0
Increment or
decrement
Exponent
Add
1
Shift left or right
Rounding hardware
Sign
1
Significand
Normalize
Round
Start
Floating Point
Multpication

Algorithm:
1. Add the biased exponents of the two
numbers, subtracting the bias from the sum
to get the new biased exponent
2. Multiply the significands
3. Normalize the product if necessary, shifting
it right and incrementing the exponent
Overflow or
underflow?
Yes
No
4. Round the significand to the appropriate
number of bits
No
Still normalized?
Yes
5. Set the sign of the product to positive if the
signs of the original operands are the same;
if they differ make the sign negative
Done
Exception
MIPS Floating Point Architecture (1/4)
• Separate floating point instructions:
• Single
Precision:
add.s, sub.s, mul.s, div.s
• Double
Precision:
add.d, sub.d, mul.d, div.d
• These are far more complicated than
their integer counterparts
• Can take much longer to execute
MIPS Floating Point Architecture (2/4)
• Problems:
• Inefficient to have different instructions
take vastly differing amounts of time.
• Generally, a particular piece of data will
not change FP  int within a program.
- Only 1 type of instruction will be used on it.
• Some programs do no FP calculations
• It takes lots of hardware relative to
integers to do FP fast
MIPS Floating Point Architecture (3/4)
• 1990 Solution: Make a completely
separate chip that handles only FP.
• Coprocessor 1: FP chip
• contains 32 32-bit registers: $f0, $f1, …
• most of the registers specified in .s and
.d instruction refer to this set
• separate load and store: lwc1 and swc1
(“load word coprocessor 1”, “store …”)
• Double Precision: by convention,
even/odd pair contain one DP FP number:
$f0/$f1, $f2/$f3, … , $f30/$f31
- Even register is the name
MIPS Floating Point Architecture (4/4)
• Today, FP coprocessor integrated with
CPU, or cheap chips may leave out FP
HW
• Instructions to move data between main
processor and coprocessors:
•mfc0, mtc0, mfc1, mtc1, etc.
• Appendix contains many more FP ops
Performance evaluation
Performance Metrics

Purchasing perspective

given a collection of machines, which has the
- best performance ?
- least cost ?
- best cost/performance?

Design perspective

faced with design options, which has the
- best performance improvement ?
- least cost ?
- best cost/performance?

Both require


basis for comparison
metric for evaluation
Two Notions of “Performance”
Plane
DC to Paris
Top
Speed
Passengers
Throughput (pass
x mph)
Boeing 747
6.5 hours
610 mph
470
286,700
BAD/Sud
Concorde
3 hours
1350 mph
132
178,200
•Which has higher performance?
•Time to deliver 1 passenger?
•Time to deliver 400 passengers?
•In a computer, time for 1 job called
Response Time or Execution Time
•In a computer, jobs per day called
Throughput or Bandwidth
•We will focus primarily on execution time for a single job
Performance Metrics
Normally we are interested in reducing Response time
(aka execution time) – the time between the start and the
completion of a task
 Our goal is to understand what factors in the architecture
contribute to overall system performance and the relative
importance (and cost) of these factors

 Additional
Metrics
Smallest/lightest
 Longest battery life
 Most reliable/durable (in space)

What is Time?

Straightforward definition of time:



Total time to complete a task, including disk accesses, memory
accesses, I/O activities, operating system overhead, ...
“real time”, “response time” or “elapsed time”
Alternative: just time processor (CPU)
is working only on your program (since multiple processes
running at same time)


“CPU execution time” or “CPU time”
Often divided into system CPU time (in OS) and user CPU time (in
user program)
Performance Factors

Want to distinguish elapsed time and the time spent on
our task

CPU execution time (CPU time) – time the CPU spends
working on a task

Does not include time waiting for I/O or running other programs
CPU execution time = # CPU clock cyclesx clock cycle time
for a program
for a program
or
CPU execution time = #------------------------------------------CPU clock cycles for a program
for a program
clock rate

Can improve performance by reducing either the length
of the clock cycle or the number of clock cycles required
for a program

Many techniques that decrease the number of clock cycles also increase the clock cycle time
Review: Machine Clock Rate

Clock rate (MHz, GHz) is inverse of clock cycle time
(clock period)
CC = 1 / CR
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec clock cycle =>
1 GHz clock rate
500 psec clock cycle =>
2 GHz clock rate
250 psec clock cycle =>
4 GHz clock rate
200 psec clock cycle =>
5 GHz clock rate
Clock Cycles per Instruction

Not all instructions take the same amount of time to
execute

One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time
per instruction
# CPU clock cycles
# Instructions Average clock cycles
= for a program x
for a program
per instruction

Clock cycles per instruction (CPI) – the average number
of clock cycles each instruction takes to execute

A way to compare two different implementations of the same ISA
CPI
CPI for this instruction class
A
B
C
1
2
3
Effective CPI

Computing the overall effective CPI is done by looking at
the different types of instructions and their individual
cycle counts and averaging
n
Overall effective CPI =

(CPIi x ICi)
i=1




Where ICi is the count (percentage) of the number of
instructions of class i executed
CPIi is the (average) number of clock cycles per instruction
for that instruction class
n is the number of instruction classes
The overall effective CPI varies by instruction mix – a
measure of the dynamic frequency of instructions
across one or many programs
THE Performance Equation

Our basic performance equation is then
CPU time
= Instruction_count x CPI x clock_cycle
or
CPU time

=
Instruction_count x
CPI
----------------------------------------------clock_rate
These equations separate the three key factors that
affect performance




Can measure the CPU execution time by running the
program
The clock rate is usually given
Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
CPI varies by instruction type and ISA implementation for
which we must know the implementation details
instruction count != the number of lines in the code
Determinates of CPU Performance
CPU time
= Instruction_count x CPI x clock_cycle
Algorithm
Programming
language
Compiler
ISA
Processor
organization
Technology
Instruction_
count
CPI
clock_cycle
X
X
X
X
X
X
X
X
X
X
X
X
A Simple Example
Op
Freq
CPIi
Freq x CPIi
ALU
50%
1
.5
.5
.5
Load
20%
5
1.0
.4
1.0
Store
10%
3
.3
.3
.3
Branch
20%
2
.4
.4
.2
2.2
1.6
2.0
=

How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

How does this compare with using branch prediction to shave
a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
Comparing and Summarizing Performance

How do we summarize the performance for
benchmark set with a single number?

The average of execution times that is directly proportional
to total execution time is the arithmetic mean (AM)
n
AM =
1/n

Timei
i=1



Where Timei is the execution time for the ith program of a
total of n programs in the workload
A smaller mean indicates a smaller average execution time
and thus improved performance
Guiding principle in reporting performance measurements
is reproducibility – list everything another experimenter
would need to duplicate the experiment (version of the
operating system, compiler settings, input set used,
specific computer configuration (clock rate, cache sizes
and speed, memory size and speed, etc.))
SPEC Benchmarks www.spec.org
Integer benchmarks
gzip
compression
vpr
FPGA place & route
gcc
GNU C compiler
mcf
Combinatorial optimization
crafty
Chess program
parser
Word processing program
eon
Computer visualization
perlbmk perl application
gap
vortex
bzip2
twolf
Group theory interpreter
Object oriented database
compression
Circuit place & route
FP benchmarks
wupwise Quantum chromodynamics
swim
Shallow water model
mgrid
Multigrid solver in 3D fields
applu
Parabolic/elliptic pde
mesa
3D graphics library
galgel
Computational fluid dynamics
art
Image recognition (NN)
equake Seismic wave propagation
simulation
facerec Facial image recognition
ammp
Computational chemistry
lucas
Primality testing
fma3d
Crash simulation fem
sixtrack Nuclear physics accel
apsi
Pollutant distribution
Example SPEC Ratings
Other Performance Metrics

Power consumption – especially in the embedded market
where battery life is important (and passive cooling)

For power-limited applications, the most important metric is
energy efficiency
Summary: Evaluating ISAs


Design-time metrics:

Can it be implemented, in how long, at what cost?

Can it be programmed? Ease of compilation?
Static Metrics:


How many bytes does the program occupy in memory?
Dynamic Metrics:
How many instructions are executed? How many bytes does the
processor fetch to execute the program?
CPI
 How many clocks are required per instruction?

Best Metric: Time to execute the program!
depends on the instructions set, the
processor organization, and compilation Inst. Count
techniques.
Cycle Time
Example

A certain computer has a CPU running at 500 MHz. Your PINE session
on this machine ends up executing 200 million instructions, of the
following mix:
Type CPI Freq
A 4 20 %
B 3 30 %
C 1 40 %
D 2
10 %

What is the average CPI for this session?
Problem 5 - Performance

A certain computer has a CPU running at 500 MHz. Your PINE session
on this machine ends up executing 200 million instructions, of the
following mix:
Type CPI Freq
A 4 20 %
B 3 30 %
C 1 40 %
D 2
10 %

What is the average CPI for this session?
Answer: (20% · 4) + (30% · 3) + (40% · 1) + (10% · 2) = 2.3.

Problem 5 - Performance

A certain computer has a CPU running at 500 MHz. Your PINE session
on this machine ends up executing 200 million instructions, of the
following mix:
Type CPI Freq
A 4 20 %
B 3 30 %
C 1 40 %
D 2
10 %

How much CPU time was used in this session?
Problem 5 - Performance

A certain computer has a CPU running at 500 MHz. Your PINE session
on this machine ends up executing 200 million instructions, of the
following mix:
Type CPI Freq
A 4 20 %
B 3 30 %
C 1 40 %
D 2
10 %

How much CPU time was used in this session?
Answer: (200 x 106 instructions) · (2 ns/cycle) · (2.3 cycles/instruction) =
0.92 seconds.
