Transcript Hazards

UNIT III -PIPELINE
1
Outline
•
•
•
•
Pipelining exercise
What is data hazards
Data hazard solutions
Assignment 1 Solution
2
Pipeline Exercise
• Consider a nonpipelined machine with 6 execution stages of lengths 50 ns,
50 ns, 60 ns, 60 ns, 50 ns, and 50 ns.
– Find the instruction latency on this machine.
– How much time does it take to execute 100 instructions?
• Solution:
• Instruction latency = 50+50+60+60+50+50= 320 ns
Time to execute 100 instructions = 100*320 = 32000 ns
3
• Suppose we introduce pipelining on this machine. Assume
that when introducing pipelining, the clock skew adds 5ns of
overhead to each execution stage.
– What is the instruction latency on the pipelined machine?
– How much time does it take to execute 100 instructions?
• Solution:
• Note: the length of the pipe stages must all be the same,
The length of pipelined stage = MAX(lengths of unpipelined
stages) + overhead = 60 + 5 = 65 ns
• Instruction latency = 65 ns
• Time to execute 100 instructions = 65*6 + 65*99 = 390 +
6435 = 6825 ns
4
• What is the speedup obtained from pipelining?
(here we do not consider any stalls introduced by different
types of hazards which we will look at in the next section)
• Solution:
Speedup = Old Execution Time / New Execution Time
= 32000 / 6825
= 4.69
5
Pipelining Hazards
• Hazards prevent next instruction from
executing during its designated clock cycle
– Structural hazards
• caused by hardware resource conflicts
– Data hazards
• arise when an instruction depends on the results of a
previous instruction
– Control hazards
• caused by change of control (e.g. jump)
6
Data Hazards
• Data hazards occur when data is used before it is ready
Time (in clock cycles)
CC 1
Value of
register $2: 10
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10/– 20
– 20
– 20
– 20
– 20
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
DM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
The use of the result of the SUB instruction in the next three instructions causes a
data hazard, since the register $2 is not written until after those instructions read it.
7
Data Hazards
Execution Order is:
InstrI
InstrJ
Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
Caused by a “Dependence” (in compiler nomenclature).
This hazard results from an actual need for communication.
8
Data Hazards
Execution Order is:
InstrI
InstrJ
Write After Read (WAR)
InstrJ tries to write operand before InstrI reads i
– Gets wrong operand
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
– Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
9
Data Hazards
Execution Order is:
InstrI
InstrJ
Write After Write (WAW)
InstrJ tries to write operand before InstrI writes it
– Leaves wrong result ( InstrI not InstrJ )
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW later in more complicated pipes
10
Data Hazards Solutions
• Solutions for Data Hazards
– Stalling
• Add bubbles
– Forwarding
• Connect new value directly to next stage
– Reordering
11
Data Hazard - Stalling
0
add $s0,$t0,$t1
2
IF
4
ID
6
EX
8
MEM
10
W
s0
12
16
18
$s0
written
here
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
sub $t2,$s0,$t3
IF
R
s0
EX
MEM
WB
$s0 read
here
12
Data Hazards - Forwarding
• Key idea: connect new value directly to next stage
• Still read s0, but ignore in favor of new result
• Problem: what about load instructions?
13
Data Hazards - Forwarding
• STALL still required for load - data avail. after MEM
• MIPS architecture calls this delayed load, initial
implementations required compiler to deal with this
0
lw $s0 ,20($t1 )
2
IF
4
ID
6
ID
EX
8
MEM
10
12
16
18
W
s0
new value
of s0
STALL
BUBBLE
sub $t2, $s0 ,$t3
BUBBLE
IF
BUBBLE
R
s0
BUBBLE
BUBBLE
EX
MEM
WB
14
Forwarding
Key idea: connect data internally before it's stored
Time (in clock cycles)
CC 1
Value of
register $2: 10
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10/– 20
– 20
– 20
– 20
– 20
IF/ID
IM
ID/EX
EX/MEM
Reg
IM
MEM/WB
DM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
How would you design the forwarding?
15
Data Hazards - Reordering
• Assuming we have data forwarding, what are
the hazards in this code?
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
• Reorder instructions to remove hazard:
lw
lw
sw
sw
$t0,
$t2,
$t0,
$t2,
0($t1)
4($t1)
4($t1)
0($t1)
16
Assignment 1 Solution
17
Question 1
• You are trying to figure out whether to construct a new fabrication
facility for your IBM Power5 chips.
• It costs $1.5 billion to build a new fabrication facility.
• The benefit of the new fabrication is that you predict that you will
be able to sell 3 times as many chips at 2 times the price of the old
chips.
• Assume the wafer has a diameter of 300 mm. In both the old and
new fabrications, it costs $1000 to fabricate a wafer, and the
packaging and testing cost is $20 per chip (after final testing). You
were preciously selling the chips for 40% more than their cost.
Assume a = 4, Wafer Yield = 1.
• The fabrication process’ parameters are shown in the following
table.
Chip
Old
IBM Power5
New
IBM Power5
Die size Estimated defect rate Manufacturing
(mm2)
(per cm2)
feature size (nm)
389
.30
130
Transistors
(millions)
276
186
-
.70
-
18
Question 1
• Cost Formula Summary
IC Cost 
Die Cost  Testing Cost  Packaging Cost
Final Test Yield
Die Cost 
Wafer Cost
Dies Wafer   Die Yield
Dies per Wafer 
 Wafer diameter 2
2
Die Area

 Wafer diameter 
2  Die Area 1
2
Defects per unit area  Die Area 


Dies Yield  Wafer Yield  1 
a


a
19
Question 1
a) [2 Points] With the old fabrication, how many dies can we get from a wafer
before we test individual dies? (Round the result to an integer number)
pi  30 / 2
pi  30
Dies _ per _ wafer 

 182  34  148
3.89
2  3.89
2
Dies per Wafer 
 Wafer diameter 2
2
Die Area

 Wafer diameter 
2  Die Area 1
2
b) [2 Points] With the old fabrication, what is the die yield?
 0.30  3.89 
Yield  1 

4


4
 0.36
Defects per unit area  Die Area 

Dies Yield  Wafer Yield  1 

a


a
c) [2 Points] What is the cost of the old Power5 chip?
Cost _ per _ die 
$1000
 $18.77
148  0.36
Chip _ cos t  $18.77  20  $38.77
Die Cost 
IC Cost 
Wafer Cost
Dies Wafer   Die Yield
Die Cost  Testing Cost  Packaging Cost
Final Test Yield
20
Question 1
d) [2 Points] With the new fabrication, how many dies can we get from a
wafer before we test individual dies? (Round the result to an integer number)
 

300
2
300
Die
_
per
_
Wafer
 
380

49

331
186
2

186
2
Dies per Wafer 
 Wafer diameter 2
2
Die Area

 Wafer diameter 
2  Die Area 1
2
e) [2 Points] With the new fabrication, what is the die yield?

4
Defects per unit area  Die Area 

.
007

186
0


Dies Yield  Wafer Yield  1 
Die
_
Yield

1

1


0
.
32


a


4

a
f) [2 Points] What is the cost of the new Power5 chip?
Die _ Cost 
$1000
 $9.44
331  0.32
Chip _ cos t  $9.44  20  $29.44
Die Cost 
IC Cost 
Wafer Cost
Dies Wafer   Die Yield
Die Cost  Testing Cost  Packaging Cost
Final Test Yield
21
Question 1
g) [2 Points] What was the selling price of each old Power5 chip?
$38.77  (1  40%)  $54.28
h) [2 Points] What is the selling price of each new Power5 chip? What is the
difference between the selling price and the cost of the new chip?
selling _ price  $54.28  2  $108.56
difference _ above _ cos t  $108.56  $29.44  $79.12
i) [4 Points] Suppose 50% of the difference between the selling price and the
chip cost is your profit. If you sold 500,000 old Power5 chips per month, how
long would it take for the accumulated profit to recoup the costs of the new
fabrication facility?
1.5  10 9
 25.28months
79.12
 500000  3
2
22
Question 2
• An application running on a 1GHz pipelined processor has 55%
load-store, 30% arithmetic, and 15% branch instructions. The
individual CPIs of these instructions are 5, 4 and 4, respectively.
• a) [6 Points] Determine the overall CPI of this program execution on
the given processor.
CPI

5

55
%

4

30
%

4

15
%

4
.
55
23
Question 2
• b) A new embedded version of the processor is being modified to operate
at 600 MHz. In this new version, the individual CPIs of load-store and
arithmetic instructions are remaining unchanged. However, the individual
CPI of branch instructions is getting stretched to 6 clock cycles. A new
compiler is also developed for the new processor which eliminates 25% of
load-store and 5% of arithmetic instructions for the given application (e.g.,
if there were 1 million load-store instructions before, now the program
compiled by the new compiler would have only 750000).
• b1) [6 Points] Determine the overall CPI of this program execution on the
new processor together with the new compiler technology.








5

55
%

1

25
%

4

30
%

1

5
%

6

15
%

4
.
84




55
%

1

25
%

30
%

1

5
%

15
%
24
Question 2
• b) A new embedded version of the processor is being modified to operate
at 600 MHz. In this new version, the individual CPIs of load-store and
arithmetic instructions are remaining unchanged. However, the individual
CPI of branch instructions is getting stretched to 6 clock cycles. A new
compiler is also developed for the new processor which eliminates 25% of
load-store and 5% of arithmetic instructions for the given application (e.g.,
if there were 1 million load-store instructions before, now the program
compiled by the new compiler would have only 750000).
• b2) [8 Points] Determine the factor by which the application will run faster
or slower on the new processor with the new compiler technology.
Old
_
Execution
_
Time
Speedup

New
_
Execution
_
Time
I CPI
Clock
_
Cycle
old
 old old
Inew
CPI
Clock
_
Cycle
new
new
 
9
Iold
4
.55
1
10

6
Iold


4

55
%

1

25
%
30
%

1

5
%
15
%
.84
1600
10



0
.67
25
Question 3
• Suppose there is a program which takes 200 seconds to execute. Of
this time, 30% is used for multiplication, 60% for memory access
instructions and 10% for other tasks. Suppose your goal is to
enhance the performance of a processor by 2 times and there are
two ways of doing so: either make multiply instructions run faster
than before, or memory access instructions run faster than before,
but not both.
• a) [6 Points] How much shall we improve the memory access
instructions in order to achieve the performance enhancement? Is
it possible?
1
Speedup


2
60
%


1

60
%

x
x6
Thus, it is possible if we improve the memory access instructions by 6 times.
26
Question 3
• Suppose there is a program which takes 200 seconds to execute. Of
this time, 30% is used for multiplication, 60% for memory access
instructions and 10% for other tasks. Suppose your goal is to
enhance the performance of a processor by 2 times and there are
two ways of doing so: either make multiply instructions run faster
than before, or memory access instructions run faster than before,
but not both.
• b) [6 Points] How much shall we improve the multiplication in order
to achieve the performance enhancement? Is it possible?
1
Speedup


2
30
%


1

30
%

x
x  1.5
Thus, it is impossible.
27
Question 3
• Suppose there is a program which takes 200 seconds to execute. Of
this time, 30% is used for multiplication, 60% for memory access
instructions and 10% for other tasks. Suppose your goal is to
enhance the performance of a processor by 2 times and there are
two ways of doing so: either make multiply instructions run faster
than before, or memory access instructions run faster than before,
but not both.
• c) [8 Points] Suppose we now improve the memory access by 5
times (Time(old execution) / Time(new execution) = 5).
Unfortunately, the design makes multiplication slow down by 20%
(i.e., Time(old execution) / Time(new execution) = 5/6). What will
the overall speedup be?
Speedup 
Told
Told

 1.72
60
%
Tnew 

 30%  1  20%   10%   Told

 5

28
Question 4
•
The following is a set of individual benchmark scores for each of the programs in
the integer portion of the SPEC2000 benchmark.
Benchmark
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
•
Score before improvement
10
14
23
36
9
12
25
18
30
17
7
38
Score after improvement
12
16
28
40
12
120
28
21
28
21
10
42
a) [6 Points] Calculate the speedup after the improvement using the arithmetic
mean.
10  14  23  36  9  12  25  18  30  17  7  38
 19.92
12
12  16  28  40  12  120  28  21  28  21  10  42

 31.5
12
Arithmetic _ meanold 
Arithmetic _ meannew
Speedup = 31.5/19.92 = 1.58
29
Question 4
•
The following is a set of individual benchmark scores for each of the programs in
the integer portion of the SPEC2000 benchmark.
Benchmark
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
•
Score before improvement
10
14
23
36
9
12
25
18
30
17
7
38
Score after improvement
12
16
28
40
12
120
28
21
28
21
10
42
b) [6 Points] Calculate the speedup after the improvement using the geometric
mean.
Geometric _ meanold  12 10 14  23  36  9 12  25 18  30 17  7  38  17.39
Geometric _ meannew  12 12 16  28  40 12 120  28  21 28  2110  42  24.42
Speedup = 24.42/17.39 = 1.40.
30
Question 4
• c) [8 Points] Note that there is a difference between the above two
improvement ratios, what is the main reason?
• The arithmetic mean is much more sensitive to large changes in one of the
values in the set than the geometric mean.
• Most of the individual benchmarks see relatively small changes as we add
the improvement to the architecture, but 197.parser improves by a factor
of 10. This causes the arithmetic mean to increase by almost 60 percent,
while the geometric mean increases by only 40 percent.
• This reduced sensitivity to individual values is why benchmarking experts
prefer the geometric mean for averaging the results of multiple
benchmarks, since one very good or very bad result in the set of
benchmarks has less of an impact on the overall score.
31
Question 5
•
A given processor has 32 registers, uses 16-bit immediates, and has 142
instructions (corresponding to 142 operation codes) in its ISA.
–
–
–
–
•
20 percent of the instructions take one source register and have one destination register
30 percent of the instructions have two source registers and have one destination register
25 percent of the instructions have one source register and have one destination register, and take
an immediate source operand
25 percent have one immediate source operand and one destination register
a) [10 Points] For each of the four types of instructions, how many bits are
required? Assume that the ISA requires that all instructions be a multiple of 8 bits
in length, and the operation codes (opcodes) are of fixed length (i.e., the ISA does
not use shorter opcodes for some instructions and longer opcodes for others).
8 bits are required to encode 142 instructions (128 < 142 < 256).
5 bits are required to encode 32 registers.
16 bits are required to encode each immediate.
One source register, one destination register: 8 + 5 + 5 = 18 bits, which rounds up to 24 bits.
Two source registers, one destination register: 8 + 5 + 5 + 5 = 23 bits, which rounds up to 24 bits.
One source register, one destination register, and an immediate: 8 + 5 + 5 + 16 = 34 bits, which rounds up to 40
bits.
One immediate, one destination register: 8 + 16 + 5 = 29 bits, which rounds up to 32 bits.
32
Question 5
•
A given processor has 32 registers, uses 16-bit immediates, and has 142
instructions (corresponding to 142 operation codes) in its ISA.
–
–
–
–
•
20 percent of the instructions take one source register and have one destination register
30 percent of the instructions have two source registers and have one destination register
25 percent of the instructions have one source register and have one destination register, and take
an immediate source operand
25 percent have one immediate source operand and one destination register
b) [10 Points] How much less memory does the program take up if a variablelength instruction set encoding is used as opposed to a fixed-length encoding?
Since the longest instruction type requires 40 bits to encode, the fixed-length encoding
will have 40 bits per instruction.
For the variable-length encoding, the average number of bits per instruction is:
24  20% + 24  30% + 40  25% + 32  25% = 30 bits.
(40  30) / 40 = 25%.
Therefore, the variable-length encoding requires 25 percent less space than the fixedlength encoding for this program.
33