Transcript powerpoint

10/11: Lecture Topics
• Execution cycle
• Introduction to pipelining
• Data hazards
Office Hours
• Changing Th 8:30-9:30 to Mo 2-3
• New office hours are
– Mo 10:30-11:30
– Mo 2:00-3:00
– Tu 2:30-3:30
Execution Cycle
IF
ID
EX
MEM
WB
• Five steps to executing an instruction:
1. Fetch
• Get the next instruction to execute from
memory onto the chip
2. Decode
• Figure out what the instruction says to do
• Get values from registers
3. Execute
• Do what the instruction says; for example,
– On a memory reference, add up base and offset
– On an arithmetic instruction, do the math
More Execution Cycle
IF
ID
EX
MEM
WB
4. Memory Access
• If it’s a load or store, access memory
• If it’s a branch, replace the PC with the
destination address
• Otherwise do nothing
5. Write back
• Place the result of the operation in the
appropriate register
add $s0, $s1, $s2
• IF get instruction at PC from memory
– it’s 000000 10001 10010 10000 00000 100000
• ID determine what 000000 … 100000 is
– 000000 … 100000 is add
– get contents of $s1 and $s2 ($s1=7, $s2=12)
• EX add 7 and 12 = 19
• MEM do nothing
• WB store 19 in $s0
lw $t2, 16($s0)
• IF get instruction at PC from memory
– it’s 010111 10000 01000 0000000000010000
• ID determine what 010111 is
– 010111 is lw
– get contents of $s0 and $t2 (we don’t know that
we don’t care about $t2) $s0=0x200D1C00,
$t2=77763
• EX add 16 to 0x200D1C00 = 0x200D1C10
• MEM load the word stored at 0x200D1C10
• WB store loaded value in $t2
Latency & Throughput
1
2
3
4
5
IF
ID
EX
MEM
WB
6
IF
7
ID
8
EX
9
MEM
10
WB
• Latency—the time it takes for an individual
instruction to execute
– What’s the latency for this implementation?
• Throughput—the number of instructions
that execute per minute
– What’s the throughput of this implementation?
inst 1
inst 2
A case for pipelining
• The functional units are being underutilized
– the instruction fetcher is used once every five clock cycles
– why not have it fetch a new instruction every clock cycle?
• Pipelining overlaps the stages of execution so every
stage has something to due each cycle
• A pipeline with N stages could speedup by N times, but
– each stage must take the same amount of time
– each stage must always have work to do
• Also, latency for each instruction may go up, but why don’t
we care?
Unpipelined Assembly Line
• What is the latency of this
assembly line, i.e. for how many
cycles is the plane on the assembly
line?
• What is the throughput of this
assembly line, i.e. how many
planes are manufactured each
cycle?
Pipelined Assembly Line
• The assembly line has 5 stages
• If a plane isn’t ready to go to the next
stage then the pipeline stalls
– that stage and all stages before it freeze
• The gap in the assembly line is known
as a bubble
Pipelined Analysis
• What is the latency?
• What is the throughput?
• What is the speed up? (Speed up = Old
Time / New Time)
Pipeline Example
1 2 3 4 5 6 7 8 9 10
add $s0, $s1, $s2
sub $s3, $s2, $s3
lw
$s2, 20($t0)
sw
$s0, 16($s1)
and $t1, $t2, $t3
IF
ID
EX
MEM
WB
Pipelined Xput and Latency
1
2
3
4
5
6
7
8
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
9
inst 1
inst 2
inst 3
inst 4
WB
inst 5
• What’s the throughput of this implementation?
• What’s the latency of this implementation?
Data Hazards
• What happens in the following code?
add $s0, $s1, $s2
add $s4, $s3, $s0
IF
ID
EX
MEM WB
IF
ID
EX
$s0 is
read here
MEM WB
$s0 is
written here
• This is called as a data dependency
• When it causes a pipeline stall it is
called a data hazard
Solution: Stall
• Stall the pipeline until the result is
available
add s0,s1,s2
add s4,s3,s0
IF
ID
IF
EX
MEM WB
stall
ID
EX
• Stall the pipeline until the result is
available
MEM WB
Solution: Read & Write in
same Cycle
• Write the register in the first part of the clock
cycle
• Read it in the second part of the clock cycle
write $s0
add s0,s1,s2
add s4,s3,s0
IF
ID
IF
EX
MEM WB
stall
ID
read $s0
EX
• A stall of two cycles is still required
MEM WB
Solution: Forwarding
• The value of $s0 is known after cycle 3
(after the first instruction’s EX stage)
• The value of $s0 isn’t needed until cycle 4
(before the second instruction’s EX stage)
• If we forward the result there isn’t a stall
add s0,s1,s2
add s4,s3,s0
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
Another data hazard
• What if the first instruction is lw?
lw
s0,0(s2)
add s4,s3,s0
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
• s0 isn’t known until after the MEM stage
• We can’t forward back into the past
• Either stall or reorder instructions
Solutions to the lw hazard
• We can stall for one cycle, but we hate to stall
lw
s0,0(s2)
IF
add s4,s3,s0
ID
EX
MEM WB
IF
ID
EX
MEM WB
• Try to execute an unrelated instruction between
the two instructions
lw
s0,0(s2)
sub t4,t2,t3
add s4,s3,s0
sub t4,t2,t3
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
Reordering Instructions
• Reordering instructions is a common
technique for avoiding pipeline stalls
• Sometimes the compiler does the
reordering statically
• Almost all modern processors do this
reordering dynamically
– they can see several instructions and they
execute anyone that has no dependency
– this is known as out-of-order execution
and is very complicated to implement
Control Hazards
• Branch instructions cause control hazards
because we don’t know which instruction to
execute next
bne $s0, $s1, next
add $s4, $s3, $s0
...
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
next: sub $s4, $s3, $s0
do we
fetch add
or sub?
we don’t
know until
here