Transcript Pipeline

Pipelining Processor
Natawut Nupairoj
Assembly Language
1
Instruction Cycle
pc = 0;
do {
ir := memory[pc++];
decode(ir);
fetch(operands);
execute;
store(results);
} while(ir != HALT);
Natawut Nupairoj
{
{
{
{
{
Fetch the instruction. }
Decode the instruction. }
Fetch the operands. }
Execute the instruction. }
store the results. }
Assembly Language
2
Pipelining
• Improve the execution speed.
• Divide instruction cycle into “stages”.
• Each stage executes independently and
concurrently.
• Pipelining is natural !!!
(from David Patterson’s lecture note.)
Natawut Nupairoj
Assembly Language
3
Natawut Nupairoj
Assembly Language
4
Natawut Nupairoj
Assembly Language
5
Natawut Nupairoj
Assembly Language
6
Pipelining Lessons
• Pipelining doesn’t help
latency of single task.
• It helps throughput of
entire workload.
• Multiple tasks operating
simultaneously using
different resources.
• Potential speedup =
number pipe stages
Natawut Nupairoj
Assembly Language
7
Pipelining in Modern Processor
• Instruction cycle is divided into five stages:
Fetch
Natawut Nupairoj
Decode
Operand
Fetch
Assembly Language
Execute
Store
8
Pipelining Execution
Time
Inst 1
Inst 2
Inst 3
Natawut Nupairoj
1
F
2
D
F
3
O
D
F
4
E
O
D
Assembly Language
5
S
E
O
6
7
S
E
S
9
Performance of Pipeline
• What do we gain ?
• Suppose we execute 1000 instructions on nonpipelined and pipelined CPUs.
• Clock speed = 500 MHz (1 clock = 2 ns.)
• non-pipelined CPU:
– total time = 2ns/cycle x 5 cycles/inst x 1000 instr.
= 10 ms.
• Perfect pipelined CPU:
– total time = 2ns/cycle x (1 cycle/inst x 1000 instr. + 4
cycles drain)
= 2.008 ms.
Natawut Nupairoj
Assembly Language
10
Nothing is perfect !!!
• Problem with branch.
• Don’t know what to fetch next until decoded.
Time
1
Inst 1
F
Inst 2 : JMP X
Inst X
2
D
F
3
O
D
4
E
O
F
5
S
E
D
6
7
8
S
O
E
S
Branch target address is
not available until here !!!
Natawut Nupairoj
Assembly Language
11
Stalled Pipe
• When pipelining is not smooth, we called it is
“stalled”.
• Branch and others ?
– Subroutine calling
– Memory accessing
– Multi-cycle execution
• Can we do better ? YES…but discuss later.
Natawut Nupairoj
Assembly Language
12
Branching in Sparc
• Sparc uses a 5-stage pipeline.
• Recall: pipe is stalled due to branch !!!
Time
1
Inst 1
F
Inst 2 : JMP X
Inst X
2
D
F
3
O
D
4
E
O
F
5
S
E
D
6
7
8
S
O
E
S
Branch target address is
not available until here !!!
Natawut Nupairoj
Assembly Language
13
Branching in Sparc
• However, Sparc does not stall but will execute
the instruction next to the brach (or call)
instruction BEFORE it actually branches.
• This is called “delay slot”.
Natawut Nupairoj
Assembly Language
14
Delay Slot
Time
1
Inst 1
F
Inst 2 : JMP X
Inst 3 Delay Slot
Inst X
2
D
F
3
O
D
F
4
E
O
D
F
5
S
E
O
D
6
7
8
S
E
O
S
E
S
Branch target address is
not available until here !!!
Natawut Nupairoj
Assembly Language
15
Filling Delay Slots with NOP
main:
.global
save
mov
sub
add
call
nop
add
call
nop
mov
main
%sp, -64, %sp
9, %l0
%l0, 2, %o0
%l0, 14, %o1 !
.mul
!
%l0, 8, %o1
!
.div
!
%o0, %l1
mov
ta
1, %g1
0
Natawut Nupairoj
Instruction before branch
Delay slot => wasted
Instruction before branch
Delay slot => wasted
Assembly Language
16
Filling Delay Slots
main:
.global
save
mov
sub
call
add
call
add
mov
main
%sp, -64, %sp
9, %l0
%l0, 2, %o0
.mul
%l0, 14, %o1 ! Delay slot filled
.div
%l0, 8, %o1
! Delay slot filled
%o0, %l1
mov
ta
1, %g1
0
Natawut Nupairoj
Assembly Language
17
Optimizing Our Second Program
• Can we fill the delay slot ?
...
mov
add
cmp
bl
nop
...
%o0, %l1
%l0, 1, %l0
%l0, 11
loop
! Store it in y
! x++
! x < 11 ?
! Delay slot => wasted
• Not with cmp, not add (cmp depends on add).
• mov can !!! No other instructions after that (and
before bl) depend on this instruction.
Natawut Nupairoj
Assembly Language
18
Optimizing Our Second Program
...
mov
add
cmp
bl
mov
...
%o0,
%l0,
%l0,
loop
%o0,
%l1
1, %l0
11
! Store it in y
! x++
! x < 11 ?
%l1
! Store it in y
• The key is to fill the delay slot with the
instruction that has no other instruction
depends on its result !!!
Natawut Nupairoj
Assembly Language
19
Filling Delay Slot Summary
• After branch and call, there is one delay slot.
• Always feel the delay slot to improve
performance.
• When filling the slot, don’t change the results
the program computes.
• No other instructions (before the branch and the
branch itself) depend on the instruction in the
delay slot.
• You can always fill the slot with “nop”.
Natawut Nupairoj
Assembly Language
20
Do…While Delay Slot
• How to fill the delay slot ?
– Independent instruction
– target instruction with annulled branch
• when you cannot find any independent instruction
• Independent instruction
– see our second program
Natawut Nupairoj
Assembly Language
21
Filling with Target Instruction
loop:
sub
call
sub
call
sub
mov
add
cmp
bl,a
sub
Natawut Nupairoj
%l0, 1, %o0
.mul
%l0, 7, %o1
.div
%l0, 11, %o1
%o0, %l1
%l0, 1, %l0
%l0, 11
loop
%l0, 1, %o0
!(x-1) to %o0, execute once
!(x-7) to %o1, delay slot
!(x-11) to %o1, delay slot
! Store it in y
! x++
! x < 11 ?
!(x-1) to %o0 (delay slot)
Assembly Language
22
Annulled Branch
• Execute an instruction in the delay slot if and
only if branch occurs.
• Program is one instruction longer and waste one
cycle when the loop exits.
• Do not need to find an independent instruction.
• Can be used with any type of branches.
Natawut Nupairoj
Assembly Language
23
While Loop Optimization
• Reduce number of instructions to be executed
inside the loop.
– By first jumping to the comparison at the end of the
loop.
• Then fill the delay slot.
Natawut Nupairoj
Assembly Language
24
While Loop Optimization Example
ba
nop
test
! Initial jump
! Delay slot
add
add
%l0, %l1, %l0
%l2, 1, %l2
! a = a + b
! c++
cmp
ble
nop
%l0, 17
loop
! Check condition
! Repeat if true
! Delay slot
loop:
test:
Natawut Nupairoj
Assembly Language
25
While Loop Optimization Example
loop:
test:
ba
test
cmp %l0, 17
add %l2, 1, %l2
cmp %l0, 17
ble,a loop
add %l0, %l1, %l0
!
!
!
!
!
!
Initial jump
Check condition (DS)
c++
Check condition
Repeat if true
very tricky! (DS)
• Performance Improvement:
– Direct translation = 7*number of loop iterations
– With initial jumping = 5*number of loop iterations
– And filling delay slots = 4*number of loop iterations
Natawut Nupairoj
Assembly Language
26