Transcript ppt

Dependencies
•
Problem with starting next instruction before first is finished
– dependencies that “go backward in time” are data hazards
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10
10/–20
–20
–20
–20
–20
IM
Reg
DM
Reg
Value of
register $2:
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
1
Software Solution
•
•
Have compiler guarantee no hazards
Where do we insert the “nops” ?
sub
and
or
add
sw
•
$2, $1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$15, 100($2)
Problem: this really slows us down!
2004 Morgan Kaufmann Publishers
2
Forwarding
•
Use temporary results, don’t wait for them to be written
– register file forwarding to handle read/write to same register
– ALU forwarding
Time (in clock cycles)
CC 1
CC 2
Value of register $2:
10
10
Value of EX/MEM:
X
X
Value of MEM/WB:
X
X
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
–20
X
10/–20
X
–20
–20
X
X
–20
X
X
–20
X
X
–20
X
X
DM
Reg
Program
execution
order
(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14,$2 , $2
sw $15, 100($2)
what if this $2 was $13?
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
3
Forwarding
•
The main idea (some details not shown)
ID/EX
EX/MEM
MEM/WB
M
u
x
ForwardA
Registers
ALU
M
u
x
Data
memory
M
u
x
ForwardB
Rs
Rt
Rt
Rd
EX/MEM.RegisterRd
M
u
x
Forwarding
unit
MEM/WB.RegisterRd
2004 Morgan Kaufmann Publishers
4
Can't always forward
•
Load word can still cause a hazard:
– an instruction tries to read a register following a load instruction
that writes to the same register.
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Program
execution
order
(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
•
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
Thus, we need a hazard detection unit to “stall” the load instruction
2004 Morgan Kaufmann Publishers
5
Stalling
•
We can stall the pipeline by keeping an instruction in the same stage
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
Reg
DM
Reg
CC 6
CC 7
CC 8
CC 9
CC 10
Program
execution
order
(in instructions)
lw $2, 20($1)
IM
bubble
and becomes nop
add $4, $2, $5
or $8, $2, $6
add $9, $4, $2
IM
Reg
IM
DM
Reg
IM
Reg
DM
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
2004 Morgan Kaufmann Publishers
6
Hazard Detection Unit
•
Stall by letting an instruction that won’t write anything go forward
Hazard
detection
unit
ID/EX.MemRead
ID/EX
WB
M
u
x
Control
0
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
M
u
x
ALU
PC
Instruction
memory
M
u
x
Data
memory
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M
u
x
ID/EX.RegisterRt
Rs
Rt
Forwarding
unit
2004 Morgan Kaufmann Publishers
7
Branch Hazards
•
When we decide to branch, other instructions are in the pipeline!
Time (in clock cycles)
CC 1
CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Program
execution
order
(in instructions)
40 beq $1, $3, 28
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
•
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
We are predicting “branch not taken”
– need to add hardware for flushing instructions if we are wrong
2004 Morgan Kaufmann Publishers
8
Flushing Instructions
IF.Flush
Hazard
detection
unit
ID/EX
WB
Control
0
IF/ID
M
u
x
+
EX/MEM
M
WB
EX/MEM
EX
M
WB
+
4
M
u
x
Shift
left 2
Registers
PC
=
M
u
x
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Fowarding
unit
Note: we’ve also moved branch decision to ID stage
2004 Morgan Kaufmann Publishers
9
Branches
•
•
•
•
If the branch is taken, we have a penalty of one cycle
For our simple design, this is reasonable
With deeper pipelines, penalty increases and static branch prediction
drastically hurts performance
Solution: dynamic branch prediction
Taken
Not taken
Predict taken
Predict taken
Taken
Not taken
Taken
Not taken
Predict not taken
Predict not taken
Taken
Not taken
A 2-bit prediction scheme
2004 Morgan Kaufmann Publishers
10
Branch Prediction
•
Sophisticated Techniques:
– A “branch target buffer” to help us look up the destination
– Correlating predictors that base prediction on global behavior
and recently executed branches (e.g., prediction for a specific
branch instruction based on what happened in previous branches)
– Tournament predictors that use different types of prediction
strategies and keep track of which one is performing best.
– A “branch delay slot” which the compiler tries to fill with a useful
instruction (make the one cycle delay part of the ISA)
•
Branch prediction is especially important because it enables other
more advanced pipelining techniques to be effective!
•
Modern processors predict correctly 95% of the time!
2004 Morgan Kaufmann Publishers
11
Improving Performance
•
Try and avoid stalls! E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
0($t1)
4($t1)
0($t1)
4($t1)
•
Dynamic Pipeline Scheduling
– Hardware chooses which instructions to execute next
– Will execute instructions out of order (e.g., doesn’t wait for a
dependency to be resolved, but rather keeps going!)
– Speculates on branches and keeps the pipeline full
(may need to rollback if prediction incorrect)
•
Trying to exploit instruction-level parallelism
2004 Morgan Kaufmann Publishers
12
Advanced Pipelining
•
•
•
•
Increase the depth of the pipeline
Start more than one instruction each cycle (multiple issue)
Loop unrolling to expose more ILP (better scheduling)
“Superscalar” processors
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
•
All modern processors are superscalar and issue multiple
instructions usually with some limitations (e.g., different “pipes”)
•
VLIW: very long instruction word, static multiple issue
(relies more on compiler technology)
•
This class has given you the background you need to learn more!
2004 Morgan Kaufmann Publishers
13
Chapter 6 Summary
•
Pipelining does not improve latency, but does improve throughput
Deeply
pipelined
Multicycle
(Section 5.5)
Pipelined
Multiple issue
with deep pipeline
(Section 6.10)
Multiple issue
with deep pipeline
(Section 6.10)
Multiple-issue
pipelined
(Section 6.9)
Multiple-issue
pipelined
(Section 6.9)
Single-cycle
(Section 5.4)
Deeply
pipelined
Multicycle
(Section 5.5)
Single-cycle
(Section 5.4)
Slower
Pipelined
Faster
Instructions per clock (IPC = 1/CPI)
1
Several
Use latency in instructions
2004 Morgan Kaufmann Publishers
14
Chapter Seven
2004 Morgan Kaufmann Publishers
15
Memories: Review
•
SRAM:
– value is stored on a pair of inverting gates
– very fast but takes up more space than DRAM (4 to 6 transistors)
•
DRAM:
– value is stored as a charge on capacitor (must be refreshed)
– very small but slower than SRAM (factor of 5 to 10)
Word line
A
A
B
B
Pass transistor
Capacitor
Bit line
2004 Morgan Kaufmann Publishers
16
Exploiting Memory Hierarchy
•
Users want large and fast memories!
SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.
DRAM access times are 50-70ns at cost of $100 to $200 per GB.
Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.
•
2004
Try and give it to them anyway
– build a memory hierarchy
CPU
Level 1
Increasing distance
from the CPU in
access time
Levels in the
Level 2
memory hierarchy
Level n
Size of the memory at each level
2004 Morgan Kaufmann Publishers
17