Transcript Recap

(Recap)
Control Hazards
COMP381 by M. Hamdi
1
Control (Branch) Hazard
A: beqz r2, label
B:
--------label:
P:
Problem:
The outcome of the branch (taken/not taken) is determined deep in the pipeline
Do not know whether to execute B or execute P?
Solutions:
• Stall pipeline for required number of cycles
• Methods to reduce the branch penalty:
1. Speculatively predict branch outcome and recover if mispredicted
2. Delayed Branch: Always execute instruction(s) following the branch
COMP381 by M. Hamdi
2
Branch Hazards: Stall pipeline
A: beqz r2, label
stall
stall
stall
B
----label: P
A
B
P
1
IF
2
ID
Execution sequence:
A, NOP, NOP, NOP, B …..
or A, NOP, NOP, NOP, P
3
EX
4
MEM
5
WB
6
7
8
9
IF
ID
EX
MEM
WB
COMP381 by M. Hamdi
3
Reducing Branch Penalty
1. Stall scheme had a branch penalty of 3 cycles.
CPI assuming branch frequency of 30% is 1.9 (with no data stalls)
2. Predicting branch not taken:
• Speculatively fetch and execute in-line instructions following the branch
• If prediction incorrect flush pipeline of speculated instructions
• Convert these instructions to NOPs by clearing pipeline registers.
• Fetch instructions from the branch target
COMP381 by M. Hamdi
4
Predict branch not taken
B
IF
A
ID
T=2
C
IF
B
ID
T=3
D
IF
C
IF
D
T=1
EX
MEM
WB
A
EX
MEM
WB
ID
B
EX
A
MEM
WB
ID
C
EX
B
MEM
Branch actually not
taken: No stalls
T=4
E
COMP381 by M. Hamdi
A
WB
5
Predict branch not taken
B
IF
A
ID
T=2
C
IF
B
ID
T=3
D
IF
C
ID
T=1
EX
MEM
WB
A
EX
MEM
WB
B
EX
MEM
WB
A
Branch actually taken:
FLUSH pipeline
Make B,C,D NOPS
T=4
P
IF
ID
EX
COMP381 by M. Hamdi
MEM
A
WB
6
Reducing Branch Stall Cycles
1- Find out whether a branch is taken earlier in the pipeline.
2- Compute the taken PC earlier in the pipeline.
In MIPS:
– In MIPS branch instructions BEQZ, BNE, test a register for
equality to zero.
– This can be completed in the ID cycle by moving the zero test
into that cycle.
– Both PCs (taken and not taken) must be computed early.
– Requires an additional adder because the current ALU is not
useable until EX cycle.
– This results in just a single cycle stall on branches.
COMP381 by M. Hamdi
7
Reducing Branch Penalty (contd … )
3. Predicting branch taken:
• Speculatively fetch and execute instructions at the branch target
– But haven’t calculated branch target address in MIPS
• MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before outcome
COMP381 by M. Hamdi
8
Reducing Branch Penalty (contd … )
4. Delayed Branch:
Branch delay slots: In-line instructions following a branch instruction
conditional branch instruction
sequential successor1
sequential successor2
……..
sequential successorn
branch target if taken
Always executed: Usually 1 delay slot (short pipelines)
• Compiler tries to put useful instructions in delay slot to mask delay
COMP381 by M. Hamdi
9
Delayed Branch
• Instruction in branch delay slot is always executed
• Compiler (tries to) move a useful instruction into delay slot.
(a) From before the Branch: Always helpful when possible
ADD R1, R2, R3
BEQZ R2, L1
DELAY SLOT
L1:
BEQZ R2, L1
ADD R1, R2, R3
L1:
• If the ADD instruction were: ADD R2, R1, R3 the move
would not be possible
COMP381 by M. Hamdi
10
Delayed Branch
(b) From the Target: Helps when branch is taken. May duplicate
instructions
ADD R2, R1, R3
ADD
R2, R1, R3
BEQZ R2, L1
DELAY SLOT
L1: SUB R4, R5, R6
L2:
BEQZ R2, L2
SUB R4, R5, R6
L1:
SUB R4, R5, R6
L2:
Instructions between BEQ and SUB (in fall through) must not use
R4.
COMP381 by M. Hamdi
11
Delayed Branch
( c ) From Fall Through: Helps when branch is not taken.
ADD R2, R1, R3
BEQZ R2, L1
DELAY SLOT
SUB R4, R5, R6
-
ADD R2, R1, R3
BEQZ R2, L1
SUB R4, R5, R6
-
L1:
L1:
Instructions at target (L1 and after) must not use R4 till set again.
• Cancelling (Nullifying) Branch:
Branch instruction indicates direction of prediction.
If mispredicted the instruction in the delay slot is cancelled.
Greater flexibility for compiler to schedule instructions.
COMP381 by M. Hamdi
12
Canceling branch
• Cancelling (Nullifying) Branch:
– Branch instruction indicates direction of prediction.
– Mispredicted branch “annuls” the instruction in delay slot.
Instruction behaves as a no-op.
– Gives more leverage to compiler to select instructions to fill delay
slots.
COMP381 by M. Hamdi
13
Delayed Branch
• Limitations of delayed branch
– Compiler may not find appropriate instructions to
fill delay slots. Then it fills delay slots with noops.
– Visible architectural feature – likely to change
with new implementations
• Pipeline structure is exposed to compiler. Need
to know how many delay slots.
COMP381 by M. Hamdi
14
Delayed Branch
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots
useful in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: As processor go to deeper
pipelines and multiple issue, the branch delay grows
and need more than one delay slot
– Delayed branching has lost popularity compared to more
expensive but more flexible dynamic approaches
– Growth in available transistors has made dynamic
approaches relatively cheaper
COMP381 by M. Hamdi
15
Dynamic Branch Prediction
• Builds on the premise that history matters
– Observe the behavior of branches in previous instances
and try to predict future branch behavior
– Try to predict the outcome of a branch early on in order
to avoid stalls
– Branch prediction is critical for multiple issue
processors
• In an n-issue processor, branches will come n times faster than
a single issue processor
COMP381 by M. Hamdi
16
Basic Branch Predictor
• Use a 1-bit branch predictor buffer or branch
history table
• 1 bit of memory stating whether the branch was
recently taken or not
• Bit entry updated each time the branch instruction
is executed
T
Prediction
Taken
Not Taken
State
1
0
Branch outcome
Taken Not Taken
1
0
1
0
NT
NT
State 1
State 0
Predict
Taken
Predict
Not Taken
T
COMP381 by M. Hamdi
17
1-bit Branch Prediction Buffer

Problem – even simplest branches are mispredicted
twice
LD R1, #5
Loop:
LD R2, 0(R5)
ADD R2, R2, R4
STORE R2, 0(R5)
ADD R5, R5, #4
SUB R1, R1, #1
BNEZ R1, Loop
First time: prediction = 0
but the branch is taken  change
prediction to 1 miss
Time 2, 3, 4: prediction = 1
and the branch is taken
Time 5: prediction = 1
but the branch is not taken  change
prediction to 0 miss
COMP381 by M. Hamdi
18
Dynamic Branch Prediction Accuracy
COMP381 by M. Hamdi
19