Transcript lec3-2

CS 152
Computer Architecture and Engineering
Lecture 6 – Superpipelining + Branch Prediction
2014-2-6
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
Play:
CS 152: L6: Superpipelining + Branch Prediction
UC Regents Spring 2014 © UCB
Today: First advanced processor lecture
Super-pipelining: Beyond 5 stages.
Short Break.
Branch prediction: Can we escape
control hazards in long CPU pipelines?
CS 152: L6: Superpipelining + Branch Prediction
UC Regents Spring 2014 © UCB
From Appendix C: Filling the branch delay slot
Superpipelining
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
5 Stage Pipeline: A point of departure
Seconds
Program
=
Instructions
Program
Cycles
Instruction
Seconds
Cycle
At best, the 5-stage pipeline
executes one instruction per
clock, with a clock period
determined by the slowest stage
Processor has no “multi-cycle” instructions
(ex: multiply with an accumulate register)
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Superpipelining: Add more stages
Goal: Reduce critical path by
adding more pipeline stages.
Example: 8-stage ARM XScale:
extra IF, ID, data cache stages.
Difficulties: Added penalties for
load delays and branch misses.
Ultimate Limiter: As logic delay
Also, power! goes to 0, FF clk-to-Q and setup.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
5 Stage
Note: Some stages now
overlap, some instructions
take extra stages.
8 Stage
IF
IR
ID+RF
IR
EX
IR
MEM
IR
WB
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Superpipelining techniques ...
Split ALU and decode logic over
several pipeline stages.
Pipeline memory: Use more banks of
smaller arrays, add pipeline stages
between decoders, muxes.
Remove “rarely-used” forwarding
networks that are on critical path.
Creates stalls, affects CPI.
Pipeline the wires of frequently
used forwarding networks.
Also: Clocking tricks (example: use positive-edge AND negative-edge
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Recall: IBM Power Timing Closure
“Pipeline
engineering”
happens here
...
... about 1/3
of project
schedule
From “The circuit and physical design of the POWER4 microprocessor”, IBM J
Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Pipelining a 256 byte instruction memory.
Fully combinational (and slow). Only read behavior shown.
Can we add two
pipeline stages?
A7-A0: 8-bit read address
{
3
{
A7 A6 A5 A4 A3 A2
3
OE --> Tri-state Q outputs!
OE
1
D
E
M
U .
X .
.
OE
Byte 0-31
256
Q
256
Byte 32-63
Q
...
OE
.
.
.
Byte 224-255 Q
256
M
U
X
Data
3
output
is 32 bits
D0-D31
32
i.e.
4 bytes
256
Each register holds 32 bytes (256 bits)
CS 152: L6: Superpipelining + Branch Prediction
UC Regents Spring 2014 © UCB
On a chip: “Registers” become SRAM cells
Architects specify number of rows and columns.
Word and bit lines slow down as array grows larger!
Write
Driver
Write
Driver
Write
Driver
Write
Driver
Parallel
Data
I/O
Lines
Add muxes
here to select
subset of bits
How could we pipeline this memory? See last
CS 152: L6: Superpipelining + Branch Prediction
UC Regents Spring 2014 © UCB
RISC CPU
5.85
million
devices
0.65
million
devices
IC processes are optimized for small SRAM cells
From Marvell ARM CPU paper: 90% of the
6.5 million transistors, and 60% of the chip
area, is devoted to cache memories.
Implication? SRAM is 6X as dense as logic.
RAM Compilers
On average,
30% of a
modern logic
chip is SRAM,
which is
generated by
RAM compilers.
Compile-time
parameters set
number of bits,
aspect ratio,
ports, etc.
CS 250 L1: Fab/Design Interface
UC Regents Fall 2013 © UCB
ALU: Pipelining Unsigned Multiply
* 1011
Facts to
remember
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Building Block: Full-Adder Variant
1-bit signals: x, y, z, s, Cin, Cout
x
y
z
Cin
Cout
z: one bit of multiplier
s
x: one bit of multiplicand
If z = 1, {Cout, s} <= x + y + Cin
If z = 0, {Cout, s} <= y + Cin
Verilog for “2-bit entity”,
CS 194-6 L9: Advanced Processors I
y: one bit of the “running sum”
UC Regents Fall 2008 © UCB
Put it together: Array computes P = A x B
y
To pipeline
array:
z
x
0
Place registers
between adder
stages (in
green).
0
0
0
Cout
Cout
Cout
Add registers
Cout
to delay
selected
A and B
Fully
combinational
implementation
is
bits
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Adding pipeline stages is not enough ...
MIPS R4000: Simple 8-stage pipeline
Branch stalls are
the main reason
why pipeline
CPI > 1.
2-cycle load delay,
3-cycle branch delay.
(Appendix C, Figure
C.52)
CS 152: L6: Superpipelining + Branch Prediction
UC Regents Spring 2014 © UCB
Branch Prediction
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Add pipeline stages, reduce clock period
Q. Could adding pipeline stages
hurt the CPI for an application?
A. Yes, due to these problems:
ARM XScale
8 stages
CS 194-6 L9: Advanced Processors I
CPI Problem
Possible Solution
Taken branches
cause longer
stalls
Branch prediction,
loop unrolling
Cache misses
take more
clock cycles
Larger caches,
add prefetch
opcodes to ISA
UC Regents Fall 2008 © UCB
Recall: Control hazards ...
IF (Fetch)
ID (Decode)
IR
I-Cache
EX (ALU)
IR
MEM
IR
WB
IR
We avoiding stalling by (1) adding a branch delay
slot, and (2) adding comparator to ID stage
If we add more early stages, we must stall.
Sample Program Time: t1
(ISA w/o branch Inst
IF
I1:
delay slot)
I2:
I1: BEQ R4,R3,25
I3:
I2: AND R6,R5,R4
I4:
SUB
R1,R9,R8
I3:
I5:
I6:
CS 194-6 L9: Advanced Processors I
t2
t3
t4
t5
ID
IF
EX
ID
IF
MEM
WB
t6 t7 t8
EX stage
computes
if branch is
taken
If branch is taken,
these instructions
MUST NOT complete!
UC Regents Fall 2008 © UCB
Solution: Branch prediction ...
IF (Fetch)
ID (Decode)
IR
I-Cache
Branch
Predictor
Predictions
A control
instr?
Taken or
Not
Taken?
If taken,
where to?
What PC?
EX (ALU)
IR
MEM
IR
WB
IR
We update the PC based on the outputs of the
branch predictor. If it is perfect, pipe stays full!
Dynamic Predictors: a cache of branch history
Time: t1
Inst
IF
I1:
I2:
I3:
I4:
I5:
I6:
CS 194-6 L9: Advanced Processors I
t2
t3
t4
t5
ID
IF
EX
ID
IF
MEM
WB
t6 t7 t8
EX stage
computes
if branch is
taken
If we predicted incorrectly,
these instructions MUST
NOT complete!
UC Regents Fall 2008 © UCB
Branch predictors cache branch history
Address of branch instruction
0b0110[...]01001000
30 bits
4096 Branch Target
entries ...30-bit address tag
Buffer (BTB)
target address
Branch instruction
BNEZ R1 Loop
Branch
History Table
(BHT)
=
=
=
0b0110[...]0010
PC + 4 + Loop
=
“Hit”
At EX stage,
update BTB/BHT,
kill instructions,
if necessary,
CS 152: L6: Superpipelining + Branch Prediction
“Taken”
Address
2 state
bits
Drawn
as fully
associativ
e to focus
on the
essentials.
In real
designs,
“Taken” or always
direct“Not Taken” mapped.
UC Regents Spring 2014 © UCB
Branch predictor: direct-mapped version
Address of BNEZ instruction
0b011[..]010[..]100
18 bits
12 bits
Branch Target Buffer (BTB)
18-bit address tag
0b011[...]01
=
Hit
target address
PC + 4 + Loop
“Taken”
Address
As in real-life ... direct-mapped ...
BNEZ R1 Loop
Branch
History Table
(BHT)
4096
BTB/BHT
entries
Update
BHT/BTB
for next
time,
once
true
behavior
“Taken” or known
“Not Taken”
Must check prediction, kill instruction if needed.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
Simple (”2-bit”) Branch History Table Entry
Prediction for next branch.
(1 = take, 0 = not take)
Initialize to 0.
D
Was last prediction correct?
(1 = yes, 0 = no)
Initialize to 1.
Q
D
Q
After we “check”
prediction ...
Flip bit if prediction is not
correct and “last predict
correct” bit is 0.
Set to 1 if prediction bit was correct.
Set to 0 if prediction bit was incorrect.
Set to 1 if prediction bit flips.
We do not change the prediction the first time it is incorrect. Why?
This branch taken 10 times, then not
ADDI R4,R0,11
taken once (end of loop). The next time
loop: SUBI R4,R4,-1
we enter the loop, we would like to
BNE R4,R0,loop
predict “take” the first time through.
CS 194-6 L9: Advanced Processors I
UC Regents Fall 2008 © UCB
“80-90% accurate”
4096-entry 2bit predictor
Figure C.19
Branch Prediction: Trust, but verify ...
Instr Fetch
PC
D
Decode & Reg Fetch
Execute
I-Cache
Q
+4
IR
Branch
Predictor
and BTB
IR
IR
A
Y
Predicted PC Predictions
Branch
Taken/Not Taken
A branch
instr?
Taken or
Not
Taken?
Logic
B
If taken,
where to?
What PC?
Note instruction type
and branch target.
B Pass to next stage.B
P
Prediction info --> P
CS 152: L6: Superpipelining + Branch Prediction
Check all predictions.
Take actions if needed
(kill instructions,
update predictor).
Prediction info -->
UC Regents Spring 2014 © UCB
Flowchart
control
for
dynamic
branch
prediction.
Figure 3.22
Spatial Predictors
C code snippet:
b1
b2
b3
After
compilation:
We want to
predict
this branch.
Idea: Devote hardware to four
2-bit predictors for BEQZ branch.
P1: Use if b1 and b2 not taken.
P2: Use if b1 taken, b2 not taken.
P3: Use if b1 not taken, b2 taken.
P4: Use if b1 and b2 taken.
Track the current taken/not-taken
status of b1 and b2, and use it to
choose from P1 ... P4 for BEQZ ...
How?
b1
b1
b2
b2
b3
Can b1 and b2 help us predict it?
Branch History Register: Tracks global history
Instr Fetch
PC
D
Decode & Reg Fetch We choose which predictor
I-Cache
Q
+4
IR
Branch
Predictor
and BTB
IR
to use (and update) based
on the Branch History
Register.
IR
A
Y
Predicted PC Predictions
A branch
instr?
Taken or
Not
Taken?
B
If taken,
where to?
What PC?
Prediction
B info -->
Logic
P
CS 152: L6: Superpipelining + Branch Prediction
Branch History
Register
D
WE
Q
D
Q
WE
Shift register. Holds taken/not-taken
status of last 2UCbranches.
Regents Spring 2014 © UCB
Spatial branch predictor (BTB, tag not shown)
0b0110[...]01001000
BEQZ R3 L3
Branch History Tables
Map
PC to
index
P1
P2
P3
P4
2 state
bits
2 state
bits
2 state
bits
2 state
bits
Branch
History
Register
D
Q
Mux to choose “which branch
(bb==2)
WE
predictor”
branch
D Q
(aa==2)
“Taken” or “Not Taken”
branch
WE
For (aa!= bb)
CS 152: L6: Superpipelining + Branch Prediction
UC Regents Spring 2014 © UCB
Performance
For more details on
branch prediction:
4096 vs
1024? Fair
comparison,
matches total
# of bits)
One BHT
(4096 entries)
Spatial
(4 BHTs,
each with
1024
entries)
Figure 3.3
Predict
function
returns by
stacking
call info
Figure 3.24
Hardware limits to superpipelining?
FO4
Delays
CPU Clock Periods
1985-2005
MIPS 2000
5 stages
Historical
limit:
about
12 FO4s
Pentium Pro
10 stages
*
Pentium 4
20 stages
Power wall:
Intel Core
Duo has 14
stages
FO4: How many fanout-of-4 inverter
delays in the clock period.
Thanks to Francois Labonte, Stanford
CS 250 L3: Timing
UC Regents Fall 2013 © UCB
On Tuesday
We turn our focus to memory system design ...
Have a good weekend!