Transcript lec8-2
CS 152
Computer Architecture and Engineering
Lecture 16 -- Midterm I Review Session
2014-3-13
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
Play:
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Today - Midterm I Review Session
Study tips, test ground rules
All questions answered (almost ...)
Short break
HW 1, problem by problem ...
Recall: HW 1 was Fall 05 Mid-term I.
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
On Tuesday
Mid-term I ...
Ground rules ...
When is it? Where is it? Ground rules.
9:30 AM sharp, Tuesday March 18th,
306 Soda.
Every-other-seat seating, except for the
front row, where every-seat is permitted.
No blue-books needed. We will be handing
out a paper test. Pencil is preferred.
Pencils down @ 10:55 AM, so we can
collect papers before next class comes in.
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
When is it? Where is it? Ground rules.
No use of calculators, smartphones,
laptops, etc ... during the exam.
Closed-book, closed-notes. Just pencils,
erasers. No consulting with students.
Restroom breaks are OK, but you’ll still
need to hand in your exam @ 10:55.
Questions are reserved for serious
concerns about a bug in the question.
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
What does it cover? First 8 lectures.
Not to be taken legalistically. For example, WAW
hazards were covered in the pipelining lecture, and so
it’s fair for me to show a pipeline and say “does this
have a WAW hazard”, even if that example was also
seen in a later lecture.
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Mid-term: How to do well ...
Problem intro often features a lecture slide.
If you have to teach yourself that slide
during the test, you’re starting out behind.
Getting the problem correct requires
thinking on your feet to do a new design
or analyze one given to you.
There will not be “you can only get it if do
the reading” problems ... but the reading
helps you understand how to think through
the problem.
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Mid-term: There may be math ...
No memorization: If we ask about Amdahl’s
Law, we will show its
definition lecture slide.
Understanding is needed: A problem may
require you to apply equation to a design,
etc.
Cannot use
You may need to do:
electronic devices
simple algebra and calculus,
... more
add a few numbers by hand,
administrative
etc.
info after we do
some content.
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Example starting slides ...
These are meant to be examples, not a complete
list!
A problem may start with a slide that is not in this
part of the presentation!
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
CS 152
Computer Architecture and Engineering
Lecture 2 – Single Cycle Wrap-up
2014-1-23
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Merging data paths ...
Add muxes
R-format
N
N
N
How many ?2
I-format
(ignore ALU control)
Where ?
CS 152: Single-Cycle Design
UC Regents Spring 2014 © UCB
Adding branch testing to the data path
5
5
5
RegFile
rs1
rs2
ws
wd
RegDest
32
ALUctr
32
rd1
rd2
32
WE
Ext
RegWr
ExtOp
MemToReg
ALUsrc
MemWr
Equal (wire into control)
Syntax: BEQ $1, $2, 12
Action: If ($1 != $2), PC = PC + 4
Action: If ($1 == $2), PC = PC + 4 + 48
CS 152: Single-Cycle Design
UC Regents Spring 2014 © UCB
Josh Fisher: idea grew out of his Ph.D (1979) in
compilers
VLIW
Very
Long
Instruction
Words
CS 152: L2 Single-Cycle Wrap-up
Led to a startup
(MultiFlow)
whose computers
worked, but
which went out of
business ... the
ideas remain
influential.
UC Regents Spring 2014 © UCB
32-bit & 64-bit semantics different? Yes!
Assume: $7 = 7, $8 = 8, $9 = 9, $10 = 10 (decimal)
32-bit MIPS:
ADD $8 $9 $10;
Result: $8 = 19
ADD $7 $8 $9;
Result: $7 = 28
VLIW:
Instr:
ADD $8 $9 $10
ADD $7 $8 $9
CS 152: Single-Cycle Design
; result $8 = 19
; result $7 = 17 (not 28)
UC Regents Spring 2014 © UCB
Branch policy: All instr operators execute
BNE $8 $9 Label
opcode
rs
rt
ADD $7 $8 $9
rd
shamt funct
opcode rs
rt
rd shamt funct
ADD executes if branch is taken or not taken.
Problem: Large N machines find it hard to
fill all operators with useful work.
Solution: New “predication” operator.
Syntax: SELECT $7 $8 $9 $10
Semantics: If $8 == 0, $7 = $10, else $7 = $9
Permits simple branches to be converted to inline code.
CS 152: Single-Cycle Design
UC Regents Spring 2014 © UCB
Branch nesting in a single instruction ...
BEQ $8 $9 LabelOne
opcode
rs
rt
rd
shamt funct
opcode
rs
rt
rd
shamt funct
BEQ $11 $12 LabelTwo
Conundrum: How to define the semantics of
multiple branches in one instruction?
Solution: Nested branch semantics
If $8 == $9, branch to LabelOne
Else $11 == $12, branch to LabelTwo
CS 152: Single-Cycle Design
UC Regents Spring 2014 © UCB
CS 152
Computer Architecture and Engineering
Lecture 3 – Metrics
2014-1-28
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
CPU time: Proportional to Instruction Count
Q. Once ISA is set, who
can influence instruction
count?
A. Compiler writer,
application developer.
CPU time
Program
∝
Q. Static count?
(lines of program printout)
Or dynamic count?
(trace of execution)
A. Dynamic.
Machine Instructions
Program
Rationale: Every
additional
instruction you
execute takes time.
CS 152 L6: Performance
Q. How does a architect
influence the number of
machine instructions needed
to run an algorithm?
A. Create new instructions:
instruction set architect.
UC Regents Fall 2006 © UCB
Recall Lecture 2: Multi-flow VLIW CPU
Q. Which right-hand-side term decreases with “N” ?
Seconds
Program
=
Instructions
Program
Cycles
Instruction
A. This one
gets
smaller.
Syntax: ADD $8
$9 $10 Semantics:$8
Seconds
Cycle
A. We hope this
one doesn’t
=grow.
$9 + $10
opcode
rs
rt
rd
shamt funct
opcode
rs
rt
rd
shamt funct
Syntax: ADD $7 $8 $9
Semantics:$7 = $8 + $9
N x 32-bit VLIW yields factor of N speedup!
Multiflow: N = 7, 14, or 28 (3 CPUs in product family)
CS 152 L3: Metrics + Microcode
UC Regents Spring 2014 © UCB
A closer look at fan-out ...
Driving more
gates adds
delay.
Linear model
works for
reasonable
fan-out
FO4: Fanout of
four delay.
CS 250 L3: Timing
Delay time of an
inverter
UC Regents Fall 2013 © UCB
Clock skew also eats into “time budget”
CLKd
CLKd
As T →0,
which circuit
fails first?
CLKd
CS 250 L3: Timing
UC Regents Fall 2013 © UCB
CS 152
Computer Architecture and Engineering
Lecture 4 – Pipelining
2014-1-30
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Hazards: An instruction is not a car ...
Stage #3
Stage #1
Stage #2
Instr Fetch
Decode & Reg Fetch
IR
ADD R4,R3,R2
OR R5,R4,R2
IR
... wrong value of
R4 fetched from
RegFile, contract
with programmer
A
broken! Oops!
M
B
CS 152: L4 Pipelining
IR
R4 not written yet ...
New sample program
ADD R4,R3,R2
OR R5,R4,R2
An example of a
“hazard” -- we must
(1) detect and
(2) resolve all hazards
to make a CPU that
matches ISA
UC Regents Spring 2014 © UCB
Performance Equation and Pipelining
Seconds
Program
=
Instr Fetch
Instructions
Program
IR
Cycles
Instruction
Decode & Reg Fetch
Stage #3
IR
CPI == 1
Once pipe is fill,
one instruction
completes per
A
cycle
M
B
CS 152: L4 Pipelining
Seconds
Cycle
IR
Clock period is
shorter
Less work to do
in each cycle
To get shortest
clock period,
balance the work
to do in each
pipeline stage.
UC Regents Spring 2014 © UCB
Data Hazards: 3 Types (RAW, WAR, WAW)
Write After Read (WAR) hazards. Instruction I2
expects to write over a data value after an
earlier instruction I1 reads it. But instead, I2
writes too early, and I1 sees the new value.
Write After Write (WAW) hazards. Instruction
I2 writes over data an earlier instruction I1
also writes. But instead, I1 writes after I2, and
the final data value is incorrect.
WAR and WAW not possible in our 5-stage
pipeline. But are possible in other pipeline
designs.
CS 152: L4 Pipelining
UC Regents Spring 2014 © UCB
Resolving a RAW hazard by forwarding
1
“IF” Stage
2
3
“ID/RF” Stage
Decode & Reg Fetch
“EX” Stage
Execution
OR R5,R4,R2
ADD R4,R3,R2
Instr Fetch
Sample program
ADD R4,R3,R2 IR
OR R5,R4,R2
IR
ALU computes R4
in
the EX stage, so ...
Just forward it
back!
A
Y
M
M
B
CS 152: L4 Pipelining
IR
Unlike stalling, does
not change CPI. May
hurt cycle time.
UC Regents Spring 2014 © UCB
CS 152
Computer Architecture and Engineering
Lecture 5 – ISA Design + Microcode + Cost
2014-2-4
Born on this
day in 2004.
Facebook
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
CS 152: L5: ISA Design + Microcode
UC Regents Spring 2014 © UCB
Register-register
1990s technology was ready for RISC
Machine
code for c
= a + b;
Transistors were available
for on-chip instruction cache.
So, larger code size would
not monopolize bandwidth to
off-chip DRAM memory.
Fixed-length instructions made
fast pipelining practical.
Not really
quantitative ...
----------->
For the right target ISA,
compiled code quality could
match hand-coded assembly.
Instruction modes: “C is a high-level assembler” origin
w += i
w += 3
w += a[100 + i]
w += a[i]
w += a[i + j]
w += a[1001]
w += a[*p]
a[i++]
a[i--]
w += a[100 + i + d*j]
How compiler technology can inform an ISA decision
Example:
f=b+c+d-1
becomes:
Temporaries: a, e
During code generation, a compiler
allocates registers to
temps when available, because
registers are faster than memory.
In the general case, register allocation
task is NP-complete ...
There are good heuristic solutions, but they require 16
free registers (preferably more) to work well.
This line of reasoning quantifies one
advantage of 32 general purpose registers
An example of a complex instruction
8 byte instruction
fetch amortized by
28 byte data move
MOVEM.L D0/D4-D7/A4/A5,40(A6)
Move the 32-bit data stored in
7 registers (D0, D4, D5, D6, D7, A4, A5)
to the region of memory pointed to
by A^, displaced by 28H bytes.
Takes 58 clock cycles to execute.
Requires non-architected state to keep
track of memory and register indices.
But ... what exactly is microcode?
Each clock cycle, we can think of this data path
as executing an 11-bit microcode instruction word:
One microcode instruction - binary format
DN
0
Assembler format
OE1, WEP;
WE1 WE2 WE3 WE4 WEP OE1 OE2 OE3 OE4 OEP
0
0
0
0
1
1
0
R1
D
0
0
a list of “1”
columns.
P
R4
Q
32
32
D
...
OE
WE
0
D
Q
32
32
WE
OE
Q
32
32
WE
OE
DN
WE1 OE1
WE4 OE4
WEP OEP
32
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
353 Haswell CPU
dies on this wafer
==> $4.80 per die!
11
11
But if die were
twice as big ...
$9.60 per die!
This is one reason
21
21
why die size
27
27
matters.
31
31
Die counts
33 35
33
37 35
per column
This analysis is optimistic ...
CS 152
Computer Architecture and Engineering
Lecture 6 – Superpipelining + Branch Prediction
2014-2-6
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
CS 152: L6: Superpipelining + Branch Prediction
UC Regents Spring 2014 © UCB
Pipelining a 256 byte instruction memory.
Fully combinational (and slow). Only read behavior shown.
Can we add two
pipeline stages?
A7-A0: 8-bit read address
{
3
{
A7 A6 A5 A4 A3 A2
3
OE --> Tri-state Q outputs!
OE
1
D
E
M
U .
X .
.
OE
Byte 0-31
256
Q
256
Byte 32-63
Q
...
OE
.
.
.
Byte 224-255 Q
256
M
U
X
Data
3
output
is 32 bits
D0-D31
32
i.e.
4 bytes
256
Each register holds 32 bytes (256 bits)
CS 152: L6: Superpipelining + Branch Prediction
UC Regents Spring 2014 © UCB
Spatial Predictors
C code snippet:
b1
b2
b3
After
compilation:
We want to
predict
this branch.
Idea: Devote hardware to four
2-bit predictors for BEQZ branch.
P1: Use if b1 and b2 not taken.
P2: Use if b1 taken, b2 not taken.
P3: Use if b1 not taken, b2 taken.
P4: Use if b1 and b2 taken.
Track the current taken/not-taken
status of b1 and b2, and use it to
choose from P1 ... P4 for BEQZ ...
How?
b1
b1
b2
b2
b3
Can b1 and b2 help us predict it?
CS 152
Computer Architecture and Engineering
Lecture 7 -- Power and Energy
2014-2-11
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
The Watt:
Unit of power.
A rate of
energy (J/s).
A gas pump
hose delivers
6 MW.
120 KW: The power
delivered by a
Tesla Supercharger.
Tesla Model S has a
306 MJ battery
1J=1W
(good for 265 miles).
CS 152: L7: Power and Energy
The Joule: Unit of
energy. A 1 Gallon
gas container holds
130 MJ of energy.
1 W = 1 J/s.
UC Regents Spring 2014 © UCB
And so, we can transform this:
Gate delay
roughly linear
with Vdd
2
P ~ F ⨯ Vdd
2
P~1⨯1
Block processes stereo audio. 1/2
of clocks for “left”, 1/2 for “right”.
Into this:
Top block processes “left”, bottom “right”.
2
Vdd
P ~ #blks ⨯ F ⨯
P ~ 2 ⨯ 1/2 ⨯ 1/4 = 1/4
CV2 power
only
This magic trick brought to you by Cory Hall ...
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
CS 152
Computer Architecture and Engineering
Lecture 8 -- CPU Verification
2014-2-13
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
CS 152: L7: Power and Energy
UC Regents Spring 2014 © UCB
“CPU program” diagnosis is tricky ...
Observation: On a buggy CPU model,
the correctness of every executed
instruction is suspect.
Consequence: One needs to verify the
correctness of instructions that surround
the suspected buggy instruction.
Depends on: (1) number of “instructions in
flight” in the machine, and (2) lifetime of
non-architected state (may be “indefinite”).
CS 250 L11: Design Verification
UC Regents Fall 2012 © UCB
Combinational Unit Testing: 32-bit Adder
Number of input bits ? 65
Cin
A
Total number of
possible input values?
32
+
B
32
Sum
2
65
= 3.689e+19
32
Just test them all?
Cout
Exhaustive testing
does not “scale”.
“Combinatorial explosion!”
CS 250 L11: Design Verification
UC Regents Fall 2012 © UCB
On Tuesday
Mid-term I ...
Ground rules ...
When is it? Where is it? Ground rules.
9:30 AM sharp, Tuesday March 18th,
306 Soda.
Every-other-seat seating, except for the
front row, where every-seat is permitted.
No blue-books needed. We will be handing
out a paper test. Pencil is preferred.
Pencils down @ 10:55 AM, so we can
collect papers before next class comes in.
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
When is it? Where is it? Ground rules.
No use of calculators, smartphones,
laptops, etc ... during the exam.
Closed-book, closed-notes. Just pencils,
erasers. No consulting with students.
Restroom breaks are OK, but you’ll still
need to hand in your exam @ 10:55.
Questions are reserved for serious
concerns about a bug in the question.
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Today - Midterm I Review Session
All questions answered (almost ...)
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Break
Play:
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
#
Points
1
10
Name:
2
15
SSID:
3
10
“All the work is my own. I have no prior knowledge
of the exam contents, aside from guidance from
class staff. I will not share the contents with others
in CS152 who have not taken it yet.”
4
10
5
15
Signature:
6
15
Please write clearly, and put your name on each
page. Please abide by word limits. Good luck!
7
10
8
15
CS152 Midterm I
October 4th 2005
Now at Splunk (log files in the cloud)
Now at
Redux (10second
David Marquardt
Udam Saini
John Lazzaro
(still @ berkeley)
Tot 100
Q1. Register File Design (10 points)
clk
ws
5
WE
R0 - The constant 0
D
D
E
M
U .
X .
.
D
Q
En
R1
Q
En
R2
Q
...
D
32
wd
En
R31
Q
Q1: The actual question ...
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
ws1
clk
5
R0 - Constant 0
D
Draw
your
answer
here ...
WE1
WE2
D
Q
En
R1
Q
En
R2
Q
...
D
5
ws2
32
wd1
32
wd2
En
R31
Q
ws1
clk
5
WE1
de
mu
x
R0 - Constant 0
a0
a1
a2
or
.
.
.
D
En
Q
a1
b1
R1
Q
1
a31
0
32
or
D
En
a2
b2
R2
Q
1
WE2
de
mu
x
0
b0
b1
b2
ws2
...
or
.
.
.
D
1
b31
5
32
32 32
wd1 wd2
0
32
En
a31
b31
R31
Q
Q2: Single Cycle Design (part A)
LWA: Load Word and
Auto-update Index
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
R-format
op
rs
rt
I-format
op
rs
rt
rs
rt
1
rt
5
5
5
rs 0
RegDest
RegFile
rs1
rd1
rs2
32
ws
32
wd
32
rd2
WE
RegWr
rd
shamt
funct
imm
ALUctr
1
1
imm Ext
ExtOp
Equal
0
0
MemToReg
ALUsrc
MemWr
Mux control: 0 is lower mux input, 1 upper mux input
RegWr, MemWr: 1 = write, 0 = no write.
ExtOp: 1 =sign-extend, 0 = zero-extend.
Q2: Single Cycle Design (part A)
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Q2: Single Cycle Design (part B)
SWA: Store Word
and Auto-update
Index
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
R-format
op
rs
rt
I-format
op
rs
rt
rs
rt
1
rt
5
5
5
rs 0
RegDest
RegFile
rs1
rd1
rs2
32
ws
32
wd
32
rd2
WE
RegWr
rd
shamt
funct
imm
ALUctr
1
1
imm Ext
ExtOp
Equal
0
0
MemToReg
ALUsrc
MemWr
Mux control: 0 is lower mux input, 1 upper mux input
RegWr, MemWr: 1 = write, 0 = no write.
ExtOp: 1 =sign-extend, 0 = zero-extend.
Q2: Single Cycle Design (part B)
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Q2: Single Cycle Design (part C)
BEQR: Branch if
EQual to address in
Register
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
R-format
op
rs
rt
I-format
op
rs
rt
rs
rt
1
rt
5
5
5
rs 0
RegDest
RegFile
rs1
rd1
rs2
32
ws
32
wd
32
rd2
WE
RegWr
rd
shamt
funct
imm
ALUctr
1
1
imm Ext
ExtOp
Equal
0
0
MemToReg
ALUsrc
MemWr
Mux control: 0 is lower mux input, 1 upper mux input
RegWr, MemWr: 1 = write, 0 = no write.
ExtOp: 1 =sign-extend, 0 = zero-extend.
32
PC
32
Instr
Mem
32
32
1
D
+
32
Q
0
0x4
Addr
Data
32
32
PCSrc
Clk
imm
+
Ext
1
rd1
0
BRSrc
32
Note: imm is immediate field from Iformat bitfield. Ext unit sign
extends, does word->byte shift.
Note: rd1 is output from register file.
Mux control: 0 is lower mux input, 1 upper mux
input
Q2: Single Cycle Design (part C)
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Q3: Single-Cycle Branch Delay Slot
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Single-Cycle I-Fetch (no delay slot)
32
PC
Instr
Mem
32
32
1
32
D
+
32
Q
0
0x4
Addr
Data
32
32
PCSrc
Ex
te
nd
Clk
+
32
Mux control: 0 is lower mux input, 1 upper mux inpu
Design Instruction Fetch WITH delay slot
32
PC
1
32
D
Q
+
32
0
32
PCSrc
Ex
te
nd
+
Clk
Instr
Mem
32
32
Addr
Data
Design Instruction Fetch WITH delay slot
32
PC
32
NEWREG
1
32
0x4
0x0
D
+
1
32
Q
Q
32
0
32
0
PCSrc
PCSrc
Ex
te
nd
D
+
32
Clk
Clk
On reset:
PC = 32’d0
NEWREG = 32’d4
Instr
Mem
32
Addr
Data
Q4: Interpreting Schmoo Plots ...
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
A “Schmoo” plot for a Cell SPU ...
The energy equations:
E
2
1
C
2
2
1
C
2
E1V
V
=
=
dd current
1 Joule of
dissipated
>1 energy is dd
>0 by a 1 Amp
0-
Operating
point Q
flowing through a 1 Ohm resistor for 1 second.
Also, 1 Joule of energy is 1 Watt (1 amp
into 1 ohm) dissipating for 1 second.
Operating point R
Operating
point P
Each square shows chip
temperature (C) and power
Q4: Part A ...
Operating point P: 1.3 V, 4.8 GHz, 10
W.
Operating
point Q: 1.3 V, 2.4 GHz, 5
W.
Q4: Part A answer
Q4: Part B ...
Operating point P: 1.3 V, 4.8 GHz, 10
W.
Operating
point Q: 1.3 V, 2.4 GHz, 5
W.
Operating point R: 0.9 V, 2.4 GHz, 1
W.
Q4: Part B answer
Q5: Visualizing Stalls and Kills
Note: no forwarding muxes, no “==” ID ALU
1
“IF” Stage
Instr Fetch
2
3
5
4
“ID” Stage
“EX” Stage “MEM” Stage WB
Memory Write
Decode & Reg Fetch Execution
Back
IR
IR
IR
IR
WE, MemToReg
Mux,Logic
A
Y
R
To branch logic
M
B
M
Notes:
Program
I1: OR R5,R1,R2
In BEQ, the I7 denotes the branch target
instruction (if the branch is taken). Look at the I2: OR R6,R1,R2
I3: BEQ R6,R5,I7
code to figure out if branch is taken or not.
I4: LW R3 0(R5)
Use N to denote a stage with a muxed-in
I5: OR R7,R5,R6
NOP instruction.
I6: OR R0,R3,R7
I7: OR R6,R0,R3
Fill out the table until all slots of t13 are
I8: OR R5,R0,R1
filled in. Do not add and fill in t14, t15, etc.
I9: OR R11,R9,R9
I10: OR R12,R9,R9
We filled in I1 to get you started.
t1
IF: I1
ID:
EX:
MEM:
WB:
t2
t3
t4
t5
I1
I1
I1
I1
t6
t7
t8 t9 t10 t11 t12 t13
Notes:
Program
I1: OR R5,R1,R2
In BEQ, the I7 denotes the branch target
instruction (if the branch is taken). Look at the I2: OR R6,R1,R2
I3: BEQ R6,R5,I7
code to figure out if branch is taken or not.
I4: LW R3 0(R5)
Use N to denote a stage with a muxed-in
I5: OR R7,R5,R6
NOP instruction.
I6: OR R0,R3,R7
I7: OR R6,R0,R3
Fill out the table until all slots of t13 are
I8: OR R5,R0,R1
filled in. Do not add and fill in t14, t15, etc.
I9: OR R11,R9,R9
I10: OR R12,R9,R9
We filled in I1 to get you started.
t1
IF: I1
ID:
EX:
MEM:
WB:
t8 t9 t10 t11 t12 t13
t2
t3
t4
t5
t6
t7
I2
I1
I3
I2
I1
I4
I3
I2
I1
I4
I3
N
I2
I1
I4
I3
N
N
I2
I4 I5 I7
I3 I4 N
N I3 I4
N
N I3
N
N N
I8
I7
N
I4
I3
I8 I8 I9
I7 I7 I8
N
N I7
N N N
N
I4 N
Q6: Unified Memory and Pipelines
1
“IF” Stage
Instr Fetch
NOP mux
into IR not
shown
2
3
5
4
“ID/RF” Stage “EX” Stage “MEM” Stage WB
Memory Write
Decode & Reg Fetch Execution
Back
IR
IR
IR
IR
WE, MemToReg
Mux,Logic
PC
PC update
logic not
shown
A
Y
R
To branch logic
M
M
MemToReg
B
Policy: Data reads and writes
take precedence over instruction
fetches.
Use N to denote a stage holding a NOP.
Fill out the table until all slots of t13 are
filled in. Do not add and fill in t14, t15, etc.
We filled in I1 to get you started.
t1
IF: I1
ID:
EX:
MEM:
WB:
t2
t3
t4
t5
I1
I1
I1
I1
t6
t7
Program
I1: LW R1, 0(R0)
I2: LW R2, 0(R1)
I3: LW R3, 0(R1)
I4: LW R4, 0(R3)
I5: LW R5, 0(R3)
I6: LW R6, 0(R4)
I7: OR R5,R6,R5
t8 t9 t10 t11 t12 t13
Program
I1: LW R1, 0(R0)
I2: LW R2, 0(R1)
I3: LW R3, 0(R1)
I4: LW R4, 0(R3)
I5: LW R5, 0(R3)
I6: LW R6, 0(R4)
I7: OR R5,R6,R5
t1
t2
t3
t4
t5
IF: I1
ID:
EX:
MEM:
WB:
I2
I1
I3
I2
I1
N
I3
I2
I1
N
I3
N
I2
I1
t6
I4
I3
N
N
I2
t7 t8
I5 N
I4 I5
I3 I4
N I3
N
N
t9 t10 t11 t12 t13
N I6 I7 N N
I5 I5 I6 I7 I7
N I5 I6 N
N
I4 N
N I5 I6
I3 I4 N N I5
Q7: Forwarding Networks
Forwarding muxes, with numbers inputs
ID (Decode)
EX
IR
IR
WB
MEM
IR
IR
To branch logic
==
Mux,Logic
From
WB
From mux
outputs
4
5
2
1
1
2
3
4
A
1
2
3
5
M
3
Y
R
M
H
CS 152 L16: Midterm I Review
UC Regents Spring 2014 © UCB
Program
I1: OR R5,R1,R4
OR ws,rs1,rs2
LW ws,imm(rs1)
I2: OR R4,R1,R2
BEQ rs1, rs2, branch target label
I3: OR R3,R5,R4
(1) Fill in IF/ID/EX/MEM/WB rows with instruction
I4: OR R3,R1,R2
number (I1, I2, etc) or N for a stage that holds a
I5: BEQ R3,R4,I8
NOP.
(2) Fill in A# with the selected input of the mux
I6: LW R3 0(R3)
driving the A register needed to fulfill the
programmers contract (1,2,3, 4, or X for don’t care). I7: OR R6,R9,R3
(3) Fill in M# with the selected input of the mux driving I8: OR R3,R3,R9
the M register needed to fulfill the programmers contractI9: OR R9,R6,R3
I10: OR R3,R6,R6
(1,2,3, 5, or X for don’t care).
Opcodes to datapath mapping:
t1
IF: I1
ID:
EX:
MEM:
WB:
A#: X
M#: X
t2
t3
t4
t5
I1
I1
I1
I1
t6
t7
t8 t9 t10
Program
I1: OR R5,R1,R4
I2: OR R4,R1,R2
I3: OR R3,R5,R4
I4: OR R3,R1,R2
I5: BEQ R3,R4,I8
I6: LW R3 0(R3)
I7: OR R6,R9,R3
I8: OR R3,R3,R9
I9: OR R9,R6,R3
I10: OR R3,R6,R6
t1
t2
t3
t4
t5
t6
t7
t8 t9 t10
IF: I1
ID:
EX:
MEM:
WB:
A#: X
M#: X
I2
I1
I3
I2
I1
I4
I3
I2
I1
4
5
4
5
2
1
I5
I4
I3
I2
I1
4
5
I6
I5
I4
I3
I2
1
3
I8
I6
I5
I4
I3
2
X
I9
I8
I6
I5
I4
X
X
I9 I10
I8 I9
N I8
I6 N
I5 I6
2 4
5 1
Q8: Forwarding Through Registers
A novel forward scheme (regfile mux)
ID (Decode)
EX
IR
IR
WB
MEM
IR
IR
Mux,Logic
3
4
1
2
3
3
4
A
Y
R
5
3
3
5
M
B
1
M
2
3
2
CS 152 L16: Midterm I Review
1
UC Regents Spring 2014 © UCB
Opcodes to datapath mapping: OR ws,rs1,rs2
Fill in A# with the selected input of the mux driving
the A register needed to fufill the programmers
LW ws,imm(rs1)
contract (3, 4, or X for don’t care).
Fill in M# with the selected input of the mux driving the
M register needed to fufill the programmers contract
(3, 5, or X for don’t care).
Fill in wd with the selected input of the mux driving the
wd register file input (1, 2, 3, or X for “don’t care
because there is no write this cycle”)
IF:
ID:
EX:
MEM:
WB:
A#:
M#:
wd:
Program
I1: OR R5,R1,R2
I2: OR R8,R3,R5
I3: OR R7,R8,R5
I4: LW R4 0(R8)
I5: OR R9,R8,R7
I6: OR R3,R9,R7
I7: OR R2,R4,R3
I8: OR R7,R3,R4
I9: OR R5,R7,R9
t8 t9 t10 t11 t12 t13
t1
t2
t3
t4
t5
t6
t7
I1
I2
I1
I3
I2
I1
I4
I3
I2
I1
I5
I4
I3
I2
I1
I6
I5
I4
I3
I2
I7 I8 I9
I6 I7 I8
I5 I6 I7
I4 I5 I6
I3 I4 I5
X
X
X
I9
I8 I9
I7 I8 I9
I6 I7 I8 I9
IF:
ID:
EX:
MEM:
WB:
A#:
M#:
wd:
Program
I1: OR R5,R1,R2
I2: OR R8,R3,R5
I3: OR R7,R8,R5
I4: LW R4 0(R8)
I5: OR R9,R8,R7
I6: OR R3,R9,R7
I7: OR R2,R4,R3
I8: OR R7,R3,R4
I9: OR R5,R7,R9
t8 t9 t10 t11 t12 t13
t1
t2
t3
t4
t5
t6
t7
I1
I2
I1
I3
I2
I1
I4
I3
I2
I1
4
5
X
4
3
3
3
5
3
I5
I4
I3
I2
I1
4
X
3
I6
I5
I4
I3
I2
4
5
X
I7 I8 I9
I6 I7 I8
I5 I6 I7
I4 I5 I6
I3 I4 I5
4
3
4
3
5
5
2
3
1
X
X
X
I9
I8 I9
I7 I8 I9
I6 I7 I8 I9
3
5
Q8: Forwarding Through Registers
Q9: Simple branch predictor
Address of BNEZ instruction
0b0110[...]01001000
28 bits
2 bits
Branch Target Buffer (BTB)
line
index
28-bit address tag
target address
0b00
0b01
0b10 0b0110[...]0100
0b11
=
Hit
CS 152 L16: Midterm I Review
PC + 4 + Loop
BNEZ R1 Loop
Branch History
Table (BHT)
N
L
Update
BHT once
taken/
not taken
status
is known
On a miss, replace BTB for the line
with the new branch tag & target.
Next slide defines initial BHT N and
UC Regents Spring 2014 © UCB
Simple (”2-bit”) Branch History State
“N bit”
Prediction for Next branch
(1 = take, 0 = not take)
D
“L bit”
Was Last prediction correct?
(1 = yes, 0 = no)
D
Q
L
N
old N
0
0
0
0
1
1
1
Q
old L
branch
new N
new L
0
0
1
1
0
0
1
not taken
taken
not taken
taken
not taken
taken
not taken
0
1
0
0
0
1
1
1
1
1
0
1
1
0
When replacing the
tag value for a line,
initialize branch
history state to
(N = 1, L = 1)
(for taken branches)
or to
(N = 0, L = 1)
(for “not taken”
branches).
Branch
predictor
state
before first
inst. in
trace
executes
1
28-bit address tag
target address
0x 0000 000
PC + 4 + Lab1
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 007
PC + 4 + Lab8
0x 0000 0000
N L
0 0
1 0
0 1
1
BEQ R1 R2 Lab1
0x 0000 000
PC + 4 + Lab1
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 007
PC + 4 + Lab8
1
line
index
0b00
0b01
0b10
0b11
Taken
1
1
0
1
1
0
1
0b00
1
0b11
0b01
0b10
2
0x 0000 000
PC + 4 + Lab1
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 007
PC + 4 + Lab8
0x 0000 0034
1
1
0
1
BEQ R7 R8 Lab4
0x 0000 000
PC + 4 + Lab1
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 007
PC + 4 + Lab8
1
0
1
0b00
1
0b11
0b01
0b10
Not Taken
1
0
0
1
1
1
1
0b00
1
0b11
0b01
0b10
3
0x 0000 000
PC + 4 + Lab1
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 007
PC + 4 + Lab8
0x 0000 006C
1
0
0
1
BEQ R13 R14 Lab7
0x 0000 000
PC + 4 + Lab1
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 006
PC + 4 + Lab7
1
1
1
0b00
1
0b11
0b01
0b10
Not Taken
1
0
0
0
1
1
1
0b00
1
0b11
0b01
0b10
4
0x 0000 000
PC + 4 + Lab1
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 006
PC + 4 + Lab7
0x 0000 0058
1
0
0
0
BEQ R11 R12 Lab6
0x 0000 000
PC + 4 + Lab1
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 006
PC + 4 + Lab7
1
1
1
0b00
1
0b11
0b01
0b10
Taken
1
0
0
0
1
1
0
0b00
1
0b11
0b01
0b10
5
0x 0000 000
PC + 4 + Lab1
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 006
PC + 4 + Lab7
0x 0000 0020
1
0
0
0
BNE R5 R6 Lab3
0x 0000 002
PC + 4 + Lab3
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 006
PC + 4 + Lab7
1
1
0
0b00
1
0b11
0b01
0b10
Taken
1
0
0
0
1
1
0
0b00
1
0b11
0b01
0b10
6
0x 0000 002
PC + 4 + Lab3
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 006
PC + 4 + Lab7
0x 0000 0034
1
0
0
0
BEQ R7 R8 Lab4
0x 0000 002
PC + 4 + Lab3
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 006
PC + 4 + Lab7
1
1
0
0b00
1
0b11
0b01
0b10
Taken
1
0
0
0
1
0
0
0b00
1
0b11
0b01
0b10
7
Q4 Answer:
Branch
predictor
state
after 7
branches
complete
0x 0000 002
PC + 4 + Lab3
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 006
PC + 4 + Lab7
0x 0000 006C
1
0
0
0
BEQ R13 R14 Lab7
0x 0000 002
PC + 4 + Lab3
0x 0000 003
PC + 4 + Lab4
0x 0000 005
PC + 4 + Lab6
0x 0000 006
PC + 4 + Lab7
1
0
0
0b00
1
0b11
0b01
0b10
Not Taken
1
0
0
0
1
0
0
0b00
1
0b11
0b01
0b10
On Tuesday
Mid-term I ...
Good luck !