Lecture 3

Transcript Lecture 3

EENG 449bG/CPSC 439bG
Computer Systems
Lecture 3
MIPS Instruction Set
&
Intro to Pipelining
January 20, 2004
Prof. Andreas Savvides
Spring 2004
http://www.eng.yale.edu/courses/eeng449bG
1/20/04
EENG449b/Savvides
Lec 3.1
The MIPS Architecture
Features:
• GPRs with load-store
• Displacement, Immediate and Register Indirect
Addressing Modes
• Data sizes: 8-, 16-, 32-, 64-bit integers and 64bit floating point numbers
• Simple instructions: load, store, add, sub, move
register-register, shift
• Compare equal, compare not equal, compare less,
branch, jump call and return
• Fixed instruction encoding for performance, variable
instruction encoding for size
• Provide at least 16 general purpose registers
1/20/04
EENG449b/Savvides
Lec 3.2
MIPS Architecture Features
Registers:
• 32 64-bit GPRs (R0, R1…R31)
– Note: R0 is always 0 !!!
• 32 64-bit Floating Point Registers (F0,F1… F31)
Data types:
• 8-bit bytes, 16-bit half words
• 32-bit single precision and 64-bit double precision
floating point instructions
Addressing Modes:
• Immediate (Add R4, R3 --- Regs[R4]<-Regs[R4]+3
• Displacement (Add R4, 100(R1) – Regs[R4]<Mem[100+Regs[R1]]
• Register indirect (place 0 in the displacement field)
– E.g Add R4, 0(R1)
• Absolute Addressing (place R0 as the base register)
1/20/04
– E.g Add R4, 1000(R0)
EENG449b/Savvides
Lec 3.3
MIPS Instruction Format
op – opcode (basic operation
of the instruction)
rs – first register operant
rt – second register operant
rd – register destination
operant
shamnt – shift amount
funct – Function
Example:
LW t0, 1200($t1)
35
9
8
1200
binary
100011
1/20/04
01001 01000
0000 0100 1011 0000
Note: The numbers for these examples are form “Computer Organization & Design”, Chapter 3EENG449b/Savvides
Lec 3.4
MIPS Instruction Format
op – opcode (basic operation
of the instruction)
rs – first register operant
rt – second register operant
rd – register destination
operant
shamnt – shift amount
funct – Function
Example:
Add $t0, $s2,$t0
0
18
8
8
0
32
binary
00000
1/20/04
10010 01000 01000 00000
100000
Note: The numbers for these examples are form “Computer Organization & Design”, Chapter 3EENG449b/Savvides
Lec 3.5
MIPS Instruction Format
op – opcode (basic operation
of the instruction)
rs – first register operant
rt – second register operant
rd – register destination
operant
shamnt – shift amount
funct – Function
Example:
j 10000
2
10000
binary
?
?
You fill it in!
1/20/04
EENG449b/Savvides
Lec 3.6
MIPS Operations
Four broad classes supported:
1. Loads and stores (figure 2.28)
•
Different data sizes: LD, LW, LH, LB, LBU …
2. ALU Operations (figure 2.29)
–
–
Add, sub, and, or …
They are all register-register operations
3. Control Flow Instructions (figure 2.30)
–
Branches (conditional) and Jumps (unconditional)
4. Floating Point Operations
1/20/04
EENG449b/Savvides
Lec 3.7
Levels of Representation
temp = v[k];
High Level Language
Program
Compiler
Assembly Language
Program
Assembler
Machine Language
Program
v[k] = v[k+1];
v[k+1] = temp;
lw $t15,
0($t2)
lw $t16,
4($t2)
sw
$t16,0($t2)
sw
$t15,4($t2)
0000
1010
1100
0101
1001
1111
0110
1000
1100
0101
1010
0000
0110
1000
1111
1001
1010
0000
0101
1100
1111
1001
1000
0110
0101
1100
0000
1010
1000
0110
1001
1111
Machine Interpretation
Control Signal
Specification
°
°
1/20/04
EENG449b/Savvides
Lec 3.8
Execution Cycle
Instruction
Obtain instruction from program storage
Fetch
Instruction
Determine required actions and instruction size
Decode
Operand
Locate and obtain operand data
Fetch
Execute
Result
Compute result value or status
Deposit results in storage for later use
Store
Next
Instruction
1/20/04
Determine successor instruction
EENG449b/Savvides
Lec 3.9
5 Steps of MIPS Datapath
Instruction
Fetch
Instr. Decode
Reg. Fetch
Execute
Addr. Calc
Next SEQ PC
Adder
4
Zero?
RS1
L
M
D
MUX
Data
Memory
ALU
Imm
MUX MUX
RD
Reg File
Inst
Memory
Address
RS2
Write
Back
MUX
Next PC
Memory
Access
Sign
Extend
WB Data
1/20/04
EENG449b/Savvides
Lec 3.10
1/20/04
EENG449b/Savvides
Lec 3.11
Announcements
• Homework 1 is out
–
–
–
–
Chapter 1: Problems 1.2, 1.3, 1.17
Chapter 2: Problems 2.5, 2.11, 2.12, 2.19
Appendix A: Problems A.1, A.5, A.6, A.7, A.11
Due Thursday, Feb 5, 2:00pm
• Note the paper on DSP processors on the website
• Reading for this week: Patterson and Hennessy
Appendix A
– This lecture we are covering A1 and A2, next lecture will
cover the rest of the appendix
• Need to form teams for projects
– Select a topic
– Signup for group appointments with me
1/20/04
EENG449b/Savvides
Lec 3.12
List of Possible Projects
• Power saving schemes in embedded microprocessors
• Embedded operating system enhancements and
scheduling schemes for sensor interfaces
– Available operating systems TinyOS, PALOS, uCOS-II
• Time synchronization in sensor networks and its
hardware implications
• Efficient microcontroller interfaces and control
mechanisms for articulated nodes
• Network protocols and/or data memory management
for sensor networks
• I also encourage you to propose your own project
1/20/04
EENG449b/Savvides
Lec 3.13
Introduction to Pipelinening
Pipelining – leverage parallelism in
hardware by overlapping instruction
execution
1/20/04
EENG449b/Savvides
Lec 3.14
Fast, Pipelined Instruction Interpretation
Next Instruction
Instruction Address
Instruction Fetch
Instruction Register
Decode &
Operand Fetch
Operand Registers
NI
NI
IF
NI NI NI
IF IF IF IF
D D D D
E E E
W W
D
E
W
E
W
W
Time
Execute
Result Registers
Store Results
Registers or Mem
1/20/04
EENG449b/Savvides
Lec 3.15
Sequential Laundry
6 PM
7
8
9
10
11
Midnight
Time
T
a
s
k
O
r
d
e
r
30 40 20 30 40 20 30 40 20 30 40 20
A
B
C
D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
1/20/04
EENG449b/Savvides
Lec 3.16
Pipelined Laundry
Start work ASAP
6 PM
7
8
9
10
11
Midnight
Time
T
a
s
k
O
r
d
e
r
30 40 40 40 40 20
A
B
C
D
• Pipelined laundry takes 3.5 hours for 4 loads
1/20/04
EENG449b/Savvides
Lec 3.17
Pipelining Lessons
6 PM
7
8
9
Time
T
a
s
k
O
r
d
e
r
30 40 40 40 40 20
A
B
C
D
1/20/04
• Pipelining doesn’t help
latency of single task, it
helps throughput of
entire workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously
• Potential speedup =
Number pipe stages
• Unbalanced lengths of
pipe stages reduces
speedup
• Time to “fill” pipeline
and time to “drain” it
reduces speedup
EENG449b/Savvides
Lec 3.18
Instruction Pipelining
• Execute billions of instructions, so throughput is
what matters
– except when?
• What is desirable in instruction sets for pipelining?
– Variable length instructions vs.
all instructions same length?
– Memory operands part of any operation vs.
memory operands only in loads or stores?
– Register operand many places in instruction
format vs. registers located in same place?
1/20/04
EENG449b/Savvides
Lec 3.19
Requirements for Pipelining
Goal: Start a new instruction at every cycle
What are the hardware implications?
• Two different tasks should not attempt to use the same
datapath resource on the same clock cycle.
• Instructions should not interfere with each other
• Need to have separate data and instruction memories
• Need increased memory bandwidth
– A 5-stage pipeline operating at the same clock rate as pipelined version
requires 5 times the bandwidth
• Need to introduce pipeline registers
• Register file used in two places in the ID and WB stages
– Perform reads in the first half and writes in the second half.
1/20/04
EENG449b/Savvides
Lec 3.20
Pipeline Requirements…
Register file
Read in the first half, write
in the second half cycle
Need separate instruction and
Data memories: Structural Hazard
1/20/04
EENG449b/Savvides
Lec 3.21
Add registers between pipeline
stages
• Prevent interference between
2 instructions
• Carry data from one stage to the
next
• Edge triggered
1/20/04
EENG449b/Savvides
Lec 3.22
Pipelining Hazards
Hazards: circumstances that would cause incorrect
execution if next instruction where launched
Structural Hazards:Attempting to use the same
hardware to do two different things at the same
time
Data Hazards:Instruction depends on result of prior
instruction still in the pipeline
Control Hazards:Caused by delay between the fetching
of instructions and decisions about changes in control
flow (branches and jumps)
Common Solution: “Stall” the pipeline, until the hazard
is resolved by inserting one or more “bubbles” in the
pipeline
1/20/04
EENG449b/Savvides
Lec 3.23
Data Hazards
Occurs when the relative timing of
instructions is altered because of pipelining
Consider the following code:
DADD R1, R2, R3
DSUB
AND
OR
XOR
1/20/04
R4, R1, R5
R6, R1, R7
R8, R1, R9
R10, R1, R11
EENG449b/Savvides
Lec 3.24
Data Hazard
1/20/04
EENG449b/Savvides
Lec 3.25
Data Hazards: Data Forwarding
1/20/04
EENG449b/Savvides
Lec 3.26
Data Hazards Requiring Stalls
LD
DSUB
AND
OR
R1,0(R2)
R4,R1,R5
R6,R1,R7
R8,R1,R9
HAVE to stall for
1 cycle…
1/20/04
EENG449b/Savvides
Lec 3.27
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
–
–
–
–
–
Execute successor instructions in sequence
“Squash” instructions in pipeline if branch actually taken
Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% MIPS branches taken on average
– But haven’t calculated branch target address in MIPS
» MIPS still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
1/20/04
EENG449b/Savvides
Lec 3.28
Four Branch Hazard Alternatives
#4: Delayed Branch
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2
........
sequential successorn
........
branch target if taken
Branch delay of length n
– 1 slot delay allows proper decision and branch target
address in 5 stage pipeline
– MIPS uses this
1/20/04
EENG449b/Savvides
Lec 3.29
Delayed Branch
• Where to get instructions to fill branch delay slot?
–
–
–
–
Before branch instruction
From the target address: only valuable when branch taken
From fall through: only valuable when branch not taken
Canceling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful in
computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple
instructions issued per clock (superscalar)
1/20/04
EENG449b/Savvides
Lec 3.30
Pipelining Performance Issues
Consider an unpipelined processor
1ns/instruction
Frequency
4 cycles for ALU operations
40%
4 cycles for branches
20%
5 cycles for memory operations 40%
Pipelining overhead 0.2ns
For the unpipelined processor
Average instructio n execution time  Clock cycle  average CPI
 1 ns  ((40%  20%) x 4  40%  5  4.4 ns
1/20/04
EENG449b/Savvides
Lec 3.31
Speedup from Pipelining
Now if we had a pipelined processor, we
assume that each instruction takes 1 cycle
BUT we also have overhead so instructions
take 1ns + 0.2 ns = 1.2ns
Speedup from pipelining 

1/20/04
Average instructio n time unpipeline d
Average instructio n time pipelined
4.4 ns
 3.7 times
1.2 ns
EENG449b/Savvides
Lec 3.32
Considering the stall overhead
Speedup from pipelining 
Average instructio n time unpipeline d
Average instructio n time pipelined

CPI unpipeline d  Clock cycle unpiplined
CPI pipelined  Clock cycle pipelined

CPI unpipeline d Clock cycle unpipeline d

CPI pipelined
Clock cycle pipelined
CPI pipelined  Ideal CPI  Average Stall cycles per Inst
Speedup 
Speedup 
1/20/04
CPI unpipeline d
1  Pipeline stall cycles per instruction
Pipeline depth
Cycle Time unpipeline d

1  Pipeline stall CPI
Cycle Time pipelined
EENG449b/Savvides
Lec 3.33

Lecture 3

Transcript Lecture 3

Directory