Transcript Memory

Lecture 5 Procedures and Processor
Peng Liu
[email protected]
1
Procedure Call and Return
• Procedures are required for structured programming
– Aka: functions, methods, subroutines, …
• Implementing procedures in assembly requires several things to be
done
– Memory space must be set aside for local variables
– Arguments must be passed in and return values passed out
– Execution must continue after the call
• Procedure Call Steps
– Place parameters in a place where the procedure can access
them
– Transfer control to the procedure
– Acquire the storage resources needed for the procedure
– Perform the desired task
– Place the result value in a place where the calling program
can access it
– Return control to the point of origin
2
MIPS Storgae Layout
3
Procedure Activation Record (Frame)
• Each procedure creates an activation record on the stack
4
Register Assignments
Calling Convention
5
Caller vs. Callee Saved Registers
Preserved
No Preserved
Saved register ($s0-$s7)
Temporary register ($t0-t9)
Stack/frame pointer ($sp, $fp, $gp)
Argument registers ($a0-$a3)
Return address ($ra)
Return values ($v0-$v1)
• Preserved registers (Callee Save)
– Save register values on stack prior to use
– Restore registers before return
• Not preserved registers (Caller Save)
– Do what you please and expect callees to do likewise
– Should be saved by the caller if needed after procedure call
6
Call and Return
• Caller
– Save caller-saved register $a0-$a3, $t0-$t9, if necessary
– Load arguments in $a0-$a3, rest passed on stack
– Execute jal instruction
• Callee Setup
– Allocate memory for new frame ($sp = $sp - frame)
– Save callee-saved registers $s0-$s7, $fp, $ra, if necessary
– Set frame pointer ($fp = $sp + frame size - 4)
• Callee Return
–
–
–
–
Place return value in $v0 and $v1
Restore any callee-saved registers
Pop stack ($sp = $sp + frame size)
Return by jr $ra
7
Calling Convention Steps
8
Simple Example
9
How To Build A Processor
(or any complex hardware)
• Break operation down to steps even gates can
understand
• Generally decompose digital systems into two kinds of
operation
– Things that deal with the real data (Datapath)
– Things that control the stuff operating on the real data
(Control)
• Find a decomposition that is simple, and efficient
– Some are obvious, others can be more subtle
• We will start simple
– Add stuff to improve performance
10
How to Execute Instructions in Five Easy Steps
• First we need to:
– Fetch the instruction
• Then we need to:
– Decode instruction/fetch register
• Then we need to:
– Do the operation
• Then we need to:
– Write the result into register file
• Finally we need to:
– Calculate the next instruction address
11
Subset of Instructions
• To simplify our study of processor design, we will focus on
a subset of the MIPS instructions
– Memory: lw and sw
– Arithmetic: addu, subu, and, ori, and slt
– Branch: beq and j
• The method of implementing other instructions should
come naturally from these
12
Starting The Design
• Think of the steps needed for each instruction execution
– Instructions could take multiple cycles or one cycle
• Steps that occur on each instruction:
–
–
–
–
–
Fetch instruction from memory – address is specified by PC
Read one or two registers
Do add/sub/etc. using ALU
Fetch a value from memory
Store results to register-file/memory
• Needed state for single cycle machine:
– Instruction pointer (PC), 32 register, memory
13
Logic Design Basics
• Information encoded in binary
– Low voltage = 0, High voltage = 1
– One wire per bit
• Multi-bit data encoded on multi-wire buses
• Combinational element
– Operate on data
– Output is a function of input
– If you wait long enough you will get the right answer
• State (sequential) elements
– Store information
– Need to separate signals across clock cycles
– This is usually done with flip-flops (or latches)
• We will take about flop-based design in this class
14
Combinational Elements
15
Inverters: A simple transistor model
pFET.
A switch.
“On” if
gate is
grounded.
“1”
This model is
too simple to
be useful ...
“0”
“1”
“1”
“0”
“0”
nFET.
A switch.
“On” if
gate is
at Vdd.
16
Transistors as water valves
If electrons are water molecules,
and a capacitor a bucket ...
“1”
A “on” p-FET fills
up the capacitor
with charge.
“0”
Time
Water level
“1”
A “on” n-FET
empties the bucket.
“0”
Water level
This model is often good enough ...
Time
17
What is the bucket? A gate’s “fan-out”.
“Fan-out”: The number of gate inputs
driven by a gate’s output.
Driving other gates slows a gate down.
Driving wires slows a gate down.
18
A closer look at fan-out ...
Driving
more
gates adds
delay.
Linear
model
works for
reasonable
fan-out
19
Propagation delay graphs ...
1->0
20
Clocking Methodology
• Combinational logic transforms data during clock cycles
– Between clock edges
– Input from state elements, output to state element
– Longest delay determines clock period
21
Sequential Elements
• Flip-flops (or registers): stores data in a circuit
– Uses a clock signal to determine when to update the stored
value
– Edge-triggered: update when Clk changes from 0 to 1
22
Sequential Elements
• Flip-flops with write control
– Only updates on clock edge when write control input is 1
– Used when stored value is required later
23
Single cycle data paths: Assumptions
Processor uses
synchronous logic
design (a “clock”).
f
T
1 MHz
1 μs
10 MHz
100 ns
100 MHz
10 ns
1 GHz
1 ns
D
All state elements act
like positive edgetriggered flip flops.
clk
Q
24
Review: Edge-Triggering in Verilog
D
Q
Value of D is sampled on positive clock edge.
Q outputs sampled value for rest of cycle.
module ff(D, Q, CLK);
CLK
input D, CLK;
output Q;
Module code
has two bugs.
Where?
always @ (CLK)
Q <= D;
endmodule
25
Review: Edge-Triggered D Flip Flops
D
Q
Value of D is sampled on positive clock edge.
Q outputs sampled value for rest of cycle.
module ff(D, Q, CLK);
CLK
input D, CLK;
output Q;
reg Q;
always @ (posedge CLK)
Q <= D;
endmodule
26
Timing Analysis and Logic Delay
Register:
An Array of
Flip-Flops
Combinational Logic
If T > worst-case delay through CL,
does this ensure correct operation?
27
Flip Flops have internal delays ...
D
Q
Value of D is sampled on positive clock edge.
Q outputs sampled value for rest of cycle.
t_setup
CLK
D
Q
t_clk-to-Q
28
Flip-Flop delays eat into “time budget”
Combinational Logic
ALU “time budget”
29
Clock skew also eats into “time budget”
CLKd
CLKd
As T →0,
which circuit
fails first?
CLKd
30
Some Flip Flops have “hold” time ...
t_setup
t_inv
t_hold
CLK
D
Q
D
D must
stay
stable
here
What is the intended function
of this circuit?
CLK
Does flip-flop hold time
affect operation of this
circuit? Under what
conditions?
t_clk-to-Q + t_inv > t_hold
For correct operation.
31
Critical Timing Issues
• Flops work great as long as input is stable when clock
rises
– Called setup and hold windows
– Clock skew can cause some nasty problems
• Hold time violations
– Cycle Time = longest Prop Delay + Setup + Clock Skew
32
Memory Structure
• Memory structures are generally specially designed
– Could build them from flops or latches
• But they would be big, slow, and power hungry
– So circuit designers create the basic design
• Create a module generator for logic designers to use
33
Read from/Write to Memory
• Interface to Memory can be:
– Combinational (asynchronous)
– Clocked (synchronous)
• Combinational memory:
– Read data is valid some delay after address lines settle
• There is no clock.
– Writes are tricky: must supply a write pulse in the middle of
your address and data valid times
• Clocked memory (most common):
– Memory looks like a standard synchronous device.
– Address and control signals are sampled on rising edge of
clock, and data is valid some number of cycles later
34
Memory Timing
35
Memories In Our Design
• They will be combinational
– Otherwise we can’t complete an instruction in one cycle!
• Interface is simple:
– Inputs:
32
Addr
• Address
• DataIn
• WriteEn (WriteEn must be a pulse)
– Outputs:
Data Memory
32
Dout
Din
32
WE
• Dataout
• Register file:
– It has three address, two for reads, and one for write
– It is called a 3-port, since it can perform 3 accesses per cycle
36
MIPS Register file: From the top down
clk
sel(ws)
5
“two read ports”
sel(rs1)
Why is R0 special?
R0 - The constant 0
D
D
E
WE M
.
U
..
X
D
Q
En
R1
Q
En
R2
Q
D
En
R31
.
..
Q
32
rd1
sel(rs2)
5
32
.
..
32
wd
M
U
X
32
.
..
...
5
32
M
U
X
32
rd2
32
37
Register File Schematic Symbol
Why do we need WE?
If we had a MIPS register file w/o WE,
how could we work around it?
5
5
5
RegFile
rs1
rd1
rs2
32
ws
32
wd
32
rd2
WE
38
The First Task: Fetching The Instruction (IF)
• Not that complex
• Instr = Mem[PC]
– Fetch the instruction from memory
• Update program counter for next cycle
– What is the address of the next instruction?
39
Instruction Fetch
40
What Did We Fetch?
41
Nice Characteristics of MIPS Machine Code
• Instructions are fixed length
– Don’t need to decode first instruction to find next one
– Always add 4 bytes to instruction pointer (PC)
• Register specifiers are always in the same place
– Destination moves around some, but
– Source register are always in the same place
• Or you don’t need that register
– Can fetch the registers BEFORE you decode instruction
• Feed bits directly from the instruction memory
42
Register to Register Operations
• In our subset this is only addu and subu
– I do not want to worry about overflow
addu rd, rs, rt
subu rd, rs, rt
• Operation
R[rd] <- R[rs] + R[rt]; Add operation
R[rd] <- R[rd] + R[rt]; Sub operation
43
Datapath: R-Format Instructions
• Read two register operands
• Perform arithmetic/logical operation
• Write register result
44
OR Immediate RTL
• OR Immediate instruction
– ori rt, rs, imm
– R[rt] <- R[rs] OR ZeroExt(imm);
• Means I need to get instr[15:0] into the datapath, on RT
path
45
Datapath: Immediate Ops
•
•
•
•
•
Extend datapath to support immediate operations
Write register is rt or rd based on instruction
Read data 2 is ignored for immediates
Immediates can be sign or zero extended
ALUsrc and ALU operation set based on instruction
46
Load
• Load instruction
– lw rt, rs, imm
– Addr <- R[rs] + SignExt(imm); Compute memory address
– R[rt] <- Mem[Addr];
Load data into register
• Notice this will use the immediate path as well
47
Datapath: Load
• Extend datapath to support other immediate operations
• Extender handlers either sign or zero extension
• MUX selects between ALU result and Memory output
48
Store
• Store instruction
– sw rt, rs, imm
– Addr <-R[rs] + SignExt(imm); Compute memory addr
– Mem[Addr] <- R[rt];
Load data into register
Bits
6
5
5
16
OP
rs
rt
imm
First
Source
Register
Second
Source
Register
Immediate
49
Datapath: Store
• Read Register 2 is passed on to Memory
• Memory address calculated just as in lw case
50
Branch
• Branch instruction
– beq rs, rt, imm
Cond <- R[rs] – R[rt];
Calculate branch condition
If (cond eq 0)
Test if equal
PC <- PC + 4 + SignExt(imm)*4
else
PC <-PC + 4;
Calculate next address
Bits
6
5
5
16
OP
rs
rt
imm
First
Source
Second
Source
Immediate
51
The Next Address
• PC is byte-addressed into instruction memory
– Sequential
• PC [31:0] = PC[31:0] + 4
– Branch operation
• PC[31:0] = PC[31:0] + 4 + SignExt(imm)x4
• Instruction Addresses
– PC is byte addressed, but instructions are 4 bytes long
– Simplify hardware by using 30 bit PC
• Sequential
• PC [31:2] = PC[31:2] + 1
• Branch operation
• PC[31:2] = PC[31:2] + 1 + SignExt(imm)
52
Datapath: IF + Branch
53
Jump RTL
• Jump instruction
– j target
– PC[31:12]<- PC[31:29] || target[25:0];
Bits
6
OP
26
target
Jump Target Address
54
Datapath: IFU + Jump
• MUX selects pseudodirect jump target
55
Putting it All Together
56
Control
• Since every instruction takes one cycle, control is state
free!
– It is just decoded instruction bits
• There are also few control points
– Control the multiplexers
– Operation type for the ALU
– Write control on the Instruction & data memories
• First part of cycle does not have any control
– Which is good, since we don’t have instruction yet
• Look at setting of the control points for different
instructions
57
What is single cycle control?
Instr
Mem
Combinational Logic
(Only Gates, No Flip Flops)
Equal
32
Addr
Data
Just specify logic functions!
RegDest
PCSrc
RegWr
ExtOp
rs,rt,rd,imm
5
5
5
RegDest
32
ALUsrc
RegFile
rs1
rd1
rs2
32
ws
32
wd
MemToReg
rd2
MemWr
ALUctr
Equal
WE
Ext
RegWr
ExtOp
MemToReg
ALUsrc
MemWr
58
At Beginning of Clock Cycle
59
Control for Arithmetic
60
Instruction Fetch at End
61
Arithmetic Immediate (ori)
62
Control for Load
63
Control for Store
64
Control for Branch (beq)
65
Control for Jump (j)
66
Summary of Control Signals
67
Multilevel Decoding
• Since only the ALU needs the func field
– Pass it to the ALU unit, and have a local decoder there
68
Multilevel Decoding (cont)
69
Putting It All Together
70
Single Cycle Processor
• Advantages
– Single cycle per instruction makes logic and clock simple
• Disadvantages
– Inefficient utilization of memory and functional units since
different instructions take different lengths of time
• ALU only computes values a small amount of the time
– Cycle time is the worst case path -> long cycle times
• Load instruction
– Best possible CPI is 1
71
Two goals when specifying control logic
Bug-free: One “0” that should be a “1”
in the control logic function breaks
contract with the programmer.
Should be easy for humans to
read and understand: sensible
signal names, symbolic
constants ...
Efficient: Logic function specification
should map to hardware with good
performance properties: fast, small, low
power, etc.
72
In practice: Use behavioral Verilog
Advice: Carefully written Verilog will yield
identical semantics in ModelSim and Synplicity.
If you write your code in this way, many “works
in Modelsim but not on Xilinx” issues
disappear.
Always check log files, and inspect output tools produce!
Look for tell-tale Synplicity “warnings
and errors” messages !
“latch generated”, “combinational loop detected”,
etc
Automate with scripts if possible.
73
Increasing Parallelism
• Problem:
– Each functional unit used once per cycle
– Most of the time it is sitting waiting for its turn
• Well it is calculating all the time, but it is waiting for valid data
– There is no parallelism in this arrangement
• Making instructions take more cycles makes machine
faster!
– Increases the parallelism going on in the machine
– We will look at a 5 stage pipeline
• Modern machines have order 20 cycles/instruction
74
Acknowledgements
• These slides contain material from courses:
– UCB CS152
– Stanford EE108B
75
Homework
• Check the web site for HW3
• Read COD Appendices C and Chapter4.1-4.4
76