No Slide Title

Download Report

Transcript No Slide Title

Pipelining Basics
• Assembly line concept
• An instruction is executed in multiple steps
• Multiple instructions overlap in execution
• A step in a pipeline is called a pipe stage or pipe
segment
• The time required for a step is called a Machine Cycle
• Machine cycle in determined by the slowest pipe stage,
usually one clock cycle
Pipelining Benefits
• Ideally speaking
- In perfectly balanced pipe stages,
Tiem per instruciton 
Time per instruciton in unpipeline d machine
Numbers of pipe stages
• But in reality:
- Stages are not perfectly balanced
- There are overheads due to pipelining
• Two ways to look at the improvement due to pipelining
1.
Decrease in CPI
2.
Decrease in Clock Cycle Time
DLX Implementation without Pipelining
•Every DLX instruction is executed in at most 5 clock cycles
1.
2.
Instruction fetch cycle (IF):
IR
Mem [PC]
NPC
PC + 4
Instruction decode/register fetch cycle (ID):
A
Regs [IR6..10];
B
Regs [IR11..15];
Imm
((IR16)16 ## IR16..31)
3.
Execution/effective address cycle (EX);
The ALU operates on the operands prepared in the
prior cycle, performing one of four functions depending
on the DLX instruction type.
•
Memory reference:
ALUOutput
•
A + Imm;
Register - Register ALU instruction:
ALUOutput
A func B;
•
Register - immediate ALU instrucion:
ALUOutput
•
4.
A op Imm;
Branch:
ALUOutput
NPC + Imm;
Cond
(A op 0)
Memory access/branch completion cycle (MEM):
The only DLX instructions active in this cycle are
loads, stores, and branches.
•
Memory reference:
LMD
Mem [ALUOutput] or
Mem [ALUOutput]
•
B;
Branch:
if (cond) PC
5.
ALUOutput else PC
Write - back cycle (WB):
•
Register - Register ALU instruction:
Regs [IR16..20]
•
Register - Immediate ALU instruction:
Regs [IR11..15]
•
ALUOutput;
ALUOutput;
Load instruction:
Regs [IR11..15]
LMD;
NPC
CPI for DLX
Branch and store require only 4 cycles
•
Branch frequency is 12%
• Store frequency is 5%
What is CPI?
CPI = 4 (.12 + .5) + 5 (.83) = 4.83
Can we improve it?
CPI for DLX (Cont’d)
• ALU instructions are idle during MEM cycle
• We can complete ALU instruction in 4 cycles
• Assume ALU instructions frequency is 47%
The improved CPI = 4 (012 + .05 + .47) + 5 (.36)
= 4.36
Improvement
= 4.83 / 4.36 = 1.1
Basic Pipeline for DLX
• Each clock cycle, the hardware initiates a new
instruction
• Each instruction still takes 5 clock cycles to complete
• During each clock cycle, each stage may be working
on a different instruction
• There may be upto 5 different instructions in the
pipeline at a time
• We need to make sure that we do not cause resource
contention
Basic pipeline for DLX (cont’d)
To allow overlapping execution of multiple instruction
• Separate instruction and date memories (to avoid
conflict in CC4)
• Register file, needs 2 reads and 1 write in each cycle
• Increment and store the PC every clock, during IF stage.
(raises the issue of how Branches are handled)
• All operations in a pipe stage must be completed in one
clock cycle.
• Values passed from one stage to the next must be placed
in registers, called pipeline registers or pipeline latches.
Control
• Set the Multiplexer (MUX) controls
• Depends on the instruction in the IR
• Top ALU input MUX is set by whether
instruction in a branch or not (if branch, select
ID/EX.NPC)
• Bottom ALU input MUX is set by whether it is a
register - register ALU instruction or not.
Performance Issues
• Pipelining increases instruction throughput - i.e
the number of instructions completed per unit
time.
• Pipelining does not reduce the execution time of
an individual instruction - it usually increases it
slightly (due to pipelining overhead)
• But increase in throughput means the program
runs faster, even though no single instruction runs
faster!
Pipeline hazards
• Structural Hazards: due to resource contention
• Date Hazards: due to data dependency
• Control Hazards: due to branches, etc
Hazards may require the pipeline to stall
Speedup from Pipelining
Speedup from pipelining 
=
Avereage instruction time unpiplined
Average instruction time pipelined
CPI unpipeline d Clode cycle unpipeline d
X
CPI pipelined
Clode cycle pipelined
Ideal CPI on a pipelined machine is almost always!
CPI pipelined  1  pipeline stall clock cycles per instructio n
Speedup from Pipelining (cont’d)
Ignore the clock cycle overhead due to pipelining, i.e.
Clock cycle unpipelined = Clock cycle pipelined
Speedup =
CPI unpipeline d
1  pipelined stall cycles per instructions
CPI unpipelined = depth of the pipeline
Speedup =
pipeline depth
1  pipelined stall cycles per instructio n