Transcript Execute
More Pipelining and
Instruction Level Parallelism
CE 140 A1/A2
16 July 2003
Pipeline Performance
Effective Pipeline CPI (clock cycles
per instruction) = Ideal Pipeline CPI +
Structural Stalls + RAW Stalls + WAR
Stalls + WAW Stalls + Control stalls
Ideal Pipeline CPI – maximum
performance
Reduce each of the terms on the right
hand side
Three Types of Dependencies
Instructions that are dependent
cannot be executed in parallel
Dependencies must be properly
identified
Types
Data Dependencies
Name Dependencies
Control Dependencies
Data Dependencies
Assumption: Ii < Ij < Ik
Ii produces a result that is used by Ij
Ik is data dependent on Ij and Ij is
data dependent on Ii
Example
ADD R1, R2, R3
ADD R4, R1, R5
ADD R6, R4, R3
; R1 = R2 + R3
; R4 = R1 + R5
; R6 = R4 + R3
Leads to RAW hazards
Data Dependencies
True data dependencies are a
property of programs
Whether the dependencies give rise
to a hazard and whether the hazard
causes a stall is a property of the
pipeline organization
Name Dependencies
Instructions use the same register or
memory location, called a name
Antidependency
Ij writes a register or memory location
that Ii reads WAR hazard
Output dependency
Ii and Ij write the same register or
memory location WAW hazard
Not “true” dependencies
Name Dependencies
Output dependency
MUL R1, R4, 15
ADD R2, R1, 1
MOVE R1, R3
; R1 = R4 * 15
; R2 = R1 + 1
; R1 = R3
Antidependency
ADD R1, R2, 1
MOVE R2, R3
; R1 = R2 + 1
; R2 = R3
Name Dependencies
Instructions with name dependencies
can be reordered if the name in the
instruction is changed to avoid
conflict
Technique: register renaming
Control Dependencies
Determines the ordering of an
instruction with respect to a branch
instruction so that the non-branch is
executed only when it should be
Control Dependencies
if p1 {
S1;
};
if p2 {
S2;
}
S1 is control dependent on p1
S2 is control dependent on p2 but not
on p1
Control Dependencies
Instruction that is control dependent
on a branch cannot be moved before
the branch
Instruction that is not control
dependent on a branch cannot be
moved after the branch.
Dependencies
Two ways of dealing with
dependencies
Maintain the dependency but avoid a
hazard
Eliminate the dependency by
transforming the code
Pipelining
Overlapped instruction execution
Effective ideal CPI (clock cycles per
instruction) = 1
CPI cannot go below 1
Why?
Only one instruction is issued per clock
cycle
Superpipelined Processors
More pipeline stages
Simpler stage finish faster
Advantage: more instructions can be
overlapped
Disadvantage: introduces more data
dependencies stalls
Pipeline with > 8 stages counterproductive
Improving Pipelining CPI
Goal: Decrease CPI to < 1
How?
Be able to issue multiple instructions in
one clock cycle
Solution: Multiple-issue processors
Multiple-Issue Processors
Two Flavors
Superscalar Processors
VLIW (Very Long Instruction Word)
Processors
Also referred to as static superscalar
Multiple Execution Units
Execute
Instruction
INTEGER
Fetch
Instruction
Decode
Instruction
Write
Result
Execute
Instruction
FLOATING
POINT
Superscalar Structure
Fetch
Instruction
Decode
Instruction
Fetch
Instruction
Decode
Instruction
Fetch
Instruction
Decode
Instruction
Fetch
Instruction
Decode
Instruction
ALU
INTEGER
LOAD
STORE
FPU
FLOATING
POINT
Write
Result
Write
Result
Write
Result
Write
Result
Superscalar Operation
Typical superscalar processor: issue 1
to 8 instructions per clock cycle
Instructions must be independent and
satisfy constraints
If an instruction in the sequence has
a dependency, only instructions
appearing before that instruction will
be executed number of instructions
issued per clock cycle varies
Superscalar Operation
Needs instruction fetch unit that can
maintain an instruction queue by
fetching more than one instruction at
a time
Superscalar Operation
Multiple processing units
Multiple-issue
Start more than one execution per clock
cycle
Superscalar Structure
ALU
INTEGER
Fetch
Instruction
Decode
Instruction
Fetch
Instruction
Decode
Instruction
LOAD
STORE
FPU
FLOATING
POINT
Write
Result
Write
Result
Superscalar Operation
CLOCK
I1 (Fadd)
1
2
3
4
5
F1 D1 E1
E1
E1 W1
A
B
I2 (Add)
F2 D2 E2 W2
I3 (Fsub)
F3 D3
I4 (Sub)
F4 D4
6
7
C
E3
E3
A
B
E4 W4
E3 W3
C
8
9
10 11 12
Superscalar Operation
Execution unit: one for integer operations
and one for floating-point operations
Example:
1 Integer Operation
1 Floating-Point Operation
No hazards
Issue the two instructions at the same time
Introduces additional dependencies
Superscalar Operation
Processor picks up two instructions
Issue both instructions if one is an
integer op, and one is a floating-point
op
If not, instructions are issued
sequentially
Superscalar Operation
Little impact on code density
Processor detects whether next
instruction can be issued
No need to lay out instructions to match
issue capability
Unscheduled programs, or those
compiled for older implementations
can still run
Superscalar Operation Issues
Difficult for processor to detect
dependencies among instructions to
be issued
Problem becomes more pronounced
as number of instructions that can be
issued per clock cycle increases
Superscalar Operation
Scheduling of instructions can also be
left to the compiler (static scheduling)
to avoid data hazards
Out-of-Order Execution
Instructions are dispatched in the
order they appear in the program
Instruction execution can complete
out-of-order
PROBLEM: What if an exception
happens?
Exceptions
Situations where the normal execution
order of instruction is changed
Example:
I/O device request
Tracing instruction execution
Divide by zero
Divide overflow
Integer overflow
Page fault
Misaligned memory accesses
Out-of-Order Execution
When an exception occurs for a
particular instruction, one or more
succeeding instructions may already
have been completed (result may
have already been written to a
register)
Imprecise Exception
When instructions following an
exception are permitted to complete
State of machine is not consistent
imprecise
Precise Exception
Results of subsequent instructions
that may have been partially
executed are discarded
May introduce some amount of delay
Delayed Write
CLOCK
I1 (Fadd)
1
2
3
4
5
F1 D1 E1
E1
E1 W1
A
I2 (Add)
F2 D2 E2
I3 (Fsub)
F3 D3
I4 (Sub)
F4 D4
B
6
7
C
W2
E3
E3
A
B
E 3 W3
C
E4 W4
8
9
10 11 12
Out-of-Order Execution
Out-of-order execution is desirable
BUT… in-order completion is
necessary for precise exception
Using Temporary Registers
CLOCK
I1 (Fadd)
1
2
3
F1 D1 E1
4
5
6
E1B
E1C
W1
7
A
I2 (Add)
F2 D2 E2
TW
W2
2
I3 (Fsub)
F3 D3
E3A
E3B
E3 W3
C
I4 (Sub)
F4 D4
E4
TW
4
W4
8
9
10 11 12
Using Temporary Registers
Result is written to temporary register
Commitment step – final write step
After this step, instruction cannot be
reversed
One technique of register renaming
Reorder Buffer
Commitment unit – guarantees inorder-commitment
Reorder buffer – queue of instructions
Instruction at head of queue is retired
temporary registers are written to
permanent registers
VLIW
Very Long Instruction Word
Multiple, independent functional units
Multiple instructions packaged into
one very long instruction
Burden is on compiler
Simple hardware – just executes the
instructions generated by compiler
Statically scheduled
VLIW
128-bit Instruction Word
FADD
ADD
Floating- Integer
Point Unit
ALU
LOAD
BREQ
Load/
Branch
Store Unit
Unit
VLIW Limitations
Bigger code
Complex compilers
Compiler must detect all dependencies
One Stalls, All Instructions Stall
Unless hardware has pipeline interlock to for
hazard resolution
Compiler cannot predict all
dependencies/hazards
Hardware-specific compilation
Compilation is optimized for the underlying
architecture (i.e. number of functional units)
VLIW CPUs
Itanium (IA-64) from Intel
Hardware decides how many instruction
groups it will execute depending on
available resources
Crusoe from Transmeta
Code Morphing software – translates X86
code to Crusoe ISA
Instruction Scheduling
Static Scheduling – compiler
schedules code to minimize
hazards/stalls; complex
compiler/simple hardware
Dynamic Scheduling – hardware
rearranges instruction execution to
reduce the stalls; complex
hardware/simple compiler
Static Scheduling
Minimize stalls by separating
instructions with dependencies so
they will not lead to hazards
Scheduled code can still be run on
dynamically scheduled pipeline
Dynamic Scheduling
Scheduling is performed by the
hardware, as opposed to compiler
scheduling (software)
Instruction begins execution as soon
as operands are available
Leads to out-of-order execution, outof-order completion
Dynamic Scheduling
Cannot remove true data dependencies, but
avoids stalling when dependencies are
present
Example:
DIV F0, F2, F4
;F0 = F2 / F4
ADD F10, F0, F8
;F10 = F0 + F8
SUB F12, F8, F14
;F12 = F8 – F14
Normally, dependency between DIV and ADD
will cause stall and prevent SUB from executing
(if only in-order execution is allowed)
But, SUB can be executed if out-of-order
execution is allowed because it is not dependent
on previous instructions
Dynamic Scheduling with a
Scoreboard
Keeps track of number of instructions
reading a register, and if register is being
written to
If current instruction needs a register
whose value has not yet been computed,
instruction must stall
For simplicity, assume that a functional unit
is always available.
Real scoreboards are more complex; keep
tracks of functional unit usage (busy or not
busy), which functional unit writes to a
register, etc.
Scoreboard: In-order
Scoreboard: Out-of-order
Register Renaming
Out-of-order execution/completion
needs register renaming
To overcome name dependencies,
registers are “renamed” to use
internal registers available to the
processor
Large set of available registers not
available in the ISA level
Predication
New IA-64 feature to deal with
branches
Currently: All instructions are
executed
With Predication: instructions contain
conditions (predicates) that tell when
they should be executed and when
not
Conditional Execution
Precursor of predication
Eliminates conditional branch
Example (C code):
if (R1 == 0)
R2 = R3;
Example (Assembly)
CMP R1, 0
BNE L1
MOV R2, R3
L1:
Example (Conditional Execution)
CMOVZ R2, R3, R1
Conditional Execution
CMOVZ – conditional move if third
operand is zero
CMOVN – conditional move if third
operand is not zero
More Conditional Execution
C Example
if (R1 == 0) {
R2 = R3;
R4 = R5; }
Else {
R6 = R7;
R8 = R9; }
More Conditional Execution
Assembly Example
CMP R1, 0
BNE L1
MOV R2, R3
MOV R4, R5
BR L2
L1: MOV R6, R7
MOV R8, R9
L2:
More Conditional Execution
Conditional Execution Example
CMOVZ R2, R3, R1
CMOVZ R4, R5, R1
CMOVN R6, R7, R1
CMOVN R8, R9, R1
Predicated Execution
C
if (R1 == R2)
R3 = R4 + R5;
else
R6 = R4 – R5
Predicated Execution
Assembly
CMP R1, R2
BNE L1
MOV R3, R4
ADD R3, R5
BR L2
L1: MOV R6, R4
SUB R6, R5
L2:
Predicated Execution
CMPEQ R1, R2, P4
<P4> ADD R3, R4, R5
<P5> SUB R6, R4, R5
IA-64
Every instruction is predicated
All instructions are executed
Upon retirement, if predicate is true then
results are written permanently,
otherwise results are discarded
Speculative Loads
Data is loaded from memory before it
is certain that the LOAD will actually
be executed
Problem: What if speculative load
causes an exception (i.e. cache miss)
IA-64 solution: Handle the exception
only when necessary
Speculative Loads
I8 (load is detected)
and moved upward
over the branch
- “hoisting”
I1
Speculative Load
Load data before it is
needed, if exception
occurs, postpone
reporting, set “poison bit”
I3 (Branch)
I4
I7
I5
I8 (load data)
I6
Check for “poison bit”
in register. If bit is
not set, then OK,
otherwise handle
exception
Speculative Check
I9 (use data)