Transcript Execute

More Pipelining and
Instruction Level Parallelism
CE 140 A1/A2
16 July 2003
Pipeline Performance
 Effective Pipeline CPI (clock cycles
per instruction) = Ideal Pipeline CPI +
Structural Stalls + RAW Stalls + WAR
Stalls + WAW Stalls + Control stalls
 Ideal Pipeline CPI – maximum
performance
 Reduce each of the terms on the right
hand side
Three Types of Dependencies
 Instructions that are dependent
cannot be executed in parallel
 Dependencies must be properly
identified
 Types
 Data Dependencies
 Name Dependencies
 Control Dependencies
Data Dependencies
 Assumption: Ii < Ij < Ik
 Ii produces a result that is used by Ij
 Ik is data dependent on Ij and Ij is
data dependent on Ii
 Example
 ADD R1, R2, R3
 ADD R4, R1, R5
 ADD R6, R4, R3
; R1 = R2 + R3
; R4 = R1 + R5
; R6 = R4 + R3
 Leads to RAW hazards
Data Dependencies
 True data dependencies are a
property of programs
 Whether the dependencies give rise
to a hazard and whether the hazard
causes a stall is a property of the
pipeline organization
Name Dependencies
 Instructions use the same register or
memory location, called a name
 Antidependency
 Ij writes a register or memory location
that Ii reads  WAR hazard
 Output dependency
 Ii and Ij write the same register or
memory location  WAW hazard
 Not “true” dependencies
Name Dependencies
 Output dependency
 MUL R1, R4, 15
 ADD R2, R1, 1
 MOVE R1, R3
; R1 = R4 * 15
; R2 = R1 + 1
; R1 = R3
 Antidependency
 ADD R1, R2, 1
 MOVE R2, R3
; R1 = R2 + 1
; R2 = R3
Name Dependencies
 Instructions with name dependencies
can be reordered if the name in the
instruction is changed to avoid
conflict
 Technique: register renaming
Control Dependencies
 Determines the ordering of an
instruction with respect to a branch
instruction so that the non-branch is
executed only when it should be
Control Dependencies
 if p1 {
S1;
};
if p2 {
S2;
}
 S1 is control dependent on p1
 S2 is control dependent on p2 but not
on p1
Control Dependencies
 Instruction that is control dependent
on a branch cannot be moved before
the branch
 Instruction that is not control
dependent on a branch cannot be
moved after the branch.
Dependencies
 Two ways of dealing with
dependencies
 Maintain the dependency but avoid a
hazard
 Eliminate the dependency by
transforming the code
Pipelining
 Overlapped instruction execution
 Effective ideal CPI (clock cycles per
instruction) = 1
 CPI cannot go below 1
 Why?
 Only one instruction is issued per clock
cycle
Superpipelined Processors
 More pipeline stages
 Simpler stage finish faster
 Advantage: more instructions can be
overlapped
 Disadvantage: introduces more data
dependencies  stalls
 Pipeline with > 8 stages  counterproductive
Improving Pipelining CPI
 Goal: Decrease CPI to < 1
 How?
 Be able to issue multiple instructions in
one clock cycle
 Solution: Multiple-issue processors
Multiple-Issue Processors
 Two Flavors
 Superscalar Processors
 VLIW (Very Long Instruction Word)
Processors
 Also referred to as static superscalar
Multiple Execution Units
Execute
Instruction
INTEGER
Fetch
Instruction
Decode
Instruction
Write
Result
Execute
Instruction
FLOATING
POINT
Superscalar Structure
Fetch
Instruction
Decode
Instruction
Fetch
Instruction
Decode
Instruction
Fetch
Instruction
Decode
Instruction
Fetch
Instruction
Decode
Instruction
ALU
INTEGER
LOAD
STORE
FPU
FLOATING
POINT
Write
Result
Write
Result
Write
Result
Write
Result
Superscalar Operation
 Typical superscalar processor: issue 1
to 8 instructions per clock cycle
 Instructions must be independent and
satisfy constraints
 If an instruction in the sequence has
a dependency, only instructions
appearing before that instruction will
be executed  number of instructions
issued per clock cycle varies
Superscalar Operation
 Needs instruction fetch unit that can
maintain an instruction queue by
fetching more than one instruction at
a time
Superscalar Operation
 Multiple processing units
 Multiple-issue
 Start more than one execution per clock
cycle
Superscalar Structure
ALU
INTEGER
Fetch
Instruction
Decode
Instruction
Fetch
Instruction
Decode
Instruction
LOAD
STORE
FPU
FLOATING
POINT
Write
Result
Write
Result
Superscalar Operation
CLOCK
I1 (Fadd)
1
2
3
4
5
F1 D1 E1
E1
E1 W1
A
B
I2 (Add)
F2 D2 E2 W2
I3 (Fsub)
F3 D3
I4 (Sub)
F4 D4
6
7
C
E3
E3
A
B
E4 W4
E3 W3
C
8
9
10 11 12
Superscalar Operation
 Execution unit: one for integer operations
and one for floating-point operations
 Example:




1 Integer Operation
1 Floating-Point Operation
No hazards
Issue the two instructions at the same time
 Introduces additional dependencies
Superscalar Operation
 Processor picks up two instructions
 Issue both instructions if one is an
integer op, and one is a floating-point
op
 If not, instructions are issued
sequentially
Superscalar Operation
 Little impact on code density
 Processor detects whether next
instruction can be issued
 No need to lay out instructions to match
issue capability
 Unscheduled programs, or those
compiled for older implementations
can still run
Superscalar Operation Issues
 Difficult for processor to detect
dependencies among instructions to
be issued
 Problem becomes more pronounced
as number of instructions that can be
issued per clock cycle increases
Superscalar Operation
 Scheduling of instructions can also be
left to the compiler (static scheduling)
to avoid data hazards
Out-of-Order Execution
 Instructions are dispatched in the
order they appear in the program
 Instruction execution can complete
out-of-order
 PROBLEM: What if an exception
happens?
Exceptions
 Situations where the normal execution
order of instruction is changed
 Example:







I/O device request
Tracing instruction execution
Divide by zero
Divide overflow
Integer overflow
Page fault
Misaligned memory accesses
Out-of-Order Execution
 When an exception occurs for a
particular instruction, one or more
succeeding instructions may already
have been completed (result may
have already been written to a
register)
Imprecise Exception
 When instructions following an
exception are permitted to complete
 State of machine is not consistent 
imprecise
Precise Exception
 Results of subsequent instructions
that may have been partially
executed are discarded
 May introduce some amount of delay
Delayed Write
CLOCK
I1 (Fadd)
1
2
3
4
5
F1 D1 E1
E1
E1 W1
A
I2 (Add)
F2 D2 E2
I3 (Fsub)
F3 D3
I4 (Sub)
F4 D4
B
6
7
C
W2
E3
E3
A
B
E 3 W3
C
E4 W4
8
9
10 11 12
Out-of-Order Execution
 Out-of-order execution is desirable
 BUT… in-order completion is
necessary for precise exception
Using Temporary Registers
CLOCK
I1 (Fadd)
1
2
3
F1 D1 E1
4
5
6
E1B
E1C
W1
7
A
I2 (Add)
F2 D2 E2
TW
W2
2
I3 (Fsub)
F3 D3
E3A
E3B
E3 W3
C
I4 (Sub)
F4 D4
E4
TW
4
W4
8
9
10 11 12
Using Temporary Registers
 Result is written to temporary register
 Commitment step – final write step
 After this step, instruction cannot be
reversed
 One technique of register renaming
Reorder Buffer
 Commitment unit – guarantees inorder-commitment
 Reorder buffer – queue of instructions
 Instruction at head of queue is retired
 temporary registers are written to
permanent registers
VLIW
 Very Long Instruction Word
 Multiple, independent functional units
 Multiple instructions packaged into
one very long instruction
 Burden is on compiler
 Simple hardware – just executes the
instructions generated by compiler
 Statically scheduled
VLIW
128-bit Instruction Word
FADD
ADD
Floating- Integer
Point Unit
ALU
LOAD
BREQ
Load/
Branch
Store Unit
Unit
VLIW Limitations
 Bigger code
 Complex compilers
 Compiler must detect all dependencies
 One Stalls, All Instructions Stall
 Unless hardware has pipeline interlock to for
hazard resolution
 Compiler cannot predict all
dependencies/hazards
 Hardware-specific compilation
 Compilation is optimized for the underlying
architecture (i.e. number of functional units)
VLIW CPUs
 Itanium (IA-64) from Intel
 Hardware decides how many instruction
groups it will execute depending on
available resources
 Crusoe from Transmeta
 Code Morphing software – translates X86
code to Crusoe ISA
Instruction Scheduling
 Static Scheduling – compiler
schedules code to minimize
hazards/stalls; complex
compiler/simple hardware
 Dynamic Scheduling – hardware
rearranges instruction execution to
reduce the stalls; complex
hardware/simple compiler
Static Scheduling
 Minimize stalls by separating
instructions with dependencies so
they will not lead to hazards
 Scheduled code can still be run on
dynamically scheduled pipeline
Dynamic Scheduling
 Scheduling is performed by the
hardware, as opposed to compiler
scheduling (software)
 Instruction begins execution as soon
as operands are available
 Leads to out-of-order execution, outof-order completion
Dynamic Scheduling
 Cannot remove true data dependencies, but
avoids stalling when dependencies are
present
 Example:
DIV F0, F2, F4
;F0 = F2 / F4
ADD F10, F0, F8
;F10 = F0 + F8
SUB F12, F8, F14
;F12 = F8 – F14
Normally, dependency between DIV and ADD
will cause stall and prevent SUB from executing
(if only in-order execution is allowed)
 But, SUB can be executed if out-of-order
execution is allowed because it is not dependent
on previous instructions




Dynamic Scheduling with a
Scoreboard
 Keeps track of number of instructions
reading a register, and if register is being
written to
 If current instruction needs a register
whose value has not yet been computed,
instruction must stall
 For simplicity, assume that a functional unit
is always available.
 Real scoreboards are more complex; keep
tracks of functional unit usage (busy or not
busy), which functional unit writes to a
register, etc.
Scoreboard: In-order
Scoreboard: Out-of-order
Register Renaming
 Out-of-order execution/completion
needs register renaming
 To overcome name dependencies,
registers are “renamed” to use
internal registers available to the
processor
 Large set of available registers not
available in the ISA level
Predication
 New IA-64 feature to deal with
branches
 Currently: All instructions are
executed
 With Predication: instructions contain
conditions (predicates) that tell when
they should be executed and when
not
Conditional Execution
 Precursor of predication
 Eliminates conditional branch
 Example (C code):
 if (R1 == 0)
 R2 = R3;
 Example (Assembly)
 CMP R1, 0
 BNE L1
 MOV R2, R3
 L1:
 Example (Conditional Execution)
 CMOVZ R2, R3, R1
Conditional Execution
 CMOVZ – conditional move if third
operand is zero
 CMOVN – conditional move if third
operand is not zero
More Conditional Execution
 C Example






if (R1 == 0) {
R2 = R3;
R4 = R5; }
Else {
R6 = R7;
R8 = R9; }
More Conditional Execution
 Assembly Example








CMP R1, 0
BNE L1
MOV R2, R3
MOV R4, R5
BR L2
L1: MOV R6, R7
MOV R8, R9
L2:
More Conditional Execution
 Conditional Execution Example




CMOVZ R2, R3, R1
CMOVZ R4, R5, R1
CMOVN R6, R7, R1
CMOVN R8, R9, R1
Predicated Execution
C
 if (R1 == R2)
 R3 = R4 + R5;
 else
 R6 = R4 – R5
Predicated Execution
 Assembly








CMP R1, R2
BNE L1
MOV R3, R4
ADD R3, R5
BR L2
L1: MOV R6, R4
SUB R6, R5
L2:
Predicated Execution




CMPEQ R1, R2, P4
<P4> ADD R3, R4, R5
<P5> SUB R6, R4, R5
IA-64
 Every instruction is predicated
 All instructions are executed
 Upon retirement, if predicate is true then
results are written permanently,
otherwise results are discarded
Speculative Loads
 Data is loaded from memory before it
is certain that the LOAD will actually
be executed
 Problem: What if speculative load
causes an exception (i.e. cache miss)
 IA-64 solution: Handle the exception
only when necessary
Speculative Loads
I8 (load is detected)
and moved upward
over the branch
- “hoisting”
I1
Speculative Load
Load data before it is
needed, if exception
occurs, postpone
reporting, set “poison bit”
I3 (Branch)
I4
I7
I5
I8 (load data)
I6
Check for “poison bit”
in register. If bit is
not set, then OK,
otherwise handle
exception
Speculative Check
I9 (use data)