Transcript execution

MAMAS – Computer Architecture 234367
Lecture 9 – Out Of Order (OOO)
Dr. Avi Mendelson
Some of the slides were taken from:
(1) Lihu Rapoport (2) Randi Katz and (3) Petterson
© Avi Mendelson, 5/2005
1
Lecture 9 – OOO execution
Agenda
 Introduction
 Data flow execution
 Out of order execution – principles
 Pentium-(II,III and 6) example
© Avi Mendelson, 5/2005
2
Lecture 9 – OOO execution
Introduction
 The goal is to increase the performance of a system.
 In order to achieve that, so far we discussed
– Adding pipe stages in order to increase the clock rate
– Increase the potential parallelism within the pipe by
 Fetching decoding and executing several instructions in parallel
 Are there any other options?
© Avi Mendelson, 5/2005
3
Lecture 9 – OOO execution
Data flow vs. conventional computer ?
 In theory “data flow machines” has the best
performance
– View the program as a parallel operations wait to be executed (will
be demonstrate next slide)
– Execute instruction as soon as its inputs are ready
 So, why computers are Van-Neumann based and not
Data-Flow based
– Hard to debug
– Hard to write Data-Flow programs (need special programming
language in order to be efficient)
© Avi Mendelson, 5/2005
4
Lecture 9 – OOO execution
Data flow execution – a different approach for
high performance computers
 Data flow execution is an alternative for Van-Neumann
execution. Here, the instructions are executed in the
order of their input dependencies and not in the order
they appears in the program
 Example: assume that we have as many execution
units as we need:
(1)
(2)
(3)
(4)
(5)
(6)
r1
r8
r5
r6
r4
r7
© Avi Mendelson, 5/2005






r4
r1
r5
r6
r5
r8
/
+
+
+
*
r7
r2
1
r3
r6
r4
Data Flow Graph
We could execute it
in 3 cycles
1
3
2
4
5
6
5
Lecture 9 – OOO execution
Data flow execution - cont
 Can we build a machine that will execute the
“data flow graph”?
 In the early 70th several machines were built to
work according to the data-flow graph. They
were called “data flow machines”. They were
vanished due to the reasons we mentioned
before.
 Solution: Let the user think he/she are using
Van-Neumann machine, and let the system
work in “data-flow mode”
© Avi Mendelson, 5/2005
6
Lecture 9 – OOO execution
OOOE - General Scheme
Most of the modern computers are using OOO execution.
Most of them are doing the fetching and the retirement INORDER, but it executes in OUT_OF_ORDER
Fetch &
Decode
Instruction
pool
In-order
Retire
(commit)
In-order
Execute
Out-of-order
© Avi Mendelson, 5/2005
7
Lecture 9 – OOO execution
Out Of Order Execution
Basic idea:
– The fetch is done in the program order (in-order) and fast enough in order
to “fill-out” a window of instructions.
– Out of the instruction window, the system forms a data flow graph and
looks for instructions which are ready to be executed:
 All the data the instructions are depended on, are ready
 Resources are available.
– As soon as the instruction is execution it needs to signal to all the
instructions which are depend on it that it generate new input.
– The instructions are commit in “program’s order” to preserve the “user
view”
 Advantages:
– Help exploit Instruction Level Parallelism (ILP)
– Help cover latencies (e.g., cache miss, divide)
© Avi Mendelson, 5/2005
8
Lecture 9 – OOO execution
How to convert “In-order” instruction
flow into “data flow”
 The problems:
1. Data Flow has only RAW dependencies, while OOOE
has also WAR and WAW dependencies
2. How to guarantee the in-order complition.
 The Solutions:
1. Register Renaming (based on “Tomuselo algorithm”) solves
the WAR and WAW dependencies
2. We need to “enumerate” the instructions at decode
time (in order) so we know in what order to retire them
© Avi Mendelson, 5/2005
9
Lecture 9 – OOO execution
Register Renaming
 Hold a pool of physical registers.
 Architectural registers are mapped into physical registers
– When an instruction writes to an architectural register
 A free physical register is allocated from the pool
 The physical register points to the architectural register
 The instruction writes the value to the physical register
– When an instruction reads from an architectural register
 reads the data from the latest instruction which writes to the same
architectural register, and precedes the current instruction.
 If no such instruction exists, read directly from the architectural register.
– When an instruction commits
 Moves the value from the physical register to the architectural register it
points.
© Avi Mendelson, 5/2005
10
Lecture 9 – OOO execution
OOOE with Register Renaming: Example
Before renaming
(1)
(2)
WAW
(3)
WAW
(4)
(5)
(6)
WAR
(7)
(8)
r1
r2
r1
r3
r1
r4
r5
r6








mem1
r2 +
mem2
r3 +
mem3
r5 +
2
r5 +
After renaming
t1
t2
t3
t4
t5
t6
t7
t8
r1
r1
r1
2








mem1
r2 +
mem2
r3 +
mem3
r5 +
2
t7 +
t1
t3
t5
2
After renaming all the false dependencies
(WAW and WAR) were removed
© Avi Mendelson, 5/2005
11
Lecture 9 – OOO execution
Executing Beyond Branches
 The scheme we saw so far does not search beyond a
branch
 Limited to the parallelism within a basic-block
– A basic-block is ~5 instruction long
(1)
(2)
(3)
(4)
(5)
r1  r4 / r7
r2  r2 + r1
r3  r2 - 5
beq r3,0,300
r8  r8 + 1
If the beq is predicted NT,
Inst 5 can be spec executed
 We would like to look beyond branches
– But what if we execute an instruction beyond a branch and then it
turns out that we predicted the wrong path ?
Solution: Speculative Execution
© Avi Mendelson, 5/2005
12
Lecture 9 – OOO execution
Speculative Execution
• Execution of instructions from a predicted (yet unsure) path
• Eventually, path may turn wrong
• Implementation:
– Hold a pool of all “not yet executed” instructions
– Fetch instructions into the pool from a predicted path
– Instructions for which all operands are “ready” can be executed
– An instruction may change the processor state (commit) only when it is safe
 An instruction commits only when all previous (in-order) instructions
had committed  Instructions commit in-order
 Instructions which follow a branch commit only after the branch
commits
 If a predicted branch is wrong all the instructions which follow it are flushed
• Register Renaming helps speculative execution
– Renamed registers are kept until speculation is verified to be correct
© Avi Mendelson, 5/2005
13
Lecture 9 – OOO execution
The magic of the modern X86
architectures (Intel, AMD, etc.)
 The user view of the X86 machine is as a CISC architecture.
 The machine supports this view by keeping the in-order parts as
close as possible to the X86 view.
 While moving from the In-order part (front-end) to the OOO part
(execution), the hardware translates each X86 instruction into a
set of uop operations, which are the internal machine operations.
These operations are RISC like (load-store based).
 During this translation, the hardware performs the register
renaming. So, during the execution time it uses internal registers
and not the X86 ones. The number of these registers can be
changed from one generation to another.
 While moving back from the OOO part (execution) to the In-Order
part (commit), the hardware translates the registers back to X86,
in order to keep for the user a coherent picture.
© Avi Mendelson, 5/2005
14
Lecture 9 – OOO execution
OOOE Architecture: based on Pentium-II

Bus Interface Unit
ID
Instr. Decode
and rename
Write back
bus
Load/Store
Operations






RS
IFU
Instr. Fetch
on Branch prediction.
2. Decode and rename:
Data
cache
MOB
Instruction
cache
In Order Front-end
1. Fetch from instruction cache, base
Out-Of-Order execution
3. Do in Parallel:

Arithmetic
Operations
Translate to Uops
Use the RAT table for renaming
Put ALL instructions in ROB
Put all “arithmetic instructions” in the
RS queue
Put all Load/Store instructions in
MOB

Load and store operations are
executed based on MOB information
Arithmetic operations are executed
based on RS information.
4. All results are written back to ROB,
RAT
while RS and MOB “still” values
they need
ROB

Retire (commit)
Logic
© Avi Mendelson, 5/2005
15
In Order completion (retirement)
5. The retire logic (commit logic)
moves instructions out of the ROB
and updates the architectural
registers
Lecture 9 – OOO execution
Re-order Buffer (ROB)
 Mechanism for keeping the in-order view of the
user.
 Basic ROB functions
– Provide large physical register space for register renaming
– Keeps intermediate results, some of them may not be commit if
the branch prediction was wrong (we will discuss this
mechanism later on)
– Keeps information on what is the “Real Register” the commit
need to update
© Avi Mendelson, 5/2005
16
Lecture 9 – OOO execution
Reservation station (RS)
 Pool of all “not yet executed” uops
– Holds both the uop attributes as well as the values of the input data
 For each operand, it keeps indication if it is ready
– Operand that need to be retrieved from the RRF is always ready
– Operand that waits for another Uop to generate its value, will “lesson” to
the WB bus. When the value appears on the bus (the value is always
associated with the ROB number it needs to update), all RS entries how
need to consume this value, “still” it from the bus and mark the input as
ready (this is done in parallel to the ROB update.
– Uops whose all operands are ready can be dispatched for execution
– Dispatcher chooses which of the ready uops to execute next. If can also do
“forwarding”; i.e., schedule the instruction at the same cycle the information
is written to the RS entry.
 As soon as Uop completes its execution, it is deleted from
the RS.
 If the RS is full, it stalls the decoder
© Avi Mendelson, 5/2005
17
Lecture 9 – OOO execution
Memory Order Buffer (MOB)
 Goal – Manipulates the Load and Store operations. If
possible, it allows out-of-order among memory operations
 Structure similar in concept to ROB
 Every memory uop allocates new entry in-order.
 Address need to be updated when known
 Problem- Memory dependencies cannot be fully resolved
statically (memory disambiguation)
– store r1,a; load r2,b  can advance load before store
– store r1,[r3]; load r2,b  load should wait till r3 is known
 In most of the modern processors, Loads may pass
loads/stores but Stores must be execute in order (among
stores).
 For simplicity, this course assumes that all MOB operations
are executed in order.
© Avi Mendelson, 5/2005
18
Lecture 9 – OOO execution
An example of OOO Execution
© Avi Mendelson, 5/2005
19
Lecture 9 – OOO execution
RAT
R0
Instruction Q
R1
R2
R3
ROB
RS
MOB
Execute
Retire
© Avi Mendelson, 5/2005
20
Lecture 9 – OOO execution
RAT
R0
R1
R2
R3
ROB
I5
I4
R1 <- R1+R0
R2 <- R3
LD R1,X
RS
Instruction Q
◄
MOB
Execute
Retire
© Avi Mendelson, 5/2005
21
Lecture 9 – OOO execution
RAT
R0
R1
RB0
R2
R3
ROB
LD R1,X
I5
I4
R1 <- R1+R0
R2 <- R3
LD R1,X
RS
M0
Instruction Q
◄
MOB
LD RB0,X
Takes 3
cycles
Execute
Retire
© Avi Mendelson, 5/2005
22
Lecture 9 – OOO execution
RAT
R0
R1
R2
RB0
RB1
R3
ROB
LD R1,X
R2 <- R3
I5
I4
R1 <- R1+R0
R2 <- R3
LD R1,X
◄
RS
RB1 <- R3
M0
RS0
Instruction Q
MOB
LD RB0,X
Execute
Retire
© Avi Mendelson, 5/2005
23
Lecture 9 – OOO execution
RAT
R0
R1
R2
RB2
RB1
R3
ROB
LD R1,X
R2 <- R3
R1 <- R1+R0
I5
I4
R1 <- R1+R0
R2 <- R3
LD R1,X
◄ Instruction Q
RS
RB1 <- R3
M0
RS0
RB2 <- RB0+R0
RS1
RB1 <- R3
MOB
LD RB0,X
Execute
Retire
© Avi Mendelson, 5/2005
24
Lecture 9 – OOO execution
RAT
R0
R1
R2
RB2
RB1
R3
ROB
LD R1,X
R2 <- R3
R1 <- R1+R0
I5
I4
R1 <- R1+R0
R2 <- R3
LD R1,X
◄ Instruction Q
RS
M0
O.K
RB2 <- RB0+R0
RS1
I4
MOB
LD RB0,X
Got the
value now
Execute
Cannot execute since
the data is not ready yet
© Avi Mendelson, 5/2005
Retire
25
Lecture 9 – OOO execution
RAT
R0
R1
R2
RB2
RB1
R3
ROB
LD R1,X
I5
I4
R1 <- R1+R0
R2 <- R3
LD R1,X
Instruction Q
RS
OK
OK
R1 <- RB0+R0 RS1
R2 <- R3
I4
I5
◄
MOB
RB2 <- RB0+R0
I4
RS2
I5
RS3
RB2 <- RB0+R0
Execute
Retire
© Avi Mendelson, 5/2005
26
Lecture 9 – OOO execution
◄
RAT
R0
R1
RB3
R2
R3
ROB
I5
I4
R1 <- R1+R0
R2 <- R3
LD R1,X
Instruction Q
MOB
RS
I6
R1 <- RB1+R0 OK
I4
I5
I6
I4
I5
rs2
rs3
rs0
I4
I5
R2 <- R3
Execute
Retire
LD R1,X
© Avi Mendelson, 5/2005
27
Lecture 9 – OOO execution
Backup
© Avi Mendelson, 5/2005
28
Lecture 9 – OOO execution