Pipelining - La Salle University

Download Report

Transcript Pipelining - La Salle University

Pipelining
CSIT 301 (Blum)
1
Levels of Instructions
• One high-level  Many assembly-level
– Recall that a single line of code in a high level
language may result in several instructions at
the assembly/machine-code level.
• One assembly-level  Many micro-code
– Furthermore, one assembly/machine-code
instruction consists of several micro-code
instructions.
CSIT 301 (Blum)
2
Micro-coding Load Accumulator A
• Recall the our very naïve architecture and
the micro-coding of a simple assembly-level
instruction, Load Accumulator A, for that
architecture.
CSIT 301 (Blum)
3
Bus
(Naïve Architecture)
Keyboard
encoder
Input port 1
Input port 2
Prog. counter
Mem.Add.Reg.
Memory
MDR
Instr. Reg.
CSIT 301 (Blum)
Control
Accumulator
ALU
Flags
TMP
B
C
Output port 3
Display
Output port 4
4
Address State: the value of the program counter (which recall is
the address of line of the program to be performed) is put into
memory address register.
CSIT 301 (Blum)
5
Increment State: the program counter is incremented, getting it
ready for the next time.
CSIT 301 (Blum)
6
Memory State: the current line of the program is put into
instruction register (so Control knows what to do).
CSIT 301 (Blum)
7
Recall the instruction consisted of a load command and an
address. A copy of the address is now taken over to the memory
address register.
CSIT 301 (Blum)
8
The value at that address is loaded into Accumulator A.
CSIT 301 (Blum)
9
For the load command, there is no activity during the sixth step. It
is known as a "no operation" step (a "no op" or "nop").
CSIT 301 (Blum)
10
• These 1’s and 0’s are MICRO-CODE.
• A large part of processor design is to get this micro-code to
execute as efficiently as possible.
CSIT 301 (Blum)
11
Instruction Review
• There are many registers so the Load instruction
might be more like
LD r5, (r6)
– Load Register 5 with the value at the address in
Register 6.
• An Add instruction might be
ADD r1, r2, r3
– Add the values in Registers 2 and 3 and place the result
in Register 3
CSIT 301 (Blum)
12
Stages
• The micro-code steps that go into making an
assembly-level instruction can be thought of as
belonging to various stages:
–
–
–
–
–
–
Fetch Instruction
Decode Instruction
Read Registers
Execute Instruction
Write Result Back to Registers
(Look for interrupt?)
CSIT 301 (Blum)
13
Stages and Units
• The pipeline is said to be broken into stages.
– So one talks about the Fetch Instruction (FI) stage.
– The terms section and/or segment are sometimes used
in place of stage.
• The circuitry responsible for a stage is known as a
unit.
– So one talks of the Fetch Instruction (FI) Unit.
CSIT 301 (Blum)
14
Idle hands
• Early processors executed one
instruction (all of its stages) at a time.
– Recall the von Neumann bottleneck.
• In this approach, most of the
processor’s parts are inactive most of
the time.
CSIT 301 (Blum)
15
• Note in the micro-code above we assumed 1
was active and that most of the entries are
0’s.
CSIT 301 (Blum)
16
Toward Efficiency
• Division of Labor:
– “A production process in which a worker or group of
workers is assigned a specialized task in order to
increase efficiency.”
• Assembly Line:
– “An arrangement of machines, tools, and workers in
which a product is assembled by having each perform a
specific, successive operation on an incomplete unit as
it passes by in a series of stages organized in a direct
line.”
• (Webster’s)
CSIT 301 (Blum)
17
Assembly Line
CSIT 301 (Blum)
18
Pipelining
• Applying the assembly line idea to a processor’s
execution of instructions is known as pipelining.
• The instruction is broken into various stages and
instead of having one instruction complete all
stages before the next is begun, several
instructions are being executed at once – each in a
different stage.
• The stages are separated by latches, which prevent
information from one stage to pass prematurely
into the next stage.
CSIT 301 (Blum)
19
Slower but faster
• The introduction of the stages separated by
latches actually slows down the execution
of a single instruction.
• However, there is an overall gain in speed
since many instructions are in the midst of
being processed simultaneously.
CSIT 301 (Blum)
20
Staying in sync and the weakest link
• The stage that takes the longest will
determine the speed of the entire process.
– All of the other stages must wait for the slowest
stage to complete before information is passed
to the next stage.
– The processes should be broken into stages that
as much as possible take the same amount of
time.
CSIT 301 (Blum)
21
For example
• If you broke a 26-second process into five stages:
– Four of them taking 5 s, and
– One taking 6 s.
• Then it would take 30 s for an individual
instruction to move through the pipeline (5 stages
taking 6 s).
• However, one instruction would be coming out of
the pipeline every 6 s.
– Compared to one instruction every 26 s without
pipelining.
CSIT 301 (Blum)
22
Sample question
• If a 32-second process were broken into 8 stages,
(3 requiring 3 seconds, 3 requiring 4 seconds, 1
requiring 5 seconds and 1 requiring 6 seconds),
then it would take ______(number of seconds) for
an instruction to move through the pipeline as
opposed to the original 32 seconds. However, one
instruction would be issuing from the pipeline
every _____ (number of seconds) as opposed to 1
every 32 seconds without pipelining.
CSIT 301 (Blum)
23
Caveat
• The preceding calculation is somewhat
misleading as it makes the stages appear to
be synchronized with all of the instructions
moving on to the next stage simultaneously.
Between two stages is a latch that will
know whether the previous stage has
completed and will hold any information to
be passed to the next stage.
CSIT 301 (Blum)
24
Caveat (Cont.)
• In our discussions of memory we said that
synchronous was better since asynchronous
required some unnecessary waiting.
• Here we will see that certain situations
(“hazards”) require an instruction to be
held up in a particular stage (the instruction
is said to be “stalled”) until an instruction
before it has been completely processed.
CSIT 301 (Blum)
25
Latch Size
• One is tempted to say that the 6-second stage in
the example should be two 3-second stages, but
there can be other issues.
• The latches hold the information produced by one
stage that is passed on to the next stage.
– They are a kind of register.
• Another consideration is that stages should be
chosen in a way that does not require a lot of
information to be passed from one stage to the
next.
– Big latches  more registers  more transistors 
more money, more heat, etc.
CSIT 301 (Blum)
26
Structural Hazard
• Two instructions at different stages might conflict
if they both need the same resource – maybe they
both need to read from cache.
• This is called a structural hazard.
• Using Harvard cache – the separation of cache
into data and instructions – reduces structural
hazards as different stages tend to need different
caches rather than the same cache.
CSIT 301 (Blum)
27
Possible Hazard
• Having multiple instructions in the pipeline means
that the first instruction is not complete before the
second instruction begins.
• This is good for saving time but could cause
problems if the two instructions involve the same
data.
• This situation is known as a data dependency or a
data hazard.
– Somewhat reminiscent of locking ideas in databases.
CSIT 301 (Blum)
28
Dependency Categorizing
• RAR: Read After Read
– 1st instruction reads, 2nd instruction reads
• RAW: Read After Write
– 1st instruction writes, 2nd instruction reads
• WAR: Write After Read
– 1st instruction reads, 2nd instruction writes
• WAW: Write After Write
– 1st instruction writes, 2nd instruction writes
CSIT 301 (Blum)
29
No Problemo
• RAR is not a problem.
– ADD r1, r2, r3
– SUB r4, r5, r3
• Both instructions can read the same data
without causing a conflict.
CSIT 301 (Blum)
30
RAW is Hell
• RAW is a problem.
– ADD r1, r2, r3
– SUB r4, r5, r1
• The data should be written before it is read.
Therefore, the second instruction must be stalled
in the pipeline with NOP’s (no-operation).
• The situation is called a data dependency and the
result is called a pipeline stall or bubble.
CSIT 301 (Blum)
31
Name Dependencies
• WAR (Write After Read) is not a problem unless
it turns into RAW (Read After Write).
• This could happen if the second instruction is
executed faster (if they have different latencies) or
if the instructions are executed out of order.
• WAW (Write After Write) is similar.
• These potential hazards are called name
dependencies and occur because there are a finite
number of registers.
CSIT 301 (Blum)
32
What’s the score?
• In order to avoid a RAW hazard, the processor
must know that the first instruction will write to a
particular register.
• A register has a presence bit that indicates
whether it can be read or whether it will be written
to by some instruction in the pipeline and is thus
unavailable.
• This technique is called register scoreboarding.
CSIT 301 (Blum)
33
Anticipating Problems
• Another approach is to have all of the addresses
of data that are involved in the instructions being
worked on in the pipeline (including any registers)
placed in a kind of memory.
• Then as new instructions are added to the pipeline,
the address of the data involved is compared to
what is already stored there to look for potential
conflicts.
– As we discussed with cache, this comparison must be a
fast hardware search/comparison. Thus this is another
use of associative memory or content-addressable
memory.
CSIT 301 (Blum)
34
Bypassing
• If the processor knows that a stage of some
subsequent instruction needs the data from
an instruction that just completed execution,
it can send the information directly to that
stage rather than making that stage go read
it from a register.
• This process is known as bypassing.
CSIT 301 (Blum)
35
A Special Case
• A special case of a RAW hazard (data
dependency) is when the data being written is the
result of a comparison, for example in an if.
– Then the very instructions that should be in the pipeline
will depend on the outcome of the comparison.
• Recall that ifs and loops provide a program its
control flow and so these hazards are known as
control hazards.
• The time spent waiting for the outcome of a
comparison is known as a branch delay.
• One tries to minimize branch delay by practicing
branch prediction.
CSIT 301 (Blum)
36
Super-pipelining and Hyperpipelining
• Super-pipelining and hyper-pipelining mean
having a long pipeline.
– The Pentium Pro had “super-pipelining” which
was a 14-stage pipeline.
• The Pentium Pro pipeline also allowed out-of-order
processing.
– The Pentium 4 advertises “Hyper Pipelined
Technology”
• 20 stages compared to 10 stages in the Pentium III
CSIT 301 (Blum)
37
BPU
• Recall that having more stages requires
introducing latches which increases the execution
time of a single instruction. The benefit of hyperpipelining is having so many instructions being
worked on at once.
• But the benefits are all lost if the pipeline becomes
stalled or clogged due to a hazard (data
dependency), especially a control hazard.
• Hyper-pipelining is only useful if accompanied by
good branch prediction provided by the Branch
Prediction Unit (BPU).
CSIT 301 (Blum)
38
What’s the harm?
• One approach to branching is just to proceed with
the one of the branches (or if possible both) before
knowing the outcome of the branch.
• Lining up instructions and even executing them
does not cause a problem unless the results are
written. The results of the wrong branch can just
be forgotten. But if it is the desired branch, you’ve
saved time.
• Executing a branch just in case it might be needed
is called speculative execution.
CSIT 301 (Blum)
39
The Road Most Traveled
• When branching occurs, the processors stores data
about the branch taken.
• This information is placed in the Branch Target
Buffer (BTB).
• It is used to determine which set of instructions to
line up in the pipeline.
– The program will probably branch in the same way it
did before. This is especially true with loops.
– Locality of reference strikes again.
CSIT 301 (Blum)
40
You’re Out of Order!
• Pipelines that allow out-of-order processing
were introduced with the Pentium Pro.
They have three parts
– The in-order front end (fetches and decodes)
– The out-of-order core
• If order doesn’t matter, execute instructions in the
most efficient order
– The in-order retirement section
• Restores order
CSIT 301 (Blum)
41
Superscalar
• In pipelining, the processor is working on several
instructions at once but each is in a different stage.
• A superscalar processor has more than one
pipeline.
– It is parallel processing on the micro-code level.
• The ALUs associated with these parallel pipelines
may not be equivalent.
– For instance, floating-point operations may have to go
through a particular pipeline whereas an integer
operation may put through either.
– Floating point pipelines tend to be longer since those
instructions are more complicated.
CSIT 301 (Blum)
42
Multiple execution units
• One possible use of multiple pipelines is to
“execute” the various branches of code and then
accept the results of the good pipeline when the
condition is determined.
• The execution of the branches may require the
reading and/or writing of data.
• Each pipeline/branch has its own set of registers.
Then when the actual branch is selected, the
values of these internal registers are written to the
external registers shared by all of the pipelines.
– The external/global registers are accessible to the
programmer.
CSIT 301 (Blum)
43
Decisions, decisions
• Just as there were hazards (data
dependencies) that led to stalls in the
pipeline, there are issues one must consider
when dividing work up into multiple
pipelines.
– These tend to be conflicts.
CSIT 301 (Blum)
44
Register renaming
• A part of the processor known as the commit unit
determines when a given pipeline’s registers
should be written to the shared, programmeraccessible registers.
• The internal registers are called rename registers
or rename buffers.
• And the process is known as register renaming.
– One way is to have the programmer-accessible register
point to the pipeline registers of choice.
CSIT 301 (Blum)
45
Loop Unrolling
• One example of what a compiler might do to help
with pipelining is loop unrolling.
• Branching is the bane of pipelining and
parallelism.
• Loops have at least one branch with each iteration.
• Loop unrolling is doing two or more iterations
worth of work in one iteration. It reduces the
number of branch considerations and it can also
promote parallelism.
CSIT 301 (Blum)
46
Loop Unrolling Example
for(i=0; i<100; i++){
a[i] = b[i] + c[i];
}
for(i=0; i<100; i+=2){
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
}
The unrolled version has half as many branches and so is
easier to pipeline.
The unrolled version will use more independent registers
within each iteration and so takes greater advantage of
pipelining or parallelism.
CSIT 301 (Blum)
47
Don’t try this at home
• Loop unrolling requires knowledge of the
processor’s capabilities (the number of
execution units, the number of stages in the
pipeline, etc.). If the programmer does not
have this knowledge, the unrolling and
other code optimization techniques should
be left to the compiler.
CSIT 301 (Blum)
48
Don’t interrupt me
• Another situation that causes problems for
pipelining is interruptions.
• Recall that the various devices do not immediately
receive the processor’s attention. Rather they
make an interrupt request and the processor
attends to them when it is ready.
• But with an endless stream of instructions in the
pipeline, when is a good time?
CSIT 301 (Blum)
49
When is a good time?
1. Right away? To attend to the interruption
immediately, one would have to store the
state of the entire pipeline so that one
could return to the proper place after
processing the interrupt.
2. In awhile? One could allow everything
currently in the pipeline to be completely
processed before processing the interrupt.
CSIT 301 (Blum)
50
Interrupt Compromise
• The early fetching stage of an instruction is easily
undone – just decrement the program counter.
• But if data is stored in a register or memory
location, one loses the previous information
making such a step irreversible.
• In long pipelines, the compromise is to undo the
reversible stages and continue with the irreversible
stages and then process the interrupt.
CSIT 301 (Blum)
51
References
• Computer Architecture, Nicholas Carter
• http://www.pcguide.com
• PC Hardware in A Nutshell, Thompson and
Thompson
• http://www.cs.wvu.edu/~jdm/cs455/notes/te
ch/instrpipe.html
CSIT 301 (Blum)
52