Pipelining - La Salle University

Download Report

Transcript Pipelining - La Salle University

Pipelining
CSC 370 (Blum)
1
Levels of Instructions
• One high-level  Many assembly-level
– Recall that a single line of code in a high level
language may result in several instructions at
the assembly/machine-code level.
• One assembly-level  Many micro-code
– Furthermore, one assembly/machine-code
instruction consists of several micro-code
instructions.
CSC 370 (Blum)
2
Micro-coding Load Accumulator A
• Recall the our very naïve architecture and
the micro-coding of a simple assembly-level
instruction, Load Accumulator A, for that
architecture.
CSC 370 (Blum)
3
Bus
(Naïve Architecture)
Keyboard
encoder
Input port 1
Input port 2
Prog. counter
Mem.Add.Reg.
Memory
MDR
Instr. Reg.
CSC 370 (Blum)
Control
Accumulator
ALU
Flags
TMP
B
C
Output port 3
Display
Output port 4
4
Address State: the value of the program counter (which recall is
the address of line of the program to be performed) is put into
memory address register.
CSC 370 (Blum)
5
Increment State: the program counter is incremented, getting it
ready for the next time.
CSC 370 (Blum)
6
Memory State: the current line of the program is put into
instruction register (so Control knows what to do).
CSC 370 (Blum)
7
Recall the instruction consisted of a load command and an
address. A copy of the address is now taken over to the memory
address register.
CSC 370 (Blum)
8
The value at that address is loaded into Accumulator A.
CSC 370 (Blum)
9
For the load command, there is no activity during the sixth step. It
is known as a "no operation" step (a "no op" or "nop").
CSC 370 (Blum)
10
• These 1’s and 0’s are MICRO-CODE.
• A large part of processor design is to get this micro-code to
execute as efficiently as possible.
CSC 370 (Blum)
11
Instruction Review
• There are many registers so the Load instruction
might be more like
LD r5, (r6)
– Load Register 5 with the value at the address in
Register 6.
• An Add instruction might be
ADD r1, r2, r3
– Add the values in Registers 2 and 3 and place the result
in Register 3
CSC 370 (Blum)
12
Stages
• The micro-code steps that go into making an
assembly-level instruction can be thought of as
belonging to various stages:
–
–
–
–
–
–
Fetch Instruction
Decode Instruction
Read Registers
Execute Instruction
Write Result Back to Registers
(Look for interrupt?)
CSC 370 (Blum)
13
Stages and Units
• The pipeline is said to be broken into stages.
– So one talks about the Fetch Instruction (FI)
stage.
– The terms section and/or segment are
sometimes used in place of stage.
• The circuitry responsible for a stage is
known as a unit.
– So one talks of the Fetch Instruction (FI) Unit.
CSC 370 (Blum)
14
Idle hands
• Early processors executed one instruction
(all of its stages) at a time.
– Recall the von Neumann bottleneck.
• In this approach, most of the processor’s
parts are inactive most of the time.
CSC 370 (Blum)
15
• Note in the micro-code above we assumed 1
was active and that most of the entries are
0’s.
CSC 370 (Blum)
16
Toward Efficiency
• Division of Labor:
– “A production process in which a worker or group of
workers is assigned a specialized task in order to
increase efficiency.”
• Assembly Line:
– “An arrangement of machines, tools, and workers in
which a product is assembled by having each perform a
specific, successive operation on an incomplete unti as
it passes by in a series of stages organized in a direct
line.”
• (Webster’s)
CSC 370 (Blum)
17
Assembly Line
CSC 370 (Blum)
18
Pipelining
• Applying the assembly line idea to a processor’s
execution of instructions is known as pipelining.
• The instruction is broken into various stages and
instead of having one instruction complete all
stages before the next is begun, several
instructions are being executed at once – each in a
different stage.
• The stages are separated by latches, which
prevent information from one stage to pass
prematurely into the next stage.
CSC 370 (Blum)
19
Slower but faster
• The introduction of the stages separated by
latches actually slows down the execution
of a single instruction.
• However, there is an overall gain in speed
since many instructions are in the midst of
being processed simultaneously.
CSC 370 (Blum)
20
Staying in sync and the weakest link
• The stage that takes the longest will
determine the speed of the entire process.
– All of the other stages must wait for this stage
to complete before information is passed to the
next stage.
– The processes should be broken into stages that
at much as possible take the same amount of
time.
CSC 370 (Blum)
21
For example
• If you broke a 26-second process into five stages:
four of them taking 5 s and one taking 6 s.
• Then it would take 30 s for an individual
instruction to move through the pipeline.
• However, one instruction would be coming out of
the pipeline every 6 s.
– Compared to one instruction every 26 s without
pipelining.
CSC 370 (Blum)
22
Sample question
• If a 32-second process were broken into 8 stages,
(3 requiring 3 seconds, 3 requiring 4 seconds, 1
requiring 5 seconds and 1 requiring 6 seconds),
then it would take ______(number of seconds) for
an instruction to move through the pipeline as
opposed to the original 32 seconds. However, one
instruction would be issuing from the pipeline
every _____ (number of seconds) as opposed to 1
every 32 seconds without pipelining.
CSC 370 (Blum)
23
Caveat
• The preceding calculation is somewhat
misleading as it makes the stages appear to
be synchronized with all of the instructions
moving on to the next stage simultaneously.
Between two stages is a latch that will know
whether the previous stage has completed
and will hold any information to be passed
to the next stage.
CSC 370 (Blum)
24
Caveat (Cont.)
• In our discussions of memory we said that
synchronous was better since asynchronous
required some unnecessary waiting.
• Here we will see that certain situations
(“hazards”) require an instruction to be held
up in a particular stage (“stalled”) until an
instruction before it has been completely
processed.
CSC 370 (Blum)
25
Latch Size
• One is tempted to say that the 6-second stage in
the example should be two 3-second stages, but
there can be other issues.
• The latches hold the information produced by one
stage that is passed on to the next stage.
– They are a kind of register.
• Another consideration is that stages should be
chosen in a way that does not require a lot of
information to be passed from one stage to the
next.
– Big latches  more registers  more transistors 
more money, more heat, etc.
CSC 370 (Blum)
26
Structural Hazard
• Two instructions at different stages might conflict
if they both needed the same resource – maybe
they both need to read from cache.
• This is called a structural hazard.
• Using Harvard cache – the separation of cache
into data and instructions – reduces structural
hazards as different stages tend to need different
caches rather than the same cache.
CSC 370 (Blum)
27
Possible Hazard
• Having multiple instructions in the pipeline means
that the first instruction is not complete before the
second instruction begins.
• This is good for saving time but could case
problems if the two instructions involve the same
data.
• This situation is known as a data dependency or a
data hazard.
– Somewhat reminiscent of locking ideas in databases.
CSC 370 (Blum)
28
Dependency Categorizing
• RAR: Read After Read
– 1st instruction reads, 2nd instruction reads
• RAW: Read After Write
– 1st instruction writes, 2nd instruction reads
• WAR: Write After Read
– 1st instruction reads, 2nd instruction writes
• WAW: Write After Write
– 1st instruction writes, 2nd instruction writes
CSC 370 (Blum)
29
No Problemo
• RAR is not a problem.
– ADD r1, r2, r3
– SUB r4, r5, r3
• Both instructions can read the same data
without causing a conflict.
CSC 370 (Blum)
30
RAW is Hell
• RAW is a problem.
– ADD r1, r2, r3
– SUB r4, r5, r1
• The data should be written before it is read.
Therefore, the second instruction must be stalled
in the pipeline with NOP’s (no-operation).
• The situation is called a data dependency and the
result is called a pipeline stall or bubble.
CSC 370 (Blum)
31
Name Dependencies
• WAR (Write After Read) is not a problem unless it
turns into RAW (Read After Write).
• This could happen if the second instruction is
executed faster (if they have different latencies) or
if the instructions are executed out of order.
• WAW (Write After Write) is similar.
• These potential hazards are called name
dependencies and occur because there are a finite
number of registers.
CSC 370 (Blum)
32
What’s the score?
• In order to avoid a RAW hazard, the processor
must know that the first instruction will write to a
particular register.
• A register has a presence bit that indicates
whether it can be read or whether it will be written
to by some instruction in the pipeline and is thus
unavailable.
• This technique is called register scoreboarding.
CSC 370 (Blum)
33
Anticipating Problems
• Another approach is to have all of the addresses of
data that are involved in the instructions being
worked on in the pipeline (including any registers)
are placed in kind of memory.
• Then as new instructions are added to the pipeline,
the address of the data involved is compared to
what is already stored there to look for potential
conflicts.
– As we discussed with cache, this comparison must be a
fast hardware search/comparison. Thus this is another
use of associative memory or content-addressable
memory.
CSC 370 (Blum)
34
Bypassing
• If the processor knows that a stage of some
subsequent instruction needs the data from
an instruction that just completed execution,
it can send the information directly to that
stage rather than making that stage go read
it from a register.
• This process is known as bypassing.
CSC 370 (Blum)
35
A Special Case
• A special case of a RAW hazard (data
dependency) is when the data being written is the
result of a comparison, for example in an if.
– Then the very instructions that should be in the pipeline
will depend on the outcome of the comparison.
• Recall that ifs and loops provide a program its
control flow and so these hazards are known as
control hazards.
• The time spent waiting for the outcome of a
comparison is known as a branch delay.
• One tries to minimize branch delay by practicing
branch prediction.
CSC 370 (Blum)
36
Super-pipelining and Hyperpipelining
• Super-pipelining and hyper-pipelining mean
having a long pipeline.
– The Pentium Pro had “super-pipelining” which
was a 14-stage pipeline.
• The Pentium Pro pipeline also allowed out-of-order
processing.
– The Pentium 4 advertises “Hyper Pipelined
Technology”
• 20 stages compared to 10 stages in the Pentium III
CSC 370 (Blum)
37
BPU
• Recall that having more stages requires
introducing latches which increases the execution
time of a single instruction. The benefit of hyperpipelining is having so many instructions being
worked on at once.
• But the benefits are all lost if the pipeline becomes
stalled or clogged due to a hazard (data
dependency), especially a control hazard.
• Hyper-pipelining is only useful if accompanied by
good branch prediction provided by the Branch
Prediction Unit (BPU).
CSC 370 (Blum)
38
What’s the harm?
• One approach to branching is just to proceed with
the one of the branches (or if possible both) before
knowing the outcome of the branch.
• Lining up instructions and even executing them
does not cause a problem unless the results are
written. The results of the wrong branch can just
be forgotten. But if it is the desired branch, you’ve
saved time.
• Executing a branch just in case it might be needed
is called speculative execution.
CSC 370 (Blum)
39
The Road Most Traveled
• When branching occurs, the processors stores data
about the branch taken.
• This information is placed in the Branch Target
Buffer (BTB).
• It is used to determine which set of instructions to
line up in the pipeline.
– The program will probably branch in the same way it
did before. This is especially true with loops.
– Locality of reference strikes again.
CSC 370 (Blum)
40
You’re Out of Order!
• Pipelines that allow out-of-order processing
were introduced with the Pentium Pro.
They have three parts
– The in-order front end (fetches and decodes)
– The out-of-order core
• If order doesn’t matter, execute instructions in most
efficient order
– The in-order retirement section
• Restores order
CSC 370 (Blum)
41
Superscalar
• In pipelining, the processor is working on several
instructions at once but each is in a different stage.
• A superscalar processor has more than one
pipeline.
– It is parallel processing on the micro-code level.
• The ALUs associated with these parallel pipelines
may not be equivalent.
– For instance, floating-point operations may have to go
through a particular pipeline whereas an integer
operation may put through either.
– Floating point pipelines tend to be longer since those
instructions are more complicated.
CSC 370 (Blum)
42
Multiple execution units
• One possible use of multiple pipelines is to
“execute” the various branches of code and then
accept the results of the good pipeline when the
condition is determined.
• The execution of the branches may require the
reading and/or writing of data.
• Each pipeline/branch has its own set of registers.
Then when the actual branch is selected, the
values of these internal registers are written to the
external registers shared by all of the pipelines.
– The external/global registers are accessible to the
programmer.
CSC 370 (Blum)
43
Decisions, decisions
• Just as there were hazards (data
dependencies) that led to stalls in the
pipeline, there are issues one must consider
when dividing work up into multiple
pipelines.
– These tend to be conflicts.
CSC 370 (Blum)
44
Register renaming
• A part of the processor known as the commit unit
determines when a given pipeline’s registers
should be written to the shared, programmeraccessible registers.
• The internal registers are called rename registers
or rename buffers.
• And the process is known as register renaming.
– One way is to have the programmer-accessible register
point to the pipeline registers of choice.
CSC 370 (Blum)
45
Loop Unrolling
• One example of what a compiler might do to help
with pipelining is loop unrolling.
• Branching is the bane of pipelining and
parallelism.
• Loops have at least one branch with each iteration.
• Loop unrolling is doing two or more iterations
worth of work in one iteration. It reduces the
number of branch considerations and it can also
promote parallelism.
CSC 370 (Blum)
46
Loop Unrolling Example
for(i=0; i<100; i++){
a[i] = b[i] + c[i];
}
for(i=0; i<100; i+=2){
a[i] = b[i] + c[i];
a[i+1] = b[i+1] + c[i+1];
}
The unrolled version has half as many branches and so is
easier to pipeline.
The unrolled version will use more independent registers
within each iteration and so takes greater advantage of
pipelining or parallelism.
CSC 370 (Blum)
47
Don’t try this at home
• Loop unrolling requires knowledge of the
processor’s capabilities (the number of
execution units, the number of stages in the
pipeline, etc.). If the programmer does not
have this knowledge, the unrolling and
other code optimization techniques should
be left to the compiler.
CSC 370 (Blum)
48
Don’t interrupt me
• Another situation that causes problems for
pipelining is interruptions.
• Recall that the various devices do not immediately
receive the processor’s attention. Rather they
make an interrupt request and the processor
attends to them when it is ready.
• But with an endless stream of instructions in the
pipeline, when is a good time?
CSC 370 (Blum)
49
When is a good time?
1. Right away? To attend to the interruption
immediately, one would have to store the
state of the entire pipeline so that one
could return to the proper place after
processing the interrupt.
2. In awhile? One could allow everything
currently in the pipeline to be completely
processed before processing the interrupt.
CSC 370 (Blum)
50
Interrupt Compromise
• The early fetching stage of an instruction is easily
undone – just decrement the program counter.
• But if data is stored in a register or memory
location, one loses the previous information
making such a step irreversible.
• In long pipelines, the compromise is to undo the
reversible stages and continue with the irreversible
stages and then process the interrupt.
CSC 370 (Blum)
51
References
• Computer Architecture, Nicholas Carter
• http://www.pcguide.com
• PC Hardware in A Nutshell, Thompson and
Thompson
• http://www.cs.wvu.edu/~jdm/cs455/notes/te
ch/instrpipe.html
CSC 370 (Blum)
52