Chapter One: Introduction to Pipelined Processors

Download Report

Transcript Chapter One: Introduction to Pipelined Processors

Chapter One
Introduction to Pipelined
Processors
Superscalar Processors
Superscalar Processors
• Scalar processors: one instruction per cycle
• Superscalar : multiple instruction pipelines
are used.
• Purpose: To exploit more instruction level
parallelism in user programs.
• Only independent instructions can be
executed in parallel.
Superscalar Processors
• The fundamental structure (m=3) is as follows:
Superscalar Processors
• Here, the instruction decoding and execution
resources are increased
• Example: A dual pipeline superscalar processor
Superscalar Processor - Example
Superscalar Processor - Example
• Can issue two instructions per cycle
• There are two pipelines with four processing
stages : fetch, decode, execute and store
• Two instruction streams are from a single Icache.
• Assume each stage requires one cycle except
execution stage.
Superscalar Processor - Example
• The four functional units of execution stage are:
Functional Unit
Number of stages
Adder
02
Multiplier
03
Logic
01
Load
01
• Functional units are shared on dynamic basis
• Look-ahead Window: for out-of-order
instruction issue
Superscalar Performance
• Time required by scalar base machine is
T(1,1) = k+N-1
• The ideal execution time required by an missue superscalar machine is
T(m,1)  k  (N - m)
m
k – time required to execute first m
instructions
(N-m)/m – time required to execute remaining
(N-m) instructions
Superscalar Performance
• The ideal speedup of the superscalar machine
is
T(1,1)
S(m,1) 
T(m,1)
=?
Superscalar Performance
• The ideal speedup of the superscalar machine
is
T(1,1)
S(m,1) 
T(m,1)
m(N  k - 1)
S(m,1) 
N  m(k - 1)
• As N  ∞, the speedup S(m,1) =?
Superscalar Performance
• The ideal speedup of the superscalar machine
is
T(1,1)
S(m,1) 
T(m,1)
m(N  k - 1)
S(m,1) 
N  m(k - 1)
• As N  ∞, the speedup S(m,1)  m.
Superpipeline Processors
• In a superpipelined processor of degree n, the
pipeline cycle time is 1/n of base cycle.
Superpipeline Performance
• Time to execute N instructions for a superpipelined
machine of degree n with k stages is
T(1,n) = k + (N-1)/n
• Speedup is given as
T(1,1) n(k  N - 1)
S(1, n) 

T(1, n) nk  (N - 1)
• As N ∞ , S(1,n) n
Superpipelined Superscalar Processors
• This machine executes m instructions every
cycle with a pipeline cycle 1/n of base cycle.
Superpipelined Superscalar Performance
• Time taken to execute N independent instructions
on a superpipelined superscalar machine of
degree (m,n) is
N-m
T (m, n)  k 
mn
• The speedup over base machine is
T(1,1) mn(k  N  1)
S(m, n) 

T(m, n) mnk  N  m
• As N  ∞, S(m,n)mn
Superscalar Processors Superpipelined Processors
• Rely on spatial parallelism
• Multiple operations running
on separate hardware
concurrently
• Achieved by duplicating
hardware resources such as
execution units and register
file ports
• Requires more transistors
• Rely on temporal parallelism
• Overlapping multiple
operations on a common
hardware
• Achieved through more
deeply pipelined execution
units with faster clock cycles
• Requires faster transistors
Systolic Architecture
Systolic Architecture
• Conventional architecture operate on load
and store operations from memory.
• This requires more memory references which
slows down the system as shown below:
Systolic Architecture
• In systolic processing, data to be processed
flows through various operation stages and
finally put in memory as shown below:
Systolic Architecture
• The basic architecture constitutes processing
elements (PEs) that are simple and identical in
behavior at all instants.
• Each PE may have some registers and an ALU.
• PEs are interlinked in a manner dictated by the
requirements of the specific algorithm.
• E.g. 2D mesh, hexagonal arrays etc.
Systolic Architecture
• PEs at the boundary of structure are connected
to memory
• Data picked up from memory is circulated
among PEs which require it in a rhythmic
manner and the result is fed back to memory
and hence the name systolic
• Example : Multiplication of two n x n matrices
Example : Multiplication of two n x n
matrices
• Every element in input is picked up n times
from memory as it contributes to n elements
in the output.
• To reduce this memory access, systolic
architecture ensures that each element is
pulled only once
• Consider an example where n = 3
Matrix Multiplication
a11 a12 a13
a21 a22 a23
a31 a32 a33
*
b11 b12 b13
b21 b22 b23
b31 b32 b33
=
Conventional Method: O(n3)
For I = 1 to N
For J = 1 to N
For K = 1 to N
C[I,J] = C[I,J] + A[J,K] * B[K,J];
c11 c12 c13
c21 c22 c23
c31 c32 c33
Systolic Method
This will run in O(n) time!
To run in n time we need n x n processing units,
in our example n = 9.
P1
P2
P3
P4
P5
P6
P7
P8
P9
For systolic processing, the input data need to be modified as:
Flip columns 1 & 3
a13 a12 a11
a23 a22 a21
a33 a32 a31
Flip rows 1 & 3
b31 b32 b33
b21 b22 b23
b11 b12 b13
and finally stagger the data sets for input.
b31
b21
b11
a13 a12 a11
a23 a22 a21
a33
a32
a31
b32
b22
b12
b33
b23
b13
P1
P2
P3
P4
P5
P6
P7
P8
P9
At every tick of the global system clock, data is passed to each
processor from two different directions, then it is multiplied and the result
is saved in a register.
3 4 2
2 5 3
3 2 5
*
3 4 2
2 5 3
3 2 5
Using a systolic array.
3
2
3
2 4 3
5
23 36 28
25 39 34
28 32 37
=
2
5
4
5
3
2
P1
P2
P3
3 5 2
P4
P5
P6
2
P7
P8
P9
3
Clock tick : 1
3
2
2 4
3 5 2
5
2
3
3*3
P4
P7
2
5
4
P2
P5
P8
5
3
2
P3
P6
P9
P1
9
P2
0
P3
0
P4
0
P5
0
P6
0
P7
0
P8
0
P9
0
Clock tick : 2
3
2
5
4*2
2
5
3*4
5
3
2
P3
3 5
2*3
P5
P6
2
P7
P8
P9
3
P1
9+8=17
P2
12
P3
0
P4
6
P5
0
P6
0
P7
0
P8
0
P9
0
Clock tick : 3
P1
2
2*3
5
4*5
5
3
3*2
3
5*2
2*4
P6
2
3*3
P8
P9
17+6=23
P2 12+20=32
P3
6
P4
6+10=16
P5
8
P6
0
P7
9
P8
0
P9
0
Clock tick : 4
5
23
5
2*2
4*3
3*3
5*5
2*2
2*2
3*4
P9
P1
23
P2
32+4=36
P3
6+12=18
P4
16+9=25
P5
8+25=33
P6
4
P7
9+4=13
P8
12
P9
0
Clock tick : 5
P1
23
P2
36
P3 18+10=28
23
36
2*5
25
3*2
5*3
5*3
2*5
3*2
P4
25
P5
33+6=39
P6
4+15=19
P7 13+15=28
P8 12+10=22
P9
6
Clock tick : 6
23
36
28
P1
23
P2
36
P3
28
P4
25
P5
39
P6 19+15=34
25
39
3*5
28
5*2
2*3
P7
28
P8 22+10=32
P9
6+6=12
Clock tick : 7
23
36
28
25
39
34
28
32
5*5
P1
23
P2
36
P3
28
P4
25
P5
39
P6
34
P7
28
P8
32
P9 12+25=37
End
23
36
28
25
39
34
28
32
37
P1
23
P2
36
P3
28
P4
25
P5
39
P6
34
P7
28
P8
32
P9
37
Samba: Systolic Accelerator for
Molecular Biological Applications
This systolic array contains 128 processors shared into 32 full custom
VLSI chips. One chip houses 4 processors, and one processor performs
10 millions matrix cells per second.