Pipelining and parallel processing

Download Report

Transcript Pipelining and parallel processing

ELEC692 VLSI Signal
Processing Architecture
Lecture 2
Pipelining and Parallel Processing
Technique for improving
performance
• Exploiting parallelism in improving performance
• Two ways
– Pipelining
• Using pipeline latches to reduce the critical path delay
• Can exploit to increase the clock speed of sample speed or
reduce power consumption at the same speed.
– Parallel processing
• Multiple output are computed in parallel in a clock period with
parallel hardware
• Effective sampling speed is increased with the level of
parallelism
• Can be used for the reduction of power consumption
Example of a 3-tap FIR filter
X(n)
X(n-1)
D
A
B
D
X(n-2)
C
y(n)
y (n)  Ax(n)  Bx (n  1)  Cx(n  2)
Let TM be the delay for multiplier and TA be the delay
of adder, the sampling period
Tsample  TM  2TA
1
f sample 
TM  2TA
How can we improve the performance??
Pipelining of FIR digital filter
• By adding latches
X(n)
X(n-1) D
D
A
X(n-2)
X(n)
A
B
D
D
B
C
D
C
y(n)
y(n)
D
2
1
Critical path = TM+TA
Critical path = TM+2TA
3
Schedule of events in the pipeline
Clock
Input
Node 1
Node 2
Node 3
Output
0
X(0)
Ax(0)+bx(-1)
-
-
-
1
X(1)
Ax(1)+bx(0)
Ax(0)+bx(-1)
Cx(-2)
Y(0)
2
X(2)
Ax(2)+Bx(1)
Ax(1)+bx(0)
Cx(-1)
Y(1)
3
X(3)
Ax(3)+Bx(2)
Ax(2)+Bx(1)
Cx(0)
Y(2)
Pipelining properties
• M-level pipelining needs M-1 more delay
elements in any path from input to output
• Increase in speed with the following
penalty
– Increase in system latency
– Increase in the number of latches
• Pipelining latches can only be placed
across any feed-forward cutset of the
graph (signal flow graph/DFG)
Cutset pipelining
• Cutset – a set of edges of a graph such
that if these edges are removed from the
graph, the graph becomes disjoint.
• Feed-forward cutset – a cutset that the
data move in the forward direction on all
the edges of the cutset, e.g. dotted line in
the previous slide
• We can place latches on a feed-forward
cutset of any FIR filter structure without
affecting the functionality of the algorithm.
The data movement between the two
disjoint sub-graphs only occurs on the
feed-forward cutset, delaying or
advancing the data movement along all
edges on the cutset by the same amount
of time do not change the behavior.
cutset
D1
SG2
SG1
D1
Feed-forward cutset
D
A2
A6
A1
A3
D
A2
A4
D
A1
A3
A5
A4
A6
D
A5
Not a valid pipelining
D
A2
A1
A4
A6
D
D
D
A3
D
A5
Must place delays on all edges
In the cutset
Critical path reduced to 2
D
Data-broadcast structures
• The critical path of the original 3-tap FIR filter can be
reduced without introducing pipelining latches by
transposing the structure
• Transposition – reversing the direction of all the edges in
a given SFG and interchanging the input and output
ports preserves the functionality of the system.
X(n)
Z-1
Z-1
a
b
y(n)
c
a
y(n)
SFG of the FIR
Z-1
Z-1
b
c
x(n)
Transposed SFG of the FIR
Data-broadcast structures
• Data-broadcasting structure based on
transposed form where data are not stored
but are broadcast to all the multipliers
simultaneously.
Critical path delay
= TM+TA
X(n)
B
C
D
A
D
y(n)
Fine-grain pipelining
• Further breakdown the functional units by
pipelining to increase performance.
• E,g. breakdown each multiplier into 2
small units
(6)
X(n)
(6)
(6)
m1
m1
m1
D
D
D
m2
(4)
(4)
m2
m2
D
y(n)
D
(2)
(4)
(2)
Critical path delay
= TM2+TA
= 4+2 = 6
Parallel Processing
• Parallel processing and pipelining
techniques are duals of each other
– Both exploit concurrency available in the
computation
• Parallel processing – computed using
duplicate hardware
A Parallel FIR System
• E.g. 3-tap FIR filter, Single-input-single-output (SISO)
system
• Y(n) = Ax(n)+bx(n-1)+cx(n-2)
• A parallel system with 3 inputs per clock cycle , level
of parallel processing L=3.
– Y(3k) = Ax(3k)+bx(3k-1)+cx(3k-2)
– Y(3k+1) = Ax(3k+1)+bx(3k)+cx(3k-1)
– Y(3k+2) = Ax(3k+2)+bx(3k+1)+cx(3k)
X(3k)
y(n)
X(n)
SISO
X(3k+1)
y(3k)
MIMO
X(3k+2)
Sequential System
3-Parallel System
y(3k+1)
y(3k+2)
A Parallel FIR System
x(3k+2)
x(3k+1) x(3k)
a
b
Tclk  TM  2TA
c
Y(3k+2)
D
c
a
Titer
b
1
1
 Tsample  Tclk  (TM  2TA )
L
3
Y(3k+1)
Parallel system
Tclk  Tsample
D
b
c
a
Pipelined system
Y(3k)
Tclk  Tsample
Complete parallel processing
system
Clock Period=T/4
X(n)
Serial-to-Parallel
COnverter
Sampling
Period=T/4
X(4k+3)
X(4k+1)
X(4k+2)
X(4k)
y(4k+3)
Clock Period
=T
MIMO
System
y(4k+2)
y(4k+1)
y(4k)
Parallel-to-Serial
COnverter
Y(n)
When should we use parallel over
pipeline processing
• There is fundamental limit to pipelining imposed by
the input/output (I/O) bottlenecks.
• Communication bounded
– Communication time (input/output
pad delay + wire-delay) is larger
than that of computation delay.
Chip1
– Pipelining can only be used to
reduce the critical path
computation delay.
o/p
pad
– For communication-bound system,
this cannot help.
– So only parallel processing can be Tcomputation
used to improve the performance.
– Further improvement can be
achieved by combining pipelining
and parallel processing
Chip2
Tcomm.
i/p
pad
Low Power Signal Processing
•
•
•
•
Higher speed
Low Power
2
Dynamic Power consumption Pdyn  CtotalVo f
Cch arg eVo
Propagation delay
Tpd 
k (Vo  Vt )
2
– Ccharge: the capacitance to be charged/discharged
– Vo: supply voltage; Vt: threshold voltage
– K: technology parameter
Pipelining for Low Power
• Pseq=CtotalV02 f
• After pipelining, the critical path
is reduced, hence we can use a
Cch arg eVo
lower voltage V’=bV0, the new
power is Ppip=Ctotalb2V02 f=b2Pseq Tseq  k (V  V ) 2
t
o
• The power consumption
Cch arg e
reduction factor, b, can be found
bVo
the following:
T  M
Tseq
pipe
(Vo)
Sequential (critical path)
Tpipe
Tpipe
Tpipe
(bVo)
Pipelined: (critical path when M=3)
k ( bVo  Vt ) 2
Tpipe  Tseq
M ( bVo  Vt ) 2  b (Vo  Vt ) 2
Example
X(n)
X(n)
B
C
D
(6)
D
A
D
m1
y(n)
m1
(6)
m1
D
m2 (4)
(6)
D
m2
(4)
D
m2 (4)
y(n)
D
(2)
(2)
• Assume
– Cap. Of multiplier CM is 5 times of that of an adder CA
– Fine grain pipelining is used, and Cm1=3 CA and Cm2 = 2
CA
– Vdd = 5V and Vt = 0.6V
Solution
• For original filter, Ccharge=CM+CA=6CA
• For pipelined filter, Ccharge=CM1=CM2+CA = 3CA
• Now M = 2,we have 2(b.5-0.6)2=b.(5-0.6)2,
solving this equation, we have b=0.6033
• The voltage of the pipelined filter Vpipe=b.Vo=~3V
• Power consumption ratio is b2 = 36.4%
Parallel Processing for low power
• In an L-parallel architecture, we can assume the
charge capacitance remain the same, but the total
capacitance (i.e. Ctotal) is increased L times.
• The clock speed of the L-parallel architecture is
reduced to 1/L (i.e. f = 1/L. Tpd) to maintain the same
sampling rate
• Supply voltage can be reduced to b.Vo since more
time is allowed to charge or discsharge the same
capacitance.
Parallel Processing for low power
Tseq
(Vo)
Sequential (critical path)
3Tseq
k (V0  Vt ) 2
Tparallel  L  Tseq 
3Tseq
3Tseq
Tseq 
Cch arg e Vo
(bVo)
Parallel: critical path when L=3
Cch arg e  b Vo
k ( bV0  Vt )
L( bV0  Vt ) 2  b  (V0  Vt ) 2
2
Example: Reduce Power by parallel
• Consider the following FIR filters
X(2k)
X(n)
y(n)
D
D
y(2k+1)
D
D
D
D
X2k+1)
Assumption:
- CM = 8CA
- TM = 8TA
- both architectures operate at the
sampling period of 9 TA
- Supply voltage = 3.3V and Vt =
0.45V
y(2k)
D
D
D
Solution
• Ccharge: Sequential:
Ccharge = CM + CA = 9
CA
• Parallel: Ccharge = CM
+ 2CA = 10 CA
• Power ratio b2 =
0.434
Tseq
9C AVo

k (Vo  Vt ) 2
10C A b Vo
T para 
k ( b Vo  Vt ) 2
Tsample  Tseq , T para  2Tsample  2Tseq
5b (Vo  Vt ) 2  9( b Vo  Vt ) 2
b  0.659