Pipelining and Parallel Processing

Download Report

Transcript Pipelining and Parallel Processing

L7: Pipelining and
Parallel Processing
VADA Lab.
Introduction (1)
 Pipelining transformation leads
to a reduction in the critical path,
which can be exploited to increase
the clock speed (sample speed), or
to reduce power consumption at
same speed.
 In the parallel processing,
multiple outputs are computed in
parallel in a clock period.
Therefore, the effective sampling
speed is increased by the level of
parallelism.
Introduction (2)
 3-tap FIR digital filter
y(n) = ax(n)+bx(n-1)+cx(n-2)

Sample Period
Tsample  TM  2TA

Sampling frequency
1
fsample 
TM  2TA
Pipelining of FIR digital filter
 Pipelined implementation of the 3-tap FIR filter is obtained by placing
2 additional latches.
 The critical path is reduced from TM+2TA to TM+TA .
 The two main drawbacks of the pipelining are increase in the number
of latches and in system latency.
<Pipelined FIR filter>
Pipelining of FIR digital filter (2)
 The critical path (longest path) can be reduced by suitably placing the
pipelining latches in the architecture.
 The pipelining latches can only be placed across any feed-forward
cutset of the graph
 Introduce 2 definitions of graph for pipelining.


Cutset A cutset is a set of edges of a graph such that if these edges are
removed from the graph, the graph becomes disjoint.
Feed-forward Cutset A cutset is called a feed-forward cutset if the data
move in the forward direction on all the edges of the cutset.
 To obtain an appropriate pipelining circuit, pipelining latches should
be inserted on all the edges in the feed-forward cutset !!
Pipelining of FIR digital filter (3)
 Signal-flow graph example
Pipelining of FIR digital filter (4)
 Data-Broadcast Structures

The critical path of the original 3-tap FIR filter can be reduced
without introducing any pipelining latches by transposing the
structure.

Transposition theorem
“ Reversing the direction of all the edges in a given SFG (signalflow graph) and interchanging the input and output ports
preserves the functionality of the system.”
Pipelining of FIR digital filter (5)
< SFG representation
of the FIR filter>
< Transposed SFG representation
of the FIR filter>
Pipelining of FIR digital filter (6)
 Transposed SFG representation leads to the data-broadcast
structure where data are not stored but are broadcast to all
the multipliers simultaneously.
<Data-broadcast structure of the FIR filter>
Pipelining of FIR digital filter (7)
 Fine-Grain Pipelining



Let TM=10 units and TA units, and the desired clock period be
(TM+TA)/2=6 units.
In this case the multiplier is broken into 2 smaller units with
processing times of 6 units and 4 units, respectively.
By placing the latches on the horizontal cutset across the
multiplier , the desired clock speed can be achieved.
Parallel Processing (1)
 Designing a Parallel FIR System

To obtain a parallel processing structure, the SISO(single-input
single-output) system must be converted into a MIMO(multipleinput multiple-output) system.
y(3k) = ax(3k)+bx(3k-1)+cx(3k-2)
y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1)
y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k)

Parallel Processing systems are also referred to as block
processing systems.
Parallel Processing (2)
 Parallel processing architecture for a 3-tap FIR filter
(with block size 3)
Parallel Processing (3)
 The critical path of the parallel processing system has remained
unchanged and the clock period (Tclk) must satisfy :
Tclk  TM  2TA
 But since 3 samples are processed in 1 clock cycle instead of 3, the
iteration period is given by
1
1
Titer  Tsample  Tclk  (TM  2TA)
L
3
 In a Pipelined system : Tclk = Tsample
Parallel Processing (4)
 Complete parallel processing system with block size 4
Parallel Processing (5)
 Why do we use parallel processing when we can use pipelining ?



Due to a fundamental limit to pipelining imposed by the I/O bottlenecks.
Pipelining can be combined with parallel processing to further increase
the speed of the architecture.
By combining parallel processing and pipelining, the sample period has
been reduced to
1
1
Titer  Tsample 
Tclk  (TM  2TA)
LM
6

Parallel processing is also used for reduction of power consumption while
using slow clocks.
Parallel Processing (6)
< A chip set>
Parallel Processing (7)
<Combined fine-grain pipelining and parallel processing
for 3-tap FIR filter>
Pipelining and Parallel processing
for Low power
 There are two main advantages of using pipelining and parallel
processing :


Higher speed
Lower power
 For CMOS circuit, the propagation delay can be written as :
Tpd
CchargeV 0

k (V 0  Vt ) 2
 Power consumption of a CMOS circuit can be estimated as :
P  CtotalV 0 f
2
Pipelining for Low power (1)

Pseq  CtotalV 0 2 f represent the power consumed in the original filter.
(where Tseq is the clock period of the original sequential filter)
 In the M-level pipelined system, the critical path is reduced to 1/M of
its original length and the capacitance to be charged/discharged in a
single clock cycle is reduced to Ccharge / M.
 supply voltage can be reduced to
V 0
Pipelining for Low power (2)
 The power consumption factor,  , can be determined by examining
the relationship between the propagation delay of the original filter and
the pipelined filter.
Tseq
CchargeV 0

k (V 0  Vt ) 2
Tpipe 
Ccharge V 0
k ( V 0  Vt ) 2
 Ppip  Ctotal  V 0 f   Pseq
2
2
2
Parallel processing for Low power (1)
 Parallel processing, like pipelining, can reduce the power consumption
of a system by allowing the supply voltage to be reduced.
supply voltage can be reduced to
 In an L-parallel system, the charging capacitance does not change
while the total capacitance is increased by L times.
 In order to maintain the same sample rate, the clock period of the Lparallel circuit must be increased to LTseq, where Tseq is the propagation
delay of the sequential circuit.
 There is more time to charge the same capacitance => supply voltage
can be reduced to  V 0
Parallel processing for Low power (2)
 The propagation delay of the L-parallel system is given by :
Ccharge V 0
LTseq 
k ( V 0  Vt ) 2
Tseq
CchargeV 0

k (V 0  Vt ) 2
 Ppar  Ctotal  V 0 f   Pseq
2
2
2
Conclusions
 The pipelining


Pipelining latches are placed across the feed-forward cutsets in the
SFG and computation time of the critical path is reduced
The clock frequency can be increased and hence the sampling rate
is increased.
 Parallel processing


The hardware for the original serial system is duplicated and the
resulting system is MIMO parallel system.
The clock freq. Stays the same, and the sampling freq. is increased.
 Two scheme is used for higher speed and lower power
design (using lower supply voltage).