Pipelining and Parallel Processing
Download
Report
Transcript Pipelining and Parallel Processing
L7: Pipelining and
Parallel Processing
VADA Lab.
Introduction (1)
Pipelining transformation leads
to a reduction in the critical path,
which can be exploited to increase
the clock speed (sample speed), or
to reduce power consumption at
same speed.
In the parallel processing,
multiple outputs are computed in
parallel in a clock period.
Therefore, the effective sampling
speed is increased by the level of
parallelism.
Introduction (2)
3-tap FIR digital filter
y(n) = ax(n)+bx(n-1)+cx(n-2)
Sample Period
Tsample TM 2TA
Sampling frequency
1
fsample
TM 2TA
Pipelining of FIR digital filter
Pipelined implementation of the 3-tap FIR filter is obtained by placing
2 additional latches.
The critical path is reduced from TM+2TA to TM+TA .
The two main drawbacks of the pipelining are increase in the number
of latches and in system latency.
<Pipelined FIR filter>
Pipelining of FIR digital filter (2)
The critical path (longest path) can be reduced by suitably placing the
pipelining latches in the architecture.
The pipelining latches can only be placed across any feed-forward
cutset of the graph
Introduce 2 definitions of graph for pipelining.
Cutset A cutset is a set of edges of a graph such that if these edges are
removed from the graph, the graph becomes disjoint.
Feed-forward Cutset A cutset is called a feed-forward cutset if the data
move in the forward direction on all the edges of the cutset.
To obtain an appropriate pipelining circuit, pipelining latches should
be inserted on all the edges in the feed-forward cutset !!
Pipelining of FIR digital filter (3)
Signal-flow graph example
Pipelining of FIR digital filter (4)
Data-Broadcast Structures
The critical path of the original 3-tap FIR filter can be reduced
without introducing any pipelining latches by transposing the
structure.
Transposition theorem
“ Reversing the direction of all the edges in a given SFG (signalflow graph) and interchanging the input and output ports
preserves the functionality of the system.”
Pipelining of FIR digital filter (5)
< SFG representation
of the FIR filter>
< Transposed SFG representation
of the FIR filter>
Pipelining of FIR digital filter (6)
Transposed SFG representation leads to the data-broadcast
structure where data are not stored but are broadcast to all
the multipliers simultaneously.
<Data-broadcast structure of the FIR filter>
Pipelining of FIR digital filter (7)
Fine-Grain Pipelining
Let TM=10 units and TA units, and the desired clock period be
(TM+TA)/2=6 units.
In this case the multiplier is broken into 2 smaller units with
processing times of 6 units and 4 units, respectively.
By placing the latches on the horizontal cutset across the
multiplier , the desired clock speed can be achieved.
Parallel Processing (1)
Designing a Parallel FIR System
To obtain a parallel processing structure, the SISO(single-input
single-output) system must be converted into a MIMO(multipleinput multiple-output) system.
y(3k) = ax(3k)+bx(3k-1)+cx(3k-2)
y(3k+1) = ax(3k+1)+bx(3k)+cx(3k-1)
y(3k+2) = ax(3k+2)+bx(3k+1)+cx(3k)
Parallel Processing systems are also referred to as block
processing systems.
Parallel Processing (2)
Parallel processing architecture for a 3-tap FIR filter
(with block size 3)
Parallel Processing (3)
The critical path of the parallel processing system has remained
unchanged and the clock period (Tclk) must satisfy :
Tclk TM 2TA
But since 3 samples are processed in 1 clock cycle instead of 3, the
iteration period is given by
1
1
Titer Tsample Tclk (TM 2TA)
L
3
In a Pipelined system : Tclk = Tsample
Parallel Processing (4)
Complete parallel processing system with block size 4
Parallel Processing (5)
Why do we use parallel processing when we can use pipelining ?
Due to a fundamental limit to pipelining imposed by the I/O bottlenecks.
Pipelining can be combined with parallel processing to further increase
the speed of the architecture.
By combining parallel processing and pipelining, the sample period has
been reduced to
1
1
Titer Tsample
Tclk (TM 2TA)
LM
6
Parallel processing is also used for reduction of power consumption while
using slow clocks.
Parallel Processing (6)
< A chip set>
Parallel Processing (7)
<Combined fine-grain pipelining and parallel processing
for 3-tap FIR filter>
Pipelining and Parallel processing
for Low power
There are two main advantages of using pipelining and parallel
processing :
Higher speed
Lower power
For CMOS circuit, the propagation delay can be written as :
Tpd
CchargeV 0
k (V 0 Vt ) 2
Power consumption of a CMOS circuit can be estimated as :
P CtotalV 0 f
2
Pipelining for Low power (1)
Pseq CtotalV 0 2 f represent the power consumed in the original filter.
(where Tseq is the clock period of the original sequential filter)
In the M-level pipelined system, the critical path is reduced to 1/M of
its original length and the capacitance to be charged/discharged in a
single clock cycle is reduced to Ccharge / M.
supply voltage can be reduced to
V 0
Pipelining for Low power (2)
The power consumption factor, , can be determined by examining
the relationship between the propagation delay of the original filter and
the pipelined filter.
Tseq
CchargeV 0
k (V 0 Vt ) 2
Tpipe
Ccharge V 0
k ( V 0 Vt ) 2
Ppip Ctotal V 0 f Pseq
2
2
2
Parallel processing for Low power (1)
Parallel processing, like pipelining, can reduce the power consumption
of a system by allowing the supply voltage to be reduced.
supply voltage can be reduced to
In an L-parallel system, the charging capacitance does not change
while the total capacitance is increased by L times.
In order to maintain the same sample rate, the clock period of the Lparallel circuit must be increased to LTseq, where Tseq is the propagation
delay of the sequential circuit.
There is more time to charge the same capacitance => supply voltage
can be reduced to V 0
Parallel processing for Low power (2)
The propagation delay of the L-parallel system is given by :
Ccharge V 0
LTseq
k ( V 0 Vt ) 2
Tseq
CchargeV 0
k (V 0 Vt ) 2
Ppar Ctotal V 0 f Pseq
2
2
2
Conclusions
The pipelining
Pipelining latches are placed across the feed-forward cutsets in the
SFG and computation time of the critical path is reduced
The clock frequency can be increased and hence the sampling rate
is increased.
Parallel processing
The hardware for the original serial system is duplicated and the
resulting system is MIMO parallel system.
The clock freq. Stays the same, and the sampling freq. is increased.
Two scheme is used for higher speed and lower power
design (using lower supply voltage).