CircuitOptionsPt2A - The University of Auckland

Download Report

Transcript CircuitOptionsPt2A - The University of Auckland

Reconfigurable Computing Options in Circuit Design
John Morris
Chung-Ang University
The University of Auckland
‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia
Design Options – so far
‘Structural Options’
1. Bit serial
 Most Space efficient
 Slow

One bit of result produced per cycle
 Sometimes this isn’t a problem
 Example


Small efficient adder
Very small multiplier
Serial Circuits
 Bit serial adder
ENTITY serial_add IS
PORT( a, b, clk : IN std_logic;
sum, cout : OUT std_logic );
END ENTITY serial_add;
b
FA
cin
cout
2-bit
register
sum
a
clock
ARCHITECTURE df OF serial_add IS
Note:
SIGNAL cint : std_logic;
BEGIN
The synthesizer will insert
PROCESS( clk )
the latch on the internal signals!
BEGIN
IF clk’EVENT AND clk = ‘1’ THEN
sum <= a XOR b XOR cint;
cint <= (a AND b) OR (b AND cint) OR (a AND cint );
END IF;
Note:
END PROCESS;
Reset or clear needed to
cout <= cint;
frame operands!
END ARCHITECTURE df;
Design Options – so far
‘Structural Options’
1. Bit serial
 Most Space efficient
2. Sequential
 Combinatorial / bit-parallel block + register
 Example

Sequential multiplier – adder + shifter + register
Design Options – so far
‘Structural Options’
1. Bit serial
2. Sequential
3. Pipelined
 High throughput
 High latency too though!
 Need to achieve pipeline balance


Every stage should have similar propagation delay
More later!
 Example

Pipelined multiplier
4. Examine communication patterns
 Example

Eliminate horizontal carry chains in parallel array multiplier
Design Options – so far
‘Structural Options’
1. Bit serial
2. Sequential
3. Pipelined
4. Examine communication patterns
 Example

Eliminate horizontal carry chains in parallel array multiplier
Multipliers
 We can add the partial products with FA blocks
a3
a2
a1
a0
0
Try to use a
b0
more efficient adder
A simpler scheme
in each row?
uses a ‘carry save’
FA
FA
FA
FA
adder – which
b1
pushes
the carry
out’s down to the
next row!
FA
FA
FA
FA
b2
FA
FA
FA
Carry select adder
FA
Note that an extra adder is
needed below the last row to
add the last partial products and
p0
product
bitsabove!
p1 the carries
from
the row
Design Options – so far
‘Structural Options’
1. Bit serial
2. Sequential
3. Pipelined
4. Examine communication patterns
5. Tree structures
 Example
 Combine carries in level below
 Wallace Tree multiplier
Signed digit arithmetic – Avoiding the carries!
 If we use more than one bit to represent each bit of an operand
In binary, the partial products
are trivial –
if multiplier bit = 1, copy the
multiplicand
else 0
Use an ‘and’ gate!
Residue Arithmetic
 Residue Number Systems
 A verse by the Chinese scholar, Sun Tsu, over 1500 years ago
posed this problem
 What number has remainders 2, 3 and 2 when divided by the numbers
7, 5 and 3, respectively?
 This is probably the first documented use of number
representations using multiple residues
 In a residue number system,
a number, x, is represented by the list of its residues (remainders) with
respect to k relatively prime moduli,
mk-1, mk-2, …, m0
 Thus x is represented by
(xk-1, xk-2, …, x0)
where
xi = x mod mi
 So the puzzle may be re-written
What is the decimal representation of (2,3,2) in RNS(7,5,3)?
Residue Number Systems
 The dynamic range of a RNS,
M = mk-1  mk-2  … m0
 For example, in the system RNS(8,7,5,3)
M = 8  7  5  3 = 840
 Thus we have
RNS(8,7,5,3)
Decimal
(0,0,0,0)
0 or 840 or -840 or …
(1,1,1,1)
1 or 841 or -839 or …
(2,2,2,2)
2 or 842 or …
(0,1,3,2)
8 or 848 or …
 Any RNS can be viewed as a weighted representation
 In RNS(8,7,5,3), the weights are:
105 120 336 280
 Thus (1,2,4,0) represents
(105  1 + 120  2 336  4 + 280  0)840 = (1689)840 = 9
Residue Number Systems - Operations
 Complement
 To find –x, complement each of the digits with respect to the
modulus for that digit
21 = (5,0,1,0)
 so
-21 =
(8-5,0,5-1,0) = (3,0,4,0)
 Addition or subtraction is performed on each digit
(
5
,
5
, 0 ,
2
)RNS
(
7
,
6
, 4 ,
2
)RNS
( (5+7)=48, (5+6)=47, 4 , (2+2)=13)RNS
(
4
,
4
, 4 ,
1
)RNS
= 510
= -110
= 410
= 410
 Multiplication is also achieved by operations on each digit
(
5
,
5
, 0 ,
2
)RNS = 510
(
7
,
6
, 4 ,
2
)RNS = -110
( (5x7)=38, (5x6)=27, 0 , (2x2)=13)RNS = -510
(
3
,
2
, 0 ,
1
)RNS = -510
Residue Arithmetic - Advantages
 Parallel independent operations on small numbers of digits
 Significant speed ups
 Especially for multiplication!
 4 bit x 4 bit multiplier (moduli up to 15)
much simpler than 16 bit x 16 bit one
 Carries are strictly confined to small numbers of bits
 Each modulus is only a small number of bits
 Can be implemented in Look Up Tables (LUTs)
 6 bit residues (moduli up to 63)
 64 x 64 x 6 bits required (<4Kbytes)
Residue Arithmetic – Choosing the moduli
 Largest modulus determines the overall speed –
 Try to make it as small as possible
 Simple strategy
 Choose sequence of prime numbers until the dynamic range, M,
becomes large enough
eg Application requires a range of at least 105, ie M  105




For RNS(13,11,7,5,3,2), M = 30,300
Range is too low, so add one more modulus:
RNS(17,13,11,7,5,3,2), M = 510,510
Now
• each modulus requires a separate circuit and
• our range is now ~5 times as large as needed, so remove 5:
 RNS(17,13,11,7,3,2), M = 102,102
 Six residues, requiring
5 + 4 + 4 + 3 + 2 + 1 = 19 bits
 The largest modulus (17 requiring 5 bits) determines the speed,
so …
Residue Arithmetic – Choosing the moduli
Application requires a range of at least 105, ie M  105
 …
 RNS(17,13,11,7,3,2), M = 102,102
 Six residues, requiring
5 + 4 + 4 + 3 + 2 + 1 = 19 bits
 The largest modulus (17 requiring 5 bits) determines the speed,
so combine some of the smaller moduli
(Remember the requirement is that they be relatively prime!)
 Try to produce the largest modulus using only 5 bits –
Pair 2 and 13, 3 and 7
 RNS(26,21,17, 11), M = 102,102
 Four residues, requiring
5 + 5 + 5 + 4 = 19 bits
(no improvement in total bit count, but 2 fewer ALUs!)
 Better …?
Residue Arithmetic – Choosing the moduli
Application requires a range of at least 105, ie M  105
 …
 RNS(26,21,17, 11), M = 102,102
 Four residues, requiring
5 + 5 + 5 + 4 = 19 bits
(no improvement in total bit count, but 2 fewer ALUs!)
 Include powers of smaller primes before primes,
starting with
 RNS(3,2), M = 6
 Note that 22 is smaller than the next prime, 5, so move to
 RNS(22,3), M = 12
(trying to minimize the size of the largest modulus)
 After including 5 and 7, note that 23 and 32 are smaller than 11:
 RNS(32,23,7,5), M = 2,520
 Add 11  RNS(11,32,23,7,5), M = 27,720
 Add 13  RNS(13,11,32,23,7,5), M = 360,360
Residue Arithmetic – Choosing the moduli
Application requires a range of at least 105, ie M  105
 …
 Add 13  RNS(13,11,32,23,7,5), M = 360,360
 M is now 3 larger than needed,
so replace 9 with 3, then combine 5 and 3
 RNS(15,13,11,23,7), M = 360,360
 5 moduli,
 4 + 4 + 4 + 3 + 3 = 18 bits,
 largest modulus has 4 bits
 You can actually do somewhat better than this!
 Reference:
B. Parhami, Computer Arithmetic: Algorithms and Hardware
Designs, Oxford University Press, 2000
Residue Numbers - Conversion
 Inputs and outputs will invariably be in standard binary or
decimal representations,
 conversion to and from them is required
 Conversion from binary | decimal to RNS
 Problem: Given a number, y, find its residues wrt moduli, mi
 Divisions would be too time-consuming!
 Use this equality:
(yk-1yk-2…y1y0)2mi
=   2k-1yk-1  mi + … +  2y1  mi +  y0  mi  mi
 So we only need to precompute the residues  2 j  mi for
each of the moduli, mi, used by the RNS
Residue Numbers - Conversion
For RNS(8,7,5,3) :
• <y>8 is trivially calculated (3 LSB bits)
• For 7, 5 and 3, we need the powers of 2 modulus 7, 5 and 3
j
2j
2 j7
2 j5
2 j3
0
1
1
1
1
1
2
2
2
2
2
4
4
4
1
3
8
1
3
2
4
16
2
1
1
5
32
4
2
2
6
64
1
4
1
7
128
2
3
2
8
256
4
1
1
9
512
1
2
2
Residue Numbers - Conversion
Find 16410 = 1010 01002 = 27 + 25 + 22 in RNS(8,7,5,3) :
• <164>8 is 1002 = 410
j
2j
2 j7
2 j5
2 j3
0
1
1
1
1
1
2
2
2
2
2
4
4
4
1
3
8
1
3
2
4
16
2
1
1
5
32
4
2
2
6
64
1
4
1
7
128
2
3
2
8
256
4
1
1
9
512
1
2
2
<164>7
= <2 + 4 + 4>7
= <10>7
= 3
Note that the
additions are done
in a modular adder!
Worst case:
k additions for each
residue for a k -bit
number
Residue Numbers - Conversion
Residue Arithmetic - Disadvantages




Range is limited
Division is hard!
Comparison <, >, sign (<0?) are hard
Still suitable for some DSP applications
 Only use +, x
 Range is limited
 Result range is known
 Examples: digital filters, Fourier transforms
Multipliers
 ‘Long’ multiplication
x
a3
b3
x
x
a2
b2
x
x
x x x x
x x x x
x x x x x
a1 a0
b1 b0
x x
x
b0
b1
b2
b3
x
x
In binary, the partial products
are trivial –
if multiplier bit = 1, copy the
multiplicand
else 0
Use an ‘and’ gate!
a3
a2
a1
a0
b0
first row of partial products
Multipliers
 We can add the partial products with FA blocks
a3
a2
a1
a0
0
FA
FA
FA
b0
FA
b1
FA
FA
FA
FA
b2
FA
FA
FA
FA
p1
p0
product bits
Parallel Array Adder
 We can build this adder in VHDL with two GENERATE loops
SIGNAL pa, pb, cout : ARRAY( 0 TO n-1 ) OF
ARRAY( 0 TO n-1 ) OF std_logic;
FOR j IN 0 TO n-1 GENERATE -- For each row
FOR j IN 0 TO n-1 GENERATE –- Generate a row
pjk : full_adder PORT MAP( … );
END GENERATE;
This part is straight-forward!
END GENERATE;
… but you need to fill in the PORT MAP
using internal signals!
Multipliers
 We can add the partial products with FA blocks
a3
a2
a1
a0
0
FA
FA
FA
FA
b0
Optimization 1:
Replace this row
of FAs
b1
FA
FA
FA
FA
FA
FA
Time?
What’s the worst
b2 case propagation
delay?
FA
FA
p1
p0
product bits