FPGA-struct - The University of Auckland

Download Report

Transcript FPGA-struct - The University of Auckland

Reconfigurable Computing FPGA structures
John Morris
Chung-Ang University
The University of Auckland
‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia
FPGA Architectures
 Programmable logic takes many forms
 Originally devices contained 10’s of gates and flip-flops
 These early devices were generally called PAL’s (Programmable
Array Logic)
 A typical structure was
FF
Programmable
And-Or array
FF
Outputs (20)
Inputs (20)
FF
FF
 With 10-20 inputs and outputs and ~20 flip-flops,
they could
• implement small state machines and
• replace large amounts of discrete ‘glue’ logic
Programmable Logic
 Memory should also be included in the class of
programmable logic!
 It finds application in LUTs, state machines, ...
 From early UV EPROMs with ~kbytes,
we now have many styles of memory which retains
values when power is removed
and capacities in Mbytes
Memory is an important consideration when designing
reconfigurable systems.
FPGA technology does not provide large amounts of memory
and this can be a constraint especially if you are trying to produce a compact,
single chip solution to your problem!
Modern Programmable Logic
 As technology has evolved, so have programmable
devices
 Today’s FPGAs contain
 Millions of ‘gates’
 Memory
 Support for several I/O protocols - TTL, LVDS, GTL, …
 Arithmetic units - adders, multipliers
 Processor cores
FPGA Architecture
 The ‘core’ architecture of most modern FPGAs
consists of
Logic blocks
Interconnection resources
I/O blocks
Typical FPGA Architecture
 Logic blocks
embedded in a
‘sea’ of connection
resources
 CLB = logic block
IOB = I/O buffer
PSM = programmable
switch matrix
This particular arrangement
is similar to that in Xilinx 4000
(and onwards) chips devices from other manufacturers
are similar in overall
structure
Typical FPGA Architecture
 Logic blocks embedded in a
‘sea’ of connection
resources
 CLB = logic block
IOB = I/O buffer
PSM = programmable
switch matrix
 Interconnections critical
 Transmission gates on
paths
 Flexibility
 Connect any LB to any
other
but
Much slower than
connections within a logic
block
Aside:
This is a ‘universal’ problem not restricted to FPGAs!
Applies to
• custom VLSI,
• ASICs,
• systems,
• parallel processors
Small transistors
 high speed
 high density
 long, wide datapaths
Logic Blocks
 Combination of
 And-or array
or
Look-Up-Table (LUT)
 Flip-flops
 Multiplexors
 General aim
 Arbitrary boolean
function of several
variables
 Storage
 Designers try to estimate
what combination of resources
will produce the most efficient
application  circuit mappings
Xilinx 4000 (and on) CLB
3 LUT blocks•
2 Flip-Flops (Asynch Reset)•
Multiplexors•
Clock / Reset Lines•
Adders
 Adders appear in most designs
 Arithmetic Adders (including subtracters)
 Other arithmetic operators
 eg multipliers, dividers
 Counters (including program counters in processors)
 Incrementors, decrementors, etc
 They also often appear on the critical path
 Adder performance can be crucial for system performance
 Because of their importance, researchers are still searching for better
ways to add!
 Adder structures proposed already






Ripple carry
Carry select
Carry skip
Carry look-ahead
Manchester
… and several dozen more variants
Ripple Carry Adder
 The simplest and most well known adder
carry
out
an-1 bn-1
FA
cout cin
an-2 bn-2
FA
cout cin
a 1 b1
FA
cout cin
a 0 b0
FA
cout cin
sn-1
sn-2
s1
s0
 How long does it take an n-bit adder to produce a result?
 n x propagation delay( FA: (a or b)  carry )
 We can do better than this - using one of many known better structures
but
 What are the advantages of a ripple carry adder?
 Small
 Regular
Very important in packing circuitry into
 Fits easily into a 2-D layout!
fixed 2-D layout of an FPGA!
Ripple Carry Adders
 Ripple carry adder performance is limited by propagation of carries
an-1bn-1
FA
cout cin
an-2bn-2
FA
cout cin
a 3 b3
FA
cout cin
a 2 b2
FA
cout cin
a 1 b1
FA
cout cin
a 0 b0
FA
cout cin
sn-1
sn-2
s3
s2
s1
s0
carry
out LB
LB
On an FPGA,
this link is often
the major source
of time delay
LB
… because one or two
FA blocks will often fit
in a logic block!
Connections within a logic block are fast!
Connections between logic blocks are slower
Using general interconnect
 Interconnections
critical
 Transmission
gates on paths
 Flexibility
 Connect any
LB to any other
but
Much slower
than
connections
within a logic
block
Much slower
than long lines
on an ASIC
Every one of these
connection points
is a transmission gate
This switch matrix is
a mass of transmission
gates too!
‘Fast Carry’ Logic
 Critical delay
 Transmission of carry out from one logic block to the next
 Solution (most modern FPGAs)
 ‘Fast carry’ logic
 Special paths between logic blocks used specifically for
carry out
 Very fast ripple carry adders!
 More sophisticated adders?
 Carry select
 Uses ripple carry blocks - so can use fast carry logic
 Should be faster for wide datapaths?
 Carry lookahead
 Uses large amounts of logic and multiple logic blocks
 Hard to make it faster for small adders!
Logic Blocks and fast carry
Cout Cindown
 Xilinx solution
(in the same column)
 Carry logic
precedes
LUTs
 Fast carry
connections
G
carry
G1-4
 Up
 Down
 Adders must
lie in a column
of the FPGA
 Some
(not
serious?)
constraint
on layout
Direct connections
to CLB above
F
carry
F1-4
Note that carry chains can
run either up or down
(But not sideways!)
Cinup
Cout
Direct connections
to CLB below
(in the same column)
Logic Blocks and fast carry - Altera version
 Altera
Simpler logic element
 One LUT and one flip-flop / logic element
 Four inputs + one output / logic element
 Simpler LE  more LEs / device
Carry logic follows LUT
Carry chains in one direction only
Additional fast link (cascade chain)
 Efficient implementation of high fan-in functions
 eg 4+-input gate spans 2+ LEs with a slow link
 Cascade chain avoids the slow link
• Efficient (fast) high fan-in functions
• eg Potentially much faster carry look-ahead adder?
• Discussed later!
 Is Xilinx better?
Carry Select Adder
a4-7
0
cin
a0-3
b0-3
n-bit
Ripple Carry Adder
b4-7
n-bit
Ripple Carry Adder
cout7
sum04-7
cout3
1
b4-7
n-bit
Ripple Carry Adder
sum0-3
cout7
sum14-7
‘Standard’
0
n-bit ripple carry
1
0
1
adders
n = any suitable value
Here we build
an 8-bit adder
from 4-bit blocks
sum4-7
carry
Carry Select Adder
a4-7
0
cin
a0-3
b0-3
n-bit
Ripple Carry Adder
These two blocks
‘speculate’
b4-7on the value of cout3
n-bit
Ripple Carry Adder
sum04-7
cout3
1
b4-7
n-bit
Ripple Carry Adder
sum0-3
This block adds
the 4 low order bits
After 4*tpd it will
produce a carry out
cout7
cout7
sum14-7
0
1
sum4-7
0 it will1
One assumes
be 0
the other assumes 1
carry
Carry Select Adder
0
cin
a0-3
b0-3
n-bit
Ripple Carry Adder
cout3
1
After
a4-74*tpd we will
b4-7have:
• sum0-3 (final sum bits)
• cout3
cout
(from
n-bit low order block) 7
• sum0
4-7Adder
Ripple
Carry
• cout07
(from
block assuming 0 cin)
sum0
• sum14-7 4-7
• cout17
(from block
b assuming 1 cin)
4-7
n-bit
Ripple Carry Adder
sum0-3
This block adds
the 4 low order bits
After 4*tpd it will
produce a carry out
cout7
sum14-7
0
1
sum4-7
0
1
carry
Carry Select Adder
a4-7
0
cin
a0-3
cout
b0-3
n-bit
Ripple Carry Adder
b4-7
7
n-bit
Ripple Carry
Adder
Cout
3 selects correct
sum4-7 and carry out
sum04-7
cout3
sum0-3
1
b4-7
n-bit
Ripple Carry Adder
cout7
sum14-7
0
All 8 bits + carry are available
after
4*tpd(FA) + tpd(multiplexor)
1
sum4-7
0
1
carry
Carry Select Adder
Each ripple carry block
should use fast carry logic
a4-7
0
cin
a0-3
b0-3
n-bit
Ripple Carry Adder
b4-7
n-bit
Ripple Carry Adder
sum04-7
cout3
sum0-3
cout7
1
b4-7
n-bit
Ripple Carry Adder
cout7
sum14-7
0
Drawback:
These links are relatively slow!
 CSA only faster for n > ?
1
sum4-7
0
1
carry
Carry Select Adder
 This scheme can be generalized to any number of bits
 Select a suitable block size (eg 4, 8)
 Replicate all blocks except the first
 One with cin = 0
 One with cin = 1
 Use final cout from preceding block to select correct set of
outputs for current block
Better adders
 Performance of a CSA is still O(n)
 More sophisticated adders?
 Carry skip
 Variant on carry select idea
 Carry lookahead
 Uses large amounts of logic and multiple logic blocks
 Hard to make it faster for small adders!
Carry Lookahead Adder
 Standard adder expressions
cout = a•b + b•cin + a•cin
sum = a  b  cin
 Define two new symbols
‘Generate’ G = a•b
 If a and b are both 1, then a carry must be generated
‘Propagate’ P = a  b
 If a  b (a  b = 1), then propagate the carry in
 Now
sum = P  cin
cout = G + cinP
 For the ith bit
sumi = Pi  ci
Carry Lookahead Adder
 Expanding the carry expressions
c1 = G0 + c0P0
c2 = G1 + c1P1 = G1 + P1 (G0 + c0P0) = G1 + P1G0 + c0P1P0
c3 = G2 + c2P2 = G2 + P2 (G1 + c1P1) = G2 + P2 (G1 + P1( G0
+ c0P0))
= G2 + P2 G1 + P2P1G0 + c0P2P1P0
c4 = G3 + c3P3 = G3 + P3 (G2 + c2P2) = ...
= G3 + P3 G2 + P3P2G1 + P3P2P1G0 + c0P3P2P1P0
 Note that c4 can be calculated in a logic block that
permits
4 P inputs
4 G inputs
c0 input
Large Carry Lookahead Adders
 Define ‘group’ generate and propagate symbols
For 4-bit ‘groups’
PG = P3P2P1P0
GG = G3 + P3 G2 + P3P2G1 + P3P2P1G0
c5 = G4 + c4P4 = G4 + P4 (G3 + c3P3) = ...
= G4 + P4 G3 + P4P3G2 + P4P3P2G1 + P4P3P2P1G0+
c0P4P3P2P1P0
= G4 + P4 (GG + PGc0)
Note that the expression for c5 has exactly the
same form as the expression for c1 - substituting
(GG + PGc0) for c0
 Thus large CLAs are built by cascading blocks that
compute group G and P signals
Cascading blocks for large CLAs
a16-19 b16-19
a12-15 b12-15
a8-11 b8-11
a4-7
GG PG s16-19
GG PG s12-15 c12
GG PG s8-11 c8
GG PG s4-7 c4
G0 P0
c0
G3 P3
c3
G2 P2
c2
G1 P1
GG PG
b4-7
a0-3
c1
b0-3
GG PG s0-3
G0 P0
GG PG
G0 P0
G1 P1 c1
4-bit adder block
modified to calculate G and P
GG PG
c0
Large Carry Lookahead Adders
 Define ‘group’ generate and propagate symbols
 Thus large CLAs are built by cascading blocks that
compute group G and P signals
 As with CS Adders,
optimum group size will be determined by the
technology used
It will need to match logic block capability
It will need to balance internal LUT propagation
delay with link delay
+ effectiveness of additional capabilities,
eg Altera cascade chains
Some experimentation will be needed to determine
it!
Fast Adders
 Many other fast adder schemes have been proposed
eg
 Carry-skip
 Manchester
 Carry-save
 Carry Look Ahead
 If implementing an adder
(eg in programmable logic)
do a little research first!
Fast Adders
 Challenge: What style of adder is fastest / most compact for any
FPGA technology?
 Answer is not simple
 For small adders (n < ?),
fast carry logic will certainly make a simple ripple carry
adder fastest
 It will also use the minimum resources - but will need to be laid
out as a column or row
 For larger adders ( ? < n < ? ),
carry select styles are likely to be best  They use ripple carry blocks efficiently
 For very large adders ( n > ? ),
a carry look ahead adder may be faster?
 But it will use considerably more resources!
Exploiting a manufacturer’s fast carry logic
 To use the Altera fast carry logic, write your adder like this:
LIBRARY ieee;
USE ieee.std_logic_1164.all;
LIBRARY lpm ;
USE lpm.lpm_components.all ;
ENTITY adder IS
PORT ( c_in : IN
a, b
: IN
sum
: OUT
c_out : OUT
END adderlpm ;
STD_LOGIC ;
STD_LOGIC_VECTOR(15 DOWNTO 0) ;
STD_LOGIC_VECTOR(15 DOWNTO 0) ;
STD_LOGIC ) ;
ARCHITECTURE lpm_structure OF adder IS
BEGIN
instance: lpm_add_sub
GENERIC MAP (LPM_WIDTH => 16)
PORT MAP ( cin => Cin, dataa => a, datab => b,
result => sum, cout => c_out ) ;
END lpm_structure ;
What about that carry in?
 In an ALU, we usually need to do more than just add!
 Subtractions are common also
 Observe
c = a - b
is equivalent to
c = a + (-b)
 So we can use an adder for subtractions if we can
negate the 2nd operand
 Negation in 2’s complement arithmetic?
Adder / Subtractor
 Negation in 2’s complement arithmetic?
 Rule:
Complement each bit
Add 1
eg
Complement
Add 1
Binary
0001
1110
1111
Complement
Add 1
0110
1001
1010
Decimal
1
-1
6
-6
Adder / Subtractor
 Using an adder
Complement each bit using an inverter
Use the carry in to add 1!
a
b
0 1
add/
subtract
cin
FA
FA
FA
carry
c
Example - Generate
ENTITY adder IS
GENERIC ( n : INTEGER := 16 ) ;
PORT ( c_in
: IN
std_ulogic ;
a, b
: IN
std_ulogic_vector(n-1 DOWNTO 0) ;
sum : OUT std_ulogic_vector(n-1 DOWNTO 0) ;
c_out : OUT std_ulogic ) ;
END adder;
ARCHITECTURE rc_structure OF adder IS
SIGNAL c : STD_LOGIC_VECTOR(1 TO n-1) ;
COMPONENT fulladd
PORT ( c_in, x, y : IN std_ulogic ;
s, c_out : OUT std_ulogic ) ;
END COMPONENT ;
BEGIN
FA_0: fulladd PORT MAP ( c_in=>c_in, x=>a(0), y=>b(0),
s=>sum(0), c_out=>c(1) ) ;
G_1: FOR i IN 1 TO n-2 GENERATE
FA_i: fulladd PORT MAP ( c(i), a(i), b(i), sum(i), c(i+1) ) ;
END GENERATE ;
FA_n: fulladd PORT MAP (C(n-1),A(n-1),B(n-1),Sum(n-1),Cout) ;
END rc_structure ;
Serial Circuits
 Space efficient
 Sloooow
One bit of result produced per cycle
Sometimes this isn’t a problem
 Highly parallel problems
• Search
• Many operations on the same data stream
eg search a text database
=key0?for many keywords in parallel
Serial processing needs:
8x Mbits/s - Easy!
Text stream
Data rate:
x MB/s
=key
i?
=key
=keyii?
?
=keyn?
Effective performance may require
comparison with
1000’s of keys
space for key circuits critical!
small, compact bit-serial
comparator ideal!
Serial Circuits
 Bit serial adder
ENTITY serial_add IS
PORT( a, b, clk : IN std_logic;
sum, cout : OUT std_logic );
END ENTITY serial_add;
b
FA
cin
cout
2-bit
register
sum
a
clock
ARCHITECTURE df OF serial_add IS
Note:
SIGNAL cint : std_logic;
BEGIN
The synthesizer will insert
PROCESS( clk )
the latch on the internal signals!
BEGIN
IF clk’EVENT AND clk = ‘1’ THEN
sum <= a XOR b XOR cint;
cint <= (a AND b) OR (b AND cint) OR (a AND cint );
END IF;
END PROCESS;
It will recognize the
cout <= cint;
IF clk’EVENT … pattern!
END ARCHITECTURE df;