FPGA-struct - The University of Auckland
Download
Report
Transcript FPGA-struct - The University of Auckland
Reconfigurable Computing FPGA structures
John Morris
Chung-Ang University
The University of Auckland
‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia
FPGA Architectures
Programmable logic takes many forms
Originally devices contained 10’s of gates and flip-flops
These early devices were generally called PAL’s (Programmable
Array Logic)
A typical structure was
FF
Programmable
And-Or array
FF
Outputs (20)
Inputs (20)
FF
FF
With 10-20 inputs and outputs and ~20 flip-flops,
they could
• implement small state machines and
• replace large amounts of discrete ‘glue’ logic
Programmable Logic
Memory should also be included in the class of
programmable logic!
It finds application in LUTs, state machines, ...
From early UV EPROMs with ~kbytes,
we now have many styles of memory which retains
values when power is removed
and capacities in Mbytes
Memory is an important consideration when designing
reconfigurable systems.
FPGA technology does not provide large amounts of memory
and this can be a constraint especially if you are trying to produce a compact,
single chip solution to your problem!
Modern Programmable Logic
As technology has evolved, so have programmable
devices
Today’s FPGAs contain
Millions of ‘gates’
Memory
Support for several I/O protocols - TTL, LVDS, GTL, …
Arithmetic units - adders, multipliers
Processor cores
FPGA Architecture
The ‘core’ architecture of most modern FPGAs
consists of
Logic blocks
Interconnection resources
I/O blocks
Typical FPGA Architecture
Logic blocks
embedded in a
‘sea’ of connection
resources
CLB = logic block
IOB = I/O buffer
PSM = programmable
switch matrix
This particular arrangement
is similar to that in Xilinx 4000
(and onwards) chips devices from other manufacturers
are similar in overall
structure
Typical FPGA Architecture
Logic blocks embedded in a
‘sea’ of connection
resources
CLB = logic block
IOB = I/O buffer
PSM = programmable
switch matrix
Interconnections critical
Transmission gates on
paths
Flexibility
Connect any LB to any
other
but
Much slower than
connections within a logic
block
Aside:
This is a ‘universal’ problem not restricted to FPGAs!
Applies to
• custom VLSI,
• ASICs,
• systems,
• parallel processors
Small transistors
high speed
high density
long, wide datapaths
Logic Blocks
Combination of
And-or array
or
Look-Up-Table (LUT)
Flip-flops
Multiplexors
General aim
Arbitrary boolean
function of several
variables
Storage
Designers try to estimate
what combination of resources
will produce the most efficient
application circuit mappings
Xilinx 4000 (and on) CLB
3 LUT blocks•
2 Flip-Flops (Asynch Reset)•
Multiplexors•
Clock / Reset Lines•
Adders
Adders appear in most designs
Arithmetic Adders (including subtracters)
Other arithmetic operators
eg multipliers, dividers
Counters (including program counters in processors)
Incrementors, decrementors, etc
They also often appear on the critical path
Adder performance can be crucial for system performance
Because of their importance, researchers are still searching for better
ways to add!
Adder structures proposed already
Ripple carry
Carry select
Carry skip
Carry look-ahead
Manchester
… and several dozen more variants
Ripple Carry Adder
The simplest and most well known adder
carry
out
an-1 bn-1
FA
cout cin
an-2 bn-2
FA
cout cin
a 1 b1
FA
cout cin
a 0 b0
FA
cout cin
sn-1
sn-2
s1
s0
How long does it take an n-bit adder to produce a result?
n x propagation delay( FA: (a or b) carry )
We can do better than this - using one of many known better structures
but
What are the advantages of a ripple carry adder?
Small
Regular
Very important in packing circuitry into
Fits easily into a 2-D layout!
fixed 2-D layout of an FPGA!
Ripple Carry Adders
Ripple carry adder performance is limited by propagation of carries
an-1bn-1
FA
cout cin
an-2bn-2
FA
cout cin
a 3 b3
FA
cout cin
a 2 b2
FA
cout cin
a 1 b1
FA
cout cin
a 0 b0
FA
cout cin
sn-1
sn-2
s3
s2
s1
s0
carry
out LB
LB
On an FPGA,
this link is often
the major source
of time delay
LB
… because one or two
FA blocks will often fit
in a logic block!
Connections within a logic block are fast!
Connections between logic blocks are slower
Using general interconnect
Interconnections
critical
Transmission
gates on paths
Flexibility
Connect any
LB to any other
but
Much slower
than
connections
within a logic
block
Much slower
than long lines
on an ASIC
Every one of these
connection points
is a transmission gate
This switch matrix is
a mass of transmission
gates too!
‘Fast Carry’ Logic
Critical delay
Transmission of carry out from one logic block to the next
Solution (most modern FPGAs)
‘Fast carry’ logic
Special paths between logic blocks used specifically for
carry out
Very fast ripple carry adders!
More sophisticated adders?
Carry select
Uses ripple carry blocks - so can use fast carry logic
Should be faster for wide datapaths?
Carry lookahead
Uses large amounts of logic and multiple logic blocks
Hard to make it faster for small adders!
Logic Blocks and fast carry
Cout Cindown
Xilinx solution
(in the same column)
Carry logic
precedes
LUTs
Fast carry
connections
G
carry
G1-4
Up
Down
Adders must
lie in a column
of the FPGA
Some
(not
serious?)
constraint
on layout
Direct connections
to CLB above
F
carry
F1-4
Note that carry chains can
run either up or down
(But not sideways!)
Cinup
Cout
Direct connections
to CLB below
(in the same column)
Logic Blocks and fast carry - Altera version
Altera
Simpler logic element
One LUT and one flip-flop / logic element
Four inputs + one output / logic element
Simpler LE more LEs / device
Carry logic follows LUT
Carry chains in one direction only
Additional fast link (cascade chain)
Efficient implementation of high fan-in functions
eg 4+-input gate spans 2+ LEs with a slow link
Cascade chain avoids the slow link
• Efficient (fast) high fan-in functions
• eg Potentially much faster carry look-ahead adder?
• Discussed later!
Is Xilinx better?
Carry Select Adder
a4-7
0
cin
a0-3
b0-3
n-bit
Ripple Carry Adder
b4-7
n-bit
Ripple Carry Adder
cout7
sum04-7
cout3
1
b4-7
n-bit
Ripple Carry Adder
sum0-3
cout7
sum14-7
‘Standard’
0
n-bit ripple carry
1
0
1
adders
n = any suitable value
Here we build
an 8-bit adder
from 4-bit blocks
sum4-7
carry
Carry Select Adder
a4-7
0
cin
a0-3
b0-3
n-bit
Ripple Carry Adder
These two blocks
‘speculate’
b4-7on the value of cout3
n-bit
Ripple Carry Adder
sum04-7
cout3
1
b4-7
n-bit
Ripple Carry Adder
sum0-3
This block adds
the 4 low order bits
After 4*tpd it will
produce a carry out
cout7
cout7
sum14-7
0
1
sum4-7
0 it will1
One assumes
be 0
the other assumes 1
carry
Carry Select Adder
0
cin
a0-3
b0-3
n-bit
Ripple Carry Adder
cout3
1
After
a4-74*tpd we will
b4-7have:
• sum0-3 (final sum bits)
• cout3
cout
(from
n-bit low order block) 7
• sum0
4-7Adder
Ripple
Carry
• cout07
(from
block assuming 0 cin)
sum0
• sum14-7 4-7
• cout17
(from block
b assuming 1 cin)
4-7
n-bit
Ripple Carry Adder
sum0-3
This block adds
the 4 low order bits
After 4*tpd it will
produce a carry out
cout7
sum14-7
0
1
sum4-7
0
1
carry
Carry Select Adder
a4-7
0
cin
a0-3
cout
b0-3
n-bit
Ripple Carry Adder
b4-7
7
n-bit
Ripple Carry
Adder
Cout
3 selects correct
sum4-7 and carry out
sum04-7
cout3
sum0-3
1
b4-7
n-bit
Ripple Carry Adder
cout7
sum14-7
0
All 8 bits + carry are available
after
4*tpd(FA) + tpd(multiplexor)
1
sum4-7
0
1
carry
Carry Select Adder
Each ripple carry block
should use fast carry logic
a4-7
0
cin
a0-3
b0-3
n-bit
Ripple Carry Adder
b4-7
n-bit
Ripple Carry Adder
sum04-7
cout3
sum0-3
cout7
1
b4-7
n-bit
Ripple Carry Adder
cout7
sum14-7
0
Drawback:
These links are relatively slow!
CSA only faster for n > ?
1
sum4-7
0
1
carry
Carry Select Adder
This scheme can be generalized to any number of bits
Select a suitable block size (eg 4, 8)
Replicate all blocks except the first
One with cin = 0
One with cin = 1
Use final cout from preceding block to select correct set of
outputs for current block
Better adders
Performance of a CSA is still O(n)
More sophisticated adders?
Carry skip
Variant on carry select idea
Carry lookahead
Uses large amounts of logic and multiple logic blocks
Hard to make it faster for small adders!
Carry Lookahead Adder
Standard adder expressions
cout = a•b + b•cin + a•cin
sum = a b cin
Define two new symbols
‘Generate’ G = a•b
If a and b are both 1, then a carry must be generated
‘Propagate’ P = a b
If a b (a b = 1), then propagate the carry in
Now
sum = P cin
cout = G + cinP
For the ith bit
sumi = Pi ci
Carry Lookahead Adder
Expanding the carry expressions
c1 = G0 + c0P0
c2 = G1 + c1P1 = G1 + P1 (G0 + c0P0) = G1 + P1G0 + c0P1P0
c3 = G2 + c2P2 = G2 + P2 (G1 + c1P1) = G2 + P2 (G1 + P1( G0
+ c0P0))
= G2 + P2 G1 + P2P1G0 + c0P2P1P0
c4 = G3 + c3P3 = G3 + P3 (G2 + c2P2) = ...
= G3 + P3 G2 + P3P2G1 + P3P2P1G0 + c0P3P2P1P0
Note that c4 can be calculated in a logic block that
permits
4 P inputs
4 G inputs
c0 input
Large Carry Lookahead Adders
Define ‘group’ generate and propagate symbols
For 4-bit ‘groups’
PG = P3P2P1P0
GG = G3 + P3 G2 + P3P2G1 + P3P2P1G0
c5 = G4 + c4P4 = G4 + P4 (G3 + c3P3) = ...
= G4 + P4 G3 + P4P3G2 + P4P3P2G1 + P4P3P2P1G0+
c0P4P3P2P1P0
= G4 + P4 (GG + PGc0)
Note that the expression for c5 has exactly the
same form as the expression for c1 - substituting
(GG + PGc0) for c0
Thus large CLAs are built by cascading blocks that
compute group G and P signals
Cascading blocks for large CLAs
a16-19 b16-19
a12-15 b12-15
a8-11 b8-11
a4-7
GG PG s16-19
GG PG s12-15 c12
GG PG s8-11 c8
GG PG s4-7 c4
G0 P0
c0
G3 P3
c3
G2 P2
c2
G1 P1
GG PG
b4-7
a0-3
c1
b0-3
GG PG s0-3
G0 P0
GG PG
G0 P0
G1 P1 c1
4-bit adder block
modified to calculate G and P
GG PG
c0
Large Carry Lookahead Adders
Define ‘group’ generate and propagate symbols
Thus large CLAs are built by cascading blocks that
compute group G and P signals
As with CS Adders,
optimum group size will be determined by the
technology used
It will need to match logic block capability
It will need to balance internal LUT propagation
delay with link delay
+ effectiveness of additional capabilities,
eg Altera cascade chains
Some experimentation will be needed to determine
it!
Fast Adders
Many other fast adder schemes have been proposed
eg
Carry-skip
Manchester
Carry-save
Carry Look Ahead
If implementing an adder
(eg in programmable logic)
do a little research first!
Fast Adders
Challenge: What style of adder is fastest / most compact for any
FPGA technology?
Answer is not simple
For small adders (n < ?),
fast carry logic will certainly make a simple ripple carry
adder fastest
It will also use the minimum resources - but will need to be laid
out as a column or row
For larger adders ( ? < n < ? ),
carry select styles are likely to be best They use ripple carry blocks efficiently
For very large adders ( n > ? ),
a carry look ahead adder may be faster?
But it will use considerably more resources!
Exploiting a manufacturer’s fast carry logic
To use the Altera fast carry logic, write your adder like this:
LIBRARY ieee;
USE ieee.std_logic_1164.all;
LIBRARY lpm ;
USE lpm.lpm_components.all ;
ENTITY adder IS
PORT ( c_in : IN
a, b
: IN
sum
: OUT
c_out : OUT
END adderlpm ;
STD_LOGIC ;
STD_LOGIC_VECTOR(15 DOWNTO 0) ;
STD_LOGIC_VECTOR(15 DOWNTO 0) ;
STD_LOGIC ) ;
ARCHITECTURE lpm_structure OF adder IS
BEGIN
instance: lpm_add_sub
GENERIC MAP (LPM_WIDTH => 16)
PORT MAP ( cin => Cin, dataa => a, datab => b,
result => sum, cout => c_out ) ;
END lpm_structure ;
What about that carry in?
In an ALU, we usually need to do more than just add!
Subtractions are common also
Observe
c = a - b
is equivalent to
c = a + (-b)
So we can use an adder for subtractions if we can
negate the 2nd operand
Negation in 2’s complement arithmetic?
Adder / Subtractor
Negation in 2’s complement arithmetic?
Rule:
Complement each bit
Add 1
eg
Complement
Add 1
Binary
0001
1110
1111
Complement
Add 1
0110
1001
1010
Decimal
1
-1
6
-6
Adder / Subtractor
Using an adder
Complement each bit using an inverter
Use the carry in to add 1!
a
b
0 1
add/
subtract
cin
FA
FA
FA
carry
c
Example - Generate
ENTITY adder IS
GENERIC ( n : INTEGER := 16 ) ;
PORT ( c_in
: IN
std_ulogic ;
a, b
: IN
std_ulogic_vector(n-1 DOWNTO 0) ;
sum : OUT std_ulogic_vector(n-1 DOWNTO 0) ;
c_out : OUT std_ulogic ) ;
END adder;
ARCHITECTURE rc_structure OF adder IS
SIGNAL c : STD_LOGIC_VECTOR(1 TO n-1) ;
COMPONENT fulladd
PORT ( c_in, x, y : IN std_ulogic ;
s, c_out : OUT std_ulogic ) ;
END COMPONENT ;
BEGIN
FA_0: fulladd PORT MAP ( c_in=>c_in, x=>a(0), y=>b(0),
s=>sum(0), c_out=>c(1) ) ;
G_1: FOR i IN 1 TO n-2 GENERATE
FA_i: fulladd PORT MAP ( c(i), a(i), b(i), sum(i), c(i+1) ) ;
END GENERATE ;
FA_n: fulladd PORT MAP (C(n-1),A(n-1),B(n-1),Sum(n-1),Cout) ;
END rc_structure ;
Serial Circuits
Space efficient
Sloooow
One bit of result produced per cycle
Sometimes this isn’t a problem
Highly parallel problems
• Search
• Many operations on the same data stream
eg search a text database
=key0?for many keywords in parallel
Serial processing needs:
8x Mbits/s - Easy!
Text stream
Data rate:
x MB/s
=key
i?
=key
=keyii?
?
=keyn?
Effective performance may require
comparison with
1000’s of keys
space for key circuits critical!
small, compact bit-serial
comparator ideal!
Serial Circuits
Bit serial adder
ENTITY serial_add IS
PORT( a, b, clk : IN std_logic;
sum, cout : OUT std_logic );
END ENTITY serial_add;
b
FA
cin
cout
2-bit
register
sum
a
clock
ARCHITECTURE df OF serial_add IS
Note:
SIGNAL cint : std_logic;
BEGIN
The synthesizer will insert
PROCESS( clk )
the latch on the internal signals!
BEGIN
IF clk’EVENT AND clk = ‘1’ THEN
sum <= a XOR b XOR cint;
cint <= (a AND b) OR (b AND cint) OR (a AND cint );
END IF;
END PROCESS;
It will recognize the
cout <= cint;
IF clk’EVENT … pattern!
END ARCHITECTURE df;