Transcript Document

Field-programmable Gate Array
Architectures and Algorithms
Optimized for Implementing
Datapath Circuits
Andy Gean Ye
University of Toronto
1
Motivation: Datapath Regularity
• Larger FPGAs
– Larger applications on FPGAs
– More datapath logic in larger
applications
– Datapath logic is highly regular
• In custom ASIC regularity is routinely
utilized to increase logic density
• Can regularity also be utilized to
improve the logic density of FPGAs?
2
Previous Work
• Datapath-FPGA (DP-FPGA) study [cher96]
– Yes, datapath regularity can be utilized to
reduce FPGA area by as much as 50%
– Based on a partially specified FPGA
architecture
• Major simplifying assumptions
– All transistors are minimum width
– Datapaths are completely regular
– No inefficiency from the CAD tools
3
This Work – An In-depth Study
on Datapath Regularity
• Designed a new datapath-oriented FPGA
architecture
– With detailed architectural specifications
– With correctly sized transistors
• Utilized realistic datapath benchmarks
– From the Pico-java processor from SUN
• Created a complete set of CAD tools to
support the new architecture
– Taking CAD inefficiency into account
4
Multi-bit FPGA (MB-FPGA)
• Architected to utilize datapath
regularity to generate area savings
• Architectural features
– Capture regularity using special
logic blocks called super-clusters
– Increase logic density through
configuration memory sharing
routing resources
5
MB-FPGA – Overview
L
Routing
Channels
S
L
L
L
Logic Block
L
S
Switch Block
Conf. Mem. Shar. Routing Tracks
Conventional Routing Tracks
6
MB-FPGA – Logic Block
Cluster = Bit-Slice
LRN
MUX
LUT
BLE
BLE
BLE
BLE
Cluster 4
DFF
BLE
Local
BLE
Routing
BLE
Network
BLE
BLE
BLE
BLE
BLE
Cluster 3
LRN
BLE
BLE
BLE
BLE
Cluster 2
LRN
LRN
BLE
BLE
BLE
BLE
Cluster 1
M
A Basic Logic Element (BLE)
7
Capturing Datapath Regularity
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
BLE
Bit-Slice 1
Bit-Slice 2
Bit-Slice 3
8
MB-FPGA – Routing Architecture
Cluster
Cluster
Cluster
Cluster
M
M
M
M
Conventional Routing
M
Conf. Mem. Shar. Routing
Switch Block
Logic Block
M
M
9
Utilizing Datapath Regularity to
Save Area
L
M
L
Conf. Mem.
Shar. Tracks
L
L
10
Area Estimation Using Correct
Transistor Sizing
• Based on the fully specified MB-FPGA
architecture
• Detailed Assumptions
– SRAM transistors are min. width
– Tri-state buffers are 5x min. width
– 75% FPGA area is routing area
• Simplified Assumptions
– Datapaths are completely regular (all
conf. mem. shar. tracks)
– No inefficiency from the CAD tools
11
Area Estimation Using Correct
Transistor Sizing
• Datapath regularity can only be
used to reduce the MB-FPGA
area by 25%
• Down from the 50% area savings
prediction of the DP-FPGA study
[cher96]
12
Benchmark Regularity
• Fifteen benchmark circuits
– From the Pico-java processor
– Implemented on the MB-FPGA
• Measurements after synthesis
– Logic regularity
– Net regularity
13
Logic Regularity
• Classify LUTs and DFFs into two types
– Irregular type
• LUTs and DFFs that do not belong to any 4-bit wide
datapath components
– Regular type
• LUTs or DFFs that belong to a 4-bit wide datapath
component
• More regular type of LUTs and DFFs
– More regular nets
– Greater area savings
14
A Datapath Component
A Datapath Component –
A Group of
4 identical LUTs or DFFs
S4
S3
S2
S1
Identical LUTs or DFFs
15
Logic Regularity
#LUT + #DFF
dcu_dpath
ex_dpath
icu_dpath
imdr_dpath
pipe_dpath
smu_dpath
ucode_dat
ucode_reg
code_seq_dp
exponent_dp
Incmod
mantissa_dp
multmod_dp
prils_dp
rsadd_dp
Total
1254
2917
3590
1388
689
683
1528
156
439
565
939
1070
1827
#LUT + #DFF in
Datapath
Components
1188
2740
3460
1292
575
618
1448
132
204
384
836
964
1540
%LUT & DFF in
Datapath
Components
95%
94%
96%
93%
83%
90%
95%
85%
46%
68%
89%
90%
84%
388
305
17738
324
281
15986
84%
92%
90%
16
Net Regularity
• Classify two-terminal connections in
each circuit into three types
– Regular 4-bit wide buses
– Regular 4-bit wide control group
– Irregular
• Two-terminal connections do not belong
to either a bus or a control group
17
Definition – Net Regularity
A 4-bit wide bus
S4
S3
S2
S1
S4
S3
S2
S1
A 4-bit wide control group
S4
S3
S2
S1
Note: Only 4-bit wide buses can be used to increase the
area efficiency of MB-FPGA through
conf. mem. shar. routing tracks
18
Net Regularity
Total Two-Terminal
Connections
dcu_dpath
ex_dpath
icu_dpath
imdr_dpath
pipe_dpath
smu_dpath
ucode_dat
ucode_reg
code_seq_dp
exponent_dp
incmod
mantissa_dp
multmod_dp
prils_dp
rsadd_dp
Total
2232
6547
8047
3100
1049
1167
3143
194
799
1362
2013
2533
3380
% of Two-Terminal
Connections in 4Bit Wide Buses
49%
52%
47%
50%
48%
48%
52%
72%
58%
32%
42%
47%
39%
% of Two-Terminal
Connections in
Fan-Out 4 Groups
43%
39%
36%
36%
42%
25%
41%
21%
18%
23%
33%
36%
25%
864
722
37152
41%
52%
48%
32%
27%
35%
19
Area Estimation Based on
Correct Net Regularity
• Assumptions
– SRAM transistors are min. width
– Tri-state buffers are 5x min. width
– 75% FPGA area is routing area
– 50% of routing tracks are conf. mem.
shar.
– No inefficiency from the CAD tools
• Result
– Datapath regularity can be utilized to
reduce FPGA area by 12% (again down
from 25%)
20
Datapath-oriented CAD Flow –
Overview
Enhanced Module
Compaction Synthesis
Coarse-grain Node
Graph Packing
Multi-bit FPGA
Placement
Coarse-grain
Resource Routing
21
Can Regularity Be Utilized to
Improve Logic Density?
• To achieve best area
– What should be the best number of
clusters per logic block?
– What should be the best number of
conf. mem. shar. routing tracks per
routing channel?
• What is the performance this
datapath-oriented FPGA?
22
Experiments
• Fifteen benchmark circuits
– From the Pico-java processor
– Implemented on the MB-FPGA
• Experiments
– Granularity (the number of clusters
per logic block) vs. Area
– % conf. mem. shar. tracks vs. area
– % conf. mem. shar. tracks vs.
performance
23
Granularity Vs. Area
• Explored a 2-D architectural space
– First vary granularity
– For each granularity: vary % of
conf. mem. shar. routing tracks per
routing channel
• For each architecture, find the
average area required to implement
the benchmark circuits
• Plot best area for each granularity
24
Granularity Vs. Area
Avg. Area (10e6)
3
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
2
4
8
12
16
20
Granularity
24
28
32
25
% C.M.S. Tracks Vs. Area
• Assume four clusters per logic block for the
MB-FPGA
• For each circuit
– Set a fixed number of conf. mem. shar.
tracks
– Search for minimum number of
additional conv. tracks
• Classify into eight percentile ranges
• Use the minimum area obtainable for each
circuit to calculate average area
26
% C.M.S. Tracks Vs. Area
• Also implement the same
benchmarks on a comparable
conventional FPGA
• MB-FPGA area is normalized
against the conventional FPGA
area
27
% C.M.S. Tracks Vs. Area
Normalized Avg. Area
1.00
0.98
0.96
10%
0.94
0.92
0.90
0.88
0%
0%- 10%- 20%- 30%- 40%- 50%- 60%10% 20% 30% 40% 50% 60% 70%
% Conf. Mem. Shar. Tracks
28
Area (40% - 50% Tracks Are C.M.S.)
Conventional
FPGA Area (10e5)
Datapath-oriented
FPGA Area (10e5)
icu_dpath
ex_dpath
multmod_dp
imdr_dpath
ucode_dat
mantissa_dp
dcu_dpath
incmod
exponent_dp
smu_dpath
pipe_dpath
prils_dp
code_seq_dp
56.0
50.8
22.4
20.0
18.6
15.5
13.2
9.89
7.02
6.72
5.37
4.77
4.77
48.9
38.8
25.0
17.2
16.1
14.8
11.5
11.6
7.66
6.69
5.19
4.67
4.51
Datapath-oriented
FPGA Area
(Normalized)
0.87
0.76
1.10
0.86
0.86
0.96
0.87
1.17
1.10
1.00
0.97
0.98
0.95
rsadd_dp
ucode_reg
Avg. Area
4.16
1.00
16.0
3.56
1.04
14.5
0.86
1.00
0.90
29
Performance (Crit. Path Delay)
• Assume carry network delay equal to
local routing network delay
– Over-estimated carry delay
– Results are pessimistic
• Normalized average crit. path delay
over 15 benchmark circuits with
respect to conventional FPGA
30
% C.M.S. Tracks Vs. Crit. Path
Normalized Avg. Delay
1.14
1.13
1.12
1.11
1.1
1.09
1.08
0%
0%10%
10%- 20%- 30%- 40%- 50%- 60%20% 30% 40% 50% 60% 70%
% Config. Mem. Shar. Tracks
31
Crit. Path Delay (40%- 50% Tracks Are CMS)
Conv. FPGA Crit.
Path Delay (ns)
D.P. FPGA Crit.
Path Delay (ns)
code_seq_dp
dcu_dpath
ex_dpath
exponent_dp
icu_dpath
imdr_dpath
Incmod
mantissa_dp
multmod_dp
pipe_dpath
prils_dp
rsadd_dp
smu_dpath
16.7
8.27
42.1
21.0
24.9
46.0
45.5
13.9
33.3
13.4
23.0
35.8
29.3
15.5
10.5
47.6
22.5
34.7
47.2
42.8
12.3
36.1
17.2
25.5
39.6
35.3
D.P. FPGA Crit.
Path Delay
(Normalized)
0.93
1.3
1.1
1.1
1.4
1.0
0.94
0.88
1.1
1.3
1.1
1.1
1.2
ucode_dat
ucode_reg
Avg. Area
11.3
3.58
20.2
11.6
4.25
22.3
1.0
1.2
1.1
32
Conclusions
• Investigated the question
– Can regularity be effectively utilized to
improve logic density?
• Presented
– A datapath-oriented FPGA architecture
• Fully specified to the level of transistor sizing
– An analysis on datapath regularity
– A brief description of the CAD flow for
the architecture
33
Conclusions
• Detailed architectural specification and
CAD implementation is very important
• Best MB-FPGA architecture
– Granularity = 4
– 40%-50% of tracks are C.M.S.
• Architectural Results
– 10% smaller in area than conv. FPGA
– Much less than the 50% area savings
prediction [cher96]
– Has a 10% performance penalty
34
Discussions
• Under what circumstances will MBFPGA be more area efficient?
– Applications with more buses than
our benchmarks
– Wider datapath applications
– Larger than 1x min. width
transistors in SRAM cells
– Smaller than 5x min. width
transistors in tri-state buffers
– SRAM reduction is more important
than area reduction
35
Future Work
• Architecture
– Sharing configuration memory in
logic
– Improve performance
• CAD tools
– Proper modeling of carry network
delay
– Improve performance
– Power modeling
36
Detailed Datapath-oriented CAD
Implementation Issues
Andy Gean Ye
University of Toronto
37
Datapath-oriented CAD Flow –
Overview
Enhanced Module
Compaction Synthesis
Coarse-grain Node
Graph Packing
Multi-bit FPGA
Placement
Coarse-grain
Resource Routing
38
Input to CAD Flow
• Netlists of datapath components in
Verilog or VHDL
• From a pre-defined library
– Arithmetic operators
– Logic operators
– Multiplexers
• Datapath regularity of the input is
preserved throughout the CAD flow
39
An Example Input Datapath
Circuit
a0
sel
b0
mux
c0
b1
mux
c1
d0
cin
a1
mux
b3
mux
d3
+
s1
a3
c3
d2
+
s0
b2
c2
d1
+
a2
cout
+
s2
s3
40
Synthesis
• Synopsys FPGA compiler has 38%
area inflation when instructed to
preserve datapath regularity
• Two major causes of area inflation
– Duplicated logic across bit-slices
– Bit-slices are too small
• Augmented FPGA compiler with new
algorithms
– Reduced the area inflation to 3%
41
Packing
• Based on the T-VPACK [betz99]
algorithm
• Like T-VPACK – timing driven
• New feature – ability to preserve
datapath regularity
42
After Synthesis and Packing
a0 b0 c0 sel
a1 b1 c1 sel
BLE
a2 b2 c2 sel
BLE
a3 b3 c3 sel
BLE
BLE
bus
d0
cin
d1
BLE
BLE
s0
d2
BLE
BLE
s1
d3
BLE
BLE
s2
BLE
BLE
cout
s3
43
Placement and Routing
• Based on the VPR tools [betz99]
– Placer: simulated annealing [kirk83]
– Router: congestion negotiationbased pathfinder [ebel95]
• New feature of the placer
– Ability to move individual clusters if
they do not contain datapath
– Move entire logic block if they
contain datapath to preserve
datapath regularity
44
Router
• Contains a new set of expansion cost
functions
– Designed to ease the task of
comparing the cost of using conv.
tracks against the cost of using
conf. mem. shar. tracks
– Composed of delay and congestion
metrics (similar to the conventional
expansion cost)
45
Overall Routing Flow
Route Buses
Route Non-bus Signals
Update Cost Functions
46
Routing Buses
• Route entire buses through conf.
mem. shar. routing tracks
• Route the first bit through conv.
routing tracks – test for delay and
congestion
• Compare expansion costs
• Select the option with the lowest
expansion cost
47
Routing Non-bus Signals
• Consider the options of routing the
signal through conv. as well as conf.
mem. shar. tracks
• Compare the expansion cost
• Select the option with the lowest
expansion cost
48