Transcript ppt

ECE 636
Reconfigurable Computing
Lecture 13
Mid-term I Review
Lecture 13: Mid-term 1 Review
October 22, 2013
SRAM-based FPGA
Q
Read or Write
Q
P1
P2
P3
P4
Data
Programming Bit
Out
I1 I2
2-Input LUT
• SRAM bits can be programmed many times
• Each programming bit takes up five transistors
• Larger device area reduces speed versus EPROM and
antifuse.
Lecture 13: Mid-term 1 Review
October 22, 2013
Field Programmable Gate Array
Lecture 13: Mid-term 1 Review
October 22, 2013
Connection Box Flexibility
Tracks
Logic
Cluster
IO pin
T0 T1
Out
T0
T1
T2
T2
Out
FC = 3
T0 T1
T2
• Fc -> How many tracks does an input pin connect to?
• If logic cluster is small, FC is large
FC = W
• If logic cluster is large, Fc can be less.
- Approximately 0.2W for Xilinx XC4000EX, Virtex
Lecture 13: Mid-term 1 Review
October 22, 2013
Switchbox Flexibility
0
1
0
0
1
1
0
1
• Switch box provides optimized interconnection area.
• Flexibility found to be not as important as FC
• Six transistors needed for FS= 3
Lecture 13: Mid-term 1 Review
October 22, 2013
Switchbox Issues
Lecture 13: Mid-term 1 Review
October 22, 2013
Bidirectional vs Directional
Lecture 13: Mid-term 1 Review
October 22, 2013
Directional Wiring: Outputs can use switch block muxes
New connectivity
constraint
Single-driver
Wiring!!!
Dir Architecture
Lecture 13: Mid-term 1 Review
October 22, 2013
Fine-grained Approach
16X1
A
Addr
D
16X1
A
LUT1
D
LUT2
•
•
•
•
For 4-input LUTs 16 bits of information available
Can be chained together through programmable network.
Decoder and multiplexer an issue.
Flexibility is a key aspect.
Lecture 13: Mid-term 1 Review
October 22, 2013
Hill Climbing Algorithms
• To avoid getting trapped in local minima, consider “hillclimbing” approach
• Need to accept worse solutions or make “bad” moves to get
global minima.
• Acceptance is probabalistic. Only accept cost-increasing
moves some of the time.
Cost
Solution space
Lecture 13: Mid-term 1 Review
October 22, 2013
Maze Routing
L
S
L
C
• Evaluate shortest feasible paths based on a cost function
• Like row-based device global route allocates channel
bandwidth not specific solutions.
• Formulate cost function as needed to address desired
goal.
Lecture 13: Mid-term 1 Review
October 22, 2013
Routing Tradeoffs
• Bias router to find first, best route.
• Vary number of node expansions using:
pcosti = (1 – a) x pcosti-1 + ncosti + a x disti
Lecture 13: Mid-term 1 Review
October 22, 2013
Architectural Limitation
• Routing
architecture
necessitates
domain selection.
• Bigger effect for
multi-fanout nets
Lecture 13: Mid-term 1 Review
October 22, 2013
Pathfinder
•
•
•
•
Use a non-decreasing history value to represent congestion.
Similarities to multi-commodity flow
Can be implemented efficiently but does require substantial run time
Only update after an interation.
ci = (1 + hn * hfac) * (1 + pn * pfac) + bn, n-1
Lecture 13: Mid-term 1 Review
October 22, 2013
Bipartitioning
• Perhaps biggest problem in multi-FPGA design is partitioning
• Partitioner must deal with logic and pin constraints.
• Could simultaneously attempt partitioning across all devices. Even
“simple” algorithms are O(n3)
• Better to recursively bipartition circuit.
Lecture 13: Mid-term 1 Review
October 22, 2013
KLFM Partitioning
Bin 1
Bin 2
• Identify nodes to swap to reduce overall cut size
• Lock moved nodes
• Algorithm continues until no un-locked node can be moved without
violating size constraints
Lecture 13: Mid-term 1 Review
October 22, 2013
KLFM Partitioning
•
•
•
Key issue is implementing node costs in lists that can be easily
accessed and updated.
Many extensions to consider to speed up overall optimization
Reasonably easy to implement in software
Lecture 13: Mid-term 1 Review
October 22, 2013
Partition Preprocessing: Clustering
• Identify bin size
• Choose a seed block (node)
• Identify node with highest connectivity to join cluster
• Terminate when cluster size met.
• In practical terms cluster size of 4 works best
Lecture 13: Mid-term 1 Review
October 22, 2013
Clustering
• Technology mapping before partitioning is typically ineffective since
frequently area is secondary to interconnect
• Frequently bipartitioning continues after unclustering as well.
Cluster
KLFM
uncluster
KLFM
• This allows for additional fine-grain moves.
Lecture 13: Mid-term 1 Review
October 22, 2013
Logic Replication
• Attempt to reduce cutset by replicating logic.
• Every input of original cell must also input the
replicated cell.
• Replication can either be integrated into the partitioning
process or used as a post-process technique.
Lecture 13: Mid-term 1 Review
October 22, 2013
Logic Emulation
• Emulation takes a sizable amount of resources
• Compilation time can be large due to FPGA compiles
• One application: also direct ties to other FPGA
computing applications.
Lecture 13: Mid-term 1 Review
October 22, 2013
Are Meshes Realistic?
• The number of wires leaving a partition grows with Rent’s Rule
P = KGB
• Perimeter grows as G0.5 but unfortunately most circuits grow at GB where
B > 0.5
• Effectively devices highly pin limited
• What does this mean for meshes?
Lecture 13: Mid-term 1 Review
October 22, 2013
Multi-FPGA Software
• Missing high-level synthesis
• Global placement and routing
similar to intra-device CAD
Lecture 13: Mid-term 1 Review
October 22, 2013
Virtual Wires
• Overcome pin limitations by multiplexing pins and signals
• Schedule when communication will take place.
Lecture 13: Mid-term 1 Review
October 22, 2013
Virtual Wires Software Flow
• Global router enhanced
to include scheduling
and embedding.
• Multiplexing logic
synthesized from FPGA
logic.
Lecture 13: Mid-term 1 Review
October 22, 2013
Why Compiling C is Hard
° General Language
° Not Designed For Describing Hardware
° Features that Make Analysis Hard
• Pointers
• Subroutines
• Linear code
° C has no direct concept of time
° C (and most procedural languages) are inherently
sequential
• Most people think sequentially.
• Opportunities primarily lie in parallel data
Lecture 13: Mid-term 1 Review
October 22, 2013
Variables
°
Handel-C has one basic type - integer
°
May be signed or unsigned
°
Can be any width, not limited to 8, 16, 32 etc.
Variables are mapped to hardware registers.
void main(void)
{
unsigned 6 a;
a=45;
}
a=
1 0 1 1 0 1 = 0x2d
MSB
Lecture 13: Mid-term 1 Review
LSB
October 22, 2013
DeepC Compiler
• Consider loop based computation
to be memory limited
• Computation partitioned across
small memories to form tiles
• Inter-tile communication is
scheduled
• RTL synthesis performed on
resulting computation and
communication hardware
Lecture 13: Mid-term 1 Review
October 22, 2013
DeepC Compiler
• Parallelizes compilation across multiple tiles
• Orchestrates communication between tiles
• Some dynamic (data dependent) routing possible.
Lecture 13: Mid-term 1 Review
October 22, 2013
Control FSM
• Result for each tile is a datapath, state machine,
and memory block
Lecture 13: Mid-term 1 Review
October 22, 2013
Striped Architecture
Condition
Codes
Microprocessor
Interface
FPGA
Fabric
Control
Unit
Address
Control &
Next Addr
Configuration Cache
Configuration
• Same basic approach, pipelined communication,
incremental modification
• Functions as a linear pipeline
• Each stripe is homogeneous to simplify computation
• Condition codes allow for some control flexibility
Lecture 13: Mid-term 1 Review
October 22, 2013
Piperench Internals
• Only multi-bit functional units used
• Very limited resources for interconnect to neighboring programming
elements
• Place and route greatly simplied
Lecture 13: Mid-term 1 Review
October 22, 2013
Convolutional Encoder


Accepts information bits as a continuous stream
Operates on the current b-bit input, where b ranges
from 1 to 6 and some number of immediately
preceding b-bit inputs to produce V output bits,
V
>b
b =1, V =2
+
1
FF
0
0
FF
1
+
0
Definitions

Constraint Length



Code Rate (or) Rate


Number of successive b-bit groups of information
bits for each encoding operation
Denoted by K
b/V
Typical values


K:7
Rate : 1/2, 1/3
The Viterbi Algorithm


Finds a bit-sequence in the set of all
possible transmitted bit-sequences that
most closely resembles the received
data.
Maximum likelihood algorithm


Each bit received by decoder associated
with a measure of correctness.
Practical for short constraint length
convolutional codes
State diagram

State


0/00
Encoder memory
00
Branch
k/ij,
where i and j represent
the output bits
associated with
input bit k
0/11
1/11
1/00

10
01
0/10
0/01
11
1/10
1/01
Trellis Diagram
T=0
00
0
T=1
00 0
T=2
00 2
11
11
01
2
10
00 3
11
3
11 1
00
10
0
2
10
01
3
01
11
ENC IN :
ENC OUT :
RECEIVED:
T=3
0
00
00
1
11
11
1
10
0
10
11
Accumulated metric
2+2,3+0 : 3
0+1,3+1 : 1
2+0,3+1 : 2
0+1,3+1 : 1
K=3
Rate ½
Total number of states = 2K-1
Adaptive Viterbi Algorithm
Motivation


Extremely large memory and logic for Viterbi
Algorithm



Fewer number of paths retained
Reduced memory and computation
Definitions




Path – Bit sequence
Path metric or cost – Accumulated error metric of a path
Survivor – Path which is retained for the subsequent time
step
Adaptive Viterbi Algorithm
Criterion for path survival
1.
2.
A threshold T is introduced such that a path is
retained if and only if current path metric is less
than dm+T, where dm is the minimum cost
among all survivors of the previous time step.
The total number of survivors per time step is
limited to a critical number called Nmax selected
by user.

Only best Nmax paths have to be retained at any time.
Trellis Diagram for AVA
Architecture (contd.)
b1
sum1
Add
di < d m + T
Elimination of sorting
yes
Count
paths
b2
sum2
Add
Count < Nmax
no
di < d m + T
yes
T = T-2
yes
Update
memory
Virtual Router
 Independent routing policies
for each virtual router
 Key challenges
• Isolation
• Performance
• Flexibility
• Scalability
Physical router
Virtual router A
Virtual router B
Routing
Control
Routing
Control
Forwarding Table
Forwarding Table
DEMUX
MUX
42
Virtualization using FPGAs
 A novel network virtualization substrate which
• Uses FPGA to implement high performance virtual routers
• Introduces scalability through virtual routers in host software
• Exploits reconfiguration to customize hardware virtual routers
Virtual
Router 1
Virtual
Router 2
A
FP G
Virtual
Router 3
Virtual
Router 4
43
Partial Reconfiguration
 Use partial reconfiguration to independently configure
virtual routers
44
Full FPGA Reconfiguration
 Two virtual routers (A, B) initially in FPGA
 During reconfiguration router A migrated to software, the
other eliminated
 After reconfiguration two virtual routers (A, B’) again in FPGA
Reduced
Throughput
45
Partial FPGA Reconfiguration
 A remains in hardware and operates at full speed
 20X speedup in reconfiguration down time due to partial
reconfiguration
Sustained
Throughput
46