Presentation on Runtime Reconfiguration

Download Report

Transcript Presentation on Runtime Reconfiguration

Implementation Approaches with FPGAs
• Compile-time reconfiguration (CTR)
CTR is a static implementation strategy where each application
consists of one configuration.
• Run-time reconfiguration (RTR)
RTR is a dynamic implementation strategy where each
application consists of multiple cooperating configurations.
Compile Time Reconfiguration
• Consist of a single system-wide configuration
• Static hardware configuration remains on the FPGAs for
the duration of the application
• Similar to ASIC from application point of view
• Conventional design tools provide adequate support for
application development
• Examples: Splash, Nano processor
Run Time Reconfiguration
• Applications reconfigure hardware resources during
application execution
• Each configuration implements some fraction of the
application
• Optimizes hardware resources
• Lack of sufficient design tools and a well-defined design
methodology
New Design Problems
• Divide the algorithm into time-exclusive segments that do
not need to (or cannot) run concurrently
– Each segment should remain a reasonable amount of time
– Tasks should be relatively independent from each other
• Coordinate the behavior between configurations
– Intermediate result
Two RTR Approaches (1)
• Global Approach
– Each phase of application is implemented as a single system-wide
configuration; it allocates all hardware resources in each
configuration step
– Relatively simple, coarse grained
• Implementation issues
– Divide the application into roughly equal-sized partitions
– Interfaces between configurations are fixed
• Example: RRANN
Two RTR Approaches (2)
• Local Approach
– Applications locally reconfigure subsets of the logic as the
application executes
– Flexible, finer granularity
– Ability to create fine-grained functional operators
• Implementation Issues
– Interfaces are not fixed
– Designer need to ensure both structural compliance and physical
compliance
– No good design tool support
• Example: DISC, RRANN2
Run-time Reconfiguration Paper (1)
• FPGA and Neural Networks
–
–
–
–
Implementation of random topologies
Training versus operation
Multiple training algorithms
Run-time reconfiguration
Run-time Reconfiguration Paper (2)
• Problem: Backpropagation Training Algorithm
– Feed-forward stage:
neti   O jW jii : i  B
j A
1
Oi  f (neti ) 
1  e  neti
– Backpropagation
 i  f ' (neti )(Ti  Oi )i : i  C
Run-time Reconfiguration Paper (3)
– Backpropagation
 i  f ' (neti )  jWiji : i  D
jE
– Update
Wij  Cl Oi j i, j : i  A, j  B
Wijt1  Wijt  Wij
Run-time Reconfiguration Paper (4)
• Approach 1:
– Combine all three stages of execution into the same circuit module
and configure this module onto FPGAs
– No reconfiguration
• Approach 2:
– Combine the feed-forward and update stages into one circuit and
the backpropagation stage into another.
– Reconfigure twice (per cycle)
Run-time Reconfiguration Paper (5)
• Approach 3:
– Treat feed-forward, backpropagation and update as three circuit
modules
– Need to reconfigure three times per cycle
– Each stage consists of a global controller occupying one FPGA and
many neural processors occupying the balance of the available
FPGAs
– 6 neurons per FPGA
Run-time Reconfiguration Paper (6)
• Global Controller
– Sequence the execution of local hardware subroutines on the
neural processors
– Supplying data to the neural processors
• Neural Processor
– Perform computations
– Have six hardware neurons, pre- and post- processing, memory
interfacing, local control, and a local RAM
Run-time Reconfiguration Paper (7)
• Multiplexed Interconnection
– Broadcast bus is used to connect all outputs of neurons on layer m
and inputs of neurons on layer m+1
• The Feed-forward Stage
• The Backpropagation Stage
• The Update Stage
Run-time Reconfiguration Paper (8)
• Implementation
– Xilinx XC3090
– Host PC
• Comparison of space capacity
– Option 1: One hardware neuron per XC3090
– Option 2: Four hardware neurons per XC3090
– Option 3: Six hardware neurons per XC3090
Run-time Reconfiguration Paper (9)
• Comparison of time efficiency
t E  tc  t R
– Option 1: 0ms reconfiguration time
– Option 2: 14ms per pass reconfiguration time
– Option 3: 21ms per pass reconfiguration time
• Time / Space tradeoff
– When more hardware is needed, the same space on an FPGA could
be reused many times through reconfiguration, but doing so
reduces the amount of time that the FPGA could spend executing
Run-time Reconfiguration Paper (10)
• Functional Density Metric D
– Funtional density is a composite area-time metric used to identify
the computational throughput (operations per second) of unit
hardware resources
1
D
AT
– A (area) is measured in the FPGA cell-count of the circuit;
operating time (T) is measured as the execution time of the system
DRTR 
1
A(texec  tconfig )
RRANN2: Partial Reconfiguration (1)
• Runtime reconfiguration (RTR) is an implementation
approach that divides an application into a series of
sequentially executed stages with each stage implemented
as a separate circuit module
• Partial RTR extends the approach by partitioning these
stages and designing their circuitry such that they exhibits
a high degree of functional and physical commonality
• By leaving common circuitry resident, transition between
configurations can be accomplished by updating only the
difference between configurations
RRANN2: Partial Reconfiguration (2)
• Design goal
– To reach the break-even point with fewer neurons per layer
• Advantages
– Reduced size of reconfiguration bit-stream is faster to download
– Eliminating part of the routing and control circuitry increases
hardware neural density
• Static versus Dynamic Circuitry
RRANN2: Partial Reconfiguration (3)
• Fully static circuitry
– Combinational logic
– Storage devices (preserves both configuration and current value of
the storage device)
• Mostly static circuitry
– Precision: two devices only differ in their precision
– Constant value: two blocks differ by a constant value
– Function: two blocks perform logically different functions but their
construction is almost identical
– Subsets: one block is structurally and functionally contained within
the bounds of the other
RRANN2: Partial Reconfiguration (4)
• Physical design issues
– Each block should contain the same physical implementation and
occupy the same position on the device
– A common logic block is also constrained by the physical context
of its surroundings, many of which might be unknown at design
time
– Further constrains have to be placed on the design to group the
static circuitry to insure the decrease of the resulting bit-stream
– No good design-tool support
RRANN2: Partial Reconfiguration (5)
• Implementation
– Step 1: The circuit modules are placed and routed by hand to
physically map the schematics to corresponding FPGA resources
– Step 2: The physical representation is converted to downloadable
configuration bit streams
• Performance (CLAy31)
– Reconfiguration time: 600us
– Training performance: 4 times the performance of RRANN
– FPGA density: 50% more neurons per FPGA than RRANN
Research Issues (1)
• Scheduling designs into a Time-multiplexed FPGA
– An algorithm is proposed to split a FPGA design into multiple
configurations of time-multiplexed FPGAs
– ASAP
– ALAP
– Optimize the scheduler by identifying the units not on the critical
path and reschedule their evaluation into other cycles
Research Issues (2)
• Wormhole run-time reconfiguration
– The means of altering the configuration has relied on global
control strategies, which presents a fundamental bottleneck to the
potential bandwidth of configuration information flow
– Serial configuration: Xilinx 4000
– Random access configuration: CLAy
– Wormhole run-time reconfiguration
Research Issues (3)
• Interaction of pipeline and reconfigurable FPGAs
– An ideal virtualized FPGA would be capable of executing any
hardware design, regardless of the size of that design. The
execution speed would be proportional to the physical capacity of
FPGA, and inversely-proportional to the size of the hardware
design
– Similar to DISC?
– Granularity of swapping unit