Presentation

Download Report

Transcript Presentation

Amalgam: a Reconfigurable
Processor for Future Fabrication
Processes
Nicholas P. Carter
University of Illinois at Urbana-Champaign
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Performance = f(architecture,
implementation)
ST
ST
MUL
ADD
MUL
ADD
MUL
ADD
LD
LD
ST
ST
MUL
LD
MUL
ADD
LD
ST
MUL
LD
ST
MUL
LD
MUL
ADD
LD
LD
1-D
IDCT
1-D
IDCT
1-D
IDCT
1-D
IDCT
Time
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Efficient Implementation
• Everything you give up in clock rate you
have to make back in architectural
efficiency
• Wire delay is the big limiting factor in
system architectures today
– Wires get slower relative to transistors as fab.
process improves
• Programmable processors moving to
deeper pipelines
– Not good enough to just prevent wires from
making reconf. logic slower
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Amalgam
DRAM
Cache
(Multi-Banked)
Network
PCluster
RCluster
PCluster
RCluster
PCluster
RCluster
PCluster
RCluster
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Reconfigurable Cluster Design
• 4 Register banks
Network Interface
– 8 registers/bank
• 4 Reconfigurable
logic segments
– 8 Rows x 32 LBs
per segment
• Array control unit
• Network interface
• Counter-clockwise
flow of
computation
through cluster
Bank
Segment
Segment
Bank
ACU
Segment
Bank
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Bank
Segment
Reconfigurable Clock Rates
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Unpipelined Critical Path
FF
HWIRE
HWIRE
• Effect on clock rate
varies significantly with
fabrication process
Bank
VWIRE
– Wires have heavy
loads, making them
slower than their length
would indicate
LB
VWIRE
• Latches in logic blocks
only resource for
pipelining
• Vertical and horizontal
wires carry data
between logic blocks
LB
FF
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Supporting Pipelining
• Goal: make logic block delay the limiting
factor on clock rate
• Add configurable latches at each wire
intersection
– Problem: different paths may have different
latencies
• Add retiming buffers at logic block
inputs/outputs
• Add network queues to reduce
synchronization overhead
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Critical Path
LB
FF
HWIRE
VWIRE
Bank
FF
VWIRE
• Delay of individual
wires < logic block
delay in all processes
studied
• Add configurable
pipeline latches at
junctions between
wires
• Pipeline latches also
added on carry chains
within rows
FF
HWIRE
FF
FF
LB
FF
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Retiming Buffers
• 5-deep chain of
latches added to each
logic block input
– Similar structure added
to LB output
• Can “borrow” up to
two cycles of
additional delay from
adjacent input
• Total pipeline register
overhead = 17%
FF
FF
FF
FF
FF
FF
FF
FF
FF
FF
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Register Queues
Original Architecture
Original Architecture
Network
Network
WRITE R8, Val1
WRITE R8, Val2
WRITE R8, Val1
Sync.
Message
EMPTY R8
WRITE R8, Val2
Register
File
Register
Queue
Register
File
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Implementing Pipelined Apps.
• Logical vs. Physical pipelining
– Logical: Program-visible, uses array and
registers
– Physical: Only visible to ACU, uses pipeline
registers on wires, retiming buffers
• Take advantage of decoupling provided by queues
• Applications use same reconfigurable logic
configurations in different fab. processes
– Only FSM in ACU changes
– Applications to portability, managing intra-die
variation
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Experimental Methodology
• Programs simulated using Amalsim
– Set each cluster’s clock rate independently
• Benchmarks: IDCT, Rijndael, DNA comparison
– Fine-grained version of each benchmark does one computation
– Medium-grained version performs four independent computatons
• Programmable cluster clock rates based on ITRS
– Limit stages to 7 FO4 delay, slightly more aggressive than ITRS
• Logic block latencies, wire lengths taken from circuit-level design of
reconf. Cluster in 180nm CMOS
– Convert logic block delay to FO4, scale by FO4 delay of each
fabrication process
– Scale wire length based on fabrication process, simulate wire
delay in SPICE
– Pipeline such that reconf. cluster cycle time is determined by logic
block delay
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Clock Rates
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Fine-Grained Benchmark Perf.
• Reconfigurable version maintains about 20%
perf. Improvement over programmable in all fab.
processes
• Pipelining only small benefit
• Majority of speedup comes from reduction in
memory references
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Medium-Grain Benchmark Perf.
• Pipelined architecture sees 2.6x perf
improvement over programmable
• Unpipelined architecture only minor improvement
over programmable
– Greater parallelism means more ability to tolerate
memory delays
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Limit Studies
• Believe that memory operations are much of the
benefit for small tasks
– Study limit where memory latency = 1
– Also test theory that streaming benchmarks have
enough parallelism to cover latency
• Understand how much clock rate of
reconfigurable unit affects performance
– Model reconfigurable unit at same clock rate as
programmable clusters
– Completely unreasonable for unpipelined
– Might be indicator of what industry could do with
pipelined
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Unpipelined Fine-Grained
• Removing memory latencies makes
programmable performance similar to
reconfigurable
• Latency of reconfig. clusters has large impact on
performance -- no parallelism to cover latency
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Fine-Grained
• Results similar to unpipelined
– Benefit still mostly from memory reduction
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Unpipelined Medium-Grain
• Eliminating memory latencies really helps
programmable
• Latency of reconf. logic an even bigger problem
– Programmable clusters can exploit parallelism through
pipelines
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Medium-Grain
• Impact of memory system on reconfigurable
performance very small
• Less benefit from increasing reconfigurable
cluster clock rate
– With even small amounts of parallelism, throughput
becomes more important than latency.
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Future Directions
• ASIC-like performance with programmable
systems
– ASICs typically get 100x better performance
per unit area than microprocessors
• Application-specific memory systems in a
programmable chip
– Transform memory references into
communication
– Create natural division of programs into regular
and irregular blocks
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Conclusion
• Reconfigurable computing must provide
both speedup from custom logic and high
clock rates to succeed
• Amalgam does this by limiting and
tolerating wire delay at multiple levels
– Clustered architecture
– Segmented reconfigurable unit
– Pipeline wire delays
• Result: 2.6x speedup over 8-way CMP in
current and future fabrication processes
Amalgam: a Reconfigurable Processor for Future Fabrication Processes