Coarse Grain Reconfigurable Architectures

Download Report

Transcript Coarse Grain Reconfigurable Architectures

Enabling Technologies for
Reconfigurable Computing
November 21, 2001, Tampere, Finland
Reiner Hartenstein
University of
Kaiserslautern
Enabling Technologies for
Reconfigurable Computing
part 1:
Reconfigurable Computing (RC)
Wednesday, November 21, 8.30 – 10.00 hrs.
Schedule
Xputer Lab
University of Kaiserslautern
time
slot
08.30 – 10.00
Reconfigurable Computing (RC)
10.00 – 10.30
coffee break
10.30 – 12.00
Compilation Techniques for RC
12.00 – 14.00
lunch break
14.00 – 15.30
Resources for Stream-based RC
15.30 – 16.00
coffee break
16.00 – 17.30
FPGAs: recent developments
© 2001, [email protected]
2
http://www.fpl.uni-kl.de
Reconfigurable: why?
Xputer Lab
University of Kaiserslautern
• Exploding design cost and shrinking product life cycles of
ASICs create a demand on RA usage for product longevity.
• Performance is only one part of the story. The time has
come fully exploit their flexibility to support turn-around
times of minutes instead of months for real time in-system
debugging, profiling, verification, tuning, fieldmaintenance, and field-upgrades.
• A new “soft machine” paradigm and language framework is
available for novel compilation techniques to cope with the
new market structures transferring synthesis from vendor
to customer.
© 2001, [email protected]
3
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
SOC Alternatives… not including
C/C++ CAD Tools [Gordon Bell]
• The blank sheet of paper: FPGA
• Auto design of a basic system: Tensilica
• Standardized, committee designed components*,
cells, and custom IP
• Standard components including more application
specific processors *, IP add-ons and custom
• One chip does it all: SMOP **
*) Processors, Memory, Communication & Memory Links,
**) SMOP ??
© 2001, [email protected]
4
http://www.fpl.uni-kl.de
SoC Alternatives [Gordon Bell]
Xputer Lab
University of Kaiserslautern
product
strategy
vendor
FPGA
“sea of uncommitted gate arrays” Xylinx, Altera
compile a system
unique processor for every
application
systolic array
many pipelined or parallel
processors + custom
Tensilica
DSP, VLIW
special purpose processor cores + TI
custom
processor + RAM + general purpose cores,
IBM, Intel,
ASICS
specialized by I/O, etc.
universal micro
multiprocessor array,
programmable I/O
© 2001, [email protected]
5
Cradle
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
A Decade of Research in
Reconfigurable Computing
• Due to the achievements of numerous Research
Projects throughout the 90ies the Breakthrough
in Commercialization has started and already a
quite comprehensive Methodology is available.
• Dear Colleague, the RC Scene welcomes your
contributions to improve it and to push for
Inclusion in contemporary CS&E Curricula.
• It is one of the Goals of this Talk to stimulate you
by Highlights and introducing some Key Issues.
© 2001, [email protected]
6
http://www.fpl.uni-kl.de
Xputer Lab
no more a strange niche area
University of Kaiserslautern
• was “Hardware” design for a strange plattform
– CAD, but no Compilation
• Emerging awareness:
– New mind set
– New curricular embedding
• coming Dichotomie of CS
– SW <-> CW
– HW <-> FW
– computing in time <-> computing in space
© 2001, [email protected]
7
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
flexibility/universality trade-off
Kress
Array
Xplorer
FPGA
flexibility
© 2001, [email protected]
trade-off
8
hardwired
efficiency
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
RAs are heading for Mainstream
... become indispensable for SoC products ?
ASPP, application-specific
programmable product is:
• Application-specific
standard product and:
• embedded programmable logic
Flash / RAM
DRAM/Flash/SRAM
memory banks
Programmable Logic
Logic
CSoC, configurable SoC is:
• an industry standard µProcessor, Reconfigurable
• embedded reconfigurable array, Microprocessor
Accelerator
• memory, dedicated systen bus ...
Array
Analog
Logic
Soap Chip: System on a
programmable Chip
© 2001, [email protected]
9
http://www.fpl.uni-kl.de
Reconfigurable Logic going Mainstream
Xputer Lab
University of Kaiserslautern
• Fine grain: FPGAs killing the ASIC market
• Fastest growing segment of semiconductor market
• Substantially improved design flow and libraries
• Coarse grain: several startups
• Comprehensive Methodology
• Please, Lobby for New Curricula.
• One of the goals of this talk: to motivate You
by Key Issues and Visionary Highlights.
© 2001, [email protected]
10
http://www.fpl.uni-kl.de
Xputer Lab
Designer-oriented Innovation stalled ?
University of Kaiserslautern
•
•
•
•
•
EDA industry: about 7 bio $
leverages > 200 bio $ semconductor industry
FPGAs (7 bio $) fastest growing segment
EDA industry constantly redefining itself
„except logic synthesis nor really significant
innovation in the past decade“
• CAD developers can‘t deliver their idear
effectively
• CAD developers personally don‘t appreciate the
real problems facing designers
© 2001, [email protected]
11
http://www.fpl.uni-kl.de
EDA the main bottleneck
Xputer Lab
University of Kaiserslautern
© 2001, [email protected]
12
http://www.fpl.uni-kl.de
Xputer Lab
guess it !
Biggest Mistake of EDA
University of Kaiserslautern
© 2001, [email protected]
13
http://www.fpl.uni-kl.de
>> History
Xputer Lab
University of Kaiserslautern
• History
• Paradidgm Shift
• Coarse Grain: why ?
• Coarse Grain Architectures
• Reconfiguration Architecture
http://www.uni-kl.de
© 2001, [email protected]
14
http://www.fpl.uni-kl.de
Logic Gate Price Trend
Xputer Lab
Price (Normalized to Q1/1993)
University of Kaiserslautern
Source:Altera
1.2
Price per Logic Element
1
40% lower per Year
0.8
0.6
0.4
0.261
0.2
0
0.086
Q1
'93
Q1
'94
© 2001, [email protected]
Q1
'95
Q1
'96
Q1
'97
15
0.042
Q1
'98
Q1
'99
0.029
Q1
'00
http://www.fpl.uni-kl.de
The History of
Paradigm Shifts
Xputer Lab
University of Kaiserslautern
TTL
1967
1957
custom
LSI,
MSI
© 2001, [email protected]
“The Programmable System-on-a-Chip
is the next wave“
µproc.,
memory
1977
1987
2nd Design Crisis
standard
1st Design Crisis
“Mainstream Silicon Application
is switching every 10 Years”
ASICs,
accel’s
16
2007
1997
?
?
http://www.fpl.uni-kl.de
Makimoto’s 3rd Wave
Xputer Lab
University of Kaiserslautern
• Fine Grain Subsystems (FPGAs):
– 1st half of 3rd wave
– universal (but less efficient)
• Coarse Grain Subsystems:
– 2nd half of 3rd wave
– domain-specific
– much more flexible than 2nd half of 2rd wave
© 2001, [email protected]
17
http://www.fpl.uni-kl.de
How’s next Wave ?
Xputer Lab
University of Kaiserslautern
standard
hardwired
procedural programming
1967
1957
1987
1977
structural programming
4th wave ?
Coarse
FPGAs 2007 grain
RAs
1997
?
?
custom
algorithm: fixed
algorithm: variable
algorithm: variable
resources: fixed
resources: fixed
resources: variable
Tredennick’s
Paradigm Shifts
© 2001, [email protected]
Hartenstein’s
Curve
no further wave !
18
http://www.fpl.uni-kl.de
The Impact of Makimoto’s
Paradigm Shifts
Xputer Lab
University of Kaiserslautern
Software Industry’s
Secret of Success
Personalization
(CAD) before
fabrication
standard
1967
1957
custom
Repeat Success Story by
new Machine Paradigm !
Procedural
personalization
via RAM-based
Machine Paradigm
µproc.,
memory
TTL
LSI,
MSI
© 2001, [email protected]
Dr. Makimoto: FPL 2000 keynote
structural
personalization:
RAM-based
before run time
2007
1987
1977
19
ASICs,
accel’s
1997
http://www.fpl.uni-kl.de
>> Paradigm Shift
Xputer Lab
University of Kaiserslautern
• History
• Paradidgm Shift
• Coarse Grain: why ?
• Coarse Grain Architectures
• Reconfiguration Architecture
http://www.uni-kl.de
© 2001, [email protected]
20
http://www.fpl.uni-kl.de
Sequential vs. structural RAM
Xputer Lab
University of Kaiserslautern
Logic Synthesis
Software
Route and Place
(procedural)
downloading
I/O
RAM
sequential
RAM
download
FPGA
restructural
conf.
accelerator(s)
RAM
RAM
data path
instruction
sequencer
“von Neumann”
© 2001, [email protected]
21
http://www.fpl.uni-kl.de
Changing Models of Computing
Xputer Lab
University of Kaiserslautern
the tail
Software
wagging the dog
Configware
Software
occupies most silicon
(procedural)
downloading
I/O
data path
RAM
instruction
sequencer
“von Neumann”
downloading
RAM
CAD
hardwired
accelerator(s)
host
downloading
RAM
host
contemporary
Hardware
© 2001, [email protected]
(structural)
22
reconf.
accelerator(s)
RAM
reconfigurable
computing
Flexware
http://www.fpl.uni-kl.de
Xputer Lab
The Microprocessor is a Methuselah
University of Kaiserslautern
9 technology generations ...
•
•
•
•
•
•
•
•
•
© 2001, [email protected]
1th
2nd
3rd
4th
5th
6th
7th
8th
9th
4004
... the steam engine
8008
of the silicon age
8086
80286
80386
80486
P5 (Pentium)
P6 (Pentium Pro / Pentium II)
Pentium III
23
http://www.fpl.uni-kl.de
… Decline of Wintel Business Model
Xputer Lab
University of Kaiserslautern
Billion US-$ US Market [forrester]
Billion Subscribers worldwide
Million Devices delivered in the U.S.
20
[IDC]
15
201 Bio
1500 $
1000 $
1997 1998
10 0.5 Bio
1999
2000
2001
2002
© 2001, [email protected]
24
http://www.fpl.uni-kl.de
Basics of Binding Time
Xputer Lab
University of Kaiserslautern
time of “Instruction Fetch”
run time
microprocessor
parallel computer
loading time
Reconfigurable
Computing
compile time
© 2001, [email protected]
25
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
Binding Time vs. Computing Domain
Binding time: (Set-up of
Communication Channels)
at run time
microprocessor
parallel computer
array processor
at loading time
at compile time
The KressArray
is a generalization
of the systolic array
later fabrication step
ASICs
before fabrication
programming domain:
© 2001, [email protected]
Reconfigurable
Computing
time domain
(procedural)
26
systolic
arrays
full custom
ICs
time & space
(hybrid)
space domain
(structural)
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
Dataquest Predicts Programmability
to be Predominant in SOC
• Application-specific programmable products (ASPPs)
will be the next best thing in semiconductor
technology
• With programmability as a standard feature, ASPPs
will be predominant system-on-a-chip products in five
years
Jordan Selburn, principal analyst,
ASICs and system-level integration,
Dataquest Inc.’s Semiconductors Group
EETimes 10/21/98
Dataquest Semiconductors ‘98 conference
© 2001, [email protected]
27
http://www.fpl.uni-kl.de
Applications
Xputer Lab
University of Kaiserslautern
• next generations’ wireless*
• network processors*
• many other areas*
*) keynotes and papers at FPL 2000
Villach, Austria, August 27 - 30, 2000
http://www.fpl.uni-kl.de/FPL/
The 10th International Conference on Field-programmable Logic and Applications
The Roadmap to Reconfigurable Systems
© 2001, [email protected]
28
http://www.fpl.uni-kl.de
Applications (2)
Xputer Lab
University of Kaiserslautern
• Image Processing:
–
for smart car (collision avoidance, others ...),
–
Smart traffic pilots, robotics, fast material inspection,
–
smart stub finders, motion detection (MPEG-4, ...)
• Signal Processing, Speech Processing, Software Radio,
• Correlation, Encryption, Comm. Switching / Protocols,
• Innovative consumer electronics:
–
super smart cards, smart handies, wearable,
–
portable, set-top, laptop, desktop, embedded, ...
• many others, ...
© 2001, [email protected]
29
http://www.fpl.uni-kl.de
Applications
Xputer Lab
University of Kaiserslautern
• new cellular standard: up to 2 Mbit/sec: new CDMA
standard: > 500 MIPS needed just for RF receiver part
• wide variety of end-user‘s devices: smart handies, palm
pilots, laptops, games, camcorder-likes, ..the internet car,
many new types of devices to come ...
• increasing wide variety of services available from network
provider:download just what a particular customer is
subscribed to
• expert group [Vissers]: > 20% of it will be accelerator code*
© 2001, [email protected]
30
http://www.fpl.uni-kl.de
4G
Why coarse grain ?
Xputer Lab
3G
University of Kaiserslautern
Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld
memory
100 000 000
2G
Transistors/chip
Normalized
processor speed
10 000 000
wireless
1000 000
100 000
Algorithmic Complexity
(Shannon’s Law)
10 000
microprocessor / DSP
1G
100
computational
efficiency
1000
10
SH7752
StrongARM
mA/ MIP
0.1
100
battery performance
10
1
1960
1
1970
© 2001, [email protected]
1980
1990
31
2000
0.01
0.001
2010
http://www.fpl.uni-kl.de
Shannon‘s Law
Xputer Lab
University of Kaiserslautern
• In a number of application
areas throughput
requirements are growing
faster than Moore's law
• Fundamental flaws in
software processor solutions
• 32 soft ARM cores fit onto
contemporary FPGA
• Stream-based distributed
processing is the way to go
© 2001, [email protected]
32
http://www.fpl.uni-kl.de
Xputer Lab
It’s a Paradigm Shift !
University of Kaiserslautern
• Using FPGAs (fine grain reconfigurable)
just mainly is classical Logic Synthesis
on a “strange hardware” platform
• Coarse Grain Reconfigurable Arrays
(Reconfigurable Computing), however,
mean a really fundamental Paradigm Shift
• This is still ignored by CS and EE
Curricula and almost all R&D scenes
© 2001, [email protected]
33
http://www.fpl.uni-kl.de
>> Coarse Grain: why ?
Xputer Lab
University of Kaiserslautern
• History
• Paradidgm Shift
• Coarse Grain: why ?
• Coarse Grain Architectures
• Reconfiguration Architecture
http://www.uni-kl.de
© 2001, [email protected]
34
http://www.fpl.uni-kl.de
Xputer Lab
It’s a General Paradigm Shift !
University of Kaiserslautern
• Using FPGAs (fine grain reconfigurable):
just Logic Synthesis on a strange platform
• Coarse Grain Reconfigurable Arrays
(Reconfigurable Computing):
a fundamental Paradigm Shift
• Replacing Concurrent Processes by
much more efficient parallelism:
Stream-based ComputingArrays
• ignored by Curricula & most R&D scenes
© 2001, [email protected]
35
http://www.fpl.uni-kl.de
Xputer Lab
Fine-grained vs. coarse-grained
University of Kaiserslautern
• Fine-grained reconfiguration versus
coarse-grained reconfiguration.
• fine grain is general purpose
• slow and area-inefficient, but high parallelism
• coarse grain is application domain-specific
• coarse grain is highly area-efficient
• extremely high performance
© 2001, [email protected]
36
http://www.fpl.uni-kl.de
Xputer Lab
Reconfigurability Overhead
University of Kaiserslautern
area used by
application
L
L
partly for
configuration
code storage
S
L
© 2001, [email protected]
S
L
resources
needed for
reconfigurability
“hidden RAM”
not shown
L
S
L
S
L
37
L
L
http://www.fpl.uni-kl.de
Principle of a Typical FPGA
Xputer Lab
University of Kaiserslautern
FF
FF
CLB
CLB
CLB
CLB
CLB
CLB
ConnectionPoint
FF
FF
Tap
FF
FF
FF
FF
FF of hidden RAM
© 2001, [email protected]
38
http://www.fpl.uni-kl.de
Routing Overhead in FPGAs
Xputer Lab
University of Kaiserslautern
>1000 transistors
at each cross bar
>
Ý 40 transistors
at each
switching
point
Routing Congestion [DeHon]:
often 50% or less of CLBs used
part of the
hidden RAM
FF
FF
FF
FF
Ý 15 transistors
>
at each tap
most FPGA
vendors’
gate count:
FF
FF
FF
FF
1 flipflop of
configuration
RAM = 4 gates
© 2001, [email protected]
39
FF
http://www.fpl.uni-kl.de
Why Coarse Grain instead of FPGA ?
Xputer Lab
University of Kaiserslautern
Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld
physical
logical
100 000 000 000
FPGA
physical
Transistors / chip
10 000 000 000
1000 000 000
FPGA
routed
10 000 000
reduced reconfigurability
overhead by up to ~ 1000
1000 000
100 000
drastically
much
fastersmaller
loading
configuration memory
a lot of more benefits
10 000
© 2001, [email protected]
~ 10 000
FPGA
logical
100 000 000
1000
1980
~ 10
1990
2000
40
2010
http://www.fpl.uni-kl.de
Xputer Lab
>>> extremely high efficiency
University of Kaiserslautern
1.
avoiding address computation overhead
2.
avoiding instruction fetch and interpretation overhead
3.
high parallelism, massively multiple deep pipelines
4.
much less configuration memory
5.
no routing areas to configure functions from CLBs
© 2001, [email protected]
41
http://www.fpl.uni-kl.de
Xputer Lab
Configurable Computing Systems
University of Kaiserslautern
• combine programmable sequential processor with Flexware
(structurally programmable „hard“ware):
• capitalize on the strength of both,flexware and software.
• early 60ies: Estrin (UCLA): enabling technology not available
• 90ies: significant increase of research activities (DARPA ...)
• FPGAs: not the enabling technology: hardware skills needed
• Verilog or VHDL based systems often result in poor performance
© 2001, [email protected]
42
http://www.fpl.uni-kl.de
Platforms available
Xputer Lab
University of Kaiserslautern
• Soft Data Path Arrays
–
–
–
–
–
KressArray
Xtreme (PACT)
ACM (Quicksilver Tech)
CHESS Array (Elixent)
others
• Compilation techniques feasibility studies:
– Partitioning Co-Compiler
– Design Space Explorer
– others
© 2001, [email protected]
43
http://www.fpl.uni-kl.de
Xputer Lab
Also as an autonomous Machine
University of Kaiserslautern
•
New Machine Paradigm (Xputer)
•
is the counterpart of the so-called von Neumann paradigm
•
easy to teach: simple machine principles
–
–
–
–
–
–
–
–
CONS: confuses customers (paradigm switch: the brain hurts)
PROS: strong guidance of EDA tool development
more effective hardware/software APIs
compilation techniques similar to traditional compilation
better Application Development Tools accepting C or Java
scan patterns (data counter) similar to control flow (program
counter)
general model of hardware / software co-design
fascination for freak effect: opening up a new R&D discipline
© 2001, [email protected]
44
http://www.fpl.uni-kl.de
>> Coarse Grain Architectures
Xputer Lab
University of Kaiserslautern
• History
• Paradidgm Shift
• Coarse Grain: why ?
• Coarse Grain Architectures
• Reconfiguration Architecture
http://www.uni-kl.de
© 2001, [email protected]
45
http://www.fpl.uni-kl.de
Some Players in Silicon Valley and ….
Xputer Lab
University of Kaiserslautern
Company
Architecture
Business Model
Markets
Adaptive Silicon
Not disclosed
Sell Cores
Chameleon Systems
32 bit datapath array
Sell Chips
Embedded DSP
Networking
Malleable
Not disclosed
Sell Chips
Voice over IP
MorphICs
Not disclosed
Sell Cores
Wireless Commun.
Silicon Spice
Not disclosed
Sell Solutions
Networking
Systolix
Bit Serial Systolic Array Sell Cores
Signal Conditioning
Triscend
System on Chip
Embedded Systems
Sell Chips
Network Processors: > 20 Players
© 2001, [email protected]
46
http://www.fpl.uni-kl.de
Commercial rDPAs
Xputer Lab
University of Kaiserslautern
XPU family (IP cores):
PACT Corp., Munich
CALISTO: Silicon Spice **
CS2000 family:
Chameleon Systems
MECA family: Malleable **
flexible array: MorphICs
ACM: Quicksilver Tech *
CHESS array: Elixent
MorphoSys: Morpho Tech*
FIPSOC: SIDSA
XPU128
**) bought
© 2001, [email protected]
*) here at SoC
47
http://www.fpl.uni-kl.de
PACT Corp
Xputer Lab
University of Kaiserslautern
• Xtreme Processor Platform (XPP) family of IP cores, high-speed
data-stream-capable, scalable, reconfigurable clusters of arrays of
32-bit DPUs with embedded memories, and high-speed I/O ports • Application development support software featuring a flow graphstyle algorithm mapping language - to minimize training requirements.
• XPP's fabrics, featuring automatic DataFlow synchronization and
flagged Event Network to dynamically configure the execution flow,
• Supports dynamic RTR: hierarchical configuration managers free the
designer from chip-level details and ensure that configurations are
independently loaded in exactly the intended order.
• Automatic event-based task swapping along with data streams:
released resources automatically reconfigured immediately
© 2001, [email protected]
48
http://www.fpl.uni-kl.de
Xputer Lab
rDPA (Reconfigurable Datapath Array)
University of Kaiserslautern
rDPU
rDPU
rDPU
rDPU
rDPU rDPU rDPU rDPU
rDPU
rDPU
rDPU
rDPU
rDPU rDPU rDPU rDPU
Reconfigurable Interconnect Fabric
rDPU
rDPU
rDPU
rDPU
rDPU rDPU rDPU rDPU
rDPU
rDPU
rDPU
rDPU
rDPU rDPU rDPU rDPU
separate routing area
© 2001, [email protected]
RIF layouted over rDPUs:
rDPA wired by abutment
49
http://www.fpl.uni-kl.de
Generically defined Fabrics:
KressArray Family
Xputer Lab
University of Kaiserslautern
a)
c)
d) rDPU:
b)
rDPU
routing
only
e) rDPU:
routing
and
function
f)
g)
h)
+
i)
Some Application Areas, like e. g. Wireless Communicatio
50 Communication
http://www.fpl.uni-kl.de
© 2001,
[email protected]
need
extraordinarily powerful
Resources
Xputer Lab
Universal RAs are not always feasible
University of Kaiserslautern
The General Purpose (coarse
grain) Reconfigurable Array
may appear to be an Illusion ...
... often Functional Resources
are not the Throughput Bottleneck
Some Application Areas, such as e. g.
Wireless Communication, need extremely
rich Communication Resources
Use Domain-specific Platform Generators !
© 2001, [email protected]
51
http://www.fpl.uni-kl.de
Xputer Lab
KressArray Family Example
University of Kaiserslautern
16
taylored KressArray
rDPU example
32
24
2
rDPU
4
external view: only
NNport Abutment
Architecture shown
© 2001, [email protected]
8
52
http://kressarray.de
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
KressArray Family generic Fabrics:
a
few
examples
Select mode,
Select
number, width
of NNports
16
Function
Repertory
8
32
+
24
2
rDPU
4
select Nearest Neighbour (NN) Interconnect: an example
routthrough
only
more NNports:
rich Rout Resources
rout-through
and function
Examples of
2nd Level
Interconnect:
layouted over
rDPU cell no separate
routing areas !
http://kressarray.de
© 2001, [email protected]
53
http://www.fpl.uni-kl.de
Xputer Lab
CMOS intercoonnect resources
University of Kaiserslautern
Foundries offer
up to 8 metal layers
and up to 3 poly layers
reconfigurable
interconnect fabric
layouted over the
rDU cell
© 2001, [email protected]
54
http://www.fpl.uni-kl.de
Super Pipe Networks
Xputer Lab
University of Kaiserslautern
array
systolic
array
applications
regular data
dependencies
only
supersystolic
rDPA
*
pipeline properties
shape
resources
linear
only
uniform
only
mapping
linear projection or
algebraic synthesis
simulated
annealing or
P&R algorithm
no restrictions
scheduling
(data stream
formation)
(e.g. force-directed)
scheduling
algorithm
*) KressArray [1995]
© 2001, [email protected]
55
http://www.fpl.uni-kl.de
Xputer Lab
Communication Resource Requirements
University of Kaiserslautern
... often Functional Resources
are not the Throughput Bottleneck
In some Application Areas,
such as e. g. Wireless Communication,
Reconfigurable Computing Arrays
need extraordinarily rich and powerful
Communication Resources
The Solution: Generators for
Domain-specific RA Platforms
© 2001, [email protected]
56
http://www.fpl.uni-kl.de
SNN filter KressArray Mapping Example
Xputer Lab
University of Kaiserslautern
http://kressarray.de
rout thru only
array size:
10 x 16
= 160 rDPUs
Legend:
© 2001, [email protected]
rDPU not used
backbus connect
used for
routing only
backbus
connect
57
operator and routing
port location
not
used marker
http://www.fpl.uni-kl.de
Xputer Lab
Xplorer Plot: SNN Filter Example
University of Kaiserslautern
[13]
http://kressarray.de
2 hor. NNports, 32 bit
3 vert. NNports, 32 bit
route-thru-only rDPU
© 2001, [email protected]
+
result
operand
58
operator
operand
route thru
backbus connect
http://www.fpl.uni-kl.de
Super Pipe Networks
Xputer Lab
University of Kaiserslautern
array
systolic
array
supersystolic
RA *
applications
regular data
dependencies
only
pipeline properties
shape
resources
linear
only
uniform
only
mapping
linear projection or
algebraic synthesis
simulated
annealing or
P&R algorithm
no restrictions
scheduling
(data stream
formation)
(e.g. force-directed)
scheduling
algorithm
*) KressArray [ASP-DAC-1995]
© 2001, [email protected]
59
http://www.fpl.uni-kl.de
Xputer Lab
KressArray: try out youself !
University of Kaiserslautern
• You may experiment yourself
• You may use it over the internet
• Map an application onto a KressArray
• Start with a simple example
• Visit http://kressarray.de
• Click the link to Xplorer
try Netscape 4.7x
• ... does not run on internet explorer ....
• ... since Bill Gates does not like Java 
© 2001, [email protected]
60
http://www.fpl.uni-kl.de
Michael Herz
Xputer Lab
University of Kaiserslautern
Dissertation
Agilent, Sindelfingen
Michael Herz:
• ... on mapping parallel memory
architectures for stream-based arrays
onto KessArrays
• ... also transformation of storage
schemes to optimize memory bandwith
• (MoM scan pattern transformations)
© 2001, [email protected]
61
http://www.fpl.uni-kl.de
Ulrich Nageldinger
Xputer Lab
University of Kaiserslautern
Dissertation
Ulrich Nageldinger: infineon technologies, Munich
• ... on mapping applications onto KessArrays
• ... simultaneous routing and placement by
simulated annealing
• Supporting a huge family of KressArrays
• fuzzy logic improvement proposal generator
• profiling
• design space exploration
© 2001, [email protected]
62
http://www.fpl.uni-kl.de
Rainer Kress
Xputer Lab
University of Kaiserslautern
Dissertation
Rainer Kress:
infineon technologies, Munich
• ... on mapping applications onto his* KessArray
• DPSS datapath synthesis system
• Including a data scheduler
• (data stream scheduler)
• Generalization of the Systolic Array
• (KressArray is a super systolic array)
• 32 bit design via Eurochip support
© 2001, [email protected]
63
http://www.fpl.uni-kl.de
Jürgen Becker
Xputer Lab
University of Kaiserslautern
Dissertation
Jürgen Becker:
Professor at Univ. Karlsruhe
• ... Automatically partitioning Co-compiler
• (configware / software co-compilation)
• Resource-parameter-driven retargettable
• Profiler-driven optimization
• Accepts HLL „ALE-X“ (extended C subset)
• (subset: pointers not supported)
© 2001, [email protected]
64
http://www.fpl.uni-kl.de
Karin Schmidt
Xputer Lab
University of Kaiserslautern
Dissertation
Karin Schmidt:
DaimlerChrysler Research
• Compilation Techniques for Xputers
• modified loop transformations
• Modified parts of implementation used
for Jürgen Becker‘s Ph. D. thesis
© 2001, [email protected]
65
http://www.fpl.uni-kl.de
CHESS Array w. embedded RAM (Elixent)
Xputer Lab
University of Kaiserslautern
multi-granular e. g. 16 * 4 Bits = 64 Bits
Sequencer
ALU
ALU
ALU
ALU
ALU
ALU
© 2001, [email protected]
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
R
A
M
66
Memory Interface
User Registers Clock Control
http://www.fpl.uni-kl.de
Chameleon Systems
Xputer Lab
University of Kaiserslautern
• RISC processor and an array of 108 arithmetic processing units.
Each of those 32-bit processing cores runs at 125 MHz.
• The CS2112 is the industry's first Reconfigurable Communications
Processor (RCP), a streaming data processor.
• The vendor claims a performance of 20 billion 16-bit operations
per second, and 2.4 billion 16-bit multiply-accumulates per second and 1.6 GBytes / sec for ist programmable I/O (PIO) banks.
• It also has a PCI interface.
• Tool suite C~SIDE for developing, verifying and optimizing.
© 2001, [email protected]
67
http://www.fpl.uni-kl.de
Coarse Grain Architectures
Xputer Lab
University of Kaiserslautern
style
project
DP-FPGA
KressArray
Colt
Matrix
RAW
Garp
REMARC
mesh
MorphoSys
CHESS
DReAM
CS2000 family
MECA family
CALISTO
FIPSOC
RaPID
linear
PipeRench
PADDI
Cross
PADDI-2
bar
Pleiades
first source
1994
publ.
1995
1996
1996
1997
1997
1998
1999
1999
2000
2000
2000
2000
2000
1996
1998
1990
1993
1997
architecture
granularity
[4]
2-D array
1 & 4 bit multi-granular
[5,11]
2-D mesh
family: sel. pathwidth
[12]
2-D array
1 & 16 bit
[15]
2-D mesh
8 bit, multi-granular
[17]
2-D mesh
8 bit, multi-granular
[16]
2-D mesh
2 bit
[18]
2-D mesh
16 bit
[19]
2-D mesh
16 bit
[20]
hexagon
4 bit, multi-granular
[21]
2-D array
8 &16 bit
[23]
2-D array
16 & 32 bit
[24]
2-D array
multi-granular
[25]
2-D array
16 bit multi-granular
[26]
2-D array
4 bit multi-granular
[27]
1-D array
16 bit
1-D array
128 bit
[29]
[30]
crossbar
16 bit
[32]
crossbar
16 bit
[33] mesh+crossbar
multi-granular
© 2001, [email protected]
fabrics
mapping
intended target application
Inhomog. routing channels
switchbox routing
multiple NN & bus segments
(co-)compilation
inhomogenous
run time reconfiguration
8NN, length 4 & global lines
multi-length
8NN switched connections
switchbox rout
global & semi-global lines
heuristic routing
NN & full length buses
(info not available)
NN, length 2 & 3 global lines
manual P&R
8NN and buses
JHDL compilation
NN, segmented buses
co-compilation
inhomogenous array
(not disclosed)
(not disclosed)
(not disclosed)
(not disclosed)
(not disclosed)
(not disclosed)
(not disclosed)
segmented buses
channel routing
(sophisticated)
scheduling
central crossbar
routing
multiple crossbar
routing
multiple segmented crossbar
switchbox routing
68
regular datapaths
(adaptable)
highly dynamic reconfig.
general purpose
experimental
loop acceleration
multimedia
(not disclosed)
multimedia
next generation wireless
communication
tele- & datacommunication
tele- & datacommunication
tele- & datacommunication
pipelining
pipelining
DSP
DSP and others
multimedia
http://www.fpl.uni-kl.de
Primarily Mesh-based ….
Xputer Lab
University of Kaiserslautern
market
project
KressArray
Garp
CHESS
Matrix
research RAW
Colt
DReAM
REMARC
MorphoSys
CALISTO
MECA family
commercial CS2000 family
FIPSOC
XPP XPU128
© 2001, [email protected]
69
bits granularity
source
variable
2
4
U. Kaiserslautern
UC Berkeley
Hewlett Packard
8
M.I.T.
1 & 16
8 &16
Virginia Tech
TU Darmstadt
Stanford
UC Irvine
Slicon Spice
Malleable
Chameleon Systems
SIDSA
PACT Corp.
16
16 & 32
16 & analog
32
http://www.fpl.uni-kl.de
UC Berkeley (Jan Rabaey)
Xputer Lab
University of Kaiserslautern
market
project
bits granularity
source
16
UC Berkeley
PADDI
research PADDI-2
Pleiades
© 2001, [email protected]
70
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
Crossbar-based Architectures
16 bit
C
T
L
EXU
1990: UC Berkeley
(Jan Rabaey)
1993: PADY-II
(Jan Rabaey)
1997: Pleiades
(mesh & crossbar)
C
T
L
EXU
C
T
L
EXU
C
T
L
EXU
crossbar switch
I/O
I/O
C
T
L
EXU
C
T
L
EXU
C
T
L
EXU
C
T
L
EXU
32 bit
© 2001, [email protected]
71
http://www.fpl.uni-kl.de
PADDI-II Architecture
Xputer Lab
University of Kaiserslautern
P1
P2
P3
P4
P5
P6
P7
P8
Level-2
Network
16 x 16b
© 2001, [email protected]
P9
P10
P11
P12
P13
P14
P15
P16
P25
P26
P27
P28
P29
P30
P31
P32
I/O
I/O
I/O
I/O
P17
P18
P19
P20
P21
P22
P23
P24
break-switch
I/O
break-switch
6 x 16b
I/O
P33
P34
P35
P36
P37
P38
P39
P40
72
16 x 6 switch matrix
4-PE Cluster
P45
P46
P47
P41
P42
P43
P44
P45
P46
P47
P48
P48
I/O
I/O
Level-1 Network
http://www.fpl.uni-kl.de
MorphoSys
Xputer Lab
University of Kaiserslautern
© 2001, [email protected]
73
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
PipeRench Architecture (CMU 1998)
alternating
data/instruction
stream
highly dynamic
reconfiguration
© 2001, [email protected]
74
http://www.fpl.uni-kl.de
Xputer Lab
M.I.T.
0.5 m CMOS
8 bit 10 x 10
1.8 mm2
100 MHz
MATRIX (1996)
University of Kaiserslautern
Multiple Alu archiTecture with Reconfigurable Interconnect eXperiment
RAW (M.I.T. 1997)
compare / reduce 2
Architecture
Workbench
Network Port A
ALU Func Port
global
Reconfigurable lines
256x8 bit
Mem
8 bit
ALU
compare / reduce 1
Level-1 Network
© 2001, [email protected]
C / R Network
75
Mem Func Port
cross
bar
mode
WE
Network Port B
MIPS-like
processor
core
global
lines
BFU
C / R Network
opc
operation
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
×
×+
×++
× const
insh
nsh
dsh
csh
+
+0
+1
:=
nand
nor
xor
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
MATRIX Interconnect Fabrics
Communication Resources
are often the bottleneck BFUs
BFU
its
neighbours
© 2001, [email protected]
76
http://www.fpl.uni-kl.de
Xputer Lab
More Research Projects
University of Kaiserslautern
Garp (UC Berkeley)
published
between
1996 - 2000
RaPiD (U. Washington )
REMARC (Stanford)
DReAM (U. Karlsruhe)
.... and others
Asia / Pacific: also
see embedded tutorials
by Prof. Amano
(ASP_DAC’99, FPL-2000)
© 2001, [email protected]
77
http://www.fpl.uni-kl.de
RaPiD Architecture
Xputer Lab
University of Kaiserslautern
M
U
L
T
Datapath
Registers
A
L
U
Bus Connectors
© 2001, [email protected]
R
A
M
A
L
U
Input Multiplexers
78
R
A
M
A
L
U
R
A
M
Output Drivers
http://www.fpl.uni-kl.de
REMARC
Xputer Lab
University of Kaiserslautern
© 2001, [email protected]
79
http://www.fpl.uni-kl.de
Xputer Lab
Future Coarse Grain RA Development
University of Kaiserslautern
• It is indispensable to operate within
the Convergence Area of Compilers,
Co-Compilers, Architecture and fullcustom-style VLSI Design (array cells).
• It is a must, that Products come with
a Development Platform which
encourages users,especially also those
with a limited Hardware Background.
© 2001, [email protected]
80
http://www.fpl.uni-kl.de
>> Reconfiguration Architecture
Xputer Lab
University of Kaiserslautern
• History
• Paradidgm Shift
• Coarse Grain: why ?
• Coarse Grain Architectures
• Reconfiguration Architecture
http://www.uni-kl.de
© 2001, [email protected]
81
http://www.fpl.uni-kl.de
Dimensions of Reconfigurability
Xputer Lab
University of Kaiserslautern
ASIPs* vs. Network Processors
*) Application-Specific Instruction set Processors
configuration time
design
ASIP
time
Extremes:
Class of product
processor
vendor
ASIP
Tensilica
Tensilica
fabrication
time
MECA family Malleable
Network
Processor CALISTO
SiliconSpice
many others many others
© 2001, [email protected]
82
compile
time
run time
statically reconfigurable
Network
dynamically
Processor
reconfigurable
http://www.fpl.uni-kl.de
Xputer Lab
Configuration Architectures
(dynamic vs. static)
University of Kaiserslautern
Configuration caching*:
host
Compiler,
Mapper,
RTOS
etc.
RAM
Config. RAM
Cache RAM
RAM
*) no cache
as usual !
straight forward:
host
Soft
RAM Data
Path
Compiler,
Mapper,
RTOS
etc.
Soft
RAM Data
Path
multi-context:
Configuration Loading Resources:
RAM
• separate configuration fabrics (e.g. FPGA) host
Soft
RAM
• wormhole routing (KressArray, Colt, PipeRench) Compiler,
Data
RAM Path
Mapper,
• RA part computes code for other
RTOS
RAM
dynamic
etc.
RA part (self reconfiguration)
© 2001, [email protected]
83
http://www.fpl.uni-kl.de
Colt Architecture (P. Athanas 1996)
Xputer Lab
University of Kaiserslautern
Multiplier
wormhole
routing
DP
DP
I/O Pins
I/O Pins
Smart
Crossbar
DP
I/O Pins
DP
I/O Pins
DP
DP
I/O Pins
Studying
highly dynamic
reconfiguration
© 2001, [email protected]
I/O Pins
IFU
IFU
IFU
IFU
IFU
84
IFU
IFU
IFU
IFU
IFU
IFU
IFU
IFU
IFU
IFU
IFU
http://www.fpl.uni-kl.de
Schedule
Xputer Lab
University of Kaiserslautern
time
slot
08.30 – 10.00
Reconfigurable Computing (RC)
10.00 – 10.30
coffee break
10.30 – 12.00
Compilation Techniques for RC
12.00 – 14.00
lunch break
14.00 – 15.30
Resources for Stream-based RC
15.30 – 16.00
coffee break
16.00 – 17.30
FPGAs: recent developments
© 2001, [email protected]
85
http://www.fpl.uni-kl.de
Xputer Lab
University of Kaiserslautern
- END © 2001, [email protected]
86
http://www.fpl.uni-kl.de