Design Productivity Crisis
Download
Report
Transcript Design Productivity Crisis
xPilot: A Platform-Based System-Level Synthesis for
Reconfigurable SOCs
Prof. Jason Cong
[email protected]
UCLA Computer Science Department
Motivation
Design
complexity is outgrowing the traditional RTL
method even in current CMOS technologies
Nanotechnology will
enable 10-100x increase in device
density and degree of integration
Need
to enable higher level of design abstraction
Start from behavior descriptions (e.g. C or SystemC)
Use and/or re-use more complex functional unit (e.g. processor
cores instead of standard cells)
ESL Tools – A Lot of Interests …
xPilot: Platform-Based Synthesis System
Platform Description
& Constraints
SystemC/C
xPilot
xPilot Front End
Profiling
SSDM
(System-Level
Synthesis
Data Model)
Processor &
Architecture
Synthesis
Processor Cores
+ Executables
Interface
Synthesis
Drivers + Glue Logic
Analysis
Mapping
Behavioral Synthesis
Custom Logic
FPSoC
Uniqueness of xPilot
Platform-based synthesis and optimization
Communication-centric synthesis with interconnect optimization
xPilot: Behavioral-to-RTL Synthesis Flow
Behavioral spec.
in C/SystemC
Platform
description
Frontend
compiler
FPGAs/ASICs
Loop unrolling/shifting
Strength reduction / Tree height reduction
Bitwidth analysis
Memory analysis …
Core synthesis optimizations
Scheduling
Resource binding, e.g., functional unit
binding register/port binding
SSDM
RTL + constraints
Presynthesis optimizations
Arch-generation & RTL/constraints
generation
Verilog/VHDL/SystemC
FPGAs: Altera, Xilinx
ASICs: Magma, Synopsys, …
System-Level Exploration Using xPilot for
Heterogeneous MPSoC Platforms
Heterogeneous
MPSoCs exploration
Processors
• Heterogeneous vs. homogeneous
• General-purpose vs. application-specific
On-chip communication architecture (OCA)
• Bus (e.g. AMBA, CoreConnect), packet switching network
(e.g. Alpha 21364)
Memory hierarchy
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
IP
μP
Network
Interface
Network
Interface
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
FPGA
μP
Network
Interface
Network
Interface
Communication Network
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
μP
DSP
Network
Interface
Network
Interface
Outline
xPilot
Overview
Behavior-level synthesis in xPilot
System-level synthesis in xPilot
Recent
Progress in xPilot
Interface synthesis
Resource binding based on distributed register architecture
Conclusions
Advantage of Behavior Synthesis
Shorter
verification/simulation cycle
Better
complexity management, faster time to market
Rapid
system exploration
Quick evaluation of different hardware/software boundaries
Fast exploration of multiple micro-architecture alternatives
Higher
quality of results
Platform-based synthesis & optimization
Full consideration of physical reality
Example: Better Complexity Management
Shorter verification/simulation cycle
Simulation speed 100X faster than RTL-based method [NEC, ASPDAC04]
Significant code size reduction
RTL design ~300KL Behavioral design 40KL [NEC, ASPDAC04]
VHDL code generated by UCLA xPilot targeting Altera Stratix platform
Over 10x code size reduction can be achieved
Unique Features of xPilot (1):
Platform-based Synthesis & Optimization
Platform-based
synthesis & optimization
The quality of a RTL design is platform-dependent
Designers often lack the complete and detail knowledge of the target
platform
Resource
Area
Delay (ns)
ADDSUB-24b
25 LUTs
2.27
ADDSUB-32b
33 LUTs
2.61
MUX8to1-24b
120 LUTs
2.92
MUX16to1-24b
264 LUTs
4.658
DSPMUL-18bx18b
2 DSP Blocks
3.833
DSPMUL-24bx24b
8 DSP Blocks
7.688
Platform: Altera Stratix
(0,0)
0.58
1.8
2.8
2.0
2.9
3.7
2.8
3.8
4.7
(95,61)
3X3 Delay Matrix
RTL synthesis & place-and-route: Altera QuartusII v5.0
Unique Features of xPilot (2):
Communication-Centric Synthesis & Optimization
System performance & power is dominated by interconnect
It is difficult for designers to consider physical layout at the RT
level
T
add1
Data
transfer
>
F
5*
2*, 3*
add2
6*
4*
mul1
(2,4,5)
mul2
(3,6)
Binding solution 1:
Both multipliers keep
active
mul1
mul2
Layout-aware performance
optimization
Overlap computation with communication
<
C2’
Layout-aware power
optimization
mul1
(2,5,6)
mul2
(3,4)
Binding solution 2:
mul2 can be powered
off when false branch
is taken
Unique Features of xPilot (3):
Highly Scalable and Optimized Synthesis Algorithms
Use
of highly scalable and optimized synthesis algorithms
for best quality of results
Interface synthesis: Simultaneous data and communication
scheduling for latency minimization
Scheduling: A unified framework for multi-constraints and multiobjective scheduling based on the system of difference
constraints (SDC)
Resource binding: Use of distributed register architectures for
interconnect/communication optimization
Power optimization: Optimal functional module and voltage
binding
…
Behavior and Communication Co-Optimization
for Systems with SCM
SCM
: Sequential Communication Media
FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.)
Data must be read and written in the same order
Order may have dramatic impact on performance
• Best order should guarantee that no data transmission on critical path are
delayed by non-critical transmission
for (int i=0; i <8; i++) {
S1: data[i] = …;
}
C
data[8]
int s07 = data[0] + data[7];
Int s16 = data[1] + data[6];
…..
P2
P1
FIFO
Custo
m
Logic 1
PE1
Custom logic 2
DCT example
PE2
SCM Co-Optimization Problem Formulation
Given:
A set of processes P connected by a set of channels in C
A set of data D = {d1, d2, …, dm} to be transmitted on each
channel cj,
Goal:
Find the optimal transmission order of each process, so that
the overall latency of the process network is minimized
subject to the given design constraints and platform
specifications
In the meantime, generate the drivers and glue logics for
each process automatically
Proposed SCM Co-Optimization Design Flow
Platform Description &
Constraints
Process Network
Front End
System-Level Synthesis
Data Model
SCOOP (SCM CO-Optimization)
Communication
order detection
Code transformation and
interface generation
Indices compression
for loop reordering
Drivers + Glue
Logics
Process
Behavior
Communication Order Detection
Step 1. Construct a global CDFG by merging the individual CDFGs of each process
Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the
total latency of the global CDFG
Process 1
T1
T2
+
T3
*
*
T3
T2
Process 2 +
T1
*
T1
+
T3
T2
+
Ti : FIFO
Latency = 5 cycles
Latency = 7 cycles
Loop Indices Compression
Given the optimal order, we try to generate restructured loops for
code compression
i.e., given the original iteration and reordered iteration, find the minimum
number of linear intervals to represent the new iteration space
Original order: (0,0), (0,1), (1,0), (1,1)
After reordering: (0,0), (1,0), (0,1), (1,1)
Need to solve the linear system
i
i' a1 b1 c1
j
j ' a 2 b2 c 2 1
Solution: i’=j, j’ = i;
Preliminary Experimental Results
Experimental setting
Target communication model: two-process producer-consumer model
Behavioral synthesizer: UCLA xPilot
RTL simulator : Mentor ModelSim
Total latency (Cycle#)
RAs Compress
Designs
Trad.
SCOOP
Reduction
Before
After
DCT1
325
290
10.77%
0
0
Haar
142
134
5.63%
0
0
DWT
689
617
10.45%
0
0
Mat_mul
408
339
16.91%
96
20
DCT2
483
419
13.25%
80
64
Masking
620
420
32.26%
192
0
Dot
1903
1084
43.04%
300
0
An average of 26% improvement in total latency can be achieved.
Advantage of Register-File Microarchitectures
1
1
2
3
2
4
3
4
2
1
(a)
(a) A scheduled
DFG with register
binding indicated
on each variable
(b)
(b)
Binding using
discrete registers
(c)
(c)
Binding
using a register
file
Distributed Register-File Microarchitecture
Island B
Island A
On-chip memory
blocks
Data-Routing
Logic
Local
Register
File
Input Buffers
FUP MUX
Island C
Functional Unit Pool
MUL
Island A
ALU
Island C
ALU’
Xilinx XC-2V
2000
3000
4000
6000
8000
#18Kb BRAM
56
96
120
144
168
Dist. RAM(Kb)
336
448
720
1,056
1,456
Altera EP1
S25
S30
S40
S60
S80
#M512(512b)
224
295
384
574
767
#M4K(4Kb)
138
171
183
292
364
#M-(512Kb)
2
4
4
6
9
Island B
FP-SoC
On-chip RAM resource
(Virtex II and Stratix)
Resource Binding for DRFM
Facts
1
v7
v2
3
v3
4
v4
A
v6
v1
2
under simplified
assumptions
v9
v5
B
v8
C
v10
D
Inter-island connections
(A,B)=(A,D)=1
(A,C)=1, two data transfers
share one connection
(C,D)=2
Operations bound onto an island
form a chain in the given
scheduled DFG
Inter-chain data transfers may
share a physical inter-island
connection
The
number of inter-island
connections is crucial to the
QoR of a DRFM instance
Resource Binding Problem for DRFM
General
DRFM binding problem
Given scheduled DFG G and DRFM M, to find a feasible resource
binding B(G,M), so that the quality of B is optimized.
• Hard to characterize the quality of binding solution B
• The problem is too ad-hoc
Relaxed
problem – DRFM Binding for Minimizing InterIsland Connections:
Given a scheduled DFG G and DRFM M, to find a feasible
resource binding B(G,M), so that the total number of inter-island
connections of B is minimized.
Solution: control-step by step binding with min-cost bipartite
matching
Three Experimental Flows for Comparison
xPilot Frontend
xPilot behavioral
synthesis system
1) Binding on
Discrete-Register
Microarchitecture
SSDM/CDFG
Scheduling algorithms
Scheduled CDFG (STG)
2) Baseline (Random)
DRFM Binding
RTL generation
Xilinx Virtex II
3) DRFM Binding for
Minimizing
Inter-Island Connections
Experimental Results
Xilinx ISE 7.1; Virtex II; Target clock period: 8ns
The baseline DRFM binding results achieve 46.70% slice reduction over the discreteregister approach
Optimized DRFM binding reduces 12.21% further
Overall, more than 2X logic slice reduction with better clock period (7.8%).
1200
Discrete-Reg
DRF-Random
1000
14
12
DRF-Opt
10
Clock Period (ns)
Slices
800
600
400
8
6
4
200
2
0
0
PR
LEE
CHEN
DIR
Area (Slices, DRF solutions use
on-chip RAM blocks)
PR
LEE
CHEN
Clock period (ns)
DIR
Conclusions
xPilot
can automatically synthesize behavior level C or SystemC
presentation to RTL code with necessary design constraints
Platform-based
synthesis with physical planning provides
Shorter verification/simulation cycle
Better complexity management, faster time to market
Rapid system exploration
Higher quality of results
xPilot
can help to explore the efficient use of (multiple) on-chip
processors
xPilot
can efficiently optimize the software for reconfigurable
processors
We
are interested to engage with selected industrial partners to
further validate and enhance the technology
Acknowledgements
We would like to thank the supports from
National Science Foundation (NSF)
Gigascale Systems Research Center (GSRC)
Semiconductor Research Corporation (SRC)
Industrial sponsors under the California MICRO programs (Altera, Xilinx)
Team members:
Yiping Fan
Guoling Han
Wei Jiang
Zhiru Zhang