Design Productivity Crisis
Download
Report
Transcript Design Productivity Crisis
Platform-Based
Behavior-Level and System-Level Synthesis
Prof. Jason Cong
[email protected]
UCLA Computer Science Department
Outline
Motivation
xPilot
system framework
Behavior-level
synthesis in xPilot
Advantages of behavioral synthesis
Scheduling
Resource binding
System-level
synthesis in xPilot
Synthesis for ASIP platforms
Design exploration for heterogeneous MPSoCs
Conclusions
ASICs SOC Example: Philips Nexperia
50 to 300+ MHz
MIPS™
SDRAM
MMI
MIPS CPU
D$
PRxxxx
I$
Image
coprocessors
DEVICE IP. BLOCK
.
PI BUS
Library of device IP
blocks
DEVICE IP BLOCK
.
DEVICE IP BLOCK
D$
32-bit or 64-bit
I$
DEVICE IP BLOCK
DEVICE IP
. BLOCK
Scalable VLIW media
processor:
100 to 300+ MHz
TriMedia CPU
TM-xxxx
32-bit or 64-bit
TriMedia™
PI BUS
General-purpose
scalable RISC
processor
DVP MEMORY BUS
Nexperia™system
buses
32-128 bit
.
. BLOCK
DEVICE IP
DSPs
UART
1394
DVP SYSTEM SILICON
MPEG VIDEO
USB
…
Courtesy Philips
Philips Nexperia SoC platform for
high-end digital video
MSP
MIPS
ACCESS
CTL.
VLIW
Field-Programmable SOC Example: Xilinx Virtex-4 FPGA
IP
IP
IBM CoreConnect™ Bus
MicroBlaze
180MHz
Soft core Proc < ~1300 LUTs
166 DMIPS
MicroBlaze
H.264/AVC hardware blocks
PowerPC 405 (PPC405) core
450 MHz, 700+ DMIPS RISC core
(32-bit Harvard architecture)
Courtesy Xilinx
IC Design Steps
System-Level
Specification
Behavior-level
Description
Physical
Design
Placed
& Routed
Design
Packaging
Synthesis
Technology
Mapping
Gate/Circuit
Design
Fabrication
RT-Level
Description
Generic Logic
Description
X=(AB*CD)+
(A+D)+(A(B+C))
Y = (A(B+C)+AC+
D+A(BC+D))
[©Sherwani]
xPilot: Platform-Based Synthesis System
Platform Description
& Constraints
SystemC/C
xPilot
xPilot Front End
Profiling
SSDM
(System-Level
Synthesis
Data Model)
Processor &
Architecture
Synthesis
Processor Cores
+ Executables
Interface
Synthesis
Drivers + Glue Logic
Analysis
Mapping
Behavioral Synthesis
Custom Logic
Embedded SoC
Uniqueness of xPilot
Platform-based synthesis and optimization
Communication-centric synthesis with interconnect optimization
Outline
Motivation
xPilot
system framework
Behavior-level
synthesis in xPilot
Advantages of behavioral synthesis
Scheduling
Resource binding
System-level
synthesis in xPilot
Synthesis for ASIP platforms
Design exploration for heterogeneous MPSoCs
Conclusions
xPilot: Behavioral-to-RTL Synthesis Flow
Behavioral spec.
in C/SystemC
Platform
description
Frontend
compiler
FPGAs/ASICs
Loop unrolling/shifting
Strength reduction / Tree height reduction
Bitwidth analysis
Memory analysis …
Core synthesis optimizations
Scheduling
Resource binding, e.g., functional unit
binding register/port binding
SSDM
RTL + constraints
Presynthesis optimizations
Arch-generation & RTL/constraints
generation
Verilog/VHDL/SystemC
FPGAs: Altera, Xilinx
ASICs: Magma, Synopsys, …
Advantages of Behavioral Synthesis
Shorter verification/simulation cycle
• 100X speed up with behavior-level simulation
Better complexity management, faster time to market
• 10M gate design may require 700K lines of RTL code
Rapid system exploration
• Quick evaluation of different hardware/software boundaries
• Fast exploration of multiple micro-architecture alternatives
Higher quality of results
• Platform-based synthesis & optimization
• Full consideration of physical reality
Behavior Synthesis Has Been Tried and Failed – Why?
Reasons
for previous failures
Lack of a compelling reason: design complexity is still
manageable a decade of ago
Lack of a solid RTL foundation
Lack of consideration of physical reality
Lack of widely accepted behavior models
xPilot Advantages
Advanced
algorithms for platform-based, communication-
centric optimization
Platform-based
behavior and system synthesis
Communication/interconnect-centric approach
Complete
validation through final P&R on FPGAs
Platform Modeling & Characterization
Target platform specification
MUX
High-level resource library with
ALU
ALU
ALU
delay/latency/area/power curve for
Two binding solutions for
various input/bitwidth configurations
same behavior:
• Functional units: adders, ALUs,
•
•
multipliers, comparators, etc.
Connectors: mux, demux, etc.
Memories: registers, synchronous
memories, etc.
Chip layout description
• On-chip resource distributions
• On-chip interconnect delay/power
estimation
Which one is better?
Answer is platform-dependent:
How large/fast are the
MUX and ALU?
0.58 1.8
2.8
2.0
2.9
3.7
2.8
3.8
4.7
3X3 Delay Matrix for Stratix-EP1S40
Advanced Behavior System Algorithms:
Example: Versatile Scheduling Algorithm Based on SDC
Scheduling problem in behavioral synthesis is NP-Complete
under general design constraints
ILP-based solutions are
versatile but very inefficient
Exponential time complexity
*1
+2
+3
+4
*
+
CS0
*1
+2
CS1
*5
*5
+3
+4
Existing Scheduling Techniques for Behavioral Synthesis
Heuristic approach:
Fast, but ad hoc (limited efficiency to specific
applications)
Data-flow-based scheduling (Targets data-flow-intensive designs,
e.g., DSP applications, image processing applications, etc.)
Control-flow-based scheduling (Targets control-flow-intensive
designs e.g., controllers, network protocol processors, etc.)
Exact
approach: Versatile, but inefficient (poor scalability)
ILP-based scheduling, e.g., [Huang et al., TCAD’91], etc.
BDD-based symbolic scheduling, e.g., [Radivojevic and Brewer, TCAD’96] …
Scheduling Our Approach
Overall
approach
Current objective: high-performance
Use a system of integer difference constraints to
express all kinds of scheduling constraints
Represent the design objective in a linear function
+v
*
1
*
v2
+
Dependency constraint
•
•
•
•
v1 v3 : x3 – x1 0
v2 v3 : x3 – x2 0
v3 v5 : x4 – x3 0
v4 v5 : x5 – x4 0
Frequency constraint
v4
• <v2 , v5> : x5 – x2 1
Resource constraint
v3
• <v2 , v3>: x3 – x2 1
v
5
Platform characterization:
• adder (+/–) 2ns
• multipiler (*): 5ns
Target cycle time: 10ns
Resource constraint: Only
ONE multiplier is available
1
0
0
0
0
0
1
0
0
1
-1
-1
1
0
0
A
0
0
-1
1
0
0
0
0
-1
-1
X1
X2
X3
X4
X5
x
0
-1
0
0
-1
b
Totally unimodular matrix: guarantees integral solutions
UPS Scheduling Overall Framework
CDFG
xPilot scheduler
Constraint equations
generation
Userspecified
design
constraints&
assignments
Relative timing constraints
Dependency constraints
Frequency constraints
Resource constraints …
Objective function generation
System of pairwise
difference constraints
Linear programming solver
LP solution interpretation
STG (State
Transition Graph)
Target
platform
modeling
(resource
library &
chip layout)
UPS vs. SPARK: Results on SPARK’s Benchmarks
Mult (*): 2 cycles; Div (*) : 5 cycles; Rest: one cycle
Target frequency: 7.5ns
SPARK
UPS
State#
W. Cycle#
State#
W. Cycle#
UPS /
SPARK
MPEG2-dpframe
32
424
35
352
0.83
GIMP-tiler
27
2234
32
1877
0.84
ADPCM-decoder
15
327
13
278
0.85
ADPCM-encoder
16
133
13
112
0.84
Benchmark
Average Ratio
UPS achieves 16% cycle count reduction over SPARK
0.84
Platform-Based Interface Synthesis
Focus on sequential communication media (SCM)
FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.)
Order may have dramatic impact on performance
• Best order should guarantee that no data transmission on critical path are delayed
by non-critical transmission
Interface synthesis for SCM
Consider both behavior and communication to determine the optimal transmission
order
for (int i=0; i <8; i++) {
S1: data[i] = …;
}
C
data[8]
int s07 = data[0] + data[7];
Int s16 = data[1] + data[6];
…..
P2
P1
FIFO
Custo
m
Logic 1
PE1
Custom logic 2
DCT example
PE2
SCM Co-Optimization Problem Formulation
Given:
A set of processes P connected by a set of channels in C
A set of data D = {d1, d2, …, dm} to be transmitted on each
channel cj,
Goal:
Find the optimal transmission order of each process, so that
the overall latency of the process network is minimized
subject to the given design constraints and platform
specifications
In the meantime, generate the drivers and glue logics for
each process automatically
SystemC/C-to-RTL Design Flow
SystemC/C specification
Front-end compiler
xPilot behavioral
synthesis
SSDM
Platform description
& constraints
(System-Level
Synthesis
Data Model)
SSDM/CDFG
Behavioral synthesis
SSDM/FSMD
RTL generation
FSM with Datapath
in VHDL
Floorplan and/or multicycle path constraints
RTL synthesis
ASICs/FPGAs platform
Preliminary Results of xPilot
Better Complexity Management
Significant
code size reduction
RTL design Behavioral design: 10x code size reduction
VHDL code generated by UCLA xPilot targeting Altera
Stratix platform
Outline
Motivation
xPilot
system framework
Behavior-level
synthesis in xPilot
Advantages of behavioral synthesis
Scheduling
Resource binding
System-level
synthesis in xPilot
Synthesis for ASIP platforms
Design exploration for heterogeneous MPSoCs
Conclusions
Design Exploration for Heterogeneous MPSoC Platforms
Heterogeneous
MPSoCs exploration
Processors
• Heterogeneous vs. homogeneous
• General-purpose vs. application-specific
On-chip communication architecture (OCA)
• Bus (e.g. AMBA, CoreConnect), packet switching network
(e.g. Alpha 21364)
Memory hierarchy
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
IP
μP
Network
Interface
Network
Interface
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
FPGA
μP
Network
Interface
Network
Interface
Communication Network
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
μP
DSP
Network
Interface
Network
Interface
Configurable SoC Platforms
General purpose processor cores +
programmable fabric
Tight integration using extended instructions (ASIPs)
• Example: Altera Nios / Nios II
Loose integration using FIFOs/busses for communications
• Example: Xilinx MicroBlaze, etc.
Custom instruction logic for Nios II
[source: www.altera.com]
Xilinx MicroBlaze
[source: www.xilinx.com]
ASIP Compilation: Problem Statement
Given:
t1 = a * b;
t4 = ext-inst1(a, b, c);
CDFG G(V, E)
t2 = b * c;;
t5 = ext-inst2(b, c, d, e);
The basic instruction set I
t3 = d * e;
t6 = t4 + t5;
Pattern constraints:
t4 = t1 + t2;
• Number of inputs |PI(pi)| Nin;
• Number of outputs |PO(pi)| = 1;
• Total area
Objective:
area( p ) A
1i N
t5 = t2 + t3;
t6 = t5 + t4;
Performance speedup = 9 / 5 = 1.8X
i
Generate a pattern library P
Map G to the extended instruction set
IP, so that the total execution time
is minimized
a
*
b
*
c d
+
+
ext-inst1
(MAC1: 2 cycles) t4
* 2 clock cycles
*
e
+
t6
t5
ext-inst2
(MAC2: 2 cycles)
+ 1 clock cycle
Target Core Processor Model
Core processor model
Classic single-issue pipelined RISC core (fetch / decode / execute / mem /
write-back)
• The number of input and output operands of an instruction is pre-determined
• An instruction reads the core register file during the execute stage, and commits
the result during the write-back stage
MUX
MEM / WB
Memory
OP2
EX / MEM
OP1
ALU
ID / EX
RS2
Reg File
PC
Inst
Cache
IF / ID
Adder
4
RS1
Result
Core Processor
Custom
Logic
ASIP Compilation Flow
C code
Front-end compilation
CDFG
Pattern Generation
Satisfying input/output
constraints
Arch
constraint
1. Pattern generation
2. Pattern selection
Pattern library
3. Application mapping &
Graph covering
Optimized
CDFG
Backend compilation
Optimized assembly
Pattern Selection
Select a subset to maximize
the potential speedup while
satisfying the resource
constraint
Application Mapping
Graph covering to
minimize the total
execution time
Experimental Results on Altera Nios
Altera Nios is used for ASIP implementation
5 extended instruction formats
up to 2048 instructions for each format
Small DSP applications are taken as benchmark
Speedup
Extended
Instruction# Estimation
Nios
fft_br
iir
fir
pr
dir
mcm
Average
9
7
2
2
2
4
Resource Overhead
LE
Memory
DSP Block
3.28
2.65
408
6.06%
65,536
9.79%
16
3.18
3.73
255
3.79%
4,736
0.71%
40
2.40
1.57
2.14
1.75
51
71
0.76%
1.05%
1,024
0
0.15%
0.00%
8
14
3.28
4.75
3.02
3.22
54
186
0.80%
2.76%
0
0
0.00%
0.00%
16
56
3.08
2.75
-
2.54%
-
1.77%
-
Architecture Extension for ASIPs
Data bandwidth problem
• Limited register file bandwidth (two read ports, one write port)
• ~40% of the ideal performance speedup will be lost
Shadow-register-based architectural extension
Core registers are augmented by an extra set of shadow registers
• Conditionally written during write-back stage
• Low power/area overhead
Novel shadow-register binding algorithms are developed
MUX
MEM / WB
Memory
OP2
EX / MEM
OP1
ALU
PC
Inst
Cache
ID / EX
RS2
Reg File
IF / ID
Adder
4
RS1
Result
Core Processor
…
SRK
Custom
Logic
Hashing
Unit
k = hash(j)
SR1
Ongoing Work -- Mapping for Heterogeneous
Integration with Multiple Processing Cores
Given:
A library of processing cores P and communication library C
Task graph G(V, E)
• For each v in V, execution time t(v, pi) on pi
• For each (u, v) in E, communication data size s(u,v)
Throughput constraint
Problem:
Select and instantiate the processing elements and communication channels
from P and C respectively
Map the tasks onto the processing elements and communications to the
channels so that
• The optimal latency is achieved subject to the throughput constraint
• The implementation cost is minimized
Preliminary Results on Motion-JPEG Example
Preprocess
DCT
RAW Images
Encoded JPEG Images
Table
Modification
OR
Preprocess
Quant
HW-DCT
Quant
Table
Modification
System
Cycle#
Huffman
Model #1 : 5 Microblazes
FSL-based communication
Huffman
Model #2 : 4 Microblazes
+ DCT on FPGA fabrics
Fmax
Exe Time
Area
(MHZ)
(ms)
(Slice#)
Model #1
23812
126
0.189
4306
Model #2
14800
(-38%)
126
0.117
6345
Xilinx XUP Board
Conclusions
xPilot
has fairly mature and advanced behavior synthesis capability
from C or SystemC to RTL code with necessary design constraints
xPilot
advantages include
Platform-based behavior and system synthesis
Communication/interconnect-centric approach
Advanced algorithms for platform-based, communication-centric optimization
Promising results demonstrated on available FPGAs
xPilot
system synthesis capabilities
Performance simulation of multi-processor systems
Exploration the efficient use of (multiple) on-chip processors
Compilation and optimization for reconfigurable processors
Acknowledgements
We would like to thank the supports from
Gigascale Systems Research Center (GSRC)
National Science Foundation (NSF)
Semiconductor Research Corporation (SRC)
Industrial sponsors under the California MICRO programs (Altera, Xilinx)
Team members:
Yiping Fan
Guoling Han
Wei Jiang
Zhiru Zhang
Electronic System-Level (ESL) Design Automation
Modeling
SystemC -- OpenSource
SystemVerilog
Simulation
and Verification
Behavior-level simulation & verification
System-level simulation & verification
SystemC provides behavior-level and system-level synthesis capabilities for
free -- rapidly gaining popularity
Synthesis
Behavior-level synthesis: from behavior specification (e.g. C, SystemC, or
Matlab) to RTL or netlists
System-level synthesis: from system specification to system implementation
ESL Tools – A Lot of Interests …
Communication- and Interconnect-Centric Synthesis:
Example: Use of Distributed Register-File Architectures
Island C
1
1
3
2
4
Island B
Island A
Local
Register
File
2
3
Data-Routing
Logic
Input Buffers
4
2
1
A scheduled DFG
with register binding
indicated on each
variable (assume
one-functional unit
constraint)
Binding using
discrete registers
FUP MUX
Functional Unit Pool
MUL
ALU
ALU’
Distributed register-file
micro-architecture:
Binding using a
register file: more
efficient design!
Efficiently use on-chip
embedded memories
Fully explore operation and
data-transfer parallelism
Distributed Register-File Microarchitecture
Island B
Island A
On-chip memory
blocks
Data-Routing
Logic
Local
Register
File
Input Buffers
FUP MUX
Island C
Functional Unit Pool
MUL
Island A
ALU
Island C
ALU’
Xilinx XC-2V
2000
3000
4000
6000
8000
#18Kb BRAM
56
96
120
144
168
Dist.
RAM(Kb)
336
448
720
1,056
1,456
On-chip RAM resource on Virtex II
Island B
FP-SoC
Resource Binding for DRF-Microarchitecture
Intra-island transfers
Inter-island transfers
1
2
v7
v2
3
v3
4
v9
v5
v4
Island
(Chain)
v6
v1
A
B
v8
C
v10
D
Inter-island connections = 5
(A,B)=(A,D)=1
(A,C)=1, two data transfers
share one connection
(C,D)=2
Facts
under simplified
assumptions
Operations bound onto an island
form a chain in the given
scheduled DFG
Inter-chain data transfers may
share a physical inter-island
connection
The
number of inter-island
connections (IIC) is crucial to
the QoR of a DRFM instance
DRFM Binding Solution
1
v2
3
v3
4
v4
Island
(Chain)
A
v3
1
v6
v1
2
0
v9
v9
B
v8
C
B
2
v7
v5
A
1
1
v10
D
D
C-step 1, 2 handled. For c-step 3:
Construct weighted bipartite graph:
Edge weight = # new introduced interisland connections (IIC)
Min-weight matching optimal
binding in this step
In step-by-step fashion
Final Inter-Island Connections = 4
2
C
Overview:
Use weighted bipartite-matching
to solve each step optimally
2
0
Solution of this step:
Matching: V3 Island A; V9 Island C
New introduced IIC # = 0
DRF Experimental Results:
Three Experimental Flows for Comparison
xPilot Frontend
xPilot behavioral
synthesis system
1) Binding on
Discrete-Register
Microarchitecture
SSDM/CDFG
Scheduling algorithms
Scheduled CDFG (STG)
2) Baseline (Random)
DRF Binding
RTL generation
Xilinx Virtex II
3) DRF Binding for
Minimizing
Inter-Island Connections
DRF Experimental Results
Xilinx ISE 7.1; Virtex II; Target clock period: 8ns
The baseline DRF binding results achieve 46.70% slice reduction over the discrete-register
approach
Optimized DRF binding reduces 12.21% further
Overall, more than 2X logic slice reduction with better clock period (7.8%).
1200
Discrete-Reg
DRF-Random
1000
14
12
DRF-Opt
10
Clock Period (ns)
Slices
800
600
400
8
6
4
200
2
0
0
PR
LEE
CHEN
DIR
Area (Slices, DRF solutions use
on-chip RAM blocks)
PR
LEE
CHEN
Clock period (ns)
DIR
Preliminary Result of xPilot
Better QoR (Comparison with UCI/UCSD SPARK)
SPARK
Resource Usage
Designs
Slice
Slice
Slice
(LUT)
(FF)
Delay
xPilot
Fmax
Ratio
Resource Usage
DSP
(MHz)
Slice
Slice
Slice
(LUT)
(FF)
Fmax
xPilot
DSP
(MHz)
/SPARK
PR
588
981
247
0
92.85
331
416
564
16
146.84
1.58
WANG
660
1157
265
0
109.29
357
464
588
15
133.51
1.22
LEE
574
996
220
0
109.17
356
484
659
19
131.93
1.21
MCM
1062
1857
479
0
99.40
887
1207
1282
30
110.38
1.11
DIR
1323
2256
494
3
79.30
979
1002
1732
56
98.81
1.25
Ave Ratio
1
1
1
1
1.00
0.66
0.48
2.74
n/a
1.27
1.27
Device setting: Xilinx Virtex-II pro (xc2v4000 -6)
Target frequency: 200 MHz
Proposed SCM Co-Optimization Design Flow
Platform Description &
Constraints
Process Network
Front End
System-Level Synthesis
Data Model
SCOOP (SCM CO-Optimization)
Communication
order detection
Code transformation and
interface generation
Indices compression
for loop reordering
Drivers + Glue
Logics
Process
Behavior
Communication Order Detection
Step 1. Construct a global CDFG by merging the individual CDFGs of each process
Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the
total latency of the global CDFG
Process 1
T1
T2
+
T3
*
*
T3
T2
Process 2 +
T1
*
T1
+
T3
T2
+
Ti : FIFO
Latency = 5 cycles
Latency = 7 cycles
Loop Indices Compression
Given the optimal order, we try to generate restructured loops for
code compression
i.e., given the original iteration and reordered iteration, find the minimum
number of linear intervals to represent the new iteration space
Original order: (0,0), (0,1), (1,0), (1,1)
After reordering: (0,0), (1,0), (0,1), (1,1)
Need to solve the linear system
i
i' a1 b1 c1
j
j ' a 2 b2 c 2 1
Solution: i’=j, j’ = i;
Initial Results of Interface Synthesis
Target for sequential communication channels
In particular, FSL in VirtexII
Consider two communicating processes
Total latency (Cycle#)
RAs Compress
Designs
Trad.
SCOOP
Reduction
Before
After
DCT1
325
290
10.77%
0
0
Haar
142
134
5.63%
0
0
DWT
689
617
10.45%
0
0
Mat_mul
408
339
16.91%
96
20
DCT2
483
419
13.25%
80
64
Masking
620
420
32.26%
192
0
Dot
1903
1084
43.04%
300
0
An average of 26% improvement in total latency can be achieved.
MPEG-4 Simple Profile Decoder: Architecture Profiling
• C specification overview
Module
Name
Orig. C
Source File
Orig. C
line #
Copy
Controller
copyControl.c
287
Display
displayControl.c
Controller
• Runtime Profiling (PowerPC/XUP board)
Parser/VLD
59.0%
Texture/IDCT
18.1%
Motion Comp.
15.7%
Copy Controller
3.6%
358
Motion
Comp.
MotionCompensation.c
312
Parser
/VLD
parser.c
1092
texture_vld.c
508
Texture
/IDCT
texture_idct.c
1901
Texture
Update
textureUpdate.c
220
MPEG-4 Simple Profile Decoder: Hyprid HW/SW Impmentation
HW block
Integrated with
PowerPC single
process design:
Software blocks
running on PowerPC
15% speed
improvement
MPEG-4 Simple Profile Decoder:
Alternate Implementations
Single Single PowerPC w/
PowerPC HW Motion Comp.
Single uBlaze
7-uBlaze
Throughput
(Frame per Second)
0.59
1.18
3.06
3.53
Improvement
-
+ 209%
+ 68.4%
+ 15.3%
• xPilot Synthesis Report of HW blocks
Line counts
C
RTL
SystemC
RTL
VHDL
Slices ( FFs, LUTs)
MUL
Clock
period (ns)
Latency
(Cycles)
Motion Comp.
210
9903
5655
986 (1111, 1017)
2
7.97
505
Block IDCT
200
9534
2731
1877 (2376, 2438)
26
7.963
280
Texture Update
160
8227
4475
1551 (1696, 1931)
4
7.913
335
Advantages of Our Scheduling Algorithm
A highly versatile
scheduling engine (UPS)
Supports a wide spectrum of applications with high complexity
• Data-intensive, control-intensive, memory-intensive, mixed, etc.
Honors a rich set of design constraints
• Resource constraints, relative timing constraints, frequency
constraints, latency constraints, etc.
Offers a variety of optimization techniques
• Operation chaining, pipelined multi-cycle operation, awareness of
repetitions, behavioral templates, speculation, functional/loop
pipelining, multi-cycle communication
Accounts for physical reality
• Optimizes communications simultaneously with computations
Preliminary Results of xPilot
Rapid System Exploration
Quick
evaluation of various amounts of process level
concurrency and different hardware/software boundaries
Example: Motion-JPEG implementation
-All HW implementation
-All SW implementation (using embedded processors)
-SW/HW co-design: optimal partitioning?
-Repeated manual RTL coding is not solution!
Preliminary Results of xPilot
Shorter Simulation/Verification Cycle
From other projects:
Simulation speed on behavior model 100X faster than
RTL-based method [NEC, ASPDAC04]
Our experience:
Motion-compensation module in a Mpeg4-decoder
• Behavior level (in C language) simulation
Less than 1 second per frame
• RTL SystemC simulation
About 310 second per frame
Ongoing Work: Design Exploration for MPSoCs
A scalable
architecture simulation infrastructure for architecture
evaluation & performance/power estimation
Need for structural abstraction of processors and interconnects
• Recent work such as Liberty is an effort along this direction
Complete structural abstraction makes the simulation very slow
• Liberty is about 10X slower than SimpleScalar on Itanium model
Hybrid approach
• Tradeoff between accuracy and simulation time
• Model interconnection accurately using SystemC (for accuracy)
• Cores modeled using Simplescalar (for simulation speed)
Communication
network synthesis
Automatic interface synthesis is required
Physical planning is needed for interconnect latency/power estimation