Design Productivity Crisis

Download Report

Transcript Design Productivity Crisis

Platform-Based
Behavior-Level and System-Level Synthesis
Prof. Jason Cong
[email protected]
UCLA Computer Science Department
Outline
 Motivation
 xPilot
system framework
 Behavior-level
synthesis in xPilot
 Advantages of behavioral synthesis
 Scheduling
 Resource binding
 System-level
synthesis in xPilot
 Synthesis for ASIP platforms
 Design exploration for heterogeneous MPSoCs
 Conclusions
ASICs SOC Example: Philips Nexperia
 50 to 300+ MHz
MIPS™
SDRAM
MMI
MIPS CPU
D$
PRxxxx
I$
 Image
coprocessors
DEVICE IP. BLOCK
.
PI BUS
Library of device IP
blocks
DEVICE IP BLOCK
.
DEVICE IP BLOCK

D$
 32-bit or 64-bit
I$
DEVICE IP BLOCK
DEVICE IP
. BLOCK
Scalable VLIW media
processor:
 100 to 300+ MHz
TriMedia CPU
TM-xxxx
 32-bit or 64-bit

TriMedia™
PI BUS
General-purpose
scalable RISC
processor
DVP MEMORY BUS


Nexperia™system
buses
 32-128 bit
.
. BLOCK
DEVICE IP
 DSPs
 UART
 1394
DVP SYSTEM SILICON
MPEG VIDEO
 USB
…
Courtesy Philips
Philips Nexperia SoC platform for
high-end digital video
MSP
MIPS
ACCESS
CTL.
VLIW
Field-Programmable SOC Example: Xilinx Virtex-4 FPGA
IP
IP
IBM CoreConnect™ Bus
MicroBlaze
180MHz
Soft core Proc < ~1300 LUTs
166 DMIPS
MicroBlaze
H.264/AVC hardware blocks
PowerPC 405 (PPC405) core
450 MHz, 700+ DMIPS RISC core
(32-bit Harvard architecture)
Courtesy Xilinx
IC Design Steps
System-Level
Specification
Behavior-level
Description
Physical
Design
Placed
& Routed
Design
Packaging
Synthesis
Technology
Mapping
Gate/Circuit
Design
Fabrication
RT-Level
Description
Generic Logic
Description
X=(AB*CD)+
(A+D)+(A(B+C))
Y = (A(B+C)+AC+
D+A(BC+D))
[©Sherwani]
xPilot: Platform-Based Synthesis System
Platform Description
& Constraints
SystemC/C
xPilot
xPilot Front End
Profiling
SSDM
(System-Level
Synthesis
Data Model)
Processor &
Architecture
Synthesis
Processor Cores
+ Executables
Interface
Synthesis
Drivers + Glue Logic
Analysis
Mapping
Behavioral Synthesis
Custom Logic
Embedded SoC

Uniqueness of xPilot
 Platform-based synthesis and optimization
 Communication-centric synthesis with interconnect optimization
Outline
 Motivation
 xPilot
system framework
 Behavior-level
synthesis in xPilot
 Advantages of behavioral synthesis
 Scheduling
 Resource binding
 System-level
synthesis in xPilot
 Synthesis for ASIP platforms
 Design exploration for heterogeneous MPSoCs
 Conclusions
xPilot: Behavioral-to-RTL Synthesis Flow
Behavioral spec.
in C/SystemC
Platform
description





Frontend
compiler


FPGAs/ASICs
Loop unrolling/shifting
Strength reduction / Tree height reduction
Bitwidth analysis
Memory analysis …
Core synthesis optimizations
 Scheduling
 Resource binding, e.g., functional unit
binding register/port binding
SSDM
RTL + constraints
Presynthesis optimizations
Arch-generation & RTL/constraints
generation
 Verilog/VHDL/SystemC
 FPGAs: Altera, Xilinx
 ASICs: Magma, Synopsys, …
Advantages of Behavioral Synthesis
 Shorter verification/simulation cycle
• 100X speed up with behavior-level simulation
 Better complexity management, faster time to market
• 10M gate design may require 700K lines of RTL code
 Rapid system exploration
• Quick evaluation of different hardware/software boundaries
• Fast exploration of multiple micro-architecture alternatives
 Higher quality of results
• Platform-based synthesis & optimization
• Full consideration of physical reality
Behavior Synthesis Has Been Tried and Failed – Why?
 Reasons
for previous failures
 Lack of a compelling reason: design complexity is still
manageable a decade of ago
 Lack of a solid RTL foundation
 Lack of consideration of physical reality
 Lack of widely accepted behavior models
xPilot Advantages
 Advanced
algorithms for platform-based, communication-
centric optimization
 Platform-based
behavior and system synthesis
 Communication/interconnect-centric approach
 Complete
validation through final P&R on FPGAs
Platform Modeling & Characterization
 Target platform specification
MUX
 High-level resource library with
ALU
ALU
ALU
delay/latency/area/power curve for
Two binding solutions for
various input/bitwidth configurations
same behavior:
• Functional units: adders, ALUs,
•
•
multipliers, comparators, etc.
Connectors: mux, demux, etc.
Memories: registers, synchronous
memories, etc.
 Chip layout description
• On-chip resource distributions
• On-chip interconnect delay/power
estimation
 Which one is better?
 Answer is platform-dependent:
 How large/fast are the
MUX and ALU?
0.58 1.8
2.8
2.0
2.9
3.7
2.8
3.8
4.7
3X3 Delay Matrix for Stratix-EP1S40
Advanced Behavior System Algorithms:
Example: Versatile Scheduling Algorithm Based on SDC
 Scheduling problem in behavioral synthesis is NP-Complete
under general design constraints
 ILP-based solutions are
versatile but very inefficient
 Exponential time complexity
*1
+2
+3
+4
*
+
CS0
*1
+2
CS1
*5
*5
+3
+4
Existing Scheduling Techniques for Behavioral Synthesis
 Heuristic approach:
Fast, but ad hoc (limited efficiency to specific
applications)
 Data-flow-based scheduling (Targets data-flow-intensive designs,
e.g., DSP applications, image processing applications, etc.)
 Control-flow-based scheduling (Targets control-flow-intensive
designs e.g., controllers, network protocol processors, etc.)
 Exact
approach: Versatile, but inefficient (poor scalability)
 ILP-based scheduling, e.g., [Huang et al., TCAD’91], etc.
 BDD-based symbolic scheduling, e.g., [Radivojevic and Brewer, TCAD’96] …
Scheduling  Our Approach
 Overall
approach
 Current objective: high-performance
 Use a system of integer difference constraints to
express all kinds of scheduling constraints
 Represent the design objective in a linear function
+v
*
1
*
v2
+
 Dependency constraint
•
•
•
•
v1  v3 : x3 – x1  0
v2  v3 : x3 – x2  0
v3  v5 : x4 – x3  0
v4  v5 : x5 – x4  0
 Frequency constraint
v4
• <v2 , v5> : x5 – x2  1
 Resource constraint
v3
• <v2 , v3>: x3 – x2  1
v
5
 Platform characterization:
• adder (+/–) 2ns
• multipiler (*): 5ns
 Target cycle time: 10ns
 Resource constraint: Only
ONE multiplier is available
1
0
0
0
0
0
1
0
0
1
-1
-1
1
0
0
A
0
0
-1
1
0
0
0
0
-1
-1
X1
X2
X3
X4
X5
x

0
-1
0
0
-1
b
Totally unimodular matrix: guarantees integral solutions
UPS Scheduling  Overall Framework
CDFG
xPilot scheduler
Constraint equations
generation
Userspecified
design
constraints&
assignments
Relative timing constraints
Dependency constraints
Frequency constraints
Resource constraints …
Objective function generation
System of pairwise
difference constraints
Linear programming solver
LP solution interpretation
STG (State
Transition Graph)
Target
platform
modeling
(resource
library &
chip layout)
UPS vs. SPARK: Results on SPARK’s Benchmarks
Mult (*): 2 cycles; Div (*) : 5 cycles; Rest: one cycle
Target frequency: 7.5ns
SPARK
UPS
State#
W. Cycle#
State#
W. Cycle#
UPS /
SPARK
MPEG2-dpframe
32
424
35
352
0.83
GIMP-tiler
27
2234
32
1877
0.84
ADPCM-decoder
15
327
13
278
0.85
ADPCM-encoder
16
133
13
112
0.84
Benchmark
Average Ratio
UPS achieves 16% cycle count reduction over SPARK
0.84
Platform-Based Interface Synthesis

Focus on sequential communication media (SCM)
 FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx CoreConnect. Altera Avalon, etc.)
 Order may have dramatic impact on performance
• Best order should guarantee that no data transmission on critical path are delayed
by non-critical transmission

Interface synthesis for SCM
 Consider both behavior and communication to determine the optimal transmission
order
for (int i=0; i <8; i++) {
S1: data[i] = …;
}
C
data[8]
int s07 = data[0] + data[7];
Int s16 = data[1] + data[6];
…..
P2
P1
FIFO
Custo
m
Logic 1
PE1
Custom logic 2
DCT example
PE2
SCM Co-Optimization  Problem Formulation
 Given:
 A set of processes P connected by a set of channels in C
 A set of data D = {d1, d2, …, dm} to be transmitted on each
channel cj,
 Goal:
 Find the optimal transmission order of each process, so that
the overall latency of the process network is minimized
subject to the given design constraints and platform
specifications
 In the meantime, generate the drivers and glue logics for
each process automatically
SystemC/C-to-RTL Design Flow
SystemC/C specification
Front-end compiler
xPilot behavioral
synthesis
SSDM
Platform description
& constraints
(System-Level
Synthesis
Data Model)
SSDM/CDFG
Behavioral synthesis
SSDM/FSMD
RTL generation
FSM with Datapath
in VHDL
Floorplan and/or multicycle path constraints
RTL synthesis
ASICs/FPGAs platform
Preliminary Results of xPilot 
Better Complexity Management
 Significant
code size reduction
 RTL design  Behavioral design: 10x code size reduction
 VHDL code generated by UCLA xPilot targeting Altera
Stratix platform
Outline
 Motivation
 xPilot
system framework
 Behavior-level
synthesis in xPilot
 Advantages of behavioral synthesis
 Scheduling
 Resource binding
 System-level
synthesis in xPilot
 Synthesis for ASIP platforms
 Design exploration for heterogeneous MPSoCs
 Conclusions
Design Exploration for Heterogeneous MPSoC Platforms
 Heterogeneous
MPSoCs exploration
 Processors
• Heterogeneous vs. homogeneous
• General-purpose vs. application-specific
 On-chip communication architecture (OCA)
• Bus (e.g. AMBA, CoreConnect), packet switching network
(e.g. Alpha 21364)
 Memory hierarchy
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
IP
μP
Network
Interface
Network
Interface
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
FPGA
μP
Network
Interface
Network
Interface
Communication Network
μP
μP
tasks
OS
Driver
Network
Interface
Network
Interface
μP
DSP
Network
Interface
Network
Interface
Configurable SoC Platforms
General purpose processor cores +
programmable fabric
 Tight integration using extended instructions (ASIPs)
• Example: Altera Nios / Nios II
 Loose integration using FIFOs/busses for communications
• Example: Xilinx MicroBlaze, etc.
Custom instruction logic for Nios II
[source: www.altera.com]
Xilinx MicroBlaze
[source: www.xilinx.com]
ASIP Compilation: Problem Statement
 Given:
t1 = a * b;
t4 = ext-inst1(a, b, c);
 CDFG G(V, E)
t2 = b * c;;
t5 = ext-inst2(b, c, d, e);
 The basic instruction set I
t3 = d * e;
t6 = t4 + t5;
 Pattern constraints:
t4 = t1 + t2;
• Number of inputs |PI(pi)|  Nin;
• Number of outputs |PO(pi)| = 1;
• Total area
 Objective:
 area( p )  A
1i  N
t5 = t2 + t3;
t6 = t5 + t4;
Performance speedup = 9 / 5 = 1.8X
i
 Generate a pattern library P
 Map G to the extended instruction set
IP, so that the total execution time
is minimized
a
*
b
*
c d
+
+
ext-inst1
(MAC1: 2 cycles) t4
* 2 clock cycles
*
e
+
t6
t5
ext-inst2
(MAC2: 2 cycles)
+ 1 clock cycle
Target Core Processor Model
Core processor model
 Classic single-issue pipelined RISC core (fetch / decode / execute / mem /
write-back)
• The number of input and output operands of an instruction is pre-determined
• An instruction reads the core register file during the execute stage, and commits
the result during the write-back stage
MUX
MEM / WB
Memory
OP2
EX / MEM
OP1
ALU
ID / EX
RS2
Reg File
PC
Inst
Cache
IF / ID
Adder
4
RS1
Result

Core Processor
Custom
Logic
ASIP Compilation Flow
C code
Front-end compilation
CDFG
Pattern Generation
Satisfying input/output
constraints
Arch
constraint
1. Pattern generation
2. Pattern selection
Pattern library
3. Application mapping &
Graph covering
Optimized
CDFG
Backend compilation
Optimized assembly
Pattern Selection
Select a subset to maximize
the potential speedup while
satisfying the resource
constraint
Application Mapping
Graph covering to
minimize the total
execution time
Experimental Results on Altera Nios

Altera Nios is used for ASIP implementation
 5 extended instruction formats
 up to 2048 instructions for each format

Small DSP applications are taken as benchmark
Speedup
Extended
Instruction# Estimation
Nios
fft_br
iir
fir
pr
dir
mcm
Average
9
7
2
2
2
4
Resource Overhead
LE
Memory
DSP Block
3.28
2.65
408
6.06%
65,536
9.79%
16
3.18
3.73
255
3.79%
4,736
0.71%
40
2.40
1.57
2.14
1.75
51
71
0.76%
1.05%
1,024
0
0.15%
0.00%
8
14
3.28
4.75
3.02
3.22
54
186
0.80%
2.76%
0
0
0.00%
0.00%
16
56
3.08
2.75
-
2.54%
-
1.77%
-
Architecture Extension for ASIPs

Data bandwidth problem
• Limited register file bandwidth (two read ports, one write port)
• ~40% of the ideal performance speedup will be lost
Shadow-register-based architectural extension
 Core registers are augmented by an extra set of shadow registers
• Conditionally written during write-back stage
• Low power/area overhead
 Novel shadow-register binding algorithms are developed
MUX
MEM / WB
Memory
OP2
EX / MEM
OP1
ALU
PC
Inst
Cache
ID / EX
RS2
Reg File
IF / ID
Adder
4
RS1
Result

Core Processor
…
SRK
Custom
Logic
Hashing
Unit
k = hash(j)
SR1
Ongoing Work -- Mapping for Heterogeneous
Integration with Multiple Processing Cores

Given:
 A library of processing cores P and communication library C
 Task graph G(V, E)
• For each v in V, execution time t(v, pi) on pi
• For each (u, v) in E, communication data size s(u,v)
 Throughput constraint

Problem:
 Select and instantiate the processing elements and communication channels
from P and C respectively
 Map the tasks onto the processing elements and communications to the
channels so that
• The optimal latency is achieved subject to the throughput constraint
• The implementation cost is minimized
Preliminary Results on Motion-JPEG Example
Preprocess
DCT
RAW Images
Encoded JPEG Images
Table
Modification
OR
Preprocess
Quant
HW-DCT
Quant
Table
Modification
System
Cycle#
Huffman
Model #1 : 5 Microblazes
FSL-based communication
Huffman
Model #2 : 4 Microblazes
+ DCT on FPGA fabrics
Fmax
Exe Time
Area
(MHZ)
(ms)
(Slice#)
Model #1
23812
126
0.189
4306
Model #2
14800
(-38%)
126
0.117
6345
Xilinx XUP Board
Conclusions
 xPilot
has fairly mature and advanced behavior synthesis capability
from C or SystemC to RTL code with necessary design constraints
 xPilot
advantages include
 Platform-based behavior and system synthesis
 Communication/interconnect-centric approach
 Advanced algorithms for platform-based, communication-centric optimization
 Promising results demonstrated on available FPGAs
 xPilot
system synthesis capabilities
 Performance simulation of multi-processor systems
 Exploration the efficient use of (multiple) on-chip processors
 Compilation and optimization for reconfigurable processors
Acknowledgements
 We would like to thank the supports from
 Gigascale Systems Research Center (GSRC)
 National Science Foundation (NSF)
 Semiconductor Research Corporation (SRC)
 Industrial sponsors under the California MICRO programs (Altera, Xilinx)
 Team members:
Yiping Fan
Guoling Han
Wei Jiang
Zhiru Zhang
Electronic System-Level (ESL) Design Automation
 Modeling
 SystemC -- OpenSource
 SystemVerilog
 Simulation
and Verification
 Behavior-level simulation & verification
 System-level simulation & verification
 SystemC provides behavior-level and system-level synthesis capabilities for
free -- rapidly gaining popularity
 Synthesis
 Behavior-level synthesis: from behavior specification (e.g. C, SystemC, or
Matlab) to RTL or netlists
 System-level synthesis: from system specification to system implementation
ESL Tools – A Lot of Interests …
Communication- and Interconnect-Centric Synthesis:
Example: Use of Distributed Register-File Architectures
Island C
1
1
3
2
4
Island B
Island A
Local
Register
File
2
3
Data-Routing
Logic
Input Buffers
4

2
1

A scheduled DFG
with register binding
indicated on each
variable (assume
one-functional unit
constraint)
Binding using
discrete registers
FUP MUX
Functional Unit Pool
MUL
ALU
ALU’
 Distributed register-file
micro-architecture:

Binding using a
register file: more
efficient design!
Efficiently use on-chip
embedded memories
Fully explore operation and
data-transfer parallelism
Distributed Register-File Microarchitecture
Island B
Island A
On-chip memory
blocks
Data-Routing
Logic
Local
Register
File
Input Buffers
FUP MUX
Island C
Functional Unit Pool
MUL
Island A
ALU
Island C
ALU’
Xilinx XC-2V
2000
3000
4000
6000
8000
#18Kb BRAM
56
96
120
144
168
Dist.
RAM(Kb)
336
448
720
1,056
1,456
On-chip RAM resource on Virtex II
Island B
FP-SoC
Resource Binding for DRF-Microarchitecture
Intra-island transfers
Inter-island transfers
1
2
v7
v2
3
v3
4
v9
v5
v4
Island
(Chain)

v6
v1
A
B
v8
C
v10
D
Inter-island connections = 5
 (A,B)=(A,D)=1
 (A,C)=1, two data transfers
share one connection
 (C,D)=2
 Facts
under simplified
assumptions
 Operations bound onto an island
form a chain in the given
scheduled DFG
 Inter-chain data transfers may
share a physical inter-island
connection
 The
number of inter-island
connections (IIC) is crucial to
the QoR of a DRFM instance
DRFM Binding Solution
1
v2
3
v3
4
v4
Island
(Chain)
A
v3
1
v6
v1
2
0
v9
v9
B
v8
C
B
2
v7
v5
A
1
1
v10
D
D
C-step 1, 2 handled. For c-step 3:

Construct weighted bipartite graph:
 Edge weight = # new introduced interisland connections (IIC)
 Min-weight matching  optimal
binding in this step
 In step-by-step fashion
 Final Inter-Island Connections = 4
2
C

 Overview:
 Use weighted bipartite-matching
to solve each step optimally
2
0

Solution of this step:
 Matching: V3 Island A; V9  Island C
 New introduced IIC # = 0
DRF Experimental Results:
Three Experimental Flows for Comparison
xPilot Frontend
xPilot behavioral
synthesis system
1) Binding on
Discrete-Register
Microarchitecture
SSDM/CDFG
Scheduling algorithms
Scheduled CDFG (STG)
2) Baseline (Random)
DRF Binding
RTL generation
Xilinx Virtex II
3) DRF Binding for
Minimizing
Inter-Island Connections
DRF Experimental Results

Xilinx ISE 7.1; Virtex II; Target clock period: 8ns

The baseline DRF binding results achieve 46.70% slice reduction over the discrete-register
approach

Optimized DRF binding reduces 12.21% further

Overall, more than 2X logic slice reduction with better clock period (7.8%).
1200
Discrete-Reg
DRF-Random
1000
14
12
DRF-Opt
10
Clock Period (ns)
Slices
800
600
400
8
6
4
200
2
0
0
PR
LEE
CHEN
DIR
Area (Slices, DRF solutions use
on-chip RAM blocks)
PR
LEE
CHEN
Clock period (ns)
DIR
Preliminary Result of xPilot 
Better QoR (Comparison with UCI/UCSD SPARK)
SPARK
Resource Usage
Designs
Slice
Slice
Slice
(LUT)
(FF)
Delay
xPilot
Fmax
Ratio
Resource Usage
DSP
(MHz)
Slice
Slice
Slice
(LUT)
(FF)
Fmax
xPilot
DSP
(MHz)
/SPARK
PR
588
981
247
0
92.85
331
416
564
16
146.84
1.58
WANG
660
1157
265
0
109.29
357
464
588
15
133.51
1.22
LEE
574
996
220
0
109.17
356
484
659
19
131.93
1.21
MCM
1062
1857
479
0
99.40
887
1207
1282
30
110.38
1.11
DIR
1323
2256
494
3
79.30
979
1002
1732
56
98.81
1.25
Ave Ratio
1
1
1
1
1.00
0.66
0.48
2.74
n/a
1.27
1.27

Device setting: Xilinx Virtex-II pro (xc2v4000 -6)

Target frequency: 200 MHz
Proposed SCM Co-Optimization Design Flow
Platform Description &
Constraints
Process Network
Front End
System-Level Synthesis
Data Model
SCOOP (SCM CO-Optimization)
Communication
order detection
Code transformation and
interface generation
Indices compression
for loop reordering
Drivers + Glue
Logics
Process
Behavior
Communication Order Detection

Step 1. Construct a global CDFG by merging the individual CDFGs of each process

Step 2. Solve a resource-constrained min-latency scheduling problem to optimize the
total latency of the global CDFG
Process 1




T1
T2

+

T3
*
*
T3
T2
Process 2 +

T1
*
T1

+
T3

T2

+

Ti : FIFO
Latency = 5 cycles
Latency = 7 cycles
Loop Indices Compression
Given the optimal order, we try to generate restructured loops for
code compression

 i.e., given the original iteration and reordered iteration, find the minimum
number of linear intervals to represent the new iteration space
Original order: (0,0), (0,1), (1,0), (1,1)
After reordering: (0,0), (1,0), (0,1), (1,1)
Need to solve the linear system
i
 i'   a1 b1 c1  
   
 j 
 j '   a 2 b2 c 2  1 
 
Solution: i’=j, j’ = i;
Initial Results of Interface Synthesis

Target for sequential communication channels
 In particular, FSL in VirtexII

Consider two communicating processes
Total latency (Cycle#)
RAs Compress
Designs
Trad.
SCOOP
Reduction
Before
After
DCT1
325
290
10.77%
0
0
Haar
142
134
5.63%
0
0
DWT
689
617
10.45%
0
0
Mat_mul
408
339
16.91%
96
20
DCT2
483
419
13.25%
80
64
Masking
620
420
32.26%
192
0
Dot
1903
1084
43.04%
300
0
An average of 26% improvement in total latency can be achieved.
MPEG-4 Simple Profile Decoder: Architecture Profiling
• C specification overview
Module
Name
Orig. C
Source File
Orig. C
line #
Copy
Controller
copyControl.c
287
Display
displayControl.c
Controller
• Runtime Profiling (PowerPC/XUP board)
Parser/VLD
59.0%
Texture/IDCT
18.1%
Motion Comp.
15.7%
Copy Controller
3.6%
358
Motion
Comp.
MotionCompensation.c
312
Parser
/VLD
parser.c
1092
texture_vld.c
508
Texture
/IDCT
texture_idct.c
1901
Texture
Update
textureUpdate.c
220
MPEG-4 Simple Profile Decoder: Hyprid HW/SW Impmentation
HW block
Integrated with
PowerPC single
process design:
Software blocks
running on PowerPC
15% speed
improvement
MPEG-4 Simple Profile Decoder:
Alternate Implementations
Single Single PowerPC w/
PowerPC HW Motion Comp.
Single uBlaze
7-uBlaze
Throughput
(Frame per Second)
0.59
1.18
3.06
3.53
Improvement
-
+ 209%
+ 68.4%
+ 15.3%
• xPilot Synthesis Report of HW blocks
Line counts
C
RTL
SystemC
RTL
VHDL
Slices ( FFs, LUTs)
MUL
Clock
period (ns)
Latency
(Cycles)
Motion Comp.
210
9903
5655
986 (1111, 1017)
2
7.97
505
Block IDCT
200
9534
2731
1877 (2376, 2438)
26
7.963
280
Texture Update
160
8227
4475
1551 (1696, 1931)
4
7.913
335
Advantages of Our Scheduling Algorithm
 A highly versatile
scheduling engine (UPS)
 Supports a wide spectrum of applications with high complexity
• Data-intensive, control-intensive, memory-intensive, mixed, etc.
 Honors a rich set of design constraints
• Resource constraints, relative timing constraints, frequency
constraints, latency constraints, etc.
 Offers a variety of optimization techniques
• Operation chaining, pipelined multi-cycle operation, awareness of
repetitions, behavioral templates, speculation, functional/loop
pipelining, multi-cycle communication
 Accounts for physical reality
• Optimizes communications simultaneously with computations
Preliminary Results of xPilot 
Rapid System Exploration
 Quick
evaluation of various amounts of process level
concurrency and different hardware/software boundaries
Example: Motion-JPEG implementation
-All HW implementation
-All SW implementation (using embedded processors)
-SW/HW co-design: optimal partitioning?
-Repeated manual RTL coding is not solution!
Preliminary Results of xPilot
Shorter Simulation/Verification Cycle
From other projects:
 Simulation speed on behavior model 100X faster than
RTL-based method [NEC, ASPDAC04]
Our experience:
 Motion-compensation module in a Mpeg4-decoder
• Behavior level (in C language) simulation

Less than 1 second per frame
• RTL SystemC simulation

About 310 second per frame
Ongoing Work: Design Exploration for MPSoCs
 A scalable
architecture simulation infrastructure for architecture
evaluation & performance/power estimation
 Need for structural abstraction of processors and interconnects
• Recent work such as Liberty is an effort along this direction
 Complete structural abstraction makes the simulation very slow
• Liberty is about 10X slower than SimpleScalar on Itanium model
 Hybrid approach
• Tradeoff between accuracy and simulation time
• Model interconnection accurately using SystemC (for accuracy)
• Cores modeled using Simplescalar (for simulation speed)
 Communication
network synthesis
 Automatic interface synthesis is required
 Physical planning is needed for interconnect latency/power estimation