Transcript xPilot

xPilot  A Platform-Based Behavioral Synthesis
System
Prof. Jason Cong
Students: Deming Chen, Yiping Fan,
Guoling Han, Wei Jiang, Zhiru Zhang
August, 2005
Supported by NSF, GSRC, Altera, Xilinx.
Outline
 Motivation
 xPilot
system framework
 Overview
of the synthesis engine
 Scheduling
 Resource binding
 Experimental
results
2
Motivation (1)
 Design
Complexity is outgrowing the traditional RTL
method
 Feasible to build SoC device with 500M transistors; Billiontransistor chips are on the horizon
 Behavioral synthesis  a critical technology for enabling the
move to higher level of abstraction
 Reasons for previous failures
• Lack of a compelling reason: design complexity is still manageable a
•
•
decade of ago
Lack of a solid RTL foundation
Lack of consideration of physical reality
3
Motivation (2)
 Behavioral
Synthesis provides combined advantages
 Better complexity management
• Code size: RTL design ~300KL  Behavioral design 40KL [NEC,
ASPDAC04]
 Shorter verification/simulation cycle
• Simulation speed 100X faster than RTL-based method
 Rapid system exploration
• Quick evaluation of different hardware/software boundaries
• Fast exploration of multiple micro-architecture alternatives
 Higher quality of results
• Full consideration of physical reality
4
xPilot: Platform-Based Behavioral to RTL Synthesis Flow
Behavioral spec.
in C/SystemC
Platform
description





Frontend
compiler


FPGAs/ASICs
Loop unrolling/shifting
Strength reduction / Tree height reduction
Bitwidth analysis
Memory analysis …
Core synthesis optimizations
 Scheduling
 Resource binding, e.g., functional unit
binding register/port binding
SSDM
RTL
Presynthesis optimizations
Arch-generation & RTL/constraints
generation
 Verilog/VHDL/SystemC
 FPGAs: Altera, Xilinx
 ASICs: Magma, Synopsys, …
5
System-level Synthesis Data Model
SSDM
(System-level Synthesis Data Model)
 Hierarchical netlist of concurrent processes and communication
channels
 Each leaf process contains a sequential program which is represented
by an extended LLVM IR with hardware-specific semantics
• Port / IO interfaces, bit-vector manipulations, cycle-level notations
6
Platform Modeling & Characterization
 Target platform specification
 High-level resource library with delay/latency/area/power curve
for various input/bitwidth configurations
• Functional units: adders, ALUs, multipliers, comparators, etc.
• Connectors: mux, demux, etc.
• Memories: registers, synchronous memories, etc.
 Chip layout description
• On-chip resource distributions
• On-chip interconnect delay/power estimation
7
Scheduling  Goals
 A highly versatile
scheduling engine
 Applicable to a wide range of application domains
• Computation-intensive, data/memory-intensive, control-intensive, etc.
• Mixed behavioral & RTL
 Amenable to a rich set of scheduling constraints
• Data dependency constraints
• Resource constraints: IO ports constraints, memory ports constraints,
•
•
Functional unit constraints, etc.
Timing constraints: Frequency constraint, Latency constraints, etc.
Relative IO timing constraints: Cycle-fixed mode, superstate-fixed
mode,
free-floating mode, etc.
 Retargetable to a variety of design objectives
• High performance, small area, low power, etc.
8
Scheduling  Optimization Capabilities
 Offers
a variety of optimization techniques in a unified
framework
 Combinational/Sequential non-pipelined/pipelined
multi-cycle operation
 Unconditional/Conditional operation chaining
 Relative scheduling
 Considerations of branching probabilities and repetitions
 Multi-cycle communication (under development)
 Code motion & speculation (under development)
 Functional / loop pipelining (under development)
 Physical layout integration (to be supported)
9
Scheduling  Current Status
 Design
objective
 Focus on high-performance designs
 Overall
approach
 Use a system of pairwise difference constraints to express all
kinds of scheduling constraints
 Represent the design objective in a linear function
 The system is immediately solvable via any linear programming
solver with integral solutions
10
Scheduling  Design Framework
CDFG
xPilot scheduler
Constraint equations
generation
Userspecified
design
constraints&
assignments
Relative timing constraints
Dependency constraints
Frequency constraints
Resource constraints …
Objective function generation
System of pairwise
difference constraints
Target
platform
modeling
(resource
library &
chip layout)
Linear programming solver
LP solution interpretation
STG (State
Transition Graph)
11
Example : Greatest Common Divisor

GCD C description
BB1
x = inport1;
y = inport2;
while (x != y) {
if ( x > y )
x = x – y;
else y = y – x;
}
*outport = x;
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
T
BB2
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
T
BB3
T
BB4
x_2 = x1 – y1;
cond3 = (x_2 != y_1);
y_2 = y1 – x1;
cond4 = (x_1 != y_2);
T
BB5
x_3 = (x_0, x_1, x_2);
*outport = x_3;
12
Constraints Generation
 Data
dependency constraint
 Operation v is data dependent on operation u, i.e., (u, v)E
s(v) – s(u)  0 where schedule variable s(v) represents the relative schedule of
node v
u: x_1 = (x_0, x_1, x_2);
v: cond2 = (x_1 > y_1);
 Other constraints can be represented in a similar way …
 The
constraint equations form a system of pairwise difference
constraints
 Matrix A is totally unimodular
 Feasibility check can be formulated as a single-source shortest path problem
 Optimizations can be performed via any LP solver; the dual problem is
equivalent to a min-cost network flow problem
13
Solution by LP Solver

Scheduling are
performed across
the basic block
boundaries
BB1
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
0
BB2
T
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
T
BB3
T
BB4
x_2 = x1 – y1;
cond3 = (x_2 != y_1);
1
y_2 = y1 – x1;
cond4 = (x_1 != y_2);
T
BB5
x_3 = (x_0, x_1, x_2);
*outport = x_3;
14
Schedule Interpretation
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
x_2 = x1 - y1;
cond3 = (x_2 != y_1);
y_2 = y1 - x1;
cond4 = (x_1 != y_2);
x_3 = (x_0, x_1, x_2);
*outport = x_3;
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
if (cond1) {
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
if (cond2) {
x_2 = x1 - y1;
cond3 = (x_2 != y_1); }
else {
y_2 = y1 - x1;
cond4 = (x_1 != y_2); }
}
if (!cond1 || !cond3&&!cond4) {
x_3 = (x_0, x_1, x_2);
15
*outport = x_3; }
Deriving State Transition Graph
 Final
STG for GCD
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
if (cond1) {
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
if (cond2) {
x_2 = x1 - y1;
cond3 = (x_2 != y_1); }
else {
y_2 = y1 - x1;
cond4 = (x_1 != y_2); }
}
if (!cond1 || !cond3&&!cond4) {
x_3 = (x_0, x_1, x_2);
*outport = x_3; }
cond3 || cond4
16
Unified Resource Binding
 Provides
an unified resource sharing framework to
optimize for various design objectives
 Simultaneous functional unit binding, register binding and port
binding
 Equipped with advanced techniques to optimized the interconnect
and steering logic networks
 Guided by a flexible cost evaluation engine to achieve different
objectives, e.g., performance, area, power, etc.
 Extendable to exploit physical layout information
17
An FU/Register binding Example
R1
R2
R3
F1
R4
R1
R3
MUX
F2
MUX
R4
MUX
F1
R5
Case 1
R5
Case 2
(a)
R1
R1
R2
F1
R2
R2
Note: Assume all
operations and variables
are compatible for sharing
F2
MUX
F1
R3
Case 1
R3
Case 2
(b)
 Observations:
 Binding has large impact to the resulting performance and cost
 Functional unit and register binding are highly correlated
18
Drawbacks of Previous Work
 Many
existing algorithms focus on functional-unit- or register“number” minimization
 Technology advances – interconnect effect increasing
• 51% of the total dynamic power of a microprocessor in 0.13um tech.
• Up to 80% of the dynamic power in future technologies
 May generate larger amount of multiplexers and interconnects
 Unfavorable performance and cost results
 Optimization
for unrealistic goals
 Minimize “number” of FUs, registers, or multiplexors
• Should have detailed datapath models to guide the optimization
 No technology specific consideration
• Should have platform-specific characterizations
19
Resource Binding in xPilot
STG (State
Transition Graph)
Baseline Register Binding
Userspecified
design
constraints
FU Allocation/Binding
Register Allocation/Binding
Target
platform
(resource
library &
chip layout)
xPilot
architecture exploration
Iteration
Yes
Datapath model for
performance-cost
estimation
Improved??
No
STG + Best Datapath Models
20
Design Space Exploration

 Exploring Node 2:
• (1) (2) two mul
• (1, 2) one mul
<
 ….
C1
’
2*, 3*
3*
4*
5*
2*
4*
C2
5*
<
C2’
6+
Compatible Graphs
A State Transition Graph
(STG)
power
 Exploring Node 4:
•
•
•
•
•
•
•
>
C1
6+
1*
 Exploring Node 3:
• (1) (2) (3) three mul
• (1, 2) (3) two mul
• (1, 3) (2) two mul
1*
>
Exploration phases:
(1) (2) (3) (4)
pruned
(1, 2, 4) (3)
(1, 2) (3, 4)
MUL
MUL
(1, 2) (3) (4)
(1, 3, 4) (2)
(1, 3) (2, 4)
(1, 3) (2) (4)
Datapath for
solution (1, 2, 4) (3)
Datapath Model
delay
Curve for Design
21
Space Pruning
Experimental Results  Benchmark Suite

Benchmark suite
 PR, MCM:
• DSP kernels: pure additions/subtractions and multiplications
 CACHE
• Cache controller: control-intensive designs with cycle-accurate I/O operations
 MOTION:
• Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest
amount of computations
 IDCT:
• JPEG inverse discrete cosine transform: computation intensive
 DWT:
• JPEG2000 discrete wavelet transform: computation intensive with modest control
flow
 EDGELOOP:
• Extracted from H.264 decoder: a very complex design, features a mix of
computation, control, and memory accesses
22
Experimental Results  Code Size Reduction
23
Experimental Results  Comparison with SPARK On Scheduling
 SPARK
[UCI/UCSD, 2004], a state of the art academic highlevel synthesis tool
Synthesis
Designs
MOTION
PR
IDCT
CACHE
Tool/Flow
Altera Quartus II report
Report
state#
reg#
fmax (MHz)
LE
register
mem
dsp
spark
13
18
170.8
666
367
0
4
xpilot
24
11
161.2
888
266
0
4
spark
13
36
130.6
508
491
0
32
xpilot
13
40
178.7
1,349
783
0
0
spark
176
~400
72.01
10,847
4,547
0
138
xpilot
141
413
105.53
11,481
5,627
0
64
xpilot-mem
334
451
162.9
9,351
6,098
1,024
64
spark
xpilot-mem
Memory unsupported
47
16
161.6
371
265
3072
0
24
Experimental Results  Comparison with SPARK On Binding
SPARK
xPilot
LE
COMB
LonelyReg
CombReg
DSP
(MHz)
LE
COMB
LonelyReg
CombReg
DSP
(MHz)
Fmax
Ratio
xPilot/
SPARK
PR
1108
815
0
293
0
123.53
1349
713
84
552
0
178.7
1.45
WANG
1217
942
0
275
0
118.89
1105
527
62
516
8
166.11
1.40
LEE
1367
1052
0
315
0
119.32
1585
691
207
687
4
166.61
1.40
MCM
2808
2248
0
560
0
74.87
2402
981
73
1348
0
152.56
2.04
DIR
2425
2034
0
391
6
69.38
3489
1752
110
1627
4
146.8
2.12
FEIG
16170
13136
0
3034
6
37.17
10539
2295
240
8004
4
173.49
4.67
Total
25095
20227
0
4868
12
543.16
20469
6959
776
12734
20
984.27
1.81
Ave
Ratio
1
1
1
1
1
1
1.17
0.65
n/a*
2.96
n/a*
2.48
2.48
Resource Usage
Designs

Fmax
Resource Usage
Fmax
On average, xPilot resource binding achieves designs with similar area, and 2.48x higher
frequency over Spark
25
Synthesis Results for DWT (JPEG2000)
 Settings
 Target platform: Altera Stratix
 RTL synthesis & place-and-route: Altera QuartusII v5.0
 Simulation: Mentor ModelSim SE6.0
 Design
alternatives
Target cycle time
State# fmax(MHz)
Cycle# Latency (ns)
LE#
DSP#
9ns
34
123.56
4830
39.1
1777
128
7ns
36
147.28
5211
35.4
1862
128
5.5ns
51
183.62
6926
37.8
1926
128
26
Experimental Results: ASIC Flow
Magma
RTL to GDSII flow
Technology
Tradeoff
library: Cadence Generic Standard Cell Library 0.18um
study:
 1st column: delay constraint enforced in xPilot
 2nd column: control step count of xPilot generated RTL
 3rd-5th column: data reported after mapping by Magma tool
DIR
State
#
Cell
count
Area(u2)
Delay(ps)
Fmax(MHz) Latency(ps)
5ns
5
17555
1256584
2111
473.71
10555
10ns
3
23077
1332203
2139
467.51
6417
15ns
2
28381
1486487
2181
458.51
4362
20ns
2
27189
1394451
2514
397.77
5028
30ns
1
27797
1401642
2725
366.97
2725
27
Experimental Results: ASIC Flow (cont.)
LEE
State#
Cell
count
Area(u2)
Delay(ps) Fmax(MHz) Latency(ps)
5ns
8
8242
509807
2066
484.03
16528
10ns
4
15989
708870
2254
443.66
9016
15ns
2
16698
703381
3423
292.14
6846
20ns
2
15256
656147
4226
236.63
8452
30ns
1
16085
697363
5070
197.24
5070
State#
Cell
count
Motion
Area(u2)
Delay(ps) Fmax(MHz) Latency(ps)
10ns
35
16474
909721
2107
474.61
73745
15ns
30
15695
847262
2358
424.09
70740
20ns
28
16463
867898
2498
400.32
69944
30ns
28
15807
852573
2563
390.17
71764
28