Transcript xPilot
xPilot A Platform-Based Behavioral Synthesis
System
Prof. Jason Cong
Students: Deming Chen, Yiping Fan,
Guoling Han, Wei Jiang, Zhiru Zhang
August, 2005
Supported by NSF, GSRC, Altera, Xilinx.
Outline
Motivation
xPilot
system framework
Overview
of the synthesis engine
Scheduling
Resource binding
Experimental
results
2
Motivation (1)
Design
Complexity is outgrowing the traditional RTL
method
Feasible to build SoC device with 500M transistors; Billiontransistor chips are on the horizon
Behavioral synthesis a critical technology for enabling the
move to higher level of abstraction
Reasons for previous failures
• Lack of a compelling reason: design complexity is still manageable a
•
•
decade of ago
Lack of a solid RTL foundation
Lack of consideration of physical reality
3
Motivation (2)
Behavioral
Synthesis provides combined advantages
Better complexity management
• Code size: RTL design ~300KL Behavioral design 40KL [NEC,
ASPDAC04]
Shorter verification/simulation cycle
• Simulation speed 100X faster than RTL-based method
Rapid system exploration
• Quick evaluation of different hardware/software boundaries
• Fast exploration of multiple micro-architecture alternatives
Higher quality of results
• Full consideration of physical reality
4
xPilot: Platform-Based Behavioral to RTL Synthesis Flow
Behavioral spec.
in C/SystemC
Platform
description
Frontend
compiler
FPGAs/ASICs
Loop unrolling/shifting
Strength reduction / Tree height reduction
Bitwidth analysis
Memory analysis …
Core synthesis optimizations
Scheduling
Resource binding, e.g., functional unit
binding register/port binding
SSDM
RTL
Presynthesis optimizations
Arch-generation & RTL/constraints
generation
Verilog/VHDL/SystemC
FPGAs: Altera, Xilinx
ASICs: Magma, Synopsys, …
5
System-level Synthesis Data Model
SSDM
(System-level Synthesis Data Model)
Hierarchical netlist of concurrent processes and communication
channels
Each leaf process contains a sequential program which is represented
by an extended LLVM IR with hardware-specific semantics
• Port / IO interfaces, bit-vector manipulations, cycle-level notations
6
Platform Modeling & Characterization
Target platform specification
High-level resource library with delay/latency/area/power curve
for various input/bitwidth configurations
• Functional units: adders, ALUs, multipliers, comparators, etc.
• Connectors: mux, demux, etc.
• Memories: registers, synchronous memories, etc.
Chip layout description
• On-chip resource distributions
• On-chip interconnect delay/power estimation
7
Scheduling Goals
A highly versatile
scheduling engine
Applicable to a wide range of application domains
• Computation-intensive, data/memory-intensive, control-intensive, etc.
• Mixed behavioral & RTL
Amenable to a rich set of scheduling constraints
• Data dependency constraints
• Resource constraints: IO ports constraints, memory ports constraints,
•
•
Functional unit constraints, etc.
Timing constraints: Frequency constraint, Latency constraints, etc.
Relative IO timing constraints: Cycle-fixed mode, superstate-fixed
mode,
free-floating mode, etc.
Retargetable to a variety of design objectives
• High performance, small area, low power, etc.
8
Scheduling Optimization Capabilities
Offers
a variety of optimization techniques in a unified
framework
Combinational/Sequential non-pipelined/pipelined
multi-cycle operation
Unconditional/Conditional operation chaining
Relative scheduling
Considerations of branching probabilities and repetitions
Multi-cycle communication (under development)
Code motion & speculation (under development)
Functional / loop pipelining (under development)
Physical layout integration (to be supported)
9
Scheduling Current Status
Design
objective
Focus on high-performance designs
Overall
approach
Use a system of pairwise difference constraints to express all
kinds of scheduling constraints
Represent the design objective in a linear function
The system is immediately solvable via any linear programming
solver with integral solutions
10
Scheduling Design Framework
CDFG
xPilot scheduler
Constraint equations
generation
Userspecified
design
constraints&
assignments
Relative timing constraints
Dependency constraints
Frequency constraints
Resource constraints …
Objective function generation
System of pairwise
difference constraints
Target
platform
modeling
(resource
library &
chip layout)
Linear programming solver
LP solution interpretation
STG (State
Transition Graph)
11
Example : Greatest Common Divisor
GCD C description
BB1
x = inport1;
y = inport2;
while (x != y) {
if ( x > y )
x = x – y;
else y = y – x;
}
*outport = x;
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
T
BB2
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
T
BB3
T
BB4
x_2 = x1 – y1;
cond3 = (x_2 != y_1);
y_2 = y1 – x1;
cond4 = (x_1 != y_2);
T
BB5
x_3 = (x_0, x_1, x_2);
*outport = x_3;
12
Constraints Generation
Data
dependency constraint
Operation v is data dependent on operation u, i.e., (u, v)E
s(v) – s(u) 0 where schedule variable s(v) represents the relative schedule of
node v
u: x_1 = (x_0, x_1, x_2);
v: cond2 = (x_1 > y_1);
Other constraints can be represented in a similar way …
The
constraint equations form a system of pairwise difference
constraints
Matrix A is totally unimodular
Feasibility check can be formulated as a single-source shortest path problem
Optimizations can be performed via any LP solver; the dual problem is
equivalent to a min-cost network flow problem
13
Solution by LP Solver
Scheduling are
performed across
the basic block
boundaries
BB1
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
0
BB2
T
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
T
BB3
T
BB4
x_2 = x1 – y1;
cond3 = (x_2 != y_1);
1
y_2 = y1 – x1;
cond4 = (x_1 != y_2);
T
BB5
x_3 = (x_0, x_1, x_2);
*outport = x_3;
14
Schedule Interpretation
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
x_2 = x1 - y1;
cond3 = (x_2 != y_1);
y_2 = y1 - x1;
cond4 = (x_1 != y_2);
x_3 = (x_0, x_1, x_2);
*outport = x_3;
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
if (cond1) {
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
if (cond2) {
x_2 = x1 - y1;
cond3 = (x_2 != y_1); }
else {
y_2 = y1 - x1;
cond4 = (x_1 != y_2); }
}
if (!cond1 || !cond3&&!cond4) {
x_3 = (x_0, x_1, x_2);
15
*outport = x_3; }
Deriving State Transition Graph
Final
STG for GCD
x_0 = inport1;
y_0 = inport2;
cond1 = (x_0 != y_0);
if (cond1) {
x_1 = (x_0, x_1, x_2);
y_1 = (y_0, y_1, y_2);
cond2 = (x_1 > y_1);
if (cond2) {
x_2 = x1 - y1;
cond3 = (x_2 != y_1); }
else {
y_2 = y1 - x1;
cond4 = (x_1 != y_2); }
}
if (!cond1 || !cond3&&!cond4) {
x_3 = (x_0, x_1, x_2);
*outport = x_3; }
cond3 || cond4
16
Unified Resource Binding
Provides
an unified resource sharing framework to
optimize for various design objectives
Simultaneous functional unit binding, register binding and port
binding
Equipped with advanced techniques to optimized the interconnect
and steering logic networks
Guided by a flexible cost evaluation engine to achieve different
objectives, e.g., performance, area, power, etc.
Extendable to exploit physical layout information
17
An FU/Register binding Example
R1
R2
R3
F1
R4
R1
R3
MUX
F2
MUX
R4
MUX
F1
R5
Case 1
R5
Case 2
(a)
R1
R1
R2
F1
R2
R2
Note: Assume all
operations and variables
are compatible for sharing
F2
MUX
F1
R3
Case 1
R3
Case 2
(b)
Observations:
Binding has large impact to the resulting performance and cost
Functional unit and register binding are highly correlated
18
Drawbacks of Previous Work
Many
existing algorithms focus on functional-unit- or register“number” minimization
Technology advances – interconnect effect increasing
• 51% of the total dynamic power of a microprocessor in 0.13um tech.
• Up to 80% of the dynamic power in future technologies
May generate larger amount of multiplexers and interconnects
Unfavorable performance and cost results
Optimization
for unrealistic goals
Minimize “number” of FUs, registers, or multiplexors
• Should have detailed datapath models to guide the optimization
No technology specific consideration
• Should have platform-specific characterizations
19
Resource Binding in xPilot
STG (State
Transition Graph)
Baseline Register Binding
Userspecified
design
constraints
FU Allocation/Binding
Register Allocation/Binding
Target
platform
(resource
library &
chip layout)
xPilot
architecture exploration
Iteration
Yes
Datapath model for
performance-cost
estimation
Improved??
No
STG + Best Datapath Models
20
Design Space Exploration
Exploring Node 2:
• (1) (2) two mul
• (1, 2) one mul
<
….
C1
’
2*, 3*
3*
4*
5*
2*
4*
C2
5*
<
C2’
6+
Compatible Graphs
A State Transition Graph
(STG)
power
Exploring Node 4:
•
•
•
•
•
•
•
>
C1
6+
1*
Exploring Node 3:
• (1) (2) (3) three mul
• (1, 2) (3) two mul
• (1, 3) (2) two mul
1*
>
Exploration phases:
(1) (2) (3) (4)
pruned
(1, 2, 4) (3)
(1, 2) (3, 4)
MUL
MUL
(1, 2) (3) (4)
(1, 3, 4) (2)
(1, 3) (2, 4)
(1, 3) (2) (4)
Datapath for
solution (1, 2, 4) (3)
Datapath Model
delay
Curve for Design
21
Space Pruning
Experimental Results Benchmark Suite
Benchmark suite
PR, MCM:
• DSP kernels: pure additions/subtractions and multiplications
CACHE
• Cache controller: control-intensive designs with cycle-accurate I/O operations
MOTION:
• Motion compensation algorithm for MPEG-1 decoder: control-intensive with modest
amount of computations
IDCT:
• JPEG inverse discrete cosine transform: computation intensive
DWT:
• JPEG2000 discrete wavelet transform: computation intensive with modest control
flow
EDGELOOP:
• Extracted from H.264 decoder: a very complex design, features a mix of
computation, control, and memory accesses
22
Experimental Results Code Size Reduction
23
Experimental Results Comparison with SPARK On Scheduling
SPARK
[UCI/UCSD, 2004], a state of the art academic highlevel synthesis tool
Synthesis
Designs
MOTION
PR
IDCT
CACHE
Tool/Flow
Altera Quartus II report
Report
state#
reg#
fmax (MHz)
LE
register
mem
dsp
spark
13
18
170.8
666
367
0
4
xpilot
24
11
161.2
888
266
0
4
spark
13
36
130.6
508
491
0
32
xpilot
13
40
178.7
1,349
783
0
0
spark
176
~400
72.01
10,847
4,547
0
138
xpilot
141
413
105.53
11,481
5,627
0
64
xpilot-mem
334
451
162.9
9,351
6,098
1,024
64
spark
xpilot-mem
Memory unsupported
47
16
161.6
371
265
3072
0
24
Experimental Results Comparison with SPARK On Binding
SPARK
xPilot
LE
COMB
LonelyReg
CombReg
DSP
(MHz)
LE
COMB
LonelyReg
CombReg
DSP
(MHz)
Fmax
Ratio
xPilot/
SPARK
PR
1108
815
0
293
0
123.53
1349
713
84
552
0
178.7
1.45
WANG
1217
942
0
275
0
118.89
1105
527
62
516
8
166.11
1.40
LEE
1367
1052
0
315
0
119.32
1585
691
207
687
4
166.61
1.40
MCM
2808
2248
0
560
0
74.87
2402
981
73
1348
0
152.56
2.04
DIR
2425
2034
0
391
6
69.38
3489
1752
110
1627
4
146.8
2.12
FEIG
16170
13136
0
3034
6
37.17
10539
2295
240
8004
4
173.49
4.67
Total
25095
20227
0
4868
12
543.16
20469
6959
776
12734
20
984.27
1.81
Ave
Ratio
1
1
1
1
1
1
1.17
0.65
n/a*
2.96
n/a*
2.48
2.48
Resource Usage
Designs
Fmax
Resource Usage
Fmax
On average, xPilot resource binding achieves designs with similar area, and 2.48x higher
frequency over Spark
25
Synthesis Results for DWT (JPEG2000)
Settings
Target platform: Altera Stratix
RTL synthesis & place-and-route: Altera QuartusII v5.0
Simulation: Mentor ModelSim SE6.0
Design
alternatives
Target cycle time
State# fmax(MHz)
Cycle# Latency (ns)
LE#
DSP#
9ns
34
123.56
4830
39.1
1777
128
7ns
36
147.28
5211
35.4
1862
128
5.5ns
51
183.62
6926
37.8
1926
128
26
Experimental Results: ASIC Flow
Magma
RTL to GDSII flow
Technology
Tradeoff
library: Cadence Generic Standard Cell Library 0.18um
study:
1st column: delay constraint enforced in xPilot
2nd column: control step count of xPilot generated RTL
3rd-5th column: data reported after mapping by Magma tool
DIR
State
#
Cell
count
Area(u2)
Delay(ps)
Fmax(MHz) Latency(ps)
5ns
5
17555
1256584
2111
473.71
10555
10ns
3
23077
1332203
2139
467.51
6417
15ns
2
28381
1486487
2181
458.51
4362
20ns
2
27189
1394451
2514
397.77
5028
30ns
1
27797
1401642
2725
366.97
2725
27
Experimental Results: ASIC Flow (cont.)
LEE
State#
Cell
count
Area(u2)
Delay(ps) Fmax(MHz) Latency(ps)
5ns
8
8242
509807
2066
484.03
16528
10ns
4
15989
708870
2254
443.66
9016
15ns
2
16698
703381
3423
292.14
6846
20ns
2
15256
656147
4226
236.63
8452
30ns
1
16085
697363
5070
197.24
5070
State#
Cell
count
Motion
Area(u2)
Delay(ps) Fmax(MHz) Latency(ps)
10ns
35
16474
909721
2107
474.61
73745
15ns
30
15695
847262
2358
424.09
70740
20ns
28
16463
867898
2498
400.32
69944
30ns
28
15807
852573
2563
390.17
71764
28