FPGA Power Reduction Using Configurable Dual-Vdd

Download Report

Transcript FPGA Power Reduction Using Configurable Dual-Vdd

Simultaneous Time Slack Budgeting and Retiming
for Dual-Vdd FPGA Power Reduction
Yu Hu1, Yan Lin1, Lei He1 and Tim Tuan2
1EE
Department, UCLA
2Xilinx
Research Lab
Presented by Yu Hu
Partially supported by NSF.
Address comments to [email protected]
Outline
 Background, Motivation and Problem Formulation
 Chip-level Vdd-level Assignment Algorithm
[for mixed length wire segments]
 Simultaneous Vdd Level Assignment and Retiming
 Experimental Results
 Conclusions
Background
 Existing FPGAs are power inefficient compared to ASICs.
 Interconnect is the dominant component of FPGA power
dissipation (dynamic and leakage). [Li, TCAD‘05]
 Power aware FPGA architectures and CAD algorithms have
been studied extensively.

CAD algorithms to minimize power-delay product

Configuration inversion for leakage reduction

Vdd-programmable FPGA logic blocks
[Lamoureux, ICCAD’03]
[Anderson, FPGA’04]
[Li, FPGA’04] [Li, DAC’04]

Vdd-programmable FPGA interconnects
[Li, ICCAD’04] [Gayasen, FPL’04] [Anderson, ICCAD’04] [Lin, DAC’05]
Vdd Programmable Interconnect Arch.
 Island style and mixed wire segment length.
 Routing switch/connection block (Two PMOS power transistors M3 and
M4 are inserted between the tri-state buffer and VddH, VddL power rails, respectively.)
[Li, ICCAD’04]
 Level converter free in routing tree (Guarantee that no VddL switch
drives VddH switches.) with LEAST area and power penalty
[Lin, TCAD’06].
Limitation of Existing Approaches
 The most recent work [Lin, DAC'05] for programmable dualVdd FPGA considers timing slack budgeting to maximize
power reduction


Uniform wire segment length was assumed, and can not be extended to
mixed wire segment.
Vdd level assignment was performed in combinational sub-circuits.
 Simultaneous retiming and timing budgeting has been
studied to reduce area or improve performance.
[Yeh, DAC'03] [Yeh, ICCAD'03]
 Power reduction has not been considered.
 Post-layout flip-flop binding constraints were not addressed.
Call for Simultaneous Vdd Assignment and Retiming
 [Lin, DAC'05] performs Vdd level assignment in the
combinational sub-circuit, which limits the searching space.
Interconnect
Delay
3 (1)
4 (0)
3 (1)
Combinational Assign
4 (0)
3 (1)
4 (0)
4 (0)
Sequential
3 (1)
4 (0)
4 (0)
2 units slack needed for a
Timing Slack VddL switch insertion
4 (0)
2 (2)
Assignment
All VddH
Switches !
Movable with
Retiming!
VddL Switch
Inserted !
 Simultaneous retiming and Vdd assignment explores larger
searching space to extract more useful timing slack for VddL
switches insertion.
Major Contributions
Synthesis / Placement / routing
CLB level Vdd
assignment
Vdd level assignment for mixed
wire segments FPGAs.
53% interconnect power reduction
is achieved compared to single Vdd
designs.
Min-clock retiming
Interconnect Vdd
assignment
Power-aware post-layout
re-synthesis processes:
Sequential vs. Simultaneous
Simultaneous retiming and
interconnect Vdd assignment
Global refinement
Simultaneous retiming and
interconnect Vdd assignment with
flip-flop binding constraints. Up to
20% further interconnect power
reduction is achieved compared to
sequential flow.
Problem Formulations
[ Dual-Vdd Level Assignment Problem ]
Given: placement and routing results of a FPGA design
Find: A Vdd-level assignment to each interconnect switch
Objective: Minimize interconnect (dynamic and leakage) power
Constraints:


Meet the delay target Tspec
Vdd-level converters are inserted ONLY at CLB inputs/outputs
[ Simultaneous Retiming and Dual-Vdd Level Assignment Problem ]
Same to Dual-Vdd level assignment problem in addition to:
 Retiming as an extra design freedom
 Satisfy post-layout flip-flop binding constraints.
Outline
 Background, Motivation and Problem Formulation
 Chip-level Vdd-level Assignment Algorithm [for mixed
length wire segments]
 Interconnect Power Reduction Estimation
 LP Based Vdd-level Assignment Algorithm
 Simultaneous Vdd Level Assignment and Retiming
 Experimental Results
 Conclusions
Delay and Power Model for Interconnect
 Delay Model


Intrinsic delay and effective driving resistance of
switch has been pre-characterized using SPICE.
Elmore delay is used to calculate routing delay.
 Interconnect Power Model


Dynamic power Pd(Vddjj)=0.5fclk*C*Vddjj2
Leakage power Pl(Vddjj) is pre-characterized using
SPICE
 Interconnect power reduction estimation is the
essential part of dual-Vdd assignment algorithm.
Review of Vdd Level Assignment Algorithm
b4
Pdr(j)=0.5fclk*C(j)*f(j)*ΔVdd2
b3
b1
Plr(j)=f(j)*ΔPlr(j)
Power reduction estimation
b2
S1=1
S2=3
Vdd assignment
base on estimation
f(1)
f(2)
f(3)
f(4)
=
=
=
=
.5
1
1
1.5
VddL possibility
for switches
[Lin, DAC'05]
b4
b3
b1
b2
S1=1
S2=3
Timing Slack
assigned at sinks
...
Interconnect power reduction estimation
...
Interconnect Vdd
assignment
Chip-level timing slack
allocation (LP)
Net-level Bottom-up
Vdd assignment
Refinement
Problem remained: How to
calculate VddL possibility for
mixed wire segment?
The net-level bottom-up Vdd
assignment guarantees the
legalization of final solutions.
[Lin, DAC’05]
Leverage all extra slack
with VddL switches
[Lin, DAC’05]
VddL Possibility Calculation
 Represent timing slack in number of switches:

si = Li * ( Si / Di )



si is the number of VddL switches can be inserted in the path from
source to jth sink in the routing tree.
Li is the number of switches along this path.
si : how many switches can be turned to VddL along sourceto-sink-i path for the given timing slack Si.
 VddL possiblity for switch j at sink i based on load capacity:


f(i,j) = si * (cij / Ci)
Key idea: distribute timing slack to each switch based on cap.
b4, 16x
b1, 8x
L2 = 3
D2 = 12
s2 = 3*(10/12)=5/2
S1=6
b3, 16x
b2, 8x
S2=10
f(2,2) = 1
f(2,3) = 1
f(2,4) = 1/2
Power Reduction Estimation for Mixed Wire Segments
 The lower bound estimation [Y. Lin, DAC'05] for interconnect
power reduction is no longer valid for mixed wire segments.
1.7 slack left -1.8
needed!
fn(i,1) = 0.9
fn(i,2) = 0.5
lower bound of VddL
switches = 0.9 + .5 = 1.4
b1, 16x, need 1.8
slack
b2, 8x, need 1.0
slack
S = 2.7
Only 1.0 VddL
switch assignment
Consume 1.0
S = 2.7
Problem here: Lower bound > actual number!
 Our solution: develop the upper bound estimation of VddL
switch number
Sum up all VddL
possibility


Consistent upper bound of power reduction
Remove the non-linear term "min" and the corresponding extra
LP constraints from lower bound estimation
LP formulation for dual-Vdd Level Assignment
 Basic timing constraints
Arrival time for prim-output
Arrival time for prim-input
Arrival time constraints
 Slack constraints
Slack upper bound
Slack constraints
Slack non-negative
 Objective function
Leakage power reduction
upper bound
Dynamic power reduction
upper bound
Outline
 Motivation
 Problem Formulations
 Chip-level Vdd-level Assignment Algorithm
[for mixed length wire segments]
 Simultaneous Vdd Level Assignment and Retiming



MILP formulation for retiming FPGA circuits
Extra constraints for post-layout FPGA retiming
Link between MILP retiming to timing budgeting
 Experimental Results
 Conclusions
Retiming for FPGA
 Retiming graph is a directed cyclic graph.
 Given the retiming graph G=(V, E), a retiming is an integervalued vertex-labeling r: V → Z.
 A weight is w(u,v) associated with edge e(u,v) denotes the
number of FFs in that edge.
 After retiming (re-labeling of vertices): w'(u,v) = w(u,v) + r(v) – r(u)
SUBBLK_IPIN
SUBBLK_IPIN
FF_NODE
FF_NODE
SUBBLK_OPIN
Retiming
SUBBLK_OPIN
SUBBLK_IPIN
SUBBLK_IPIN
(B)
(C)
Link between MILP retiming & timing budgeting
 Extend MILP formulation in [Leiserson, Algorithmica’91] to link
arrival time with retiming labeling
The real value a(v) assigned in
node v is its arrival time after
retiming
 Timing slack in edge (u,v) is represented by
linearize
R(v) = r(v) + a(v) /c,
 Timing slack in connection from sink Sk to the source
of routing tree Ri
Retiming Constraints 1:
Placement and Flip-Flop Binding Constraints
 Keep both placement and routing unchanged after retiming.


No FFs in global interconnect (inter-CLB)
No FFs in local interconnect (intra-CLB and inter-SLICE)
 Within a single SLICE, only FF_NODE → SUBBLK_OPIN edges
allow FF insertion.
SUBBLK_IPIN
SUBBLK_IPIN
SUBBLK_IPIN
FF_NODE
SUBBLK_OPIN
SUBBLK_IPIN
No way to assign this FF in
any SLICE physically!
SUBBLK_OPIN
SUBBLK_OPIN
SUBBLK_IPIN
SUBBLK_IPIN
(A)
FF_NODE
FF_NODE
(B)
FF# can be further
reduced!
 Extra constraints in MILP formulation:
(C)
The only timing edge that
can insert FFs
Retiming Constraints 2:
LUT Delay and FF Setup &Hold Time Constraints
Delay = T_seq_out
Delay = 0
SUBBLK_IPIN
SUBBLK_IPIN
FF_NODE
FF_NODE
SUBBLK_OPIN
SUBBLK_OPIN
SUBBLK_IPIN
SUBBLK_IPIN
(A)
Delay = T_comb
FF hold
time
LUT delay
(B)
Delay = T_seq_in
FF# in edge (e)
 Delay constraints for timing edges within SLICE:
FF setup time
+ LUT delay
Outline
 Motivation
 Problem Formulations
 Chip-level Vdd-level Assignment Algorithm
[for mixed length wire segments]
 Simultaneous Vdd Level Assignment and Retiming
 Experimental Results



Dual-Vdd Assignment for FPGAs with Mixed Wire
Segments
Simultaneous Vdd Level Assignment and Retiming
A runtime Efficient Post-Layout Re-Synthesis CAD Flow
 Conclusions
Experimental Setting
 Cluster-based Island Style FPGA Structure




Size-10 cluster and size-4 LUT
100% buffered interconnects, subset switch block
60% length-4 and 40% length-8l wire segments
25x buffer for length-4 and 10x buffer for length-8
 ITRS 100nm technology, 1.3v for VddH and 0.8v for VddL
 Use VPR [Betz-Rose-Marquardt] for placement and routing
 Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation


Considering short-circuit power, glitch power and input vector
8% average error compared to SPICE simulation
 10 biggest sequential MCNC benchmarks are tested
 Use mosek [student license, www.mosek.com] to solve
LP and MILP
Experimental Results: Dual-Vdd Assignment
for FPGAs with Mixed Wire Segments
 EdTLC-LP algorithm achieves 85% VddL assignment.
 EdTLC-LP algorithm achieves 53% interconnect power
reduction for mixed length interconnect wire on average.
S-VDD
DUAL-VDD
Circuit
Cluster#
power
VddL
power
clock
Runtime (s)
tseng
131
10.2
96%
3.9(-62%)
13.1
32
dsip
162
111.4
72%
66.5(-40%)
6.2
44
diffeq
195
10.2
88%
4.0(-61%)
13.5
41
s298
256
23.3
83%
10.9(-53%)
24.3
96
bigkey
294
96.1
73%
57.9(-40%)
7.0
89
elliptic
421
35.9
90%
14.9(-58%)
17.3
229
frisc
595
29.6
98%
11.1(-63%)
23.7
479
s38584.1
704
121.1
92%
51.0(-58%)
11.9
421
s38417
847
125.8
82%
67.2(-47%)
15.7
709
clma
1358
155.4
75%
76.9(-51%)
23.2
2735
ave
496
77.7
85%
36.4(-53%)
15.5
488
Experimental Results – Simultaneous vs. Sequential
 Simultaneous Retiming and Slack Budgeting vs. Sequential
Approach (Delay-Optimal Retiming + Slack Budgeting)
[Those circuits with VddL < 85% are selected]
SEQUE NTI AL APPROACH
SI MULTANE OUS
ci rcui t
VddL
po wer
cl ock
VddL
po wer
dsi p
72 %
66. 5
6. 2( 0 %)
71 %
66. 5( 0 %)
s298
83 %
10. 9
24. 3( 0 %)
83 %
10. 8(- 1 %)
bi gkey
73 %
57. 8
6. 8(- 2 %
)
79 %
46. 1(- 20 %
)
s38417
82 %
66. 7
15. 7( 0 %)
82 %
66. 1(- 1 %)
cl ma
75 %
76. 9
23. 2( 0 %)
75 %
74. 4(- 3 %)
ave
77 %
55. 76
15. 24(- 1 %
)
78 %
52. 8(- 5 %
)
 Simultaneous approach gains 5% on average and up to 20%
further power reduction compared to sequential one.
Runtime Efficient CAD Flow
Synthesis / Placement
/ routing / routing
Synthesis / Placement
 Simultaneous approach has 10x
more runtime overhead compared to
place&route.
CLB level Vdd
CLB level Vdd
assignment
assignment
 DO NOT need to perform
simultaneous approach for every
single design.
Min-clock retiming
 Indicators for simultaneous gain
 High percentage of VddL assignment
Interconnect
Min-clock retiming
Interconnect Vdd
assignment
Vdd
will not lead to gain from simultaneous
assignment
approach
 Little gain from min-clock retiming
indicates little room for improvement by
simultaneous approach
VddL > 85% ?
Min-clock retiming gains little?
No
Yes
SOLUTION
Do Simultaneous Procedure
only when necessary
Simultaneous retiming and
interconnect Vdd assignment
Simultaneous retiming and
interconnect Vdd assignment
Global refinement
Global refinement
Outline
 Motivation
 Problem Formulations
 Chip-level Vdd-level Assignment Algorithm
[for mixed length wire segments]
 Simultaneous Vdd Level Assignment and Retiming
 Experimental Results
 Conclusions
Conclusions
 A chip-level dual-Vdd assignment algorithm for mixed
length wire segment. Experimental results show that
reduces interconnect power by 53% on average
compared to single-Vdd FPGA designs.
 A MILP based simultaneous timing budgeting and
retiming formulation which further reduces interconnect
power up to 20% compared to min-clock retiming
followed Vdd assignment.
 A runtime efficient post-layout re-synthesis CAD flow.

Do simultaneous procedure only when necessary.
Thank you!
Q/A
Extra Slides for Q/A
EdTLC-LP : Net-level Bottom-up Assignment
 Perform bottom-up assignment within each tree to leverage the
allocated slacks
 Bottom-up assignment



Assign low-Vdd to switches in the routing tree in a bottom-up fashion
Slack is reduced by d in each step
Stop the process until no slack left
 Theorem: the bottom-up assignment is optimal
Major Contributions
 Present a tight estimation of power reduction upper bound for
mixed-length interconnect in FPGAs.
 Develop a linear programming (LP) based slack budgeting
and Vdd level assignment algorithm for mixedlength
interconnect FPGAs.

The experimental results show 53% interconnect power
reduction on average compared to singleVdd interconnects.
 Propose a mixed integer and linear programming (MILP)
based simultaneous retiming and slack budgeting for power
reduction while considering placement and flip-flop (FF)
binding constraints.

The experimental results show up to 20% interconnect power
reduction compared to the sequential approach (retiming
followed by slack budgeting).
Vdd Programmable Interconnect Arch.
 Island style routing architecture.
 Mixed wire length (60% length 4 wire and 40% length 8 wire).
 Routing switch/connection block (Two PMOS power transistors M3 and
M4 are inserted between the tri-state buffer and VddH, VddL power rails, respectively.)
source
 Level converter free (Guarantee that no VddL switch drives VddH switches.)
Timing Slack vs. VddL Switch Number
 Timing Slack Sij of a connection between source and jth sink in
routing tree Ri, = the amount of delay which could be added to
this connection without increasing the cycle time Tspec.
 Timing Slack Sij indicates the number of VddL switches.
 Useful Slack: Timing Slack Sij is bounded due to the number
of switches in connection between source and jth sink in
routing tree Ri,. Extra slack will NOT lead to more VddL
switches!
b4
b1
1 unit slack is
needed for VddL
s1=1
b4
b3
b1
b2
s2=1
s1=2
b3
Useful Slack =
3
b2
s2=4
 Timing Slack Bounding Constraint: 0 ≤ Sij ≤ Dij
[Dij is the delay increase of the path from source to jth sink by
setting VddL to all the switches in this path]
Retiming for LUT based FPGA
 Retiming graph is a directed cyclic graph.
 Given the retiming graph G=(V, E), a retiming is an integervalued vertex-labeling r: V → Z.
 A weight is w(u,v) associated with edge e(u,v) denotes the
number of FFs in that edge.
 After retiming (re-labeling of vertices): w'(u,v) = w(u,v) + r(v) – r(u)
LUT
1
C
G
CLB1
I1
PI2
E
D
LUT
2
SLICE
2
SLICE
1
B
B
D
C
H
O1
F
G
O2
PO
H
O1
I3
I3
(a) FPGA Circuit
Retiming
I1
A
FF
PI2
PI1
I2
F
CLB2
PI
2
I2
O2
PO
(b) Retiming Graph
r(G) = 1
r(D) = 1
r(F) = 1
A
B
LUT
1
C
FF
E
D
LUT
2
SLICE
2
I1
A
PI
1
PI
2
I2
SLICE
1
PI
1
G
CLB1
PI1
I2
I1
B
A
F
D
C
H
O1
G
(a) FPGA Circuit
O2
PO
H
O1
I3
I3
CLB2
F
O2
PO
(b) Retiming Graph
MILP Based Retiming Formulation
Extended from MILP formulation [Leiserson, Algorithmica’91]
 Let G = (V, E, d, w) be a synchronous circuits, and let c be a
positive real number. Then there exists a retiming r of G such
that Φ(Gr) ≤ c if and only if there exists an assignment of a real
value a(v) and an integer value r(v) to each vertex v such that the
following conditions are satisfied:
 Let R(v) = r(v) + a(v) /c, then this formula can be rewritten as
Runtime Efficient CAD Flow
 Runtime overhead of post-layout re-synthesis processes
CRT+SB
RTSB
circuit
total
b-Search
total
tseng
43
4
44
dsip
74
9
66
diffeq
192
124
85
s298
654
470
230
bigkey
302
77
254
elliptic
575
145
473
frisc
1286
171
2402
s38584
1072
193
1092
s38417
1687
1029
863
clma
9736
2427
8580
ave
1562
465
1409
 High percentage of VddL assignment
after won‘t lead to gain from RTSB
 Little gain from min-clock retiming
indicates little room for improvement
RTSB
Placement / routing
Chip-level vdd
assignment
Min-clock retiming
Vdd assignment
VddL > 85% ?
Min-clock retiming gains little?
No
Yes
RTSB
by
Global refinement