FPGA Power Reduction Using Configurable Dual-Vdd
Download
Report
Transcript FPGA Power Reduction Using Configurable Dual-Vdd
Simultaneous Time Slack Budgeting and Retiming
for Dual-Vdd FPGA Power Reduction
Yu Hu1, Yan Lin1, Lei He1 and Tim Tuan2
1EE
Department, UCLA
2Xilinx
Research Lab
Presented by Yu Hu
Partially supported by NSF.
Address comments to [email protected]
Outline
Background, Motivation and Problem Formulation
Chip-level Vdd-level Assignment Algorithm
[for mixed length wire segments]
Simultaneous Vdd Level Assignment and Retiming
Experimental Results
Conclusions
Background
Existing FPGAs are power inefficient compared to ASICs.
Interconnect is the dominant component of FPGA power
dissipation (dynamic and leakage). [Li, TCAD‘05]
Power aware FPGA architectures and CAD algorithms have
been studied extensively.
CAD algorithms to minimize power-delay product
Configuration inversion for leakage reduction
Vdd-programmable FPGA logic blocks
[Lamoureux, ICCAD’03]
[Anderson, FPGA’04]
[Li, FPGA’04] [Li, DAC’04]
Vdd-programmable FPGA interconnects
[Li, ICCAD’04] [Gayasen, FPL’04] [Anderson, ICCAD’04] [Lin, DAC’05]
Vdd Programmable Interconnect Arch.
Island style and mixed wire segment length.
Routing switch/connection block (Two PMOS power transistors M3 and
M4 are inserted between the tri-state buffer and VddH, VddL power rails, respectively.)
[Li, ICCAD’04]
Level converter free in routing tree (Guarantee that no VddL switch
drives VddH switches.) with LEAST area and power penalty
[Lin, TCAD’06].
Limitation of Existing Approaches
The most recent work [Lin, DAC'05] for programmable dualVdd FPGA considers timing slack budgeting to maximize
power reduction
Uniform wire segment length was assumed, and can not be extended to
mixed wire segment.
Vdd level assignment was performed in combinational sub-circuits.
Simultaneous retiming and timing budgeting has been
studied to reduce area or improve performance.
[Yeh, DAC'03] [Yeh, ICCAD'03]
Power reduction has not been considered.
Post-layout flip-flop binding constraints were not addressed.
Call for Simultaneous Vdd Assignment and Retiming
[Lin, DAC'05] performs Vdd level assignment in the
combinational sub-circuit, which limits the searching space.
Interconnect
Delay
3 (1)
4 (0)
3 (1)
Combinational Assign
4 (0)
3 (1)
4 (0)
4 (0)
Sequential
3 (1)
4 (0)
4 (0)
2 units slack needed for a
Timing Slack VddL switch insertion
4 (0)
2 (2)
Assignment
All VddH
Switches !
Movable with
Retiming!
VddL Switch
Inserted !
Simultaneous retiming and Vdd assignment explores larger
searching space to extract more useful timing slack for VddL
switches insertion.
Major Contributions
Synthesis / Placement / routing
CLB level Vdd
assignment
Vdd level assignment for mixed
wire segments FPGAs.
53% interconnect power reduction
is achieved compared to single Vdd
designs.
Min-clock retiming
Interconnect Vdd
assignment
Power-aware post-layout
re-synthesis processes:
Sequential vs. Simultaneous
Simultaneous retiming and
interconnect Vdd assignment
Global refinement
Simultaneous retiming and
interconnect Vdd assignment with
flip-flop binding constraints. Up to
20% further interconnect power
reduction is achieved compared to
sequential flow.
Problem Formulations
[ Dual-Vdd Level Assignment Problem ]
Given: placement and routing results of a FPGA design
Find: A Vdd-level assignment to each interconnect switch
Objective: Minimize interconnect (dynamic and leakage) power
Constraints:
Meet the delay target Tspec
Vdd-level converters are inserted ONLY at CLB inputs/outputs
[ Simultaneous Retiming and Dual-Vdd Level Assignment Problem ]
Same to Dual-Vdd level assignment problem in addition to:
Retiming as an extra design freedom
Satisfy post-layout flip-flop binding constraints.
Outline
Background, Motivation and Problem Formulation
Chip-level Vdd-level Assignment Algorithm [for mixed
length wire segments]
Interconnect Power Reduction Estimation
LP Based Vdd-level Assignment Algorithm
Simultaneous Vdd Level Assignment and Retiming
Experimental Results
Conclusions
Delay and Power Model for Interconnect
Delay Model
Intrinsic delay and effective driving resistance of
switch has been pre-characterized using SPICE.
Elmore delay is used to calculate routing delay.
Interconnect Power Model
Dynamic power Pd(Vddjj)=0.5fclk*C*Vddjj2
Leakage power Pl(Vddjj) is pre-characterized using
SPICE
Interconnect power reduction estimation is the
essential part of dual-Vdd assignment algorithm.
Review of Vdd Level Assignment Algorithm
b4
Pdr(j)=0.5fclk*C(j)*f(j)*ΔVdd2
b3
b1
Plr(j)=f(j)*ΔPlr(j)
Power reduction estimation
b2
S1=1
S2=3
Vdd assignment
base on estimation
f(1)
f(2)
f(3)
f(4)
=
=
=
=
.5
1
1
1.5
VddL possibility
for switches
[Lin, DAC'05]
b4
b3
b1
b2
S1=1
S2=3
Timing Slack
assigned at sinks
...
Interconnect power reduction estimation
...
Interconnect Vdd
assignment
Chip-level timing slack
allocation (LP)
Net-level Bottom-up
Vdd assignment
Refinement
Problem remained: How to
calculate VddL possibility for
mixed wire segment?
The net-level bottom-up Vdd
assignment guarantees the
legalization of final solutions.
[Lin, DAC’05]
Leverage all extra slack
with VddL switches
[Lin, DAC’05]
VddL Possibility Calculation
Represent timing slack in number of switches:
si = Li * ( Si / Di )
si is the number of VddL switches can be inserted in the path from
source to jth sink in the routing tree.
Li is the number of switches along this path.
si : how many switches can be turned to VddL along sourceto-sink-i path for the given timing slack Si.
VddL possiblity for switch j at sink i based on load capacity:
f(i,j) = si * (cij / Ci)
Key idea: distribute timing slack to each switch based on cap.
b4, 16x
b1, 8x
L2 = 3
D2 = 12
s2 = 3*(10/12)=5/2
S1=6
b3, 16x
b2, 8x
S2=10
f(2,2) = 1
f(2,3) = 1
f(2,4) = 1/2
Power Reduction Estimation for Mixed Wire Segments
The lower bound estimation [Y. Lin, DAC'05] for interconnect
power reduction is no longer valid for mixed wire segments.
1.7 slack left -1.8
needed!
fn(i,1) = 0.9
fn(i,2) = 0.5
lower bound of VddL
switches = 0.9 + .5 = 1.4
b1, 16x, need 1.8
slack
b2, 8x, need 1.0
slack
S = 2.7
Only 1.0 VddL
switch assignment
Consume 1.0
S = 2.7
Problem here: Lower bound > actual number!
Our solution: develop the upper bound estimation of VddL
switch number
Sum up all VddL
possibility
Consistent upper bound of power reduction
Remove the non-linear term "min" and the corresponding extra
LP constraints from lower bound estimation
LP formulation for dual-Vdd Level Assignment
Basic timing constraints
Arrival time for prim-output
Arrival time for prim-input
Arrival time constraints
Slack constraints
Slack upper bound
Slack constraints
Slack non-negative
Objective function
Leakage power reduction
upper bound
Dynamic power reduction
upper bound
Outline
Motivation
Problem Formulations
Chip-level Vdd-level Assignment Algorithm
[for mixed length wire segments]
Simultaneous Vdd Level Assignment and Retiming
MILP formulation for retiming FPGA circuits
Extra constraints for post-layout FPGA retiming
Link between MILP retiming to timing budgeting
Experimental Results
Conclusions
Retiming for FPGA
Retiming graph is a directed cyclic graph.
Given the retiming graph G=(V, E), a retiming is an integervalued vertex-labeling r: V → Z.
A weight is w(u,v) associated with edge e(u,v) denotes the
number of FFs in that edge.
After retiming (re-labeling of vertices): w'(u,v) = w(u,v) + r(v) – r(u)
SUBBLK_IPIN
SUBBLK_IPIN
FF_NODE
FF_NODE
SUBBLK_OPIN
Retiming
SUBBLK_OPIN
SUBBLK_IPIN
SUBBLK_IPIN
(B)
(C)
Link between MILP retiming & timing budgeting
Extend MILP formulation in [Leiserson, Algorithmica’91] to link
arrival time with retiming labeling
The real value a(v) assigned in
node v is its arrival time after
retiming
Timing slack in edge (u,v) is represented by
linearize
R(v) = r(v) + a(v) /c,
Timing slack in connection from sink Sk to the source
of routing tree Ri
Retiming Constraints 1:
Placement and Flip-Flop Binding Constraints
Keep both placement and routing unchanged after retiming.
No FFs in global interconnect (inter-CLB)
No FFs in local interconnect (intra-CLB and inter-SLICE)
Within a single SLICE, only FF_NODE → SUBBLK_OPIN edges
allow FF insertion.
SUBBLK_IPIN
SUBBLK_IPIN
SUBBLK_IPIN
FF_NODE
SUBBLK_OPIN
SUBBLK_IPIN
No way to assign this FF in
any SLICE physically!
SUBBLK_OPIN
SUBBLK_OPIN
SUBBLK_IPIN
SUBBLK_IPIN
(A)
FF_NODE
FF_NODE
(B)
FF# can be further
reduced!
Extra constraints in MILP formulation:
(C)
The only timing edge that
can insert FFs
Retiming Constraints 2:
LUT Delay and FF Setup &Hold Time Constraints
Delay = T_seq_out
Delay = 0
SUBBLK_IPIN
SUBBLK_IPIN
FF_NODE
FF_NODE
SUBBLK_OPIN
SUBBLK_OPIN
SUBBLK_IPIN
SUBBLK_IPIN
(A)
Delay = T_comb
FF hold
time
LUT delay
(B)
Delay = T_seq_in
FF# in edge (e)
Delay constraints for timing edges within SLICE:
FF setup time
+ LUT delay
Outline
Motivation
Problem Formulations
Chip-level Vdd-level Assignment Algorithm
[for mixed length wire segments]
Simultaneous Vdd Level Assignment and Retiming
Experimental Results
Dual-Vdd Assignment for FPGAs with Mixed Wire
Segments
Simultaneous Vdd Level Assignment and Retiming
A runtime Efficient Post-Layout Re-Synthesis CAD Flow
Conclusions
Experimental Setting
Cluster-based Island Style FPGA Structure
Size-10 cluster and size-4 LUT
100% buffered interconnects, subset switch block
60% length-4 and 40% length-8l wire segments
25x buffer for length-4 and 10x buffer for length-8
ITRS 100nm technology, 1.3v for VddH and 0.8v for VddL
Use VPR [Betz-Rose-Marquardt] for placement and routing
Use fpgaEva-LP2 [Lin et al, FPGA’05] for power calculation
Considering short-circuit power, glitch power and input vector
8% average error compared to SPICE simulation
10 biggest sequential MCNC benchmarks are tested
Use mosek [student license, www.mosek.com] to solve
LP and MILP
Experimental Results: Dual-Vdd Assignment
for FPGAs with Mixed Wire Segments
EdTLC-LP algorithm achieves 85% VddL assignment.
EdTLC-LP algorithm achieves 53% interconnect power
reduction for mixed length interconnect wire on average.
S-VDD
DUAL-VDD
Circuit
Cluster#
power
VddL
power
clock
Runtime (s)
tseng
131
10.2
96%
3.9(-62%)
13.1
32
dsip
162
111.4
72%
66.5(-40%)
6.2
44
diffeq
195
10.2
88%
4.0(-61%)
13.5
41
s298
256
23.3
83%
10.9(-53%)
24.3
96
bigkey
294
96.1
73%
57.9(-40%)
7.0
89
elliptic
421
35.9
90%
14.9(-58%)
17.3
229
frisc
595
29.6
98%
11.1(-63%)
23.7
479
s38584.1
704
121.1
92%
51.0(-58%)
11.9
421
s38417
847
125.8
82%
67.2(-47%)
15.7
709
clma
1358
155.4
75%
76.9(-51%)
23.2
2735
ave
496
77.7
85%
36.4(-53%)
15.5
488
Experimental Results – Simultaneous vs. Sequential
Simultaneous Retiming and Slack Budgeting vs. Sequential
Approach (Delay-Optimal Retiming + Slack Budgeting)
[Those circuits with VddL < 85% are selected]
SEQUE NTI AL APPROACH
SI MULTANE OUS
ci rcui t
VddL
po wer
cl ock
VddL
po wer
dsi p
72 %
66. 5
6. 2( 0 %)
71 %
66. 5( 0 %)
s298
83 %
10. 9
24. 3( 0 %)
83 %
10. 8(- 1 %)
bi gkey
73 %
57. 8
6. 8(- 2 %
)
79 %
46. 1(- 20 %
)
s38417
82 %
66. 7
15. 7( 0 %)
82 %
66. 1(- 1 %)
cl ma
75 %
76. 9
23. 2( 0 %)
75 %
74. 4(- 3 %)
ave
77 %
55. 76
15. 24(- 1 %
)
78 %
52. 8(- 5 %
)
Simultaneous approach gains 5% on average and up to 20%
further power reduction compared to sequential one.
Runtime Efficient CAD Flow
Synthesis / Placement
/ routing / routing
Synthesis / Placement
Simultaneous approach has 10x
more runtime overhead compared to
place&route.
CLB level Vdd
CLB level Vdd
assignment
assignment
DO NOT need to perform
simultaneous approach for every
single design.
Min-clock retiming
Indicators for simultaneous gain
High percentage of VddL assignment
Interconnect
Min-clock retiming
Interconnect Vdd
assignment
Vdd
will not lead to gain from simultaneous
assignment
approach
Little gain from min-clock retiming
indicates little room for improvement by
simultaneous approach
VddL > 85% ?
Min-clock retiming gains little?
No
Yes
SOLUTION
Do Simultaneous Procedure
only when necessary
Simultaneous retiming and
interconnect Vdd assignment
Simultaneous retiming and
interconnect Vdd assignment
Global refinement
Global refinement
Outline
Motivation
Problem Formulations
Chip-level Vdd-level Assignment Algorithm
[for mixed length wire segments]
Simultaneous Vdd Level Assignment and Retiming
Experimental Results
Conclusions
Conclusions
A chip-level dual-Vdd assignment algorithm for mixed
length wire segment. Experimental results show that
reduces interconnect power by 53% on average
compared to single-Vdd FPGA designs.
A MILP based simultaneous timing budgeting and
retiming formulation which further reduces interconnect
power up to 20% compared to min-clock retiming
followed Vdd assignment.
A runtime efficient post-layout re-synthesis CAD flow.
Do simultaneous procedure only when necessary.
Thank you!
Q/A
Extra Slides for Q/A
EdTLC-LP : Net-level Bottom-up Assignment
Perform bottom-up assignment within each tree to leverage the
allocated slacks
Bottom-up assignment
Assign low-Vdd to switches in the routing tree in a bottom-up fashion
Slack is reduced by d in each step
Stop the process until no slack left
Theorem: the bottom-up assignment is optimal
Major Contributions
Present a tight estimation of power reduction upper bound for
mixed-length interconnect in FPGAs.
Develop a linear programming (LP) based slack budgeting
and Vdd level assignment algorithm for mixedlength
interconnect FPGAs.
The experimental results show 53% interconnect power
reduction on average compared to singleVdd interconnects.
Propose a mixed integer and linear programming (MILP)
based simultaneous retiming and slack budgeting for power
reduction while considering placement and flip-flop (FF)
binding constraints.
The experimental results show up to 20% interconnect power
reduction compared to the sequential approach (retiming
followed by slack budgeting).
Vdd Programmable Interconnect Arch.
Island style routing architecture.
Mixed wire length (60% length 4 wire and 40% length 8 wire).
Routing switch/connection block (Two PMOS power transistors M3 and
M4 are inserted between the tri-state buffer and VddH, VddL power rails, respectively.)
source
Level converter free (Guarantee that no VddL switch drives VddH switches.)
Timing Slack vs. VddL Switch Number
Timing Slack Sij of a connection between source and jth sink in
routing tree Ri, = the amount of delay which could be added to
this connection without increasing the cycle time Tspec.
Timing Slack Sij indicates the number of VddL switches.
Useful Slack: Timing Slack Sij is bounded due to the number
of switches in connection between source and jth sink in
routing tree Ri,. Extra slack will NOT lead to more VddL
switches!
b4
b1
1 unit slack is
needed for VddL
s1=1
b4
b3
b1
b2
s2=1
s1=2
b3
Useful Slack =
3
b2
s2=4
Timing Slack Bounding Constraint: 0 ≤ Sij ≤ Dij
[Dij is the delay increase of the path from source to jth sink by
setting VddL to all the switches in this path]
Retiming for LUT based FPGA
Retiming graph is a directed cyclic graph.
Given the retiming graph G=(V, E), a retiming is an integervalued vertex-labeling r: V → Z.
A weight is w(u,v) associated with edge e(u,v) denotes the
number of FFs in that edge.
After retiming (re-labeling of vertices): w'(u,v) = w(u,v) + r(v) – r(u)
LUT
1
C
G
CLB1
I1
PI2
E
D
LUT
2
SLICE
2
SLICE
1
B
B
D
C
H
O1
F
G
O2
PO
H
O1
I3
I3
(a) FPGA Circuit
Retiming
I1
A
FF
PI2
PI1
I2
F
CLB2
PI
2
I2
O2
PO
(b) Retiming Graph
r(G) = 1
r(D) = 1
r(F) = 1
A
B
LUT
1
C
FF
E
D
LUT
2
SLICE
2
I1
A
PI
1
PI
2
I2
SLICE
1
PI
1
G
CLB1
PI1
I2
I1
B
A
F
D
C
H
O1
G
(a) FPGA Circuit
O2
PO
H
O1
I3
I3
CLB2
F
O2
PO
(b) Retiming Graph
MILP Based Retiming Formulation
Extended from MILP formulation [Leiserson, Algorithmica’91]
Let G = (V, E, d, w) be a synchronous circuits, and let c be a
positive real number. Then there exists a retiming r of G such
that Φ(Gr) ≤ c if and only if there exists an assignment of a real
value a(v) and an integer value r(v) to each vertex v such that the
following conditions are satisfied:
Let R(v) = r(v) + a(v) /c, then this formula can be rewritten as
Runtime Efficient CAD Flow
Runtime overhead of post-layout re-synthesis processes
CRT+SB
RTSB
circuit
total
b-Search
total
tseng
43
4
44
dsip
74
9
66
diffeq
192
124
85
s298
654
470
230
bigkey
302
77
254
elliptic
575
145
473
frisc
1286
171
2402
s38584
1072
193
1092
s38417
1687
1029
863
clma
9736
2427
8580
ave
1562
465
1409
High percentage of VddL assignment
after won‘t lead to gain from RTSB
Little gain from min-clock retiming
indicates little room for improvement
RTSB
Placement / routing
Chip-level vdd
assignment
Min-clock retiming
Vdd assignment
VddL > 85% ?
Min-clock retiming gains little?
No
Yes
RTSB
by
Global refinement