FPGA Power Reduction Using Configurable Dual-Vdd
Download
Report
Transcript FPGA Power Reduction Using Configurable Dual-Vdd
Power Modeling and Architecture
Evaluation for FPGA with Novel
Circuits for Vdd Programmability
Yan Lin, Fei Li and Lei He
EE Department, UCLA
[email protected]
Partially supported by NSF.
Overview
FPGA architecture evaluation
Area and delay [Rose et al, JSSC’90]
Power [Poon et al, FPLA’02][Li et al, FPGA’03]
Vdd programmability for power reduction
Concept in [FPGA’03]
Application to logic [FPGA’04][DAC’04]
Application to interconnects
[ICCAD’04][Anderson et al, ICCAD’04]
Novel circuits and Architecture evaluation
for FPGAs with Vdd-programmability
Reduce power by 50% with 17% area and
3% delay increase
Outline
Power modeling and architecture
evaluation methodology
FPGA Circuits for Vdd Programmability
Architecture Evaluation with Vdd
programmability
Conclusions and Ongoing Work
Framework fpgaEva-LP
Benchmark circuits
Logic Optimization(SIS)
Tech-Mapping (RASP)
Arch
Spec
Parasitic
Extraction
Timing-Driven Packing (TV-Pack)
Placement & Routing (VPR)
Area
Delay
Cycle-accurate
Power
Simulator
Power
FPGA Structure and Models
Cluster-based Island Style FPGA Structure
Area and delay models similar to [Betz-RoseMarquardt]
100% buffered interconnects, subset switch block
input fc = 50%, output fc = 25%
But based on layout and SPICE for 100nm and below
Mixed-level power model from [FPGA’03]
Dynamic power
Capacitive power
Short-circuit power
( transition time)
Capacitive power
Functional switch
Glitch
Static Power
Sub-threshold leakage
Reverse biased leakage
Gate leakage
New Power Model in fpgaEva-LP2
Short-circuit power
switching time * switching power
fpgaEva-LP used average signal transition time
fpgaEva-LP2 calculates transition time for each
buffer as tr tbuffer, the buffer delay
is NOT a constant 2 as in literature due to input slew
is pre-characterized by SPICE
buffer delay
<0.012 ns
< 0.03 ns
>0.03 ns
α
2
4.4
7
Validation Using SPICE
Validate by comparison for each power-component
High fidelity with average absolute error of 8%
0.0025
SPICE simulation
fpgaEVA-LP
fpgaEVA-LP2
0.002
FPGA Power (watt)
0.0015
0.001
0.0005
0
b1
parity
cm138a
Benchmark Circuits
z4ml
decode
Impact of Random Seeds in VPR
5.6
circuit: s38584
FPGA Energy (nJ/cycle)
5.55
8
10
5.5
3
5.45
6
+5%
5.4
5
2
7
5.35
4
5.3
5.25
10.2
9
+12%
1
10.4
10.6
10.8
11
11.2
11.4
11.6
11.8
12
Critical Path Delay (ns)
12% delay variation and 5% energy variation
Min-delay solution among 10 runs is used
Total FPGA Energy (nJ/cycle)
Evaluation of Single-Vdd FPGAs
9
(12, 7)
8
7
(10, 7)
(12, 6)
6
(8, 7)
(6, 7)
(6, 6)
(10, 5)
(8, 5)
(12, 4)
5
4
(6, 3)
(6, 5)
(6, 4)
(10, 4)
(8, 4)
10
11
12
13
14
15
16
17
Critical Path Delay (ns)
Cluster size N = {6, 8, 10, 12}
LUT size k = {3, 4, 5, 6, 7}
Energy-delay (ED) dominant architectures
(12, 5)
Architectures explored
(8, 3)
3
9
(10, 6)
(8, 6)
(10, 3)
(12, 3)
Architecture with smaller delay or less energy (compared
to any other architecture)
Relaxed ED dominant set may be also valuable
Energy versus Delay
For 100nm ITRS technology
Min-Energy arch (N,k)=(10,4) or (8.4)
Min-Delay arch (N,k)=(8,7) 0.8x delay but 1.7x power
9
Total FPGA Energy (nJ/cycle)
Current commercial
architecture
(12, 7)
8
7
(10, 7)
(12, 6)
6
(8, 7)
(10, 6)
(8, 6)
(6, 7)
(6, 6)
(10, 5)
(8, 5)
(12, 4)
5
4
(12, 3)
(10, 3)
(8, 3)
(12, 5)
(6, 3)
(6, 5)
(6, 4)
(10, 4)
(8, 4)
3
9
10
11
12
13
14
Critical Path Delay (ns)
15
16
17
Outline
Power modeling and evaluation
methodology
FPGA Circuits for Vdd Programmability
Architecture Evaluation with Vdd
programmability
Conclusions and Ongoing Work
Vdd-programmable FPGA
[DAC’04][ICCAD’04]
Vdd-programmable logic
block
Vdd selection
Power-gating unused blocks
Vdd-programmable FPGA
[FPGA’04][ICCAD’04]
Vdd-programmable logic
block
Vdd selection
Power-gating unused blocks
Vdd-programmable switch
Vdd-level conversion is
needed when VddL drives
VddH
To avoid excessive leakage
Vdd-programmable Routing Switch
Conventional routing switch
Vdd-programmable routing switch
Brute-force design [ICCAD’04]
Two extra SRAM cells for each routing switch
New design
One extra SRAM cell
NAND2 gate –- minimum size & high-Vt transistor
Vdd-Programmable
Interconnect Connection Block
Brute-force design [ICCAD’04]
2n extra SRAM cells for n connection switches
New design
Only TWO extra SRAM cells for n connection switches
Control logic includes 2n NAND2 and a decoder
Power and Delay
Vdd-programmable switch uses
Compared to conventional switch
4X PMOS power transistor for 7X routing switch
1X PMOS power transistor for 4X connection switch
1000X less leakage power
Connection box is 28% faster and has 18% less dynamic
power
By moving mux from critical path of connection box
Switch delay (ns)
Energy per switch (Joule)
(Vdd=1.3v)
Type
w/o power
transistor
w/ power
transistor
w/o power
transistor
w/ power
transistor
Routing
5.9E-11
6.5E-11(+11%)
3.3E-14
3.2E-14 (-2%)
Connection
2.9E-10
2.1E-10(-28%)
3.8E-14
3.1E-14(-18%)
Vdd-gateable Routing Switch
Conventional
Vdd-gateable
two states Normal Vdd or Power-gating
Enable power-gating capability w/o extra SRAM
cells
Power
transitor
Can be replaced by tri-state buffer
Vdd-gateable Connection Block
Conventional
Vdd-gateable
Enable power-gating capability w/ only one extra SRAM
and a low leakage decoder
Outline
Power modeling and evaluation
methodology
FPGA Circuits for Vdd Programmability
Architecture Evaluation with Vdd
programmability
Conclusions and Ongoing Work
FPGA Architecture Classes
Architecture Class Logic Block
Interconnect
Class0 (baseline)
single-Vdd
single-Vdd
Class1
programmable
dual-Vdd
programmable dual-Vdd,
level converters in routing
Class2
programmable
dual-Vdd
VddH and Vdd-gateable
Class3
programmable
dual-Vdd
Class 1, but no level
converters in routing
High-Vt is applied to configuration SRAM cells for
all the classes
Vdd-level Converters
Class3 removes Vdd-level converters from interconnects in
Class1
With constraints that no VddL drives VddH
We developed a routing that one routing
tree has a single Vdd level
But trees with different Vdd-levels can
share the same wire track
Alternative approaches:
Combined vdd-level converter and buffer [Anderson et al,
ICCAD’04]
Our new work [DAC’05] allows dual vdd in a tree with a
chip level time slack budgeting for extra power reduction
Energy versus Delay
Total FPGA Energy/Cycle (nJ)
LUT 7
6
High Performance
Class 0
Class 1
Class 2
Class 3
(8, 7)
5.5
(6, 7)
5
4.5
(12, 4)
4
(8, 7)
(6, 7)
3.5
(8,7)
(6,7)
(8,7)(6,7)
3
2.5
LUT 4
Low Energy
(6, 6) (8, 6)
(10, 5) (8, 5)
2
(6, 6)
(6, 5)
(8, 4)(6, 4) (10, 4)
(10, 5)
(8, 5)
(12, 4)
(10,6) (6,6) (8,6)
(10,5) (8,5)
(10,6) (6,6)
(12,4)
(8,6)
(10,5) (8,5)
(12,4)
(8, 4) (6, 4)
1.5
10
10.5
11
11.5
12
12.5
13
Critical Path Delay (ns)
ED-product reduction
20% by Class1 (Vdd-programmable interconnects w/ level converters)
45% by Class2 (Vdd-gateable interconnects)
50% by Class3 (class1 minus level converters)
Performance degrades 3% due to Vdd programmability
Energy versus Area
Class0
6
Class1
Class2
Class3
Total FPGA Energy/Cycle (nJ)
(8,7)
(6,7)
(10,5)
5
(8,5)
4
(6,5)
(6,4)
(8,4)
(10,4)
(12,4)
(8,7)
3
(6,7)
(8,7)
(8,6)
(8,4)
(12,4)
(10,4)
(8,7)
(10,6)
(6,6)
(8,5)
(10,5)
(10,4)
2
Min-area
Min-energy
(8,6)
(6,6)
(8,4)
(6,7)
(8,6)
(8,5)
(10,5)
(8,4)
(6,4)
(6,7)
(10,5)
(12,4)
(6,6)
(8,5)
(10,6)
(6,6)
(12,4)
1
6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07 2.00E+07 2.20E+07 2.40E+07 2.60E+07
Total FPGA Device Area
Average area overhead
118% for Class1 (Vdd-programmable interconnects w/ level converters)
17% for Class2 (Vdd-gateable interconnects)
52% by Class3 (Vdd-programmable interconnects w/o level converters)
Class2 is the best considering both energy and area
Energy Breakdown
Total FPGA Energy (nJ/Cycle)
4.5
2.94%
3.71%
4
16.03%
3.5
8.09%
3
2.70%
3.04%
Logic Leakage Energy
Logic Dynamic Energy
Local Interconnect Leakage Energy
Local Interconnect Dynamic Energy
Global Interconnect Leakage Energy
Global Interconnect Dynamic Energy
2.5
26.22%
2
4.07%
3.92%
7.43%
4.40%
4.32%
49.89%
1.5
39.69%
1
42.93%
9.81%
42.84%
10.81%
5.85%
4.88%
0.5
0
19.33%
Class0
37.62%
17.77%
Class1
31.70%
Class2
Class3
FPGA Architecture (N,k) = (12,4)
Class2 and Class3 dramatically reduce global interconnect
leakage
But class1 fails due to leakage in Vdd-level converters
Area Overhead
20%
18%
1.39%
Power Transistors & SRAMs (CLBs)
1.80%
Vdd-level Converters (CLBs)
4.82%
Control (Connection Blocks)
16%
Logic Blocks 3.19%
14%
FPGA Area Overhead
12%
10%
Connection Blocks 10.38%
8%
4.96%
Power Transistors (Connection Blocks)
0.60%
SRAMs (Connection Blocks)
6%
4%
2%
Routing Switches 3.87%
3.87%
Power Transistors (Routing Switches)
0%
Class2: Vdd-gateable interconnects + Vdd-programmable CLBs(12, 4)
17% = 9% for power transistors + 5% for control + 2% for SRAM
Conclusions and New Results
Field programmability is needed for fine-grained dual-vdd
and Vdd-gating in FPGA
Vdd-gating offers a better area-power tradeoff than Vddselection
45% energy-delay product reduction with 17% area
overhead
Architecture with Vdd-programmability
LUT size 4 low energy and area
LUT size 7 best performance
New results [dac’05]
Time slack allocation for Vdd-programmable
interconnects
Device and architecture co-optimization for 77% energydelay reduction
References and Download
All
references and tools at
http://eda.ee.ucla.edu
Results
in the slides have been
updated compared to the paper in
ISFPGA’05