FPGA Power Reduction Using Configurable Dual-Vdd

Download Report

Transcript FPGA Power Reduction Using Configurable Dual-Vdd

Power Modeling and Architecture
Evaluation for FPGA with Novel
Circuits for Vdd Programmability
Yan Lin, Fei Li and Lei He
EE Department, UCLA
[email protected]
Partially supported by NSF.
Overview

FPGA architecture evaluation



Area and delay [Rose et al, JSSC’90]
Power [Poon et al, FPLA’02][Li et al, FPGA’03]
Vdd programmability for power reduction



Concept in [FPGA’03]
Application to logic [FPGA’04][DAC’04]
Application to interconnects
[ICCAD’04][Anderson et al, ICCAD’04]
Novel circuits and Architecture evaluation
for FPGAs with Vdd-programmability
 Reduce power by 50% with 17% area and
3% delay increase

Outline

Power modeling and architecture
evaluation methodology

FPGA Circuits for Vdd Programmability

Architecture Evaluation with Vdd
programmability

Conclusions and Ongoing Work
Framework fpgaEva-LP
Benchmark circuits
Logic Optimization(SIS)
Tech-Mapping (RASP)
Arch
Spec
Parasitic
Extraction
Timing-Driven Packing (TV-Pack)
Placement & Routing (VPR)
Area
Delay
Cycle-accurate
Power
Simulator
Power
FPGA Structure and Models

Cluster-based Island Style FPGA Structure



Area and delay models similar to [Betz-RoseMarquardt]


100% buffered interconnects, subset switch block
input fc = 50%, output fc = 25%
But based on layout and SPICE for 100nm and below
Mixed-level power model from [FPGA’03]
Dynamic power


Capacitive power
Short-circuit power
( transition time)
Capacitive power


Functional switch
Glitch
Static Power



Sub-threshold leakage
Reverse biased leakage
Gate leakage
New Power Model in fpgaEva-LP2

Short-circuit power 
switching time * switching power

fpgaEva-LP used average signal transition time

fpgaEva-LP2 calculates transition time for each
buffer as tr    tbuffer, the buffer delay


 is NOT a constant 2 as in literature due to input slew
 is pre-characterized by SPICE
buffer delay
<0.012 ns
< 0.03 ns
>0.03 ns
α
2
4.4
7
Validation Using SPICE

Validate by comparison for each power-component
High fidelity with average absolute error of 8%
0.0025
SPICE simulation
fpgaEVA-LP
fpgaEVA-LP2
0.002
FPGA Power (watt)

0.0015
0.001
0.0005
0
b1
parity
cm138a
Benchmark Circuits
z4ml
decode
Impact of Random Seeds in VPR
5.6
circuit: s38584
FPGA Energy (nJ/cycle)
5.55
8
10
5.5
3
5.45
6
+5%
5.4
5
2
7
5.35
4
5.3
5.25
10.2
9
+12%
1
10.4
10.6
10.8
11
11.2
11.4
11.6
11.8
12
Critical Path Delay (ns)


12% delay variation and 5% energy variation
Min-delay solution among 10 runs is used
Total FPGA Energy (nJ/cycle)
Evaluation of Single-Vdd FPGAs
9
(12, 7)
8
7
(10, 7)
(12, 6)
6
(8, 7)
(6, 7)
(6, 6)
(10, 5)
(8, 5)
(12, 4)
5
4

(6, 3)
(6, 5)
(6, 4)
(10, 4)
(8, 4)
10
11
12
13
14
15
16
17
Critical Path Delay (ns)
Cluster size N = {6, 8, 10, 12}
LUT size k = {3, 4, 5, 6, 7}
Energy-delay (ED) dominant architectures


(12, 5)
Architectures explored


(8, 3)
3
9

(10, 6)
(8, 6)
(10, 3)
(12, 3)
Architecture with smaller delay or less energy (compared
to any other architecture)
Relaxed ED dominant set may be also valuable
Energy versus Delay
For 100nm ITRS technology


Min-Energy arch (N,k)=(10,4) or (8.4)
Min-Delay arch (N,k)=(8,7)  0.8x delay but 1.7x power
9
Total FPGA Energy (nJ/cycle)

Current commercial
architecture
(12, 7)
8
7
(10, 7)
(12, 6)
6
(8, 7)
(10, 6)
(8, 6)
(6, 7)
(6, 6)
(10, 5)
(8, 5)
(12, 4)
5
4
(12, 3)
(10, 3)
(8, 3)
(12, 5)
(6, 3)
(6, 5)
(6, 4)
(10, 4)
(8, 4)
3
9
10
11
12
13
14
Critical Path Delay (ns)
15
16
17
Outline

Power modeling and evaluation
methodology

FPGA Circuits for Vdd Programmability

Architecture Evaluation with Vdd
programmability

Conclusions and Ongoing Work
Vdd-programmable FPGA
[DAC’04][ICCAD’04]

Vdd-programmable logic
block


Vdd selection
Power-gating unused blocks
Vdd-programmable FPGA
[FPGA’04][ICCAD’04]

Vdd-programmable logic
block


Vdd selection
Power-gating unused blocks

Vdd-programmable switch

Vdd-level conversion is
needed when VddL drives
VddH

To avoid excessive leakage
Vdd-programmable Routing Switch

Conventional routing switch

Vdd-programmable routing switch

Brute-force design [ICCAD’04]


Two extra SRAM cells for each routing switch
New design


One extra SRAM cell
NAND2 gate –- minimum size & high-Vt transistor
Vdd-Programmable
Interconnect Connection Block


Brute-force design [ICCAD’04]
 2n extra SRAM cells for n connection switches
New design


Only TWO extra SRAM cells for n connection switches
Control logic includes 2n NAND2 and a decoder
Power and Delay

Vdd-programmable switch uses



Compared to conventional switch


4X PMOS power transistor for 7X routing switch
1X PMOS power transistor for 4X connection switch
1000X less leakage power
Connection box is 28% faster and has 18% less dynamic
power

By moving mux from critical path of connection box
Switch delay (ns)
Energy per switch (Joule)
(Vdd=1.3v)
Type
w/o power
transistor
w/ power
transistor
w/o power
transistor
w/ power
transistor
Routing
5.9E-11
6.5E-11(+11%)
3.3E-14
3.2E-14 (-2%)
Connection
2.9E-10
2.1E-10(-28%)
3.8E-14
3.1E-14(-18%)
Vdd-gateable Routing Switch

Conventional

Vdd-gateable


two states  Normal Vdd or Power-gating
Enable power-gating capability w/o extra SRAM
cells
Power
transitor

Can be replaced by tri-state buffer
Vdd-gateable Connection Block


Conventional

Vdd-gateable
Enable power-gating capability w/ only one extra SRAM
and a low leakage decoder
Outline

Power modeling and evaluation
methodology

FPGA Circuits for Vdd Programmability

Architecture Evaluation with Vdd
programmability

Conclusions and Ongoing Work
FPGA Architecture Classes
Architecture Class Logic Block
Interconnect
Class0 (baseline)
single-Vdd
single-Vdd
Class1
programmable
dual-Vdd
programmable dual-Vdd,
level converters in routing
Class2
programmable
dual-Vdd
VddH and Vdd-gateable
Class3
programmable
dual-Vdd
Class 1, but no level
converters in routing

High-Vt is applied to configuration SRAM cells for
all the classes
Vdd-level Converters

Class3 removes Vdd-level converters from interconnects in
Class1

With constraints that no VddL drives VddH

We developed a routing that one routing
tree has a single Vdd level
 But trees with different Vdd-levels can
share the same wire track

Alternative approaches:
 Combined vdd-level converter and buffer [Anderson et al,
ICCAD’04]
 Our new work [DAC’05] allows dual vdd in a tree with a
chip level time slack budgeting for extra power reduction
Energy versus Delay
Total FPGA Energy/Cycle (nJ)
LUT 7
6
High Performance
Class 0
Class 1
Class 2
Class 3
(8, 7)
5.5
(6, 7)
5
4.5
(12, 4)
4
(8, 7)
(6, 7)
3.5
(8,7)
(6,7)
(8,7)(6,7)
3
2.5
LUT 4
Low Energy
(6, 6) (8, 6)
(10, 5) (8, 5)
2
(6, 6)
(6, 5)
(8, 4)(6, 4) (10, 4)
(10, 5)
(8, 5)
(12, 4)
(10,6) (6,6) (8,6)
(10,5) (8,5)
(10,6) (6,6)
(12,4)
(8,6)
(10,5) (8,5)
(12,4)
(8, 4) (6, 4)
1.5
10
10.5
11
11.5
12
12.5
13
Critical Path Delay (ns)

ED-product reduction




20% by Class1 (Vdd-programmable interconnects w/ level converters)
45% by Class2 (Vdd-gateable interconnects)
50% by Class3 (class1 minus level converters)
Performance degrades 3% due to Vdd programmability
Energy versus Area
Class0
6
Class1
Class2
Class3
Total FPGA Energy/Cycle (nJ)
(8,7)
(6,7)
(10,5)
5
(8,5)
4
(6,5)
(6,4)
(8,4)
(10,4)
(12,4)
(8,7)
3
(6,7)
(8,7)
(8,6)
(8,4)
(12,4)
(10,4)
(8,7)
(10,6)
(6,6)
(8,5)
(10,5)
(10,4)
2
Min-area

Min-energy
(8,6)
(6,6)
(8,4)
(6,7)
(8,6)
(8,5)
(10,5)
(8,4)
(6,4)
(6,7)
(10,5)
(12,4)
(6,6)
(8,5)
(10,6)
(6,6)
(12,4)
1
6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07 2.00E+07 2.20E+07 2.40E+07 2.60E+07
Total FPGA Device Area

Average area overhead




118% for Class1 (Vdd-programmable interconnects w/ level converters)
17% for Class2 (Vdd-gateable interconnects)
52% by Class3 (Vdd-programmable interconnects w/o level converters)
Class2 is the best considering both energy and area
Energy Breakdown
Total FPGA Energy (nJ/Cycle)
4.5
2.94%
3.71%
4
16.03%
3.5
8.09%
3
2.70%
3.04%
Logic Leakage Energy
Logic Dynamic Energy
Local Interconnect Leakage Energy
Local Interconnect Dynamic Energy
Global Interconnect Leakage Energy
Global Interconnect Dynamic Energy
2.5
26.22%
2
4.07%
3.92%
7.43%
4.40%
4.32%
49.89%
1.5
39.69%
1
42.93%
9.81%
42.84%
10.81%
5.85%
4.88%
0.5
0
19.33%
Class0
37.62%
17.77%
Class1
31.70%
Class2
Class3
FPGA Architecture (N,k) = (12,4)


Class2 and Class3 dramatically reduce global interconnect
leakage
But class1 fails due to leakage in Vdd-level converters
Area Overhead
20%
18%
1.39%
Power Transistors & SRAMs (CLBs)
1.80%
Vdd-level Converters (CLBs)
4.82%
Control (Connection Blocks)
16%
Logic Blocks 3.19%
14%
FPGA Area Overhead
12%
10%
Connection Blocks 10.38%
8%
4.96%
Power Transistors (Connection Blocks)
0.60%
SRAMs (Connection Blocks)
6%
4%
2%
Routing Switches 3.87%
3.87%
Power Transistors (Routing Switches)
0%
Class2: Vdd-gateable interconnects + Vdd-programmable CLBs(12, 4)

17% = 9% for power transistors + 5% for control + 2% for SRAM
Conclusions and New Results





Field programmability is needed for fine-grained dual-vdd
and Vdd-gating in FPGA
Vdd-gating offers a better area-power tradeoff than Vddselection
45% energy-delay product reduction with 17% area
overhead
Architecture with Vdd-programmability
 LUT size 4  low energy and area
 LUT size 7  best performance
New results [dac’05]


Time slack allocation for Vdd-programmable
interconnects
Device and architecture co-optimization for 77% energydelay reduction
References and Download
 All
references and tools at
http://eda.ee.ucla.edu
 Results
in the slides have been
updated compared to the paper in
ISFPGA’05