Measure Twice and Cut Once: Robust Dynamic Voltage

Download Report

Transcript Measure Twice and Cut Once: Robust Dynamic Voltage

Measure Twice and Cut Once:
Robust Dynamic Voltage Scaling
for FPGAs
Ibrahim Ahmed, Shuze Zhao, Olivier Trescases and Vaughn Betz
Email:[email protected]
FPGA Power Consumption Challenge
2
VDD (V)
1.5
1
0.5
0
150
130
90
65
40
28
14
Technology (nm)
2
FPGA Power Consumption Challenge
2
VDD (V)
1.5
VDD not scaling
1
0.5
0
150
130
90
65
40
28
14
Technology (nm)
3
FPGA Power Consumption Challenge
• Obstacle against entering emerging low power/mobile market (IoT)
• Must show superior perf/W to compete in Data centers
• Need innovation to bring power down
“The future of continued scaling is dependent on adaptive
power management and voltage scaling”, IEEE Fellow Kevin Zhang,
VP of Intel's Technology and Manufacturing Group
4
Worst-case Modelling is Wasteful
• Devices have different delay -> Variation !!
5
Worst-case Modelling is Wasteful
• Delay is temperature dependant
High
Temperature
6
Worst-case Modelling is Wasteful
• Delay is affected by VDD
Lower VDD
7
Worst-case Modelling is Wasteful
• Aging also affects delay
End-of-life
8
Worst-case Modelling is Wasteful
• Aging also affects delay
End-of-life
Static timing analysis
(STA) accommodates
the tail
9
Worst-case Modelling is Wasteful
• Aging also affects delay
• Timing models add margins for :End-of-life
•
•
•
•
•
Slow device
Worst temperature
Worst voltage droop
End-of-life effects
Guard-bands for noise, etc..
10
How significant are the added margins ?
250
FIR filter Fmax on a 60-nm Cyclone IV (1.2 V nominal VDD)
Measured
Fmax (MHz)
200
150
CAD reported Fmax
100
50
0
800
900
1000
1100
1200
1300
1400
Supply Voltage (mV)
11
How significant are the added margins ?
250
FIR filter Fmax on a 60-nm Cyclone IV (1.2 V nominal VDD)
Measured
Fmax (MHz)
200
150
CAD reported Fmax
100
> 20 % reduction in
VDD without reducing
Fmax
50
0
800
900
1000
1100
1200
1300
1400
Supply Voltage (mV)
12
How significant are the added margins ?
250
FIR filter Fmax on a 60-nm Cyclone IV (1.2 V nominal VDD)
Measured
Fmax (MHz)
200
150
CAD reported Fmax
100
> 20 % reduction in
VDD without reducing
Fmax
50
Dynamic Voltage Scaling
(DVS)
0
800
900
1000
1100
1200
1300
1400
Supply Voltage (mV)
13
Dynamic Voltage Scaling
• Find minimum VDD that guarantees operation at required speed
• VDD, reduces both dynamic and static power
Pdynamic a VDD2
• Static power drops even faster
• DVS has been commercially adopted by CPUs, but not FPGAs
• FPGA’s programmability  unknown critical path at fabrication time
• This work: exploit programmability to perform design & chipspecific calibration
14
Outline
• DVS proposal
• Testing Procedure
• FRoC
• Results
• Summary & Future work
15
Outline
• DVS proposal
• Testing Procedure
• FRoC
• Results
• Summary & Future work
16
Conventional Design Cycle
One Measurement by STA
Application
HDL
Passes timing 
FPGA
Application
bit-stream
Program &
run application
with nominal
VDD
17
DVS Proposal Overview
CAD
System
Application
HDL
FPGA
Calibration
bit-stream
Replicated
critical
path
1st measurement by
conventional STA
(once per application)
FPGA
Application
bit-stream
Critical
path
Heaters
18
DVS Proposal Overview
CAD
System
Application
HDL
FPGA
Power VDD
stage
2nd measurement by
on-chip calibration
(repeated for each FPGA)
FPGA
Calibration
bit-stream
Application
bit-stream
Critical
path
Program &
generate
calibration table
(CT)
19
DVS Proposal Overview
CAD
System
Application
HDL
FPGA
FPGA
Calibration
bit-stream
Program &
generate
calibration table
(CT)
Application
bit-stream
CT
T = t2
T = t1
V Fmax
VDD Power
stage
Program &
run application
with DVS
20
DVS Proposal Overview
CAD
System
Application
HDL
Today’s talk
FPGA
FPGA
Calibration
bit-stream
Program &
generate
calibration table
(CT)
Application
bit-stream
CT
T = t2
T = t1
V Fmax
Program &
run application
with DVS
21
Generating the Calibration Bit-stream
• Performed on each FPGA at least once
• For aging effects, calibration with every
power up
• Capture all speed-limiting paths
• Invisible to FPGA users
Fast
Robust
Automated Calibration
FRoC CAD tool
22
Outline
• Motivation
• DVS proposal
• Testing Procedure
• FRoC
• Results
• Summary & Future work
23
How to measure Fmax
• Stimulate with random inputs and check output ?
• Does not guarantee exercising the critical path (CP)
• To robustly measure the delay of a path :• Off-path inputs must have a steady non-controlling value
Tested path
Steady 1/0
LUT
24
How to measure Fmax
• Stimulate with random inputs and check output ?
• Does not guarantee exercising the critical path (CP)
• To robustly measure the delay of a path :• Off-path inputs must have a steady non-controlling value
• Control over the edge transition from input  output
Tested path
/
LUT
Edge 1/0
25
Measuring the Delay of a Single Path
Application
FF
Critical
path
(CP)
FF
FF
FF
LUT
FF
LUT
Replicate
LUT
LUT
FF
FF
26
Measuring the Delay of a Single Path
Application
FF
Critical
path
(CP)
FF
FF
FF
LUT
FF
LUT
FF
Replicate
LUT
LUT
LUT
LUT
FF
FF
FF
27
Measuring the Delay of a Single Path
Application
FF
Critical
path
(CP)
FF
FF
FF
LUT
FF
FF
Change LUT
mask
LUT
XOR
LUT
LUT
XOR
FF
FF
FF
28
Measuring the Delay of a Single Path
Application
FF
FF
FF
FF
FF
FF
Edge1
Critical
path
(CP)
LUT
Control
edge
transition
LUT
XOR
Edge2
LUT
LUT
XOR
FF
FF
FF
29
Measuring the Delay of a Single Path
Input
stimulus
Application
FF
FF
FF
FF
FF
FF
Edge1
FF
Critical
path
(CP)
LUT
Detect
timing faults
LUT
Error
detection
XOR
Edge2
LUT
LUT
XOR
FF
FF
FF
XNOR
FF
Error
30
A Single Path Delay is Not Robust
• Many paths have delay close to the CP
• Within-die variation may cause some other paths to be more critical
• Varying VDD affects FPGA elements delay differently
Robust; measure delay of many near critical paths
Fast; use 1 calibration bit-stream
31
Testing Disjoint Paths
• Testing many disjoint paths is mostly easy
• Repeat the same procedure for single path testing
Application
FF
FF
FF
FF
32
Testing Disjoint Paths
• Testing many disjoint paths is mostly easy
• Repeat the same procedure for single path testing
Application
FF
Calibration
FF
FF
FF
⨁
⨁
FF
FF
⨁
⨁
FF
Error
FF
Error
33
..but What to Do with Overlapping Paths?
FF
S1
FF
S2
LUT
A
LUT
B
• Paths sharing a LUT through different
inputs
Path1
LUT
C
FF
Path2
34
..but What to Do with Overlapping Paths?
FF
S1
FF
S2
LUT
A
LUT
B
• Paths sharing a LUT through different
inputs
• To test Path1, fix off-path input at C
Path1
LUT
C
FF
Path2
35
..but What to Do with Overlapping Paths?
FF
S1
FF
S2
LUT
A
LUT
B
Path1
LUT
C
FF
• Paths sharing a LUT through different
inputs
• To test Path1, fix off-path input at C
• Path1 & Path2 can’t be tested together
Path2
36
..but What to Do with Overlapping Paths?
FF
S1
FF
S2
LUT
A
LUT
B
Path1
LUT
C
Path2
FF
• Paths sharing a LUT through different
inputs
• To test Path1, fix off-path input at C
• Path1 & Path2 can’t be tested together
• Need 2 separate test phases
37
..but What to Do with Overlapping Paths?
FixA
FF
S1
LUT
A
FF
S2
LUT
B
FixB
Path1
LUT
C
Path2
FF
• Paths sharing a LUT through different
inputs
• To test Path1, fix off-path input at C
• Path1 & Path2 can’t be tested together
• Need 2 separate test phases
-Add Fix control signals to keep
LUT output constant
-Test controller cycles through
test phases sequentially
38
LUT Masks for Testing
𝐹 = 𝐹𝑖𝑥 ⋅ 𝐼1 ⨁𝐼2 … ⨁𝐼𝐾−2 ⨁ 𝐸𝑑𝑔𝑒 + 𝐹𝑖𝑥
Fix off-path inputs
Break re-convergent
fan-outs
𝐼1
𝐼2
𝐼𝐾−2
𝐹𝑖𝑥
𝐸𝑑𝑔𝑒
K-LUT
Control edge transition
• 𝐹𝑖𝑥 only added when required
• Developed more LUT masks to test Cyclone IV carry-chains with the
same controllability
39
Can’t Test Everything with 1 Bit-stream
• One or two LUT inputs used as control signals
P
P2 1
P3
P4
LUT
40
Can’t Test Everything with 1 Bit-stream
• One or two LUT inputs used as control signals
P
P2 1
Edge
LUT
Fix
41
Can’t Test Everything with 1 Bit-stream
• One or two LUT inputs used as control signals
• Fixing LUT output does not break all re-convergent fan-outs
Path2
LUT
A
P
P2 1
Edge
LUT
Fix
LUT
B
Path1
LUT
C
42
Can’t Test Everything with 1 Bit-stream
• One or two LUT inputs used as control signals
• Fixing LUT output does not break all re-convergent fan-outs
Path2
LUT
A
P
P2 1
Edge
LUT
Fix
LUT
B
Path1
LUT
C
• LAB inputs constraint
• Carry-chains constraints
43
Outline
• Motivation
• DVS proposal
• Testing Procedure
• FRoC
• Results
• Summary & Future work
44
CAD System with FRoC
Proposed CAD system
Calibration
HDL
Application
HDL
Quartus
P&R
1) Paths selection
Quartus
STA
FRoC
2) Paths replication
Location &
Routing
Constraints
3) Grouping
replicated paths
Calibration
bit-stream
Quartus
Application
bit-stream
4) Test controller
generation
45
1) Path selection
Application circuit
FF
FF
FF
LUT
FF
LUT
LUT
FF
46
1) Path selection
• Extract near critical paths from STA
Application circuit
P5
FF
P1
FF
P2
4-LUT
P3
FF
FF
P4
• {P1, P2, P3, P4, P5}
4-LUT
4-LUT
FF
47
1) Path selection
• Extract near critical paths from STA
Application circuit
P5
FF
P1
FF
P2
P3
FF
FF
P4
• {P1, P2, P3, P4, P5}
• Select which paths to test
• Can’t test {P2,P3,P4} in 1 bit-stream
4-LUT
4-LUT
4-LUT
Two inputs reserved
for control signals
(Fix , Edge)
FF
48
1) Path selection
• Extract near critical paths from STA
Application circuit
P5
FF
P1
FF
P2
4-LUT
P3
FF
4-LUT
FF
• {P1, P2, P3, P4, P5}
• Select which paths to test
• Can’t test {P2,P3,P4} in 1 bit-stream
• Select the more critical paths
• {P1, P2, P3 , P5}
4-LUT
FF
49
2) Path replication
Application circuit
P5
FF
P1
FF
P2
4-LUT
P3
FF
FF
4-LUT
4-LUT
FF
Replication
+
Control
Signals
2) Path replication
Application circuit
P5
FF
P1
FF
P2
P3
P5
FF
Replicated Paths
FF
FF
P1
Fix2
FF
P2
P3
FF
Fix1
Edge1
Edge2
4-LUT
4-LUT
4-LUT
4-LUT
FF
Replication
+
Control
Signals
Fix3
4-LUT
Edge3
4-LUT
FF
51
3) Grouping replicated paths
P5
Replicated Paths
FF
P1
Fix2
FF
P2
P3
FF
Fix1
Edge1
Edge2
4-LUT
Fix3
4-LUT
Edge3
4-LUT
FF
52
3) Grouping replicated paths
P5
Replicated Paths
FF
P1
Fix2
FF
P2
P3
• Minimising test phases -> minimises calibration time
FF
Fix1
Edge1
Edge2
4-LUT
Fix3
4-LUT
Edge3
4-LUT
FF
53
3) Grouping replicated paths
P5
Replicated Paths
FF
P1
Fix2
FF
P2
P3
• Minimising test phases -> minimises calibration time
• Graph coloring problem
FF
Fix1
Edge1
Edge2
4-LUT
Fix3
4-LUT
Edge3
4-LUT
FF
54
3) Grouping replicated paths
P5
Replicated Paths
FF
P1
Fix2
FF
P2
P3
• Minimising test phases -> minimises calibration time
• Graph coloring problem
FF
Fix1
Edge1
Edge2
4-LUT
Fix3
4-LUT
Edge3
4-LUT
FF
55
3) Grouping replicated paths
P5
Replicated Paths
FF
P1
Fix2
FF
P2
P3
• Minimising test phases -> minimises calibration time
• Graph coloring problem
FF
Fix1
Edge1
Edge2
4-LUT
Fix3
4-LUT
Edge3
4-LUT
FF
56
3) Grouping replicated paths
P5
Replicated Paths
FF
P1
Fix2
FF
P2
P3
• Minimising test phases -> minimises calibration time
• Graph coloring problem
FF
Fix1
Edge1
Edge2
4-LUT
Fix3
4-LUT
Edge3
4-LUT
FF
57
3) Grouping replicated paths
P5
Replicated Paths
FF
P1
Fix2
FF
P2
P3
FF
Fix1
• Minimising test phases -> minimises calibration time
• Graph coloring problem
• Tested > 5000 paths using 17 phases only !!
Edge1
Edge2
4-LUT
Fix3
4-LUT
Edge3
4-LUT
FF
58
4) Test controller generation
• For each test phase :• Set the appropriate control signals
• Generates input stimulus
• Detects timing faults
Replicated
paths
Input
stimulus
Control
signals
Sink
registers
Test Controller
Error
59
Outline
• Motivation
• DVS proposal
• Testing Procedure
• FRoC
• Results
• Summary & Future work
60
Benchmarks & Target Chip
• Dual-channel 51-tap low pass FIR filter
• Full crossbar (Xbar) with 16 100-bit-wide-ports
Application
LE utilization
Reported FMAX
FIR filter
67,505 (59 %)
121 MHz
Crossbar
26,579 (23 %)
115 MHz
• Targeting Cyclone IV EP4CE115F29C7 (TSMC 60-nm technology)
• Nominal VDD 1.2 V
61
How Many Edges Are We Covering ?
• Timing edge is a connection between
• I & O of a cell (Cell delay) , O of a cell & I of another cell (connection delay)
• Timing edge criticality = (longest path using this edge)/(CP delay)
Timing edge coverage
Xbar 10000 candidate paths
FIR 10000 candidate paths
Criticality %
 Covering more than 90 %
of the more critical bins.
 FRoC favours testing the
more critical edges
62
First, a Sanity Check
• Need to validate the CT values
• Selected benchmarks are feed-forward applications with no buried
states
250
FIR measured Fmax
Application
BIST controller
M
I
S
R
200
Ref
=
Tested
Fmax (MHz)
L
F
S
R
Xbar measured Fmax
121.18
Xbar CAD reported
Fmax
115.19
FIR CAD reported
Fmax
150
100
50
0
800
900
1000
1100
1200
Supply Voltage (mV)
1300
63
1400
How Many Paths to Measure ?
Xbar
1 Path
2000 Paths
10000 Paths
Benchmark Actual Fmax
220
220
200
200
180
180
1 path is not
robust
160
Fmax(MHz)
Fmax(MHz)
FIR
140
120
2000 Paths
10000 Paths
Benchmark Actual Fmax
160
140
Fan-out loading
effects
120
100
100
80
80
60
1 Path
60
0.8
0.9
1
VDD(V)
1.1
1.2
1.3
0.8
0.9
1
VDD(V)
1.1
1.2
1.3
64
Fan-out Correction & Guard-banding
• Correcting for fan-out through the difference in reported delay (by
Quartus STA) between the calibration and the application bit-streams
• 1 % for FIR & 5 % for Xbar
• Guard-banding for IR-drop, crosstalk effects
• 5 % for both benchmarks (experimental values)
65
Generated CT & Power Savings
Xbar
FIR
Benchmark Actual Fmax
Guard-banded CT
200
200
180
180
160
160
Fmax(MHz)
Fmax(MHz)
Benchmark Actual Fmax
140
120
140
120
100
100
80
80
60
Guard-banded CT
60
0.8
0.9
1
VDD(V)
1.1
1.2
1.3
0.8
0.9
1
VDD(V)
1.1
1.2
1.3
66
Generated CT & Power Savings
Xbar
FIR
Benchmark Actual Fmax
Guard-banded CT
200
200
180
180
160
160
Fmax(MHz)
Fmax(MHz)
Benchmark Actual Fmax
140
120
Nominal
operation
100
80
Guard-banded CT
140
120
Nominal
operation
100
80
60
60
0.8
0.9
1
VDD(V)
1.1
1.2
1.3
0.8
0.9
1
VDD(V)
1.1
1.2
1.3
67
Generated CT & Power Savings
Xbar
FIR
Benchmark Actual Fmax
Guard-banded CT
200
200
180
180
160
160
Fmax(MHz)
Fmax(MHz)
Benchmark Actual Fmax
140
120
Nominal
operation
100
80
Guard-banded CT
140
120
Nominal
operation
100
80
60
60
0.8
0.9
1
VDD(V)
1.1
1.2
1.3
0.8
0.9
1
VDD(V)
1.1
1.2
1.3
68
Generated CT & Power Savings
Xbar
FIR
Benchmark Actual Fmax
Guard-banded CT
200
200
180
180
160
160
Fmax(MHz)
Fmax(MHz)
Benchmark Actual Fmax
140
120
Nominal
operation
100
80
Guard-banded CT
140
120
Nominal
operation
100
80
60
0.8
0.9
1
With DVS, run both application
safely at 1 V
60
1.1
1.2
1.3
0.8
0.9
1
V Save
(V)
> 33 % total
power
consumption
DD
VDD(V)
1.1
1.2
1.3
69
Outline
• Motivation
• DVS proposal
• Testing Procedure
• FRoC
• Results
• Summary & Future work
70
Summary
• Presented a DVS approach tailored for FPGA (off-line calibration)
• Created FRoC tool to automate the calibration procedure
• Achieve more than 33 % total power reduction
71
Future Work
• Reducing guard-bands to enable more power savings
• Complete fan-out modelling for tested paths
• Account for IR-drop during calibration
• # of required calibration bit-streams for full coverage
• Testing hard blocks to find the safest minimum VDD
72
Summary
• Presented a DVS approach tailored for FPGA (off-line calibration)
• Created FRoC tool to automate the calibration procedure
• Achieve more than 33 % total power reduction
73