here - USC Asynchronous CAD/VLSI Group
Download
Report
Transcript here - USC Asynchronous CAD/VLSI Group
Blade – A Path Towards Average-Case
Silicon via Bundled-Data Resilient Design
Peter A. Beerel
CHIP in Bahia, Salvador Brazil
Sept 4th, 2015
Collaborations and Acknowledgements
USC (USA): Dylan Hand, Fei Huang,
Ramy Tadros, Alan Huang, Yang Zhang,
Mel Breuer, Danlei Chen, Weizhe Hua,
Huimei Cheng, Austin Katzin, Yuqi Song,
Zhe Liu, Jun He
Tsinghua (China):
Benmao Cheng
BITS (India):
Ajay Singhvi
PUCRS (Brazil): Ney Calazans,
Matheus Moreira, Guilherme Heck,
Leandro Heck, Matheus Gibiluka
USCS (Brazil): Frederico Butzke
MOTIVATION | 2
Motivation: Delay Overheads
Cycle time of
clocked logic
Logic Time
PVT margin
Clock margin
Logic gates
Flip-flop alignment
[Dreslinksi et al., IEEE Proc. 2010]
Traditional synchronous design suffers from increased margins
• Worse at low and near-threshold regions
MOTIVATION | 2
Motivation: Data Dependent Delays
Average Delay
Cycle time of
clocked logic
Logic gates
Worst-Case
Data
Data delay in Plasma CPU
Number of Operations
Logic Time
Worst-Case Delay
Delay
Slower
Delay variation due to data is rarely exploited in traditional designs
MOTIVATION | 3
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Cells
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 5
Resilient Design
CLK = 1
0
Sequential
Gate
Logic
Shadow
Sequential
1
0
Error
Detection
Next Pipeline
Stage
To Control Logic
Allow and detect timing errors in datapath
• Correct via architectural replay or gating/pausing clock
Effective approaches have been elusive!
RESILIENT DESIGN | 5
Metastability Issue
Shadow
FF
Latch
CLK
Logic
X
Next Pipeline Stage
To Control Logic
X
[Bubble Razor, Fojtik, 2013]
The Problem [Beer et al, 2014]
• Flop metastability can propagate to ERR signal
• Metastability in control path can cause system failure
RESILIENT DESIGN | 6
Handling Metastability
CLK
Logic
Latch
Transition
Detector
X
Next Pipeline Stage
1 or 0
FFControl
FF Logic
To
Synchronizer
[RazorII, Das, 2009]
Transition detector may exhibit metastability
• Error signal must go through synchronizer, increasing delay
• Uses architectural replay to recover from errors
RESILIENT DESIGN | 7
Hold Time Concerns
Ring Oscillator
Clock Generator
CLK = 1
Data
In
STOP
MS
Detector
Comb.
Logic
In
FF
CLK = 0
Internal Data
Latch
Out
[SafeRazor, Cannizzaro, 2014]
Relies on latch for error correction
Hold times are problematic
RESILIENT DESIGN | 8
Resiliency Landscape
Design
Template
Sync /
Async
MTBF
Safe
Avoids
Replay
Logic
Hold
Low Error
Time
Penalty
Robust
Bubble
Razor
Sync
No
Yes
Yes
Yes
Razor II
Sync
Yes
No
No
No
SafeRazor
Async
Yes*
Yes
No
Yes
Blade
Async
Yes
Yes
Yes
Yes
Blade combines the best features of past resiliency schemes
RESILIENT DESIGN | 9
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Cells
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 11
Blade Template
Asynchronous
Controller
δ
Reconfigurable Delay Lines
Combinational
Logic
L.data
CLK
L.ack
L.req
Blade
Controller
RE.req
RE.ack
Δ
R.ack
R.req
Sample
LE.req
LE.ack
Err
2
EDL
Error Detection Logic
Blade Stage
R.data
Error Detecting
Latch
Single Rail Datapath
BLADE TEMPLATE | 11
Δ
Δ
EDL
Err
δ
Combinational
Logic
Error Detection Logic
Sample
Controller
B
CLK
Controller
A
Sample
CLK
Blade Template Operation
Err
EDL
Error Detection Logic
Send request speculatively before data is guaranteed stable
Timing errors delay handshaking signals
BLADE TEMPLATE | 12
Δ
Δ
EDL
Err = 0
δ
Combinational
Logic
Error Detection Logic
Sample
Controller
B
CLK
Controller
A
Sample
CLK
Blade Template Operation
Err
EDL
Error Detection Logic
Send request speculatively before data is guaranteed stable
Timing errors delay handshaking signals
BLADE TEMPLATE | 12
Δ
Δ
EDL
Err = 1
δ
Combinational
Logic
Error Detection Logic
Sample
Controller
B
CLK
Controller
A
Sample
CLK
Blade Template Operation
Err
EDL
Error Detection Logic
Send request speculatively before data is guaranteed stable
Timing errors delay handshaking signals
BLADE TEMPLATE | 12
Δ
Δ
EDL
Error Detection Logic
Err = 0
δ
Combinational
Logic
Sample
Controller
B
CLK
Controller
A
Sample
CLK
Positive Hold Margins
Err
EDL
Error Detection Logic
BLADE TEMPLATE | 13
Δ
Δ
EDL
Error Detection Logic
Err
δ
Combinational
Logic
Sample
Controller
B
CLK
Controller
A
Sample
CLK
Positive Hold Margins
Err
EDL
Error Detection Logic
Handshaking delays create positive hold margin!
BLADE TEMPLATE | 13
Error Detection Logic
From other
Q-Flops
tTD
delay
X
In
D
Latch
+C
CE = 1
0
{
Err0 = 0
{
CLK = 01
0
Err1 = 1
0
Sample = 1
Blade Controller
Q-Flop
Out
Q
EDL
Error Detection Logic
C-element stores error signal, which is sampled by Q-Flop
BLADE TEMPLATE | 14
Metastable-Safe Operation
Err0 = 10
From other
Q-Flops
tTD
delay
X
In
D
Latch
+C
CE = X
0
Some time
later…
{
{
CLK = 01
Err1 = 0
0
Sample = 1
Blade Controller
Q-Flop
Out
Q
EDL
Error Detection Logic
Q-Flop prevents metastability propagation to control path
BLADE TEMPLATE | 15
Overhead Reduction
tcomp
tTD
From other
Q-Flops
delay
delay
X
C
+
{
From other
C-elements
In
D
Latch
Err0
{
Sample
CLK
{
Err1
Blade Controller
Q-Flop
Out
Q
EDL
Error Detection Logic
4-Input
C-element
Each
Q-Flop
collects output
covers 12
from 3 EDLs
EDLs
4-Input
OR gate collects
output from 4
C-elements
C-element and OR gates amortize overhead over many EDLs
BLADE TEMPLATE | 16
Controller Implementation
L.req
L.ack
LE.req
LE.ack
goR
δ
Left
R.req
R.ack
RE.req
RE.ack
Right
edi
goL
goD
Bottom
edo
Δ
delay
Blade Controller
Sample
Δ
CLK
Err[1]
Err[0]
Three part burst-mode state machine
• Implemented using 3D [Yun, 1992]
BLADE TEMPLATE | 17
Controller Implementation
goR+ R.ack- / goD+
R.req+
L.req+ / LE.req+
δ
L.req
L.ack LE.ack+ / goR+
LE.req
LE.ack
goL- / L.ackRE.req+ Err[1]+ /
edo+
Left
goR
edi+ / LE.ackRE.ack+ goD- edo-
goD+ / CLK+
Err[1]- edi- /
RE.req+ Err[0]+ /
RE.ack+ goD-
Err[1]- edi- /
Right
edi+ /
/ goRRE.ack- goD- edogoD- delay- /
SampleErr[0]- /
goD
RE.req- Err[1]+ /
edo+
L.req- / LE.reqgoL+ / Lack+ goL
edo
delay+ /
Bottom
delay+ /
goR- R.ack+ / goD+
goL+ Sample+ CLKgoL- Sample+
CLKdelay R.reqΔ
Blade Controller
Sample
goD- delay/
Sample-
Err[0]- /
CLK goD+ /Err[1]
CLK+
R.req
R.ack
RE.req
RE.ack
edi
RE.req- Err[0]+ /
Δ
RE.ack- goD-
Err[0]
Three part burst-mode state machine
• Implemented using 3D [Yun, 1992]
BLADE TEMPLATE | 17
Delay Elements: MUX-DE
Most popular programmable Delay Elements (DEs)
Delay controlled by number of buffers in the signal path
Binary codewords
N. Mahapatra et al., “Comparison and Analysis of
Delay Elements,” in MWSCAS, 2002, pp. 473–476.
DELAY ELEMENTS | 18
Delay Elements: One-Hot DCCS
Lengths of the current source transistors increase linearly
• (1L, 2L, 3L, ..., nL)
Replicate inverting stages to balance rise-fall delays
For a particular codeword, only one of the current source transistors is ON
[Singhvi et al., ISVLSI 2015]
DELAY ELEMENTS | 19
Delay Elements: Delay Shift Inverter
Simple inverter structure
3 flavors for back-biasing
• For RBB, PBIAS > 1V
and NBIAS < 0V
• Use flavor (b)
Amount of delay shift
controlled by two factors:
• Inverter sizing
• Magnitude of back-bias
voltage
[Singhvi et al., ISVLSI 2015]
DELAY ELEMENTS | 20
Delay Lines and Voltage Scaling
Voltage Scaled Delay Ratio
• Delay at nominal voltage
over delay at near threshold
voltage
The problem
• VSDR changes with
codeword
• Forces needed margin
Solution
• Replicate and size designs
to reduce margins
*Delays range: 300ps to 1.4ns
*Sized to match rise/fall delays
[Singhvi et al, VLSID 2015]
DELAY ELEMENTS | 23
Comparison of Characteristics
*
Energy vs Quantization Error Trade-offs Exist Can be Exploited
*DQE: Delay Quantization Error
*EPT: Energy Per Transition
[Singhvi et al., ISVLSI 2015]
DELAY ELEMENTS | 21
Delay Line Quantification Effects
Delay
Recall: 𝑑 = 𝛿 + p ∗ Δ
δ
p*Δ
Quantification effects reduced due to inherent
tradeoff between nominal delayδ and error penalty p * Δ
DELAY LINE QUANTIZATION | 25
Delay Line Quantification in BD
Delay
Bundled Data: 𝑑 = 𝛿
δ
𝑑
Linear relationship between delay line quantization
and average stage delay in Bundled Data
DELAY LINE QUANTIZATION | 26
Error Detecting Latch: TBTD
Transition Detector (TD) and Error Latch (EL)
TD
EL
Formed by a transition detector and an error latch (an
asymmetric C-Element)
[Bowman et al, ISSCC 2007]
ERROR DETECTING LATCH | 27
Error Detecting Latch: Glitch Sensitivity
Transistor size XOR
• OX-TD+EL
Use static C-element
• SOX-TD+EL
Analyze for glitch sensitivity
twin
Increased sensitivity Enables safe operation
even in the presence of glitches
[Moreira et al., ISQED 2015]
ERROR DETECTING LATCH | 28
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Cells
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 29
Resiliency Performance Benefit
Nominal Delay
Worst-Case Delay
Number of Operations
Data delay in Plasma CPU
Δ
δ
Delay
Key Question: How do we set δ to optimize performance
ANALYSIS | 30
Frequency
Probability
Delay Models
Normal
Distribution
Logic Delay
Real World
Distribution
Probability
Logic Delay
Log-Normal
Distribution
Logic Delay
Our approach
• Analyze the performance of Blade for a variety of delay models
DELAY MODELS | 31
Probability
Optimal Average-Case Performance
Probability of
error (p)
σ
Probability
μ
δ
δ+Δ
PR( d ≤ δ )
•
•
•
•
C : Clock Period / Cycle Time
EC : Effective Clock Period
p : Probability of error
𝑑 : Average delay of Blade
stage
𝐶 =𝛿+Δ
𝑑 =𝛿+p∗Δ
1-PR( d ≤ δ )
δ
Definitions
δ+Δ
Optimal performance
achieved by minimizing 𝑑
*Assumes backward latency is hidden via latch retiming
DELAY MODELS | 32
Optimal Probability of Error - popt
popt observations
• Varies between 20%
and 35% for lognormal distributions
• Significantly higher
than in sync resiliency
Higher Variance
• Constant for normal
distributions!
DELAY MODELS | 33
Proof of constant popt
Assume worst case delay per stage is constant 𝐾 = 𝛿 + Δ
Worst case delay is set by mean, variance, and SER
𝐾 =𝜇+𝑚∗𝜎
Systematic Error Rate (ξ)
sets the worst-case delay
perpstage,
For−1
Normal
2𝜎[erf
1 −distribution:
2𝑝 ] + 𝜇] +
∗ KK
𝑁
1
𝛿−𝜇
𝜉=
1 − 𝑝 = [1 +
erf1 − 𝑃𝑅] 𝑑 ≤ 𝐶
Recall: 𝑑 = 𝛿 + 𝑝 ∗ Δ = 1 − 𝑝 ∗ 𝛿 + 𝑝 ∗ 𝐾
Rewrite: 𝑑 = 1 − 𝑝 [
2 zero :
2𝜎
Minimize 𝑑 by taking derivative and setting it equal to
𝑚
𝛿−𝜇 = 𝑓(𝜉)
𝜕𝑑
= 1+𝑦
𝜕𝑝
1 − 2𝑝 = erf 2𝜎
−1
𝜕erf
𝑦
Taking inverse
error function of both sides:
2
− 2 erf −1 𝑦 𝛿+−𝑚
=0
𝜇
𝜕𝑦
erf −1 1 − 2𝑝 =
2𝜎
𝑦 = 1 −𝛿2𝑝
= 2𝜎[erf −1 1 − 2𝑝 ] + 𝜇
Note y and p are independent of σ and μ! m depends only on 𝜉
Implication: Tuning of delay line may target fixed probability!
DELAY MODELS | 34
(%)
Optimal Size of Resiliency Window - Δ
Blade supports maximum Δ of 50% of clock cycle
Optimal Δ is larger for designs with high-variance!
OPTIMIZATION | 35
Comparison to Sync Resiliency
N-Stage Rings
Synchronous
F
F
• 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥
M
S
EDL
Blade
S
EDL
• 𝐸𝐶 = 𝐶 2 − 1 − 𝑝
M
EDL
Bubble Razor [Zhang,2014]
EDL
• EC set by systematic
error rate
F
F
2𝑁
COMPARISON | 36
Comparison to Sync Resiliency
Synchronous
Synchronous
• EC set by systematic
error rate
Bubble Razor [Zhang,2014]
• 𝐸𝐶 = 𝐶 2 − 1 − 𝑝
Blade
BR
35% 23%
Blade
2𝑁
Normal Distribution
• 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥
COMPARISON | 36
Comparison to Sync Resiliency
Synchronous
• EC set by systematic
error rate
Bubble Razor [Zhang,2014]
• 𝐸𝐶 = 𝐶 2 − 1 − 𝑝
Blade
2𝑁
Normal Distribution
• 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥
COMPARISON | 36
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Circuits
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 37
Automatic Conversion Flow
Convert single-clock sync RTL design to Blade
Re-uses synchronous EDA tools and libraries
Sync
Synthesis
FF to Latch
Conversion
Latch
Retiming
Add EDL +
Async Control
Simulation
Async
Netlist
RTL
Spec
Seamless integration into existing flows
Blade Design Flow
CASE STUDY | 38
Synthesis
Sync
Synthesis
FF to Latch
Conversion
Latch
Retiming
Add EDL +
Async Control
Simulation
CLK
RTL
Spec
Sync
Synthesis
FF to Latch
Conversion
Latch
Retiming
Add EDL +
Async Control
Async
Netlist
FF
FF
Synthesize sync RTL design using standardSimulation
EDA tools
Blade Design Flow
CASE STUDY | 39
Latches
Sync
Synthesis
FF to Latch
Conversion
Latch
Retiming
Add EDL +
Async Control
Simulation
CLK2
CLK1
M
S
M
S
Replace flip flops with master-slave latches
Two-phase non-overlapping clocking
CASE STUDY | 40
Latch Retiming
Sync
Synthesis
FF to Latch
Conversion
Latch
Retiming
Add EDL +
Async Control
Simulation
CLK2
CLK1
M
S
S
M
Retime latches to spread logic delay across stages
Allow time borrowing to reduce area overhead
CASE STUDY | 41
EDL Insertion
Sync
Synthesis
FF to Latch
Conversion
Latch
Retiming
Add EDL +
Async Control
Simulation
CLK2
CLK1
M
EDL
TB
S
M
EDL
S
TB
Replace non-TB latches with error detecting latches
Not all latches need be error detecting
CASE STUDY | 42
Async Control
Blade
Controller
EDL
Sync
Synthesis
FF to Latch
Conversion
Blade
Controller
TB
Latch
Retiming
Blade
Controller
EDL
Add EDL +
Async Control
Simulation
Blade
Controller
TB
Remove synchronous clock trees
Add Blade controllers and delay lines
CASE STUDY | 43
Simulation
Blade
Controller
EDL
Sync
Synthesis
Blade
Controller
TB
FF to Latch
Conversion
Latch
Retiming
Add EDL +
Async Control
Blade
Controller
Simulation
Blade
Controller
EDL
TB
Back annotated SDF simulation using final netlist
CASE STUDY | 44
Resynthesis
EDL
EDL
A
D
Critical Path
EDL
Latch
EDL
Logic
B
>δ
Delay <
EDL
C
E
Correlated
EDL
Latch
EDL
<δ
Delay >
F
Add set_max_delay from A to F = δ
CASE STUDY | 45
Brute Force Resynthesis
Best run reduced
error rate by 39%
and EDLs by 27%
(-1.9% area)
Best Run
122 Correlated EDLs!
Evaluated hundreds of resynthesis runs
• Each run sets a max delay constraint to a single latch
Chose result that led to largest reduction in area and error rate
CASE STUDY | 46
Case Study: Plasma
MIPS OpenCore
3-stage pipeline
28nm FDSOI @ 666MHz
(w/ ideal clock and Vdd)
Type
Count
Combinational
11,740
Buf/Inv
1,683
Seq. (Non-RF)
531
RF
2,048
Total
14,319
http://opencores.org/project,plasma
CASE STUDY | 47
Case Study: Area and Performance
2%
24%
AND/OR Trees
C-elements
Q-Flops
24%
Plasma running CoreMark
Delay
Elements
Controllers
6%
12%
FF to Latch + EDL
32%
Overall area overhead is 8.4%
Performance increases
• Average: 19%
• Peak: 42%
CASE STUDY | 48
Performance Comparison with Margins
Must add margins for PVT variation, clock skew / jitter, and/or aging
• Synchronous frequency degraded to accommodate
Margin in Blade design is only imposed when an error occurs
• A 30% error rate reduces impact of margin by ~70%
Margin
Synchronous
Blade (30% ER)
% Advantage
0%
666MHz
800MHz
20%
15%
566MHz
764MHz
35%
30%
466Mhz
728MHz
56%
CASE STUDY | 49
Outline
Resiliency Design and its Pitfalls
Blade Design, Operation, and Special Circuits
Blade Analysis and Optimization
Blade CAD Flow and Case Study
Conclusions and On-Going Work
OUTLINE | 50
Other Related Work
Canary Circuits [Sato, 2007]
• Removes some PVT margins
• But cannot take advantage of data dependency
Bundled Data Designs [Sutherland’89, Nowick’97]
• Speculative completion sensing exploits some data dependency
• Margins impact performance on every cycle
• No observability of errors
Soft Mousetrap [Liu, 2013]
• Hold time constraints remain difficult to meet
CONCLUSIONS | 51
Conclusions
Blade Template
• Achieves higher performance by exploiting data dependency
• Benefits from average vs worst-case MS resolution times
• Reduces impact of margins for PVT variations
• Enables voltage scaling for power savings
• Supported by design flow w/ commercial EDA tools
Plasma Case Study
• Highlights design and CAD techniques for area efficiency
• Achieves 19% increase in performance with 8.4% area overhead
CONCLUSIONS | 52
On-Going Work
Design
• Further design and analysis of efficient DEs and EDLs
Automated Flow
• Support designs specified in System Verilog CSP
• Develop average-case-driven re-synthesis tools
• Automated resilient-aware PnR Flow
Testing Strategy
• On-line tuning of delay lines
• Screen chips with variations too large to correct
Applications
• Encryption Designs (Triple DES and Light Weight Crypto)
ON-GOING WORK | 53
Blade Publications (2015)
D. Hand, M. Moreira, H.-H. Huang, D. Chen, F. Butzke, Z. Li, M. Gibiluka, M. A. Breuer, N. L. V.
Calazans, P. A. Beerel: Blade - A Timing Violation Resilient Asynchronous Template. ASYNC
2015: 21-28
D. Hand, H.-H. Huang, B. Cheng, Y. Zhang, M. Moreira, M. A. Breuer, N. L. V. Calazans, P. A. Beerel:
Performance Optimization and Analysis of Blade Designs under Delay Variability. ASYNC
2015: 61-68
Y. Zhang, L. S. Heck, . M. Moreira, D. Zar, M. A. Breuer, N. L. V. Calazans, P. A. Beerel: Design and
Analysis of Testable Mutual Exclusion Elements. ASYNC 2015: 124-131
M. Moreira, D. Hand, P. A. Beerel, N. Calazans: TDTB error detecting latches: Timing violation
sensitivity analysis and optimization. ISQED 2015: 379-383
G. Heck, L. S. Heck, A. Singhvi, M. Moreira, P. A. Beerel, N. L. V. Calazans: Analysis and
Optimization of Programmable Delay Elements for 2-Phase Bundled-Data Circuits. VLSI
Design 2015: 321-326.
A. Singhvi, M. T. Moreira, R. N. Tadros, N. L. V. Calazans, P. A. Beerel: A Fine-Grained, Uniform,
Energy-Efficient Delay Element for FD-SOI Technologies, ISLVSI 2015
P. A. Beerel, N. Calazans: A Path Towards Average Case Silicon using Resilient Bundled Data
Design, ECCTD 2015 (Invited)
ON-GOING WORK | 55
The Blade Team++
Asynchronous Circuit in Industry
3D Network on chips (STMicroelectronics)
STMicroelectronics WIOMING 3D-IC
Ethernet Switches (Intel)
Ultra high-speed FPGAs (Achronix)
Achronix FPGA. 1.7
M LUTs. 2.1 Gbps IO
Low-power smart cards (Tiempo)
Tiempo TAM16 Clockless 16-bit
Microcontroller
Neuromorphic Computing (IBM)
Internet of Things (???)
IBM True North Multi-Chip
Neuromorphic System
Fulcrum/Intel Ethernet
Switch Chip
Obrigado!
Do not miss....
22nd IEEE International Symposium
on Asynchronous Circuits and
Systems (ASYNC)
MAY 8 - 11, 2016
PORTO ALEGRE, BRAZIL
Homepage:
http://www.inf.pucrs.br/async2016
We hope to see you there!