SER estimation of SRAM-based FPGAs
Download
Report
Transcript SER estimation of SRAM-based FPGAs
Analytical Approach for Soft Error Rate
Estimation of SRAM-Based FPGAs
Ghazanfar (Hossein) Asadi
Test & Reliability Group (TRG)
Department of Electrical & Computer Engineering
Northeastern University
2004 MAPLD/221
1
Asadi
Outline
2004 MAPLD/221
Problem Statement & Motivation
Soft Errors Background & Previous work
Error Models in FPGAs
SER Estimation
Experimental Results
Summary & conclusions
2
Asadi
Problem Statement
Estimating soft error rate in FPGAs
The probability of system failure
For a given mapped design
Mean time to manifest a corrupted conf. bit
2004 MAPLD/221
Due to soft errors
To primary outputs or Flip-flops
3
Asadi
Motivation
Need for soft error rate estimation
Exponential growth of vulnerable bits due to Moore’s law
High cost of Error tolerant schemes
To make appropriate cost/reliability trade-offs
Why an analytical method?
2004 MAPLD/221
Where to put redundancy
Previous work: Fault Injection
Time-consuming / Incomplete / Expensive
Needs physical prototype board
Cannot be used in design phases
4
Asadi
Background: Error Definitions
Soft Errors:
Intermittent malfunctions of the hardware
Not reproducible
Energetic Particles
Single Event Upsets (SEUs)
Soft Errors
(may cause) System Failure
2004 MAPLD/221
5
Asadi
Previous Work
Based on Fault Injection (FI)
1. Inject fault
2. Run several workloads
3. Compare results with fault-free circuit
Exhaustive FI is very time-consuming
2004 MAPLD/221
Candidate some locations for FI
Analysis based on statistics
6
Asadi
Previous Work (Cont.)
Radiation-based fault injection
Expensive & not commonly used
Needs physical implementation
Can damage prototype board Hard error
Simulation-based fault injection
2004 MAPLD/221
Cannot be used during design phases
Bit-stream alteration
Needs physical implementation
Bridging errors may lead to hard errors
7
Asadi
Outline
2004 MAPLD/221
Problem Statement & Motivation
Soft Errors Background & Previous work
Error Models in FPGAs
SER Estimation
Experimental Results
Summary & conclusions
8
Asadi
Error Models in FPGAs
Memory resources:
User bits
Configuration bits
2004 MAPLD/221
Flip-flops, RAMs, …
Mux select bits, LUT bits, …
User bits Transient errors
Config. bits Permanent errors
9
Asadi
Error Models in FPGAs (Cont.)
Bit flip
Permanent error
Corrected by
reconfiguration
Bit flip
Transient error
Can be corrected at the next load
E1
E2
E1
E3
Short or open circuit
Corrected by reconfiguration
clk
E2
E3
BlockRAM
F1
F2
F3
F4
LUT
M
ff
M
M M M M
M
M
2004 MAPLD/221
Configuration Memory Cell
SEU
(Bit flip)
10
© Lima (DAC03)
Virtex (Xilinx)
Asadi
Error Models in FPGAs (Cont.)
Transient errors
User flip-flops, Logic gates, Block RAMs
Permanent errors (all configuration bits)
Routing:
2004 MAPLD/221
MUX select bits
PIP: Short/Open
Buffer: On/Off
LUT
Control/Clocking Bits
11
Asadi
Error Models in FPGAs (Cont.)
Only permanent errors considered
Conf. bits comprise more than
2004 MAPLD/221
99% of all memory elements excluding RAM blocks
95% of all memory elements including RAM blocks
Device
# of Config.
Bits
# of FlipFlops
Ratio
XCV50
559,200
3,996
99%
XCV400
2,546,048
22,812
99%
XCV800
4,715,616
43,872
99%
XCV1000
6,127,744
56,832
99%
12
Asadi
Outline
2004 MAPLD/221
Problem Statement & Motivation
Soft Errors Background & Previous work
Error Models in FPGAs
SER Estimation
Experimental Results
Summary & conclusions
13
Asadi
SER Estimation
Traversing structural paths [Asadi04]
From fault sites to POs
Off-Path Signals: Thin Lines
PO
SEU
PO
On-Path Signals:
Thick Lines
FF
Off-Path Signals
2004 MAPLD/221
14
Asadi
SER Estimation in ASIC Designs
S(n): System failure probability (SFP) vector
Si: SFP given node i erroneous
n: total fault sites
Experiments on ISCAS89 show that:
Three order of magnitude faster
2004 MAPLD/221
Compared to random-input simulation
Average accuracy: 97%
15
Asadi
FPGA vs. ASIC in SER Estimation
ASIC: transient error
Only requires propagation probability
FPGA: both transient & permanent errors
Transient errors: the same
Permanent errors: needs activation as well
Nodes with different error rates in FPGAs
Fault sites: all nodes
1
1
n1
A
1
B
1
n2
1
2004 MAPLD/221
1
16
C
Asadi
SER Estimation of FPGAs: Steps
Compute permanent error rates for all nodes
PRi : the permanent error rate of node i
Compute netlist failure probability vector
Ni= failure prob. given node i erroneous
System failure rate vector (S) = PR N
2004 MAPLD/221
n: total number of fault sites
Si = PRi Ni
17
Asadi
How to Compute Ni?
Open & stuck-at errors:
Bridging wired-AND error (nets i and j):
Ni = [SPi(1-SPj)PPi(0)] + [(1-SPi) SPjPPj(0)]
Bridging wired-OR error (nets i and j):
2004 MAPLD/221
Ni = [SPi PPi(0) + (1-SPi) PPi(1)] = PPi
PPi: Propagation prob. (the method used for ASIC)
SP: Signal probability is used for activation prob.
Ni = [SPi(1-SPj)PPj(1)] + [(1-SPi) SPjPPi(1)]
18
Asadi
How to Compute PRi?
PR(n): permanent error rate vector
PRi : r f
r: Raw error rate of an SRAM cell
f: Number of all possible errors at node i
n: total number of fault sites
PRAB= 6 r
1
1
1
1
A
B
0
0
1
2004 MAPLD/221
19
Asadi
System Failure Rate
For the first clock:
n
SFR 1 (1 S i )
i 1
For c clock cycles:
n
SFR 1 (1 PRi (1 N i ) c
i 1
2004 MAPLD/221
The same probability is valid for the next clock cycles
c: Number of clocks checking the state of the circuit
After particle hit
20
Asadi
Outline
2004 MAPLD/221
Problem Statement & Motivation
Soft Errors Background & previous work
Error Models in FPGAs
SER Estimation
Experimental Results
Summary & conclusions
21
Asadi
Error List
2004 MAPLD/221
Mux-open
PIP open
Buffer off
A bit-flip in LUT
Control bit-flip
22
Asadi
Experimental Setup
Xilinx Virtex 300 (XCV300)
Xilinx Design Language (XDL)
Benchmark: some ISCAS89 circuits
r = raw failure rate for an SRAM cell
1000 clocks executed for each SEU
Platform: Sun Solaris Ultra-10
2004 MAPLD/221
r=0.01 FIT/bit
256 MB Main Memory
23
Asadi
Results: Sensitive Bits
Number of sensitive SRAM bits for each part
Circuit
2004 MAPLD/221
S27
S298
S344
S349
s382
s386
Routing
64
459
536
650
807
714
LUT
68
418
392
520
712
660
Control/
Clocking
40
140
168
187
207
160
Total
172
1017
1096
1357
1726
1534
24
Asadi
Results: Manifestation Time
Mean Time To Manifest (MTTM) errors to outputs
Circuit
S27
S298
S344
S349
s382
s386
Routing
2.07
2.86
2.58
2.91
3.30
3.82
LUT
14.49 20.75 17.33 20.48 22.08 30.07
Control/
Clocking
1.18
1.31
1.36
1.40
1.40
1.77
(Results are in terms of cycles)
2004 MAPLD/221
25
Asadi
Results: SFR & Estimation Time
System Failure Rate & Estimation Time
Circuit
S27
S298 S344 S349
s382
s386
SFR (FIT)
1.71
9.87
9.99 12.77 16.04 12.11
SP Time (sec)
0.15
0.76
0.91
1.09
1.25
1.05
SFR Time (sec)
0.02
0.09
0.13
0.14
0.19
0.25
Total Time (sec) 0.17
0.85
1.04
1.23
1.44
1.30
Number of Clock cycles: 1000
SP Time: Signal Probability computation time
SFR Time: System Failure Rate computation time
2004 MAPLD/221
26
Asadi
Summary & Conclusions
A new approach for SER estimation
No physical implementation required
Can be used in early design stages
Very fast simulation time
Can cover all possible faults
Mean Time To Manifest errors to outputs:
2004 MAPLD/221
For SRAM-based FPGAs
MTTM(Control/clocking) < MTTM(routing)
MTTM(routing) < MTTM(LUT)
27
Asadi
Appendix & Backup
2004 MAPLD/221
28
Asadi
Background: Soft Error Origin
The main sources in terrestrial conditions:
Soft Error occurs:
Alpha particles & Neutrons
if hitting particles generate more than Qcrit
Critical Charge (Qcrit):
the minimum charge needed
2004 MAPLD/221
To flip the value stored in the cell
29
Asadi
Exp. Increase of Soft Errors
1.00000
0.90000
SRAM: exp(-Qcrit/Qs)
0.80000
Latch: exp(-Qcrit/Qs)
0.70000
0.60000
0.50000
0.40000
0.30000
0.20000
0.10000
0.00000
600 nm
350 nm
250 nm
180 nm
130 nm
100 nm
70 nm
50 nm
e-Qcrit/Qs trend with technology scaling (Shivakumar , DSN 2002)
• Qcrit: the critical charge (depend on characteristics of the circuit)
• Qs: the charge collection efficiency of a particle strike on the device
2004 MAPLD/221
Particles of lower energies occur far more frequently
30
Asadi
Background: Definitions
2004 MAPLD/221
How to express Soft Error Rate (SER)
MTBF (Mean Time Between Failures)
FIT (Failure-in-Time)
1 failure in a billion hours
1 year MTBF = 114,155 FIT
31
Asadi
Background: Definitions
Failure definition:
(a) Propagation of an erroneous value
to at least one flip-flip or primary output
or
(b) Propagation of an erroneous value
Definition (a) is compatible with (b)
2004 MAPLD/221
to at least one primary output
If there is no redundant flip-flop in the circuit
32
Asadi
Failure Error Rate of LUT
To reduce number of nodes
P(tx): the probability of O=tx
LUT failure rate
2004 MAPLD/221
LUT as a complex gate
LUT
F1
F2
F3
F4
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9 t10 t11
O
t12 t13 t14 t15
SO=[AP(t0)+AP(t1)+…+AP(t15)].r.NO
= r.NO
33
Asadi
Xilinx Virtex FPGA Model
CLB
Logic block
IO Mux
Switch
Matrix
(SM)
Line
Segments
IOB
2004 MAPLD/221
34
Asadi
CLB Architecture
2004 MAPLD/221
35
Asadi
Error Models in FPGAs (Cont.)
Config. Bits:
Care bits
Don’t care bits
2004 MAPLD/221
All 1s
Some of 0s
Some of 0s
36
Asadi
Error Models: PIP Short/Open
10: causes open
01: may cause short or bridging error
1
0
1
0
0
1
0
N2
Stuck-closed: Permanently ON
N1
N3
N3
E1
W1
E1
W2
E2
W2
E2
W3
E3
W3
E3
S2
S1
S3
Stuck-open and stuck-closed errors
2004 MAPLD/221
N2
W1
S1
1
Bridiging error
Stuck-open: Permanently OFF
N1
1
37
S2
S3
Bridiging error
Asadi
Error Models (Cont.)
Buffer on/off
1
0
0
1
1
Tri-state buffers
Used in IOBs
Buffer on
2004 MAPLD/221
0
Look-Up Table
38
Buffer off
F1
0
LUT
1 1
F2
0
1
0
F3
1
1
0
1 O
1
F4 0 1 1
1
1
1
Asadi