SER estimation of SRAM-based FPGAs

Download Report

Transcript SER estimation of SRAM-based FPGAs

Analytical Approach for Soft Error Rate
Estimation of SRAM-Based FPGAs
Ghazanfar (Hossein) Asadi
Test & Reliability Group (TRG)
Department of Electrical & Computer Engineering
Northeastern University
2004 MAPLD/221
1
Asadi
Outline
2004 MAPLD/221

Problem Statement & Motivation

Soft Errors Background & Previous work

Error Models in FPGAs

SER Estimation

Experimental Results

Summary & conclusions
2
Asadi
Problem Statement

Estimating soft error rate in FPGAs

The probability of system failure



For a given mapped design
Mean time to manifest a corrupted conf. bit

2004 MAPLD/221
Due to soft errors
To primary outputs or Flip-flops
3
Asadi
Motivation

Need for soft error rate estimation

Exponential growth of vulnerable bits due to Moore’s law

High cost of Error tolerant schemes

To make appropriate cost/reliability trade-offs


Why an analytical method?

2004 MAPLD/221
Where to put redundancy
Previous work: Fault Injection

Time-consuming / Incomplete / Expensive

Needs physical prototype board

Cannot be used in design phases
4
Asadi
Background: Error Definitions

Soft Errors:



Intermittent malfunctions of the hardware
Not reproducible
Energetic Particles
 Single Event Upsets (SEUs)
Soft Errors
 (may cause) System Failure
2004 MAPLD/221
5
Asadi
Previous Work

Based on Fault Injection (FI)
1. Inject fault
2. Run several workloads
3. Compare results with fault-free circuit

Exhaustive FI is very time-consuming


2004 MAPLD/221
Candidate some locations for FI
Analysis based on statistics
6
Asadi
Previous Work (Cont.)

Radiation-based fault injection


Expensive & not commonly used
Needs physical implementation



Can damage prototype board  Hard error
Simulation-based fault injection



2004 MAPLD/221
Cannot be used during design phases
Bit-stream alteration
Needs physical implementation
Bridging errors may lead to hard errors
7
Asadi
Outline
2004 MAPLD/221

Problem Statement & Motivation

Soft Errors Background & Previous work

Error Models in FPGAs

SER Estimation

Experimental Results

Summary & conclusions
8
Asadi
Error Models in FPGAs

Memory resources:

User bits


Configuration bits



2004 MAPLD/221
Flip-flops, RAMs, …
Mux select bits, LUT bits, …
User bits  Transient errors
Config. bits  Permanent errors
9
Asadi
Error Models in FPGAs (Cont.)
 Bit flip
 Permanent error
 Corrected by
reconfiguration
 Bit flip
Transient error
 Can be corrected at the next load
E1
E2
E1
E3
 Short or open circuit
 Corrected by reconfiguration
clk
E2
E3
BlockRAM
F1
F2
F3
F4
LUT
M
ff
M
M M M M
M
M
2004 MAPLD/221
Configuration Memory Cell
SEU
(Bit flip)
10
© Lima (DAC03)
Virtex (Xilinx)
Asadi
Error Models in FPGAs (Cont.)

Transient errors


User flip-flops, Logic gates, Block RAMs
Permanent errors (all configuration bits)

Routing:





2004 MAPLD/221
MUX select bits
PIP: Short/Open
Buffer: On/Off
LUT
Control/Clocking Bits
11
Asadi
Error Models in FPGAs (Cont.)

Only permanent errors considered

Conf. bits comprise more than


2004 MAPLD/221
99% of all memory elements excluding RAM blocks
95% of all memory elements including RAM blocks
Device
# of Config.
Bits
# of FlipFlops
Ratio
XCV50
559,200
3,996
99%
XCV400
2,546,048
22,812
99%
XCV800
4,715,616
43,872
99%
XCV1000
6,127,744
56,832
99%
12
Asadi
Outline
2004 MAPLD/221

Problem Statement & Motivation

Soft Errors Background & Previous work

Error Models in FPGAs

SER Estimation

Experimental Results

Summary & conclusions
13
Asadi
SER Estimation

Traversing structural paths [Asadi04]

From fault sites to POs
Off-Path Signals: Thin Lines
PO
SEU
PO
On-Path Signals:
Thick Lines
FF
Off-Path Signals
2004 MAPLD/221
14
Asadi
SER Estimation in ASIC Designs

S(n): System failure probability (SFP) vector



Si: SFP given node i erroneous
n: total fault sites
Experiments on ISCAS89 show that:

Three order of magnitude faster


2004 MAPLD/221
Compared to random-input simulation
Average accuracy: 97%
15
Asadi
FPGA vs. ASIC in SER Estimation

ASIC: transient error


Only requires propagation probability
FPGA: both transient & permanent errors



Transient errors: the same
Permanent errors: needs activation as well
Nodes with different error rates in FPGAs

Fault sites: all nodes
1
1
n1
A
1
B
1
n2
1
2004 MAPLD/221
1
16
C
Asadi
SER Estimation of FPGAs: Steps

Compute permanent error rates for all nodes

PRi : the permanent error rate of node i


Compute netlist failure probability vector


Ni= failure prob. given node i erroneous
System failure rate vector (S) = PR  N

2004 MAPLD/221
n: total number of fault sites
Si = PRi  Ni
17
Asadi
How to Compute Ni?

Open & stuck-at errors:




Bridging wired-AND error (nets i and j):


Ni = [SPi(1-SPj)PPi(0)] + [(1-SPi) SPjPPj(0)]
Bridging wired-OR error (nets i and j):

2004 MAPLD/221
Ni = [SPi  PPi(0) + (1-SPi)  PPi(1)] = PPi
PPi: Propagation prob. (the method used for ASIC)
SP: Signal probability is used for activation prob.
Ni = [SPi(1-SPj)PPj(1)] + [(1-SPi) SPjPPi(1)]
18
Asadi
How to Compute PRi?


PR(n): permanent error rate vector
PRi : r  f




r: Raw error rate of an SRAM cell
f: Number of all possible errors at node i
n: total number of fault sites
PRAB= 6  r
1
1
1
1
A
B
0
0
1
2004 MAPLD/221
19
Asadi
System Failure Rate

For the first clock:
n
SFR  1   (1  S i )

i 1
For c clock cycles:
n

SFR  1   (1  PRi  (1  N i ) c

i 1


2004 MAPLD/221
The same probability is valid for the next clock cycles
c: Number of clocks checking the state of the circuit
 After particle hit
20
Asadi
Outline
2004 MAPLD/221

Problem Statement & Motivation

Soft Errors Background & previous work

Error Models in FPGAs

SER Estimation

Experimental Results

Summary & conclusions
21
Asadi
Error List





2004 MAPLD/221
Mux-open
PIP open
Buffer off
A bit-flip in LUT
Control bit-flip
22
Asadi
Experimental Setup




Xilinx Virtex 300 (XCV300)
Xilinx Design Language (XDL)
Benchmark: some ISCAS89 circuits
r = raw failure rate for an SRAM cell



1000 clocks executed for each SEU
Platform: Sun Solaris Ultra-10

2004 MAPLD/221
r=0.01 FIT/bit
256 MB Main Memory
23
Asadi
Results: Sensitive Bits
Number of sensitive SRAM bits for each part
Circuit
2004 MAPLD/221
S27
S298
S344
S349
s382
s386
Routing
64
459
536
650
807
714
LUT
68
418
392
520
712
660
Control/
Clocking
40
140
168
187
207
160
Total
172
1017
1096
1357
1726
1534
24
Asadi
Results: Manifestation Time
Mean Time To Manifest (MTTM) errors to outputs
Circuit
S27
S298
S344
S349
s382
s386
Routing
2.07
2.86
2.58
2.91
3.30
3.82
LUT
14.49 20.75 17.33 20.48 22.08 30.07
Control/
Clocking
1.18
1.31
1.36
1.40
1.40
1.77
(Results are in terms of cycles)
2004 MAPLD/221
25
Asadi
Results: SFR & Estimation Time
System Failure Rate & Estimation Time
Circuit
S27
S298 S344 S349
s382
s386
SFR (FIT)
1.71
9.87
9.99 12.77 16.04 12.11
SP Time (sec)
0.15
0.76
0.91
1.09
1.25
1.05
SFR Time (sec)
0.02
0.09
0.13
0.14
0.19
0.25
Total Time (sec) 0.17
0.85
1.04
1.23
1.44
1.30
Number of Clock cycles: 1000
SP Time: Signal Probability computation time
SFR Time: System Failure Rate computation time
2004 MAPLD/221
26
Asadi
Summary & Conclusions

A new approach for SER estimation


No physical implementation required




Can be used in early design stages
Very fast simulation time
Can cover all possible faults
Mean Time To Manifest errors to outputs:


2004 MAPLD/221
For SRAM-based FPGAs
MTTM(Control/clocking) < MTTM(routing)
MTTM(routing) < MTTM(LUT)
27
Asadi
Appendix & Backup
2004 MAPLD/221
28
Asadi
Background: Soft Error Origin

The main sources in terrestrial conditions:


Soft Error occurs:


Alpha particles & Neutrons
if hitting particles generate more than Qcrit
Critical Charge (Qcrit):

the minimum charge needed

2004 MAPLD/221
To flip the value stored in the cell
29
Asadi
Exp. Increase of Soft Errors
1.00000
0.90000
SRAM: exp(-Qcrit/Qs)
0.80000
Latch: exp(-Qcrit/Qs)
0.70000
0.60000
0.50000
0.40000
0.30000
0.20000
0.10000
0.00000
600 nm

350 nm
250 nm
180 nm
130 nm
100 nm
70 nm
50 nm
e-Qcrit/Qs trend with technology scaling (Shivakumar , DSN 2002)
• Qcrit: the critical charge (depend on characteristics of the circuit)
• Qs: the charge collection efficiency of a particle strike on the device

2004 MAPLD/221
Particles of lower energies occur far more frequently
30
Asadi
Background: Definitions

2004 MAPLD/221
How to express Soft Error Rate (SER)

MTBF (Mean Time Between Failures)

FIT (Failure-in-Time)

1 failure in a billion hours

1 year MTBF = 114,155 FIT
31
Asadi
Background: Definitions

Failure definition:

(a) Propagation of an erroneous value

to at least one flip-flip or primary output
or

(b) Propagation of an erroneous value


Definition (a) is compatible with (b)

2004 MAPLD/221
to at least one primary output
If there is no redundant flip-flop in the circuit
32
Asadi
Failure Error Rate of LUT

To reduce number of nodes



P(tx): the probability of O=tx
LUT failure rate


2004 MAPLD/221
LUT as a complex gate
LUT
F1
F2
F3
F4
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9 t10 t11
O
t12 t13 t14 t15
SO=[AP(t0)+AP(t1)+…+AP(t15)].r.NO
= r.NO
33
Asadi
Xilinx Virtex FPGA Model
CLB
Logic block
IO Mux
Switch
Matrix
(SM)
Line
Segments
IOB
2004 MAPLD/221
34
Asadi
CLB Architecture
2004 MAPLD/221
35
Asadi
Error Models in FPGAs (Cont.)

Config. Bits:

Care bits



Don’t care bits

2004 MAPLD/221
All 1s
Some of 0s
Some of 0s
36
Asadi
Error Models: PIP Short/Open


10: causes open
01: may cause short or bridging error
1
0
1
0
0
1
0
N2
Stuck-closed: Permanently ON
N1
N3
N3
E1
W1
E1
W2
E2
W2
E2
W3
E3
W3
E3
S2
S1
S3
Stuck-open and stuck-closed errors
2004 MAPLD/221
N2
W1
S1
1
Bridiging error
Stuck-open: Permanently OFF
N1
1
37
S2
S3
Bridiging error
Asadi
Error Models (Cont.)

Buffer on/off


1
0
0
1
1
Tri-state buffers
Used in IOBs
Buffer on

2004 MAPLD/221
0
Look-Up Table
38
Buffer off
F1
0
LUT
1 1
F2
0
1
0
F3
1
1
0
1 O
1
F4 0 1 1
1
1
1
Asadi