Intermittent Faults: Frequency, Causes and Models

Download Report

Transcript Intermittent Faults: Frequency, Causes and Models

Layali Rashid




Errors that occur in bursts, at the same
location, when the fault is activated [WDSN07].
Faults which occur frequently and irregularly
for a period of time [ASPLOS08].
A persistent defect that causes zero or more
failures, such as a speck of conductive dust
partially bridging two traces [EuroSys11].
Bursts of errors that recur nondeterministically [Layali].
2




257 servers for 1.2 year.
Memory SBE rate: Number of Errors
Frequency (%)
0
47.5
1-5
31.5
6-99
13.3
100-1000
5.8
>1000
1.9
6.2% of the memory subsystems were affected
by ifaults.
Processor buses in 2 servers had 15 to 7104
SBE bursts.
3

Rates of failures (regardless of the error type).
Failure
P(1fail)
P(2 fail|1
fail)
P(3 fail|2
fail)
CPU (5 working days)
0.3%
30%
56%
CPU (30 working days)
0.5%
34%
59%
DRAM (5 working days)
0.03%
11%
46%
DRAM(30 working days)
0.05%
8%
50%
Disk (5 working days)
0.2%
29%
53%
Disk (30 working days)
0.3%
29%
59%
4



Rates of ifaults out of total failures:
Fault location
Rate of Ifaults
CPU
39%
DRAM
19%
Disk
39%
When ifaults recur?
Fault location
Recur within 10 days
Recur within a month
CPU
84%
97%
Disk
86%
99%
How many times an ifault recur?
◦ MTTF decreases as more failures occur.
◦ Not exponentially distributed.
5
Location
Source
Result in Ifault?
Wires
Electromigration
Yes
Stress migration
Not mentioned
Crosstalk
Yes
Gate oxide breakdown
Yes
Hot carrier injection
Not mentioned
Negative bias
temperature instability
Not mentioned
Thermal cycling
Not mentioned
Manufacturing defects
Yes
Dust
Yes
Transistor
Package and pins
Other
6
Ileak.
PolySi Gate
SiO2
Substrate
From Wikipedia
Traps
exist
in SiO2 due to
Consequences:
Possible
Solutions:
Hard
breakdown
Soft
breakdown
manufacturing
defects or
↑
Leakage
Current
 High-k dielectric.
high
voltage.
↑Power
consumption

Burn-in.
7
*


Consequences:
Thinner
wires, high current density and
temperature.
 Voids → stuck shorts
Metal
films imperfections.
 Hallocks
→ stuck opens
8
* University of Kiel

Thermal stress.
◦ Growth of voids

Contribute to electromigration.

Consequences:

Voids → stuck shorts
9

Major problem during layout synthesis.

Consequences:
◦ Delays and glitches
10

Appears in package and die interface (e.g.
solder joints).

Large cycles vs. small cycles

Consequences: ?
11
Other Wearout Mechanisms
Hot Carrier Injection
Negative Bias Temperature Instability
12


Dominant reliability concern for nMOS
transistors.
Happens during normal operatingtemperature ranges. Vg
Vs
Vd
Consequences:
Ig
oDecrease drain current
n+
n+
oSlower
IC
p+
Vbs
13
◦ Dominant reliability concern for pMOS.
◦ Happens during high temperature.
Vg
Vs
PolySi
Vd
H2
H2
Consequences:
SiO2
oReduces
Vt
p+
p+
H
H
H
H H H H H H H
oReduces IC speed (~20%)
n+
error
Si Si opath
Si Si delay
Si Si Si
Si Si Si
Substrate
Vbs
14
Location
Source
Model
Duration
Wires
Electromigration
Short and open
Stress migration
Short and open
Gate oxide breakdown
Ileakage
Supply voltage fluctuation lasts from 5 to 30
Crosstalk
Delay and glitch
cycles.
Transistor
Hot effects
carrier injection
Pathhundreds
delay
Temperature
evolve over
of
microseconds or milliseconds.
Negative bias
Path delay
temperature
instability
breakdown
evolves
over a few days
Soft
Thermal
cycling
Package
and pins
becomes
hard
breakdown.
Other
S:1x104s+R:
2x104s
then
Manufacturing defects
Dust
15
Transistor
Stuck-open
Last Output
Stuck-short
IDDQ
Delay
16
Stuck-Open
Last output
Open
Stuck-at
Wire
IDDQ
Short
Delay
Bridging
Logical
AND/OR
17


Intermittent faults are loosely defined and
their causes are not well explored.
We need more accurate results on the rates of
ifaults
◦ Rates and number of recurrence

Does NBTI, stress migration, thermal cycling
and hot carrier injection cause ifault?
◦ Evidences by scientific studies or field data.
18
Backup Slides
19
20
From [AdancesinRadioScience09]
21
From [AdancesinRadioScience09]
22
23

Example
From Dr. Ivanov Course
24
Copyright 2001, Agrawal & Bushnell
25
26
RAM
Pattern
Sensitivity
BDS
Coupling
BDS
27
Stuck-Open
Last output
Open
Stuck-at
Wire
Short
Delay
Bridging
WiredAND/OR
IDDQ
Dominant
IDDQ
Dominant
AND/OR
IDDQ
28
[Wikipedia] Many articles.
[WDSN07] Impact of Intermittent Faults on Nanocomputing Devices, WDSN,
2007.
[D3T] Emphasis on the existence of intermittent faults in embedded systems.
IEEE Workshop on Defect and Data Driven Testing (D3T), 2010.
[ASPLOS08] Adapting to intermittent faults in multicore systems.
[EuroSys11] Cycles, Cells and Platters An Empirical Analysis.
[IEEETrans.onElectronDevices96]Soft breakdown of ultra-thin gate oxide layers
[ACMSurveys10] Electromigration for Microarchitects, Intel.
[Applied Physics Letters91] Stress-migration related electromigration damage
mechanism in passivated, narrow interconnects.
[AdancesinRadioScience09]Impact of negative and positive bias temperature
stress on 6T-SRAM cells
29