Transcript Chapter 8
Chapter 8
Coping with Physical Failures,
Soft Errors, and
Reliability Issues
EE141
System-on-Chip
Test Architectures
1
Ch. 8 – Physical Failures - P. 1
What is this chapter about?
Gives an Overview of and Promising Solutions
to the Causes of Manufacturing Defects and
Soft Errors
Focus on
Signal Integrity
Defect-Based Tests
Process Sensors and Adaptive Design
Soft Errors
– BISER
– Circuit-Level Approaches
Defect and Error Tolerance
EE141
System-on-Chip
Test Architectures
2
Ch. 8 – Physical Failures - P. 2
Coping with Physical Failures, Soft Errors,
and Reliability Issues
Introduction
Signal Integrity
Manufacture Defects, Process Variations, and
Reliability
Soft Errors
Defect and Error Tolerance
Concluding Remarks
EE141
System-on-Chip
Test Architectures
3
Ch. 8 – Physical Failures - P. 3
Introduction
Defects
Random defects
– Caused by manufacturing imperfections and occur in random places
Systematic defects
– Caused by process or manufacturing variations
Defect level (DL) is a function of process yield (Y) and fault coverage (FC)
DL 1 Y
EE141
System-on-Chip
Test Architectures
1 FC
4
Ch. 8 – Physical Failures - P. 4
Concept of Signal Integrity
Signal integrity is the ability of a signal to generate correct responses in a
circuit.
A signal with good integrity stays within safe margins for its voltage amplitude
and transition time.
EE141
System-on-Chip
Test Architectures
5
Ch. 8 – Physical Failures - P. 5
Basic Concept of Integrity Loss
Integrity Loss: any portion of signal that exceeds
amplitude-safe and time-safe margin.
IL ( IntegrityL oss ) ( Vi f (t ) dt )
ei
i
bi
where Vi is one of the acceptable amplitude levels and bi , ei is a
time frame during which integrity loss occurs.
EE141
System-on-Chip
Test Architectures
6
Ch. 8 – Physical Failures - P. 6
Sources of Integrity Loss
Interconnects
Power Supply Noise
Process Variations
EE141
System-on-Chip
Test Architectures
7
Ch. 8 – Physical Failures - P. 7
Integrity Loss Sensors/Monitors (1)
Current Sensor
Current sensors are often used to detect the completion
of asynchronous circuits.
EE141
System-on-Chip
Test Architectures
8
Ch. 8 – Physical Failures - P. 8
Integrity Loss Sensors/Monitors (2)
Power Supply Noise Sensor
The voltage Vx depends on the power/ground bounces:
the higher the PSN is, the longer the propagation and the
higher the voltage Vx will be.
EE141
System-on-Chip
Test Architectures
9
Ch. 8 – Physical Failures - P. 9
Integrity Loss Sensors/Monitors (3)
Noise Detector (ND) Sensor
ND sensor is designed to detect integrity loss due to
voltage violations.
EE141
System-on-Chip
Test Architectures
10
Ch. 8 – Physical Failures - P. 10
Integrity Loss Sensors/Monitors (4)
Integrity Loss Sensor (ILS)
The integrity loss sensor is a delay violation sensor.
EE141
System-on-Chip
Test Architectures
11
Ch. 8 – Physical Failures - P. 11
Integrity Loss Sensors/Monitors (5)
Jitter Monitor
Jitter is often defined as the time deviation of a signal
from its ideal location in time.
EE141
System-on-Chip
Test Architectures
12
Ch. 8 – Physical Failures - P. 12
Integrity Loss Sensors/Monitors (6)
A ring oscillator can work as a Process Variation Sensor
The variation of delay caused by PV-faults in any of the
inverters in the loop results in deviation in the frequency
of the oscillator, which can be detected.
f RO
f RO 1
1
N inv Vdd C Load
(
W 2
2Tox
)(VGS Vt ) 2 (1
K
VDS )
Leff
NinvTinv , where Ninv is an odd number of inverters
and Tinv is the delay of one inverter.
EE141
System-on-Chip
Test Architectures
13
Ch. 8 – Physical Failures - P. 13
Readout Architectures (1)
BIST-Based Architecture
BIST Architecture
Readout Circuitry
When a noise or delay violation occurs (flag=1), the
contents of all scan cells are then scanned out through
Sout for further reliability and diagnosis analysis.
EE141
System-on-Chip
Test Architectures
14
Ch. 8 – Physical Failures - P. 14
Readout Architectures (2)
Scan-Based Architecture
At the driving side of an interconnect, pattern generation
BSC(PGBSC) is used to generate test patterns. At the
receiving side of the interconnect, an observation
BSC(OBSC) is used to detect integrity loss.
EE141
System-on-Chip
Test Architectures
15
Ch. 8 – Physical Failures - P. 15
Readout Architectures (3)
Basic Concept of PV-Test Architecture
On-chip ROs with counters, embedded in a test chip are
used to detect process variation by measuring the RO’s
frequency shifts.
EE141
System-on-Chip
Test Architectures
16
Ch. 8 – Physical Failures - P. 16
Manufacture Defects, Process Variations,
and Reliability
100% single stuck-at fault coverage cannot guarantee
perfect product quality, because there are remaining
defects that are:
Timing-dependent
Sequence-dependent
Attributed to timing-dependent, non-single-stuck-at faults
EE141
System-on-Chip
Test Architectures
17
Ch. 8 – Physical Failures - P. 17
Structural Tests
A Defect-Based Test Architecture
ATPG
Structural Tests
RTL
Library
RC Extraction
Layout
Synthesis
Modeling
Timing Analysis
Defect-Based
Fault Enumeration
Path Extractor
Physical Faults
Critical Path
List
Fault Mapping
Gate-level Netlist
Defect-Based Fault Simulator
Functional Tests
Fault List
Defect-Based ATPG
EE141
System-on-Chip
Test Architectures
Logical Fault List
Defect-Based Tests
18
Ch. 8 – Physical Failures - P. 18
Defect-Based Tests
Small Delay Defect Tests
Bridge Defect Tests
N-Detect Tests
I ddq Tests
MinVDD Tests
VLV Tests
EE141
System-on-Chip
Test Architectures
19
Ch. 8 – Physical Failures - P. 19
Reliability Stress
Concept of Infant Mortality
Methods to screen infant mortality
Method I - Burn-in
ttf c e
EA
kT
Where ttf is time to failure, C is a constant, E A is the
activation energy (eV), k is the boltzman’s constant, and T is
an absolute temperature.
Method II - Elevated Voltage Stress
EE141
System-on-Chip
Test Architectures
20
Ch. 8 – Physical Failures - P. 20
Redundancy and Memory Repair
Redundancy:
Spare rows, columns, or blocks
Repair
schemes:
Pellston Technology [Wuu 2005]: If repeated error
are detected, disable cache line (set “not to use”
bit)
Perform memory BIST at new operating conditions;
exclude failing cells and resize cache (cache size
can vary larger or smaller, depending on whether
new conditions are more favourable or worse)
EE141
System-on-Chip
Test Architectures
21
Ch. 8 – Physical Failures - P. 21
Process Sensors and Adaptive design
Compare traditional test structures put on the
scribe lines and embed additional process
sensors on-chip.
On-Chip Process Sensors:
Process Variation Sensor
Thermal Sensor
Dynamic Voltage Scaling
EE141
System-on-Chip
Test Architectures
22
Ch. 8 – Physical Failures - P. 22
Process variation Sensor
Ring oscillators:
Many factors can affect the frequency of the ring oscillator such as
process variation, temperature and voltage.
Analog Process Variation Sensor:
The analog circuit will be sensitive to different process parameters.
Neither can report the process variation at the specific spot
on the die and unlikely to extract and analyze the data in
real time.
EE141
System-on-Chip
Test Architectures
23
Ch. 8 – Physical Failures - P. 23
Thermal Sensor
On-chip thermal sensors are the last defence
to prevent system crash or permanent
damage to the chip.
Thermal sensor example:
_
+
□I3
I2
I1
Vref-1
Vref_diode
R1
Δvf
Vb
R2
Vc
Vref-n
MUX
Vref_diode
Vb
Vref_TTLEVEL
_
+
Tx Detect
Vref_diode
N
Figure 8.14:Thermal sensor example
EE141
System-on-Chip
Test Architectures
24
Ch. 8 – Physical Failures - P. 24
Dynamic Voltage Scaling
DVS
Request
frequency
change
1
f MAX
Frequency f
MIN
Transition 1, 3 in
range of 100s of pS
□
3
4
2
VIDnomNOM
Vcc
Transition 2, 4 in
range of 100s of μS
VIDmin
Time
Figure 8.15: Dynamic voltage scaling scheme
EE141
System-on-Chip
Test Architectures
25
Ch. 8 – Physical Failures - P. 25
Dynamic Voltage Scaling (cont’d)
Use sleep transistors and dynamic biasing to
save power
Use the adaptive test method for smart
binning
EE141
System-on-Chip
Test Architectures
26
Ch. 8 – Physical Failures - P. 26
Soft Errors
Introduction
Sources of Soft Errors and SER
Trends
Coping with Soft Errors
EE141
System-on-Chip
Test Architectures
27
Ch. 8 – Physical Failures - P. 27
Introduction
Soft errors
Soft errors are transient single-event upsets
(SEUs) caused by various type of radiation
Cosmic radiation is the major source of soft
errors,especially in memories.
Terrestrial radiation is another source of soft
errors.
EE141
System-on-Chip
Test Architectures
28
Ch. 8 – Physical Failures - P. 28
Sources of Soft Errors and SER Trends
If a glitch is induced at the junction (red label) in a memory
element, its state can be reversed.
Figure 8.16: Induced soft error on a SRAM
cell
EE141
System-on-Chip
Test Architectures
29
Ch. 8 – Physical Failures - P. 29
Sources of Soft Errors and SER Trends
Logic circuits are less susceptible to these glitches
than memories for the following reasons.
The glitch must be of sufficient strength to propagate from
the location of the strike.
The glitch needs to have a functionally sensitized path to be
latched.
The glitch must arrive at a latch during its latching window.
Figure 8.18: Masking factors of soft errors in
combinational logic
EE141
System-on-Chip
Test Architectures
30
Ch. 8 – Physical Failures - P. 30
Coping with Soft Errors
As chips are susceptible to soft errors, many soft
error protection schemes targeting chip designs have
been proposed.
Fault Tolerance
□
Error-resilient microarchitectures
soft errroe mitigation
EE141
System-on-Chip
Test Architectures
31
Ch. 8 – Physical Failures - P. 31
Fault Tolerance
Removing the source of soft errors to improve the
reliability of a chip.
Three fundamental fault tolerance schemes:
Hardware (spatial) redundancy
– assumption that defects and radiation particles will
only hit on a specific□
device and not another device
Time (temporal) redundancy
– assumption that the radiation strike will not happen
on the same circuitry against at a slightly later time
Information redundancy
– using error-detecting code or error-correcting code
to represent information contents
EE141
System-on-Chip
Test Architectures
32
Ch. 8 – Physical Failures - P. 32
Fault Tolerance
Common fault tolerance schemes used in high
reliability system
Duplicate and compare
– used in mainframes and high-end servers
Triple modular redundancy
□
– used for systems that cannot fail
Redundant multithreading
– using error-detecting code or error-correcting code
to represent information contents
EE141
System-on-Chip
Test Architectures
33
Ch. 8 – Physical Failures - P. 33
Error-Resilient Microarchitectures
Two representative error-resilient processor
microarchitectures
DIVA
Razor
DIVA
□
Dynamic Implementation Verification Architecture (DIVA)
DIVA Checker
– a smaller and simpler shadow processor
– contain a functional checker stage (CHK), commit stage (CT),
and a watchdog timer(WT)
DIVA Core
– The main processor that fetches, decodes, and executes
instructions, holding their speculative results in the reorder
buffer (ROB)
EE141
System-on-Chip
Test Architectures
34
Ch. 8 – Physical Failures - P. 34
Error-Resilient Microarchitectures
Razor
Dynamic voltage scaling (DVS) is one of the most
effective and widely used methods for power-aware
computing.
The key idea of Razor is to tune the supply voltage by
□ circuit of operation; this is
monitoring the error during
accomplished with a shadow unit, but this shadow unit
has been pushed all the way down into a Razor flipflop.
This Razor flip-flop is shown in Figure 8.21a.
EE141
System-on-Chip
Test Architectures
35
Ch. 8 – Physical Failures - P. 35
Error-Resilient Microarchitectures
clk
Logic
Stage
D1
L1
0
1
Logic
Stage
Q1
Main
Flip-Flop
□
Shadow
Latch
L2
Error_L
comparator
RAZOR FF
Error
clk_del
Figure 8.21(a) Schematic of the Razor
flip-flop
EE141
System-on-Chip
Test Architectures
36
Ch. 8 – Physical Failures - P. 36
Error-Resilient Microarchitectures
Razor
A reduced overhead Razor flip-flop with the
metastability detection circuit is illustrated in Figure
8.21b.
clk
clk_b
□
D
clk_b
0
1
Error_L
clk
Q
Metastability Detector
Inv_n
Inv_p
clk_del_b
Error_L
clk_del
Shadow Latch
Figure 8.21(b) Reduced overhead Razor
flip-flop with metastability detection circuit
EE141
System-on-Chip
Test Architectures
37
Ch. 8 – Physical Failures - P. 37
Soft Error Mitigation
Soft error mitigation techniques are to provide partial
immunity of a design to potential soft errors while
significantly minimizing the required cost over fault
tolerance schems.
There are three soft error mitigation methods:
(1) Built-In Soft-Error Resilience (BISER)
BISER proposed in [Mitra 2005] can be used to allow scan
design to protect a device from soft errors during normal
operation.
EE141
System-on-Chip
Test Architectures
38
Ch. 8 – Physical Failures - P. 38
Soft Error Mitigation
Figure 8.22 shows the BISER scan cell design that
reduces the impact of soft errors affecting storage
elements by more than 20 times.
Scan portion
SCB
LA
1D
C1 Q
2D
C2
SI
SCA
CAPTURE
UPDATE
D
CLK
. .
.
.
PH2
C1 Q
1D
LB
O2
C1
Q
1D
. .
C-element
.
.
PH1
1D
C1
O1
Q
2D
C2
SO
Keeper
.
Q
.
System flip-flop
TEST
Figure 8.22: Built-in soft-error resilience
(BISER) scan cell
EE141
System-on-Chip
Test Architectures
39
Ch. 8 – Physical Failures - P. 39
Soft Error Mitigation
Circuit-level approaches
(2) Gate resizing for soft error mitigation [Zhou 2006] is
based on physical-level design modifications.
Figure 8.23 illustrates the effect of gate resizing on the
amplitude and width of a 0-to-1 transient at the output of a
gate.
Figure 8.23: Effect of gate resizing on the
amplitude/width of SETs [Zhou 2006]
EE141
System-on-Chip
Test Architectures
40
Ch. 8 – Physical Failures - P. 40
Soft Error Mitigation
Circuit-level approaches
(3) Netlist transformation for soft error mitigation
[Almukhaizim 2006] is based on logic-level design
modifications.
.
Figure 8.24: Example of rewiring to
reduce the soft error failure rate
EE141
System-on-Chip
Test Architectures
41
Ch. 8 – Physical Failures - P. 41
Defect and Error Tolerance
Defect Tolerance
Insert redundancy circuitry in a circuit under test
The circuit can continue correct operation in the
presence of defects.
Error Tolerance
Allow the circuit to continue acceptable operation
in the presence of errors
EE141
System-on-Chip
Test Architectures
42
Ch. 8 – Physical Failures - P. 42
Random Spot defects
Assume a design consists N submodules.
Each module has n unique positions where a
defect would cause it to fail its tests.
D defects uniformly distributed over the
submodule.
Number of defects in any submodule is
independent of the number of defects in other
submodules.
EE141
System-on-Chip
Test Architectures
43
Ch. 8 – Physical Failures - P. 43
Defect Probability
Probability that an arbitrary position on a
submodule is associated with a defect is:
p = D / (nN)
Probability of having d defects in a given
submodule is:
P(d) = C(n,d)pd(1-p)n-d
where
C(n,d) = n! / (d!(n-d)!)
EE141
System-on-Chip
Test Architectures
44
Ch. 8 – Physical Failures - P. 44
Poisson Distribution
P(d) is binomially distributed, the average number of
defects in an arbitrary submodule is:
E(d) = λ = np = D / N
For large n and small p, the binomial distribution can be
approximated by Poisson distribution
Pd e
EE141
System-on-Chip
Test Architectures
d
d!
45
Ch. 8 – Physical Failures - P. 45
Example
Assume
a submodule is equally likely to
be defect-free or defective:
Pd = 0 = e 0 / 0!
λ = 0.693.
Effective yield can increase significantly
if the system can accept some defective
submodules.
Thus,
EE141
System-on-Chip
Test Architectures
46
Ch. 8 – Physical Failures - P. 46
Probability of Having Exact d Defects at a Submodule as a Function of Yield (Y)
for Various Values of Failure Rate λ
d λ=
λ= λ= λ= λ= λ= λ= λ=
λ=
0.105 0.223 0.357 0.511 0.693 0.916 1.204 1.609 2.303
0
1
2
3
4
5
6
7
Y= Y=
0.90 0.80
0.09 0.18
0.02
EE141
System-on-Chip
Test Architectures
Y=
0.70
0.25
0.04
0.01
Y=
0.60
0.31
0.08
0.01
Y=
0.50
0.35
0.12
0.03
Y=
0.40
0.37
0.17
0.05
0.01
Y=
0.30
0.36
0.22
0.09
0.03
0.01
Y=
0.20
0.32
0.26
0.14
0.06
0.02
Y=
0.10
0.23
0.27
0.20
0.12
0.05
0.02
0.01
47
Ch. 8 – Physical Failures - P. 47
Defect Tolerance
Used to be called
redundancy repair
A typical defect-tolerant
design is shown on the
left
M
M
Switch
M
Two spares (identical
modules)
A switch used to select
one module
EE141
System-on-Chip
Test Architectures
48
Ch. 8 – Physical Failures - P. 48
Error Tolerance
The main Objective of error tolerance is to
increase the effective yield of a process by
identifying defective but acceptable chips
This lies in the development of
An accurate method to estimate error rate
An effective method to predict yield
EE141
System-on-Chip
Test Architectures
49
Ch. 8 – Physical Failures - P. 49
Fault-Oriented Test Methodology
Enhance effective yield based on error-rate
analysis
Estimate error rate of each modeled fault
A set of acceptable faults is identified based on
their error rates
IC
Fabrication
Fault
Ranking
Testing
Acceptable
Chips
Unacceptable
Chips
EE141
System-on-Chip
Test Architectures
50
Ch. 8 – Physical Failures - P. 50
Error-Oriented Test Methodology
IC
Fabrication
Bad
Chips
Focus on errors produced
by defective chips rather
than on modeled faults
Error-Rate
Estimation
estimate the error rates of
these chips
determine the
acceptability of the faulty
chips by estimated results
Estimated
Error Rate
Classification
Based on Estimated
Error Rate
Acceptable
Chip Set 1
EE141
System-on-Chip
Test Architectures
Good
Chips
Testing
Acceptable
Chip Set 2
… Unacceptable
Chips
51
Ch. 8 – Physical Failures - P. 51
Concluding Remarks
Circuit Errors can be caused by manufacturing
defects and soft errors.
Design for Manufacturability (DFM) – Fault avoidance
schemes to cope with physical failures caused by
signal integrity, defects, and process variations during
manufacturing.
Design for Reliability (DFR) – Embedded error
resilience and defect tolerance circuitry on-chip to
tolerate soft errors and manufacturing defects.
EE141
System-on-Chip
Test Architectures
52
Ch. 8 – Physical Failures - P. 52