A Comprehensive Look at VLSI Fault Diagnosis

Download Report

Transcript A Comprehensive Look at VLSI Fault Diagnosis

A Comprehensive Look at
VLSI Fault Diagnosis
David B. Lavo
Dissertation Defense
July 19, 2002
Outline
•
•
•
•
•
Background & Philosophy
Three-Stage Fault Diagnosis
IDDQ Fault Diagnosis
Small Fault Dictionaries
Conclusions and Future Work
Format of This Talk
• Prior art is introduced throughout
• Innovations are presented as:
– Problem to be solved
– My approach
• Brief exposition:
– Background and prior art
– Particulars and importance of innovations
Format Example
• Problem: This presentation is too long
– 6 conference papers, 5 years of work,
73 slides
• Solution: TalkKompress® technology
– Rapid delivery
– High gloss finish
– Drill-down capability
VLSI Test (In One Slide)
Test
Pattern
(Vector)
1
0
0
0
1
1
1
1
0 Fail!
0
1
0 Fail!
1
1
0
1
IDDQ Test:
Pass: Low Current
Fail: High Current
Test
Response
Fault Signature:
1: 5, 6
Test No.
Failing
Outputs
VLSI Fault Diagnosis
(in One Slide)
Defective Circuit
Observed
Behavior
Tests
Location
or
Fault
Physical Analysis
Diagnosis
Diagnosis Algorithm
Traditional Fault Diagnosis
Behavior Signature
010001010100010101010 …
Defective Circuit
Tests
Comparison &
Conclusion
010100110000101010100 …
101000100001011101100 …
010100010100011101100 …
000111000101010011110 …
Fault Simulator
Candidate Signatures
Diagnosis
Algorithm
Fault Models
• A fault model is an abstraction of a type
of defect behavior
• A fault instance is the application of a
model to a circuit wire, node, gate, etc.
• Used to create and evaluate test sets
• For diagnosis, they can be used to
simulate and predict faulty behaviors
Stuck-at Fault Model
• The most-used fault
model (by far)
• Simple to simulate
• Effective for testing, fault
grading, and diagnosis of
some defects
• Many fault scenarios are
not well represented by
the stuck-at model
Node A stuck-at 1:
0/1
A
1
B
(Fault-free/faulty
logic values)
0/1
Bridging Fault Model
• Shorts are a
common defect type
in CMOS
• Different bridging
fault models have
varying accuracy
and precision, from
simplistic to very
sophisticated
Nodes X and Y bridged:
0
1
1
1
X
0
Y
1/0
Node X forces Y
to a value of 0
Statement of Philosophy
• Defect model isn’t known beforehand: A single-model
algorithm will not be robust
• Fault models are necessary: Fault models provide precision
and guidance for physical failure analysis
• Fault models are unreliable: Expecting defect behavior to
correspond exactly with model assumptions and predictions
can cause diagnoses to fail
• Diagnosis is messy: Any data item can be corrupt; don’t
make an irreversible decision based on a single piece of
information
• Be practical: Be wary of expensive solutions, and use all
available data
Outline
•
•
•
•
•
Background & Philosophy
Three-Stage Fault Diagnosis
IDDQ Fault Diagnosis
Small Fault Dictionaries
Conclusions and Future Work
Three-Stage Fault Diagnosis
• Three stages of increasing precision
• First: Model-independent diagnosis for
complex and multiple defects
• Second: Determine likely fault models
and likely physical areas for analysis
• Third: Diagnosis using multiple specific
fault models
First Stage Fault Diagnosis
• Problem: Size of the diagnosis problem
– Entire circuit is implicated
– Any fault model can be considered
• Solution: iSTAT algorithm
– Model-independent
– Computationally simple
– Implicates a subset of the circuit nodes
– Per-Test approach
Per-Test Fault Diagnosis
• Prior art:
– Waicukauski & Lindbloom (D&T‘89)
– POIROT algorithm (Intel, ITC’00)
– SLAT algorithm (IBM, ITC’01)
• Basic Idea:
– A complex defect will, on some tests,
behave exactly like a stuck-at fault
– The stuck-at faults so implicated tell you
something about the defect
Per-Test Fault Diagnosis
Traditional Diagnosis:
1: 1,2,3
2: 3,4,5
3: 1,2,5
 Fault X
(Best single match)
Test 1: {1, 2, 3}
?
Test 3: {1, 2, 5}
Test 2: {3, 4, 5}
Per-Test Diagnosis:
* Tests 1 & 2 are
1: 1,2,3  Fault A *
2: 3,4,5  Fault B *
3: 1,2,5  no match *
Final diagnosis: {A, B}
simple failing tests
 implicate faults
* Test 3 is a
complex failing test
 no implication
Improving Per-Test Diagnosis
• The problem with the per-test approach
is the number of implicated faults
• Also, most per-test algorithms can’t
handle complex failures or passing tests
• The iSTAT algorithm solves these
problems by treating each test result as
evidence
• Matching faults share per-test evidence
Dempster-Shaffer
A
T1:
B
C
Ø
0
1
D
T2:
0
B
Ø
A
B
C
Ø
B
AB
B
BC
B
D
AD
BD
CD
D
A
B
C
Ø
Ø
1
{B} is a minimal covering
{A,D} is an alternative covering
Acceptable coverings
(complete & non-redundant):
{B}, {A,D}, {C,D}
Creating & Scoring Multiplets
• A multiplet is a set of faults that cover all
failing tests
• Multiplets and scores are created by DS
method
– Non-complete combinations are dropped
– Redundant combinations are dropped
Matching Passing Tests
• Lack of failures can indicate:
– Unsensitized faults
– Interference of propagation of fault effects
• Considering passing tests is important
for resolution:
Asa1
Output fault failures
Csa1 swamp input faults!
• Passing test evidence is split among
correct predictors, but with high p(Ø)
Matching Complex Failures
• Complex failures can
indicate:
– Interference of faulteffect propagation
– Multiple faults
• iSTAT uses a
conservative approach:
• iSTAT splits evidence among matching
multiplets, uses somewhat higher p(Ø)
Match
NonMatch
Experimental Results
• Simulated defects in industrial design
• Different defect types:
– Single and multiple stuck-ats, bridges
– Multiple faults on one net, gate, and path
• Compared to SLAT algorithm:
– Ave. SLAT diagnosis size: 10.75 multiplets
– Ave. iSTAT diagnosis size: 2.75 multiplets
Experimental Results
• Industrial circuit from TI
• 14 bridge defects introduced by focused
ion beam (FIB)
• Comparison to leading industrial
diagnosis tool (Fastscan - W&L):
– Fastscan: 2 full success, 9 partial, 3 failures
– iSTAT: 5 full success, 9 partial, 0 failures
Second Stage Fault Diagnosis
• Problem: Multiplets are hard to use
– A collection of seemingly-unrelated faults
– Don’t relate to any common defect
mechanisms
• Solution: Implicate plausible fault
models and localize fault sets
– Suggest possible defect mechanisms
– Implicate physical or logical circuit areas
The Value of Fault Models
• Relate to known defect mechanisms:
– Shorts to power or ground
– Signal-to-signal shorts
– Opens or breaks
• Increases diagnostic precision
– Physical FA knows what to look for
– Can compare simulation to behavior
Plausibility Metrics
• Search for correlations between faults
• What do the faults in a multiplet have in
common?
– Common gate, wire, path
• What is the upper probability limit that a
multiplet represents a common fault?
– Bridges, transition faults, gate faults
• First-order estimates: how reasonable
to pursue fault model(s)
Proximity Metrics
• How close/related are the faults in a
multiplet?
– Physical distance between implicated nodes
– Bounding box of implicated wires
– Logical distance between node faults
• May be most the important metric, since
physical FA is so limited
Experimental Results
• Same 20 simulated defects from before
• 6 possible multiplet classifications:
– Single stuck-at, 2-line bridging fault
– Common node, net, gate, and path
• 10 are perfect match to defect
• 5 match stuck-at: test dependent
• 5 are multiple defects & low correlations
Third Stage Fault Diagnosis
• Problem: Need multiple fault models
– Prior art algorithms all target 1 fault model
– What happens if defect doesn’t match?
• Solution: Mixed-model fault diagnosis
– Apply an arbitrary number of fault models
– Pick the best candidate, regardless of model
– More models means more precision and
more robustness
Probabilistic Scoring
• The scoring method used by most
algorithms is unique to the fault model
• If multiple fault model diagnosis is to
work, the scoring must be comparable
across fault models
• The most intuitive method is
probabilistic: What is the most likely
candidate, regardless of model?
Standard Scoring Example
Bridge Candidates:
1. Bridge 12 @ 57
2. Bridge 21 @ 205
3. Bridge 12 @ 114
Stuck-at Candidates:
1. Node 12 sa 0
2. Node 205 sa 1
3. Node 19 sa 1
100.98.90
96.90.82
86.78.72
92n 70m
81n 60m
71n 72m
Probabilistic Scoring Example
Fault Candidates:
1. Bridge 12 @ 57
2. Node 12 sa 0
2. Bridge 21 @ 205
3. Bridge 12 @ 114
4. Node 205 sa 1
5. Node 19 sa 1
0.378
0.301
0.227
0.082
0.011
0.001
Probabilistic Fault Diagnosis
• Precedents:
– Sheppard & Simpson (VTS’96):
Comprehensive system-level diagnosis
– Henderson & Soden (ITC’97): Probabilistic
physical failure analysis
– Thibeault (VTS’97): Max. likelihood
estimation for IDDQ diagnosis
• Probability becomes the common
component for diagnosis and FA
Bayes Decision Theory
• Rate candidates by max p(Ci | B):
– Ci: Candidate i fault signature (suspect
description or explanation)
– B: Behavior fault signature (evidence or
phenomenon)
• Bayes Rule:
p (C i | B) 
p (C i ) p (B | C i )
 p(C ) p(B | C )
i
i
i
Conditional Probability
• Assuming that all per-vector (pass/fail)
predictions are independent:
p(B|Ci) =  p(bk|ck)
• What is the probability that an individual
prediction is correct?
• These rates of prediction error must be
supplied for each fault model
Diagnosis is Inherently
Probabilistic!
• Prediction error rates are required but
unknown: not enough statistics
• But, diagnosis systems specify error
rates all the time, only implicitly:
– How much error is ok?
– How important is one type of error vs.
another?
• Turn judgements into explicit parameters!
Three-Model Diagnosis
System
• Use stuck-at candidates to diagnose
shorts to VDD/GND, or “charged” opens
• Use inexpensive bridging fault
candidates for signal-line shorts
• Use node faults for transition defects
and dominance bridges
• Probabilistic scoring system
differentiates and ranks all candidates
Industrial Experiment
• Hewlett-Packard ASIC
• Defects inserted by Focused Ion Beam:
12 total shorts to power or ground, 9
signal shorts, 4 open (break) defects
• Only a pass/fail stuck-at dictionary used
• No IFA or realistic bridging faultlist was
available
Results
• Stuck-at results: 9 successes, 1 partial
(bridge ranked highest), 2 ambiguous
(>100 equiv. candidates)
• Bridging results: 6 successes, 3 partial
• Open results: 3 successes (stuck-at
candidates), 1 failure (top candidates
were stuck-at of unknown relation to
open site)
Experimental Detail
Strong inverter
FIB short
Weak inverter
Top candidate is stuck-at fault
on this node.
Outline
•
•
•
•
•
Background & Philosophy
Three-Stage Fault Diagnosis
IDDQ Fault Diagnosis
Small Fault Dictionaries
Conclusions and Future Work
IDDQ Testing
• IDDQ testing measures quiescent current
to infer the presence of defects
• IDDQ fault diagnosis has the advantage of
not relying on propagation through logic
• But, how much quiescent current
indicates a defect?
• Modern chips are inherently noisy
IDDQ Diagnosis
• Problem: How to conduct diagnosis
when failure is ambiguous?
• Solution: Probabilistic IDDQ diagnosis
• Prior Art:
– Aitken, Chakravarty & Liu - require definite
pass/fail measures
– SEMATECH experiment - repeatedly adjust
pass/fail thresholds
Threshold Setting
180
160
140
120
100
80
60
40
20
0
0
50
100
150
200
Threshold Setting (cont.)
• There are many possible valid
thresholds for diagnosis
• SEMATECH method: repeatedly
adjust threshold until an exact match
is found in the fault dictionary
Threshold Setting, SEMATECH
180
160
140
120
No match
Exact match
No match
100
80
60
No
No match
match
40
20
0
0
50
100
150
200
Improving the Methods
• First, use probabilistic diagnosis to
eliminate the need for exact
matches
• Second, automate the setting of
pass-fail thresholds
Conditional Probability (cont.)
• If a high current state is predicted for
a vector:
p(b|c) = p(observed IDDQ | predicted fail)
1
= p(failing IDDQ|obs. IDDQ)p(failing IDDQ|pred. fail)
+ p(passing IDDQ|obs. IDDQ)p(passing IDDQ|pred. fail)
= p(failing IDDQ|obs. IDDQ)
• If low current is predicted:
= p(passing IDDQ|obs. IDDQ)
0
Threshold Scenarios
• Different diagnostic situations mandate
different ways of setting thresholds
• A predetermined threshold may be
available
• More or less information may be
available about expected IDDQ values
Scenario 1: Perfect Knowledge
• If a threshold has been established
for test, it may be used for diagnosis
• The simplest approach is to assume
this threshold cleanly divides
passing from failing IDDQ values
• This approach lacks robustness
Perfect Knowledge Clustering
180
160
140
120
100
80
60
40
20
0
p(defect)
1
0
0
50
100
150
200
Perfect Knowledge, Improved
180
160
140
120
100
80
60
40
20
0
p(defect)
1
0
0
50
100
150
200
Scenario #2: Statistical
Knowledge
• The IDDQ of the part may be statistically
characterized before testing & diagnosis
• A technique by Maxwell, et al. (ITC99)
uses the ratio of max to min IDDQ to
establish a 3 acceptance limit
• Expected (good circuit) distribution can
be used as the probability of no defect
p(no defect)
Statistical Knowledge
180
160
140
120
100
80
60
40
20
0
0
50
100
150
200
Scenario #3: Zero Knowledge
• Use Gattiker-Maly current signatures to
identify IDDQ signature clusters
• Each cluster associated with a defectinduced current path
• Intra-cluster variations normally
distributed
• Lowest cluster defines
p(obs. value | pass), other clusters
define p(obs. value | fail)
Zero Knowledge Clustering
150
100
50
0
20
40
60
Vector Order
80
0
100
IDDQ (uA)
200
Experimental Results
• We re-ran the SEMATECH experiments
on 16 parts with thorough FA
• In 15 of 16 cases our method was able
to automatically assign thresholds for
successful diagnoses
• In the remaining case, the SEMATECH
diagnosis was a failure; our diagnosis
resulted in a partial success
Outline
•
•
•
•
•
Background & Philosophy
Three-Stage Fault Diagnosis
IDDQ Fault Diagnosis
Small Fault Dictionaries
Conclusions and Future Work
Small Fault Dictionaries
• Problem: Full fault dictionaries are
prohibitively large
– Good diagnostic resolution
– Pass-fail format is smaller, but less
resolution
• Solution: Output-compaction and
clustered dictionaries
The Full-Response Dictionary
• For each fault ( f ), store the response to
each test vector ( v )
• One bit per vector, pass ( 0 ) or fail ( 1 )
• For each vector, store the expected
output response ( o )
• Total storage requirement: f  v  o bits
The Pass-Fail Dictionary
• For each fault, store only the test vector
responses
• One bit per vector, pass ( 0 ) or fail ( 1 )
• Total storage requirement: f  v bits
• Much smaller than full-response, and
often practical for even very large
circuits
Pass-Fail vs. Full-Response
Circuit Full-Resp.
Faults
Ranked #1
C432
2.29
C499
1.17
C880
1.61
C1355
1.67
C1908
1.82
C2670
2.24
C3540
2.03
Full-Resp.
Bits
191,142
901,120
1,512,864
2,936,192
2,966,700
18,141,184
8,963,724
Pass-Fail Pass-Fail
Faults
Bits
Ranked #1
2.80
27,306
1.17
28,160
1.66
56,032
1.71
91,756
1.99
118,668
3.04
283,456
2.10
407,442
C5315
C6288
C7552
1.89
73,878,966
1.33
7,207,680
1.63 131,224,800
2.07
600,642
1.40
225,240
2.12 1,226,400
Ind-A
Ind-B
Ind-C
Ind-D
2.74
2.33
2.51
1.91
5.49
2.91
51.0
2.86
232.84 Gb
929.42 Gb
297.86 Gb
9.08 Gb
15.52 Mb
46.37 Mb
21.27 Mb
333,822
Dictionary Encodings
• To reduce dictionary size, researchers
have looked at:
– What data is included
– How the data is organized
– How the data is stored or encoded
• Pass-fail, drop-on-k address the first issue
• Boppana Hartanto Fuchs (VTS96)
addressed the second
• Data compression addresses the third
What About Data
Compression?
• However the data is chosen and encoded, it
can still be compressed afterward (zip, etc.)
• But, compressed dictionaries must be
uncompressed for use!
• We are (mainly) concerned here with what
data is included, not how it is encoded
• Can we create “compressed” fault
signatures that are usable as-is?
Output-Compacted Signatures
OR
OR
O1
O2
O3
O4
O5
O6
O7
O8
O9
V1
0
0
1
1
0
0
0
0
0
V2
0
0
1
1
0
0
0
0
1
V3
0
0
0
0
0
0
0
0
0
V4
0
0
0
0
0
0
0
0
0
V5
0
0
0
0
0
0
0
0
0
V6
0
0
0
0
0
0
0
0
0
V7
0
0
0
1
0
0
0
0
0
V8
0
0
0
0
0
0
0
0
0
V9
0
0
1
1
1
0
0
0
1
PF
1
1
0
0
0
0
1
0
1
OC
0
0
1
1
1
0
0
0
1
Reconstructing a Full Signature
O1
O2
O3
O4
O5
O6
O7
O8
O9
PF
V1
0
0
1
1
0
0
0
0
0
1
V2
0
0
1
1
0
0
0
0
1
1
V3
V3
00
00
00
00
00
00
00
00
00
00
V4
V4
00
00
00
00
00
00
00
00
00
00
V5
V5
00
00
00
00
00
00
00
00
00
00
V6
V8
V9
OC
V6 V7
V7 V8
V8 V9
V9 OC
OC
000
00
00
000
000
000
00
00
000
000
111
00
00
000
111
111
00
11
000
111
111
00
00
000
111
000
00
00
000
000
000
00
00
000
000
000
00
00
000
000
111
00
00
000
111
00
11
000
111
Results with Output Signatures
Circuit Full-Resp. Pass-Fail
Faults
Faults
Ranked #1 Ranked #1
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Ind-A
Ind-B
Ind-C
Ind-D
2.29
1.17
1.61
1.67
1.82
2.24
2.03
1.89
1.33
1.63
2.74
2.33
2.51
1.91
2.80
1.17
1.66
1.71
1.99
3.04
2.10
2.07
1.40
2.12
5.49
2.91
51.0
2.86
Pass-Fail +
Output Sig.
Faults
Ranked #1
2.32
1.17
1.61
1.69
1.82
2.24
2.03
1.89
1.34
1.63
2.87
2.34
2.51
1.91
Clustering Output Signatures
• Relations between faults can lead to
identical output signatures, but also nearly
identical signatures
• Similar signatures can be collapsed into a
single signature if some loss is ok
• Data loss equals precision loss at diagnosis
• A probabilistic (inexact) matching algorithm
assures no loss of accuracy
Clustered Output Signatures
Clustered
Output
Signature:
Output
Signature:
11001
0010101000000000000010011
Results - Clustered Signatures
(1000 bits/output signature)
Circuit Outputs Full-Resp. Pass-Fail Pass-Fail+ Pass-Fail+
Faults
Faults
Output Sig. Output Sig.
Ranked #1 Ranked #1
Faults
Clustered
Ranked #1
Faults
Ranked #1
Ind-A
15,000
2.74
5.49
2.87
2.87
Ind-B
20,042
2.33
2.91
2.34
2.34
Ind-C
14,003
2.51
51.0
2.51
6.21
Ind-D
27,193
1.91
2.86
1.91
2.03
Outline
•
•
•
•
•
Background & Philosophy
Three-Stage Fault Diagnosis
IDDQ Fault Diagnosis
Small Fault Dictionaries
Conclusions and Future Work
Conclusions
• Fault diagnosis is inherently
probabilistic
– Based on human judgement, not statistics
– Data and problem are messy and
unpredictable
• It can be made practical by certain
techniques and iteration
• It can be as precise as desired
Future Work
• Address delay-inducing
defects
– Failures are not static but
temporal & sensitive to
various conditions
– Increasingly important
• Expand on physical correlation of faults
– Handle very complex defects
– Physical diagnosis?
• More experiments on production fails
Analysis of Multiple Faults
• A multiplet may represent multiple faults
• Low plausibility scores over all fault
classes may imply multiple faults
• Select for high-correlation subsets:
– Fault class cardinality (n-choose-2 for
bridges)
– High proximity correlation
Output Signature Dictionary Sizes
Circuit
Full-Resp.
Bits
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Ind-A
Ind-B
Ind-C
Ind-D
191,142
901,120
1,512,864
2,936,192
2,966,700
18,141,184
8,963,724
73,878,966
7,207,680
131,224,800
232.84 Gb
929.42 Gb
297.86 Gb
9.08 Gb
Pass-Fail
Bits
27,306
28,260
56,032
91,756
118,668
283,456
407,442
600,642
225,240
1,226,400
15.52 Mb
46.37 Mb
21.27 Mb
333,822
Pass-Fail +
Output Sig.
Bits
29,637
42,240
70,720
117,740
135,718
371,520
441,014
926,100
345,368
1,585,920
309.51 Mb
420.24 Mb
319.13 Mb
66.11 Mb
Storing Output Signatures Once
• “Related” faults often propagate failures to
the same set of outputs:
– Faults on the same net
– Faults in the same functional block
– Faults logically close to outputs
• This results in a lot of redundancy in output
signatures
• We can save space by only storing a
particular signature once and then indexing
Only Unique Output Sigs
Circuit
Full-Resp.
Bits
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Ind-A
Ind-B
Ind-C
Ind-D
191,142
901,120
1,512,864
2,936,192
2,966,700
18,141,184
8,963,724
73,878,966
7,207,680
131,224,800
232.84 Gb
929.42 Gb
297.86 Gb
9.08 Gb
Pass-Fail
Bits
27,306
28,260
56,032
91,756
118,668
283,456
407,442
600,642
225,240
1,226,400
15.52 Mb
46.37 Mb
21.27 Mb
333,822
Pass-Fail +
Output Sig.
Raw Bits
29,637
42,240
70,720
117,740
135,718
371,520
441,014
926,100
345,368
1,585,920
309.51 Mb
420.24 Mb
319.13 Mb
66.11 Mb
Pass-Fail +
Output Sig.
Unique Bits
29,591
41,784
64,191
112,696
132,756
307,456
429,074
710,496
268,242
1,322,702
96.06 Mb
159.51 Mb
55.46 Mb
17.27 Mb
Low-Resolution Diagnosis
• Clustering outputs to an arbitrary number of
bits suggests a direction for diagnosis
• Reduce dictionaries to an arbitrarily small
size, especially for very large circuits
• The resolution will depend upon data size
• Use small dictionaries for initial gross
diagnosis
• Iterate through larger and more detailed
dictionaries