Transcript day9

CS137:
Electronic Design Automation
Day 9: October 17, 2005
Fault Detection
CALTECH CS137 Fall2005 -- DeHon
Today
• Faults in Logic
• Error Detection Schemes
• Optimization Problem
CALTECH CS137 Fall2005 -- DeHon
Problem
• Gates, wires, memories:
– built out of physical media
– may fail
CALTECH CS137 Fall2005 -- DeHon
Device Physics
• Represent a 1 or 0 with charge
– On a gate, in a memory
• Charge may be disrupted
– -particle (other ionizing particles)
– Ground bounce
– Noise coupling
– Tunneling
– Thermal noise
– Behavior of individual electrons is statistical
CALTECH CS137 Fall2005 -- DeHon
DRAMs
•
•
•
•
Small cells
Store charge dynamically on capacitor
Store about 50,000 electrons
Must be refreshed
– Data leaks away through parasitic
resistance
• -particle can be 1,000,000 carriers?
CALTECH CS137 Fall2005 -- DeHon
System Reliability
•
•
•
•
Device fail with Probability: Pfail
Have N components in system
All must work for device to work
Psys = (1-Pfail)N
N 2
Psys  1  N  Pfail     Pfail
2
CALTECH CS137 Fall2005 -- DeHon
N 3
    Pfail  ...
3
System Reliability
N 2
Psys  1  N  Pfail     Pfail
2
N 3
    Pfail  ...
3
• If NPfail << 1
 NPfail dominates higher order terms…
Psys  1  N  Pfail
CALTECH CS137 Fall2005 -- DeHon
System Reliability
Psys  1  N  Pfail
• Psysfail  N  Pfail
CALTECH CS137 Fall2005 -- DeHon
Modern System
• 100 Million  1 Billion Transistors
– Not to mention wiring…
• > GHz = > 1 Billion Transitions / sec.
• N = 1018 per second…
Psys  1  N  Pfail
CALTECH CS137 Fall2005 -- DeHon
As we scale?
• N increases
• Charge/gate decreases
Psys  1  N  Pfail
– Less electrons
– Higher probability they wander
– Greater variability in behavior
• Voltage levels decrease
– Smaller barriers
• Greater variability in device parameters
Pfail increases
CALTECH CS137 Fall2005 -- DeHon
Exacerbated at Nanoscale
• Small numbers of dopants (10s)
– High variability
• Small numbers of electrons (10-1000s?)
– High variability
– Highly susceptible to noise
• Small number of molecules
– May break, decay…
CALTECH CS137 Fall2005 -- DeHon
What do we do about it?
• Tolerate faulty components
• Detect faults
– Not do anything bad
– Try it again
• If statistically unlikely error,
–high likelihood won’t recur.
• …Focus on detection…
CALTECH CS137 Fall2005 -- DeHon
Detect Faults
• Key Idea: redundancy
• Include enough redundancy in
computation
– Can tell that an error occurred
CALTECH CS137 Fall2005 -- DeHon
What kind of redundancy can we
use?
• Multiple copies of logic
• Compute something about result
– Parity on number of outputs
– Count of number of 1’s in output
CALTECH CS137 Fall2005 -- DeHon
Error Detection
CALTECH CS137 Fall2005 -- DeHon
What do we protect against?
• Any n errors
– Worst-case selection of errors
CALTECH CS137 Fall2005 -- DeHon
Single Error Detection
• If Pfail small:
– No error: (1-Pfail)N  1-NPfail
– One error: NPfail (1-Pfail)N-1  NPfail
– Two errors: [N(N-1)/2] (Pfail )2(1-Pfail)N-1
• Probability of an error going undetected
 For:
NPfail << 1
 Goes from  NPfail

to  (NPfail )2
CALTECH CS137 Fall2005 -- DeHon
Single Error Detection (Example)
• Probability of an error going undetected
 For:
NPfail << 1
 Goes from  NPfail

to  (NPfail )2
 N=1010 Pfail=10-20
 NPfail=10-10<<1  ~1010 cycles MTTF
 Mean Time To Failure
 1GHz = 10s
 (NPfail)2=10-20  1020 cycles MTTUF
 Mean Time To Undetected Fault
 1011s = 3000 years
CALTECH CS137 Fall2005 -- DeHon
Detection Overhead
• …but: Correction and detection circuitry increase
circuit size.
• Ndetect > Nlogic
• Ndetect = c Nlogic
• Probability of an error going undetected
 Goes from  NPfail

to  (cNPfail )2
 To come out ahead, want: c2 << 1/(NPfail )
 c=3, N=1010 Pfail=10-20
 (cNPfail)2=910-20  1019 cycles MTTUF
 1010s = 300 years
CALTECH CS137 Fall2005 -- DeHon
Detection Overhead
• …but: Correction and detection circuitry increase
circuit size.
• Ndetect > Nlogic
• Ndetect = c Nlogic
• Probability of an error going undetected
 Goes from  NPfail

to  (cNPfail )2
 To come out ahead, want: c2 << 1/(NPfail )




c=3, N=31010 Pfail=10-11
NPfail=0.3
(cNPfail)2=0.81  worse
Neither workable!
CALTECH CS137 Fall2005 -- DeHon
Reliability Tuning
• Want NPfail small
– Want: (cNPfail )2 very small
• Idea:
– Guard subsystems independently
– Make Ns suitably small
– Smaller probability there is a double error
localized in this small subsystem
• That is: as long as compartmentalization
guarantees very small (cNsPfail )2:
– can reduce to single detect case.
CALTECH CS137 Fall2005 -- DeHon
Guarding Subsystems
CALTECH CS137 Fall2005 -- DeHon
Composing Subsystems
•
•
•
•
•
Psysundetect = (Nsys/Ns) Psubundetect
Psubundetect = (cNsPfail )2
Psysundetect = (Nsys/Ns) (cNsPfail )2
Psysundetect = Nsys  Ns  (cPfail )2
Extermes:
• Ns= Nsys
• Ns=1
CALTECH CS137 Fall2005 -- DeHon
No benefit
Maximum benefit factor of Nsys
[in practice c=f(Ns)]
Composing Subsystems
• Psysundetect = Nsys  Ns  (cPfail )2
• Example: c=3, Nsys=31010 Pfail=10-11
•
•
•
•
Ns=103
31010  103  (310-11)2
33  10-9  310-8
Still < 1s MTTUF …
CALTECH CS137 Fall2005 -- DeHon
(<<0.81)
Problem
Motivates Problem:
• Generate logic capable of detecting any
single error
CALTECH CS137 Fall2005 -- DeHon
Terminology
• Fault-secure: system never produces
incorrect code word
– Either produces correct result
– Or detects the error
• Self-testing: for every fault, there is
some input that produces an incorrect
code word
– That detects the error
CALTECH CS137 Fall2005 -- DeHon
Terminology
• Totally Self Checking: system is both
fault-secure and self-testing.
CALTECH CS137 Fall2005 -- DeHon
Duplication
Detects any single fault (even in checker)
CALTECH CS137 Fall2005 -- DeHon
Duplication
• N original gates
• Duplicate: + N
• O outputs
– O xors
– O/2  2  2 ors
– Total 3O gates
• Total: 2N+3O
• O<N
• 2<c<5
CALTECH CS137 Fall2005 -- DeHon
Duplication
• Total: 2N+3O
• O<N
• Rent’s Rule: O~kNp
– p<1
• Total: 2N+3kNp
• c(N)=2+3k/N(1-p)
– N small  5
– N large  2
CALTECH CS137 Fall2005 -- DeHon
Duplication with PLA
Logic
Duplicate
CALTECH CS137 Fall2005 -- DeHon
PLA Duplication
• N product terms in
original
• N in duplicate
• 2 O product terms
for matching
• ON
• 2<c<4
CALTECH CS137 Fall2005 -- DeHon
Can we do better?
• Seems like overkill to compute twice?
CALTECH CS137 Fall2005 -- DeHon
Idea
• Encode so outputs have some
checkable property
– E.g. parity
CALTECH CS137 Fall2005 -- DeHon
Will this work?
Original
Logic
Extra cubes
for parity
parity
CALTECH CS137 Fall2005 -- DeHon
Problem
• Single fault may
produce multiple
output errors
CALTECH CS137 Fall2005 -- DeHon
How Fix?
• How do we fix?
CALTECH CS137 Fall2005 -- DeHon
No Logic Sharing
• No sharing
• Single fault
effects single
output
CALTECH CS137 Fall2005 -- DeHon
Parity Checking
• To check parity
– Need xor tree on outputs/parity
– [(O+1)/2]22 = 2(O+1) xors
• For PLA
– xor would blow up
– Wrap multiple times
– 2 product terms per xor
– 4O product terms
CALTECH CS137 Fall2005 -- DeHon
nanoPLA Wrapped xor
Note: two planes here just for buffering/inversion
CALTECH CS137 Fall2005 -- DeHon
Better or Worse than Dual?
Design
Ins Outs OrigPterms Parity Dual
add4
9
5
135
283
240
ex1010
10
10
284
880
568
inc
7
9
29
53
58
misex1
8
7
12
40
24
rd73
7
3
7
20
10
rd84
8
4
255
389
441
sao2
10
4
9
31
14
squar5
5
8
25
38
49
z5xp1
7
10
63
96
125
CALTECH CS137 Fall2005 -- DeHon
(not include checking)
Can we allow sharing?
• When?
CALTECH CS137 Fall2005 -- DeHon
Multiple Parity Groups
• Can share
with different
parity groups
• Common
error flagged
in both groups
CALTECH CS137 Fall2005 -- DeHon
Multi-Parity Group Compare (AMD)
Design grps Mparity Orig
add4
ex1010
inc
misex1
rd73
rd84
sao2
squar5
z5xp1
4
2
6
6
7
1
9
5
9
CALTECH CS137 Fall2005 -- DeHon
209
822
44
25
10
402
17
36
103
135
284
29
12
7
255
9
25
63
Parity
Dual
283
240
880
568
53
58
40
24
20
10
389
441
31
14
38
49
96
125
(not include checking)
Best Results from Winter2004 CS137
Design class
add4
193
ex1010
inc
misex1
rd73
rd84
sao2
squar5
23
*8
385
13
34
z5xp1
CALTECH CS137 Fall2005 -- DeHon
AMD Orig Parity
Dual
209
135
283
240
822
284
880
568
44
29
53
58
25
12
40
24
10
7
20
10
402
255
389
441
17
9
31
14
36
25
38
49
103
63
96
125
(not include checking)
Better or Worse than Dual?
• Typical results from Mitra [ITC2002]
– Multi-level gate mapping to LSI std. cell
library
(parity here includes multiple parity)
CALTECH CS137 Fall2005 -- DeHon
Admin
• Assignment #2 due Friday
• Wednesday reading online
• Friday reading handout
CALTECH CS137 Fall2005 -- DeHon
Big Ideas
• Low-level physics imperfect
– Statistical, noisy
• Larger number of devices  greater
likelihood of faults
• Redundancy
• Self-checking circuits
CALTECH CS137 Fall2005 -- DeHon