ED4I: Error Detection by Diverse Data and Duplicated Instructions

Download Report

Transcript ED4I: Error Detection by Diverse Data and Duplicated Instructions

ED4I: Error Detection by Diverse
Data and Duplicated Instructions
Greg Bronevetsky
4
ED I
Background
•
A code transformation system developed at
the Stanford Center for Reliable Computing.
•
Authors: Nahmsuk Oh, Subhasish Mitra,
Edward J. McCluskey
•
ED4I allows us to run a program on two
slightly different inputs and still be able to
compare results at the end.
Motivation
•
The simplest way to detect Byzantine Faults
is to run the same program on multiple
processors and compare results.
•
ED4I is Byzantine Fault detection for
uniprocessors.
•
Must take into account both temporary and
and permanent faults.
Definitions
•
Temporary Faults – any fault that temporarily
affects a processor, long enough to execute
several instructions.
•
•
Ex: Radiation hitting wires, frayed wires.
Permanent Faults – a fault that affects a
processor for a long period of time.
•
Ex: Spilling Coke on the chip, cut wires.
Problem Statement
•
•
•
•
We can detect Byzantine Failures by running
each program or procedure twice and
comparing the results.
However, this does not guard against
permanent faults since the results of both
runs will be the same.
Need to make the two runs different so that
the same fault will affect the results
differently.
Overhead = 100%.
Key Idea
•
•
Lets feed into the program two different sets
of data and then compare the results.
Key Insight:
•
If the program only uses arithmetic operations,
we can alter the input by multiplying all input
numbers by a constant.
• Then the modified output will be the
(real output) * (the constant).
• Thus, you can verify that the two computations
succeeded AND the two computations will be
affected by errors differently.
New Program
•
•
•
If we alter the input to the program, we must
alter the program to work with this modified
input.
The transformation is given the constant k
(called the “diversity factor”) and it creates
the “k-factor diverse program”.
The new program will have the same control
flow graph as the old program but all the
variables will be k-multiples of the of original
ones.
Transformations
•
•
•
•
•
If k<0, branches flip directions
(> ↔ <, ≥ ↔ ≤)
All constants in code get multiplied by k.
Addition and Subtraction of variables
unchanged.
Multiplication:
v1*v2*....*vn → (v1*v2*....*vn)/kn-1
Division:
v1/v2 → (v1/v2)*k
Fault Detection Probability
•
For functional unit hi (such as the adder),
fault f and diversity factor k:
Ei  E
Ci ( k ) =  P ( f )
Xi
f
'
i
•
•
Xi = is the set of inputs to hi
Ei = subset of X containing the inputs that will
result in erroneous output due to the fault.
• E'i = subset of Ei that will escape detection
•
Ci(k) = Probability of catching an error in hi.
Data Integrity Probability
•
For functional unit hi, fault f and diversity
factor k:
D j (k ) =  P( f )(1 
f
•
•
'
i
E
Xi
)
Xi = is the set of inputs to hi
Ei = subset of X containing the inputs that will
result in erroneous output due to the fault.
• E'i = subset of Ei that will escape detection
•
Di(k) = Probability of missing no errors in hi.
Choosing the value of k
•
•
For some functional units we can derive Ci(k)
and Di(k) analytically for each k.
This is too hard in general so we resort to
trying out a range of k's empirically to
determine Ci(k) and Di(k).
Bus Signal Line
•
Bus wire stuck at either 0 or 1.
•
Derived results for a 12-bit bus:
Adder
•
Experimental results for a 12-bit ripple carry
adder:
•
Experimental results for a 12-bit carry lookahead adder:
Multiplier & Divider
•
Experimental Results for
•
•
•
12-bit array multiplier
8-bit Wallace Tree multiplier
SRT divider
Shifter
•
Experimental Results for 16-bit multiplexerbased shifter:
Using Benchmarks to pick k
•
Need to determine how much each functional
unit is used in the average program.
•
Add, sub, mult and shift use the obvious
functional units.
• “memory access” uses the memory bus
• “branch” uses a carry-lookahead adder
Benchmarked Data Integrity
•
Calculated Data Integrity=Di(k) given above
usage statistics. (high Di(k) top priority)
•
Highlighted columns provide the best data
integrity for each benchmark.
Benchmarked Detection Probability
•
Calculated Detection Probability=Ci(k) given
above usage statistics.
•
Highlighted columns provide the best
detection probability for each benchmark.
Optimum k
•
Optimum k selected:
•
•
•
Must maximize the Data Integrity=Di(k).
Given maximum Di(k), maximize Ci(k).
For each program, should get an estimate of
how it uses the different functional units and
pick k accordingly.
Dealing with Overflow
•
By multiplying all variables by k, we may
cause them to overflow.
•
•
Can scale variables up to next largest type.
Scale down variables by dividing by k. Must only
check higher order bits when comparing new
results to results of original program.
• Can use compile-time range checking to
determine vulnerability to overflow and pick k
accordingly
Floating Point Numbers
•
•
•
•
•
Above technique fails for floating point
numbers.
s
b
IEEE 754 format:  1  m  2 ,1  m  2
K=-2 will only change the sign bit and some
bits in the exponent.
Solution: pick separate k's for the exponent
and the mantissa and run the program once
with each k.
Overhead = 200%.
Picking k for the mantissa
•
•
To find errors in mantissa, pick k to be 3/2.
A stuck-at-1 fault:
•
In original program, variable x's value corrupted
to: xe =  1s  (m  e)  2b , 1  (m  e)  2
3
s 3
b
x'
=
x
=

1

m

2
• In transformed program,
2
2
3 3
 m3
Since 1  m  2,
2 2
3
However, the mantissa must be <2, so if 2 m  2
•
the mantissa is right shifted by 1 and
normalized.
Transformed variables
•
•
So now, the value in transformed program is:
3
s 3
b
 1  ( m)  2 , if m  2
2
2
3
s 3
b 1
 1  ( m)  2 , if 2  m  3
4
2
Value in original program is:
1s  m  2b , 1  m  2
Fault Detection in Mantissa
•
If there is a stuck-at-1 fault
•
Value in transformed program:
3
3
b
 1  ( m  e)  2 , if m  2
2
2
3
s 3
b 1
 1  ( m  e)  2 , if 2  m  3
4
2
Value in original program * k (for checking):
3
s 3
b
 1  (m  e)  2 , if (m  e)  2
2
2
3
s 3
b 1
 1  (m  e)  2 , if 2  (m  e)  3
4
2
s
•
We can detect Mantissa errors!
•
Note that the error values for the original and
the transformed programs are different!
3
• We actually use k=  in order to flip the sign
2
•
bit for improved detection capability
k for exponents
•
In order to flip all the bits of the exponent,
need to transform program to use
0101010101
1010101010
k= 2
and k= 2
If a fault invalidates a bit of the exponent, the
fault will be detected by comparing to the
exponents of one of the two transformed
programs.
2
•
2
Effectiveness for Mantissa
3 10101010102
• Effectiveness of k=  2
2
(for IEEE 754 single precision)
Effectiveness for Exponent
3 01010101012
• Effectiveness of k=  2
2
(for IEEE 754 single precision)
Summary
•
•
•
•
•
•
ED4I effectively detects Byzantine Failures in
numerical applications on uniprocessors.
Purely software solution using Data Diversity.
Detects permanent and temporary faults.
Works with fixed-point and floating point
numbers.
Compatible with arithmetic and logical
operations (probably with any bitwise logical
operation if it can be recast into arithmetic)
High overhead: 100% or 200%.