Software Fault Tolerance

Download Report

Transcript Software Fault Tolerance

Software Fault Tolerance
Jimmy John
Contents



Software Fault Tolerance
Why do we need it?
Approach for Fault Tolerance



Fault Detection
Fault Containment
Fault Correction




Rollbacks
Forward Recovery
Imprecise Computations
Modified form of Forward Recovery
Software Fault Tolerance


By software fault tolerance we mean a set of
application level software components that
can detect and recover from faults that are
not handled in the hardware/operating
system.
A failure occurs when the system deviates
from its specifications. The cause of a failure
is an error. A fault has the potential for
generating errors i.e. it may/may not
generate any errors. A system with errors will
be faulty.
Why do we need it?



Because many systems today are expected to
work in a correct manner as life depends on
them.
Cost of errors to high.
Testing is not suitable measure of reliability.


Can only establish the presence of errors but
cannot assure their absence.
relies heavily on manual skills to identify test cases
and evaluate results
Approach for Fault Tolerance

Fault Detection



Self-protection
Self-checking
Techniques




Timing checks
Replication checks
Reasonableness checks
Using a fault flag

Fault Containment


This is to ensure that faults originating in a
module do not get propagated to other
modules
Technique


Partitioned address space. E.g DEOS,
vxWorks AE
Access rights.
Fault Correction


Rollback
Checkpoints are taken at regular intervals.
When a fault is detected, the system is rolled
back to the previous checkpoint and the
checkpoint interval is re-executed.


Forward Recovery
There is a duplicate copy of the process also
running. When a fault occurs, Checkpoint
comparison fails. A third process re-executes
the checkpoint interval while the other two
are allowed to continue.

Imprecise Computation


When a fault is detected, sometimes there
is no time to redo any computation. In
such cases an imprecise computation is
carried out that gives an approximate
result.
E.g. Matrix multiplication.




Precise Algorithm
Strassen's matrix multiplication algorithm.
Imprecise Algorithm
We pick a random subset of ‘s’ columns of X,
to form an n*s matrix S. We also form an s*n
matrix R, out of the corresponding columns of
Y. The product SR is an estimator (entry by
entry) of the product XY.


Modified Forward Recovery
Checkpoints are compared to determine
faults. If fault has occurred the third
process is checkpointed and the
checkpoint interval is re-executed.