Software Fault Tolerance
Download
Report
Transcript Software Fault Tolerance
Software Fault Tolerance
Jimmy John
Contents
Software Fault Tolerance
Why do we need it?
Approach for Fault Tolerance
Fault Detection
Fault Containment
Fault Correction
Rollbacks
Forward Recovery
Imprecise Computations
Modified form of Forward Recovery
Software Fault Tolerance
By software fault tolerance we mean a set of
application level software components that
can detect and recover from faults that are
not handled in the hardware/operating
system.
A failure occurs when the system deviates
from its specifications. The cause of a failure
is an error. A fault has the potential for
generating errors i.e. it may/may not
generate any errors. A system with errors will
be faulty.
Why do we need it?
Because many systems today are expected to
work in a correct manner as life depends on
them.
Cost of errors to high.
Testing is not suitable measure of reliability.
Can only establish the presence of errors but
cannot assure their absence.
relies heavily on manual skills to identify test cases
and evaluate results
Approach for Fault Tolerance
Fault Detection
Self-protection
Self-checking
Techniques
Timing checks
Replication checks
Reasonableness checks
Using a fault flag
Fault Containment
This is to ensure that faults originating in a
module do not get propagated to other
modules
Technique
Partitioned address space. E.g DEOS,
vxWorks AE
Access rights.
Fault Correction
Rollback
Checkpoints are taken at regular intervals.
When a fault is detected, the system is rolled
back to the previous checkpoint and the
checkpoint interval is re-executed.
Forward Recovery
There is a duplicate copy of the process also
running. When a fault occurs, Checkpoint
comparison fails. A third process re-executes
the checkpoint interval while the other two
are allowed to continue.
Imprecise Computation
When a fault is detected, sometimes there
is no time to redo any computation. In
such cases an imprecise computation is
carried out that gives an approximate
result.
E.g. Matrix multiplication.
Precise Algorithm
Strassen's matrix multiplication algorithm.
Imprecise Algorithm
We pick a random subset of ‘s’ columns of X,
to form an n*s matrix S. We also form an s*n
matrix R, out of the corresponding columns of
Y. The product SR is an estimator (entry by
entry) of the product XY.
Modified Forward Recovery
Checkpoints are compared to determine
faults. If fault has occurred the third
process is checkpointed and the
checkpoint interval is re-executed.