Transcript Slides

Feedback Based
Real-Time Fault Tolerance
Issues and Possible Solutions
Xue Liu, Hui Ding, Kihwal Lee,
Marco Caccamo, Lui Sha
1
Major Issues in Software Reliability
• Software becoming more and more complex
– More features → larger code size
– Rapid evolution → introduction of new code
E.g. Apache
1998 0.8 MLOC
2002 10 MLOC
2004 27 MLOC
E.g. Windows XP 40-50 MLOC
Gray’s Estimate : 1 bug / KLOC
2
Growing Software Complexity
Poorly managed or maintained;
• Managed by human
operators
– Shortage of skilled
operators due to the
growing complexity
– Costly
– To err is human
• Faults
Software bugs and errors.
Sources of computing
system downtime
(Cite from: Candea, Stanford’03)
Category
Source of
downtime
(percentage)
Hardware
20%
Software
40%
Human
operators
40%
Complexity adds difficulty to management and breeds bugs.
- Control the complexity in computer systems!
- Build systems that are robust against software bugs
3
Feedback Control Reflection
• Successful track record in controlling
electro/mechanical systems
• Observation 1: Computing systems haven been
crucial in the success of feedback control
– Digital designs & implementations etc
• Observation 2: Feedback control have appealing
properties
– Tolerance of errors (model/sensing/actuation etc) in
the physical process
• Utilize runtime feedback for error correction
Computing
Systems
Fault tolerance
Feedback
Control
Reflection: Can feedback
control help to solve fault
tolerance problem in
computing systems?
4
Targeted applications: Real-time control systems
Q: Feedback control can help to tolerate errors in mechanical
systems, can feedback control help to tolerate software errors also?
Feedback Control
Tolerant of Errors in
Mechanical Systems
Tolerant of Errors in
Software Systems
Idea 1: Feedback Control of Software Execution
Mechanical systems: Sense (feedback)->Control (error correction) -> Actuation
Software systems:
Sense (feedback)->Control (error correction) -> Execution
Idea 2: Using Simplicity to Control Complexity
• A simple and reliable core which gives acceptable
performance;
• The system under complex control software remains in states
that are recoverable by the simple core. (achieve fault tolerance)
5
A Typical Feedback Control Loop for Mechanical
Systems
Reference
Input
_
(Decision)
(Execution)
Controller
Actuator
Mechanical
System (Plant)
Sensor
(Sensing/error
identification)
• Sense: System output, identify if error
exists
• Control: Decision
• Actuation: Execution
6
Related Work – Simplex Architecture
• A simple reliable core (HAC)
• Diversity in the form of 2 alternatives (HAC, HPC)
• Feedback control of the software execution.
Decision
Simple high assurance
control subsystem (HAC)
Plant
Complex high performance
control subsystem (HPC)
Data Flow Block Diagram
Sense (feedback)->Decision (control/error correction) -> Execution (actuation)
7
Drawbacks of Simplex
• P1: Analytically redundant high assurance controller
(HAC) runs in parallel with complex controller (HPC)
– Lowers system performance, increase operating costs
– Limits the application of Simplex in only safety-critical domains
• P2: HAC and HPC must run at the same period
Our new Proposal: On-demand Real-Time Guard (ORTGA)
HAC only runs when faulty occurs!
Design Goals of ORTGA
1. Similar functionalities with Simplex
2. Much less resource usage
3. Flexibility
8
ORTGA Architecture: Key Ideas
(1) : Reduce resource usage of Simplex
Solution:
• “On-demand” execution of HAC.
– Only when the control under HPC is detected as faulty, the HAC is
switched in to take over the plant
(2): Flexibility
Solution:
• HAC and HPC ‘s periods are multiples of subperiod
• HAC and HPC can have different periods.
9
Background: Maximum Stability Region
• The largest state space such that system is still stable
under the current controller
Maximum Stability
Region (Recovery
Region)
Stability Region
State
Constraints
Lyapunov
Functions
10
How to determine the Maximum Stability
Region?
• In the operation of a plant,
there is a set of state
constraints: representing the
safety, device physical
limitations, environmental and
other operation requirements.
• They can be represented as a
normalized polytope, CTX  1,
in the N-dimensional state
space. We must be able
State constraints
Admissible
States
Operation Constraints
and Admissible states
– take the control away from a faulty
11
Maximum Stability Region
• A stability region is closed with
respect to the operations of
simple controller. It is Lyapunov
function inside the polytope.
• The maximum recovery region
can be found using LMI.
State constraints
Lyapunov
function

X  AX
AT Q + Q A < 0
min log det Q
1
State Constraints and the switching rule
(Lyapunov function)
subject to CT X < 1
Switching rule: X TQX < 1
12
Research Issues of ORTGA
• How to detect faults in HPC
– Timing faults:
• Application level support: Monitor detect heartbeat messages
misses
• OS support: Scheduler detect task deadline misses
– Other faults:
• Wide range of traditional fault detection techniques can be
used.
• When to recover if a fault in HPC is detected?
– Recover early?
• Too early: False alarms
– Recover late?
• Too late: could not recover in time
13
When to recover
• Why not recover too early?
– Control tasks are shown can tolerate several deadline
misses
– Sometimes system just have some delay (overloaded,
communication delay etc)
– These are not “real” faults
– Try to minimize the recovery due to false alarms
• Why not recover too late?
– If you recover too late, then no time to make the
system stable!
14
Right Time To Recover (RTTR)
• An example of a
“desirable” late but
timely recovery
(under RM)
Assumption: Fault is detected
at t=2.0 before its task deadline
D=8

0
2
4
6
8

(a) Normal schedule of 1 and 2


0
2
4
6
8

Observation: Sometimes, a late
but timely recovery makes
system more schedulable
(b) recover 2 immediately

0
Find RTTR instead of minimize
MTTR!
2
4
6
8

(c) recover 2 late
15
A possible solution to determine RTTR
• Idea
– Recover as late as possible,
– But not too late
• If the state of HPC is going to be out of the HAC-established
stability region, recover!
• Otherwise, wait (maybe HPC still OK  )
Recovered
Threads
Monitor find
HB3 missing

ts
Prediction
HB1 (t1)
HB2 (t2)
(t3)
S
tr
When to recover?
Stability Region S of
Controlled Plant
16
Reduce Resource Usage: On-demand Execution of HAC
Performance Gain of ORTGA
HPC’s timing parameters: {Cp, Tp}; HAC’s timing parameters: {Ca, Ta};
A total savings of:
Relative saving:
17
Ongoing Work: A proof-of-concept System
Double Inverted Pendulum System
- Double Quanser inverted pendulum with custom-made tracks
- PC/104 sized, i486 compatible system
- Customized Linux 2.6 kernel and root image in flash memory
- ORTGA middleware layer
18
Conclusions
• Feedback Based Real-Time Fault Tolerance
– Leverage feedback control of software execution
• ORTGA Architecture
– On-demand execution of reliable core (HAC) only
when fault occurs
– Significantly reduces resource usage
• Issues and possible solutions
– How to detect fault
– When to recover to maintain system stability
– How to find the RTTR (instead of minimize MTTR)
19
Backup Slides
20
Software Fault Model in RT Control systems
• Timing fault: misses its
deadlines
Timing fault
GRMS
• Capability abuse:
– Corrupt others’ code or data
– Unauthorized acquisition of
process/resource management
Semantic fault
capability
• Semantic fault: incorrect
results that can lead to:
Analytic Redundancy
(simple & complex Controllers
Capability abuse
Privilege management
– Poor control performance
– Instability in the plant
21