i ≠ 1 - The Department of Computer Science

Download Report

Transcript i ≠ 1 - The Department of Computer Science

Self-Stabilizing Systems as a Base
for Autonomic Computing
Shlomi Dolev
Yinnon Haviv, Reuven Yagel,
Olga Brukman
Trustworthy Systems: Why Is It
So Hard?
Corbató’91: "It almost goes without saying that
ambitious systems never quite work as expected“
http://larch-www.lcs.mit.edu:8001/~corbato/turing91/
"You must pay extreme attention to detail here. One
wrong bit will make things fail… "
http://my.execpc.com/~geezer/os/pm.htm
From Pentium’s manual:
“… if the ESP or SP register is 1 when the PUSH
instruction is executed, the processor shuts down
due to a lack of stack space. No exception is
generated to indicate this condition"
Mars Rover - Spirit
…The Spirit rover has a radiation-hardened R6000 CPU from
Lockheed-Martin Federal Systems…The operating system is
Wind River Systems' Vx-Works..
…attempted to allocate more files than the RAM-based directory
structure could accommodate. That caused an exception, which
caused the task that had attempted the allocation to be
suspended…
…Spirit fell silent, alone on the emptiness of Mars, trying and
trying to reboot
http://www.eetimes.com/sys/news/OEG20040220S0046
Linux and Windows do not
Stabilize
Self-Stabilization
Self-healing, Self-managing, Self-*
Recovery Oriented Computing [Berkeley,
Stanford]
Autonomic Computing [IBM]
Self-Stabilization

Self-Stabilizing algorithm for mutual exclusion in a
ring topology [Dijkstra’74]
Well
Established
Theory !
Self-Stabilizing Systems
Elegant fault tolerant approach.
Started at any state, the system convergences to a
desired behavior.
Generally used in distributed systems.

Routing, clock synchronization, leader election, etc.
Overcome transient faults in the system.

Transient faults: soft-errors (“98% of RAM errors are soft
errors”), wrong CRC during communication etc.
Self-Stabilization
The combination and type of faults cannot be
totally anticipated in on-going systems
Any on-going system must be self stabilizing (or
manually monitored)
L
E
First Self-Stabilizing Algorithm:
Token Passing
Token Passing
do forever
if x1=xn then
x1:=(x1+1)mod(n+1)
Pi(i ≠ 1):do forever
if xi≠xi-1 then
xi:=xi-1
1 P1:
2
3
4
5
6
Token Passing Cont.
{0; 0; 0; 0; 0};
{1; 0; 0; 0; 0};
{1; 1; 0; 0; 0};
{1; 1; 1; 0; 0};
Surely works when we start in
x1 = x2 = … = xn = 0.
{1; 1; 1; 1; 0};
One processor may change a
state at a time.
{2; 2; 1; 1; 1};
{1; 1; 1; 1; 1};
{2; 1; 1; 1; 1};
{2; 2; 2; 1; 1};
{2; 2; 2; 2; 1};
{2; 2; 2; 2; 2}
…
Token Passing: Faults
Transient fault, soft errors, wrong CRC,
unexpected temporal severe conditions, etc.
Assigns each processor with an arbitrary state
(in the range of its state space).
For example {3; 4; 4; 1; 0}.
p2; p4; and p5 have tokens!
Will the system ever recover?
Token Passing: Automatic
Recovery
p1 changes state infinitely often,
Otherwise, let s1 be the fixed state of p1,
p2 eventually copies s1 from p1, then
p3 eventually copies s1 from p2, then
...
pn eventually copies s1 from pn-1, then
p1 changes state.
p1 changes state in the order 4; 5; 0; 1;
2; 3; 4; 5; 0; ...
Token Passing: Automatic
Recovery Cont.
In any initial state at least one state is missing,
{4; 4; 1; 0; 2}, 3 and 5 are missing.
Once p1 reaches the missing state e.g., 5, all the
processors must copy 5, before p1 reads 5 from
pn and changes state to 0.
Will It Stabilize With mod (n - 2)?
Mod 3
{0,0,2,1,0} p1 {1,0,2,1,0} p5
{1,0,2,1,1} p4 {1,0,2,2,1} p3
{1,0,0,2,1} p2 {1,1,0,2,1}
+1 mod 3 !
Is Self-Stabilization a Toy?
Talk Outline
Self Stabilizing Microprocessor [DH04]
Self Stabilizing Operating System [DY04]
Self-Stabilization Preserving Compiler[DH05]
Self-Stabilizing Automatic Recoverer For
Eventual Byzantine Software [BDK03]
Recovery Oriented Programming[BD05]
Self-Stabilizing Microprocessor
Overcoming Soft-Errors
Shlomi Dolev and Yinnon A. Haviv
17th International Conference on Architecture of
Computing Systems (ARCS)
Motivation
Soft-Errors: Single Event Upsets (SEU)
Caused by cosmic ray / other disruptions.
Cause a logical gate to flip its content.
Currently handled only in memories.
Significant impact on the microprocessors.
Soft-Errors - Current Solutions
Obtaining masking using probabilistic
approaches:




Information redundancy (ECC / Parity)
Space redundancy
Time redundancy
Failure detection / recovery.
Known solutions:



IBM S-390
Compaq NonStop Himalaya
IROC
Self-Stabilizing Microprocessor
 Self-stabilizing algorithms assume that the
microprocessor executes them.

Soft-errors may cause the microprocessor to be stuck
in a faulty state.
 A microprocessor is self-stabilizes if:


Started in any internal state, converges in a finite
number of steps into the set of safe states.
Microprocessor’s safe state – in which it performs
“fetch-decode-execute” cycle
Proving Convergence
 Proving that there exists no “bad” cycle in the transition graph
of the microprocessor.
 Too large ! (we must explore the entire graph)
 Using an abstraction:~ Group together states in which the
micro-code program counter is the same.
h
a
D
g
j
b
A
E
i
c
l
d
B
F
k
C
e
f
Self-Stabilizing Microprocessor:
Summary
Soft-errors are here to stay, we should:


Design our systems to mask them.
Self-stabilize following a non-masked error.
We provide methodology for validating selfstabilization property of microprocessors.
Talk Outline
Self Stabilizing Microprocessor [DH04]
Self Stabilizing Operating System [DY04]
Self-Stabilization Preserving Compiler[DH05]
Self-Stabilizing Automatic Recoverer For
Eventual Byzantine Software [BDK03]
Recovery Oriented Programming[BD05]
Toward Self-Stabilizing Operating
System (SOS)
Shlomi Dolev and Reuven Yagel,
SAACS’04 Workshop, Zaragoza
Talk Outline
The first self-stabilizing algorithm (of Dijkstra)
Self Stabilizing Microprocessor [DH04]
Self Stabilizing Operating System [DY04]
Self-Stabilization Preserving Compiler[DH05]
Self-Stabilizing Automatic Recoverer For
Eventual Byzantine Software [BDK03]
Recover Oriented Programming[BD05]
Basic Directions
Black-box


Take existing OS (Unix, Windows, RTOS)
Add stabilization layer
Carefully tailoring a tiny kernel



Processor scheduling
Memory management
Device allocation
Assumptions
Every configuration (processor/memory) is
possible
At least some program code is hardwired (in
ROM) and is correct – Harvard Model
Processor:


Instruction manual (e.g. x86\IA-32) defines a
transition function.
Self-stabilizing [DH04]
Black Box
Requirements:


Defining a legal execution is usually impractical
At least - restore original state (variables + code), infinitely often
Periodic Reset Re-install and Execute



Watchdog timer (self-stabilizing)
Periodic processor reset
During bootstraps OS reinstall from ROM
Weak self-stabilization


E = (ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …., ci, ai, ci+1, …., RRE, c1, a1,
c2, a2, ….
Is it always acceptable?
Alternative: Periodic re-install code only, add consistency
check and enforcement
Tailored Kernel
Tiny Scheduler Tiny Memory Manager
Requirements:



Self-stabilizing
Fair
Process stabilization preserving (e.g. validity of
P.C. value)
Tiny SOS Scheduler
~70 lines of a real
machine assembly
code
16bit Real mode &
32bit Protected mode.
Standard build and
emulation tools
(Nasm, ld, Bochs)
Detailed proof of
requirement
preservation
; increase task
10 mov word ax, [currentProc]
11 and ax, PROC_MASK
...
; load task state
...
;restore ip
52 mov ax, [bx+4]
;validate ip
53 and ax, IP_MASK
54 mov word [ss:STACK TOP], ax
;restore general registers
55 mov cx, word [bx+12]
56 mov dx, word [bx+14]
57 mov si, word [bx+16]
58 mov di, word [bx+18]
Tiny SOS Memory Manager
Requirements:


Consistency of memory hierarchy
Self-stabilization preservation
Tiny SOS Scheduler
Clock tick
/ execute next
Any State
Some Error
Establish Scheduler
Consistency
Some Error
Some Error
Process(ing)
Next Process
Validated & Ready
Tiny SOS Scheduler
Clock tick
/ execute next
Any State
NMI / load PC with
scheduler handler
Process(ing)
Establish Scheduler
Consistency
Next Process
Validated & Ready
Sketch of Proof
In every execution E, the code of the scheduler
is started to be executed and is executed from
the first instruction to the last instruction
infinitely often
In every execution E of the scheduler each
process is executed infinitely often
The self-stabilizing scheduler preservers
stabilization of processes.
Talk Outline
Self Stabilizing Microprocessor [DH04]
Self Stabilizing Operating System [DY04]
Self-Stabilization Preserving Compiler[DH05]
Self-Stabilizing Automatic Recoverer For
Eventual Byzantine Software [BDK03]
Recover Oriented Programming[BD05]
Self-Stabilization Preserving
Compiler
Shlomi Dolev, Yinnon A. Haviv,
Department of Computer Science
Ben-Gurion University, Israel
Mooly Sagiv,
Department of Computer Science
Tel Aviv University, Israel
Motivation
Transient malfunctions.
Single processor:


Hardware glitches.
Soft-Errors.
Distributed environment:


Processor crashes / recoveries.
Link errors.
Resulting in an unpredictable system state.
Coping with Transient Errors
Masking (safety factor) achieved by:


Information redundancy (e.g., ECC).
Time/Space redundancy. (e.g., TMR)
Self-Stabilization [Dijkstra74]:



Assuming any system state (caused by errors).
Recovering by converging into legal behavior.
Existing algorithms for distributed tasks:
 Routing, leader election, mutual exclusion, etc.
Self-Stabilizing Algorithms –
a Solution to Soft-Errors?
Self-Stabilizing algorithm assumes that the
microprocessor executes it.

Soft-Errors may cause the microprocessor to be
stuck in a faulty state.
Composition of self-stabilizing algorithms
creates a self-stabilizing system.

Make the microprocessor eventually fetch-decodeexecute machine code.
The Gap.
Need a transformation between:


Input program P written in a high abstraction
language, e.g., (D)ASM.
Output program Q in a machine language, say, JVM.
Existing compilers?


P and Q behaves the same when started in the
initial state.
What if Q reaches an unexpected state due to
soft-error experienced by microprocessor?
Trivial Example
A statement of the form:
For each i in {0..9} do f(i)
May be compiled to 
Start with cx=12 inside the
loop…
Moreover: Any runtime
mechanism can get stuck /
inconsistent.
mov ax, 10
mov cx, 0
loop1:
push cx
call f
inc cx
cmp cx,ax
jne loop
Stabilization Preserving
Compiler – a closer look
Ensuring that Q eventually behaves as P:
State space of P
State space of Q
Self-Stabilization Preserving
Compiler: Summary
Front end of compiler established.
Typed version of ASM.
JavaCC as a parser generator.
Interpreter (used as a model).
Fast stabilization vs. optimizations.
Self Stabilization preserving compiler.
 Language with clear semantics from any state.
 Innovative demands from compiler.
Talk Outline
Self Stabilizing Microprocessor [DH04]
Self Stabilizing Operating System [DY04]
Self-Stabilization Preserving Compiler[DH05]
Self-Stabilizing Automatic Recoverer For
Eventual Byzantine Software [BDK03]
Recover Oriented Programming[BD05]
Self-Stabilization and Evolving
Systems
Real world systems cannot be verified exhaustively…
We enforce safety and live-ness specifications
Contract between the client, project manager and
programmers, that is checked on line!
Make sure that the additional (thin) monitoring and
recovering layer is self-stabilizing
A change can be made to the
implementation/specification
to support evolving environments
Self-Stabilizing Recoverer for
Eventual Byzantine Software
Olga Brukman, Shlomi Dolev
Department of Computer Science
Ben-Gurion University, Israel
Hillel Kolodner,
Haifa Research Labs
IBM, Israel
Software Contains Bugs
Heisenbugs, corrupt states, leaked resources are
common…
Correct and faultless SW is hard

Long-lived running programs, e.g., OS
Usually software is tested when starting from
initial state and considering limited time
scenarios.
Fault Model Reflecting
Reality
Software packages can be trusted to work as
required after restart.
Eventual Byzantine software.
System administrators and users use reboot to
deal with faults.
Middleware Architecture
<Preds,RActs>1
<Preds,RActs>2
…
<Preds,RActs>n
OMR
Kernel
OS
Monitor-Restarter for Process
and Subsystem
<Pred,RActs>1
<Pred,RActs>2
…
Restart Actions – Mature
Approach
Subsystem waits for completion of a restart
of its components.
Restart action may vary, depending on
component internal state.



Reschedule
Roll-back
Kill & Restart
Few restart attempts with more drastic
restart actions.
Computational Model: rsf-execution
An execution E is rsf (restart supporting fair)execution iff E is a fair execution in which
every subsystem subi that is initialised during E
respects its specification function ssi.
Requirement: Every rsf-execution E has a
suffix in which the system respects its
specification function ss.
Tools for Implementation – Black
Box Approach
Software package is a black box.
Package is monitored by recording it’s IO (e.g.,
strace in Linux).
Monitors are independent of specific
implementation
Tools for Implementation –
Transparent Box Approach
Software package implementation tool is known.
Run-Time Reflection tools are used to monitor
and restart the package.
Possible in Java, C++, CORBA, COM.
Practical Experience: Printers
Problem
Corrupted pdf, doc or ps file sent to printing
server.
Printer can’t print the file.
Cause retries by printing server

Printer is “stuck” on one job.
Predicate for printing server:
 Restrict number of retries, try format conversions,
send error message to user.
Self-Stabilizing Automatic
Recoverer: Summary
Theory foundations of self-stabilization and
restart techniques could serve as a basis for the
new paradigms.
General framework for design and correctness
proof for autonomic recoverer.
Printers experience coordinated with IBM.
Recovery Oriented Programming
Olga Brukman and Shlomi Dolev
Department of Computer Science
Ben-Gurion University, Israel
Towards Robust Software
Programming

Structural programming, OOD, Design Patterns…
Testing and debugging


Unit testing [JUnit, CppUnit]…
Design By Contract (Eiffel) …
Formal specification languages

ASM, IO Automata, NURPL
Model checking
Online recovery


ROC [PBB02].
Self-Stabilizing Autonomic Recoverer for Eventual Byzantine
Software [BDK03] - black box software packages.
Our Contribution
Program invariants derived from design specifications.
 Checked every time invariant variables are updated.
Automatic code generation for invariant verification
and recovery upon invariant violation.
Invariants are verified during runtime.
 Change of invariant variable is pre-checked in sandbox.
 Violation is prevented and replaced
with a recovery action.
Our Contribution Cont.
Recovery action is chosen depending on the
current state and history.
 Roll back & resume.
 Wait.
 Reschedule.
 Kill & restart.
External Monitoring
Monitoring the whole task to avoid
 transient faults occurrence after which
invariant variables are not changed ( and no
invariant checks are done)
 liveness problem – monitor over time
Producer Consumer Example
Producer
Producer, Consumer –
threads
Queue – a circular
bounded length queue

Queue


Consumer

int queue[size]
int start – position of
current first element in the
queue
int end – position of the
first empty place in the
queue
boolean empty
Invariants for Queue
start, end are in some values range
({0,..,size-1})
Queue is not empty iff start != end
A possible invariant is:
(start in {0,..,size-1}, end in{0,.., size-1}) &&
(start != end => empty == false)
Given Implementation Code
class Queue {
private int[] queue; private int start, end, size; private boolean empty;
public Queue(int size){
queue = new int[size];
start = 0; end =0; empty= true;
}
public synchronized int dequeue() {
int result;
while (empty)
wait();
result = queue[start];
start= (start + 1) % size;
empty = (start == end) ? true;
notifyAll();
return result;
}
public synchronized void enqueue(int value) {
while (start == end && !empty) //while full
wait();
queue[end] = value;
end = (end + 1) % size; empty=false;
notifyAll();
}
}
Code Transformation
class Queue {
invariant: (start in {0,..,size-1}, end in {0,.., size-1}) &&
(start != end => empty == false)
recovery actions: {this = new Queue(); empty = false;}
private int[] queue; private int start, end, size; private boolean empty;
public Queue(int size){
queue = new int[size];
atomic{ start = 0; end =0; empty= true;}
}
public synchronized int get() {
int result;
while (empty)
wait();
result = queue[start];
atomic{ start= (start + 1) % size; empty = (start == end) ? true;}
notifyAll();
return result;
}
Code Transformation Cont.
public synchronized void put(int value) {
while (start == end && !empty) //while full
wait();
queue[end] = value;
atomic{ end = (end + 1) % size; empty=false;}
notifyAll();
}
}
External Monitoring
The possible invariant is:
(start in {0,..,size-1}, end in {0,.., size-1}) &&
(start != end => empty = false) &&
(start = end && empty = false => producer.wait()) &&
(empty = true =>consumer.wait()) &&
[(producer.wait() && !consumer.wait()) ||
(!producer.wait() && consumer.wait())]
New recovery actions: interrupt
producer/consumer and initialize it.
Recovery Oriented
Programming: Summary
Programming with self-stabilization
enforcing.
Eventually safe execution.
Talk Conclusions
Self-Stabilization as an effective paradigm for
creating robust systems.
Rigorous approach for designing basic system
components




Microprocessor
Operating system
Compiler
Evolving and Recovery Oriented