Self-stabilizing Operating Systems

Download Report

Transcript Self-stabilizing Operating Systems

Self-Stabilizing
Device Drivers
Shlomi Dolev and Reuven Yagel,
Computer Science Department Ben-Gurion University of
the Negev, Beer-Sheva, Israel
{dolev|yagel}@cs.bgu.ac.il
11/06
Motivation for
Self-stabilizing OS
• Coping with transient faults
(e.g. soft-error (SEU) in a remote sensor
results unpredictable system states)
• Growing interest in self-* / autonomic
computing systems
• Self-stabilizing algorithms/programs
assume hardware and operating system are
also stabilizing
• Pentium HALTING problem: “… if the ESP or
SP register is 1 when the PUSH instruction
is executed, the processor shuts down…”
2
Proposed solution
• To build according to the well
defined and understood
paradigm of self-stabilization
(traditionally used in
distributed systems)
• Thereby achieving:
trustworthiness,
dependability, self-healing,
automatic recovery,
adaptive systems, …
3
OS Dependability
• Past examples:
– Dijkstra, “THE” Multiprogramming System ’68
(Layered Approach)
– Denning, Lampson ~’70 (Protection)
– KeyKOS ’85 (Capabilities)
– EROS ’92 (Checkpoints)
– Micro-kernel ~‘90, Exo-kernel ’94 (Minimal TCB)
• Recent:
– IBM: K42 (Autonomic Computing), SUN: Solaris 10,
Predictive Self-Healing
– MSR: Singularity (managed code, protection,
verification)
– JHU: The Coyotos Secure Operating System
(Capabilities, type safety)
– HW e.g. Intel : Itanium 2, Machine Check Arch.
4
Goal: Autonomic Computer
• Following any sequence of transient faults, the
(operating) system converges
• Using self stabilization:
– A system can be started in an arbitrary state and
converge to a desired behavior
– Using fair composition to run hardware + OS
layaers
• BGU: Self-stabilizing systems, tools & paradigms
– Microprocessor [DH04]
– Operating System [DY04, DY05, DY06]
– Compiler [DH05]
– Framework: autonomic recoverer [BDK03, BD06]
– Middleware: File System [DK02], Group Comm. [DS01]
5
SOS - Directions
• Black-box
– Take existing (Desktop/Real-time) OS
– Add stabilization layer
– Detailed formal specification needed
• Carefully tailoring a tiny kernel
– Processor scheduling [SAACS04]
– Memory management [SOSP05, SSS05]
– Device drivers [SSS06]
6
A problem has been detected and Windows has been shut down to
prevent damage to your computer.
PFN_LIST_CORRUPT
If this is the first time you've seen this error screen,
restart your computer. If this screen appears again, follow
these steps:
Check to make sure any new hardware or software is properly
installed.
If this is a new installation, ask your hardware or software
manufacturer
for any Windows updates you might need.
If problems continue, disable or remove any newly installed
hardware
or software. Disable BIOS memory options such as caching or
shadowing.
If you need to use Safe Mode to remove or disable components,
restart
your computer, press F8 to select Advanced Startup Options, and
/* Third
then
select Safe Mode.
Technical information:
*** STOP: 0x0000004e (0x00000099, 0x00000000, 0x00000000,
0x00000000)
Edition UNIX, early 1973 */
panic(s)
char *s;
{
Beginning dump of physical memory
Physical memory dump complete.
Contact your system administrator or technical support group for
further
assistance.
prproc();
update();
printf("panic: %s\n", s);
for(;;)
idle();
}
7
Assumptions
• Every configuration (processor registers/
memory/IO controllers) is possible
• (Some) program code is hardwired (in ROM)
and is correct [Castro & Liskov, 2000]
• Stabilization of other layers
• CPU instruction manual (e.g. Pentium) defines
a transition function
– Restrict access to privileged instructions
and registers (e.g. HLT, LIDT)
– Segmentation for enforcing isolation
8
Method
• Specify main (additional) requirements
for each OS function
• Evolve self-stabilizing solutions that
follow computer-architecture/OS
progress
• Detailed proof for self-stabilization of
algorithms AND implementation
• Using processor instruction set
– Don’t rely on existing compilers
9
Device Driver Challenges
• Practically an essential part of any OS
• Co-operation with external controllers
– Adaptation (connecting interfaces)
– Translation (abstracting e.g. speed
variation)
– (and sometimes adding functionality)
• ”a modern Seagate drive contains roughly
400,000 lines of code” [IRON FS]
• “modern disk controllers often have many
megabytes of memory inside the controller”
[Minix]
10
Unreliable Components
• “In Windows XP, for example, device
drivers cause 85% of reported failures”
[Nooks]
• ”Since drivers account for the majority of
the code (over 70% in this release)…
…the error rate for drivers is almost
seven times higher than the error rate for
the rest of the kernel” [Engler et. al.]
11
Where is the Problem?
• Device drivers are loaded into kernel and
run in privileged mode
• No monitoring of operation & resource
usage
• Usually external & less tested code
• Not-stabilizing!
• "malicious driver can program a DMA capable
device to overwrite any part of memory...We
anticipate that future hardware will provide
memory protection for DMA transfers."
[Singularity]
12
What Has Been Done?
• Isolation: Micro-kernel [Mach], user mode
drivers [Minix], Monitoring [Nooks]
• Virtualization [Xen, …]
• Type safety [Singularity, Coyotos]
• Model Checking [SLAM]
• ”our goal is to prevent the vast majority of
driver-caused crashes with little or no change to
existing driver” [Nooks]
13
MSR Sinuglarity Project
• Achieving OS dependability through advances in
languages (compilers), tools (verifiers and
monitors).
• Device drivers isolation through Software
Isolated Processes (SIP) and resource useage
specification
• Recently adding hardware protection:
“increased use of hardware error detection and
correction techniques, such as error correcting
codes (ECC) on memory and data paths, to ensure
the assumption of correct execution that
underlies all program safety” [MSR-TR-2006-43]
14
Example: ATA Standard
• 1990 ATA-1 a de-facto protocol for
storage devices (ANSI 1994)
• T13 Technical Committee
http://www.t13.org/#Projects
• Current draft: ATA-8
• ATA ZOO: AT-Attachment, (E-)IDE,
ATAPI, SATA/PATA
15
ATA Model
16
Host:
ATA Model
:Device
17
ATA Non-stability
18
Device-Driver Requirements
• Error model: state of device driver &
controller program (including PC) might
get any value
• Ping-pong: infinitely I/O commands in
which there is update & read of the device
buffer
• Progress: eventually every (and only) I/O
request is executed completely and
correctly according to the ATA spec.
19
Solution 1: Leasing
20
Solution 1: Leasing
• The driver reaches its idle state
infinitely often
• From any configuration a safe (idle)
configuration is eventually reached –
device reset
• From there ping-pong holds
• Eventually progress holds
21
Solution 2: Consistency Checks
• Device driver contains consistency
checker
• Consistency of driver current state
with a snapshot of the controller
22
Solution 2: Consistency Checks
• Eventually ping-pong holds
• Eventually progress holds
• Combining the 2 solutions:
– varying snapshot sizes according to
degree of device stabilization
– fallback to leases and restarts
23
Future Work
• I/O device drivers
– Protocols for co-operation of more than
one microprocessor
– Detailed driver / General monitoring layer
• Gather the different parts
24
Conclusion
• The work shows theoretical and practical
ways to achieve the goal of a self-stabilizing
OS, particularly the device drivers part
• By supplying an infrastructure for practical
self-stabilizing systems, robust and
dependable systems can be achieved
• http://www.cs.bgu.ac.il/~yagel/sos
25