Robust and Private Cloud

Download Report

Transcript Robust and Private Cloud

Stabilization
Enabling Technology
Shlomi Dolev
Trustworthy Systems: Why
is it So Hard?
• Corbató’91: "It almost goes without saying
that ambitious systems never quite work as
expected“
http://larch-www.lcs.mit.edu:8001/~corbato/turing91/
• "You must pay extreme attention to detail
here. One wrong bit will make things fail… "
http://my.execpc.com/~geezer/os/pm.htm
• From Pentium’s manual:
“… if the ESP or SP register is 1 when the
PUSH instruction is executed, the processor
shuts down due to a lack of stack space. No
exception is generated to indicate this
condition"
Mars Rover - Spirit
• …The Spirit rover has a radiation-hardened R6000 CPU
from Lockheed-Martin Federal Systems…The operating
system is Wind River Systems' Vx-Works..
• …attempted to allocate more files than the RAM-based
directory structure could accommodate. That caused an
exception, which caused the task that had attempted
the allocation to be suspended…
• …Spirit fell silent, alone on the emptiness of Mars,
trying and trying to reboot
http://www.eetimes.com/sys/news/OEG20040220S0046
Linux and Windows do not
Stabilize
Self-Stabilization
• Self-healing, Self-managing, Self-*
• Recovery Oriented Computing [Berkeley,
Stanford]
• Autonomic Computing [IBM]
• Self-Stabilization
•
Self-Stabilizing algorithm for mutual
exclusion in a ring topology [Dijkstra’74]
Well
Established
Theory !
Self-Stabilization
• The combination and type of faults cannot
be totally anticipated in on-going systems
• Any on-going system must be self stabilizing
(or manually monitored)
L
E
First Self-Stabilizing Algorithm:
Token Passing [Dij74]
Token Passing
do forever
if x1=xn then
x1:=(x1+1)mod(n+1)
Pi(i ≠ 1):do forever
if xi≠xi-1 then
xi:=xi-1
1 P1:
2
3
4
5
6
Token Passing Cont.
{0; 0; 0; 0; 0};
{1; 0; 0; 0; 0};
{1; 1; 0; 0; 0};
{1; 1; 1; 0; 0};
• Surely works when we
start in
x1 = x2 = … = xn = 0.
• One processor may change
a state at a time.
{1; 1; 1; 1; 0};
{1; 1; 1; 1; 1};
{2; 1; 1; 1; 1};
{2; 2; 1; 1; 1};
{2; 2; 2; 1; 1};
{2; 2; 2; 2; 1};
{2; 2; 2; 2; 2}
…
Token Passing: Faults
• Transient fault, soft errors, wrong CRC,
•
•
•
•
unexpected temporal severe conditions,
etc.
Assigns each processor with an arbitrary
state (in the range of its state space).
For example {3; 4; 4; 1; 0}.
p2; p4; and p5 have tokens!
Will the system ever recover?
Token Passing: Automatic
Recovery
• p1 changes state infinitely often,
• Otherwise, let s1 be the fixed state of
•
•
•
•
•
•
p1,
p2 eventually copies s1 from p1, then
p3 eventually copies s1 from p2, then
...
pn eventually copies s1 from pn-1, then
p1 changes state.
p1 changes state in the order 4; 5; 0;
1; 2; 3; 4; 5; 0; ...
Token Passing: Automatic
Recovery Cont.
• In any initial state at least one state is
missing, {4; 4; 1; 0; 2}, 3 and 5 are
missing.
• Once p1 reaches the missing state e.g., 5,
all the processors must copy 5, before p1
reads 5 from pn and changes state to 0.
Will It Stabilize With mod (n 2)?
Mod 3
{0,0,2,1,0} p1 {1,0,2,1,0} p5
{1,0,2,1,1} p4 {1,0,2,2,1} p3
{1,0,0,2,1} p2 {1,1,0,2,1}
+1 mod 3 !
Is Self-Stabilization a
Toy?
Stabilization Stack
• Self Stabilizing Microprocessor
•
•
•
•
[DH04,DH06]
Self Stabilizing Operating System
[DY04]
Self-Stabilization Preserving
Compiler[DH05]
Self-Stabilizing Automatic Recoverer
For Eventual Byzantine Software
[BDK03]
Recovery Oriented Programming[BD05]
Implementation
Bottleneck
• Ask Intel, AMD, IBM to design a self-
stabilizing microprocessor…
• Technology for converting off-the shelf
processor to be self-stabilizing [DH06]
• Ask Microsoft, IBM, Red Hat, to convert
existing code of OS to be selfstabilizing…
• Stabilizing Virtual Machine [DY07]
Enforcing stabilization by
resetting
•
•
•
•
Processors behave correctly after reset
Periodic reset ensures correct behavior
But damages closure…
Need careful solutions
Periodic Reset Monitor
• Find a location P in OS code reached at least
every T time
• At P:


Save necessary information to RAM
Request a reset and loop forever.
• Stabilizing watchdog accepts request and
resets processor
• Upon reset: restore information and continue
• Stabilizing watchdog verifies that a reset is
performed at least every T + epsilon time
Implementation
using Intel XScale core
• Used in numerous processors

Network, I/O, Handheld, Cellular etc.
• RISC architecture (ARMv5 compatible)
• Debug interface


Allows interaction between WD and OS
External debug break used for notifying the
upcoming reset
Up to now
• Virtual Self-stabilizing processor on top
of commercial quality processor
• Towards repeating the concept in OSs
and VMMs (enforcing configuration and
protecting critical operations)
Toward Self-Stabilizing
Operating System (SOS)
Shlomi Dolev and Reuven Yagel,
SAACS’04 Workshop, Zaragoza
Basic Directions
• Black-box


Take existing OS (Unix, Windows, RTOS)
Add stabilization layer
• Carefully tailoring a tiny kernel
Processor scheduling
Memory management
Device driver
Hosting Byzantine processes
Assumptions
• Every configuration
(processor/memory) is possible
• At least some program code is
hardwired (in ROM) and is correct –
Harvard Model
• Processor:


Instruction manual (e.g. x86\IA-32)
defines a transition function.
Self-stabilizing [DH04]
Black Box
Periodic Reset Re-install and Execute



Watchdog timer (self-stabilizing)
Periodic processor reset
During bootstraps OS reinstall from ROM
Weak self-stabilization


E = (ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …., ci, ai,
ci+1, …., RRE, c1, a1, c2, a2, ….
Is it always acceptable?
Alternative: Periodic re-install code only,
add consistency check and enforcement
Tailored Kernel
• Tiny Scheduler Tiny Memory Manager
• Requirements:



Self-stabilizing
Fair
Process stabilization preserving (e.g.
validity of P.C. value)
Tiny SOS Scheduler
• ~70 lines of a real
machine assembly
code
• 16bit Real mode &
32bit Protected
mode.
• Standard build and
emulation tools
(Nasm, ld, Bochs)
• Detailed proof of
requirement
preservation
; increase task
10 mov word ax, [currentProc]
11 and ax, PROC_MASK
...
; load task state
...
;restore ip
52 mov ax, [bx+4]
;validate ip
53 and ax, IP_MASK
54 mov word [ss:STACK TOP], ax
;restore general registers
55 mov cx, word [bx+12]
56 mov dx, word [bx+14]
57 mov si, word [bx+16]
58 mov di, word [bx+18]
Sketch of Proof
• In every execution E, the code of the
scheduler is started to be executed and
is executed from the first instruction to
the last instruction infinitely often
• In every execution E of the scheduler
each process is executed infinitely often
• The self-stabilizing scheduler preservers
stabilization of processes.
Talk Outline
• Self Stabilizing Microprocessor [DH06]
• Self Stabilizing Operating System
•
•
•
•
[DY04]
Self-Stabilization Preserving
Compiler[DH05]
Self-Stabilizing Automatic Recoverer
For
Eventual Byzantine Software [BDK03]
Recover Oriented Programming[BD05]
Self-Stabilization
Preserving Compiler
Shlomi Dolev, Yinnon A. Haviv,
Department of Computer Science
Ben-Gurion University, Israel
Mooly Sagiv,
Department of Computer Science
Tel Aviv University, Israel
The Gap.
• Need a transformation between:


Input program P written in a high
abstraction language, e.g., (D)ASM.
Output program Q in a machine language, say,
JVM.
• Existing compilers?


P and Q behaves the same when started in
the initial state.
What if Q reaches an unexpected state
due to soft-error experienced by
microprocessor?
Trivial Example
• A statement of the form:

For each i in {0..9} do f(i)
• May be compiled to 
• Start with cx=12 inside
the loop…
• Moreover: Any runtime
mechanism can get
stuck / inconsistent.
mov ax, 10
mov cx, 0
loop1:
push cx
call f
inc cx
cmp cx,ax
jne loop
Stabilization Preserving
Compiler – a closer look
Ensuring that Q eventually behaves as P:
State space of P
State space of Q
The Transformation
Variable declarations
upon <condition_1> do
<statement_1>
Enforce invariants
Scheduler
condition_1
…
condition_n
Statement_1
upon <condition_n> do
<statement_n>
Statement_n
Self-Stabilization Preserving
Compiler: Summary
• Front end of compiler for ASM.
• Self Stabilization preserving compiler.
Language with clear semantics from any
state.
 New demands for a compiler.

Talk Outline
• Self Stabilizing Microprocessor [DH04]
• Self Stabilizing Operating System
[DY04]
• Self-Stabilization Preserving
Compiler[DH05]
• Self-Stabilizing Automatic Recoverer For
Eventual Byzantine Software [BDK03]
• Recover Oriented Programming[BD05]
Self-Stabilization and
Evolving Systems
• Real world systems cannot be verified
•
•
•
•
•
•
exhaustively…
We enforce safety and live-ness specifications
Contract between the client, project manager
and programmers, that is checked on line!
Make sure that the additional (thin) monitoring
and recovering layer is self-stabilizing
A change can be made to the
implementation/specification
to support evolving environments
Self-Stabilizing Recoverer
for Eventual Byzantine
Software
Olga Brukman, Shlomi Dolev
Department of Computer Science
Ben-Gurion University, Israel
Hillel Kolodner,
Haifa Research Labs
IBM, Israel
Software Contains Bugs
• Heisenbugs, corrupt states, leaked
resources are common…
• Correct and faultless SW is hard

Long-lived running programs, e.g., OS
• Usually software is tested when starting
from initial state and considering limited
time scenarios.
Fault Model Reflecting
Reality
• Software packages can be trusted to
work as required after restart.
• Eventual Byzantine software.
• System administrators and users use
reboot to deal with faults.
Middleware Architecture
<Preds,RActs>1
<Preds,RActs>2
…
<Preds,RActs>n
OMR
Kernel
OS
Monitor-Restarter for
Process and Subsystem
<Pred,RActs>1
<Pred,RActs>2
…
Restart Actions – Mature
Approach
• Subsystem waits for completion of a
restart of its components.
• Restart action may vary, depending on
component internal state.



Reschedule
Roll-back
Kill & Restart
Few restart attempts with more
drastic restart actions.
Computational Model: rsfexecution
• An execution E is rsf (restart supporting
fair)-execution iff E is a fair execution
in which every subsystem subi that is
initialised during E respects its
specification function ssi.
Requirement: Every rsf-execution E
has a suffix in which the system
respects its specification function ss.
Tools for Implementation –
Black Box Approach
• Software package is a black box.
• Package is monitored by recording it’s IO
(e.g., strace in Linux).
• Monitors are independent of specific
implementation
Tools for Implementation –
Transparent Box Approach
• Software package implementation tool is
known.
• Run-Time Reflection tools are used to
monitor and restart the package.
• Possible in Java, C++, CORBA, COM.
Practical Experience: Printers
Problem
• Corrupted pdf, doc or ps file sent to
printing server.
• Printer can’t print the file.
• Cause retries by printing server

Printer is “stuck” on one job.
• Predicate for printing server:
 Restrict number of retries, try format
conversions, send error message to user.
Recovery Oriented
Programming
Olga Brukman and Shlomi Dolev
Department of Computer Science
Ben-Gurion University, Israel
Towards Robust Software
• Programming

Structural programming, OOD, Design Patterns…
• Testing and debugging

Unit testing [JUnit, CppUnit]…
Design By Contract (Eiffel) …

ASM, IO Automata, NURPL

• Formal specification languages
• Model checking
• Online recovery


ROC [PBB02].
Self-Stabilizing Autonomic Recoverer for Eventual
Byzantine Software [BDK03] - black box software
packages.
Our Contribution
• Program invariants derived from design specifications.
Checked every time invariant variables are updated.
• Automatic code generation for invariant verification
and recovery upon invariant violation.
• Invariants are verified during
•
runtime.
 Change of invariant variable is
pre-checked in sand-box.
 Violation is prevented and replaced
with a recovery action.

Our Contribution Cont.
• Recovery action is chosen depending on
the current state and history.
 Roll back & resume.
 Wait.
 Reschedule.
 Kill & restart.
External Monitoring
• Monitoring the whole task to avoid
transient faults occurrence after which
invariant variables are not changed (
and no invariant checks are done)
 liveness problem – monitor over time

Talk Conclusions
• Self-Stabilization as an effective
paradigm for creating robust systems.
• Rigorous approach for designing basic
system components





Microprocessor
Virtual machine monitor
Operating system
Compiler
Evolving and Recovery Oriented
Self-Stabilizing Virtual Machine
Hypervisor Architecture for
Resilient Cloud
Alexander Binun, Shlomi Dolev, Reuven Yagel
{binun,dolev,yagel}@cs.bgu.ac.il
Martin Kahil, Mark Bloch, Boaz Menuhin
{kahilm,mbloch}@post.bgu.ac.il, [email protected]
Ben-Gurion University of the Negev
Marc Lacoste, Thierry Coupaye, Aurelien Wailly
{marc.lacoste,thierry.coupaye,aurelien.wailly}@orange.com
Orange Labs, Paris
INTRODUCTION
Virtualization
Virtual machines (VM)

Guest OS runs apps
Hypervisor (HV)


Hardware for VMs
Assume that HV is in
host OS (e.g. KVM)
Malware (e.g. rootkits)

Get into HV from VM
VM
Terminology
Transient failures (TFs):


Yield arbitrary changes of the system state
SEU (Single Event Update), limitations of
error detection algorithms
We do not want to say exactly, as we risk
forgetting a scenario
We better assume a resulting arbitrary
state
Terminology, Cont.

Admissible execution: minimal requirements
for a system to be recoverable


e.g., less than one third of the processors are
under attack
Possible limited faults during the automatic
recovery from the unanticipated events in
the past:
 CPU failures
 Message losses
Self-Stabilization
Legal execution: the desired behavior
Safe state: every execution starting
from it is legal
Self-stabilizing algorithm reaches a safe
state in some finite time



In every admissible execution
From any arbitrary systems’ state
Without external intervention
Self-Stabilization, Cont.
The program is stored in a ROM


Not a subject for changes
can be periodically reloaded
The system state can undergo
unpredicted changes…

… following which the system should
converge to a safe state
Example: Token Passing
Algorithm
Code: write-protected
X1
P2
P1
X2
X4
P3
P4
X3
P 1:
do forever
X1 = XN =>
X1 := (X1+1) mod (N+1)
Pi, i ≠ 1:
do forever
Xi+1≠ Xi =>
Xi+1 := Xi
66
Variables: may be corrupted
Atomic step
• Legal execution: exactly one Pi changes Xi
• in infinitely many states
• A safe state: X1 = X2 = … XN
• The only possible execution is: exactly one Pi changes Xi
Token Passing: Self Stabilization
Failure: start from
Arbitrary values
Safe:
x1
x2
x3
…
xN
{0; 0; 0; 0; 0};
Round Number
{1; 0; 0; 0; 0};
…
x1
x2
xN
…
{4; ……
…
{2; 1; 1; 1; 1};
{M; ……
{2; 2; 2; 2; 2}
…
{3; 7; 2; 3; 0};
{1; 1; 1; 1; 1};
…
x3
};
};
…
{M; M; M; M; M};
• P1 eventually increments x1
– In N rounds x1 propagates along the chain, reaches PN, then increases
• In a state S x1 gets unique (not encountered) value M
•
after incrementing x1 several times
• Then M is propagated to other processors, reaching a safe state
OUR APPROACH – MAIN
PRINCIPLE
Rootkit Activity – Current
State of Art
Rootkit Activity – Current
State of Art
Rootkit Activity – Current
State of Art
Rootkit Activity – Current
State of Art
Rootkit Activity – Current
State of Art
Our approach: Software
watchdog brings the system
into a safe state
Reboot
corrupted…
Periodic
I’m Alive
(frequent)
Every software component (host,guests) can be corrupted
– And the watchdog as well…
Reload system from ROM
upon a hardware timer
signal
ROM
SMM
Hardware
interrupt
Consistency check:
Tampered => Reboot
• Validator is write-protected by hardware means:
– It is the one that guarantees self-stabilization
– Runs rarely as it is very time-consuming
•
Watchdog quickly detects small problems
– Runs frequently and efficiently
Existing Infrastructure
Guest OS
Guest OS
User Apps
User Apps
VM1
VM2
create / delete VM
VM Manager
T2
(VM2)
T1
(VM1)
User
Existing Infrastructure
User
1. Schedule VM
VM2
VM1
I/O
drivers
Inter-VM traffic
2. Activate VM
Statei
T1
VMi
T2
Pool
Scheduler
VMTable
OS
Hypervisor
3. Run CPU during some time
CPU
4. Saving CPU state & stop
Hard
ware
I/O
drivers
Our Architecture: Watchdog
Statei
VMi
VMTable
Hypervisor
Check traffic state
Check scheduler
Check VM
state
state
Stabilization Manager
Periodic Interrupt
T1
T2
Pool
Scheduler
OS
Watchdog
Safe state?
I’m Alive
CPU
not alive during a while
=>reboot
Hard
ware
I/O
drivers
Our Architecture: Hardwarefacilitated integrity checking
Statei
VMi
VMTable
Hypervisor
Check traffic state
Check scheduler
Check VM
state
state
Stabilization Manager
Periodic Interrupt
(every second)
CPU
integrity failure=>reboot
T1
T2
OS
Pool
Scheduler
Watchdog
Safe state?
I’m Alive
not alive during a while
=>reboot
Integrity
checker
Hard
ware
Interrupt
(every day)
Timer
Implementation: Employ
external tools for examining
VMs
VM
Benign output
I am Alive
Implementation: Employ
external tools for examining
VMs (cont)
VM
Report malware
Alarm
Kill, suspend etc
Future Work
• Test the prototype on real malware
collections
– e.g. TechGainer[1]
• Intelligent safety enforcement
– if the situation is not severely dangerous :
restart only malfunctioning fragments
• reset malfunctioning printer instead of rebooting
the computer
– Guarded Commands ([2]) as a basis for the
specification language {(guard,action), … }
• Guard safety check, action enforcement
Future Work
• Make the architecture support distributed
cloud infrastructures
– E.g. OpenStack[3]
• Are there competitors ?
– Azure [5] – recovery through replication
– Replicas synchronization algorithm may suffer
from transient faults too