Demsky Group Research

Download Report

Transcript Demsky Group Research

Bristlecone: A Language
for Robust Software
Systems
Brian Demsky
Alokika Dash
University of California, Irvine
Current Software is All or Nothing


Most current software either executes
perfectly or fails completely
Small errors cause catastrophic failures



Violate fundamental developer assumptions
Violated assumptions prevent continued execution
No clean way to recover from errors


Unclear what parts of the program are affected
Failure may leave key data structures partially
updated
Degraded Service Can Be Desirable




Consider a bug that affects an embedded
application in a single web page
Current browsers often close all browser
windows and exit
Users find this behavior frustrating
Better option is to isolate the failure and
halt only the embedded application
component
Motivating Example: Web Server
Request
r=receiverequest();
logRequest(r);
processRequest(r);
Motivating Example: Web Server

Request
r=receiverequest();
logRequest(r);
CRASH
processRequest(r);
Failure in log operation


Prevents serving this
request
If logging failure is
independent of request,
potentially causes system to
fail to serve any requests
Real World Example

First Flight of the Ariane 5 Rocket







Uncaught integer overflow in computation that
computed horizontal bias
Overflow shutdown the inertial reference system
Inertial reference sent debug information to the
guidance system
Guidance system used these invalid values to set
incorrect nozzle deflections
$120 Million rocket crashed
Horizontal bias value is not even used!
Lesson: Critical system operations coupled to
non-critical operations
Observations about Recovery


Challenging to recover from failure with traditional
program structures
Unclear what code was doing


Are data structures consistent?
What depends on the failed code? What is still
safe to do?

Code structure introduces artificial dependences

In the absence of precise dependence information,
we must assume the worst case

Failures can propagate through artificial
dependences  small errors can cause
catastrophic failures
Where do we lose information?



Specifications describe functionality
requirements
Architecture/implementation phases map
requirements into sequence of operations
Mapping process loses information:




Boundaries of operations (What is A and what is B?)
Temporal dependences (Does B require A?)
Data dependences (Does B use data produced by
A?)
Lost information introduces artificial
dependences
Designing for Robustness



Underlying assumption: All code contains
bugs
Goal: Mitigate the consequences
Approach:


Decompose application into many small tasks
Specify dependences between these tasks




Data dependences
Control dependences
Use transactions to prevent failures from exposing
partially updated data structures
Use dependence information to continue past
failures
Bristlecone Language
 Program
is specified as a set of tasks
 Task specifications describe task
dependences
 Tasks have transactional semantics
 Runtime system reasons about
dependences to execute past failures
(automated recovery)
Web Server Example

Accept Connection

Read Request

Decoupled operations
Log Request and
Send Page tasks are
independent
Failure of one does not
affect the other
Log Request
Send Page
Specifying Object States
Different object states support different functionality
(Type State)
 Use flag construct to label conceptual object states
 Use these flags to determine when to perform
operations
 Can differentiate between operations that have true
data dependences and operations that just operate
on same objects
class WebRequest {
Flag initialized;
Flag send_page;
Flag write_log;
…}

Tagging Objects




Motivation: Consider the web server example
Each connection has:
 A Socket object that provides communication
 A WebRequest object that stores application specific state
Need to pair the correct Socket and WebRequest objects
together
Solution: Tag the group of objects
Connection Tag
Socket Object
WebRequest
Object
Tagging Objects
 Tags
group object instances
Tags provide mechanism
 Tags have types
 Can create many instances of a tag type
 Each instance defines a group

 Can
bind tag instances to objects
 Tags can specify that task parameters
must be in the same group
Task Specifications


Describe data dependences of tasks
Describe affect of tasks on objects
/* This task reads a request from a client. */
task readRequest(WebRequest w in initialized with
connection t, Socket s in IO_Pending with
connection t) {
...
taskexit(w: initialized:=false, send_page:=true,
write_log:=true);
}
Bristlecone Task Semantics
Runtime invokes tasks
 Tasks can be invoked when objects are
available in the heap that satisfy the task’s
parameter guards
 Task have transactional memory
semantics




All operations are executed or none
Task execution appears to occur in a single
instance
Failures cause transactions to abort and
restore consistency
Failure-free Execution
Accept Connection
Read Request
Log Request
Send Page
Failure-free Execution
Accept Connection
Read Request
Log Request
Send Page
Failure-free Execution
Accept Connection
Read Request
Log Request
Send Page
Failure-free Execution
Accept Connection
Read Request
Log Request
Send Page
Error Detection

Catching operating system signals



Library signals



Socket errors
…
Runtime language checks


Arithmetic exceptions
Null pointer exceptions
Array out of bounds exceptions
Assertions


Imperative consistency checks
Declarative data structure specifications
Failure Recovery
 Transactions
restore data structures
to previous consistent state
 Problem: Re-executing the same task
will likely result in the same failure
 Solution: Use task specifications to
determine what other tasks can be
safely executed
Automatic Recovery
Accept Connection
Read Request
CRASH
logRequest
Send Page
Automatic Recovery
Accept Connection
Read Request
Log Request
Send Page
Automatic Recovery Summary
Accept Connection
Read Request
Log Request
Send Page
Language Benefits
 Use
specifications to understand
failure in a meaningful way
 Use task specifications to reason
how to recover from failures
 Task specifications eliminate
artificial dependences
Task Dispatch



Goal: Determine which parameter objects
satisfy task guards
Problem: Brute force search can be
expensive
Our Approach maintains:


Parameter set of objects that satisfy an individual
parameter’s guard
Active task queue of sets of parameter objects
that collectively satisfy all of task’s guards
Task Dispatch

Precisely maintain parameter sets

If an object is in a parameter set




It satisfy the flag component of the guard
Is bound to the correct types of tags
All objects that satisfy parameter’s guard are in
parameter set
Active task queue is conservative


If a set of objects could potentially satisfy all of
task’s guards, it is in the task queue
Must check that set of objects in a task queue
invocation satisfies guards before invoking task
Task Dispatch




When a new object is added to parameter
set, create corresponding task queue
invocations
Search for objects that satisfy tag guards
Idea: Use tags to prune search
When we add an object with a tag guard to
the set, use tags to prune search of other
parameter objects that must be bound to
the same tag
Task Binding Iteration


Structure computation as a list of iterators
over tags and objects
Multiple types of iterators:





Over tags bound to object
Over objects bound to tag
Over objects in parameter set
Want to prune search early – ordering is
important
Statically generate iterator orderings for
each parameter set of each task
Initial Experiences


Implemented Bristlecone compiler and runtime
Have evaluated system on several benchmarks
including:






Web Server
Web Spider
Chat Server
Developed a Bristlecone and Java versions of each
Java versions were designed to use threads to
provide resilience to failures
Randomly injected failures into executions
Web Spider





Workload is a set of 100 web pages
Java version implemented using a thread
pool architecture
100 trials on each version
Randomly injected 3 halting failures into
each execution
With injected failures


Java version fetched average of 6 pages
Bristlecone version fetched average of 91 pages
Web Server





Web Server with support for e-commerce
transactions
Java version spawns a thread for each
connection
200 trials on each version
Randomly injected 50 halting failures into
each execution
With injected failures


Java failed to serve inventory requests in 4.5% of
trials, Bristlecone failed in 1.5%
Java had correct inventory responses in 68.6%,
Bristlecone in 100%
Chat Server






Chat server allows multiple users to chat
Java version spawns a thread for each
connection
100 trials on each version
Workload sent 800 messages
Randomly injected 10 failures into each
execution
With injected failures


Java version failed to serve 39.9% of messages
Bristlecone version failed to serve 19.3% of messages
Related Work

Traditional fault tolerance




Languages







N-version programming
Recovery blocks
Exception handlers
Linda / Tuple spaces
Orc
Actors
Argus
Oz
Erlang
Software and Hardware Transactional Memory
Conclusions


Bristlecone is a exciting approach to
improve application reliability
Initial experiences promising