Demsky Group Research
Download
Report
Transcript Demsky Group Research
Bristlecone: A Language
for Robust Software
Systems
Brian Demsky
Alokika Dash
University of California, Irvine
Current Software is All or Nothing
Most current software either executes
perfectly or fails completely
Small errors cause catastrophic failures
Violate fundamental developer assumptions
Violated assumptions prevent continued execution
No clean way to recover from errors
Unclear what parts of the program are affected
Failure may leave key data structures partially
updated
Degraded Service Can Be Desirable
Consider a bug that affects an embedded
application in a single web page
Current browsers often close all browser
windows and exit
Users find this behavior frustrating
Better option is to isolate the failure and
halt only the embedded application
component
Motivating Example: Web Server
Request
r=receiverequest();
logRequest(r);
processRequest(r);
Motivating Example: Web Server
Request
r=receiverequest();
logRequest(r);
CRASH
processRequest(r);
Failure in log operation
Prevents serving this
request
If logging failure is
independent of request,
potentially causes system to
fail to serve any requests
Real World Example
First Flight of the Ariane 5 Rocket
Uncaught integer overflow in computation that
computed horizontal bias
Overflow shutdown the inertial reference system
Inertial reference sent debug information to the
guidance system
Guidance system used these invalid values to set
incorrect nozzle deflections
$120 Million rocket crashed
Horizontal bias value is not even used!
Lesson: Critical system operations coupled to
non-critical operations
Observations about Recovery
Challenging to recover from failure with traditional
program structures
Unclear what code was doing
Are data structures consistent?
What depends on the failed code? What is still
safe to do?
Code structure introduces artificial dependences
In the absence of precise dependence information,
we must assume the worst case
Failures can propagate through artificial
dependences small errors can cause
catastrophic failures
Where do we lose information?
Specifications describe functionality
requirements
Architecture/implementation phases map
requirements into sequence of operations
Mapping process loses information:
Boundaries of operations (What is A and what is B?)
Temporal dependences (Does B require A?)
Data dependences (Does B use data produced by
A?)
Lost information introduces artificial
dependences
Designing for Robustness
Underlying assumption: All code contains
bugs
Goal: Mitigate the consequences
Approach:
Decompose application into many small tasks
Specify dependences between these tasks
Data dependences
Control dependences
Use transactions to prevent failures from exposing
partially updated data structures
Use dependence information to continue past
failures
Bristlecone Language
Program
is specified as a set of tasks
Task specifications describe task
dependences
Tasks have transactional semantics
Runtime system reasons about
dependences to execute past failures
(automated recovery)
Web Server Example
Accept Connection
Read Request
Decoupled operations
Log Request and
Send Page tasks are
independent
Failure of one does not
affect the other
Log Request
Send Page
Specifying Object States
Different object states support different functionality
(Type State)
Use flag construct to label conceptual object states
Use these flags to determine when to perform
operations
Can differentiate between operations that have true
data dependences and operations that just operate
on same objects
class WebRequest {
Flag initialized;
Flag send_page;
Flag write_log;
…}
Tagging Objects
Motivation: Consider the web server example
Each connection has:
A Socket object that provides communication
A WebRequest object that stores application specific state
Need to pair the correct Socket and WebRequest objects
together
Solution: Tag the group of objects
Connection Tag
Socket Object
WebRequest
Object
Tagging Objects
Tags
group object instances
Tags provide mechanism
Tags have types
Can create many instances of a tag type
Each instance defines a group
Can
bind tag instances to objects
Tags can specify that task parameters
must be in the same group
Task Specifications
Describe data dependences of tasks
Describe affect of tasks on objects
/* This task reads a request from a client. */
task readRequest(WebRequest w in initialized with
connection t, Socket s in IO_Pending with
connection t) {
...
taskexit(w: initialized:=false, send_page:=true,
write_log:=true);
}
Bristlecone Task Semantics
Runtime invokes tasks
Tasks can be invoked when objects are
available in the heap that satisfy the task’s
parameter guards
Task have transactional memory
semantics
All operations are executed or none
Task execution appears to occur in a single
instance
Failures cause transactions to abort and
restore consistency
Failure-free Execution
Accept Connection
Read Request
Log Request
Send Page
Failure-free Execution
Accept Connection
Read Request
Log Request
Send Page
Failure-free Execution
Accept Connection
Read Request
Log Request
Send Page
Failure-free Execution
Accept Connection
Read Request
Log Request
Send Page
Error Detection
Catching operating system signals
Library signals
Socket errors
…
Runtime language checks
Arithmetic exceptions
Null pointer exceptions
Array out of bounds exceptions
Assertions
Imperative consistency checks
Declarative data structure specifications
Failure Recovery
Transactions
restore data structures
to previous consistent state
Problem: Re-executing the same task
will likely result in the same failure
Solution: Use task specifications to
determine what other tasks can be
safely executed
Automatic Recovery
Accept Connection
Read Request
CRASH
logRequest
Send Page
Automatic Recovery
Accept Connection
Read Request
Log Request
Send Page
Automatic Recovery Summary
Accept Connection
Read Request
Log Request
Send Page
Language Benefits
Use
specifications to understand
failure in a meaningful way
Use task specifications to reason
how to recover from failures
Task specifications eliminate
artificial dependences
Task Dispatch
Goal: Determine which parameter objects
satisfy task guards
Problem: Brute force search can be
expensive
Our Approach maintains:
Parameter set of objects that satisfy an individual
parameter’s guard
Active task queue of sets of parameter objects
that collectively satisfy all of task’s guards
Task Dispatch
Precisely maintain parameter sets
If an object is in a parameter set
It satisfy the flag component of the guard
Is bound to the correct types of tags
All objects that satisfy parameter’s guard are in
parameter set
Active task queue is conservative
If a set of objects could potentially satisfy all of
task’s guards, it is in the task queue
Must check that set of objects in a task queue
invocation satisfies guards before invoking task
Task Dispatch
When a new object is added to parameter
set, create corresponding task queue
invocations
Search for objects that satisfy tag guards
Idea: Use tags to prune search
When we add an object with a tag guard to
the set, use tags to prune search of other
parameter objects that must be bound to
the same tag
Task Binding Iteration
Structure computation as a list of iterators
over tags and objects
Multiple types of iterators:
Over tags bound to object
Over objects bound to tag
Over objects in parameter set
Want to prune search early – ordering is
important
Statically generate iterator orderings for
each parameter set of each task
Initial Experiences
Implemented Bristlecone compiler and runtime
Have evaluated system on several benchmarks
including:
Web Server
Web Spider
Chat Server
Developed a Bristlecone and Java versions of each
Java versions were designed to use threads to
provide resilience to failures
Randomly injected failures into executions
Web Spider
Workload is a set of 100 web pages
Java version implemented using a thread
pool architecture
100 trials on each version
Randomly injected 3 halting failures into
each execution
With injected failures
Java version fetched average of 6 pages
Bristlecone version fetched average of 91 pages
Web Server
Web Server with support for e-commerce
transactions
Java version spawns a thread for each
connection
200 trials on each version
Randomly injected 50 halting failures into
each execution
With injected failures
Java failed to serve inventory requests in 4.5% of
trials, Bristlecone failed in 1.5%
Java had correct inventory responses in 68.6%,
Bristlecone in 100%
Chat Server
Chat server allows multiple users to chat
Java version spawns a thread for each
connection
100 trials on each version
Workload sent 800 messages
Randomly injected 10 failures into each
execution
With injected failures
Java version failed to serve 39.9% of messages
Bristlecone version failed to serve 19.3% of messages
Related Work
Traditional fault tolerance
Languages
N-version programming
Recovery blocks
Exception handlers
Linda / Tuple spaces
Orc
Actors
Argus
Oz
Erlang
Software and Hardware Transactional Memory
Conclusions
Bristlecone is a exciting approach to
improve application reliability
Initial experiences promising