error-scope-hpdc11

Download Report

Transcript error-scope-hpdc11

Error Scope
on a Computational Grid:
Theory and Practice
Douglas Thain
and Miron Livny
Computer Sciences Department
University of Wisconsin
HPDC-11, July 2002
Danger
Ahead!
Outline





An Exercise: Condor + Java
Bad News: Error Explosion
A Theory of Error Propagation (A Taste)
Condor Revisited
Parting Thoughts
An Exercise:
Coupling Condor and Java

The Condor Project, est. 1985.
–
–

The Java Language, est 1991.
–
–

Production high-throughput computing facility.
Provides a stable execution environment on a Grid of unstable,
autonomous resources.
Production language, compiler, and interpreter.
Provides a standard instruction set and libraries on any
processor and system.
The Grid
–
–
–
Execute any code any where at any time.
Dependable, consistent, pervasive, inexpensive...
Are we there yet?
The Condor High Throughput
Computing System

HTC != HPC
–

All participants are autonomous.
–
–
–

Users give constraints on usable machines.
Machines give constraints on jobs and users.
ClassAds: a language for matchmaking.
If you are willing to re-link jobs...
–
–
–

Measured in sims/week, frames/month, cycles/year.
Remote system calls for transparent mobility.
Binary checkpointing for migration and fault-tolerance.
Can’t relink? All other features available.
Special “universes” support software environments.
–
PVM, MPI, Master-Worker, Vanilla, Globus, Java
Submission Site
Execution Site
MatchMaker
Policy
Control
User
Agent
(schedd)
Claiming Protocol
Fork
Job
Agent
(shadow)
Machine
Agent
(startd)
Fork
Execution Protocol
Job
Agent
(starter)
Fork
Home
File
System
The Job
Policy
Control
Java Universe

Execution:
–
–

User specifies .class and .jar files.
Machine provides the JVM details.
Input and Output:
–
Know all of your files?

–
Condor transfers whole files for you.
Need online I/O?


Link program with Chirp I/O Library.
Execution site provides proxy to home site.
Submission Site
Execution Site
Job Agent
(shadow)
Job Agent
(starter)
Secure Remote I/O
I/O Server
Local System Calls
I/O Proxy
Fork
Local RPC
(Chirp)
JVM
Wrapper
Home
File
System
The Job
I/O Library
Initial Experience

Bad news! Any kind of error sent the job back to the
user with an exception message:
–
–
–
–



NullPointerException - Program is faulty.
OutOfMemory - Program outgrew machine.
ClassNotFoundError - Machine incorrectly installed.
ConnectionRefused - Network temporarily unavailable.
Users were frustrated because they had to evaluate
whether the job failed or the system failed.
These were correct in the sense they were true.
These were not bugs. We deliberately trapped all
possible errors and passed them up the chain.
What’s the Problem?




To reason about this problem, we began to
construct a theory of error propagation.
This theory offers some common definitions
and four principles that outline a design
discipline.
We re-examined the Java Universe according
to this theory.
Our most serious mistake: We failed to
propagate errors according to their scope.
We are NOT Talking About:

Fault Tolerance
–
–
–

Language Structures
–
–

What algorithms are fault-resistant?
How many disks can I lose without losing data?
How many copies should I make for five nines?
Should I use Objects or Strings to represent errors?
Should I use Exceptions or Signals to communicate errors?
These are important and valuable questions, but we
are asking something different!
We ARE Talking About:





Where is the problem?
How should a program respond to an error?
Who should receive an error message?
What information should an error carry?
How can we even reason about this stuff?
Engineering Perspective

Fault
–

Error
–

An information state that reflects a fault.
Failure
–

A physical disruption of the machine.
A violation of documented/guaranteed behavior.
Fault
–
(A failure in one’s underlying components.)
Interface Perspective

Implicit Error
–
–

Explicit Error
–
–

A result presented as valid, but found to be false.
Example: sqrt(3) -> 2.
A result describing an inability to carry out the request.
Example: open(“file”) -> ENOENT.
Escaping Error
–
–
–
A return to a higher level of abstraction.
Example: read -> virt mem failure -> process abort.
Example: server out of memory -> shutdown socket
Principles for Error Design
1 - A program must not generate an implicit error
as a result of receiving an explicit error.
2 - An escaping error must be used to convert a
potential implicit error into an explicit error at a
higher level.
3 - An error must be propagated to the program
that manages its scope.
4 - Error interfaces must be concise and finite.
Error Scope
Definition: The scope of an error is the
portion of a system that it invalidates.
Principle 3: An error must be propagated
to the program that manages its scope.
Error Scope Examples

“File not found” simply has file scope.
schedd
Job Scope
shadow
Local Resource Scope
starter
Remote Resource Scope
JVM
Prog
Image
User
Policy
Prog
Args
I/O
Server
Input
Data
Output
Space
Owner
Policy
Java
Pkg
Virtual Machine Scope
program Program Scope
Mem
& CPU
Code
Data
Scope in Condor
Detail
Scope
Handler
Program exited normally.
Program
User
Null pointer exception.
Program
User
Out of memory.
Virtual
Machine
JVM
Java misconfigured.
Remote
Resource
Local
Resource
Starter
Job
Schedd
Home file system offline.
Program image corrupt.
Shadow
Scope in Condor:
JVM Exit Code
Detail
Scope
Handler
Exit
Code
Program exited normally.
Program
User
(x)
Null pointer exception.
Program
User
1
Out of memory.
Virtual
Machine
Remote
Resource
JVM
1
Starter
1
Local
Resource
Job
Shadow
1
Schedd
1
Java misconfigured.
Home file system offline.
Program image corrupt.
What To Do With An Error?

A program cannot possibly know what to do with an
error outside its scope.
–


Propagate an error to the manager of the scope as
directly as possible.
Sometimes, a direct mechanism:
–

Should sin(x) deal with “math library not available?”
Signal, exception, dropped connection, message.
Sometimes, an indirect mechanism:
–
Touch a file, then exit by any means available.
Job Agent
(shadow)
Starter Result +
Program Result
Result
File
Job Agent
(starter)
JVM Result
JVM
Home
File
System
Program
Result
or
Error and
Scope
Wrapper
The Job
I/O Library
Result
File
JVM Result
JVM
Errors of Larger Scope
Errors Inside
Program Scope
Wrapper
The Job
I/O Library
Error Theory

An outline:
–
–
–
–

Unpopular position:
–

Definitions of error types.
Error relationships discussion.
Four principles for error discipline.
Error scope.
Generic (expandable) errors must be exterminated!
Please take a closer look, and feel free to
come argue with me!
Open Problems
Related Work

Anh Nguyen-Tuong and Andrew Grimshaw, Legion
Reflective Graph and Event Model.
–
–

John B. Goodenough, et al. “Exceptions”
–
–

Distributed applications keep a model of themselves.
Very powerful when the entire system is known to every
component.
Must exceptions be declared in the interface?
If not, how do we deal with escaping errors?
Hoare, et al, “Design by Contract”
–
–
Motivates the distinction between explicit and escaping errors.
How should escaping errors be structured?
Conclusion




Small but powerful changes drastically
improved the Java Universe.
Our mistake was to represent all possible
errors explicitly in the closest interface.
Error scope is an analytic tool that helps the
designer decide how to propagate an error.
An error discipline saves precious resources:
time and aggravation!
A Parting Thought


Very few existing structures
can be lifted into distributed
computing without change.
#!/bin/sh
Can these results be
gzip file
distinguished?
–
–
–
–
sh fails to load (result 1)
gzip fails to load (result 1)
file does not exist (result 1)
file exists (result 0)
exit $?
For more information...

Douglas Thain
–

Miron Livny
–

[email protected]
Condor Software, Manuals, Papers, and More
–

[email protected]
http://www.cs.wisc.edu/condor
Questions now?