New Wine in Old Bottles: Java and Condor

Download Report

Transcript New Wine in Old Bottles: Java and Condor

Systems Seminar Schedule

Monday, 18 Februrary, 4pm:
– “New Wine in Old Bottles” - Douglas Thain

4 March:
– No seminar: Paradyn/Condor Week

Tuesday, 19 March, 3pm:
– “The Microsoft .NET System” - Mike Litzkow

Tuesday, 2 April, 3pm:
– “Condor and the Grid” - Miron Livny

Monday, 15 April, 4pm:
– “Exploiting Gray-Box Knowledge of Buffer-Cache
Management” - Nathan Burnett

Monday, 29 April, 4pm:
– “Bridging the Information gap in Storage Protocol
Stacks” - Tim Denehy
New Wine
in Old Bottles:
Java on Condor
Douglas Thain
University of Wisconsin
18 February 2002
Abstract
We have added Java support to Condor. I’ll
tell you how it works and how to use it.
There are some nifty features for end users.
 Adding this code forced us to think about
the fundamental problem of coupling
systems and representing errors.
 A lesson: One must consider the scope of an
error as well as its detail.

Disclaimer:
This is still rough around the edges.
(Someone had to go first!)
Outline

Why Java and Condor?
 Architecture
 Initial Experience
 A Little Error Theory
 Changes for the Better
 Conclusions
Java for Scientific Computing

Java is emerging as a tool for large scale
(Grande) scientific computing.
– More accessible to domain scientists.
– Simplified porting.
– Faster development, debugging.

User communities are forming:
– ACM Java Grande Conference
– The Java Grande Forum
A. Globus, E. Langhirt, M. Livny, R. Ramamurthy, M. Solomon, and S. Traugott.
JavaGenes and Condor: Cycle-Scavenging genetic algorithms. ACM Conf on Java
Grande, 2000.
Limitations

Java floating point and complex arithmetic do not
yet satisfy all of the scientific community.
– Arguments continue between industry and academia.

Java is yet slower than comparable programs in
C/C++/Fortran.
– WAT compilers and JIT compilers are catching up.
– You choose: 2x slowdown vs 5x machines.

Can we really harness 5x machines while still
maintaining platform independence?
Condor for Scientific
Computing

Condor creates a high-throughput
computing system on a community of
computers.
 A high-throughput computing system seeks
to maximize the amount of work done over
a long period of time.
 A community of computers may be any
collection of machines that agree to work
together.
Condor Enables Ordinary Users
condor
Job
startd
condor
schedd
cpu
INFN Central
Manager
condor
Job
startd
RAM
cpu
condorJob
startd
cpu
condor
Job
startd
RAM
cpu
condor
Job
startd
Job
Job
cpu
RAM
RAM
condor
Job
startd
RAM
cpu
RAM
Job
Job
Job
UWCS Central
Manager
condor
Job
startd
cpu
RAM
condorJob
startd
cpu
RAM
condor
Job
startd
cpu
RAM
condor
Job
startd
cpu
RAM
condor
Job
startd
cpu
RAM
condor
Job
startd
cpu
RAM
condor
Job
startd
cpu
RAM
condorJob
startd
cpu
RAM
condor
Job
startd
cpu
RAM
condor
Job
startd
cpu
RAM
condor
Job
startd
cpu
RAM
condor
Job
startd
cpu
RAM
800
Top 10 Condor Pools:
700
226 Condor Pools
5576 Condor Hosts
600
500
400
300
200
100
0
em
Ch
. no
uio
ne.
ale
Sci
aqu
mp
Co
ago
hic
s
Uc
rou
Ch
RN
g
CE
En
UW
SA
NC
FN
IN
es
Am
SA
ci
pS
NA
om
sC
exa
ing
UT
eer
gin
En
UW
Sci
mp
Co
UW
The Hype:
 Java:
– “Write once, run anywhere!”
 Condor:
– “Submit once, run everywhere!”
 The
Grid:
– Uniform, dependable, consistent,
pervasive, and inexpensive computing.
The Reality

Coupling systems is not trivial!
 The easy part:
– Putting java in front of the program name.

The tricky parts:
– Java installation messes.
– Unavailable file systems.
– Distinguishing program errors from
environmental errors.
Outline

Why Java and Condor?
 Architecture
 Initial Experience
 A Little Error Theory
 Changes for the Better
 Conclusions
Match
Maker
Matchmaking Protocol
Job
Policies
schedd
Claiming Protocol
Activation Protocol
Fork
Exports the
shadow
details,
policy, and
I/O services.
Home
File
System
Execution Protocol
startd
Machine
Policies
Fork
starter
Fork
The Job
Creates the
execution
environment.
Fork
Fork
shadow
starter
Secure Remote I/O
I/O Server
Local System Calls
I/O Proxy
Fork
JVM
Wrapper
Home
File
System
The Job
I/O Library
Local I/O
(Chirp)
User Interface
condor_status -java
Name
JavaVendor
Ver
aish.cs.wisc. Sun Microsy 1.2.2
anfrom.cs.wis Sun Microsy 1.2.2
babe.cs.wisc. Sun Microsy 1.2.2
...
State
Activity LoadAv Mem
Owner
Owner
Claimed
Idle
Idle
Busy
0.000
0.030
1.120
249
249
123
Machines Owner Claimed Unclaimed Matched Preempting
INTEL/LINUX
514
101
408
5
0
0
Total
514
101
408
5
0
0
User Interface
condor_submit
universe = java
executable = Main.class
jar_files = MyLibrary.jar
input = infile
output = outfile
arguments = Main 1 2 3
queue
I/O Interface

Input, output, and error files are automatically
transferred to/from the execution site.
 Any other named files may be transferred as well.
 To do online I/O without transferring whole files,
you must make small changes to the code:
– FileInputStream -> ChirpInputStream
– FileOutputStream -> ChirpOutputStream
Application
Chirp I/O Library
Java Standard Libraries
Java Virtual Machine
C Standard Library
Operating System
JNI
Added a new library on
existing interfaces.
User must call new
constructors.
Java symbols are fully
qualified, so transparent
replacedment of classes is
not possible.
Could replace native
methods in the JVM,
but this ties us to
open-source JVMs.
Could trap real system calls, but
these are complex (asynchronous,
nonblocking, threaded) and may
be difficult to distringuish from the
JVM’s own operations.
Outline

Why Java and Condor?
 Architecture
 Initial Experience
 A Little Error Theory
 Changes for the Better
 Conclusions
Initial Experience

Bad news: Nearly any unexpected failure
would cause the job to be returned to the
user:
– Out of memory at execution site.
– Java misconfigured at execution site.
– I/O proxy can’t initialize.
– Home file system offline.
Initial Experience

Although this was correct in some sense -the information was true -- it was very
frustrating.
 Users want to know when their program
fails by design (NullPointerException,) but
not if it fails due to the environment.
 What did we do wrong?
Outline

Why Java and Condor?
 Architecture
 Initial Experience
 A Little Error Theory
 Changes for the Better
 Conclusions
A Little Error Theory

Build on standard definitions from faulttolerance and programming languages.
 Some brief examples to get the idea.
 Return to Condor and use the theory to
understand our design mistakes.
Fault Tolerance Terminology

Failure
– An externally-visible deviation from
specifications.

Error
– An internal data state that leads to a failure.

Fault
– An external event that creates an error.
A. Avizienis and J.C. Laprie. Dependable computing: From concepts to
design diversity. IEEE 74(5) May 1986.
Example
FAULT
What is sqrt(4)?
Client
Hmm, sqrt(4) is...
Server
Answer: 3
Hmm, sqrt(9) is...
FAILURE
ERROR

Implicit errors
– The system claims to have reached a valid
result, but an auditor claims it is invalid.
Example: sqrt(3)==2
 Explicit errors
– The system tells us it cannot complete the
desired action. Example: file not found.
 Escaping errors
– The system detects an error, but has no method
of reporting it, so it escapes by an alternate
route. Example: core dump, kernel panic.
John B. Goodenough, Exception Handling: issues and a proposed
notation. CACM 18(120, December 1975.
K. Ekandham and A. Bernstein. Some new Transitions in hierarchical
level structures. Operating Systems Review 12(4), 1978.
Parent
Process
Abnormal
Exit
Normal
Exit
Program
load
Escaping
error: Tell the
parent that
the program
could not
complete.
Would like to return an
explicit error, but a load
insn has no exit code.
data
Could return a default
value, but that creates an
implicit error.
Virtual Memory System
Physical
Memory
Backing
Store
Interface Contracts
int load( int address );
The implementor must either compute a result
that conforms to the contract, or is obliged
to cause an escaping error.
C. Hoare. An axiomatic basis for computer programming. CACM 12(10:576-580,
October 1969.
B. Meyer. Object-Oriented Software Construction. Prentice Hall, 1997.
Exceptions
int open( String filename )
throws FileNotFound, AccessDenied;
A language with exceptions provides more
structure to the contract. A declared
exception is an explicit error. Yet, escaping
errors are still possible.
Parent
Process
Abnormal
Exit
Normal
Exit
Program
Success,
FileNotFound,
AccessDenied
open
MemoryCorrupt,
DiskOffline,
PigeonLost
INTERFACE
Virtual File System
IMPLEMENTATION
Memory
Disk
Error Scope

In order to be accepted by end users, a
distributed system must be able to
distinguish between errors computed by the
program and errors forced upon it by the
environment.

We use the term scope to draw the
distinction.
Error Scope

The scope of an error is the portion of the
system that it invalidates.
 An error must be delivered to the process
responsible for managing that scope.
Error
FileNotFound
RPC Disconnect
Scope
File
Process
Handler
Calling Function
Parent Process
Cache Coherency
Problem
Machine
Hypervisor or
Operator
PVM Node Crash
PVM Cluster
Parent Process
Error Detail

The detail of an error describes in
phenomenological terms the cause of the error.
 In the right hands, the detail is useful. In the
wrong hands, the detail can be misleading.
 Suppose open returns AccessDenied...
– File is not accessible - Ok.
– Library containing ‘open’ is not accessible - Problem!
Lessons

Principle 1:
– A routine must not generate an implicit error as a result
of receiving an explicit error.

Principle 2:
– An escaping error converts a potential implicit error
into an explicit error at a higher level.

Principle 3:
– An escaping error must be propagated to the program
that manages the error’s scope.
Outline

Why Java and Condor?
 Architecture
 Initial Experience
 A Little Error Theory
 Changes for the Better
 Conclusions
Java and Condor Revisited
What did we do wrong?
We focussed on error detail without
considering error scope.
Java and Condor Revisited

To fix the system, we revisited the notion of
error scope throughout.
 Two examples:
– JVM exit code
– I/O errors
JVM Exit Code
Detail
Program exited by completing main
Scope
Program
Exit Code
0
Program exited through System.exit(x)
Program
x
Exception: Null pointer.
Program
1
Exception: Out of memory.
Virtual
Machine
1
Exception: Java Misconfigured.
Remote
Resource
Local
Resource
1
Job
1
Exception: Home file system offline.
Exception: Program image corrupt.
1
Fork
Fork
Starter Result +
Program Result
shadow
starter
Result
File
JVM Result
JVM
Home
File
System
Result of
Execution
Attempt +
Result of
Program, If
any.
Wrapper
The Job
I/O Library
I/O Error Scope

All Java I/O operations throw a single
exception type -- IOException.
 Our mistake: convert all detected errors into
IOExceptions and pass them to the program.
 Makes sense for FileNotFound, but not for
ProxyUnavailable or CredentialsExpired.
starter
To I/O Proxy
Result
File
Error Outside
Program Scope
JVM Result
JVM
Wrapper
Error Inside
Program Scope
The Job
I/O Library
Outline

Why Java and Condor?
 Architecture
 Initial Experience
 A Little Error Theory
 Changes for the Better
 Conclusions
Conclusion

We started building the Java Universe with
some naive assumptions about errors.
 On encountering practical difficulties, we
thought more abstractly about errors and
developed the notion of scope and detail.
 By routing errors according to their scope,
we made the system more robust and
usable.
Food for Thought

There isn’t always an easy way to propagate
an error to the scope handler.
– Escaping error to parent process:
 Raise a POSIX signal.
– Escaping error to the starter:
 Throw a Java Error, trapped by the Wrapper, placed
in file, read after process exits.
Food for Thought

The mere use of exceptions in a program does not
imply a disciplined error management.
 For example, throws IOException is a very
vague statement about an interface.
 What is an implementor allowed to throw?
– Can open() return FileNotFound?

(Probably.)
– Can read() throws FileNotFound?

(Asking for trouble.)
– What about ConnectionRefused?
Food for Thought

An contract can govern more than simply the
interface specification.
 Consider this self-cleaning program:
fd = open(“file”);
unlink(“file”);
close(fd);

Works on UNIX, fails on WinNT.
 Can an interface (code+docs) really state all the
necessary semantic information?
 Should it?
Deployment





As of February 14th, the Java Universe is running
on 515 RedHat 7.2 machines.
Will be rolled out as part of Condor 6.3.2 on all
platforms in the regular release schedule.
Sun JDK 1.2.2 on UNIX machines.
Sun JDK 1.3.2 on WinNT machines.
“Is the Java Universe available on my machine?”
– condor_status -java
skywalker.cs.wisc.edu
c2 cluster
tux lab
istat
Acknowledgements

Although we me take credit (or blame) for
the most recent changes, the Condor
architecture has dealt with errors for many
years. Much credit goes to the core
designers, esp. Mike Litzkow, Todd
Tannenbaum, and Derek Wright.
More Info:

The Condor Project:
– http://www.cs.wisc.edu/condor

These slides:
– http://www.cs.wisc.edu/~thain

Douglas Thain
– [email protected]

Questions now?