Atlas - UCSB Computer Science

Download Report

Transcript Atlas - UCSB Computer Science

Atlas: An Infrastructure for
Global Computing
People

Eric Baldeschwieler (UC Berkeley)

Bobby Blumofe (UT Austin)

Eric Brewer (UC Berkeley)
Outline
Introduction
 Programming model
 Architecture
 Examples
 Discussion
 Limitations & Conclusion

Introduction
Properties of a Internet computing
infrastructure

Scalability: to 106 nodes

Heterogeneity: of machines & OSs


Fault tolerance: completion probability
comparable to sequential program
Adaptive parallelism: dynamic set of resources
Properties ...

Safety: Hosts must be secure

Anonymity: Secure privacy of client: data & program

Hierarchy: Locality of communication (local
bandwidth typically is higher)

Ease of use: Minimize “costs” of participating.

Reasonable performance: Low overhead  Benefit
from a small set of machines.
Introduction ...

Atlas combines mechanisms from:
– Cilk
– Java
– with new mechanisms.

Java “ensures”:
– heterogeneity
– safety
Introduction ...
Atlas:

extends Cilk’s work-stealing scheduler
to a hierarchical Internet setting

uses Cilk-NOW’s mechanisms for:
– adaptive parallelism
– fault tolerance
Programming Model

Applications are written in Java

When a native library is used, heterogeneity
is limited to platforms that support it.

Programming model is:
– a Java-based implementation of Cilk:

Non-blocking, explicit continuation passing threads
– a Unix-like URL-based file system & local caching
with coherence.
Architecture
Basic architecture
Client
Compute Server
Manager
Application (Java)
Compute
Server
Compute
Server
Compute
Server
Runtime library
Java interpreter
Native libraries (C or C++)
Architecture ...

Client is a Java application
– connects to compute servers on machines other
than its manager’s.

Idle servers steal work from busy ones.
Architecture

Compute server:
– relinquishes control when there is non-Atlas
work (a screensaver?)
– Runs as a daemon:

working

pings manager & siblings for work to steal
Architecture: Porting Atlas

A Java runtime system

Port:
– natively written URL-based file system
– some support routines.
Hierarchical Work Stealing
Manager
Manager
Manager
Compute
Server
Manager
Manager
Compute
Server
Compute
Server
Hierarchical Work Stealing ...

Manager keeps track of when its subtree is
idle

If manager’s subtree is idle,
manager steals work from its siblings

If a subtree has “too much” work,
it “allows” work stealing from above
What is definition & implementation of “too much”?
Hierarchical Work Stealing

The authors claim that proven properties of
Cilk hold in this hierarchical setting.

Goals:
– Localize communication
– Sub-trees map to domain hierarchy
Administrators can control thread migration:
– Outflow: Privacy
– Inflow: Host security
Examples
Fib: fine grained threads
 POV-Ray: coarse grained threads

Base 1 Node
3 Nodes
8 Nodes
Fib (24)
1.3
40 (2.0)
31 (2.6)
POV-Ray
20700 21000
-
2700 (7.8)
80
Numbers in ( ) are speedups over 1-node case.
Examples ...

POV-Ray is not written in Java

Partitioning is done in Java

8 nodes: only 2% overhead.

What about larger P?
Discussion

Scalable: Yes.

Heterogeneity: Incomplete until
divorces itself from all native libraries.

Safety:
– Java: OK.
– Native libraries: ?
Discussion ...

Fault tolerance: A timed out thread is
recomputed from a checkpoint maintained by
subtree (manager?)
– What is affect on performance of checkpointing?
Subtree rooted at a thread is its
subcomputation.
Fault Tolerance ...
Subcomputations are transactions:

Authors claim: side effects can be undone

How does this relate to hierarchical work
stealing?
Discussion ...

Anonymity: A host executing a stolen
subtree cannot determine client.
– Managers are assumed to be trustworthy

Hierarchy: Yes, via manager hierarchy.

Ease of use: Interface incomplete.
– clients submit jobs via a special “shell”
Discussion ...

Adaptive parallelism:
– “Owner” (?) of compute server sets a
policy that defines when server is idle.
– How?
– When compute server becomes unavailable
for Atlas work, all its sub-computations are
moved to another computer server.
Adaptive Parallelism ...

Moving a subcomputation requires updating
information linking subcomputation to its:
– parent
– children
– How long does it take to retreat?
– Is sub-computation restarted? From checkpoint?
Limitations

Atlas inherits tree-structured program
limitation from Cilk.
– But this is still a rich set!

Generalizing to non-tree-structured
programs seems hard.

No shared variables among threads.

Global file system is read-only.
Conclusion
 Jicos
 Use
design goals = those for Atlas.
JXTA to give Jicos a “file system”
– Then, Jicos becomes Atlas’s heir.