Atlas - UCSB Computer Science
Download
Report
Transcript Atlas - UCSB Computer Science
Atlas: An Infrastructure for
Global Computing
People
Eric Baldeschwieler (UC Berkeley)
Bobby Blumofe (UT Austin)
Eric Brewer (UC Berkeley)
Outline
Introduction
Programming model
Architecture
Examples
Discussion
Limitations & Conclusion
Introduction
Properties of a Internet computing
infrastructure
Scalability: to 106 nodes
Heterogeneity: of machines & OSs
Fault tolerance: completion probability
comparable to sequential program
Adaptive parallelism: dynamic set of resources
Properties ...
Safety: Hosts must be secure
Anonymity: Secure privacy of client: data & program
Hierarchy: Locality of communication (local
bandwidth typically is higher)
Ease of use: Minimize “costs” of participating.
Reasonable performance: Low overhead Benefit
from a small set of machines.
Introduction ...
Atlas combines mechanisms from:
– Cilk
– Java
– with new mechanisms.
Java “ensures”:
– heterogeneity
– safety
Introduction ...
Atlas:
extends Cilk’s work-stealing scheduler
to a hierarchical Internet setting
uses Cilk-NOW’s mechanisms for:
– adaptive parallelism
– fault tolerance
Programming Model
Applications are written in Java
When a native library is used, heterogeneity
is limited to platforms that support it.
Programming model is:
– a Java-based implementation of Cilk:
Non-blocking, explicit continuation passing threads
– a Unix-like URL-based file system & local caching
with coherence.
Architecture
Basic architecture
Client
Compute Server
Manager
Application (Java)
Compute
Server
Compute
Server
Compute
Server
Runtime library
Java interpreter
Native libraries (C or C++)
Architecture ...
Client is a Java application
– connects to compute servers on machines other
than its manager’s.
Idle servers steal work from busy ones.
Architecture
Compute server:
– relinquishes control when there is non-Atlas
work (a screensaver?)
– Runs as a daemon:
working
pings manager & siblings for work to steal
Architecture: Porting Atlas
A Java runtime system
Port:
– natively written URL-based file system
– some support routines.
Hierarchical Work Stealing
Manager
Manager
Manager
Compute
Server
Manager
Manager
Compute
Server
Compute
Server
Hierarchical Work Stealing ...
Manager keeps track of when its subtree is
idle
If manager’s subtree is idle,
manager steals work from its siblings
If a subtree has “too much” work,
it “allows” work stealing from above
What is definition & implementation of “too much”?
Hierarchical Work Stealing
The authors claim that proven properties of
Cilk hold in this hierarchical setting.
Goals:
– Localize communication
– Sub-trees map to domain hierarchy
Administrators can control thread migration:
– Outflow: Privacy
– Inflow: Host security
Examples
Fib: fine grained threads
POV-Ray: coarse grained threads
Base 1 Node
3 Nodes
8 Nodes
Fib (24)
1.3
40 (2.0)
31 (2.6)
POV-Ray
20700 21000
-
2700 (7.8)
80
Numbers in ( ) are speedups over 1-node case.
Examples ...
POV-Ray is not written in Java
Partitioning is done in Java
8 nodes: only 2% overhead.
What about larger P?
Discussion
Scalable: Yes.
Heterogeneity: Incomplete until
divorces itself from all native libraries.
Safety:
– Java: OK.
– Native libraries: ?
Discussion ...
Fault tolerance: A timed out thread is
recomputed from a checkpoint maintained by
subtree (manager?)
– What is affect on performance of checkpointing?
Subtree rooted at a thread is its
subcomputation.
Fault Tolerance ...
Subcomputations are transactions:
Authors claim: side effects can be undone
How does this relate to hierarchical work
stealing?
Discussion ...
Anonymity: A host executing a stolen
subtree cannot determine client.
– Managers are assumed to be trustworthy
Hierarchy: Yes, via manager hierarchy.
Ease of use: Interface incomplete.
– clients submit jobs via a special “shell”
Discussion ...
Adaptive parallelism:
– “Owner” (?) of compute server sets a
policy that defines when server is idle.
– How?
– When compute server becomes unavailable
for Atlas work, all its sub-computations are
moved to another computer server.
Adaptive Parallelism ...
Moving a subcomputation requires updating
information linking subcomputation to its:
– parent
– children
– How long does it take to retreat?
– Is sub-computation restarted? From checkpoint?
Limitations
Atlas inherits tree-structured program
limitation from Cilk.
– But this is still a rich set!
Generalizing to non-tree-structured
programs seems hard.
No shared variables among threads.
Global file system is read-only.
Conclusion
Jicos
Use
design goals = those for Atlas.
JXTA to give Jicos a “file system”
– Then, Jicos becomes Atlas’s heir.