The IC Wall Collaboration between Computer science + Physics

Download Report

Transcript The IC Wall Collaboration between Computer science + Physics

Parallel Programming on Computational Grids
Outline
•
•
•
•
Grids
Application-level tools for grids
Parallel programming on grids
Case study: Ibis
Grids
• Seamless integration of geographically distributed computers,
databases, instruments
– The name is an analogy with power grids
• Highly active research area
– Global Grid Forum (now called Open Grid Forum)
– Globus middleware
– Many European projects
• e.g. Gridlab: Grid Application Toolkit and Testbed
– VL-e (Virtual laboratory for e-Science) project
– ….
Why Grids?
• New distributed applications that use data or instruments across multiple administrative
domains and that need much CPU power
– Computer-enhanced instruments
– Collaborative engineering
– Browsing of remote datasets
– Use of remote software
– Data-intensive computing
– Very large-scale simulation
– Large-scale parameter studies
Web, Grids and e-Science
• Web is about exchanging information
• Grid is about sharing resources
– Computers, data bases, instruments
• e-Science supports experimental science by providing a
virtual laboratory on top of Grids
– Support for visualization, workflows, data management,
security, authentication, high-performance computing
The big picture
Application
Application
Application
Potential
Potential
Potential
Generic
Generic
Generic
Management
Virtual
Laboratory
part
Management
Management
part
part
of comm. & Application
oriented
of comm.
& services of comm. &
computing
computing
computing
Grids
Harness distributed resources
Applications
• e-Science experiments generate much data, that often is distributed and that need
much (parallel) processing
– high-resolution imaging:
– Bio-informatics queries:
– Satellite world imagery:
– Current particle physics:
– LHC physics (2007):
~ 1 GByte per measurement
500 GByte per database
~ 5 TByte/year
1 PByte per year
10-30 PByte per year
Grid programming
• The goal of a grid is to make resource sharing very easy (transparent)
• Practice: grid programming is very difficult
– Finding resources, running applications, dealing with heterogeneity and security, etc.
• Grid middleware (Globus Toolkit) makes this somewhat easier,
but is still low-level and changes frequently
• Need application-level tools
Application-level tools
• Paper (on Blackboard):
– Blueprint for a New Computing Infrastructure (2nd Edition, editors I. Foster and C. Kesselman);
Chapter 24: Application-Level tools (Bal, Casanova, Dongarra, Matsuoka)
•
•
•
•
Builds on grid software infrastructure
Isolates users from dynamics of the grid hardware infrastructure
Generic (broad classes of applications)
Easy-to-use
Taxonomy of existing application-level tools
• Grid programming models
– RPC
– Task parallelism
– Message passing
– Java programming
• Grid application execution environments
– Parameter sweeps
– Workflow
– Portals
Remote Procedure Call (RPC)
• GridRPC: specialize RPC-style (client/server) programming for grids
– Allows coarse-grain task parallelism & remote access
– Extended with resource discovery, scheduling, etc.
• Example: NetSolve
– Solves numerical problems on a grid
• Current development: use web technology (WSDL, SOAP) for grid services
• Web and grid technology are merging
Task parallelism
• Many systems for task parallelism (master-worker, replicated workers) exist for the grid
• Examples
– MW (master-worker)
– Satin: divide&conquer (hierarchical master-worker)
Message passing
• Several MPI implementations exist for the grid
• PACX MPI (Stutgart):
– Runs on heterogeneous systems
• MagPIe (Thilo Kielmann):
– Optimizes MPI’s collective communication
for hierarchical wide-area systems
• MPICH-G2:
– Similar to PACX and MagPIe, implemented on Globus
Java programming
• Java uses bytecode and is very portable
– ``Write once, run anywhere’’
• Can be used to handle heterogeneity
• Many systems now have Java interfaces:
– Globus (Globus Java Commodity Grid)
– MPI (MPIJava, MPJ, …)
– Gridlab Application Toolkit (Java GAT)
• Ibis and ProActive are Java-centric grid programming systems
Parameter sweep applications
• Computations what are mostly independent
– E.g. run same simulation many times with different parameters
• Can tolerate high network latencies, can easily be made fault-tolerant
• Many systems use this type of trivial parallelism to harness idle desktops
– APST, SETI@home, Entropia, XtremWeb
Workflow applications
• Link and compose diverse software tools and data formats
– Connect modules and data-filters
• Results in coarse-grain, dataflow-like parallelism that
can be run efficiently on a grid
• Several workflow management systems exist
– E.g. Virtual Lab Amsterdam (predecessor VL-e)
Portals
• Graphical interfaces to the grid
• Often application-specific
• Also portals for resource brokering, file transfers, etc.
Outline
•
•
•
•
Grids
Application-level tools for grids
Parallel programming on grids
Case study: Ibis
Distributed supercomputing
• Parallel processing on geographically distributed computing systems (grids)
• Examples:
– SETI@home (
), RSA-155, Entropia, Cactus
• Currently limited to trivially parallel applications
• Questions:
– Can we generalize this to more HPC applications?
– What high-level programming support is needed?
Grids versus supercomputers
• Performance/scalability
– Speedups on geographically distributed systems?
• Heterogeneity
– Different types of processors, operating systems
– Different networks (Ethernet, Myrinet, WANs)
• General grid issues
– Resource management, co-allocation, firewalls,
security, authorization, accounting, fault tolerance,
malleability ….
Wide-Area Grid Systems
Approaches
• Performance/scalability
– Exploit hierarchical structure of grids
(previous project: Albatross)
• Heterogeneity
– Use Java + JVM (Java Virtual Machine) technology
• General grid issues
– Studied in many grid projects: VL-e, GridLab, GGF
Speedups on a grid?
• Grids usually are hierarchical
– Collections of clusters, supercomputers
– Fast local links, slow wide-area links
• Can optimize algorithms to exploit this hierarchy
– Minimize wide-area communication
• Successful for many applications
– Did many experiments on a homogeneous wide-area test bed (DAS)
Example: N-body simulation
• Much wide-area communication
– Each node needs info about remote bodies
CPU 1
CPU 1
CPU 2
CPU 2
Amsterdam
Delft
Trivial optimization
CPU 1
CPU 1
CPU 2
CPU 2
Amsterdam
Delft
Wide-area optimizations
• Message combining on wide-area links
• Latency hiding on wide-area links
• Collective operations for wide-area systems
– Broadcast, reduction, all-to-all exchange
• Load balancing
Conclusions:
– Many applications can be optimized to run efficiently on a hierarchical wide-area system
– Need better programming support
Outline
•
•
•
•
Grids
Application-level tools for grids
Parallel programming on grids
Case study: Ibis
– Paper (on Blackboard):
van Nieuwpoort et al., AGRIDM’03 (Workshop on Adaptive Grid Middleware, New Orleans, Sept. 2003
The Ibis system
• High-level & efficient programming support for distributed supercomputing on heterogeneous grids
• Use Java-centric approach + JVM technology
– Inherently more portable than native compilation
• “Write once, run anywhere ”
– Requires entire system to be written in pure Java
• Goal: drastically simplify programming and
deployment of grid applications
Global picture
Ibis programming support
• Ibis provides
– Remote Method Invocation (RMI)
– Replicated objects (RepMI)
- as in Orca
– Group/collective communication (GMI) - as in MPI
– Divide & conquer (Satin)
- as in Cilk
• All integrated in a clean, object-oriented way into Java, using special “marker”
interfaces
– Invoking native library (e.g. MPI) would give up Java’s “run anywhere” portability
Compiling/optimizing programs
JVM
source
source
Java
compiler
bytecode
bytecode
rewriter
bytecode JVM
JVM
• Optimizations are done by bytecode rewriting
– E.g. compiler-generated serialization
Satin: a parallel divide-and-conquer system on top of Ibis
•
•
•
•
Divide-and-conquer is inherently hierarchical
More general than master/worker
Satin: Cilk-like primitives (spawn/sync) in Java
New load balancing algorithm (CRS)
– Cluster-aware random work stealing
fib(5)
cpu 2
fib(2)
fib(1)
fib(1)
fib(0)
fib(4)
fib(3)
fib(3)
fib(2)
fib(2)
fib(1)
fib(1)
fib(0)
fib(1)
fib(0)
cpu 3
cpu 1
cpu 1
Example
interface FibInter {
public int fib(long n);
}
class Fib implements FibInter {
int fib (int n) {
if (n < 2) return n;
return fib(n-1) + fib(n-2);
}
}
Single-threaded Java
Example
interface FibInter
extends ibis.satin.Spawnable {
public int fib(long n);
}
class Fib
extends ibis.satin.SatinObject
implements FibInter {
public int fib (int n) {
if (n < 2) return n;
int x = fib (n - 1);
int y = fib (n - 2);
sync();
return x + y;
}
}
Java + divide&conquer
GridLab testbed
Ibis implementation
• Want to exploit Java’s “run everywhere” property, but
– That requires 100% pure Java implementation,
no single line of native code
– Hard to use native communication (e.g. Myrinet) or native compiler/runtime system
• Ibis approach:
– Reasonably efficient pure Java solution (for any JVM)
– Optimized solutions with native code for special cases
Ibis design
Application
RMI
GMI
Satin
MPJ
RepMI
Ibis Portability Layer (IPL)
TCP
Smart
Sockets
P2P
UDP
GM
MPI
Fast communication in pure Java
• Manta system [ACM TOPLAS Nov. 2001]
– RMI at RPC speed, but using native compiler & RTS
• Ibis does similar optimizations, but in pure Java
– Compiler-generated serialization at bytecode level
• 5-9x faster than using runtime type inspection
– Reduce copying overhead
• Zero-copy native implementation for primitive arrays
• Pure-Java requires type-conversion (=copy) to bytes
Java/Ibis vs. C/MPI on Pentium-3 cluster (using SOR)
Grid deployment
• JavaGAT: Java Grid Application Toolkit
– Make applications independent of underlying grid
– Used for file copying, resource discovery,
job submission & monitoring, user authentication
– API is currently standardized (SAGA)
• Zorilla P2P system
– Job management, gossiping,
clustering, flood scheduling
Ibis applications / external users
• e-Science (VL-e)
– Brain MEG-imaging
– Mass spectroscopy
• Multimedia content analysis
• Video processing
• Many HPC applications
– SAT-solver, N-body, grammar learning, ….
•
•
•
•
•
•
D-Grid: Workflow engine for astronomy
U. Erlangen: grid file system
INRIA: ProActive on Ibis RMI
U. Patras: Jylab scientific computing
UPC Barcelona: Grid Superscalar
HITACHI: Peta-scale data management
Early Grid experiences with Ibis
• Using Satin divide-and-conquer system
– Implemented with Ibis in pure Java, using TCP/IP
• Application measurements on
– DAS-2 (homogeneous)
– Testbed from EC GridLab project (heterogeneous)
– Grid’5000 (France) – N Queens challenge
Distributed ASCI Supercomputer (DAS) 2
VU (72 nodes)
UvA (32)
GigaPort
(1-10 Gb)
Leiden (32)
Utrecht (32)
Delft (32)
Satin on wide-area DAS-2
70.0
60.0
40.0
30.0
20.0
10.0
single cluster of 64 machines
P
TS
K
N
e
qu
ee
Pr
im
ns
e
fa
ct
or
s
Ra
yt
ra
ce
r
N
ch
oo
s
ac
k
ID
A*
Kn
ap
s
Fi
bo
na
ive
cc
in
i
te
gr
at
i
Se on
tc
Fi
ov
b.
er
th
re
sh
ol
d
0.0
Ad
ap
t
speedup
50.0
4 clusters of 16 machines
Satin on GridLab
• Heterogeneous European grid testbed
• Implemented Satin/Ibis on GridLab, using TCP
GridLab
• Latencies:
– 9-200 ms (daytime),
9-66 ms (night)
• Bandwidths:
– 9-4000 KB/s
Configuration
Type
OS
CPU
Location
CPUs
Cluster
Linux
Pentium-3 Amsterdam 8  1
SMP
Solaris
Sparc
Amsterdam 1  2
Cluster
Linux
Xeon
Brno
SMP
Linux
Pentium-3 Cardiff
12
Origin 3000
Irix
MIPS
ZIB Berlin
1  16
SMP
Unix
Alpha
Lecce
14
42
Experiences
• No support for co-allocation yet (done manually)
• Firewall problems everywhere
– Addressed with ‘SmartSockets’ library
• Java indeed runs anywhere
modulo bugs in (old) JVMs
• Need clever load balancing mechanism (CRS)
Cluster-aware Random Stealing
• Use Cilk’s Random Stealing (RS) inside cluster
• When idle
– Send asynchronous wide-area steal message
– Do random steals locally, and execute stolen jobs
– Only 1 wide-area steal attempt in progress at a time
• Prefetching adapts
– More idle nodes  more prefetching
Performance on GridLab
• Problem: how to define efficiency on a grid?
• Our approach:
– Benchmark each CPU with Raytracer on small input
– Normalize CPU speeds (relative to a DAS-2 node)
– Our case: 40 CPUs, equivalent to 24.7 DAS-2 nodes
– Define:
• T_perfect = sequential time / 24.7
• efficiency = T_perfect / actual runtime
– Also compare against single 25-node DAS-2 cluster
Results for Raytracer
Time (sec)
Efficiency (%)
878
62.6
CRS
677
81.3
RS
2084
26.4
CRS
693
79.3
1 cluster
580
96.1
sequential
13,564
100
Night RS
Day
RS = Random Stealing, CRS = Cluster-aware RS
Some statistics
• Variations in execution times:
– RS @ day: 0.5 - 1 hour
– CRS @ day: less than 20 secs variation
• Internet communication (total):
– RS: 11,000 (night) - 150,000 (day) messages
137 (night) - 154 (day) MB
– CRS: 10,000 - 11,000 messages
82 - 86 MB
N-Queens Challenge
• How many ways are there to arrange N non-attacking queens on an NxN chessboard ?
Known Solutions
• Currently, all solutions up to N=25 are known
• Result N=25 was computed using
the 'spare time' of 260 machines
• Took 6 months
• Used over 53 cpu-years!
N
# Solutions
19
20
21
22
23
24
25
4968057848
39029188884
314666222712
2691008701644
24233937684440
227514171973736
2207893435808350
N-Queens Challenge
• How many solutions can you calculate in 1 hour, using as many machines as you can get ?
• Part of the PlugTest 2005 and 2006 held in Sophia Antipolis
Satin & Ibis
• Our submission used a Satin/Ibis application
• Uses Satin (Divide & Conquer) to solve N-Queens recursively
• Satin distributes different parts of the computation over the Grid
– Uses Ibis for communication
Satin & Ibis
• Used the Grid 5000 testbed to run
– Large testbed distributed over France
• Currently contains some 2500 CPU's
• Will (eventually) contain 5000 CPU's
Largest Satin/Ibis Run
• Our best run used 961 CPUs on
5 clusters for a single application
– Largest number of CPUs of all contestants
• Solved N=22 in 25 minutes
• Second place in the 2005 contest
Recent results
• Ibis wins First IEEE International Scalable
Computing Challenge (SCALE 2008)
– 8th IEEE International Symposium on Cluster
Computing and the Grid, May, Lyon, France
• Ibis wins Data Analysis Challenge for Finding
Supernovae
– IEEE Cluster/Grid 2008 Conference,
29 September - 1 October, Tsukuba, Japan
Data challenge
• Find (potential) supernovae in a large set of images captured by the Subaru telescope
(Mauna Kea, Hawai)
Another result from last week
Current work: VL-e
• VL-e: Virtual Laboratories for e-Science
• Large Dutch project (2004-2008):
– 40 M€ (20 M€ BSIK funding from Dutch
goverment)
• 20 partners
– Academia: Amsterdam, TU Delft, VU, CWI,
NIKHEF, ..
– Industry: Philips, IBM, Unilever, CMG, ....
DAS-3
• Next generation grid in the Netherlands
• Partners:
– NWO & NCF (Dutch science foundation)
– ASCI
– Gigaport-NG/SURFnet: DWDM computer backplane
(dedicated optical group of up to 8 lambdas)
– VL-e and MultimediaN BSIK projects
• Optical 10Gb/s interconnect with Grid’5000
NOC
StarPlane project
• Application-specific management of optical networks
• Future applications can:
– dynamically allocate light paths, of 10 Gbit/sec each
– control topology through the Network Operations Center
• Gives flexible, dynamic, high-bandwidth links
• Research questions:
– How to provide this flexibility (across domains)?
– How to integrate optical networks with applications?
• Joint project with Cees de Laat (Univ. of Amsterdam), funded by NWO
Summary
• Parallel computing on Grids (distributed supercomputing) is a challenging and promising
research area
• Many grid programming environmenents exist
• Ibis: a Java-centric Grid programming environment
– Portable (“run anywhere”) and efficient
• Current work: Virtual Laboratory for e-Science (VL-e), DAS-3, StarPlane