Bryan's Slides - University of Southampton

Download Report

Transcript Bryan's Slides - University of Southampton

MPJ Express
and Lessons from Java Gadget
Bryan Carpenter
OMII, University of Southampton
Southampton SO17 1BJ, UK
February 7 2007
[email protected]
1
Contents



MPJ Express overview
The Gadget 2 code
Making a Java version of Gadget
2
Acknowledgements


Work done in collaboration with Mark Baker
and Aamir Shafir.
Most work done by latter!
3
MPJ Express
4
Java for High Performance Computing?

Various people have argued that Java provides a good
programming platform for large scale computation
(e.g. Java Grande, 1997 - ~2003).
• Parallel computing has a long history of novel language
concepts to support parallelism, multithreading etc.
• Java seems to incorporate at least some of these ideas, and
has the benefit that it is a “mainstream” language.

Java encourages better software engineering and is
more portable than, say, Fortran or C.
• There is a large body of supporting software for Java.
5
Performance

Modern JVMs perform compilation from byte code to
native machine code, as the Java program is executed.
• Typical JVMs implement many of the optimizations used by
compilers for “traditional” programming languages.
• Adaptive compilation may even allow optimizations that are
impractical for static compilers, because run-time
information is available?
• Quality of compilation comparable to many C or Fortran
compilers.

But Java has safety features that may limit
performance.
• Difficulty of managing memory explicitly, exact exception
processing, …
6
MPI Background

MPI was introduced in 1994 as a standard message
passing API for parallel scientific computing:
• Language bindings for C, C++, and Fortran.

Original programming model was SPMD (Single
Program Multiple Data) ― P copies of a single
program run on a fixed number, P, of processors.
• Only interaction is by message exchange ― no sharing of
memory between processors (c.f. CSP).

Introduced idea of communicator as basic
communication abstraction (c.f. channels). A
communicator is shared by a group of processors.
• Supports library development and collective operations.
7
Communicators
Communicator C
Proc 0
Proc 1
C.Send(…, 2)
Proc 2
C.Recv(…, 0)
8
Collective Communications
Communicator C
Proc 0
Proc 1
C.Bcast(…, 0)
C.Bcast(…, 0)
Proc 2
C.Bcast(…, 0)
9
MPJ Background


Java Grande Message Passing Workgroup defined Java
bindings for MPI in 1998.
Previous efforts follow two approaches:
• Pure Java approaches:
 Remote Method Invocation (RMI),
 Java Sockets,
• Java Native Interface (JNI) approaches:
 mpiJava – Java wrapper to platform native MPI.
10
Minimal mpiJava Program
import mpi.*
class Hello {
static public void main(String[] args) {
MPI.Init(args) ;
int myrank = MPI.COMM_WORLD.Rank() ;
if(myrank == 0) {
char[] message = “Hello, there”.toCharArray() ;
MPI.COMM_WORLD.Send(message, 0, message.length, MPI.CHAR, 1, 99) ;
}
else {
char[] message = new char [20] ;
MPI.COMM_WORLD.Recv(message, 0, 20, MPI.CHAR, 0, 99) ;
System.out.println(“received:” + new String(message) + “:”) ;
}
MPI.Finalize() ;
}
}
11
MPJ Express Goals

Modular design: unify earlier approaches by
pluggable communication devices.

Support high-level functionality of MPI-1.2 in pure
Java code, as far as possible:
•
•
•
•


Point to point communications,
Collective communications,
Groups, communicators, and contexts,
Derived datatypes – buffering API.
Thread safety ― support Java threads for
parallelism within multicore nodes.
Open source distribution:
http://mpj-express.org
12
MPJ Design
MPJ API
MPJ collective Communications (High level)
MPJ point to point communications (Base level)
mpjdev (MPJ Device level)
xdev
Native MPI
smpdev
JNI
Threads
API
niodev
Java NIO
gmdev
JNI
Java Virtual Machine (JVM)
Hardware (NIC, Memory etc)
13
MPJ Demos

There were a few existing demos for MPJ, inherited
from the earlier mpiJava project, e.g.
• A 2d Fluid Flow example.
• Some Monte Carlo simulations of 2d spin systems from
condensed matter physics.

These have nice GUIs, but aren’t very representative of
real codes used by computational scientists.
14
MPJ demo: CFD – inviscid flow
15
MPJ demo: Q-state Potts model
16
Gadget
17
Gadget-2

Gadget-2 is a free-software, production code for
cosmological N-body (and hydrodynamic)
computations.
http://www.mpa-garching.mpg.de/gadget
• Written by Volker Springel, of the Max Plank Institute for
Astrophysics, Garching.


It is written in the C language – already parallelized
using MPI.
Versions have been used in various research papers in
astrophysics literature, including the Millennium
Simulation.
18
Millennium Simulation



See the paper Simulating the Joint Evolution of Quasars,
Galaxies and their Large-Scale Distribution by Springel
et al on Gadget home page.
Follows evolution of 1010 dark matter “particles” from
early Universe (z = 127) to current day.
Performed on 512 nodes of IBM p690.
• Used 1TB of distributed memory.
• 350,000 CPU hours – 28 days elapsed time.
• Floating point performance around 0.2 TFLOPS.

Around 20Tb total data generated.
19
Dynamics in Gadget

Gadget is “just” simulating the movement of (a lot
of) representative particles under the influence of
Newton’s law of gravity.
• Plus some hydrodynamic forces, but these don’t affect
dominant Dark Matter.


Co-moving coordinates take account of expansion of
Universe according to General Relativity, but
otherwise basically classical mechanics.
Classical N-body problem.
20
Gadget Main Loop

This is a slightly simplified view of the Gadget code:
… Initialize …
while (not done) {
move_particles() ;
// update positions
domain_Decomposition() ;
compute_accelerations() ;
advance_and_find_timesteps() ; // update velocities
}
• Most of the interesting work happens in
compute_accelerations() (and domain_Decomposition() –
see later).
21
Computing Forces

The function compute_accelerations() must compute
forces experienced by all particles.
• In particular must compute gravitational force.

Because gravity is a long range force, every particle
affects every other. Naively, total cost is O(N2).
• with N ≈ 1010, this is quite infeasible.

Need some kind of approximation.
• Intuitive approach: when computing force on particle i, group
together particles in selected regions distant from i, and treat
groups as single particles, located at their centres of mass.
22
Barnes Hut Tree


First divide cubical region of 3d space into 23 = 8
regions, halving each dimension.
For every sub-region that contains any particles, divide
again recursively to make an oct-tree, until “leaf”
regions have at most one particle.
23
Two Dimensional Example

Picture borrowed from
www.cs.berkeley.edu/~demmel/cs267/lecture26/lecture26.html
24
Barnes Hut Force Computation

To compute the force on a particle i, traverse tree
starting from root:
• if a node n is “distant from” i, just add contribution to force
on i from centre of mass of n – no need to visit children of n;
• if node n is “close to” i, visit children of n and recurse.

Hinges on definition of distant from/close to.
• Basic idea is that a node representing some region of space is
distant from a particle i if the angle it subtends is smaller than
a threshold opening angle:
n
i
θ
25
Complexity


On average, number of nodes “opened” to compute
force on i is O(log N), as opposed to visiting O(N)
particles in naïve algorithm.
A huge win when N ≈ 1010.
26
Domain Decomposition



Need to divide space and/or particle set into domains,
where each domain is handled by a single processor.
Problem is that we can’t just divide space evenly,
because some regions will have many more particle
than others – poor load balancing.
Conversely, can’t just divide particles evenly, because
particles move throughout space, and want to maintain
physically close particles on the same processor, as far
as practical – communication problem.
27
Peano-Hilbert Curve

Warren and Salmon originally suggested using a spacefilling curve:

Picture borrowed from
http://www.mpa-garching.mpg.de/gadget/gadget2-paper.pdf
28
Peano-Hilbert Key


Gadget applies the recursion 20 times, logically
dividing space into up to 220 × 220 × 220 cells on the
Peano-Hilbert curve.
Then can label each cell by its location along the PeanoHilbert curve – 260 possible locations comfortably fit
into a 64-bit word.
29
Decomposition based on P-H Curve


Because number of cells > > number of particles,
segments of linear Peano-Hilbert curve sparsely
populated.
Sort particles by their Peano-Hilbert key, then divide
evenly into P domains.
• Intuitively – stretch out the P-H curve with particles dotted
along it; segment it into P parts where each part has the same
number of particles.

Characteristics of this decomposition:
• Good load balancing.
• Domains simply connected and quite “compact” in real space,
because particles that are close along P-H curve are close in
real space (converse often but not always true).
• Domains have relatively simple mapping to BH octtree nodes.
30
Distribution of BH Tree in Gadget

Ibid.
31
Distributed Representation of Tree



Every processor holds a copy of the root nodes, and a
copy of all child nodes down to the point where all
particles in of a node are held on a single remote
processor. Remotely held nodes are called pseudoparticles.
To compute the force on a single local target particle,
traverse tree from root as usual, and accumulate force
contributions from locally held particles.
Build an export list containing target particle and hosts
of pseudo-particles encountered in walk.
32
Communication



After computation for all locally held target particles,
process export list and send list of local target particles
to all hosts that own pseudo-particle nodes needed for
those particles.
All processors do another tree walk to compute their
contributions to remotely owned (from their point of
view) target particles.
These contributions are returned to the original
processor, and added into the accelerations for the
target particles.
33
Other Communications

Notably:
• The Domain Decomposition itself requires (in principle) a
distributed sort of the particle list.
• Gadget only approximates this sort, but still fairly intricate
communications.

In Gadget all communication is using the standard MPI
library. It makes fairly extensive use of collective
communication routines.
34
Java Gadget
35
Timeline






MPJ Express released 2005.
Realized we needed a realistic exemplar code: both as a
demo, and to drive further improvements of our
software.
First thoughts that an N-body simulation code would
make sense – Summer 2005.
Learned about Gadget in December that year.
Effort to make a Java version of Gadget started
seriously early February 2006 – expecting many
months work.
Non-trivial simulations running in March.
36
Gadget Code Statistics


Public distribution is around 17,000 lines of C,
distributed over about 30 source files.
Dependencies on:
• MPI library for parallelization.
• GNU scientific library (but only a handful of functions)
• FFTW – library for parallel Fourier transforms.

Comes with a set of initial conditions files that we can
use to test our code, including “colliding galaxies” and
“cluster formation”.
37
Translation of Gadget Code

Manually translated, but in first cut, deliberately keep
the Java code as similar to C as possible.
• e.g. nearly all of Java Gadget is currently implemented as
static methods on one enormous class called Main.
• Small supporting classes corresponding to structs of C code.

Currently not supporting a number of features,
including:
• PMTree algorithm (an optimization using Fourier Transforms
on a mesh to compute long-range part of gravitational force),
• Multiple Initial Conditions files,
• COMPUTE_POTENTIAL_ENERGY, ISOTHERM_EQS,
FLEXSTEPS, OUTPUTPOTENTIAL, FORCETEST,
MAKEGLASS, PSEUDOSYMMETRIC, …
38
Handling Dependencies



Replace MPI calls with MPJ calls (!)
Use of FFTW avoided by disabling the PMTree
algorithm (but may slow down some simulations a lot).
Few GNU Scientific Library functions that were needed
(numeric quadrature, etc) were hand translated to
Java.
39
Test Cases

Restrictions aside, we successfully ran Colliding
Galaxies and Cluster Formation example simulations
varying numbers of processors.
• These use pure Dark Matter – hydrodynamics code not yet
tested.

Results are indistinguishable from running the original
Gadget.
40
Colliding Galaxies (Java Gadget)
41
Cluster Formation (Java Gadget)
42
Optimizing Java Gadget


Initial Java version of Gadget was about 3 times slower
than C.
Optimizations:
• Replace communication of Object arrays (corresponding to C
structs) with communication of component primitive elements
(floats, ints, etc).
• Still wasn’t enough – final version actually expunges large
Object arrays from program – flattens all struct arrays into
arrays of primitive types (see later comments).
• Extended the MPJ Express API to manipulate Java
ByteBuffers directly.
43
Flattening Object Arrays



To preserve memory locality, we decided to “flatten”
large arrays of small objects (ParticleData) – replace an
array of objects with an array of doubles containing
object component fields, for example.
This makes the code less readable and maintainable
than the original C code – clearly not acceptable. But
we wanted speed.
Need a much better approach here. One possibility is
some kind of macro pre-processing (it would be better
if the Java language had structs…).
44
Benchmarks



Graphs taken from Aamir Shafi’s thesis – you will find
many more results there.
Some benchmarks are on up to 8 processors of Starbug
cluster in Portsmouth Distributed Systems Group.
Other benchmarks are on up to 32 processors from
NW-GRID cluster at Daresbury Laboratory.
45
Starbug: Colliding Galaxies with
60,000 Particles
46
NW-Grid: Structure Formation with
two million Particles
47
Conclusions






Possible to do large scale numerical simulations in Java
Speed is OK.
Memory consumption still an issue (Java Gadget needs
perhaps three times more memory than C Gadget).
Made various improvements to MPJ Express
implementation and API along the way.
As things stand, the Java Gadget code is not very
maintainable – C-like, and then optimized some.
Needs more work to learn how to write maintainable
code in Java that is also fast – e.g. how to preserve
memory locality in arrays of objects.
48