2 - Princeton CS

Download Report

Transcript 2 - Princeton CS

Module on Computational Astrophysics
Jim Stone
Department of Astrophysical Sciences
125 Peyton Hall : ph. 258-3815:
[email protected]
www.astro.princeton.edu/~jstone
Lecture 1: Introduction to astrophysics, mathematics, and methods
Lecture 2: Optimization, parallelization, modern methods
Lecture 3: Particle-mesh methods
Lecture 4: Particle-based hydro methods, future directions
Computational Tasks
1. Setup distribution of N particles
2. Compute forces between particles
3. Evolve positions using ODE solver
4. Display/analyze results
Leap-frog scheme
Define positions (x) and forces (F) at time level n
velocities (v)
at time level n+1/2
Then, for ith particle
n-1
x, F
n-1/2
v
n
x, F
n+1/2
v
n+1
x, F
time
To start integration, need initial x and V at two separate time levels.
Specify x0 and v0 and then integrate V to Dt/2 using high-order scheme
Accuracy of leap-frog scheme
Can show the truncation error in leap-frog is second order in Dt
Evolution eqns:
Replace Vn-1/2 in second equation using first
Substitute this back into first equation
Rearrange
This is central difference formula for F=ma.
Let X(t) be the “true” (analytic) solution.
Then
Use a Taylor expansion to compute Xn+1 and Xn-1
Thus
Substitute these back into eq. 1
Truncation error O(Dt2)
EQ. 1
Truncation versus round-off error
Note the error we have just derived is truncation error
Unavoidable result of approximating solution to some order in x or t
Completely unrelated to round-off error, which results from
representing the continuous set of real numbers with a finite number
of bits.
Truncation error can be reduced by using smaller step Dt, or higherorder algorithm
Round-off error can be reduced by using higher precision (64 bit
rather than 32, etc.), and by ordering operations carefully.
In general, truncation error is much larger than round-off error
Stability of leap-frog scheme
Easiest to illustrate with an example. Suppose the force is given by a
harmonic oscillator, that is:
Then “true” (analytic) solution is
Substitute force law into leap-frog FDE for F=ma
Look for oscillatory solutions of the form x = x0eiwt
giving
This is good! Leap-frog (correctly) gets oscillatory solutions, but
at a modified frequency
Note this gives correct solution (W) as Dt --> 0
For WDt > 2, frequency becomes complex.
Real part of w’ gives oscillatory solution, imaginary part gives
exponentially growing (unstable) solution.
So stability limit is Dt < 2/W; or Dt < 2/[(dF/dx)/m]1/2 in general
Above is a simple example of a von-Neumann stability analysis
Consistency of leap-frog scheme
Leap-frog is consistent in the sense that as Dt --> 0, the
difference equations converge to the differential equations
Leap-frog is also a symplectic method (time symmetric).
Scheme has same accuracy for Dt negative.
n-1
x, F
n-1/2
v
n
x, F
n+1/2
v
n+1
x, F
time
Efficiency of leap-frog scheme
Leap-frog is extremely efficient in terms of computational cost
(only 12 flops per particle excluding force evaluation)
Also extremely efficient in terms of memory storage (does not
require storing multiple time levels).
All the work (and memory) is in force evaluation: 10N flops per
particle for direct summation
To update all particle positions in one second on a 1 Gflop
processor requires N < 104
Extra efficiency can be gained by using different timesteps for
each particle (more later).
4. Display and Analysis
Display requires plotting particles as spatial points in 3D.
Sophisticated packages render different particles with different
colors, allow perspective to be rotated, fly-throughs, etc.
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
Quantitative analysis requires following diagnostics such as E, L,
etc. for all particles.
Need to store 6 words of data for each particle (x and v), in
enough data dumps to give good time resolution of dynamics.
Biggest simulations follow 109 particles, 1000 data dumps
requires 48 TB (64 bit words).
Putting it all together.
Should be able to put together a direct N-body code using leap-frog.
•
•
•
•
•
•
•
Initialize x, V for N particles in region of size R
Use constant time step which is << crossing time
Use direct summation to compute F, with softened potential
Update x,V with leap-frog for a few crossing times
Output x for each particle frequently compared to crossing time
Use gnuplot, etc. to plot motion of particles
Experiment with different N
Or, you can just download the
code from the web…
Large collection of N-body codes available from Peter Teuben at
University of Maryland
http://bima.astro.umd.edu/nemo
This include Sverre Aarseth’s original NBODY0 code (first
widely used direct N-body (PP) code used in astrophysics)
Also see http://www.artcompsci.org/vol_1/v1_web/v1_web.html
/*
* LEAPINT.C: program to integrate hamiltonian system using leapfrog.
*/
#include <math.h>
#define MAXPNT
100
void leapstep();
void accel();
void printstate();
/* maximum number of points */
/* routine to take one step */
/* accel. for harmonic osc. */
/* print out system state
*/
void main(argc, argv)
int argc;
char *argv[];
{
int n, mstep, nout, nstep;
double x[MAXPNT], v[MAXPNT], tnow, dt;
/* first, set up initial conditions */
n = 1;
x[0] = 1.0;
v[0] = 0.0;
tnow = 0.0;
/*
/*
/*
/*
set
set
set
set
number of points
initial position
initial velocity
initial time
*/
*/
*/
*/
/* next, set integration parameters */
mstep = 256;
nout = 4;
dt = 1.0 / 32.0;
/* number of steps to take */
/* steps between outputs
*/
/* timestep for integration */
/* now, loop performing integration */
}
for (nstep = 0; nstep < mstep; nstep++) {
if (nstep % nout == 0)
printstate(x, v, n, tnow);
leapstep(x, v, n, dt);
tnow = tnow + dt;
}
if (mstep % nout == 0)
printstate(x, v, n, tnow);
/* loop mstep times in all */
/* if time to output state */
/* then call output routine */
/* take integration step
*/
/* and update value of time */
/* if last output wanted
/* then output last step
*/
*/
/*
* LEAPSTEP: take one step using the leap-from integrator, formulated
* as a mapping from t to t + dt. WARNING: this integrator is not
* accurate unless the timestep dt is fixed from one call to another.
*/
void leapstep(x, v, n, dt)
double x[];
double v[];
int n;
double dt;
{
int i;
double a[MAXPNT];
accel(a, x, n);
for (i = 0; i < n; i++)
v[i] = v[i] + 0.5 * dt * a[i];
for (i = 0; i < n; i++)
x[i] = x[i] + dt * v[i];
accel(a, x, n);
for (i = 0; i < n; i++)
v[i] = v[i] + 0.5 * dt * a[i];
}
/*
/*
/*
/*
positions of all points
velocities of all points
number of points
timestep for integration
*/
*/
*/
*/
/* call acceleration code
*/
/* loop over all points... */
/* advance vel by half-step */
/* loop over points again...*/
/* advance pos by full-step */
/* call acceleration code
*/
/* loop over all points... */
/* and complete vel. step
*/
/*
* ACCEL: compute accelerations for harmonic oscillator(s).
*/
void accel(a, x, n)
double a[];
double x[];
int n;
{
int i;
for (i = 0; i < n; i++)
a[i] = - x[i];
/* accelerations of points
/* positions of points
/* number of points
/* loop over all points... */
/* use linear force law
*/
*/
*/
*/
}
/*
* PRINTSTATE: output system state variables.
*/
void printstate(x, v, n, tnow)
double x[];
double v[];
int n;
double tnow;
{
int i;
/*
/*
/*
/*
positions of all points
velocities of all points
number of points
current value of time
for (i = 0; i < n; i++)
/* loop over all points...
printf("%8.4f%4d%12.6f%12.6f\n", tnow, i, x[i], v[i]);
}
*/
*/
*/
*/
*/
Variable time steps with leap-frog
For efficiency, need to take variable time steps (evolve particles
at center of cluster on smaller timestep than particles at edge).
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
Variable time steps with leap-frog
However, this destroys symmetry of leap-frog; greatly increases
truncation error.
n-1
x, F
n-1/2
v
n
x, F
n+1/2
v
n+1
x, F
time
But, variable timestep leap-frog can be symmetrized
Hut, Makino, & McMillan 1995
Force evaluation with variable time steps.
Now particle positions are known at different time levels.
Greatly complicates force calculation. Must compute derivatives
of force wrt time, and use Taylor expansion to compute total
force on particle at current position.
The Good: Allows higher-order (Hermite) integration methods.
The Bad: This just makes force evaluation even more expensive!
The Ugly: Direct N-body must be optimized if we are to go
beyond 104 particles.
Solving the force problem with hardware.
Special purpose
hardware to
compute force:
Jun Makino, U. Tokyo
GRAPE-6
• The 6th generation of GRAPE
(Gravity Pipe) Project
• Gravity calculation
with 31 Gflops/chip
• 32 chips / board ⇒ 0.99 Tflops/board
• 64 boards of full system is installed in University of Tokyo
⇒ 63 Tflops
• On each board, all particle data are set onto SRAM
memory, and each target particle data is injected into the
pipeline, then acceleration data is calculated
• Gordon Bell Prize at SC2000, SC2001
(Prof. Makino, U. Tokyo)
also nominated at SC2002
Do we really need to compute force from every star for distant objects?
Andromeda – 2 million light years away
Solving the force problem with software -- tree codes
Distance = 25 times size
Organize particles into a tree. In Barnes-Hut algorithm, use a
quadtree in 2D
In 3D, Barnes-Hut uses an octree
If angle subtended by the particles contained in any node of tree is
smaller than some criterion, then treat all particles as one.
Results in an Nlog(N) algorithm.
Alternative to Barnes-Hut is KD tree.
• KD tree is binary - extremely
efficient
• Requires N to be power of 2
• Nnodes = 2N-1
Parallelizing tree code.
Best strategy is to distribute particles across processors. That way,
work of computing forces and integration is distributed across procs.
Challenge is load balancing
– Equal particles  equal work.
• Solution: Assign costs to particles based on the work they do
– Work unknown and changes with time-steps
• Insight : System evolves slowly
• Solution: Count work per particle, and use as cost for next
time-step.
A Partitioning Approach: ORB
• Orthogonal Recursive Bisection:
– Recursively bisect space into subspaces with equal
work
• Work is associated with bodies, as before
– Continue until one partition per processor
•
High overhead for large no. of processors
Another Approach: Costzones
• Insight: Tree already contains an encoding of
spatial locality.
P1
(a) ORB
•
P2
P3
P4
P5 P6 P 7
(b) Costzones
Costzones is low-overhead and very easy to program
P8
Space Filling Curves
Morton Order
Peano-Hilbert Order