Presentazione di PowerPoint

Download Report

Transcript Presentazione di PowerPoint

INAF Osservatorio
“ScicomP 9”
Astrofisico di Catania
Bologna March 23 – 26 2004
Using LAPI and MPI-2
in an N-body cosmological code on IBM SP
M. Comparato, U. Becciani, C. Gheller, V. Antonuccio
• The N-Body project at OACT
• The FLY code
• Performance analysis with
LAPI and MPI-2
• Questions
INAF Osservatorio
Astrofisico di Catania
FLY Project
• People: V. Antonuccio, U. Becciani, M. Comparato, D.
Ferro
• Funds: Included in the project “Problematiche
Astrofisiche Attuali ed Alta Formazione nel Campo del
Supercalcolo” funded by MIUR with more than 500,000
Euros + 170,000 Euros
• INAF provides grants on MPP systems at CINECA
• Resources: IBM SP, SGI Origin systems, Cray T3E
INAF Osservatorio
Astrofisico di Catania
INAF Astrophysical
Observatory of Catania
24 Processors 222 Mhz
Global RAM Memory: 48 Gbytes
Disk Space: 254 GB
(72.8 GB HD per node + 36.2 GB HD cws)
Network Topology: SPS scalable Omega switch
and FastEthernet node interconnection type
Bandwidth: 300 Mbyte/s peak bi-directional
transfer rate
Programming Language: C, C++, Fortran 90
Parallel paradigms: OpenMP, MPI, LAPI
IBM SP POWER3
INAF Osservatorio
Astrofisico di Catania
INAF Astrophysical
Observatory of Catania
NEW SYSTEM
8 Processors 1.1 GHz
Global RAM Memory: 16 Gbytes
Disk Array: 254 GB
L2: 1.5 Mbytes
L3: 128 Mbyte
IBM POWER4
Memory: 2 GB per processor
P650
Gravitational N-body problem
The cosmological simulation
The N-body technique allows to perform cosmological
simulation that describe the evolution of the universe.
With this method matter is represented as a set of particles:
• each particle is characterized by mass, position and velocity,
• the only force considered is gravity,
• the evolution of the system is obtained integrating numerically
the equation of motion on a proper time interval
Gravitational N-body problem
The cosmological simulation
• Direct interaction (P-P) method is conceptually the
simplest.
• … but it scale as O(N2) and this makes impossible
to run simulations with more than N >= 105
particles.
• to overcome this problem tree- or mesh-based
algorithms have been developed, these scale as
NlogN and N
• but, only supercomputers, and parallel codes,
allow the user to run simulations with N>= 107
particles
FLY: Parallel tree N-body code for
Cosmological Applications
• Based on the Barnes-Hut Algorithm
(J. Barnes & P. Hut, Nature, 324, 1986)
• Fortran 90 parallel code
• High Performance Code for MPP/SMP architecture using
one-side communication paradigm: SHMEM – LAPI
• It runs on Cray-T3E System, SGI ORIGIN, and on IBM SP.
• Typical Simulations require 350 MB Ram for 1 million
particles
Gravitational N-body problem
The Barnes-Hut algorithm
The particles evolve according to the laws of
Newtonian physics
d 2 xi
dt
2


Gm j dij
dij
j i
where dij = xi - xj
3
Considering a region  the force component on the iparticle may be computed as:
 
j
Gm j dij
dij
3

GM di,cm
di,cm
3
 higer order mult. terms
Gravitational N-body problem
Tree Formation
root
2D Domain decomposition and
tree structure. The split of each
sub-domain is carried out until
only one body (a leaf) is
contained in each box
Gravitational N-body problem
Force Computation
Two phases:
• tree walk procedure
• force computation
The force on any particle is
computed as the sum of the
forces by the nearby particles
plus the force by the distant
cells whose mass distributions
are approximated by multipole
series truncated, typically, at
the quadrupole order.
If
Cellsize  q
di cm
mark the cell
FLY Block Diagram
SYSTEM
INIZIALIZATION
TREE FORMATION
and BARRIER
FORCE
COMPUTATION
TREE INSPECTION
ACC. COMPONENTS
BARRIER
UPDATE POSITIONS
and BARRIER
STOP
Parallel implementation of FLY
Data distribution
Two main data structures:
• particles
• tree
Particles are distribuited in blocks such that each processor has the same
number of contiguous bodies. E.g. with 4 processors:
1
2
3
4
5
6
7
8
9
... ... ...
...
Tree structure is distribuited in a cyclic way such that each processor has
the same number of cells
1
5
9 13
2
6 10
1
4
3
... ... ...
...
Parallel implementation of FLY
Work distribution
• Each processor calculates the force for its local
particles.
• To do that the whole tree structure (which is
distributed among processors) must be accessed
asyncronously (one-side communications
required)
• This leads to a huge communication overhead
FLY: Tips and tricks
In order to lower the problems related to communication
overhead, we have implemented several “tricks”
Dynamical Load Balancing: processors help each other
Grouping: close particles have the same interaction with
far distributions of mass
Data Buffering
FLY: Data buffering
Data buffering: free RAM segments are dynamically
allocated to store remote data (tree cell properties and
remote bodies) already accessed during the tree walk
procedure.
Performance improvement:
16 Million bodies on Cray T3E, 32 PEs 156 Mbytes for each PE
Without Buffering
Each PE executes 700 GET operations for each local body
Using Buffering
Each PE execute ONLY 8 - 25 GET operations for each local body
Why FLYing from LAPI to MPI-2
• LAPI is a propretary parallel programming library (IBM)
• Implementing FLY using MPI-2 improves the portability of
our code
• RMA calls introduced in MPI-2 make the porting simple,
since there is a direct correspondence between basic
functions
lapi_get(…)
lapi_put(…)
mpi_get(…)
mpi_put(…)
• MPI-2 doesn’t have an atomic fetch and increment call
MPI-2 syncronization
• MPI_Win_fence mechanism
 when we can separate non-RMA access from
RMA access
 when all the processors access remote data at
the same time
• MPI_Win_lock and MPI_Win_unlock mechanism
 when we need the data just after the call
 when only one processor access remote data
MPI-2 syncronization
• FLY algorithm requires continuous
asyncronous access to remote data
• passive target syncronization is needed
• we have to use the lock/unlock mechanism.
MPI-2 Drawback
Unfortunately, Lock and unlock are usually not
implemented efficiently (or they are not even
implemented at all)
• LAM: not implemented
• MPICH: not implemented
• MPICH2: I am waiting next release
• IBM AIX: poor performance
• IBM Turbo MPI2: testing phase
FLY 3.3
Problem: poor performance of lock/unlock calls
Walk-around: rewrite portion of code (where
possible) to separate non-RMA accesses from
RMA accesses in order to use the fence calls
Result: MPI-2 version runs 2 times faster
Why don’t we port these changes on the LAPI version?
FLY 3.3 was born
FLY 3.2 VS FLY 3.3
Static
450
2M particles test
400
FLY_3.2
FLY_3.3
350
Static part:
sec.
300
250
200
150
• Tree generation
• Cell properties
•…
100
50
0
1
2
4
8
16
32
64
Number of processors
Dynamic part:
Dynamic
1800
FLY_3.2
FLY_3.3
1400
1200
sec.
• Interaction list
• Force computation
•…
1600
1000
800
600
400
200
0
1
2
4
8
Number of processors
16
32
64
FLY 3.2 VS FLY 3.3
Total
2000
2M particles test
1800
FLY_3.2
FLY_3.3
1600
1400
sec.
1200
1000
800
600
Total simulation time
400
200
0
1
2
4
8
16
32
64
3232
64
64
Number of processors
Scalability
timing normalized on the
number of processors
7
6
6
5
(tnxn)/t1
Scalability:
7
FLY_3.2
FLY_3.2
FLY_3.3
FLY_3.3
5
4
4
3
3
2
2
1
1
0
0
1
1
2
2
4
4
8 8
1616
Number
of processors
Number
of processors
FLY 3.3 VS FLY MPI-2
Static
8000
2M particles test
7000
FLY_3.3
FLY_MPI
6000
Static part:
sec.
5000
3000
• Tree generation
• Cell properties
•…
2000
1000
0
1
2
3
4
5
6
7
Number of processors
Dynamic
Dynamic part:
3000
2500
FLY_3.3
FLY_MPI
2000
sec.
• Interaction list
• Force computation
•…
4000
1500
1000
500
0
1
2
4
8
16
Number of processors
32
64
Conclusions
Present:
• Low performce MPI2 version of FLY (for now)
• More scalable LAPI version of FLY
Future:
• TurboMPI2
• MPICH2 (porting on Linux clusters)
• OO interface to hydrodynamic codes (FLASH)