Distributed Data Assimilation - A case study Aad J. van der Steen

Download Report

Transcript Distributed Data Assimilation - A case study Aad J. van der Steen

Distributed Data Assimilation - A case study
Aad J. van der Steen
High Performance Computing Group
Utrecht University
1. The application
2. Parallel implementation
3. Model and experiments
4. Perspectives for distributed implementation
The application - 1
"
"
"
"
The application, Ensflow, assimilates ocean flow data into a
stochasic ocean flow model.
Many realisations of the model with randomly distributed
parameters forming an ensemble are run.
Perodically these runs are integrated with satellite data and an
optimal ensemble average is computed.
The sequence of ensemble averages over time describes the
development of the ocean's currents best fitting the observations.
The application - 2
The region of interest in the southern tip of Africa:
Data from the TOPEX/Nimbus satellite are used for the assimilation.
Purpose is to understand the evolution of streams and eddies in this region.
The application -3
Because of the stochastic nature of the model many realisations
of the model with slightly different parameter values are to be
evolved.
The observations of the top layer values are interpolated to
a 251x151 grid.
The ensemble members are allowed to develop independently
for some time and combined to find the ensemble mean
  R T B

F
With F the best estimate for the model evolution without
observations.
R = matrix of field measurement covariances.
B = matrix of representer coefficients.
The application - 4
The model performs two computational intensive tasks:
1. Generation of the ensemble members.
2. The computational flow part that describes the evolution
of the stream function .
Every 240 hourly timesteps an analysis of the ensemble is
 for the past period.
done to obtain the optimal estimate 
Parallel implementation -1
1. Ensemble members are distributed evenly over the processors.
2. Data of ensemble members are independent and are local to
the processors.
3. Only in the analysis phase to determine the globally optimal


data have to be exchanged (using MPI).
4. The optimal global field is distributed and a new cycle is
started.
Parallel implementation -2
The program contains 2 irreducible scalar parts:
1. Initialisation, linearly dependent on the number of ensemble
ne and depends on nd , the number of gridpoints
members,
2
O
n
log nd . Init time = t i .
by
d
2. The analysis
part for which holds that the analysis time
t a n3e
.
On the
DAS-2 systems
ne 60
(for
).
t i 59.5 s
and
t a 104 s
Parallel implementation -3
The time per ensemble member per 24h time step t s30s.
This amounts to 20x60x30 = 36,000s = 10h single
processor time for the complete 20 day cycle considered.
After the init phase a distribute operation and per analysis step
a collect and a distribute operation are required.
n n n 1.5 MB
The total amount of data moved is x y l
.
The bandwidth at with this occurs is 120-140 MB/s (using
Myrinet on one cluster). So, the total communication time is
about 0.12s per transfer.
t c 727 s
Total communication time within one run
.
Model and experiments -1
The timing model has the following form:
T p t i t a 
1200t s
p
36,000
36,000
 t c 59.5 104
 15173.5
s
p
p
Model and experiments -2
Remarks:
1. There is a mistery with respect to the computation phases:
for p = 1 t c 17 s , for p > 1 t c 30 s consistently.
2. For p < 6, using 1 CPU/node is somewhat faster, from
p = 6 on, 2 CPUs/node is marginally faster due to
decreasing competition for memory and faster
intra-node communication.
Model and experiments: Simulation results
Shown is a simulation of 180 dayly periods, note the blue
eddies that form counterclocwise in the Atlantic.
Perspectives for distributed implementation -1
The timing model has the following form:
T p t i t a 
1200t s
In the single-cluster implementation
p
and virtually independent of .
p
 tc
tc
is quite small (ca. 15 s)
For the distributed version this might not be the case:
1. Presently Globus cannot be used yet in conjunction with
Myrinet's MPI, communication must be done via IP.
2. The geographical distance between the DAS clusters
introduces non-negligable latencies.
Perspectives for distributed implementation -2
As can be seen from the figure the communication time is
still insignificant when distributing the model over two
locations (UU and VU):
Perspectives for distributed implementation -3
t c is quite erratic, more determined by synchronisation
than communication time:
Perspectives for distributed implementation -4
The results show that this application is excellently suited
for distributed processes. Still, both communication and
the analysis phase may be made more efficient:
1. When is known which process id.s are located where,
first intra-cluster communication can be done, then
the assembled messages can be exchanged.
3on the local ensemble
2. The analysis could be tdone
n
members (remember a e ) and synchronised less
frequently.
Perspectives for distributed implementation -5
Using more sites has a notable effect on the communication.
Again, synchronisation effects are more important than
the communication time proper :
Sites
1
2
3
4
Exec.time (s) Comm.time(s)
3310
45.1
3274
62.9
3339
208.5
3299
151.9
12-proc. run: UU, UU+VU, UU+VU+Leiden,
UU+VU+Leiden+Delft
Perspectives for distributed implementation -6
This case study was a particular well suited candidate for
distributed processing. Apart from improving this implementation
we will proceed with three other projects that are promising:
1. Running two coupled oceanographic model within the
Cactus framework.
2. Inexact sequence matching of genetic material.
3. Pattern recognition on proteomic micro arrays.
Acknowledgements:
Fons van Hees for the single-system parallelisation