ppt - Parallel Programming Laboratory

Download Report

Transcript ppt - Parallel Programming Laboratory

Implementing Simplified
Molecular Dynamics Simulation
in Different Parallel Paradigms
Chao Mei
April 27th, 2006
CS498LVK
MD Simulation Recall


Time evolution of a set of interacting
atoms
Interaction between molecules



Intra: bonds and bends between atoms
Inter: interaction of point charges
Interaction can simply follow Newton’s
law (F=ma) but in certain range
Intuitive Sequential Algorithm

Input: an array of atoms


Physical attributes: position, force, velocity
etc.
Algorithm in one timestep:
for each atom A in the array
for each atom B in the array
if(distance(A,B)<=cutoff)
do force calculation for A and B
Update physical attributes of atoms
Ideas of Parallelization

Data decomposition



Chunking atom array
Distribute every chunk to processors
Atoms information be communicated
between every chunk pair
Implementation in Shared-Memory


Barrier before next time step
Simple in OpenMP



Only add compiler directives on sequential code
Automatic chunking by compiler
Not difficult in others (HPF, UPC, CAF, GAs
etc.)


Chunking done by programmer
Different code form for remote array access
Access Conflicts in Shared Data
1
2
3
P1
4
7
8
P4
Be careful! Add programming complexity 
Implementations in Message-Passing

Basic scheme of one time step
for every other processor i
Send local atoms chunk processor i
for every other processor i
Receiving atoms chunk from processor i
Compute force between local chunk and received
one
Update local atoms
Barrier()

Overlapping communication/computation

Charm++ and GAs
Other Paradigms

Fit



Moderately fit


BSP
STAPL
DSM (Treadmarks, CID, CASHMERe)
Probably not fit

Cilk
Implementation Performance
200
180
160
Time (ms)
140
120
on
on
on
on
100
80
60
40
20
0
OpenMP
MPI
Charm++
1
2
4
8
procs
procs
procs
procs
Wait…think again!

The above parallelization approach is
not good for high-performance

Possible wasteful communication
Better ideas of parallelization

Space decomposition (1-away) based
on cutoff
Simulation System
Cell
Pros/Cons of Better Ideas


Reduced communication
Update in every step change atoms’
position


Re-decompose space at every step?
Do every certain amount steps 
Possible imbalanced computation

every cell contains different number of
atoms
What We Needed for Implementation?


Cell arrays (each with size of cutoff)
A way to know which two cells need to
exchange data



A global table (can be indexed by two Cells)
Neighbor list associated with each cell
Load-balancing
Charm++ Perfectly Fits Better Ideas





Object-oriented programming
One-sided communication for
overlapping computation
Support for multidimensional array
Data-driven computation
Built-in load-balancing
Summary


MD simulation is not very difficult to
implement both in shared memory and
message-passing paradigms
Parallelization techniques matters


Simple one not difficult to implement
Better one motivates more features in
language
Thank You!