Parallel Molecular Dynamics - Parallel Programming Laboratory

Download Report

Transcript Parallel Molecular Dynamics - Parallel Programming Laboratory

Parallel Molecular Dynamics
A case study :
Programming for performance
Laxmikant Kale
http://charm.cs.uiuc.edu
Molecular Dynamics
• Collection of [charged] atoms, with bonds
• Newtonian mechanics
• At each time-step
– Calculate forces on each atom
• bonds:
• non-bonded: electrostatic and van der Waal’s
– Calculate velocities and Advance positions
• 1 femtosecond time-step, millions needed!
• Thousands of atoms (1,000 - 100,000)
Further MD
• Use of cut-off radius to reduce work
– 8 - 14 Å
– Faraway charges ignored!
• 80-95 % work is non-bonded force
computations
• Some simulations need faraway
contributions
Scalability
• The Program should scale up to use a large
number of processors.
– But what does that mean?
• An individual simulation isn’t truly scalable
• Better definition of scalability:
– If I double the number of processors, I should
be able to retain parallel efficiency by
increasing the problem size
Isoefficiency
• Quantify scalability
• How much increase in problem size is
needed to retain the same efficiency on a
larger machine?
• Efficiency : Seq. Time/ (P · Parallel Time)
– parallel time =
• computation + communication + idle
Traditional Approaches
• Replicated Data:
– All atom coordinates stored on each processor
– Non-bonded Forces distributed evenly
– Analysis: Assume N atoms, P processors
•
•
•
•
Computation: O(N/P)
Communication: O(N log P)
Communication/Computation ratio: P log P
Fraction of communication increases with number
of processors, independent of problem size!
Atom decomposition
• Partition the Atoms array across processors
– Nearby atoms may not be on the same processor
– Communication: O(N) per processor
– Communication/Computation: O(P)
Force Decomposition
• Distribute force matrix to processors
–
–
–
–
Matrix is sparse, non uniform
Each processor has one block
Communication: N/sqrt(P)
Ratio: sqrt(P)
• Better scalability (can use 100+ processors)
– Hwang, Saltz, et al:
– 6% on 32 Pes
36% on 128 processor
Spatial Decomposition
• Allocate close-by atoms to the same processor
• Three variations possible:
– Partitioning into P boxes, 1 per processor
• Good scalability, but hard to implement
– Partitioning into fixed size boxes, each a little
larger than the cutoff disctance
– Partitioning into smaller boxes
• Communication: O(N/P)
Spatial Decomposition in NAMD
• NAMD 1 used spatial decomposition
• Good theoretical isoefficiency, but for a
fixed size system, load balancing problems
• For midsize systems, got good speedups up
to 16 processors….
• Use the symmetry of Newton’s 3rd law to
facilitate load balancing
Spatial Decomposition
Spatial Decomposition
FD + SD
• Now, we have many more objects to load
balance:
– Each diamond can be assigned to any processor
– Number of diamonds (3D):
– 14·Number of Patches
Bond Forces
• Multiple types of forces:
– Bonds(2), Angles(3), Dihedrals (4), ..
– Luckily, each involves atoms in neighboring
patches only
• Straightforward implementation:
– Send message to all neighbors,
– receive forces from them
– 26*2 messages per patch!
Bonded Forces:
• Assume one patch per processor
A
C
B
Implementation
• Multiple Objects per processor
– Different types: patches, pairwise forces,
bonded forces,
– Each may have its data ready at different times
– Need ability to map and remap them
– Need prioritized scheduling
• Charm++ supports all of these
Charm++
• Data Driven Objects
• Object Groups:
– global object with a “representative” on each PE
•
•
•
•
Asynchronous method invocation
Prioritized scheduling
Mature, robust, portable
http://charm.cs.uiuc.edu
Data driven execution
Scheduler
Scheduler
Message Q
Message Q
Load Balancing
• Is a major challenge for this application
– especially for a large number of processors
• Unpredictable workloads
– Each diamond (force object) and patch
encapsulate variable amount of work
– Static estimates are inaccurate
• Measurement based Load Balancing
– Very slow variations across timesteps
Bipartite graph balancing
• Background load:
– Patches and angle forces
• Migratable load:
– Non-bonded forces
• Bipartite communication graph
– between migratable and non-migratable objects
• Challenge:
– Balance Load while minimizing communication
Load balancing
• Collect timing data for several cycles
• Run heuristic load balancer
– Several alternative ones
• Re-map and migrate objects accordingly
– Registration mechanisms facilitate migration
5000000
4500000
4000000
3500000
Time
3000000
migratable work
non-migratable work
2500000
2000000
1500000
1000000
500000
15
Av
er
ag
e
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
Processors
4500000
4000000
3500000
2500000
migratable work
non-migratable work
2000000
1500000
1000000
500000
Processors
15
Av
er
ag
e
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
Time
3000000
Performance: size of system
# of
atoms
bR
3,762
atoms
ER-ERE
36,573
atoms
ApoA-I
92,224
atoms
Procs
2
4
8
16
32
Time
1.14
Speedup 1.0
0.58
1.97
.315
3.61
.158
7.20
.086
13.2
.048
23.7
Time
Speedup
6.115
(1.97)
3.099
3.89
1.598
7.54
.810
14.9
10.76
(3.88)
5.46
7.64
2.85
14.7
Time
Speedup
1
64
128
160
.397
30.3
0.212
56.8
0.123
97.9
0.098
123
1.47
28.4
0.729
57.3
0.382
109
0.321
130
Performance: various machines
Procs
T3E
1
2
4
8
16
32
64
128
160
6.12
3.10
1.60
0.810
0.397
0.212
0.123
0.098
(1.97)
3.89
7.54
14.9
30.3
56.8
97.9
123
8.28
4.20
2.17
1.07
0.542
0.271
0.152
Time
192
-------Origin
Speedup
2000
------ASCI-
Speedup
1.0
1.96
3.80
7.74
15.3
30.5
54.3
Time
28.0
13.9
7.24
3.76
1.91
1.01
0.500
0.279
0.227
0.196
Red -------NOWs
Speedup
1.0
2.01
3.87
7.45
14.7
27.9
56.0
100
123
143
24.1
12.4
6.39
3.69
HP735
/125
Speedup
1.0
1.94
3.77
6.54
Time
Time
Performance:new data
Procs
T3E
1
2
4
8
16
32
64
128
160
6.12
3.10
1.60
0.810
0.397
0.212
0.123
0.098
(1.97)
3.89
7.54
14.9
30.3
56.8
97.9
123
8.28
4.20
2.17
1.07
0.542
0.271
0.152
Time
192
-------Origin
Speedup
2000
------ASCI-
Speedup
1.0
1.96
3.80
7.74
15.3
30.5
54.3
Time
28.0
13.9
7.24
3.76
1.91
1.01
0.500
0.279
0.227
0.196
Red -------NOWs
Speedup
1.0
2.01
3.87
7.45
14.7
27.9
56.0
100
123
143
24.1
12.4
6.39
3.69
HP735
/125
Speedup
1.0
1.94
3.77
6.54
Time
Time
Speedup
240
220
200
180
Speedup
160
140
Speedup
Perfect Speedup
120
100
80
60
40
20
0
0
20
40
60
80
100
120
140
Processors
160
180
200
220
240
Speedup: recent data
speedup on asci red
700
600
speedup
500
400
Series1
300
200
100
0
0
200
400
600
processors
800
1000
1200