ppt - Parallel Programming Laboratory

Download Report

Transcript ppt - Parallel Programming Laboratory

Programming an SMP Desktop using
Charm++
Laxmikant (Sanjay) Kale
http://charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
Supported in part by IACAT
Prologue
• I will present an abbreviated version of the planed
talk
– We are running late..
– Also, I realized that what I really intended to present, with code
examples, will need an hour long talk..
– We will write that in a report later (may be put it in charm
documentation)
4/9/2016
Charm++ Workshop
2
Outline
– Charm++ designed for portability between shared and distributed
memory
– Optimizing multicore charm
• K-neighbor and its description and performance
• What optimizations were carried out
– Abstractions:
• Basic: shared object space, Readonly data
• Plain global variables: still work.. More on disciplined use of
these later
• Nodegroups
• Passing pointers to shared data structures, including sections
of arrays.
– Readonly, write-exclusive: permissions by convention or “capability”
4/9/2016
Charm++ Workshop
3
Optimizing SMP implementation of Charm++
• Changed memory allocator
– to avoid acquiring a lock per memory allocation
• Reduced the granularity of critical region
• Used thread local storage (__thread) to avoid false
sharing
• Use memory fence instead of lock for pcqueue
• Reduce lock contention by using a separate msg
queue for every other core on the same node
• Simplify the data structure of pcqueue
– Assumes queuesize is adequately large
4/9/2016
Charm++ Workshop
4
Results on SMP Performance
•
Improvement on K-Neighbor Test (8 cores, Mar’2009)
Results on SMP Performance
•
Improvement on K-Neighbor Test (24 cores, Mar’2009)
Results on SMP Performance
Improvement on K-Neighbor Test (16 cores, Apr’2009)
kNeighbor test on a Power 5 node (15 elements, k=3)
2000.00
1800.00
average iteration time (us)
•
1600.00
1400.00
1200.00
1000.00
800.00
600.00
400.00
200.00
0.00
0
64
128
256
512
msg size (bytes)
non-smp
smp
1024
2048
4096
We evaluated many of our
applications to test and
demonstrate the efficacy
of the optimized SMP
runtime
4/9/2016
Charm++ Workshop
8
Jacobi 2D stencil computation on Power 5
(8000x8000 matrix size)
1.2
Efficiency
1
0.8
0.6
0.4
0.2
0
1
2
3
4
5
6
7
8
9
10
11
Number of processors
12
13
14
15
16
ChaNGa:
Barnes-Hut based
production astronomy code
4/9/2016
Charm++ Workshop
10
ChaNGa:
Barnes-Hut based
production astronomy code
4/9/2016
Charm++ Workshop
11
NAMD Scaling with Optimization
Multicore vs net
2
(
T
i
p
m
e
/
s
s
t
e
1.5
1
)
multicore
0.5
0
non-smp
1
3
6
12
24
multicore
1.715
0.6
0.303
0.152
0.0826
non-smp
1.719
0.631
0.321
0.173
0.11
Number of Cores
NAMD apoa1 running
on upcrc
Summary of
constructs that use
shared memory in
Charm++
4/9/2016
Charm++ Workshop
13
Basic Mechanisms
• Chares and Chare array constitute a “shared object
space”
– Analogous to shared address space
• Readonly globals
– Initialized in main::main or any method called from it
synchronously
• Shared global variables
4/9/2016
Charm++ Workshop
14
More powerful mechanisms
• Node groups
• Passing pointers to shared data structures,
including sections of arrays.
– Readonly, write-permission
4/9/2016
Charm++ Workshop
15
Node Groups
• Node Groups - a collection of objects (chares)
– Exactly one representative on each node
• Ideally suited for system libraries on SMP
– Similar to arrays:
• Broadcasts, reductions, indexing
– But not completely like arrays:
• Non-migratable; one per node
16
Conditional packing
• Pass data structure between chares
– Pass pointer (dest. within the node)
– PUP the entire structure (dest. outside the node)
• Who owns the data and frees it?
– Data structure must inherit from CkConditional
• Reference counted
• A data structure can contain info about an
array section
– Useful in cases like in-place sorting (e.g. quicksort)
Sharing Data and Conditional packing
• Pointers can be sent in “messages”, but they are
packed to underlying data structures when going
across nodes
– (feature in chare kernel since 1989 or so!)
• Data structure being shared should be
encapsulated, with a read or write “capability”
– If I give you write access, I promise not to modify it, read it, or
grant access to someone else
– If I give you a read access, I promise not to change it until you
are done
4/9/2016
Charm++ Workshop
18
Disciplined Sharing
• My pet idea: shared arrays with restricted modes
– Readonly, write-exclusive, accumulate, and “owner-computes”
– Modes can change at well-defined global synch points
– Captures a large fraction of uses of shared arrays
4/9/2016
Charm++ Workshop
19