_x0010__x000d__x0019_

Download Report

Transcript _x0010__x000d__x0019_

Structure-driven Optimizations for
Amorphous Data-parallel Programs
Mario Méndez-Lojo1 Donald Nguyen1
Dimitrios Prountzos1 Xin Sui1 M. Amber Hassaan1
Milind Kulkarni2 Martin Burtscher1 Keshav Pingali1
1The
University of Texas at Austin (USA)
2Purdue University (USA)
1
Irregular algorithms
• Operate on pointer-based data structures like graphs
– mesh refinements, min. spanning tree, max-flow…
• Plenty of available parallelism [Kulkarni et al., PPoPP’09]
• Baseline Galois system [Kulkarni et al., PLDI’07]
– uses speculation to exploit this parallelism
– may have high overheads for some algorithms
• Solution explored in paper
– exploit algorithmic structure to reduce overheads of
baseline system
• We will show:
– common algorithmic structures
– optimizations that exploit those structures
– performance results
2
Operator formulation of algorithms
• Algorithm = repeated application of
operator to graph
i3
– active node:
i1
• node where computation is needed
– activity:
• application of operator to active node
• can add/remove nodes from graph
i2
– neighborhood:
• set of nodes and edges read/written to
perform activity
• can be distinct from neighbors in graph
• Focus: algorithms in which order of
activities does not matter
i4
i5
: active node
: neighborhood
• Amorphous data-parallelism
– parallel execution of activities, subject to
neighborhood constraints
3
Delaunay mesh refinement
• Iterative refinement to remove
badly shaped triangles:
add initial bad triangles to workset
while workset is not empty:
pick a bad triangle
find its cavity
retriangulate cavity
add new bad triangles to workset
• Multiple valid solutions
• Parallelism:
– bad triangles with cavities that do not
overlap can be processed in parallel
– parallelism is dependent on runtime
values
• compilers cannot find this parallelism
4
Baseline execution model
Parallel execution model
• shared-memory
• optimistic execution of Galois
iterators
main()
Implementation
…
• threads get active nodes from
workset
• apply operator to them
Neighborhood independence
• each node/edge has an
associated token
• graph operations → acquire
tokens on read/written nodes
• token owned by another thread
→ conflict → activity rolled back
• software TLS/TM variety
for node: workset
…
…
…
i3
i1
i2
i4
…
i5
program
concurrent graph
5
Sources of overhead
• Dynamic assignment of work ≡
– the centralized workset requires synchronization
• Enforcing neighborhood constraints ≡
– acquiring/releasing tokens on neighborhood
• Copying data for rollbacks ≡
– when an activity modifies a graph element
• Aborted activities
– work is wasted + roll back the activity
R
W
activity time
R
6
Proposed optimizations
• Baseline execution model is very general
– many irregular algorithms do not need its full generality
– “optimize the common case”
• Identify general-purpose optimizations and
evaluate their performance impact
• Optimizations
– cautious
– one-shot
– iteration coalescing
7
Cautious
Algorithmic structure: operator reads all elements
of its neighborhood before modifying any of them
– conflicts detected before modifications occur
Optimizations:
• Enforcing neighborhood constraints ≡
– token acquisition unnecessary after first modification
• Copying data for rollbacks ≡
examples: Delaunay refinement, Boruvka minimum spanning tree, etc.
R
W
activity time
R
8
One-shot
Algorithmic structure: neighborhood can be
predicted before activity begins
Optimizations:
• Enforcing neighborhood constraints ≡
– token acquisition only necessary when activity starts
• Copying data for rollbacks ≡
• Aborted activities
– waste little computation
examples: preflow-push, survey propagation, stencil codes like Jacobi, etc.
R
W
activity time
R
9
Iteration coalescing
• Iteration coalescing
Algorithmic
structure:= data-centric loop chunking
─ place new
active nodes
in thread-local
−activities
generate
new active
nodes worksets
─− release
tokens
only onmany
abort/commit
same token
acquired
times across related activities
Benefits:
• Dynamic assignment of work ≡
– less contention
• Enforcing neighborhood constraints ≡
– locality: thread probably owns the token
R
W
R
R
W
R
10
Iteration coalescing
• Iteration coalescing = data-centric loop chunking
─ place new active nodes in thread-local worksets
─ release tokens only on abort/commit
Benefits:
• Dynamic assignment of work
– less contention
• Enforcing neighborhood constraints
– locality: thread probably owns the token
Drawback:
• number of tokens held by thread increases
− higher conflict ratio
11
Evaluation
• Experiments on Niagara (8 cores, 2 threads/core)
• Average % improvement over baseline
Delaunay refinement
Boruvka (MST)
•
•
•
•
•
•
•
•
cautious → 15%
cautious + coalescing → 18%
one-shot not applicable
max speedup: 8.4x
cautious → 22%
one-shot → 8%
coalescing has no impact
max speedup: 2.7x
12
Evaluation
• Experiments on Niagara (8 cores, 2 threads/core)
• Average % improvement over baseline
Preflow-push (max flow)
Survey propagation (SAT)
• cautious → 33%
• one-shot→ 44%
• one-shot + coalescing → 59%
• max speedup: 1.12x
• baseline times out
• one-shot → 28% over cautious
• max speedup: 1.04x
13
Conclusions
• There is structure in irregular algorithms
• Our optimizations exploit this structure
– cautious
– one-shot
– iteration coalescing
• The evaluation confirms the importance of
reducing the overheads of speculation
– other optimizations waiting to be discovered?
14
Thank you!
धन्यवाद!
slides available at
http://www.ices.utexas.edu/~marioml
15