Operator formulation of algorithms

Download Report

Transcript Operator formulation of algorithms

Operator Formulation
of
Irregular Algorithms
Keshav Pingali
University of Texas, Austin
1
Irregular applications
• Organized around large, pointer-based data structures such as
graphs, trees, grids, etc.
– Artificial intelligence: Bayesian inference, SAT solvers: WalkSAT, survey
propagation
– Computational biology: protein homology networks
– Finite-element methods: mesh generation, refinement, partitioning,
sparse linear solvers
– Data-mining: clustering
– Social networks: finding communities
– Simulation: event-driven simulation, Petri-nets
– N-body methods: Barnes-Hut, FMM
• Galois project:
– (Wirth): algorithms + data structures = programs
– understand patterns of parallelism and locality in irregular algorithms
– build systems for exploiting this parallelism on multicore processors
2
High-level message
• Lot of parallelism in irregular algorithms
– amorphous data-parallelism (ADP)
• Compile-time analysis cannot find ADP in most irregular algorithms
– points-to analysis, shape analysis,… 
• Optimistic parallel execution is necessary to find ADP in some
irregular applications
– but speculation is not needed for most irregular algorithms
• There is a lot of structure in irregular algorithms
– can be exploited to optimize parallel execution
– regular algorithms become a special case of irregular algorithms
• Dependence graphs are the wrong abstraction for irregular
algorithms
– data-centric abstractions are crucial
– rethink algorithms: operator formulation
3
Operator formulation of algorithms
• Algorithm = repeated application of
operator to graph
i3
– active element:
i1
• node or edge where computation is needed
– neighborhood:
• set of nodes and edges read/written to
perform activity
• distinct usually from neighbors in graph
i2
– ordering:
• order in which active elements must be executed
in a sequential implementation
– any order
– problem-dependent order
• Amorphous data-parallelism
– parallel execution of activities, subject to
neighborhood and ordering constraints
i4
i5
: active node
: neighborhood
4
Delaunay Mesh Refinement
• Iterative
refinement
Mesh
m = /* read
in meshto*/remove badly
shaped
WorkList
wl; triangles:
while there are bad triangles do {
wl.add(m.badTriangles());
Pick a bad triangle;
while (true) {
Find its cavity;
Retriangulate cavity;
if ( wl.empty()
) break;
// may create new bad triangles
Element e =} wl.get();
• Don’t-care
non-determinism:
if (e no longer
in mesh) continue;
– Cavity
final mesh
on order in which bad new cavity
c = depends
new Cavity(e);//determine
triangles are processed
– c.expand();
applications do not care which mesh is
c.retriangulate();//re-triangulate
produced
region
• Data
structure:
m.update(c);//update
mesh
– wl.add(c.badTriangles());
graph in which nodes represent triangles
and edges represent triangle adjacencies
}
• Parallelism:
– bad triangles with cavities that do not
overlap can be processed in parallel
– parallelism is dependent on runtime values
•
compilers cannot find this parallelism
– (Miller et al) at runtime, repeatedly build
interference graph and find maximal
independent sets for parallel execution
5
Event-driven simulation
•
•
•
•
Stations communicate by sending
messages with time-stamps on FIFO
channels
Stations have internal state that is
updated when a message is processed
Messages must be processed in timeorder at each station
Data structure:
–
•
Messages in event-queue, sorted in timeorder
Parallelism:
–
–
A
3
B
6
4
C
5
activities created in future may interfere
with current activities
Jefferson time-warp
•
•
–
2
station can fire when it has an incoming
message on any edge
requires roll-back if speculative conflict is
detected
Chandy-Misra-Bryant
•
•
conservative event-driven simulation
requires null messages to avoid deadlock
6
Galois programming model (PLDI 2007)
•
Joe programmers
– sequential, OO model
– Galois set iterators: for iterating over
unordered and ordered sets of active
elements
• for each e in Set S do B(e)
– evaluate B(e) for each element in set S
– no a priori order on iterations
– set S may get new elements during
execution
• for each e in OrderedSet S do B(e)
– evaluate B(e) for each element in set S
– perform iterations in order specified by
OrderedSet
– set S may get new elements during
execution
•
for each tr in Set ws do { //unordered Set iterator
if (tr no longer in mesh) continue;
Cavity c = new Cavity(tr);
c.expand();
c.retriangulate();
m.update(c);
ws.add(c.badTriangles()); //bad triangles
}
Stephanie programmers
– Galois concurrent data structure library
•
Mesh m = /* read in mesh */
Set ws;
ws.add(m.badTriangles()); // initialize ws
DMR using Galois iterators
(Wirth) Algorithms + Data structures =
Programs
7
Galois parallel execution model
•
Parallel execution model:
– shared-memory
– optimistic execution of Galois
iterators
•
Implementation:
– master thread begins execution
of program
– when it encounters iterator,
worker threads help by executing
iterations concurrently
– barrier synchronization at end of
iterator
•
Independence of neighborhoods:
– software TLS/TM variety
– logical locks on nodes and edges
•
main()
….
for each …..{
…….
…….
}
.....
Ordering constraints for ordered
set iterator:
Joe Program
Master
i3
i1
i2
i4
i5
Concurrent
Data structure
– execute iterations out of order
but commit in order
– cf. out-of-order CPUs
8
ParaMeter Parallelism Profiles: DMR
• Input mesh:
– Produced by Triangle
(Shewchuck)
– 550K triangles
– Roughly half are badly
shaped
• Available parallelism:
– How many non-conflicting
triangles can be expanded
at each time step?
• Parallelism intensity:
– What fraction of the total
number of bad triangles
can be expanded at each
step?
9
Structural analysis of irregular algorithms
general graph
topology
grid
tree
refinement
morph
coarsening
general
topology-driven
irregular
algorithms
operator
local computation
data-driven
reader
unordered
ordering
ordered
Jacobi: topology: grid, operator: local computation, ordering: unordered
DMR, graph reduction: topology: graph, operator: morph, ordering: unordered
Event-driven simulation: topology: graph, operator: local computation, ordering: ordered 10
Cautious operators (PPoPP 2010)
• Cautious operator implementation:
– reads all the elements in its neighborhood
before modifying any of them
– (eg) Delaunay mesh refinement
• Algorithm structure:
– cautious operator + unordered active
elements
• Optimization: optimistic execution w/o
buffering
– grab locks on elements during read phase
• conflict: someone else has lock, so release
your locks
– once update phase begins, no new locks
will be acquired
• update in-place w/o making copies
• zero-buffering
– note: this is not two-phase locking
11
Eliminating speculation
•
Coordinated execution of activities: if we can build dependence graph
–
Run-time scheduling:
•
•
•
•
–
Just-in-time scheduling:
•
•
–
local computation + topology-driven
(eg) tree walks, sparse MVM (inspector-executor)
Compile-time scheduling:
•
•
•
cautious operator implementation + unordered active elements
execute all activities partially to determine neighborhoods
create interference graph and find independent set of activities
execute independent set of activities in parallel w/o synchronization
previous case + graph is known at compile-time (eg. grid or clique)
make all scheduling decisions at compile-time time
Speculation needed if
–
–
operator implementation is not cautious, or
unordered algorithm that can create new active nodes dynamically
12
Summary
• (Wirth): Algorithms + Data structures = Programs
– deep understanding of structure in algorithms and data
structures is critical for success in parallelization
• Wrong abstractions:
– dependence graphs, dataflow graphs, etc.
• Right abstraction:
– operator formulation of algorithms
– key notions: active nodes, neighborhoods, ordering
• Exposes key properties of algorithms
– amorphous data-parallelism
– structure in algorithms and data structures
13