slides - Purdue College of Engineering

Download Report

Transcript slides - Purdue College of Engineering

Parallelizing Irregular Applications
through the Exploitation of
Amorphous Data-parallelism
Keshav Pingali (UT, Austin)
Milind Kulkarni (Purdue)
Mario Mendez-Lojo (UT, Austin)
Donald Nguyen (UT, Austin)
1
Objectives
• What are irregular applications?
– how are they different from regular applications?
– why are they so hard to parallelize?
• Is there parallelism in irregular applications?
–
–
–
–
operator formulation of algorithms
amorphous data-parallelism (ADP)
Galois programming model and baseline execution model
ParaMeter tool for estimating ADP in Galois programs
• Is there structure in irregular algorithms?
– structural analysis of irregular algorithms
– exploiting structure for efficient parallel implementation
– Galois system
• What are the open research problems?
2
Schedule
• 1:30 PM-2:30 PM
– Irregular applications, amorphous data parallelism, structure
(Keshav Pingali)
• 2:30 PM-3:00 PM
– ParaMeter (Milind Kulkarni)
• 3:00 PM-3:15 PM
– Break
• 3:15 PM-4:15 PM
– Galois programming model (Mario Mendez-Lojo)
• 4:15 PM-4:50 PM
– Galois performance optimizations (Donald Nguyen)
• 4:50 PM-5:00 PM
– Research problems (Keshav Pingali)
3
Irregular applications
4
Definitions
• Regular applications
– key data structures are
• vectors
• dense matrices
– simple access patterns
• (eg) array indices are affine functions of for-loop indices
– examples:
• MMM, Cholesky & LU factorizations, stencil codes, FFT,…
• Irregular applications
– key data structures are
• lists, priority queues
• trees, DAGs, graphs
• usually implemented using pointers or references
– complex access patterns
– examples: see next slide
5
Examples
Application/domain
Algorithm
AI
Bayesian inferences, survey
propagation
Compilers
Iterative and elimination-based
dataflow algorithms
Functional interpreters
Graph reduction, static and dynamic
dataflow
Maxflow
Preflow-push, augmenting paths
Computational biology
Protein homology networks,
phylogenetic reconstruction
Event-driven simulation
Chandy-Misra-Bryant, Jefferson
Timewarp
Meshing
Generation/refinement/partitioning
Social networks
Connecting the dots
Data-mining
Clustering
6
Problem
•
•
Handwritten parallel implementations exist for
some irregular problems
No systematic way of thinking about parallelism
in irregular applications
–
–
no abstractions for thinking about parallelism
many mechanisms exist but it is not clear when these
mechanisms are useful
•
•
•
•
•
static parallelization: dependence analysis and
restructuring
inspector-executor parallelization
maximal independent set computation in interference
graphs
thread-level speculation (TLS), transactional memory (TM)
…..
 Each new irregular application is a “new
phenomenon” seemingly unrelated to other
applications
• Need a science of parallel programming
–
–
analysis: framework for thinking about parallelism and
locality in applications
synthesis: guide for producing efficient parallel
implementations
7
Science of electro-magnetism
Seemingly
unrelated phenomena
Unifying abstractions
Specialized models
that exploit structure
8
Organization of talk
•
Seemingly unrelated parallel algorithms
and data structures
–
–
–
–
–
•
Unifying abstractions
–
–
–
–
•
Stencil codes
Delaunay mesh refinement
Event-driven simulation
Graph reduction of functional languages
………
Operator formulation of algorithms
Amorphous data-parallelism
Galois programming model
Baseline parallel implementation
Specialized implementations that exploit
structure
– Structure of algorithms
– Optimized compiler and runtime system
support for different kinds of structure
9
Seemingly unrelated
algorithms
10
Stencil computation: Jacobi iteration
•
Finite-difference method for solving pde’s
–
•
Values at interior points are updated using values at
neighbors
–
•
tn+1
values at boundary points are fixed
dense arrays
Parallelism:
–
–
•
tn
Data structure:
–
•
discrete representation of domain: grid
values at next time step can be computed simultaneously
parallelism is not dependent on runtime values
Compiler can find the parallelism
–
spatial loops are DO-ALL loops
//Jacobi iteration with 5-point stencil
//initialize array A
for time = 1, nsteps
for <i,j> in [2,n-1]x[2,n-1]
temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1))
for <i,j> in [2,n-1]x[2,n-1]:
A(i,j) = temp(i,j)
A
temp
Jacobi iteration, 5-point stencil
11
Delaunay Mesh Refinement
Mesh
m = /* read
in meshto*/remove badly
• Iterative
refinement
shaped
WorkList
wl; triangles:
while there are bad triangles do {
wl.add(m.badTriangles());
Pick a bad triangle;
while (true) {
Find its cavity;
if ( wl.empty()
) break;
Retriangulate
cavity;
// may create new bad triangles
Element e =} wl.get();
if (e no longer
in mesh) continue;
• Don’t-care
non-determinism:
c = depends
new Cavity(e);//determine
– Cavity
final mesh
on order in which bad new cavity
triangles are processed
c.expand();
– c.retriangulate();//re-triangulate
applications do not care which meshregion
is
produced
m.update(c);//update
mesh
• Data
structure:
– wl.add(c.badTriangles());
graph in which nodes represent triangles
and edges represent triangle adjacencies
}
• Parallelism:
– bad triangles with cavities that do not
overlap can be processed in parallel
– parallelism is dependent on runtime values
•
compilers cannot find this parallelism
– (Gary Miller et al) at runtime, repeatedly
build interference graph and find maximal
independent sets for parallel execution
12
Event-driven simulation
•
•
•
•
Stations communicate by sending messages
with time-stamps on FIFO channels
Stations have internal state that is updated
when a message is processed
Messages must be processed in time-order at
each station
Data structure:
–
•
Messages in event-queue, sorted in time-order
Parallelism:
–
activities created in future may interfere with
current activities
 interference graph approach will not work
– Jefferson time-warp
•
•
–
2
A
3
B
6
4
C
5
station can fire when it has an incoming message
on any edge
requires roll-back if speculative conflict is detected
Chandy-Misra-Bryant
•
•
conservative event-driven simulation
requires null messages to avoid deadlock
13
Remarks
• Diverse algorithms and data structures
– relatively few algorithms use dense arrays
– more common: graphs, trees, lists, priority queues,…
• Exploiting parallelism is very complex
– DMR: (Miller et al) interference graph + maximal independent sets
– Event-driven simulation: (Jefferson) Timewarp algorithm
• Complexity arises from several sources:
– parallelism can be dependent on runtime values
• DMR, event-driven simulation, graph reduction,….
– don’t-care non-determinism
• nothing to do with concurrency
• DMR, graph reduction
– activities created in the future may interfere with current activities
• event-driven simulation…
14
Organization of talk
•
Seemingly unrelated parallel algorithms
and data structures
–
–
–
–
–
•
Unifying abstractions
–
–
–
–
•
Stencil codes
Delaunay mesh refinement
Event-driven simulation
Graph reduction of functional languages
………
Operator formulation of algorithms
Amorphous data-parallelism
Galois programming model
Baseline parallel implementation
Specialized implementations that exploit
structure
– Structure of algorithms
– Optimized compiler and runtime system
support for different kinds of structure
•
Research topics
15
Unifying abstractions
• Should provide a model of parallelism in
irregular algorithms
• Ideally, unified treatment of parallelism in regular
and irregular algorithms
– parallelism in regular algorithms should emerge as a
special case of general model
– (cf.) correspondence principles in Physics
• Abstractions should be effective
– should be possible to write an interpreter to execute
algorithms in parallel
16
Traditional abstraction
•
Dependence graph
–
–
•
Parallelism
–
–
•
–
–
–
dependences between computations are
functions of runtime values
don’t-care non-determinism
conflicting work may be created dynamically
…
Data structures play almost no role in this
abstraction
–
•
dataflow model of Dennis, Arvind et al
Inadequate for irregular applications
–
•
width of the computation graph
critical path: longest path from START to END
Effective parallel computation graph model
–
•
nodes are computations
edges are dependences
in most programs, parallelism comes from
data-parallelism (concurrent operations on
data structure elements)
New abstraction
–
–
data-centric: data structures play a central role
we will use graph ADT to illustrate concepts
17
Graph ADT
• Parallel algorithms are usually introduced using
matrices and matrix algorithms
• More useful abstraction for us: graph
– abstract data type (ADT):
• some number of nodes and edges
• edges may be directed
• nodes/edges may be labeled with values
– concrete representation
• may be different for different applications, platforms and
graphs
ADT
Concrete representation
– application program written in terms of ADT
• (cf.) relational databases
• Graphs are usually sparse in most applications
– even Jacobi example: 2D grid is a sparse graph in which
each internal node has four neighbors!
• Special graphs
.....
– cliques: isomorphic to square dense matrices
– labeled graphs w/o edges: isomorphic to sets/multi-sets
18
Operator formulation of algorithms
• Algorithm =
repeated application of operator to graph
i3
– active element:
• node or edge where computation is needed
i1
– DMR: nodes representing bad triangles
– Event-driven simulation: station with
incoming message
• activity: applying operator to active element
– neighborhood:
i2
• set of nodes and edges read/written to
perform computation
– DMR: cavity of bad triangle
– Event-driven simulation: station
i4
• distinct usually from neighbors in graph
– ordering:
• order in which active elements must be executed
in a sequential implementation
– any order (Jacobi,DMR, graph reduction)
» unordered algorithms
– some problem-dependent order (eventdriven simulation)
» ordered algorithms
i5
: active node
: neighborhood
19
Parallelism
•
Amorphous data-parallelism (ADP)
i1
– Active nodes can be processed in parallel,
subject to
• neighborhood constraints
• ordering constraints
•
•
How do we make this model effective?
Unordered algorithms
i3
i2
i4
– Activities are independent if their
neighborhoods do not overlap
– Independent activities can be processed in
parallel
– More generally,
•
•
Elements in the intersection of neighborhoods are
not written
Commutativity relations
– How do we find independent activities?
•
i5
Ordered algorithms
– Independence is not enough
– How do we determine what is safe to
execute w/o violating ordering?
2
A
B
C
5
20
Galois programming model (PLDI 2007)
•
•
•
Program written in terms of abstractions in
model
Two-level programming model
Joe programmers
–
–
sequential, OO model
Galois set iterators: for iterating over unordered
and ordered sets of active elements
•
for each e in Set S do B(e)
–
–
–
•
for each e in OrderedSet S do B(e)
–
–
–
•
evaluate B(e) for each element in set S
perform iterations in order specified by OrderedSet
set S may get new elements during execution
for each tr in Set ws do { //unordered Set iterator
if (tr no longer in mesh) continue;
Cavity c = new Cavity(tr);
c.expand();
c.retriangulate();
m.update(c);
ws.add(c.badTriangles()); //bad triangles
}
Stephanie programmers
–
•
evaluate B(e) for each element in set S
no a priori order on iterations
set S may get new elements during execution
Mesh m = /* read in mesh */
Set ws;
ws.add(m.badTriangles()); // initialize ws
Galois concurrent data structure library
(Wirth) Algorithms + Data structures =
Programs
DMR using Galois iterators
21
Galois parallel execution model
•
Parallel execution model:
– shared-memory
– optimistic execution of Galois
iterators
•
Implementation:
– master thread begins execution
of program
– when it encounters iterator,
worker threads help by executing
iterations concurrently
– several work-distribution
strategies implemented in
Galois (cf. OpenMP)
– barrier synchronization at end of
iterator
•
i3
i1
i2
i4
i5
Independence of neighborhoods:
– software TM variety
– logical locks on nodes and edges
•
main()
….
for each …..{
…….
…….
}
.....
Master
Ordering constraints for ordered
set iterator:
Joe Program
Concurrent
Data structure
– execute iterations out of order
but commit in order
– cf. out-of-order CPUs
22
Parameter tool (PPoPP 2009)
• Idealized execution model:
– unbounded number of processors
– applying operator at active node takes one time step
– execute a maximal set of active nodes, subject to
neighborhood and ordering constraints
• Measures amorphous data-parallelism in
irregular program execution
• Useful as an analysis tool
• Milind will talk about this later
23
Example: DMR
• Input mesh:
– Produced by Triangle
(Shewchuck)
– 550K triangles
– Roughly half are badly
shaped
• Available parallelism:
– How many non-conflicting
triangles can be expanded
at each time step?
• Parallelism intensity:
– What fraction of the total
number of bad triangles
can be expanded at each
step?
24
Summary of ADP abstraction
•
Old abstraction: dependence graphs
–
•
inadequate for most irregular algorithms
New abstraction: operator formulation
– active elements
– neighborhoods
– ordering of active elements
•
Baseline execution model
– Galois programming model
i3
i1
i2
i4
• sequential, OO
• uses abstractions in model
– optimistic parallel execution
– conflict detection scheme: many choices
•
•
•
•
exclusive locks
read/write locks
commutativity relations
i5
Amorphous data-parallelism
– generalizes conventional data-parallelism
25
Organization of talk
•
Seemingly unrelated parallel algorithms
and data structures
–
–
–
–
–
•
Unifying abstractions
–
–
–
–
•
Stencil codes
Delaunay mesh refinement
Event-driven simulation
Graph reduction of functional languages
………
Operator formulation of algorithms
Amorphous data-parallelism
Galois programming model
Baseline parallel implementation
Specialized implementations that exploit
structure
– Structure of algorithms
– Optimized compiler and runtime system
support for different kinds of structure
•
Ongoing work
26
Structure in irregular algorithms
• Baseline implementation is general but usually inefficient
– (eg) dynamic scheduling of iterations is not needed for stencil codes
since grid structure is known at compile-time
– (eg) hand-written parallel implementations of DMR and stencil codes do
not buffer updates to neighborhood until commit point
• Efficient execution requires exploiting structure in algorithms and
data structures
• How do we talk about structure in algorithms?
– Previous approaches: like descriptive biology
•
•
•
•
Mattson et al book
Parallel programming patterns (PPP): Snir et al
Berkeley motifs: Patterson, Yelick, et al
…
– Our approach: like molecular biology
• structural analysis of algorithms
• based on amorphous data-parallelism framework
27
Structural analysis of irregular algorithms
general graph
topology
grid
tree
refinement
morph
coarsening
general
topology-driven
irregular
algorithms
operator
local computation
data-driven
reader
unordered
ordering
ordered
Jacobi: topology: grid, operator: local computation, ordering: unordered
DMR, graph reduction: topology: graph, operator: morph, ordering: unordered
Event-driven simulation: topology: graph, operator: local computation, ordering: ordered 28
Exploiting structure
to reduce overheads
29
Cautious operators (PPoPP 2010)
• Cautious operator implementation:
– reads all the elements in its neighborhood
before modifying any of them
– (eg) Delaunay mesh refinement
• Algorithm structure:
– cautious operator + unordered active
elements
• Optimization: optimistic execution w/o
buffering
– grab locks on elements during read phase
• conflict: someone else has lock, so release
your locks
– once update phase begins, no new locks
will be acquired
• update in-place w/o making copies
• zero-buffering
– note: this is not two-phase locking
30
Graph partitioning (ASPLOS 2008)
Cores
•
Algorithm structure:
– general graph/grid + unordered active elements
•
Optimization I:
– partition the graph/grid and work-set between cores
– data-centric work assignment: core gets active elements from its own partition
•
Pros and cons:
– eliminates contention for worklist
– improves locality and can dramatically reduce conflicts
– dynamic load-balancing may be needed
•
Optimization II:
– lock coarsening: associate logical locks with partitions, not graph elements
– reduces overhead of lock management
•
Over-decomposition may improve core utilization
31
Eliminating speculation
•
Coordinated execution of activities: if we can build dependence graph
– Run-time scheduling:
•
•
•
•
cautious operator + unordered active elements
execute all activities partially to determine neighborhoods
create interference graph and find independent set of activities
execute independent set of activities in parallel w/o synchronization
– Just-in-time scheduling:
• local computation + topology-driven (eg) tree walks, sparse MVM
• inspector-executor approach
– Compile-time scheduling:
• previous case + graph is known at compile-time (eg) Jacobi, FFT, MMM, Cholesky
• make all scheduling decisions at compile-time time
32
DMR Results
•2 x Xeon X5570 (4 core, 2.93 GHz)
•Java 1.6.0_0-b11
•Linux 2.6.24-27 x86_64
•20GB heap size
•0.5M triangles, 0.25M bad triangles
•Best serial: locallifo.flagopt
•Serial time: 17002 ms
•Best // time: 3745 ms
•Best speedup: 4.5X
Mario and Donald will talk about
Galois programming model and optimizations
FAQs
•
How is this different from TM?
– Programming model:
•
TM: explicitly parallel (threads)
–
–
•
transactions: synchronization mechanism for threads
mostly memory-level conflict detection + boosting
Galois: Joe programs are sequential OO programs
–
ADT-level conflict detection
– Where do threads come from?
•
•
TM: someone else’s problem
Galois project: focus on
–
–
sources of parallelism in algorithms
structure of parallelism in algorithms
– (Wirth) Algorithms + Data structures = Programs
•
•
Galois: focus on algorithms and algorithmic structure
TM: focus on concurrent data structures
– Galois: mechanisms customized to algorithm structure
•
Few algorithms require full-blown speculation/optimistic parallelization
34
FAQs (contd.)
• How is this different from TLS?
– Programming model:
• Galois: separation between ADT and its implementation is critical
– permits separation of Joe and Stephanie layers (cf. relational databases)
– permits more aggressive conflict detection schemes like commutativity relations
• TLS: FORTRAN/C, so no separation between ADT and implementation
– Programming model:
• Galois: don’t-care non-determinism plays a central role
• TLS: FORTRAN/C, so only ordered algorithms
– Exploiting algorithmic structure:
• Galois: exploits algorithmic structure to
– reduce speculative overheads
– focus speculation only where needed
35
Summary
•
Seemingly unrelated parallel algorithms
and data structures
–
–
–
–
•
Unifying abstractions
–
–
–
–
•
Stencil codes
Delaunay mesh refinement
Event-driven simulation
………
Operator formulation of algorithms
Amorphous data-parallelism
Galois programming model
Baseline parallel implementation
Specialized implementations that exploit
structure
– Structure of algorithms
– Optimized compiler and runtime system
support for different kinds of structure
36
Schedule
• 1:30 PM-2:30 PM
– Irregular algorithms, amorphous data parallelism, structure
(Keshav Pingali)
• 2:30 PM-3:00 PM
– ParaMeter (Milind Kulkarni)
• 3:00 PM-3:15 PM
– Break
• 3:15 PM-4:15 PM
– Galois programming model (Mario Mendez-Lojo)
• 4:15 PM-4:50 PM
– Galois performance optimizations (Donald Nguyen)
• 4:50 PM-5:00 PM
– Research problems (Keshav Pingali)
37
Research ideas
h2
h2
n2
n2
n1
n2
h1n
n1
n3
h4
•
h2
n4
1
n3
n3
h3
h4
n4
h3
h4
n4
h3
Application studies:
– Lonestar benchmark suite for irregular programs
http://iss.ices.utexas.edu/lonestar/
•
Development of ADP model
– locality
– intra-operator parallelism
•
Identifying other algorithmic structure useful in parallel implementations
–
•
(eg) monotonically growing graphs (see Mario’s paper in OOPSLA 2010)
Language/compiler analysis
– how do you find/communicate structure to the implementation?
– (eg) use program analysis to find cautious operators
•
Specializing data structure implementations to particular algorithms
– can this be done semi-automatically?
– programmer writes sequential implementation of ADT and tool generates thread38
safe version for use in Galois program?
Research ideas (contd.)
• Auto-tuning
–
–
–
–
–
concrete implementations for graphs
conflict detection strategies
scheduling strategies
contention-management policies
…..
• Load-balancing
– need “data-centric work-stealing”
– move partitions, not active nodes
– see Milind’s paper in SPAA 2010
• Understanding performance
• Architecture
– existing core designs use lots of real estate for caches
– may not be well-suited for irregular applications:
• little spatial and temporal locality
• relatively small amount of computation per memory access
– one idea:
• use multi-threading to hide latency
• NVIDIA GPU-like architecture for irregular algorithms?
• Hardware-software codesign
– can we generate custom cores for particular applications to exploit ADP?
39
Science of Parallel Programming
i3
i1
i2
2
A
i4
i5
B
……..
Seemingly
unrelated algorithms
Unifying abstractions
Specialized models
that exploit structure
40