slides - UCSB Computer Science
Download
Report
Transcript slides - UCSB Computer Science
Computation on
meshes,
sparse matrices,
and graphs
Some slides are from David Culler,
Jim Demmel, Bob Lucas, Horst Simon,
Kathy Yelick, et al., UCB CS267
Parallelizing Stencil Computations
• Parallelism is simple
• Grid is a regular data structure
• Even decomposition across processors gives load balance
• Spatial locality limits communication cost
• Communicate only boundary values from neighboring patches
• Communication volume
• v = total # of boundary cells between patches
2
Two-dimensional block decomposition
•
•
•
•
n mesh cells, p processors
Each processor has a patch of n/p cells
Block row (or block col) layout: v = 2 * p * sqrt(n)
2-dimensional block layout:
v = 4 * sqrt(p) * sqrt(n)
3
Ghost Nodes in Stencil Computations
Comm cost = α * (#messages) + β * (total size of messages)
Green = my interior nodes
Blue = my boundary nodes
Yellow
= neighbors’ boundary
nodes
= my “ghost nodes”
•
•
•
•
•
Keep a ghost copy of neighbors’ boundary nodes
Communicate every second iteration, not every iteration
Reduces #messages, not total size of messages
Costs extra memory and computation
Can also use more than one layer of ghost nodes 4
Parallelism in Regular meshes
• Computing a Stencil on a regular mesh
• need to communicate mesh points near boundary to neighboring
processors.
• Often done with ghost regions
• Surface-to-volume ratio keeps communication down, but
• Still may be problematic in practice
Implemented using
“ghost” regions.
Adds memory overhead
5
Irregular mesh: NASA Airfoil in 2D
6
Composite Mesh from a Mechanical Structure
7
Adaptive Mesh Refinement (AMR)
• Adaptive mesh around an explosion
• Refinement done by calculating errors
8
Adaptive Mesh
Shock waves in a gas dynamics using AMR (Adaptive Mesh
Refinement) See: http://www.llnl.gov/CASC/SAMRAI/
9
Irregular mesh: Tapered Tube (Multigrid)
10
Challenges of Irregular Meshes for PDE’s
• How to generate them in the first place
• For example, Triangle, a 2D mesh generator (Shewchuk)
• 3D mesh generation is harder! For example, QMD (Vavasis)
• How to partition them into patches
• For example, ParMetis, a parallel graph partitioner (Karypis)
• How to design iterative solvers
• For example, PETSc, Aztec, Hypre (all from national labs)
• … Prometheus, a multigrid solver for finite elements on irregular meshes
• How to design direct solvers
• For example, SuperLU, parallel sparse Gaussian elimination
• These are challenges to do sequentially, more so in parallel
11
Converting the Mesh to a Matrix
12
Sparse matrix data structure (stored by rows)
31
0
53
0
59
0
41
26
0
• Full:
• 2-dimensional array of real or
complex numbers
• (nrows*ncols) memory
31
53
59
41
26
1
3
2
1
2
• Sparse:
• compressed row storage
• about (2*nzs + nrows) memory
Distributed row sparse matrix data structure
P0
31
41
59
26
53
1
3
2
3
1
P1
P2
Pn
Each processor stores:
• # of local nonzeros
• range of local rows
• nonzeros in CSR form
Conjugate gradient iteration to solve A*x=b
x0 = 0, r0 = b, d0 = r0
for k = 1, 2, 3, . . .
αk = (rTk-1rk-1) / (dTk-1Adk-1)
xk = xk-1 + αk dk-1
rk = rk-1 – αk Adk-1
βk = (rTk rk) / (rTk-1rk-1)
dk = rk + βk dk-1
step length
approx solution
residual
improvement
search direction
• One matrix-vector multiplication per iteration
• Two vector dot products per iteration
• Four n-vectors of working storage
Parallel Vector Operations
• Suppose x, y, and z are column vectors
• Data is partitioned among all processors
• Vector add: z = x + y
• Embarrassingly parallel
• Scaled vector add (DAXPY): z = α*x + y
• Broadcast the scalar α, then independent * and +
• Vector dot product (DDOT):
β = xTy = Sj x[j] * y[j]
• Independent *, then + reduction
Broadcast and reduction
• Broadcast of 1 value to p processors in log p time
α
Broadcast
• Reduction of p values to 1 in log p time
• Takes advantage of associativity in +, *, min, max, etc.
1 3 1 0 4 -6 3 2
Add-reduction
Parallel Dense Matrix-Vector Product (Review)
• y = A*x, where A is a dense matrix
P0 P1
P2
P3
x
P0
• Layout:
• 1D by rows
• Algorithm:
Foreach processor j
Broadcast X(j)
Compute A(p)*x(j)
y
P1
P2
P3
• A(i) is the n by n/p block row that processor Pi owns
• Algorithm uses the formula
Y(i) = A(i)*X = Sj A(i)*X(j)
Parallel sparse matrix-vector product
• Lay out matrix and vectors by rows
• y(i) = sum(A(i,j)*x(j))
• Only compute terms with A(i,j) ≠ 0
P0 P1
P2
P3
P0
• Algorithm
Each processor i:
Broadcast x(i)
Compute y(i) = A(i,:)*x
x
y
P1
P2
P3
• Optimizations
• Only send each proc the parts of x it needs, to reduce comm
• Reorder matrix for better locality by graph partitioning
• Worry about balancing number of nonzeros / processor,
if rows have very different nonzero counts
Other memory layouts for matrix-vector product
• Column layout of the matrix eliminates the broadcast
• But adds a reduction to update the destination – same total comm
• Blocked layout uses a broadcast and reduction, both on
only sqrt(p) processors – less total comm
• Blocked layout has advantages in multicore / Cilk++ too
P0
P1
P2
P3
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10 P11
P12
P13 P14 P15
Graphs and Sparse Matrices
• Sparse matrix is a representation of a (sparse) graph
1
2
3
4
1 1
2
1
5
6
6
1
1
3
4
5
1
1
1
3
1
4
1
1
1
2
1
1
1
1
6
5
• Matrix entries are edge weights
• Number of nonzeros per row is the vertex degree
• Edges represent data dependencies in matrix-vector
multiplication
Graph partitioning
•
•
•
•
•
Assigns subgraphs to processors
Determines parallelism and locality.
Tries to make subgraphs all same size (load balance)
Tries to minimize edge crossings (communication).
Exact minimization is NP-complete.
edge crossings = 6
edge crossings = 10
22
Link analysis of the web
1
1
2
2
3
4
1
2
3
4
7
5
4
5
6
3
6
7
• Web page = vertex
• Link = directed edge
• Link matrix: Aij = 1 if page i links to page j
5
6
7
Web graph: PageRank (Google)
[Brin, Page]
An important page is one that
many important pages point to.
• Markov process: follow a random link most of the time;
otherwise, go to any page at random.
• Importance = stationary distribution of Markov process.
• Transition matrix is p*A + (1-p)*ones(size(A)),
scaled so each column sums to 1.
• Importance of page i is the i-th entry in the principal
eigenvector of the transition matrix.
• But the matrix is 1,000,000,000,000 by 1,000,000,000,000.
A Page Rank Matrix
• Importance ranking
of web pages
•Stationary distribution
of a Markov chain
•Power method: matvec
and vector arithmetic
•Matlab*P page ranking
demo (from SC’03) on
a web crawl of mit.edu
(170,000 pages)
Social Network Analysis in Matlab: 1993
Co-author graph
from 1993
Householder
symposium
Social Network Analysis in Matlab: 1993
Sparse Adjacency Matrix
Which author has
the most collaborators?
>>[count,author] = max(sum(A))
count = 32
author = 1
>>name(author,:)
ans = Golub
Social Network Analysis in Matlab: 1993
Have Gene Golub and Cleve Moler ever been coauthors?
>> A(Golub,Moler)
ans = 0
No.
But how many coauthors do they have in common?
>> AA = A^2;
>> AA(Golub,Moler)
ans = 2
And who are those common coauthors?
>> name( find ( A(:,Golub) .* A(:,Moler) ), :)
ans =
Wilkinson
VanLoan