slides - UCSB Computer Science

Download Report

Transcript slides - UCSB Computer Science

CS 290H Administrivia: May 14, 2008
• Course project progress reports due next Wed 21 May.
• Reading in Saad (second edition): Sections 13.1-13.5
Conjugate gradient iteration
x0 = 0, r0 = b, d0 = r0
for k = 1, 2, 3, . . .
αk = (rTk-1rk-1) / (dTk-1Adk-1)
xk = xk-1 + αk dk-1
rk = rk-1 – αk Adk-1
βk = (rTk rk) / (rTk-1rk-1)
dk = rk + βk dk-1
step length
approx solution
residual
improvement
search direction
• One matrix-vector multiplication per iteration
• Two vector dot products per iteration
• Four n-vectors of working storage
Preconditioned conjugate gradient iteration
x0 = 0, r0 = b, d0 = B-1 r0, y0 = B-1 r0
for k = 1, 2, 3, . . .
αk = (yTk-1rk-1) / (dTk-1Adk-1) step length
xk = xk-1 + αk dk-1
approx solution
rk = rk-1 – αk Adk-1
residual
yk = B-1 rk
preconditioning solve
βk = (yTk rk) / (yTk-1rk-1)
improvement
dk = yk + βk dk-1
search direction
• One matrix-vector multiplication per iteration
• One solve with preconditioner per iteration
Incomplete Cholesky factorization (IC, ILU)
x
A
RT
R
• Compute factors of A by Gaussian elimination,
but ignore fill
• Preconditioner B = RTR  A, not formed explicitly
• Compute B-1z by triangular solves (in time nnz(A))
• Total storage is O(nnz(A)), static data structure
• Either symmetric (IC) or nonsymmetric (ILU)
Incomplete Cholesky and ILU: Variants
• Allow one or more “levels of fill”
• unpredictable storage requirements
1
4
1
4
2
3
2
3
• Allow fill whose magnitude exceeds a “drop tolerance”
• may get better approximate factors than levels of fill
• unpredictable storage requirements
• choice of tolerance is ad hoc
• Partial pivoting (for nonsymmetric A)
• “Modified ILU” (MIC): Add dropped fill to diagonal of U or R
• A and RTR have same row sums
• good in some PDE contexts
Matrix permutations for iterative methods
• Symmetric matrix permutations don’t change the
convergence of unpreconditioned CG
• Symmetric matrix permutations affect the quality of an
incomplete factorization – poorly understood, controversial
• Often banded (RCM) is best for IC(0) / ILU(0)
• Often min degree & nested dissection are bad for no-fill
incomplete factorization but good if some fill is allowed
• Some experiments with orderings that use matrix values
• e.g. “minimum discarded fill” order
• sometimes effective but expensive to compute
Nonsymmetric matrix permutations
for iterative methods
• Dynamic permutation: ILU with row or column pivoting
• E.g. ILUTP (Saad), Matlab “luinc”
• More robust but more expensive than ILUT
• Static permutation: Try to increase diagonal dominance
• E.g. bipartite weighted matching (Duff & Koster)
• Often effective; no theory for quality of preconditioner
• Field is not well understood, to say the least
Row permutation for heavy diagonal
1
2
3
4
5
1
1
1
1
4
2
2
2
5
3
3
3
4
[Duff, Koster]
2
3
4
5
3
1
5
A
4
4
5
5
2
PA
• Represent A as a weighted, undirected bipartite graph
(one node for each row and one node for each column)
• Find matching (set of independent edges) with maximum
product of weights
• Permute rows to place matching on diagonal
• Matching algorithm also gives a row and column scaling
to make all diag elts =1 and all off-diag elts <=1
Preconditioned conjugate gradient iteration
x0 = 0, r0 = b, d0 = B-1 r0, y0 = B-1 r0
for k = 1, 2, 3, . . .
αk = (yTk-1rk-1) / (dTk-1Adk-1) step length
xk = xk-1 + αk dk-1
approx solution
rk = rk-1 – αk Adk-1
residual
yk = B-1 rk
preconditioning solve
βk = (yTk rk) / (yTk-1rk-1)
improvement
dk = yk + βk dk-1
search direction
• Several vector inner products per iteration (easy to parallelize)
• One matrix-vector multiplication per iteration (medium to parallelize)
• One solve with preconditioner per iteration (hard to parallelize)
Incomplete Cholesky and ILU: Issues
• Choice of parameters
• good: smooth transition from iterative to direct methods
• bad: very ad hoc, problem-dependent
• tradeoff: time per iteration (more fill => more time)
vs # of iterations (more fill => fewer iters)
• Effectiveness
• condition number usually improves (only) by constant factor
(except MIC for some problems from PDEs)
• still, often good when tuned for a particular class of problems
• Parallelism
• Triangular solves are not very parallel
• Reordering for parallel triangular solve by graph coloring
• Dynamic coloring: ILUM (Saad)
Sparse approximate inverses
B-1
A
• Compute B-1  A explicitly
• Minimize || A B-1 – I ||F
(in parallel, by columns)
• Variants: factored form of B-1, more fill, . .
• Good: very parallel, seldom breaks down
• Bad: effectiveness varies widely
Complexity of linear solvers
Time to solve
model problem
(Poisson’s
equation) on
regular mesh
n1/2
n1/3
2D
3D
Dense Cholesky:
O(n3 )
O(n3 )
Sparse Cholesky:
O(n1.5 )
O(n2 )
CG, exact arithmetic:
O(n2 )
O(n2 )
CG, no precond:
O(n1.5 )
O(n1.33 )
CG, modified IC:
O(n1.25 )
O(n1.17 )
CG, support trees:
Multigrid:
O(n1.20 ) -> O(n1+ ) O(n1.75 ) -> O(n1.31 )
O(n)
O(n)
Matrix-vector product: Parallel implementation
• Lay out matrix and vectors by rows
• Hard part is matrix-vector product
y = A*x
P0 P1
P2
P3
x
P0
• Algorithm
Each processor j:
Broadcast x(j)
Compute y(j) = A(j,:)*x
• May send more of x than needed
• Partition / reorder matrix to reduce
communication
y
P1
P2
P3
Sparse Matrix-Vector Multiplication
Graph partitioning in theory
• If G is a planar graph with n vertices, there exists a set
of at most sqrt(6n) vertices whose removal leaves no
connected component with more than 2n/3 vertices.
(“Planar graphs have sqrt(n)-separators.”)
• “Well-shaped” finite element meshes in 3 dimensions
have n2/3 - separators.
• Also some other classes of graphs – trees, graphs of
bounded genus, chordal graphs, bounded-excludedminor graphs, …
• Mostly these theorems come with efficient algorithms,
but they aren’t used much.
Graph partitioning in practice
• Graph partitioning heuristics have been an active
research area for many years, often motivated by
partitioning for parallel computation. See CS 240A.
• Some techniques:
•
•
•
•
Spectral partitioning (uses eigenvectors of Laplacian matrix of graph)
Geometric partitioning (for meshes with specified vertex coordinates)
Iterative-swapping (Kernighan-Lin, Fiduccia-Matheysses)
Breadth-first search (GLN 7.3.3, fast but dated)
• Many popular modern codes (e.g. Metis, Chaco) use
multilevel iterative swapping
• Matlab graph partitioning toolbox: see course web page
Parallel Incomplete Cholesky and ILU: Issues
• Computing the preconditioner
• Parallel direct methods well developed
• But IC/ILU is sparser => harder to speed up
• Still, you only have to do it once
• Applying the preconditioner
• Triangular solves are not very parallel
• Reordering by graph coloring (see example)
• But the orderings are not great for convergence