Transcript Lecture 11

Linear Triangular System
Lx  b
 a11

 a 21
L – lower triangular matrix, nonsingular
0   x1   b1 
    
a 22   x 2   b 2 
x1  b1 / a11
x 2  ( b 2  a 21 x1 ) / a 22
Lx=b
L: nxn nonsingular lower triangular
b: known vector
b(1) = b(1)/L(1,1)
For i=2:n
b(i) = (b(i)-L(i,1:i-1)b(1:i-1))/L(i,i)
end
Forward substitution, row version
1
Triangular System
Column version (column sweep method):
As soon as a variable is solved, its effect can be subtracted from subsequent
equations
2

1

 7
0
5
9
0   x1   6 
    
0  x2  2
    
8   x 3   5 
x1  3
5

9
0  x2  2
1    1 



3
    
  

8   x3  5 
 7    16 
Lx = b
for j=1:n-1
b(j) = b(j)/L(j,j)
b(j+1:n) = b(j+1:n)-b(j)L(j+1:n,j)
end
b(n) = b(n)/L(n,n)
Forward substitution, column version
Column version is more amenable to parallel computing
2
Triangular System: Parallel
L
block
b
L
b
Block cyclic
As soon as x_i (or a few x_i variables) is computed, the value is passed downward to
neighboring cpus;
As soon as a cpu receives x_i value, it passes the value downward to neighboring cpus;
Then local b vector is updated.
Disadvantage: load imbalance, about 50% cpus are active on average
Remedy: cyclic or block cyclic distribution of rows.
3
Triangular System: Inversion
 A1
A 
 A3
A
1
0 

A2 
1
1
A
 
 X
X  A
1
2
A – NxN lower triangular
Divide A into equal blocks
0 
1 
A2 
1
1
A3 A
Can inverse A recursively:
Inverse A1;
Inverse A2;
Compute X by matrix multiplication
 Matrix multiplication
4
Triangular System: Inversion
1
1
4
2
-2
3
1
7
2
-3
3
-1
4
3
0
-1
0
1
2
5
-2
1
6
-1
3
0
0
1
-2
4
1
5
-3
2
0
0
1
4
1/2
-2
3
1
7
2
-3
1/3
-1
4
3
0
-1
0
1
2
5
-2
1
6
-1
3
0
0
1
1/2
4
1
5
-3
2
0
0
1
First phase: invert diagonal elements of A
2nd phase: compute 2x2 diagonal blocks of A^(-1)
…
K-th phase: compute diagonal 2^(k-1) x 2^(k-1) blocks of
A^(-1)
1
-2
1/2
-2
3
1
7
2
1
1/3
-1
4
3
0
-1
0
1
2
5
-2
1
6
-1
3
0
0
1
-1/2
4
1
5
-3
2
0
0
Essentially matrix multiplications;
K-th phase: N/2^(k-1) pairs of 2^(k-2)x2^(k-2) matrix
multiplications
1
Can do in parallel on P=K^3 processors
5
Gaussian Elimination
3 x1  5 x 2  9
6 x1  7 x 2  4
3

6
5  1
  
7  2
0 3

1  0
5 

 3
Ax = b
3 x1  5 x 2  9
 3 x 2   14
A = LU,
L – unit lower triangular
U – upper triangular
Ax = b  LUx = b  Ly = b, Ux = y
Especially with multiple rhs or solve same
equations (same coefficient matrix) many
times
6
LU Factorization
A = LU
A – nxn matrix
A(1:k,1:k) nonsingular for k=1:n-1
After factorization, L is in strictly
lower triangular part of A, U is in
upper triangular part of A (including
diagonal)
(kij) version
For k=1:n-1
for i=k+1:n
A(k,k) is the pivot
A(i,k) = A(i,k)/A(k,k)
for j=k+1:n
A(i,j) = A(i,j) – A(i,k)A(k,j)
end
end
end
or
For k=1:n-1
A(k+1:n,k) = A(k+1:n,k)/A(k,k)
A(k+1:n,k+1:n) = A(k+1:n,k+1:n)- A(k+1:n,k)A(k,k+1:n)
end
7
Factorization Breakdown
If A(k,k)=0 at any stage  breakdown, LU factorization may not exist even if A is
nonsingular
Theorem:
Assume A is nxn matrix.
(1) A has an LU factorization if A(1:k,1:k) is non-singular for all k=1:n-1.
(2) If the LU factorization exists and A is non-singular, then the LU
factorization is unique.
Avoid method breakdown  pivoting
Pivoting is also necessary to improve accuracy. Small pivot  increased errors
Make sure no large entries appear in L or U. Use large pivots.
8
Block LU Factorization
 A11
A 
 A 21
A12   L11
 
A 22   L 21
0  U 11

L 22   0
U 12 

U 22 
A – nxn matrix, n = r*N
A11 – rxr matrix, A22 – (n-r)x(n-r) matrix, A12 – rx(n-r) matrix, A21 – (n-r)xr
A11 = L11*U11
A12 = L11*U12  U12
A21 = L21*U11  L21
A22 = L21*U12+L22*U22  A22-L21*U12 = A’ = L22*U22
LU factorization iteratively
9
Block LU Factorization
A – nxn matrix
A(1:k,1:k) is non-singular for k=1:n-1
1<= r <= n
Upon completion, A(i,j) overwritten by L(i,j) for i>j; A(i,j) overwritten by U(i,j) for i<=j
s = 1
While s <= n
q = min(n,s+r-1)
Use scalar algorithm to LU-factorize A(s:q,s:q) into L and U
Solve LZ = A(s:q,q+1:n) for Z and overwrite A(s:q,q+1:n) with Z
Solve WU = A(q+1:n,s:q) for W and overwrite A(q+1:n,s:q) with W
A(q+1:n,q+1:n) = A(q+1:n,q+1:n) – WZ
s = q+1
End
Matrix multiplication accounts for significant fraction of operations
10
Permutation Matrix
Permutation matrix: identity matrix with its rows re-ordered.
0

1

P 
0

0
0
0
0
0
0
1
1
0
1

0

0

0
p = [4 1 3 2] encodes permutation matrix P
p(k) is the column index of the “1” in k-th row
PA: row-permuted version of A
AP: column-permuted version of A
Interchange permutation matrix: identity matrix with two rows swapped
0

0

E 
0

1
0
0
1
0
0
1
0
0
1

0

0

0
Row 1 and 4 swapped
EA: swap rows 1 and 4 of A
AE: swap columns 1 and 4 of A
11
Permutation Matrix
A permutation matrix can be expressed as a series of row interchanges:
P  E n  E 2 E1
If E_{k} is the interchange permutation matrix with rows k and p(k) interchanged,
Then P can be encoded by vector p(1:n).
If x(1:n) is a vector, then Px can be computed using p(1:n)
For k=1:n
swap x(k) and x(p(k))
End
p(1:n) vector is useful for pivoting
12
Partial Pivoting
Pivoting is crucial to preventing breakdown and improving accuracy
Partial pivoting: choose largest element in a column (or row) and interchange rows
(columns)
3

A 2

 6
17
4
18
10 

2

 12 
Swap rows 1 and 3
6

2

 3
18
4
17
 12 

2

10 
6

0

 0
18
2
8
 12 

2

16 
Swap rows
2 and 3
6

0

 0
8
8
0
 12 

16

6 
6

0

 0
18
8
2
 12 

16

2 
13
LU Factorization with Row Partial Pivoting
A – nxn matrix
After factorization, strictly lower triangular part of A contains L; upper triangular part
contains U; vector p(1:n-1) contains permutation operations in partial pivoting
Algorithm F2:
For k=1:n-1
Determine s with k<=s<=n s.t. |A(s,k)| is largest
among |A(k:n,k)|
swap A(k,1:n) and A(s,1:n)
p(k) = s
if A(k,k) != 0
A(k+1:n,k) = A(k+1:n,k)/A(k,k)
A(k+1:n,k+1:n) = A(k+1:n,k+1:n)-A(k+1:n,k)A(k,k+1:n)
end
end
14
How to Use Factorized A
Solve Ax = b
Using LU factorization of row partial pivoting
Need to swap elements of b according to partial pivoting  information in p(1:n-1)
Assume A is LU factorized with row partial pivoting using algorithm F2:
For k=1:n-1
swap b(k) and b(p(k))
End
Solve Ly = b
Solve Ux = y
L - unit lower triangular matrix whose lower triangular part is the same as that of A; U
- upper triangular part of A (including diagonal)
15
LU Factorization With Row Partial Pivoting
A – nxn matrix
After factorization, strictly lower triangular part of A contains multipliers; upper
triangular part contains U; vector p(1:n-1) contains permutation operations in
partial pivoting
Algorithm F1:
For k=1:n-1
Determine s with k<=s<=n s.t. |A(s,k)| is largest
among |A(k:n,k)|
swap A(k,k:n) and A(s,k:n) // only difference with F2
p(k) = s
if A(k,k) != 0
A(k+1:n,k) = A(k+1:n,k)/A(k,k)
A(k+1:n,k+1:n) = A(k+1:n,k+1:n)-A(k+1:n,k)A(k,k+1:n)
end
end
16
How to Use Factorized A
Solve Ax = b
Using LU factorization of partial pivoting
Need to swap elements of b according to partial pivoting  information in p(1:n-1)
Need to multiply appropriate coefficients  information in lower triangular part of A
Assume A is LU factorized with partial pivoting using algorithm F1:
For k=1:n-1
swap b(k) and b(p(k))
b(k+1:n) = b(k+1:n) – b(k)A(k+1:n,k)
End
Solve Ux = b
U - upper triangular part of A (including diagonal)
17
Column Partial Pivoting
Column partial pivoting: search row k for the largest element, exchange that column
with column k.
A – nxn matrix
After factorization, strictly lower triangular part of A contains L; upper triangular
part contains U; vector p(1:n-1) contains permutation operations in partial pivoting
Algorithm G:
For k=1:n-1
Determine s with k<=s<=n s.t. |A(k,s)| is largest
among |A(k,k:n)|
swap A(1:n,k) and A(1:n,s)
p(k) = s
if A(k,k) != 0
A(k+1:n,k) = A(k+1:n,k)/A(k,k)
A(k+1:n,k+1:n) = A(k+1:n,k+1:n)-A(k+1:n,k)A(k,k+1:n)
end
end
18
How to Use Factorized A
Solve Ax = b
Using LU factorization with column partial pivoting
Need to swap elements of x according to partial pivoting  information in p(1:n-1)
Assume A is LU factorized with column partial pivoting using algorithm G:
Solve Ly = b
Solve Ux = y
For k=n-1:-1:1
swap x(k) and x(p(k))
end
L - unit lower triangular matrix whose lower triangular part is the same as that of A; U
- upper triangular part of A (including diagonal)
19
Complete Pivoting
Complete pivoting: the largest element in submatrix A(k:n,k:n) is permuted into
(k,k) as the pivot
Need a row interchange and a column interchange
A – nxn matrix
p(1:n-1) – vector encoding row interchanges
q(1:n-1) – vector encoding column interchanges
After factorization, lower triangular part of A contains L, upper triangular part of
A contains U (including diagonal)
20
LU Factorization with Complete Pivoting
LU factorization with complete pivoting
For k=1:n-1
Determine s (k<=s<=n) and t (k<=t<=n) s.t. |A(s,t)|
is largest among |A(i,j)| for i=k:n, j=k:n
swap A(k,1:n) and A(s,1:n)
swap A(1:n,k) and A(1:n,t)
p(k) = s
q(k) = t
if A(k,k) != 0
A(k+1:n,k) = A(k+1:n)/A(k,k)
A(k+1:n,k+1:n) = A(k+1:n,k+1:n)-A(k+1:n,k)A(k,k+1:n)
end
end
21
How to Use Factorized A
Solve Ax = b
By LU factorization with complete pivoting
Suppose A is LU factorized with complete pivoting, p(1:n-1) and q(1:n-1) are
permutation encoding vectors
for k=1:n-1
swap b(k) and b(p(k))
End
Solve Ly = b for y
Solve Ux = y for x
For k=n-1:-1:1
swap x(k) and x(q(k))
End
L and U are lower and upper triangular parts of factorized A
22
Parallelization of Gaussian Elimination
A(k,k)
Row-wise 1D block decomposition
At step k, the processor holding the pivot sends row k: A(k,k:n) to bottom
neighboring processor;
At each processor, forward data immediately to bottom neighbor upon receiving
data from top processor; then update its own data; then wait for data from top
neighbor
Disadvantage: load imbalance
Remedy: row-wise block cyclic distribution
23
Parallelization with Partial Pivoting
Row-wise block/block-cyclic decomposition
Gaussian elimination with column partial pivoting
More difficult with row partial pivoting
Pivoting search on the processor holding row k, no communication among
processors;
Column index of the new pivot element together with row k: A(k:n) need to
be sent out;
On each processor, upon receiving data from top neighbor, forward
immediately to bottom neighbor, and
swap column k and new pivot column of own data;
update own data;
wait data from top neighbor;
24