Transcript lecture15

Numerical Linear Algebra
An Introduction
High Performance Computing 1
Levels of multiplication
• vector-vector a[i]*b[i]
• matrix-vector A[i,j]*b[j]
• matrix-matrix A[I,k]*B[k,j]
High Performance Computing 1
Matrix-Matrix
for (i=0; i<n; i++)
for (j=0; j<n; j++)
{
C[i,j]=0.0;
for (k=0; k<n; k++)
C[i,j]=C[i,j] + A[i,k]*B[k,j];
}
Note:O(n3) work
High Performance Computing 1
Block matrix
• Serial version - same pseudocode, but
interpret i, j as indices of subblocks, and
A*B means block matrix multiplication
• Let n be a power of two, and generate a
recursive algorithm. Terminates with an
explicit formula for the elementary 2X2
multiplications. Allows for parallelism. Can
get O(n2.8) work
High Performance Computing 1
Pipeline method
• Pump data left to right and top to bottom
recv(&A,P[i,j-1]);
recv(&B,P[i-1,j]);
C=C+A*B
send(&A,P[i,j+1]);
send(&B,P[i+1, j]);
High Performance Computing 1
Pipeline method
B[*,0]
A[0,*]
A[1,*]
A[2,*]
A[3,*]
C[0,0]
C[1,0]
C[2,0]
C[3,0]
B[*,1]
C[0,1]
C[1,1]
C[2,1]
C[3,1]
B[*,2]
C[0,2]
C[1,2]
C[2,2]
C[3,2]
High Performance Computing 1
B[*,3]
C[0,3]
C[1,3]
C[2,3]
C[3,3]
Pipeline method
• Similar method for matrix-vector
multiplication. But you lose some of the
cache reuse
High Performance Computing 1
A sense of speed – vector ops
Loop
Flops per pass
Operation per pass
operation
1
2
v1(i)=v1(i)+a*v2(i)
update
2
8
v1(i)=v1(i)+Σ sk*vk(i)
4-fold vector
update
3
1
v1(i)=v1(i)/v2(i)
divide
4
2
v1(i)=v1(i)+s*v2(ind(i)) update+gather
5
2
v1(i)=v2(i)-v3(i)*v1(i-1) Bidiagonal
6
2
s=s+v1(i)*v2(i)
High Performance Computing 1
Inner product
A sense of speed – vector ops
Loop
1
J90 cft77
(100 nsec clock)
r_∞
n1/2
97
19
T90 1 processor cft77
(2.2 nsec clock)
r_∞
n1/2
1428
159
2
163
29
1245
50
3
21
6
219
43
4
72
21
780
83
5
4.9
2
25
9
6
120
202
474
High Performance Computing 1
164
Observations
• Simple do loops not effective
• Cache and memory hierarchy bottlenecks
• For better performance,
– combine loops
– minimize memory transfer
High Performance Computing 1
LINPACK
• library of subroutines to solve linear algebra
• example – LU decomposition and system
solve (dgefa and dgesl, resp.)
• In turn, built on BLAS
• see netlib.org
High Performance Computing 1
BLAS
Basic Linear Algebra Subprograms
• a library of subroutines designed to provide
efficient computation of commonly-used linear
algebra routines, like dot products, matrix-vector
multiplies, and matrix-matrix multiplies.
• The naming convention is not unlike other
libraries - the fist letter indicates precision, the rest
gives a hint (maybe) of what the routine does, e.g.
SAXPY, DGEMM.
• The BLAS are divided into 3 levels: vector-vector,
matrix-vector, and matrix-matrix. The biggest
speed-up usually in level 3.
High Performance Computing 1
BLAS
• Level 1
High Performance Computing 1
BLAS
• Level 2
High Performance Computing 1
BLAS
• Level 3
High Performance Computing 1
How efficient is the BLAS?
load/store
float ops
refs/ops
level 1
SAXPY
3N
2N
3/2
MN+N+2M
2MN
1/2
2MNK
2/N
level 2
SGEMV
level 3
SGEMM
2MN+MK+KN
High Performance Computing 1
Matrix-vector
read x(1:n) into fast memory
read y(1:n) into fast memory
for i = 1:n
read row i of A into fast memory
for j = 1:n
y(i) = y(i) + A(i,j)*x(j)
write y(1:n) back to slow memory
High Performance Computing 1
Matrix-vector
•
•
•
•
m=# slow memory refs = n^2 +3n
f=# arithmetic ops = 2n^2
q=f/m ~2
Mat-vec multiple limited by slow memory
High Performance Computing 1
Matrix-matrix
High Performance Computing 1
Matrix Multiply - unblocked
for i = 1 to n
read row i of A into fast memory
for j = 1 to n
read C(i,j) into fast memory
read column j of B into fast memory
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
write C(i,j) back to slow memory
*
High Performance Computing 1
Matrix Multiply unblocked
Number of slow memory references on unblocked
matrix multiply
m = n^3 read each column of B n times
+ n^2 read each column of A once for each i
+ 2*n^2 read and write each element of C once
= n^3 + 3*n^2
So q = f/m = (2*n^3)/(n^3 + 3*n^2)
~= 2 for large n, no improvement over matrixvector multiply
High Performance Computing 1
Matrix Multiply blocked
Consider A,B,C to be N by N matrices of b by b subblocks
where b=n/N is called the blocksize
for i = 1 to N
for j = 1 to N
read block C(i,j) into fast memory
for k = 1 to N
read block A(i,k) into fast memory
read block B(k,j) into fast memory
C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply
on blocks}
*
write block C(i,j) back to slow memory
High Performance Computing 1
Matrix Multiply blocked
Number of slow memory references on blocked
matrix multiply
m = N*n^2 read each block of B N^3 times
(N^3 * n/N * n/N)
+ N*n^2 read each block of A N^3 times
+ 2*n^2 read and write each block of C
= (2*N + 2)*n^2
So q = f/m = 2*n^3 / ((2*N + 2)*n^2)
~= n/N = b for large n
High Performance Computing 1
Matrix Multiply blocked
So we can improve performance by increasing the
blocksize b
Can be much faster than matrix-vector multiplty
(q=2)
Limit: All three blocks from A,B,C must fit in fast
memory (cache), so we
cannot make these blocks arbitrarily large: 3*b^2
<= M, so q ~= b <= sqrt(M/3)
High Performance Computing 1
More on BLAS
• Industry standard interface(evolving)
• Vendors, others supply optimized implementations
• History
– BLAS1 (1970s):
• vector operations: dot product, saxpy
• m=2*n, f=2*n, q ~1 or less
– BLAS2 (mid 1980s)
• matrix-vector operations
• m=n^2, f=2*n^2, q~2, less overhead
• somewhat faster than BLAS1
High Performance Computing 1
More on BLAS
– BLAS3 (late 1980s)
• matrix-matrix operations: matrix matrix multiply, etc
• m >= 4n^2, f=O(n^3), so q can possibly be as large
as n, so BLAS3 is potentially much faster than
BLAS2
• Good algorithms used BLAS3 when possible
(LAPACK)
• www.netlib.org/blas, www.netlib.org/lapack
High Performance Computing 1
BLAS on an IBM RS6000/590
Peak speed = 266 Mflops
Peak
BLAS 3
BLAS 2
BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs
BLAS 2 (n-by-n matrix vector multiply) vs
BLAS 1 (saxpy of n vectors)
High Performance Computing 1