Linear Algebra – I

Download Report

Transcript Linear Algebra – I

Linear Algebra – I
Sashi Kumar Penta
GPGP class presentation
Spring 2007
LA – 1: Outline
• Motivation
• Dense matrices and vectors
–
–
–
–
Representation
Vector-Vector operations
Vector reduction type operations
Matrix-Matrix multiplication
• Larsen et. al.
• Fatahalian et. al.
• Naga Govindaraju et. al.
• Introduction to CUDA
– Matrix multiplication
LA – 2: Outline
• Memory requirements for Balanced
computer architectures
• Sparse matrices
– Banded matrices
– Random sparse matrices
• Sparse Matrix-Vector multiplication
• Solving Navier - strokes on GPU
• …
Why LA on GPUs?
• Applications of LA
– Scientific computations
– Search (LSI)
– Graphics Simulations
• The GPU is a fast streaming processor
– LA operations are easily “streamable”
• The result of the computation is already on
the GPU and ready for display
– For simulations
Arrays = textures
• 2D arrays are the native GPU data layout
• Maximum array size in GPU 8K x 8K
• Texture coordinates to access values
stored in the textures
Representation
1
1
N
N
i
N Vectors
Matrix
N 2D-Textures
N
N
1
... ...
N
1
i
i
...
...
N
N
Courtesy Jens Krüger
Matrix multiplication: Why is it
important?
•
•
•
•
Inherently parallel
Memory intensive
Greatly benefit from GPUs
Used for bench marking hardware
Early Pre-floating point
• Larsen and McAllister described an early
pre-floating point implementation of matrix
multiplies.
• 2D textures and blending operations to
perform matrix product
• nVidia GeForce3 GPU : 8-bit fixed point output
• 4.4 GB ops
L-M GPU-based Nested Loop
Matrix multiplication
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
Z[i,j] = 0;
// Each iteration in the following loop is quadrilateral
// rasterization of size N x N
for(k = 0; k < N; k++)
Z[i,j] = Z[I, j] + X[i,k] * Y[k,j];
}
}
Continued ..
The ability to modulate with two textures provides us
with the ability to multiply values from two separate
textures using the graphics hardware
Continued…
Load texture texA with matrix A.
Load texture texB with matrix B.
Set the multitexturing mode to modulate.
Set the framebuffer write mode to accumulate.
For i = 0 .. n -1
draw an n x n pixel rectangle with
texA coords (0,i), (0,i), (n-1, i), (n-1, i) and
texB coords (i, 0), (i, n-1), (i, n-1), (i,0).
End
Screen contains result of A x B
Floating point textures
• Texture targets
– GL_TEXTURE_2D
• Coordinates have to be normalized [0, 1]
– GL_TEXTURE_RECTANGLE_ARB
• Coordinates are not normalized
GL_LUMINANCE 1 floating point
- sqrt(N) * sqrt(N)
GL_RGBA
4 floating points
- sqrt(N/4) * sqrt(N/4)
Courtesy Jens Krüger
Operations
• Vector-Vector Operations
– Reduced to 2D texture operations
– Coded in pixel shaders
• Example: Vector1 + Vector2  Vector3
Vector 1
Vector 2
Vector 3
Render Texture
Static quad
TexUnit 0
Pass through
Vertex Shader
TexUnit 1
return tex0 + tex1
Pixel Shader
Vector-vector operation: saxpy
• Scalar alpha * x plus y, i.e. (alpha * x + y)
• Y_new = y_old + alpha * x
• On CPU:
for (int i = 0; i < N; i++)
Y[i] = Y[i] + alpha * X[i]
• Calculations are independent
• Computational kernel
Y_new[i] = Y_old[i] + alpha * X[i]
Kernels = shaders
float saxpy (
float2 coords : TEXCOORD0,
uniform sampler2D textureY,
uniform sampler2D textureX,
uniform float alpha) : COLOR
{
float result;
float y = tex2D(textureY, coords);
float x = tex2D(textureX, coords);
result = y + alpha * x;
return result;
}
Four-channeled RGBA texture
float4 saxpy (
float2 coords : TEXCOORD0,
uniform sampler2D textureY,
uniform sampler2D textureX,
uniform float alpha) : COLOR
4 channeled textures to improve
efficiency in fetching and computation
{
float4 result;
float4 y = tex2D(textureY, coords);
float4 x = tex2D(textureX, coords);
result = y + alpha * x;
return result;
}
Computing=drawing
glPolygonMode(GL_FRONT,GL_FILL);
glBegin(GL_QUADS)
glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0);
glTexCoord2f(texSize, 0.0); glVertex2f(texSize, 0.0);
glTexCoord2f(texSize, texSize); glVertex2f(texSize, texSize);
glTexCoord2f(0.0, texSize); glVertex2f(0.0, texSize);
glEnd();
Texture rectangles
Computing=drawing
glPolygonMode(GL_FRONT,GL_FILL);
glBegin(GL_QUADS)
glTexCoord2f(0.0, 0.0); glVertex2f(0.0, 0.0);
glTexCoord2f(1.0, 0.0); glVertex2f(texSize, 0.0);
glTexCoord2f(1.0, 1.0); glVertex2f(texSize, texSize);
glTexCoord2f(0.0, 1.0); glVertex2f(0.0, texSize);
glEnd();
Texture2D
Reduction-type operations
• Maximum, vector norm, dot products etc.
• One method: render a 1x1 output
• only one parallel processing element (PE)
• Might exceed the maximum allowed shader length
• Local maximum of each 2x2 group of
elements in one kernel
• Input: M by M, output: M/2 by M/2
• Recursively repeated until the final 1 by 1
• Logarithmetic number of iterations
Rendering a quad
void drawQuad(int w, int h) {
glBegin(GL_QUADS);
glMultiTexCoord2f(GL_TEXTURE0, -0.5, -0.5);
glMultiTexCoord2f(GL_TEXTURE1, 0.5, -0.5);
glMultiTexCoord2f(GL_TEXTURE2, -0.5, 0.5);
glMultiTexCoord2f(GL_TEXTURE3, 0.5, 0.5);
glVertex2f(0.0, 0.0);
glMultiTexCoord2f(GL_TEXTURE0, 2*w-0.5, -0.5);
glMultiTexCoord2f(GL_TEXTURE1, 2*w+0.5, -0.5);
glMultiTexCoord2f(GL_TEXTURE2, 2*w-0.5, 0.5);
glMultiTexCoord2f(GL_TEXTURE3, 2*w+0.5, 0.5);
glVertex2f(w, 0.0);
glMultiTexCoord2f(GL_TEXTURE0, 2*w-0.5, 2*h-0.5);
glMultiTexCoord2f(GL_TEXTURE1, 2*w+0.5, 2*h-0.5);
glMultiTexCoord2f(GL_TEXTURE2, 2*w-0.5, 2*h+0.5);
glMultiTexCoord2f(GL_TEXTURE3, 2*w+0.5, 2*h+0.5);
glVertex2f(w, h);
glMultiTexCoord2f(GL_TEXTURE0, -0.5, 2*h-0.5);
glMultiTexCoord2f(GL_TEXTURE1, 0.5, 2*h-0.5);
glMultiTexCoord2f(GL_TEXTURE2, -0.5, 2*h+0.5);
glMultiTexCoord2f(GL_TEXTURE3, 0.5, 2*h+0.5);
glVertex2f(0.0, h);
glEnd();
}
maximum
float maximum (float2 left: TEXCOORD0, float2 right: TEXCOORD1,
float2 top: TEXCOORD2, float2 bottom: TEXCOORD3,
uniform samplerRECT texture) : COLOR
{
float val1 = texRECT(texture, left);
float val2 = texRECT(texture, right);
float val3 = texRECT(texture, top);
float val4 = texRECT(texture, bottom);
return max(val1,max(val2,max(val3,val4)));
}
Matrix-Matrix multiplication
• On CPUs
for (i = 0; i < N; i++)
for( j = 0; j < N; j++)
C[i, j] =0;
for(k = 0; k < N; k++)
C[i, j] += A[i,k] * B[k, j];
• Suffers from locality: Elements of B are
accessed column wise. Not sequential order in
“blocks” inputs into small
memory.
Sub matrices that fit in
• Bandwidth limited
Processor cache
• Size of its per-loop prevents bandwidth
amplification from the closest and fastest
levels of memory hierarchy.
Fatahalian et. al. Matrix-matrix
multiplication on the GPU
• 2x2 blocks of matrix elements in 4 component texels
• Reads 2 rows from matrix A and 2 columns of B to
compute 4 elements of the output matrix
• Each shader program performs 4 inner loop iterations
of the CPU program
• GPU texture caches are organized to capture 2D
locality, so sequential access along either access are
cache coherent
for (k = 0; k < N/2; i++)
C[i, j].xyzw += A[i,k].xxzz * B[k, j].xyxy +
A[i,k].yyww * B[k, j].zwzw;
Multipass algorithm
• Single pass: instruction count exceed the limits of GPU
architectures for sufficiently large matrices
• 4 consecutive elements from a matrix column into a texel
• 4x4 matrix by 4x1 vector products
• 4x reuse of data in texel B
• 6 fetches for 4 mad operations: 1.5x more fetches
C[i, j].xyzw = accum[i,j].xyzw +
A[i,k].xyzw * B[k/4, j].xxxx +
A[i,k+1].xyzw * B[k/4, j].yyyy +
A[i,k+2].xyzw * B[k/4, j].zzzz +
A[i,k+3].xyzw * B[k/4, j].wwww
Measurements – multiplication of
1024 x 1024 matrices
Time(s)
GFLOPS
BW (GB/s)
NV5900 single
0.781
2.75
7.22
NV6800 single
0.283
7.59
15.36
ATI 9800 single
0.445
4.83
12.06
P4 ATLAS
0.289
7.78
27.68
NV6800 multi
0.469
4.58
12.79
Available floating point bandwidth from the closest cache on these GPUs is
up to several times lower than that of current CPUs to their L1 caches
Naga et. al. Matrix-matrix
multiplication on the GPU
• A memory model for scientific algorithms on
Graphics Processors
• Explicit blocking to avoid hardware
dependencies
• Computation on the tiles of the size T x T is
invoked by drawing quadrilaterals of size T x T
on the screen.
• A single fragment program evaluates the dot
product from vectors of size D in X and Y
(inputs).
GPU-based nested loop matrix
multiplication
for(ib = 0; ib < N; ib = ib + T)
for(jb = 0; jb < N; jb = jb + T)
for(kb = 0; kb < N; kb = kb + D)
// following two loops invoked using a quadrilateral of size TxT
for(i = ib; i < ib + T; i = i +1)
for(j = jb; j < jb + T; j = j + 1)
// following loop is performed inside
// a fragment program
for(k = kb; k < kb + D; k = k + 1)
Z[i,j] = Z[i,j] + X[i,k] * Y[k,j]
Measurements
• NVIDIA 7900 GTX GPU
– 17.6 GFLOPS
– 40 GB / S
• NVIDIA 6800 Ultra
– 8 GFLOPS
Matrix-Matrix multiplication in
CUDA/CTM
• ~ 100 GFLOPS
• How?
– Use of extended memory formats (fetch4) and
deeper loop unrolling (R580)
– Shader memory (G80)
• On chip shared memory allows you to keep blocks
of data close to the ALUs rather than having to
ping-pong off-chip to an FBO.
• SM is only exposed through CUDA
Introduction to CUDA
• Thread Batching
– Thread Block
threadID
• A batch of threads that can cooperate
• Sharing data through fast shared memory
blockID
• Synchronize their execution
– Grid of Thread Blocks
• Blocks that execute the same kernel can
be batched together
• Reduced thread cooperation
Memory model
Matrix multiplication using CUDA
• Each thread block is
responsible for computing
one square sub-matrix
Csub of C.
• Block size of Csub is equal
to 16
void Mul(const float * A, const float * B, int hA, int wA, int wB, float * C)
{
float * Ad;
int size = hA * wA * sizeof(float);
cudaMalloc((void **) &Ad, size);
Loading A into device global memory
cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice);
float * Bd;
size = wA * wB * sizeof(float);
cudaMalloc((void **) &Bd, size);
Loading B into device global memory
cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice);
float * Cd;
size = hA * wB * sizeof(float);
Allocate C on device global memory
cudaMalloc((void**)&Cd, size);
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(wB / dimBlock.x, hA / dimBlock.y);
Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd);
cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost);
}
Execution configuration
Launch device computation
Read C from the device
__global__ void Muld(const float * A, const float * B, int wA, int wB, float * C)
{
int bx = blockIDx.x;
int by = blockIDx.y;
int tx = threadIDx.x;
int ty = threadID.y;
aBegin : Index of the first sub-matrix of
A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
int aEnd = aBegin + wA; // (wA + 1) * BLOCK_SIZE * by
int aStep = BLOCK_SIZE;
int bBegin = BLOCK_SIZE * bx;
int bStep = BLOCK_SIZE * wB;
float Csub = 0;
……
…..
…..
for( int a = aBegin, b = bBegin; a < aEnd; a += aStep, b+= bStep)
Iterate through all sub-matrices of A and B
{
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
__syncthreads();
for(int k = 0; k < BLOCK_SIZE; ++k)
Csub += As[ty][k] * Bs[k][tx];
__syncthreads();
Load one sub-matrix of A and one of B from
global memory to shared memory
Synchronize to make sure both matrices
Are fully loaded by all threads
Compute the product of the two sub-matrices
Synchronize to make sure that the product
Of the two sub-matrices is done
}
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = CSub;
}
Write the block sub-matrix to global memory
Conclusions
• Different ways of representing matrices
and vectors on the GPU
• Various algorithms for matrix-matrix
multiplication and vector-vector operations
on the GPU
• Computation throughput of the GPU
• Memory performance of the GPU
Questions?
Thank you
References
• E.S. Larsen and D. McAllister. Fast matrix multiplies using
graphics hardware. Super Computing 2001.
• Dominik Göddeke. GPGPU Tutorials.
http://www.mathematik.unidortmund.de/~goeddeke/gpgpu/index.html
• Adam Moravanszky, Dense matrix algebra on the GPU.
2003 http://www.shaderx2.com/shaderx.PDF
• Nico Galoppo, Naga K. Govindaraju, Michael Henson and
Dinesh Manocha: LU-GPU : Efficient algorithms for solving
Dense Linear systems on Graphics hardware. Super
Computing 2005
• CUDA Programming Guide and CUDA BLAS.
http://developer.nvidia.com/object/cuda.html
References ctd..
• K. Fatahalian, J. Sugerman, and P. Hanran, Understanding
the Efficiency of GPU Algorithms for Matrix-Matrix
multiplication, Graphics Hardware 2004
• Jens Krüger, Linear Algebra on GPUs SIGGRAPH 2004
course notes
• Naga K. Govindaraju, Scott Larsen, Jim Gray, Dinesh
Manocha, A Memory Model for Scientific Algorithms on
Graphics Procesors, Super Computing 2006.