Programming for Parallelism and Locality with
Download
Report
Transcript Programming for Parallelism and Locality with
Programming for Parallelism
and Locality with Hierarchically
Tiled Arrays
Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger,
Gheorghe Almasi*, Basilio B.Fraguela+, Maria J.
Garzaran, David Padua, Christoph von Praun*
University of Illinois,Urbana-Champaign,
*IBM T. J. Watson Research Center,
+Universidade de Coruna
Motivation
Memory Hierarchy
Determines performance
Increasingly complex
Registers – Cache (L1..L3) – Memory - Distributed Memory
Gap between the programs and their mapping
into the memory hierarchy
Given the importance of memory hierarchy,
programming constructs that allow the control of
locality are necessary.
7/17/2015
HTA
2
Contributions
Design of a programming paradigm that facilitates
control of locality (in sequential and parallel programs)
communication (in parallel programs)
Implementations in MATLAB, C++ and X10
Implementation of NAS benchmarks in our paradigm
Evaluation of productivity and performance
7/17/2015
HTA
3
Outline of the talk
Motivation Of Our Research
Hierarchically Tiled Arrays
Examples
Evaluation
Conclusion & Future work
7/17/2015
HTA
4
Hierarchically Tiled Arrays (HTA)
Tiled Array: Array partitioned into tiles
Hierarchically Tiled Array: Tiled Array whose
tiles are in-turn HTAs.
7/17/2015
HTA
5
Hierarchically Tiled Arrays
An example tiling
2 X 2 tiles
4 X 4 tiles
2 X 2 tiles
7/17/2015
HTA
6
Hierarchically Tiled Arrays
An example size
2 X 2 tiles
size = 64 X 64 elements
4 X 4 tiles
size = 32 X 32 elements
2 X 2 tiles
size = 8 X 8 elements
7/17/2015
HTA
7
Hierarchically Tiled Arrays
An example mapping
2 X 2 tiles
size = 64 X 64 elements
map to distinct modules
of a cluster
4 X 4 tiles
size = 32 X 32 elements
map to L1-cache
2 X 2 tiles
size = 8 X 8 elements
map to registers
7/17/2015
HTA
8
Why tiles as first class objects?
Many algorithms are formulated in terms of
tiles
e.g. linear algebra algorithms (in LAPACK,
ATLAS)
What is special about HTAs, in particular?
7/17/2015
HTA provides a convenient way for the
programmer to express these types of algorithms
Its recursive nature makes it easy to fit in to any
memory hierarchy.
HTA
9
HTA Construction
hta(MATRIX,{[delim1],[delim2],…},[processor mesh])
a
1 3
P1
P3
P2
P4
processor
mapping
ha
1
ha = hta(a,{[1,4],[1,3]},[2,2]);
4
7/17/2015
HTA
10
HTA Addressing
7/17/2015
HTA
11
HTA Addressing
tiles
h{1,1:2}
h{2,1}
hierarchical
7/17/2015
HTA
12
HTA Addressing
elements
h{1,1:2}
h(3,4)
h{2,1}
flattened
hierarchical
7/17/2015
HTA
13
HTA Addressing
h{1,1:2}
h(3,4)
h{2,1}
flattened
or
hierarchical
h{2,2}(1,2)
hybrid
7/17/2015
HTA
14
F90 Conformability
size(rhs) == 1
dim(lhs) == dim(rhs)
.and.
size(lhs) == size(rhs)
7/17/2015
HTA
15
HTA Conformability
All conformable HTAs
can be operated using
the primitive operations
(add, subtract, etc)
7/17/2015
HTA
16
HTA Assignment
h
t
h{:,:} = t{:,:}
h{1,:} = t{2,:}
h{1,:}(2,:) = t{2,:}(1,:);
implicit communication
7/17/2015
HTA
17
Data parallel functions in F90
sin( ) sin( ) sin( )
sin
sin( ) sin( ) sin( )
sin( ) sin( ) sin( )
7/17/2015
HTA
18
Map
r = map (@sin, h)
@sin
Recursive
7/17/2015
HTA
19
Map
r = map (@sin, h)
@sin
@sin
@sin
@sin
@sin
Recursive
7/17/2015
HTA
20
Map
r = map (@sin, h)
@sin
@sin
@sin
@sin
@sin
@sin
@sin
@sin
@sin
@sin
@sin
@sin
@sin
@sin
@sin
@sin
Recursive
7/17/2015
HTA
21
Reduce
r = reduce (@max, h)
@max
@max
@max
@max
@max
@max
@max
@max
@max
@max
@max
@max
@max
Recursive
7/17/2015
HTA
22
Reduce
r = reduce (@max, h)
@max
@max
@max
@max @max
@max @max
@max
@max
@max @max
@max
@max
@max
@max
@max
Recursive
7/17/2015
HTA
23
Reduce
r = reduce (@max, h)
@max
@max
@max
@max
@max
Recursive
7/17/2015
HTA
24
Higher level operations
repmat(h, [1, 3])
circshift( h, [0, -1] )
transpose(h)
7/17/2015
HTA
25
Outline of the talk
Motivation Of Our Research
Hierarchically Tiled Arrays
Examples
Evaluation
Conclusion & Future work
7/17/2015
HTA
26
Blocked Recursive Matrix Multiplication
Blocked-Recursive implementation
A{i, k}, B{k, j} and C{i, j} are sub-matrices of A, B and C.
function c = matmul (A, B, C)
if (level(A) == 0)
C = matmul_leaf (A, B, C)
else
for i = 1:size(A, 1)
for k = 1:size(A, 2)
for j = 1:size(B, 2)
C{i, j} = matmul(A{i, k}, B{k, j}, C{i,j});
end
end
end
end
7/17/2015
HTA
27
Matrix Multiplication
matmul (A, B, C)
matmul (A{i, k}, B{k, j}, C{i, j})
matmul (AA{i, k}, BB{k, j}, CC{i, j})
matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j})
7/17/2015
HTA
28
Matrix Multiplication
matmul (A, B, C)
matmul (A{i, k}, B{k, j}, C{i, j})
matmul (AA{i, k}, BB{k, j}, CC{i, j})
matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j})
7/17/2015
for i = 1:size(A, 1)
for k = 1:size(A, 2)
for j = 1:size(B, 2)
C(i,j)= C(i,j) + A(i, k) * B(k, j);
end
end
end
HTA
29
SUMMA Matrix Multiplication
Outer-product method
for k=1:n
for i=1:n
for j=1:n
C(i,j)=C(i,j)+A(i,k)*B(k,j);
end
end
end
for k=1:n
C(:,:)=C(:, :)+A{:,k} * B(k, :);
end
7/17/2015
HTA
30
SUMMA Matrix Multiplication
function C = summa (A, B, C)
for k=1:m
T1 = repmat(A{:, k}, 1, m);
T2 = repmat(B{k, :}, m, 1);
C = C + T1 * T2;
end
Here, C{i,j}, A{i,k}, B{k,j}
represent submatrices.
The * operator represents
tile-by-tile matrix
multiplication
B
A
7/17/2015
repmat
repmat
T1
HTA
*
T2
31
Outline of the talk
Motivation Of Our Research
Hierarchically Tiled Arrays
Examples
Evaluation
Conclusion & Future work
7/17/2015
HTA
32
Implementation
MATLAB Extension with HTA
X10 Extension
Global view & single threaded
Front-end is pure MATLAB
MPI calls using MEX interface
MATLAB library routines at the leaf level
Emulation only (no experimental results)
C++ extension with HTA (Under-progress)
7/17/2015
Communication using MPI (& UPC)
HTA
33
Evaluation
Implemented the full set of NAS benchmarks
(except SP)
Pure MATLAB & MATLAB + HTA
Performance metrics: running time & relative speedup
on 1 - 128 processors
Productivity metric: source lines of code
Compared with hand-optimized F77+MPI version
Tiles only used for parallelism
Matrix-matrix multiplication using HTAs in C++
Tiles used for locality
7/17/2015
HTA
34
Illustrations
MG
3D Stencil convolution:
primitive HTA operations and
assignments to implement communication
restrict
7/17/2015
interpolate
HTA
36
MG
r{1,:}(2,:) = r{2,:}(1,:);
communication
r {:, :}(2:n-1, 2:n-1) = 0.5 * u{:,:}(2:n-1, 2:n-1)
+ 0.25 * (u{:, :}(1:n-2, 2:n-1) + u{:,:}(3:n, 2:n-1)
+
u{:,:}(2:n-1, 1:n-2) + u{:,:}(2:n-1, 3:n))
n=3
7/17/2015
computation
HTA
37
CG
Sparse matrix-vector multiplication (A*p)
2D decomposition of A
replication and 2D decomposition of p
sum reduction across columns
p0T p1T p2T p3T
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
*
A30 A31 A32 A33
7/17/2015
HTA
38
CG
1
A =hta(MX,{partition_A},[M N]);
p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]);
2
1
A00 A01 A02 A03
p0T p1T
p2T
A10 A11 A12 A13
p0T p1T
p2T p3T
p0T p1T
p2T
p0T p1T
p2T pT3
A20 A21 A22 A23
A30 A31 A32 A33
7/17/2015
2
*
p3T
p3T
HTA
repmat
39
CG
1
A =hta(MX,{partition_A},[M N]);
p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]);
p = map (@transpose, p)
3
2
3
A00 A01 A02 A03
p0
p1
p2
p3
A10 A11 A12 A13
p0
p1
p2
p3
p0
p1
p2
p3
p0
p1
p2
p3
A20 A21 A22 A23
A30 A31 A32 A33
7/17/2015
*
HTA
repmat
40
CG
1
A =hta(MX,{partition_A},[M N]);
p = repmat( hta( vector', {partition_p} [1 N] ),[M,1]);
p = map (@transpose, p)
3
R = reduce ( 'sum', A * p, 2, true); 4
2
4
reduce
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
7/17/2015
*
p0
p1
p2
p3
p0
p1
p2
p3
p0
p1
p2
p3
p0
p1
p2
p3
HTA
=
R00
R01 R02 R03
R10
R20
R11 R12 R13
R21 R22 R23
R30
R31 R32 R33
repmat
41
FT
3D Fourier transform
Corner turn using dpermute
Fourier transform applied to each tile
h = ft (h, 1);
h = dpermute (h, [2 1]);
h = ft(h, 1);
1 2
5 6
9 10
13 14
7/17/2015
3
7
11
15
4
8
12
16
dpermute
A 2D illustration
1 5 9 13
2 6 10 14
3 7 11 15
4 8 12 16
HTA
@ft @ft
ft
42
IS
Bucket sort of integers
Bucket exchange using assignments
8 2 1 4 7 9 11 16 18 3 6 10 12 13 15 5 14 17
1. Distribute keys
1 2 4 8 7 9 3 6 11 10 16 18 5 12 13 15 14 17
2. Local bucket sort
3. Aggregate
4. Bucket exchange
@sort
7/17/2015
@sort
@sort
HTA
5. Local sort
43
IS
Bucket sort of integers
Bucket exchange assignments
h and r top level distributed
8 2 1 4 7 9 11 16 18 3 6 10 12 13 15 5 14 17
1 2 4 8 7 9 3 6 11 10 16 18 5 12 13 15 14 17
@sort
7/17/2015
@sort
for i = 1:n
for j = 1:n
r{i}{j} = h{j}{i}
end
end
@sort
HTA
44
Experimental Results
NAS benchmarks (MATLAB)
(Results in the paper)
CG & FT – acceptable performance and speedup
MG, LU, BT – poor performance, but fair speedup
C++ results yet to be obtained.
7/17/2015
HTA
45
Matrix Multiplication
(C++ implementation)
Matrix Multiplication
3000
MFLOPS
2500
2000
HTA
1500
ATLAS
1000
500
0
504
1008
2016
3024
4032
Matrix size
7/17/2015
HTA
46
Lines of Code
Lines of code
EP
7/17/2015
CG
MG
HTA
FT
LU
47
Outline of the talk
Motivation Of Our Research
Hierarchically Tiled Arrays
Examples
Evaluation
Conclusion & Future work
7/17/2015
HTA
48
Conclusion
HTA is the first attempt to make tiles part of a
language
Treating tiling as a syntactic entity helps improve
productivity
Extending the vector operations to tiles makes the
programs more readable
HTA provides a convenient way to express algorithms
with tiled formulation
Future work
7/17/2015
C++ library + NAS benchmarks
Scalability experiments
Evolving the language for locality
HTA
49
THE END
7/17/2015
HTA
50
Backup slides
7/17/2015
HTA
51
void(conj_grad (SparseHTA A, HTA x, Matrix z) {
z = 0.0; q = 0.0;
/* Conjugate Gradient Benchmark in C++ */
r = x; p = r;
rho = HTA::dotproduct(r, r);
for(int i = 0; i < CGITMAX; i++) {
qt = 0;
pt = p.ltranspose();
SparseHTA::lmatmul(A, pt, &qt); /* compute A*x */
qt = qt.reduce(plus<double>(), 0.0, 1, 0x2);
q = qt.transpose();
alpha = rho / HTA::dotproduct(p, q);
z = z + alpha * p; r = r - alpha * q;
rho0 = rho;
rho = Matrix::dotproduct(r, r);
beta = rho / rho0;
p = r + beta * p;
}
qt = 0;
/* compute the norm */
HTA zt = z.ltranspose();
SparseHTA::lmatmul(A, zt, &qt);
qt = qt.reduce(plus<double>(), 0.0, 1, 0x2);
r = qt.transpose();
HTA f = x - r;
f = f * f;
norm = f.reduce(plus<double>(), 0.0);
rnorm = sqrt(norm);
}
7/17/2015
HTA
52
Related Works
Languages
ZPL, CAF, UPC, HPF, X10, CHAPEL
Parallel MATLABs excluded
Classification
Global vs Segmented memory
Single vs Multiple thread of execution
Language vs Library
7/17/2015
HTA
53
Classification
Global view
Single threaded
CAF
Titanium
UPC
X10
HPF
ZPL
POOMA
HTA
G.Arrays
POET
MPI/PVM
Library
7/17/2015
HTA
54
Uni-processor performance challenges
MATLAB
Limited conversion to vector format
Interpretation costs due to loops
Copy-on-write during assignments
Poor data abstraction with higher dimensionality
C++
Run time costs (e.g. virtual functions)
Extensible, polymorphic design
Syntax constraints leading to proxies
7/17/2015
HTA
55
Multi-processor performance challenges
Eliminating temps in expression evaluation
Scalable communication operations
e.g. reduction & transpose in CG
e.g. permute in FT
e.g. cumsum & bucket exchange in IS
Many to one Mapping & complex mapping
e.g. stencil convolutions in MG
e.g. BT
Overlap computation & communication
e.g. LU
7/17/2015
HTA
56