An Interactive Environment for Combinatorial Supercomputing

Download Report

Transcript An Interactive Environment for Combinatorial Supercomputing

An Interactive Environment for
Combinatorial Supercomputing
John R. Gilbert
University of California, Santa Barbara
Viral Shah (UCSB)
Steve Reinhardt (Silicon Graphics)
with thanks to Alan Edelman (MIT & ISC)
and Jeremy Kepner (MIT-LL)
1
Support: DOE Office of Science, DARPA, SGI, ISC
Parallel Computing Today
Columbia,
NASA Ames Research Center
Lab Beowulf cluster
2
Combinatorial Scientific Computing
Emerging large scale, high-performance applications:
•
Web search and information retrieval
•
Knowledge discovery
•
Machine learning
•
Computational biology
•
Bioinformatics
•
Sparse matrix methods
•
Geometric modeling
•
...
How will combinatorial methods be used by nonexperts?
3
Analogy: Matrix Division in Matlab
x = A \ b;
• Works for either full or sparse A
• Is A square?
no => use QR to solve least squares problem
• Is A triangular or permuted triangular?
yes => sparse triangular solve
• Is A symmetric with positive diagonal elements?
yes => attempt Cholesky after symmetric minimum degree
• Otherwise
=> use LU on A(:, colamd(A))
4
Interactive Data Exploration
5
RMAT Approximate Power-Law Graph
6
Strongly Connected Components
7
Matlab*P
A = rand(4000*p, 4000*p);
x = randn(4000*p, 1);
y = zeros(size(x));
while norm(x-y) / norm(x) > 1e-11
y = x;
x = A*x;
x = x / norm(x);
end;
8
Matlab*P History
• Matlab*P 1.0 (1998): Edelman, Husbands, Isbell (MIT)
• Matlab*P 2.0 (2002- ): MIT / UCSB / LBNL
• Star-P (2004- ): Interactive Supercomputing / SGI
9
Data-Parallel Operations
< M A T L A B >
Copyright 1984-2001 The MathWorks, Inc.
Version 6.1.0.1989a Release 12.1
>> A = randn(500*p, 500*p)
A = ddense object: 500-by-500
>> E = eig(A);
>> E(1)
ans = -4.6711 +22.1882i
e = pp2matlab(E);
>> ppwhos
Name
10
Size
Bytes
Class
A
500px500p
688
ddense object
E
500px1
652
ddense object
e
500x1
8000
double array (complex)
Task-Parallel Operations
>> quad('4./(1+x.^2)', 0, 1);
ans = 3.14159270703219
>> a = (0:3*p) / 4
a = ddense object: 1-by-4
>> pp2matlab(a)
ans =
0
0.25000000000000
0.50000000000000
0.75000000000000
>> b = a + .25;
>> c = ppeval('quad','4./(1+x.^2)', a, b);
c = ddense object: 1-by-4
>> sum(c)
ans = 3.14159265358979
11
Star-P Architecture
Star-P
client manager
package manager
processor #1
dense/sparse
sort
processor #2
ScaLAPACK
processor #3
FFTW
Ordinary Matlab variables
processor #0
FPGA interface
MPI user code
UPC user code
...
MATLAB®
processor #n-1
server manager
matrix manager
12
Distributed matrices
Distributed Sparse Array Structure
P0
31
41
59
26
53
1
3
2
3
1
31
1
41
53
59
26
2
3
P1
Each processor stores:
P2
Pn
13
• # of local nonzeros (# local edges)
• range of local rows (local vertices)
• nonzeros in a compressed row
data structure (local edges)
The sparse( ) Constructor
• A = sparse (I, J, V, nr, nc);
14
•
Input: ddense vectors I, J, V, dimensions nr, nc
•
Output: A(I(k), J(k)) = V(k)
•
Sum values with duplicate indices
•
Sorts triples < i, j, v > by < i, j >
•
Inverse: [I, J, V] = find(A);
Sparse Array and Matrix Operations
•
dsparse layout, same semantics as ddense
•
Matrix arithmetic: +, max, sum, etc.
•
matrix * matrix and matrix * vector
•
Matrix indexing and concatenation
A (1:3, [4 5 2]) = [ B(:, J) C ] ;
15
•
Linear solvers: x = A \ b; using SuperLU (MPI)
•
Eigensolvers: [V, D] = eigs(A); using PARPACK (MPI)
Large-Scale Graph Algorithms
16
•
Graph theory, algorithms, and data structures are
ubiquitous in sparse matrix computation.
•
Time to turn the relationship around!
•
Represent a graph as a sparse adjacency matrix.
•
A sparse matrix language is a good start on primitives
for computing with graphs.
•
Leverage the mature techniques and tools of highperformance numerical computation.
Sparse Adjacency Matrix and Graph

1
2
4
7
3
AT
x
5
6
ATx
• Adjacency matrix: sparse array w/ nonzeros for graph edges
• Storage-efficient implementation from sparse data structures
17
Breadth-First Search: Sparse mat * vec

1
2
4
7
3
AT
18
x
5
6
ATx
•
Multiply by adjacency matrix  step to neighbor vertices
•
Work-efficient implementation from sparse data structures
Breadth-First Search: Sparse mat * vec

1
2
4
7
3
AT
19
x
5
6
ATx
•
Multiply by adjacency matrix  step to neighbor vertices
•
Work-efficient implementation from sparse data structures
Breadth-First Search: Sparse mat * vec


1
2
4
7
3
AT
20
x
5
6
ATx (AT)2x
•
Multiply by adjacency matrix  step to neighbor vertices
•
Work-efficient implementation from sparse data structures
Connected Components of a Graph
• Sequential Matlab uses depth-first search (dmperm),
which doesn’t parallelize well
• Pointer-jumping algorithms (Shiloach/Vishkin & descendants)
– repeat
• Link every (super)vertex to a neighbor
• Shrink each tree to a supervertex by pointer jumping
– until no further change
• Other graph kernels in progress:
– Shortest-path search (after Husbands, LBNL)
– Bipartite matching (after Riedy, UCB)
– Strongly connected components (after Pinar, LBNL)
21
Maximal Independent Set
1
2
4
7
degree = sum(G, 2);
prob = 1 ./ (2 * deg);
select = rand (n, 1) < prob;
if ~isempty (select & (G * select))
% keep higher degree vertices
3
6
end
IndepSet = [IndepSet select];
neighbor = neighbor | (G * select);
remain = neighbor == 0;
G = G(remain, remain);
22
Starting guess:
Select some vertices
randomly
5
Maximal Independent Set
1
2
4
7
degree = sum(G, 2);
prob = 1 ./ (2 * deg);
select = rand (n, 1) < prob;
if ~isempty (select & (G * select))
% keep higher degree vertices
3
5
6
end
IndepSet = [IndepSet select];
neighbor = neighbor | (G * select);
23
If neighbors are
selected, keep only a
higher-degree one.
remain = neighbor == 0;
Add selected vertices to
G = G(remain, remain);
the independent set.
Maximal Independent Set
1
2
4
7
degree = sum(G, 2);
prob = 1 ./ (2 * deg);
select = rand (n, 1) < prob;
if ~isempty (select & (G * select);
% keep higher degree vertices
3
6
end
IndepSet = [IndepSet select];
neighbor = neighbor | (G * select);
remain = neighbor == 0;
G = G(remain, remain);
24
Discard neighbors of
the independent set.
Iterate on the rest of
the graph.
5
SSCA#2: “Graph Analysis” Benchmark
(spec version 1)
Fine-grained, irregular data access
Searching and clustering
25
•
Many tight clusters, loosely interconnected
•
Input data is edge triples < i, j, label(i,j) >
•
Vertices and edges permuted randomly
SSCA#2: Scalable Data Generator
Scale
26
•
Given “scale” = log2(#vertices)
•
Creates edge triples < i, j, label(i,j) >
•
Randomly permutes triples and
vertex numbers
#Vertices
#Cliques
#Edges Directed #Edges Undirected
10
1,024
186
13,212
3,670
15
32,768
2,020
1,238,815
344,116
20
1,048,576
20,643
126,188,649
35,052,403
25
33,554,432
207,082
12,951,350,000
3,597,598,000
30
1,073,741,824
2,096,264
1,317,613,000,000
366,003,600,000
Statistics for SSCA2 spec v1.1
Concise SSCA#2 in Star-P
Kernel 1: Construct graph data structures
•
27
Graphs are dsparse matrices, created by sparse( )
Kernels 2 and 3
Kernel 2: Search by edge labels
•
About 12 lines of executable Matlab or Star-P
Kernel 3: Extract subgraphs
28
•
Returns subgraphs consisting of vertices and edges within
fixed distance of given starting vertices
•
Sparse matrix-matrix product for multiple breadth-first search
•
About 25 lines of executable Matlab or Star-P
Kernel 4: Clustering by BFS
• Grow local clusters from many seeds in parallel
• Breadth-first search by sparse matrix * matrix
• Cluster vertices connected by many short paths
% Grow each seed to vertices
%
reached by at least k
%
paths of length 1 or 2
C = sparse(seeds, 1:ns, 1, n, ns);
C = A * C;
C = C + A * C;
C = C >= k;
29
Kernel 4: Clustering by Peer Pressure
12
3
Steps in a peer pressure algorithm:
5
2
8
1. Vote for a cluster leader
2. Collect neighbor votes
6
4
3. Vote for a new leader
1
10
(based on neighbor votes)
11
13
7
9
•
Clustering qualities depend on details of each step.
•
Want relatively few potential leaders, e.g. a maximal indep set.
Other choices possible – for SSCA2 graph, simpler rules work too.
•
Neighbor votes can be combined using various weightings.
•
Each version of kernel4 is about 25 lines of code.
30
Kernel 4: Clustering by Peer Pressure
12
[ignore, leader] = max(G);
S = G * sparse(1:n,leader,1,n,n);
5
11
12
12
13
5
13
13
[ignore, leader] = max(S);
13
13
13
31
13
•
Each vertex votes for highest numbered neighbor as its leader
•
Number of leaders is approximately number of clusters
(small relative to the number of nodes)
Kernel 4: Clustering by Peer Pressure
12
[ignore, leader] = max(G);
S = sparse(leader,1:n,1,n,n) * G;
5
5
12
12
12
5
12
13
[ignore, leader] = max(S);
13
13
13
32
13
•
Matrix multiplication gathers neighbor votes
•
S(i,j) is # of votes for i from j’s neighbors
•
In SSCA2 (spec1.0), most of graph structure is recovered right away;
iteration needed for harder graphs
Scaling Up
Recent results on SGI Altix (up to 128 processors):
• Have run SSCA2 (spec v1.0) on graphs with 227 = 134 million vertices
and about one billion (109) edges
• Have manipulated graphs with 400 million vertices and 4 billion edges
• Timings scale well – for large graphs,
• 2x problem size  2x time
• 2x problem size & 2x processors  same time
• Tracking new SSCA2 draft spec v2.0, in progress
Using this benchmark to tune Star-P sparse array infrastructure
33
Toolbox for Graph Analysis
and Pattern Discovery
Layer 1: Graph Theoretic Tools
34
•
Graph operations
•
Global structure of graphs
•
Graph partitioning and clustering
•
Graph generators
•
Visualization and graphics
•
Scan and combining operations
•
Utilities
Sparse Matrix times Sparse Matrix
35
•
Shows up often as a primitive.
•
Graphs are mostly not mesh-like, i.e. geometric
locality and good separators.
•
On a 2D processor grid, the parallel sparse algorithm
looks much like the parallel dense algorithm.
•
Redistribute to round-robin cyclic or random
distribution for load balance.
Load Balance Without Redistribution
36
Load Balance With Redistribution
37
Compressed Sparse Matrix Storage
value: 31 41 59 26 53
31
•
38
0
53
0
59
0
41
26
0
Full storage:
•
2-dimensional array.
•
(nrows*ncols)
memory.
row:
1
3
2
3
colstart:
1
3
5
6
1
• Sparse storage:
• Compressed storage by
columns (CSC).
• Three 1-dimensional arrays.
• (2*nzs + ncols + 1) memory.
• Similarly, CSR.
Single Processor MatMul: C = A * B
C(:, :) = 0;
for i = 1:n
for j = 1:n
for k = 1:n
C(i, j) = C(i, j) + A(i, k) * B(k, j);
•
The n3 scalar updates can be done in any order.
•
Six possible algorithms:
ijk, ikj, jik, jki, kij, kji
(lots more if you consider blocking for cache)
39
•
Goal: Sparse algorithm with time = O(nonzero flops)
•
For sparse, even time = O(n2) is too slow!
Possible Organizations of Sparse MatMul
Barriers to O(flops) work

Outer product:
for k = 1:n
C = C + A(:, k) * B(k, :)

Column by column:
for j = 1:n
for k where B(k, j)  0
C(:, j) = C(:, j) + A(:, k) * B(k, j)
40
too slow
Inner product:
for i = 1:n
for j = 1:n
C(i, j) = A(i, :) * B(:, j)

- Inserting updates into C is
- n2 loop iterations cost too
much if C is sparse
- Loop k only over nonzeros
in column j of B
- Use sparse accumulator (SPA)
for column updates
Sparse Accumulator (SPA)
•
Abstract data type for a single matrix column
•
Operations on SPA:
•
41
–
initialize spa
O(n) time, O(n) storage
–
spa = spa + scalar * (CSC vector)
O(nnz(vector)) time
–
(CSC vector) = spa
O(nnz(spa)) time
–
–
spa = 0
… possibly other ops
O(nnz(spa)) time
Standard implementation of SPA (many variations):
–
n-element array “value”
–
n-element flag array “is-nonzero”
–
linked structure (or other) to gather nonzeros from spa
CSC Sparse Matrix Multiplication
with SPA
for j = 1:n
C(:, j) = A * B(:, j)
=
x
C
A
scatter/
accumulate
gather
SPA
42
B
All matrix columns
and vectors are
stored compressed
except the SPA.
Remarks
• Tools for combinatorial methods built on parallel
sparse matrix infrastructure
• Easy-to-use interactive programming environment
– Rapid prototyping tool for algorithm development
– Interactive exploration and visualization of data
– Alan Edelman: “parallel computing is fun again ”
• Sparse matrix * sparse matrix is a key primitive
• Matrices over semirings like (min,+) as well as (+,*)
44