Transcript M /2

CS294, Lecture #10
Fall, 2011
Communication-Avoiding Algorithms
www.cs.berkeley.edu/~odedsc/CS294
How to Compute and Prove
Lower and Upper Bounds
on the
Communication Costs
of Your Algorithm
Part III: Graph analysis
Oded Schwartz
Based on:
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz:
Graph expansion and communication costs of fast matrix multiplication.
Previous talk on lower bounds
Communication Lower Bounds:
Proving that your algorithm/implementation is as good as it gets.
Approaches:
1. Reduction
[Ballard, Demmel, Holtz, S. 2009]
2. Geometric Embedding
[Irony,Toledo,Tiskin 04],
[Ballard, Demmel, Holtz, S. 2011a]
3. Graph Analysis
[Hong & Kung 81],
[Ballard, Demmel, Holtz, S. 2011b]
2
Previous talk on lower bounds:
algorithms with “flavor” of 3 nested loops
[Ballard, Demmel, Holtz, S. 2009],
[Ballard, Demmel, Holtz, S. 2011a]
Following [Irony,Toledo,Tiskin 04]
  n 3 
 
 M
 M 



• BLAS, LU, Cholesky, LDLT, and QR factorizations,  
 
eigenvalues and singular values, i.e.,

essentially all direct methods of linear algebra.

• Dense or sparse matrices
In sparse cases: bandwidth is a function NNZ.
• Bandwidth and latency.
• Sequential, hierarchical, and
parallel – distributed and shared memory models.
• Compositions of linear algebra operations.
• Certain graph optimization problems
3
n  M 

M  P 
[Demmel, Pearson, Poloni, Van Loan, 11]
• Tensor contraction
3
Geometric Embedding (2nd approach)
[Ballard, Demmel, Holtz, S. 2011a]
Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
(1) Generalized form:
(i,j)  S,
C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),
gi,j,k2 (A(i,k2), B(k2,j)),
…,
k1,k2,…  Sij
other arguments)
But many algorithms just don’t fit
the generalized form!
For example:
Strassen’s fast matrix multiplication
4
Beyond 3-nested loops
How about the communication costs of algorithms
that have a more complex structure?
5
Communication Lower Bounds
Proving that your algorithm/implementation is as good as it gets.
Approaches:
1. Reduction
[Ballard, Demmel, Holtz, S. 2009]
2. Geometric Embedding
[Irony,Toledo,Tiskin 04],
[Ballard, Demmel, Holtz, S. 2011a]
3. Graph Analysis
[Hong & Kung 81],
[Ballard, Demmel, Holtz, S. 2011b]
6
Recall: Strassen’s Fast Matrix Multiplication
[Strassen 69]
• Compute 2 x 2 matrix multiplication
using only 7 multiplications (instead of 8).
• Apply recursively (block-wise)
M1 = (A11 + A22)  (B11 + B22)
M2 = (A21 + A22)  B11
M3 =
A11  (B12 - B22)
M4 =
A22  (B21 - B11)
M5 = (A11+ A12)  B22
M6 = (A21 - A11)  (B11 + B12)
M7 = (A12 - A22)  (B21 + B22)
n/2
C11
C12
A11
A12
B11
B12
A21
A22
B21
B22
=
n/2
C21
C22
T(n) = 7T(n/2) + O(n2)
T(n) = (nlog 7)
2
C11 = M1 + M4 - M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 - M2 + M3 + M6
7
Strassen-like algorithms
• Compute n0 x n0 matrix multiplication
n/n0
using only n0 multiplications
(instead of n03).
=
• Apply recursively (block-wise)
0 
2.81
[Strassen 69] works fast in practice.
2.79
[Pan 78]
T(n) = n00 T(n/n0)
2.78
[Bini 79]
T(n) = (n0)
2.55
[Schönhage 81]
2.50
[Pan Romani,Coppersmith Winograd 84]
2.48
[Strassen 87]
2.38
[Coppersmith Winograd 90]
2.38
[Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach
0
+ O(n2)
8
New lower bound for
Strassen’s fast matrix multiplication
[Ballard, Demmel, Holtz, S. 2011b]:
The Communication bandwidth lower bound is
For Strassen’s:
Strassen-like:
Recall for cubic:
log2 277
  n  log

 
M

 M 



  n 0 0 
 
 M
 M 



log2288
  n  log

 
M

 M 



  n  log2 7 M 

 

 M 

P


  n  0 M 

 

 M  P 


  n  log2 8 M 

 

 M 

P


The parallel lower bounds applies to
2D: M = (n2/P)
2.5D: M = (c∙n2/P)
9
For sequential? hierarchy?
Yes, existing implementation do!
For parallel 2D? parallel 2.5D?
Yes: new algorithms.
10
Sequential and new 2D and 2.5D parallel
Strassen-like algorithms
Sequential and Hierarchy cases:
Attained by the natural recursive implementation.
Also: LU, QR,… (Black-box use of fast matrix multiplication)
[Ballard, Demmel, Holtz, S., Rom 2011]:
New 2D parallel Strassen-like algorithm.
Attains the lower bound.
New 2.5D parallel Strassen-like algorithm.
c 0 /2-1 parallel communication speedup
over 2D implementation (c ∙ 3n2 = M∙P)
[Ballard, Demmel, Holtz, S. 2011b]:
This is as good as it gets.
11
Implications for sequential architectural scaling
• Requirements so that “most” time is spent doing arithmetic on
n x n dense matrices, n2 > M:
• Time to add two rows of largest locally storable square matrix
exceeds reciprocal bandwidth
• Time to multiply 2 largest locally storable square matrices exceeds
latency
CA Matrix multiplication
algorithm
Scaling
Bandwidth
Requirement
Scaling
Latency
Requirement
Classic
M1/2   
M3/2   
Strassen-like
M0/2-1   
M0/2   
Strassen-like algs do fewer flops & less communication
but are more demanding on the hardware.
If   2, it is all about communication.
Expansion (3rd approach)
[Ballard, Demmel, Holtz, S. 2011b],
in the spirit of [Hong & Kung 81]
Let G = (V,E) be a d-regular graph
h  min
 
E S, S
S, S 
V
2
dS
S
V \S
A is the normalized adjacency matrix, with
eigenvalues: 1 =  1 ≥  2 ≥ … ≥  n
  1 - max {2, | n|}
Thm: [Alon-Milman84, Dodziuk84, Alon86]
1
2
  h  2
13
Expansion (3rd approach)
The Computation Directed Acyclic Graph
Input / Output
Intermediate value
Dependency
WS
V
S
V \S
S
RS
Communication-cost is Graph-expansion
14
M
M
M
M
M
Expansion (3rd approach)
For a given run (Algorithm, Machine, Input)
Read
Read
S1
Read
FLOP
Write
FLOP
S2
Read
1. Consider the computation DAG: G = (V, E)
V = set of computations and inputs
E = dependencies
Read
Time
Read
FLOP
FLOP
FLOP
S3
2. Partition G into segments S of (M/2) vertices
(correspond to time / location adjacency)
Write
Write
...
FLOP
WS
V
S
RS
3. Show that every S has
 3M vertices with incoming / outgoing edges
 perform  M read/writes.
4. The total communication BW is
BW = BW of one segment  #segments
=
(M)
 O(n) / (M/2)
=
(n / M/2 -1)
15
n2
Is it a Good Expander?
Declg nC
n
Break G into edge-disjoint graphs,
corresponding to the algorithm on M1/2  M1/2 matrices.
Consider the expansions of S in each part (they sum up).
BW = (T(n))  h(G(M1/2))
BW = (T(n))  (G(M1/2))
Enlg nA
lg n
Enlg n B
n
2
S2
S3
S1
S4
S5
We need to show that M/2 expands to (M).
h(G(n)) = (M/ M/2) for n = (M1/2).
Namely, for every n, h(G(n)) = (n2/n) = ((4/7)lg n)
16
What is the CDAG of Strassen’s algorithm?
17
The DAG of Strassen, n = 2
Dec1C
M1 = (A11 + A22)  (B11 + B22)
M2 = (A21 + A22)  B11
M3 =
A11  (B12 - B22)
M4 =
A22  (B21 - B11)
M5 = (A11+ A12)  B22
M6 = (A21 - A11)  (B11 + B12)
M7 = (A12 - A22)  (B21 + B22)
1,1 1,2 2,1 2,2
7
4
1
3
2
6
`
C11 = M1 + M4 - M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 - M2 + M3 + M6
5
1,1 1,2 2,1 2,2
Enc1A
1,1 1,2 2,1 2,2
Enc1B
18
The DAG of Strassen, n=4
Dec1C
One recursive level:
1,1 1,2 2,1 2,2
• Each vertex splits into four.
• Multiply blocks
7
5
4
1
3
2
6
Dec1C
Enc1A
Enc1 B
`
Enc1A
Enc1B
19
The DAG of Strassen: further recursive steps
n2
Dec1C
Declg nC
n
Enclg nA
lg n
1,1 1,2 2,1 2,2
Enclg n B
n2
Recursive construction
Given DeciC, Construct Deci+1C:
1. Duplicate 4 times
2. Connect with a cross-layer of Dec1C
20
The DAG of Strassen
n2
C
Declg nC
n
lg n
Enlg nA
Enlg nB
A
B
n2
1. Compute weighted sums of A’s elements.
2. Compute weighted sums of B’s elements.
3. Compute multiplications m1,m2,…,m.
4. Compute weighted sums of m1,m2,…,m to obtain C.
21
Expansion of a Segment
Two methods to compute the expansion of the recursively
constructed graph:
•
or
•
Combinatorial
- estimate directly the edge / vertex expansion
(in the spirit of [Alon, S., Shapira, 08])
Spectral
- compute the edge expansion via the spectral-gap
(in the spirit of the Zig-Zag analysis
[Reingold, Vadhan, Wigderson 00])
22
Expansion of a Segment
Main technical challenges:
Dec1C
• Two types of vertices:
with/without recursion.
• The graph is not regular.
1,1 1,2 2,1 2,2
7
5
4
1
3
2
6
`
1,1 1,2 2,1 2,2
Enc1A
1,1 1,2 2,1 2,2
Enc1B
23
Estimating the edge expansion- Combinatorially
In S
Not in S
Mixed


M


S1 S
2
S3
M 
Sk
k = lg M   1
• Dec1C is a consistency gadget:
Mixed pays  1/12 of its edges.
• The fraction of S vertices is consistent
between the 1st level and the four 2nd levels
(deviations pay linearly).
24
Communication Lower Bounds
Proving that your algorithm/implementation is as good as it gets.
Approaches:
1. Reduction
[Ballard, Demmel, Holtz, S. 2009]
2. Geometric Embedding
[Irony,Toledo,Tiskin 04],
[Ballard, Demmel, Holtz, S. 2011a]
3. Graph Analysis
[Hong & Kung 81],
[Ballard, Demmel, Holtz, S. 2011b]
25
Open Problems
Find algorithms that attain the lower bounds:
• Sparse matrix algorithms
• for sequential and parallel models
• that auto-tune or are cache oblivious
Address complex heterogeneous hardware:
• Lower bounds and algorithms
[Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11]
Extend the techniques to other algorithm and algorithmic tools:
S
V \S
?
•
Non-uniform recursive structure
Characterize a communication lower bound for a problem rather
than for an algorithm.
26
CS294, Lecture #10
Fall, 2011
Communication-Avoiding Algorithms
How to Compute and Prove
Lower Bounds
on the
Communication Costs
of Your Algorithm
Part III: Graph analysis
Oded Schwartz
Based on:
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz:
Graph expansion and communication costs of fast matrix multiplication.
Thank you!