Transcript ppt

Parallel Spectral Methods:
Solving Elliptic Problems with FFTs
Kathy Yelick
www.cs.berkeley.edu/~yelick/cs267_s07
03/21/2007
CS267 Lecture 19
References
° Previous CS267 lectures
° Lecture by Geoffrey Fox:
http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/cps615fft00.ppt
° FFTW project
http://www.fftw.org
° Spiral project
http://www.spiral.net
03/21/2007
CS267 Lecture 19
Poisson’s equation arises in many models
3D: 2u/x2 + 2u/y2 + 2u/z2 = f(x,y,z)
2D: 2u/x2 + 2u/y2 = f(x,y)
1D: d2u/dx2 = f(x)
f represents the
sources; also
need boundary
conditions
° Electrostatic or Gravitational Potential: Potential(position)
° Heat flow: Temperature(position, time)
° Diffusion: Concentration(position, time)
° Fluid flow: Velocity,Pressure,Density(position,time)
° Elasticity: Stress,Strain(position,time)
° Variations of Poisson have variable coefficients
03/21/2007
CS267 Lecture 19
Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)
Algorithm
Serial
PRAM
Memory
#Procs
° Dense LU
N3
N
N2
N2
° Band LU
N2 (N7/3)
N
N3/2 (N5/3)
N (N4/3)
° Jacobi
N2 (N5/3)
N (N2/3)
N
N
° Explicit Inv.
N2
log N
N2
N2
° Conj.Gradients N3/2 (N4/3)
N1/2(1/3) *log N
N
N
° Red/Black SOR N3/2 (N4/3)
N1/2 (N1/3)
N
N
° Sparse LU
N3/2 (N2)
N1/2
N*log N (N4/3)
N
° FFT
N*log N
log N
N
N
° Multigrid
N
log2 N
N
N
log N
N
° Lower bound N
PRAM is an idealized parallel model with zero cost communication
Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997.
03/21/2007
CS267 Lecture 19
Solving Poisson’s Equation with the FFT
° Express any 2D function defined in 0  x,y  1 as a series
(x,y) = Sj Sk jk sin(p jx) sin(p ky)
• Here jk are called Fourier coefficient of (x,y)
° The inverse of this is:
jk = 4 1 dx 1 dy (x,y) sin(p jx) sin(p ky)
 
0
0
° Poisson’s equation 2  / x2 +  2  / y2 = f(x,y) becomes
Sj Sk (-p2j2 - p2k2) jk sin(p jx) sin(p ky)
= Sj Sk fjk sin(p jx) sin(p ky)
° where fjk are Fourier coefficients of f(x,y)
 and f(x,y) = Sj Sk fjk sin(p jx) sin(p ky)
° This implies PDE can be solved exactly algebraically,
jk = fjk / (-p2j2 - p2k2)
03/21/2007
CS267 Lecture 19
Solving Poisson’s Equation with the FFT
° So solution of Poisson’s equation involves the following steps
° 1) Find the Fourier coefficients fjk of f(x,y) by performing integral
° 2) Form the Fourier coefficients of  by
jk = fjk / (-p2j2 - p2k2)
° 3) Construct the solution by performing sum (x,y)
° There is another version of this (Discrete Fourier Transform)
which deals with functions defined at grid points and not directly
the continuous integral
• Also the simplest (mathematically) transform uses
exp(-2pijx) not sin(p jx)
• Let us first consider 1D discrete version of this case
• PDE case normally deals with discretized functions as these needed
for other parts of problem
03/21/2007
CS267 Lecture 19
Serial FFT
° Let i=sqrt(-1) and index matrices and vectors from 0.
° The Discrete Fourier Transform of an m-element
vector v is:
F*v
Where F is the m*m matrix defined as:
F[j,k] = v (j*k)
Where v is:
v = e (2pi/m) = cos(2p/m) + i*sin(2p/m)
 v is a complex number with whose mth power vm =1
and is therefore called an mth root of unity
° E.g., for m = 4:
v = i,
03/21/2007
v2 = -1,
v3 = -i,
CS267 Lecture 19
v4 = 1,
Using the 1D FFT for filtering
° Signal = sin(7t) + .5 sin(5t) at 128 points
° Noise = random number bounded by .75
° Filter by zeroing out FFT components < .25
03/21/2007
CS267 Lecture 19
Using the 2D FFT for image compression
° Image = 200x320 matrix of values
° Compress by keeping largest 2.5% of FFT
components
° Similar idea used by jpeg
03/21/2007
CS267 Lecture 19
Related Transforms
° Most applications require multiplication by both F
and inverse(F).
° Multiplying by F and inverse(F) are essentially the
same. (inverse(F) is the complex conjugate of F
divided by n.)
° For solving the Poisson equation and various other
applications, we use variations on the FFT
• The sin transform -- imaginary part of F
• The cos transform -- real part of F
° Algorithms are similar, so we will focus on the
forward FFT.
03/21/2007
CS267 Lecture 19
Serial Algorithm for the FFT
° Compute the FFT of an m-element vector v, F*v
(F*v)[j] = S k = 0 F(j,k) * v(k)
m-1
= S k = 0 v (j*k) * v(k)
m-1
= S k = 0 (v j)k * v(k)
m-1
= V(v j)
° Where V is defined as the polynomial
V(x) = S k = 0 xk * v(k)
m-1
03/21/2007
CS267 Lecture 19
Divide and Conquer FFT
° V can be evaluated using divide-and-conquer
V(x) = S
=
m-1
k=0
(x)k * v(k)
v[0] + x2*v[2] + x4*v[4] + …
+ x*(v[1] + x2*v[3] + x4*v[5] + … )
= Veven(x2) + x*Vodd(x2)
° V has degree m-1, so Veven and Vodd are polynomials
of degree m/2-1
° We evaluate these at points (v j)2 for 0<=j<=m-1
° But this is really just m/2 different points, since
(v (j+m/2) )2 = (v j *v m/2) )2 = v 2j *v m = (v j)2
° So FFT on m points reduced to 2 FFTs on m/2 points
• Divide and conquer!
03/21/2007
CS267 Lecture 19
Divide-and-Conquer FFT
FFT(v, v, m)
if m = 1 return v[0]
else
veven = FFT(v[0:2:m-2], v 2, m/2)
vodd = FFT(v[1:2:m-1], v 2, m/2)
precomputed
v-vec = [v0, v1, … v (m/2-1) ]
return [veven + (v-vec .* vodd),
veven - (v-vec .* vodd) ]
° The .* above is component-wise multiply.
° The […,…] is construction an m-element vector from 2 m/2
element vectors
This results in an O(m log m) algorithm.
03/21/2007
CS267 Lecture 19
An Iterative Algorithm
° The call tree of the d&c FFT algorithm is a complete binary tree
of log m levels
FFT(0,1,2,3,…,15) = FFT(xxxx)
even
odd
FFT(0,2,…,14) = FFT(xxx0)
FFT(xx00)
FFT(xx10)
FFT(1,3,…,15) = FFT(xxx1)
FFT(xx01)
FFT(xx11)
FFT(x000) FFT(x100) FFT(x010) FFT(x110) FFT(x001) FFT(x101) FFT(x011) FFT(x111)
FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15)
° An iterative algorithm that uses loops rather than recursion,
goes each level in the tree starting at the bottom
• Algorithm overwrites v[i] by (F*v)[bitreverse(i)]
° Practical algorithms combine recursion (for memory hiearchy)
and iteration (to avoid function call overhead)
03/21/2007
CS267 Lecture 19
Parallel 1D FFT
° Data dependencies in
1D FFT
• Butterfly pattern
° A PRAM algorithm
takes O(log m) time
• each step to right is
parallel
• there are log m steps
° What about
communication cost?
° See LogP paper for
details
03/21/2007
CS267 Lecture 19
Block Layout of 1D FFT
° Using a block layout
(m/p contiguous elts per
processor)
° No communication in
last log m/p steps
° Each step requires finegrained communication
in first log p steps
03/21/2007
CS267 Lecture 19
Cyclic Layout of 1D FFT
° Cyclic layout (only 1
element per
processor, wrapped)
° No communication in
first log(m/p) steps
° Communication in
last log(p) steps
03/21/2007
CS267 Lecture 19
Parallel Complexity
° m = vector size, p = number of processors
° f = time per flop = 1
° a = startup for message (in f units)
° b = time per word in a message (in f units)
° Time(blockFFT) = Time(cyclicFFT) =
2*m*log(m)/p
+ log(p) * a
+ m*log(p)/p * b
03/21/2007
CS267 Lecture 19
FFT With “Transpose”
° If we start with a cyclic
layout for first log(p)
steps, there is no
communication
° Then transpose the
vector for last log(m/p)
steps
° All communication is
in the transpose
° Note: This example has
log(m/p) = log(p)
• If log(m/p) > log(p) more
phases/layouts will be
needed
• We will work with this
assumption for simplicity
03/21/2007
CS267 Lecture 19
Why is the Communication Step Called a Transpose?
° Analogous to transposing an array
° View as a 2D array of n/p by p
° Note: same idea is useful for uniprocessor caches
03/21/2007
CS267 Lecture 19
Complexity of the FFT with Transpose
° If no communication is pipelined (overestimate!)
° Time(transposeFFT) =
2*m*log(m)/p
same as before
+ (p-1) * a
was log(p) * a
+ m*(p-1)/p2 * b
was m* log(p)/p * b
° If communication is pipelined, so we do not pay for
p-1 messages, the second term becomes simply a,
rather than (p-1)a.
° This is close to optimal. See LogP paper for details.
° See also following papers on class resource page
• A. Sahai, “Hiding Communication Costs in Bandwidth Limited FFT”
• R. Nishtala et al, “Optimizing bandwidth limited problems using onesided communication”
03/21/2007
CS267 Lecture 19
Comment on the 1D Parallel FFT
° The above algorithm leaves data in bit-reversed order
• Some applications can use it this way, like Poisson
• Others require another transpose-like operation
° Other parallel algorithms also exist
• A very different 1D FFT is due to Edelman (see http://wwwmath.mit.edu/~edelman)
• Based on the Fast Multipole algorithm
• Less communication for non-bit-reversed algorithm
03/21/2007
CS267 Lecture 19
Higher Dimension FFTs
° FFTs on 2 or 3 dimensions are define as 1D FFTs on
vectors in all dimensions.
° E.g., a 2D FFT does 1D FFTs on all rows and then all
columns
° There are 3 obvious possibilities for the 2D FFT:
• (1) 2D blocked layout for matrix, using 1D algorithms for each row
and column
• (2) Block row layout for matrix, using serial 1D FFTs on rows,
followed by a transpose, then more serial 1D FFTs
• (3) Block row layout for matrix, using serial 1D FFTs on rows,
followed by parallel 1D FFTs on columns
• Option 2 is best, if we overlap communication and computation
° For a 3D FFT the options are similar
• 2 phases done with serial FFTs, followed by a transpose for 3rd
• can overlap communication with 2nd phase in practice
03/21/2007
CS267 Lecture 19
FFTW – Fastest Fourier Transform in the West
° www.fftw.org
° Produces FFT implementation optimized for
•
•
•
•
Your version of FFT (complex, real,…)
Your value of n (arbitrary, possibly prime)
Your architecture
Close to optimal for serial, can be improved for parallel
° Similar in spirit to PHIPAC/ATLAS/Sparsity
° Won 1999 Wilkinson Prize for Numerical Software
° Widely used for serial FFTs
• Had parallel FFTs in version 2, but no longer supporting them
• Layout constraints from users/apps + network differences are
hard to support
03/21/2007
CS267 Lecture 19
Bisection Bandwidth
° FFT requires one (or more) transpose operations:
• Ever processor send 1/P of its data to each other one
° Bisection Bandwidth limits this performance
• Bisection bandwidth is the bandwidth across the narrowest part
of the network
• Important in global transpose operations, all-to-all, etc.
° “Full bisection bandwidth” is expensive
•
•
•
•
Fraction of machine cost in the network is increasing
Fat-tree and full crossbar topologies may be too expensive
Especially on machines with 100K and more processors
SMP clusters often limit bandwidth at the node level
03/21/2007
CS267 Lecture 19
Modified LogGP Model
° LogGP: no overlap
° LogGP: no overlap
P0
P0
osend
g
L
orecv
P1
EEL: end to end latency (1/2 roundtrip)
g:
minimum time between small message sends
G:
additional gap per byte for larger messages
03/21/2007
CS267 Lecture 19
P1
Historical Perspective
25
½ round-trip latency
Added Latency
Send Overhead (Alone)
20
Send & Rec Overhead
usec
15
Rec Overhead (Alone)
10
5
Q
ua
dr
i
Q cs/
ua Sh
dr
m
ic
s/
M
PI
M
yr
in
e
M t/G
yr
in M
et
/M
PI
G
ig
E/
V
G IP L
ig
E/
M
PI
A
IB PI
M
/M
P
I
/L
IB
M
T3
E
T3 /Sh
m
E
/E
-R
T 3 eg
E
/M
P
I
0
• Potential performance advantage for fine-grained, one-sided programs
• Potential productivity advantage for irregular applications
03/21/2007
CS267 Lecture 19
General Observations
° The overlap potential is the difference between the
gap and overhead
• No potential if CPU is tied up throughout message send
- E.g., no send size DMA
• Grows with message size for machines with DMA (per byte cost is
handled by network)
- Because per-Byte cost is handled by NIC
• Grows with amount of network congestion
- Because gap grows as network becomes saturated
° Remote overhead is 0 for machine with RDMA
03/21/2007
CS267 Lecture 19
GASNet Communications System
GASNet offers put/get communication
° One-sided: no remote CPU involvement required
in API (key difference with MPI)
• Message contains remote address
• No need to match with a receive
• No implicit ordering required
Compiler-generated code
° Used in language runtimes (UPC,
etc.)
° Fine-grained and bulk xfers
Language-specific runtime
GASNet
° Split-phase communication
Network Hardware
03/21/2007
CS267 Lecture 19
Performance of 1-Sided vs 2-sided Communication: GASNet vs MPI
900000
gasnet_put_nbi_bulk
800000
MPI Flood
700000
Bandwidth (KB/s)
600000
Relative BW
500000
(put_nbi_bulk/MPI_Flood)
2.4
400000
2.2
2.0
Up is good
300000
1.8
1.6
200000
1.4
1.2
1.0
100000
10
1000
100000
10000000
Size (bytes)
0
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Size (bytes)
° Comparison on Opteron/InfiniBand – GASNet’s vapi-conduit and OSU MPI 0.9.5
° Up to large message size (> 256 Kb), GASNet provides up to 2.2X improvement in
streaming bandwidth
° Half power point (N/2) differs by one order of magnitude
03/21/2007
CS267 Lecture 19
GASNet: Performance for mid-range message sizes
Flood Bandwidth for 4KB messages
(13,196)
100%
223
90%
763
231
Percent HW peak
(up is good)
80%
70%
MPI
714
702
GASNet
679
190
152
60%
750
420
50%
40%
547
1189
252
30%
20%
10%
0%
Elan3/Alpha Elan4/IA64
Myrinet/x86
IB/G5
IB/Opteron
SP/Fed
Altix
GASNet usually reaches saturation bandwidth before MPI - fewer costs to amortize
Usually outperform MPI at medium message sizes - often by a large margin
03/21/2007
CS267 Lecture 19
NAS FT Case Study
° Performance of Exchange (Alltoall) is critical
• Communication to computation ratio increases with faster, more
optimized 1-D FFTs
• Determined by available bisection bandwidth
• Between 30-40% of the applications total runtime
° Two ways to reduce Exchange cost
1. Use a better network (higher Bisection BW)
2. Overlap the all-to-all with communication (where possible) –
“break up” the exchange
Default NAS FT Fortran/MPI relies on #1
Our approach uses UPC/GASNet and builds on #2
° Started as CS267 project
• 1D partition of 3D grid is a limitation
• At most N processors for N^3 grid
• HPC Challenge benchmark has large 1D FFT (can be viewed as 3D
or more with proper roots of unity)
03/21/2007
CS267 Lecture 19
3D FFT Operation with Global Exchange
1D-FFT Columns
1D-FFT
(Columns)
Transpose
+ 1D-FFT
(Rows)
Cachelines
send to Thread 0
Transpose +
1D-FFT
1D-FFT Rows
Exchange
(Alltoall)
send to Thread 1
send to Thread 2
Divide rows
among threads
Last 1D-FFT
(Thread 0’s view)
° Single Communication Operation (Global Exchange) sends THREADS
large messages
° Separate computation and communication phases
03/21/2007
CS267 Lecture 19
Communication Strategies for 3D FFT
chunk = all rows with same destination
° Three approaches:
• Chunk:
- Wait for 2nd dim FFTs to finish
- Minimize # messages
• Slab:
- Wait for chunk of rows destined for
1 proc to finish
- Overlap with computation
• Pencil:
- Send each row as it completes
- Maximize overlap and
- Match natural layout
Joint work with Chris Bell, Rajesh Nishtala,
Dan Bonachea
03/21/2007
pencil = 1 row
slab = all rows in a single plane with
same destination
CS267 Lecture 19
Decomposing NAS FT Exchange into Smaller Messages
° Three approaches:
• Chunk:
- Wait for 2nd dim FFTs to finish
• Slab:
- Wait for chunk of rows destined for 1 proc to finish
• Pencil:
- Send each row as it completes
° Example Message Size Breakdown for
• Class D (2048 x 1024 x 1024)
• at 256 processors
Exchange (Default)
Slabs (set of contiguous rows)
Pencils (single row)
03/21/2007
CS267 Lecture 19
512 Kbytes
65 Kbytes
16 Kbytes
Overlapping Communication
° Goal: make use of “all the wires”
• Distributed memory machines allow for asynchronous
communication
• Berkeley Non-blocking extensions expose GASNet’s nonblocking operations
° Approach: Break all-to-all communication
• Interleave row computations and row communications since 1DFFT is independent across rows
• Decomposition can be into slabs (contiguous sets of rows) or
pencils (individual row)
• Pencils allow:
- Earlier start for communication “phase” and improved local
cache use
- But more smaller messages (same total volume)
03/21/2007
CS267 Lecture 19
NAS FT: UPC Non-blocking MFlops
MFlop rate in UPC Blocking and Non-blocking FT
1000
UPC Blocking
UPC Non-Blocking
MFlops per Thread
800
600
400
200
0
56
et 6 4
nd 2
a
B
i
Myr in
Infin
3 256
Elan
3 512
Elan
4 256
Elan
° Berkeley UPC compiler support non-blocking UPC extensions
° Produce 15-45% speedup over best UPC Blocking version
° Non-blocking version requires about 30 extra lines of UPC code
03/21/2007
CS267 Lecture 19
4 512
Elan
NAS FT Variants Performance Summary
° Shown are the largest classes/configurations possible on each test machine
° MPI not particularly tuned for many small/medium size messages in flight (long
message matching queue depths)
03/21/2007
CS267 Lecture 19
Pencil/Slab optimizations: UPC vs MPI
Time Spent in Communication (Best MPI)
Fraction of Unoverlapped MPI Communication that UPC Effectively Overlaps with Computation
Best MPI and Best UPC for each System (Class/NProcs)
1
0.8
0.6
0.4
0.2
0
et 6 4
Myr in
256
Ban d
i
in
f
n
I
3 256
Elan
3 512
Elan
° Same data, viewed in the context of what MPI is able to overlap
° “For the amount of time that MPI spends in communication, how much of that time can
UPC effectively overlap with computation”
° On Infiniband, UPC overlaps almost all the time the MPI spends in communication
° On Elan3, UPC obtains more overlap than MPI as the problem scales up
03/21/2007
CS267 Lecture 19
Summary of Overlap in FFTs
° One-sided communication has performance advantages
• Better match for most networking hardware
- Most cluster networks have RDMA support
- Machines with global address space support (X1, Altix) shown
elsewhere
• Smaller messages may make better use of network
- Spread communication over longer period of time
- Postpone bisection bandwidth pain
• Smaller messages can also prevent cache thrashing for packing
- Avoid packing overheads if natural message size is reasonable
03/21/2007
CS267 Lecture 19
FFTW
the “Fastest
Fourier Tranform
in the West”
• C library for real & complex FFTs (arbitrary size/dimensionality)
(+ parallel versions for threads & MPI)
• Computational kernels (80% of code) automatically generated
• Self-optimizes for your hardware (picks best composition of steps)
= portability + performance
free software: http://www.fftw.org/
03/21/2007
CS267 Lecture 19
FFTW performance
power-of-two sizes, double precision
833 MHz Alpha EV6
2 GHz AMD Opteron
03/21/2007
2 GHz PowerPC G5
500 MHz Ultrasparc IIe
CS267 Lecture 19
FFTW performance
non-power-of-two sizes, double precision
unusual: non-power-of-two sizes
receive as much optimization
as powers of two
833 MHz Alpha EV6
2 GHz AMD Opteron
…because we
let the code do the optimizing
03/21/2007
CS267 Lecture 19
FFTW performance
double precision, 2.8GHz Pentium IV: 2-way SIMD (SSE2)
powers of two
exploiting CPU-specific
SIMD instructions
(rewriting the code)
is easy
non-powers-of-two
…because we
let the code write itself
03/21/2007
CS267 Lecture 19
Why is FFTW fast?
three unusual features
3
FFTW implements many FFT algorithms:
A planner picks the best composition
by measuring the speed of different combinations.
1
The resulting plan is executed
with explicit recursion:
enhances locality
The base cases of the recursion are codelets:
highly-optimized dense code
automatically generated by a special-purpose “compiler”
2
03/21/2007
CS267 Lecture 19
FFTW is easy to use
{
complex x[n];
plan p;
p = plan_dft_1d(n, x, x, FORWARD, MEASURE);
...
execute(p); /* repeat as needed */
...
destroy_plan(p);
}
03/21/2007
Key fact: usually,
many transforms of same size
are required.
CS267 Lecture 19
Why is FFTW fast?
three unusual features
3
FFTW implements many FFT algorithms:
A planner picks the best composition
by measuring the speed of different combinations.
1
The resulting plan is executed
with explicit recursion:
enhances locality
The base cases of the recursion are codelets:
highly-optimized dense code
automatically generated by a special-purpose “compiler”
2
03/21/2007
CS267 Lecture 19
FFTW Uses Natural Recursion
Size 8 DFT
p = 2 (radix 2)
Size 4 DFT
Size 2 DFT
Size 4 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
But traditional implementation is non-recursive,
breadth-first traversal:
log2 n passes over whole array
03/21/2007
CS267 Lecture 19
Traditional cache solution: Blocking
Size 8 DFT
p = 2 (radix 2)
Size 4 DFT
Size 2 DFT
Size 4 DFT
Size 2 DFT
Size 2 DFT
Size 2 DFT
breadth-first, but with blocks of size = cache
…requires program specialized for cache size
03/21/2007
CS267 Lecture 19
Recursive Divide & Conquer is Good
[Singleton, 1967]
(depth-first traversal)
Size 8 DFT
p = 2 (radix 2)
Size 4 DFT
Size 2 DFT
03/21/2007
Size 2 DFT
Size 4 DFT
Size 2 DFT
Size 2 DFT
eventually small enough to fit in cache
…no matter CS267
whatLecture
size 19the cache is
Cache Obliviousness
• A cache-oblivious algorithm does not know the cache size
— it can be optimal for any machine
& for all levels of cache simultaneously
• Exist for many other algorithms, too [Frigo et al. 1999]
— all via the recursive divide & conquer approach
03/21/2007
CS267 Lecture 19
Why is FFTW fast?
three unusual features
3
FFTW implements many FFT algorithms:
A planner picks the best composition
by measuring the speed of different combinations.
1
The resulting plan is executed
with explicit recursion:
enhances locality
The base cases of the recursion are codelets:
highly-optimized dense code
automatically generated by a special-purpose “compiler”
2
03/21/2007
CS267 Lecture 19