Ernest Orlando Lawrence Berkeley National Laboratory

Download Report

Transcript Ernest Orlando Lawrence Berkeley National Laboratory

UPC Benchmarks
Kathy Yelick
LBNL and UC Berkeley
Joint work with The Berkeley UPC Group:
Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove,
Parry Husbands, Costin Iancu, Rajesh Nishtala, Michael Welcome
Titanium and UPC
1
Kathy Yelick
UPC for the High End
One way to gain acceptance of a new language
• Make it run faster than anything else
Keys to high performance
• Parallelism:
• Scaling the number of processors
• Maximize single node performance
• Generate friendly code or use tuned libraries (BLAS, FFTW, etc.)
• Avoid (unnecessary) communication cost
• Latency, bandwidth, overhead
• Avoid unnecessary delays due to dependencies
• Load balance
• Pipeline algorithmic dependencies
Titanium and UPC
2
Kathy Yelick
NAS FT Case Study
• Performance of Exchange (All-to-all) is critical
• Determined by available bisection bandwidth
• Becoming more expensive as # processors grows
• Between 30-40% of the applications total runtime
• Even higher on BG/L scale machine
• Two ways to reduce Exchange cost
1. Use a better network (higher Bisection BW)
2. Spread communication out over longer period of time:
“All the wires all the time”
Default NAS FT Fortran/MPI relies on #1
Our approach builds on #2
Titanium and UPC
3
Kathy Yelick
3D FFT Operation with Global Exchange
1D-FFT Columns
1D-FFT
(Columns)
Transpose
+ 1D-FFT
(Rows)
Cachelines
send to Thread 0
Transpose +
1D-FFT
1D-FFT Rows
Exchange
(Alltoall)
send to Thread 1
send to Thread 2
Divide rows
among threads
Last 1D-FFT
(Thread 0’s view)
• Single Communication Operation (Global Exchange) sends
THREADS large messages
• Separate computation and communication phases
Titanium and UPC
4
Kathy Yelick
Overlapping Communication
• Several implementations, each processor owns
a set of xy slabs (planes)
• 1) Bulk
• Do column/row FFTs, then send 1/pth of data to each, do z FFTs
• This can use overlap between messages
• 2) Slab
• Do column FFTs, then row FFT on first slab, then send it, repeat
• When done with xy, wait for and start on z
• 3) Pencil
• Do column FFTs, then row FFTs on first row, send it, repeat for
each row and each slab
Titanium and UPC
5
Kathy Yelick
Decomposing NAS FT Exchange into
Smaller Messages
• Example Message Size Breakdown for Class D at 256
Threads
Exchange (Default)
512 Kbytes
Slabs (set of contiguous rows)
65 Kbytes
Pencils (single row)
16 Kbytes
Titanium and UPC
6
Kathy Yelick
NAS FT: UPC Non-blocking MFlops
MFlop rate in UPC Blocking and Non-blocking FT
1000
UPC Blocking
UPC Non-Blocking
MFlops per Thread
800
600
400
200
0
56
et 6 4
nd 2
a
B
i
Myr in
Infin
•
•
•
3 256
Elan
3 512
Elan
4 256
Elan
4 512
Elan
Berkeley UPC compiler support non-blocking UPC extensions
Produce 15-45% speedup over best UPC Blocking version
Non-blocking version requires about 30 extra lines of UPC code
Titanium and UPC
7
Kathy Yelick
NAS FT: UPC Slabs or Pencils?
Exchange Communication time for UPC Non-blocking Slabs and Pencils
UPC Non-blocking comparison decomposing the FFT in Slabs and Pencils
40
1000
UPC Non-Blocking with Slabs
UPC Non-Blocking with Pencils
800
30
25
600
seconds
MFlops per Thread
Row Communication as Slabs
Row Communication as Pencils
Last 1-D FFT with Slabs
Last 1-D FFT with Pencils
35
400
20
15
10
200
5
0
0
256
et 6 4
an d
B
i
Myr in
in
f
In
•
•
•
3 256
Elan
3 512
Elan
4 256
Elan
256
et 6 4
an d
B
i
Myr in
in
f
In
4 512
Elan
3 256
Elan
3 512
Elan
4 256
Elan
4 512
Elan
In MFlops, pencils (<16Kb messages) are 10% faster
In Communication time, pencils are on average about 15% slower than slabs
However, pencils recover more time in allowing for cache-friendly alignment
and smaller memory footprint on the last transpose+1D-FFT
Titanium and UPC
8
Kathy Yelick
Outline
1. Unified Parallel C (UPC effort at LBNL/UCB)
2. GASNet – UPC’s Communications System
•
•
One-sided communication on Clusters (Firehose)
Microbenchmarks
3. Bisection Bandwidth Problem
4. NAS FT: Decomposing communication to
reduce Bisection Bandwidth
•
•
•
Overlapping communication and computation
UPC-specific NAS FT implementations
UPC and MPI comparison
Titanium and UPC
9
Kathy Yelick
NAS FT Implementation Variants
• GWU UPC NAS FT
• Converted from OpenMP, data structures and communication
operations unchanged
• Berkeley UPC NAS FT
• Aggressive use of non-blocking messages
• At Class D/256 Threads, each thread sends 4096 messages with FTPencils and 1024 messages with FT-Slabs for each 3D-FFT
• Use FFTW 3.0.1 for computation (best portability+performance)
• Add Column pad optimization (up to 4X speedup on Opteron)
• Berkeley MPI NAS FT
• Reimplementation of Berkeley UPC non-blocking NAS-FT
• Incorporates same FFT and cache padding optimizations
• NAS FT Fortran
• Default NAS 2.4 release (benchmarked version uses FFTW)
Titanium and UPC
10
Kathy Yelick
Pencil/Slab optimizations: UPC vs MPI
Exchange Communication time for Non-blocking NAS FT variants
50
UPC Slabs
UPC Pencils
MPI Slabs
MPI Pencils
312.8
10
0
56
et 6 4
nd 2
a
B
i
Myr in
in
Inf
•
•
•
3 256
Elan
3 512
Elan
MPI Crash (Pencils & Slabs)
20
MPI Crash (Pencils & Slabs)
30
MPI Crash (Pencils)
seconds
40
4 256
Elan
4 512
Elan
Graph measures the cost of interleaving non-blocking communications with 1D-FFT
computations
Non-blocking operations are handled uniformly well on UPC but either crash MPI or
cause performance problems (with notable exceptions for Myrinet and Elan3)
Pencil communication produces less overhead on the largest Elan4/512 config
Titanium and UPC
11
Kathy Yelick
Pencil/Slab optimizations: UPC vs MPI
Time Spent in Communication (Best MPI)
Fraction of Unoverlapped MPI Communication that UPC Effectively Overlaps with Computation
Best MPI and Best UPC for each System (Class/NProcs)
1
0.8
0.6
0.4
0.2
0
•
•
•
•
et 6 4
Myr in
56
nd 2
a
B
i
in
Inf
3 256
Elan
3 512
Elan
Same data, viewed in the context of what MPI is able to overlap
“For the amount of time that MPI spends in communication, how much of that time
can UPC effectively overlap with computation”
On Infiniband, UPC overlaps almost all the time the MPI spends in communication
On Elan3, UPC obtains more overlap than MPI as the problem scales up
Titanium and UPC
12
Kathy Yelick
NAS FT Variants Performance Summary
Best MFlop rates for all NAS FT Benchmark versions
1000
Best NAS Fortran/MPI
Best MPI
Best UPC
MFlops per Thread
800
600
400
200
0
56
et 6 4
nd 2
a
B
i
Myr in
Infin
•
•
3 256
Elan
3 512
Elan
4 256
Elan
4 512
Elan
Shown are the largest classes/configurations possible on each test machine
MPI not particularly tuned for many small/medium size messages in flight
(long message matching queue depths)
Titanium and UPC
13
Kathy Yelick
Case Study in NAS CG
• Problems in NAS CG are different than FT
• Reductions, including vector reductions
• Highlights need for processor team reductions
• Using one-sided low latency model
• Performance:
• Comparable (slightly better)
performance in UPC than
MPI/Fortran
• Current focus on more
realistic CG
• Real matrices
• 1D layout
• Optimize across iterations
(BeBOP Project)
Titanium and UPC
14
Kathy Yelick
Direct Method Solvers in UPC
• Direct methods (LU, Cholesky), have more complicated
parallelism than iterative solvers
• Especially with pivoting (unpredictable communication)
• Especially for sparse matrices (dependence graph with holes)
• Especially with overlap to break dependencies (in HPL, not
ScaLAPACK)
• Today: Complete HPL/UPC
• Highly multithreaded: UPC threads + user threads + threaded BLAS
• More overlap and more dynamic than MPI version for sparsity
• Overlap limited only by memory size
• Future: Support for Sparse SuperLU-like code in UPC
• Scheduling and high level data structures in HPL code designed for
sparse case, but not yet fully “sparsified”
Titanium and UPC
15
Kathy Yelick
LU Factorization
F in is h e d p a rt o f U
Panel factorizations
involve communication
for pivoting
A ( i,i )
A ( i,k )
F in is h e d
mp aurCt o
ltip
of m p
lie
L r le t
s ed
A ( j,i )
A ( j,k )
Blocks 2D
block-cyclic
distributed
P a r t o f U to
b e u p d a te d .
Matrixmatrix
multiplication
used here.
Can be coalesced
T r a ilin g m a tr ix
T o b e u p d a te d
P a n e l b e in g fa c to r e d
Titanium and UPC
16
Kathy Yelick
Dense Matrix HPL UPC vs. MPI
GFlop/s
X1 Linpack Performance
MPI/HPL
UPC
1400
1200
1000
800
600
400
200
0
60
• Large scaling: 2.2 TFlops on
512 Itanium/Quadrics
128
Altix Linpack Performance
GFlop/s
• Remaining issues
• Keeping the machines up and getting
large jobs through queues
• Altix issue with Shmem
• BLAS on Opteron and X1
Titanium and UPC
64
160
140
120
100
80
60
40
20
0
MPI/HPL
UPC
32
17
Kathy Yelick
End of Slides
Titanium and UPC
18
Kathy Yelick