Benchmarks_Parallel - the Academic & Research Wing of SERC
Download
Report
Transcript Benchmarks_Parallel - the Academic & Research Wing of SERC
Benchmarks for Parallel
Systems
Sources/Credits:
“Performance of Various Computers Using Standard Linear Equations
Software”, Jack Dongarra, University of Tennessee, Knoxville TN,
37996, Computer Science Technical Report Number CS - 89 – 85, April
8, 2004, url:http://www.netlib.org/benchmark/performance.ps
http://www.top500.org
FAQ: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html
Courtesy: Jack Dongarra (Top500)
http://www.top500.org
The LINPACK Benchmark: Past, Present, and Future,
Jack Dongarra, Piotr Luszczek, and Antoine Petitet
NAS Parallel Benchmarks. http://www.nas.nasa.gov/Software/NPB/
LINPACK (Dongarra: 1979)
Dense system of linear equations
Initially used as a user’s guide for
LINPACK package
LINPACK – 1979
N=100 benchmark, N=1000
benchmark, Highly Parallel Computing
benchmark
LINPACK benchmark
Implemented on top of BLAS1
2 main operations – DGEFA(Gaussian
elimination - O(n3)) and DGESL(Ax =
b – O(n2))
Major operation (97%) – DAXPY: y =
y + α.x
Called n3/3 + n2 times. Hence 2n3/3
+ 2n2 flops (approx.)
64-bit floating point arithmetic
LINPACK
N=100, 100x100 system of equations. No change in
code. User asked to give a timing routine called
SECOND, no compiler optimizations
N=1000, 1000x1000 – user can implement any code,
should provide the required accuracy: Towards Peak
Performance (TPP). Driver program always uses 2n3/3
+2n2
“Highly Parallel Computing” benchmark – any
software, matrix size can be chosen. Used in Top500
Based on 64-bit floating point arithmetic
LINPACK
100x100 – inner loop optimization
1000x1000 – three-loop/whole program
optimization
Scalable parallel program – Largest
problem that can fit in memory
Template of Linpack code
Generate
Solve
Check
Time
HPL (Implementation of HPLinpack
Benchmark)
HPL Algorithm
• 2-D block-cyclic data distribution
•Right-looking LU
•Panel factorization: various options
- Crout, left or right-looking recursive variants based on
matrix multiply
- Number of sub-panels
- recursive stopping criteria
- pivot search and broadcast by binary-exchange
HPL algorithm
Panel broadcast:
-
Update of trailing
matrix:
- look-ahead
pipeline
Validity check
- should be O(1)
Top500 (www.top500.org)
Top500 – 1993
Twice a year – June and November
Top500 gives Nmax, Rmax, N1/2,
Rpeak
India and Top 500
Rank
Site
111
Geoscience (B)
India
204
System
Vendor
Processors
Rmax
Rpeak
BladeCenter HS20
Cluster, Xeon
EM64T 3.4 GHz Gig-Ethernet
IBM
1024
3755
6963.2
Semiconductor
Company (L)
India
eServer, Opteron 2.6 GHz,
GigEthernet
IBM
1024
2791
5324.8
231
Semiconductor
Company (K)
India
xSeries x336 Cluster Xeon
EM64T 3.6 GHz Gig-Ethernet
IBM
730
2676.88
5256
293
Institute of Genomics
and Integrative
Biology
India
Cluster Platform 3000
DL140G2 Xeon 3.6
GHz Infiniband
Hewlett-Packard
576
2156
4147.2
NAS Parallel Benchmarks - NPB
Also for evaluation of Supercomputers
A set of 8 programs from CFD
5 kernels, 3 pseudo applications
NPB 1 – Original benchmarks
NPB 2 – NAS’s MPI implementation. NPB
2.4 Class D has more work and more I/O
NPB 3 – based on OpenMP, HPF, Java
GridNPB3 – for computational grids
NPB 3 multi-zone – for hybrid parallelism
NPB 1.0 (March 1994)
Defines class A and class B versions
“Paper and pencil” algorithmic
specifications
Generic benchmarks as compared to
MPI-based LinPack
General rules for implementations –
Fortran90 or C, 64-bit arithmetic etc.
Sample implementations provided
Kernel Benchmarks
EP – embarrassingly parallel
MG – multigrid. Regular communication
CG – conjugate gradient. Irregular long distance
communication
FT – a 3-D PDE using FFT. Rigorous test of long distance
communication
IS – large integer sort
Detailed rules regarding
- brief statement of the problem
- algorithm to be practiced
- validation of results
- where to insert timing calls
- method for generating random numbers
- submission of results
Pseudo applications / Synthetic
CFDs
Benchmark 1 – perform few iterations
of the approximate factorization
algorithm (SP)
Benchmark 2 - perform few iterations
of diagonal form of the approximate
factorization algorithm (BT)
Benchmark 3 - perform few iterations
of SSOR (LU)
Class A and Class B
Class A
Sample Code
Class B
NPB 2.0 (1995)
MPI and Fortran 77
implementations
2 parallel kernels (MG,
FT) and 3 simulated
applications (LU, SP,
BT)
Class C – bigger size
Benchmark rules –
0%, 5%, >5% change
in source code
NPB 2.2 (1996), 2.4 (2002), 2.4
I/O (Jan 2003)
EP and IS added
FT rewritten
NPB 2.4 – class D and rationale for class D
sizes
2.4 I/O – a new benchmark problem based
on BT (BTIO) to test the output capabilities
A MPI implementation of the same (MPI-IO)
– different options using collective buffering
or not etc.