Multi-core Acceleration of NWP

Download Report

Transcript Multi-core Acceleration of NWP

Multi-core Acceleration of NWP
John Michalakes, NCAR
John Linford, Virginia Tech
Manish Vachharajani, University of Colorado
Adrian Sandu, Virginia Tech
HPC Users Forum, September 10, 2009
Outline
•
•
•
•
WRF and multi-core overview
Cost breakdown and kernel repository
Two cases
Path forward
WRF Overview
•
Large collaborative effort to develop
community weather model
– 10000+ registered users
–
•
Applications
• Numerical Weather Prediction
• High resolution climate
• Air quality research/prediction
• Wildfire
• Atmospheric Research
Software designed for HPC
– Ported to and in use on virtually all
types of system in the Top500
– 2007 Gordon Bell finalist
•
Why acceleration?
– Exploit fine-grained parallelism
– Cost performance ($ and e-)
– Need for strong scaling
5 day global WRF forecast at
20km horizontal resolution
WRF Overview
•
Large collaborative effort to develop
community weather model
– 10000+ registered users
–
•
Applications
• Numerical Weather Prediction
• High resolution climate
• Air quality research/prediction
• Wildfire
• Atmospheric Research
Software designed for HPC
– Ported to and in use on virtually all
types of system in the Top500
– 2007 Gordon Bell finalist
•
Why acceleration?
– Exploit fine-grained parallelism
– Cost performance ($ and e-)
– Need for strong scaling
courtesy Peter Johnsen, Cray
Multi-/Many-core
•
“Traditional” multi-core
–
–
–
–
–
Xeon 5500, Opteron Istanbul, Power 6/7
Much improved memory b/w (5x on Stream*)
Hyperthreading/SMT
Includes heterogeneity in the form of SIMD units & instructions
x86 instruction set; Native C, Fortran, OpenMP, ...
*http://www.advancedclustering.com/company-blog/stream-benchmarking.html
•
Cell Broadband Engine
–
–
–
–
•
PowerXCell 8i
PowerPC with 8 co-processors on a chip
No shared memory but relatively large local stores per core
Cores separately programmed; all computation and data
movement programmer controlled
Graphics Processing Units
–
–
–
–
–
–
–
NVIDIA GTX280, AMD
High-end versions of commodity graphics cards
O(100) physical SIMD cores supporting O(1000) way concurrent threads
Separate co-processor to host CPU, PCIe connection
Large register files, fast (but very small) shared memory
Programmed using special purpose threading languages: CUDA, OpenCL
Higher-level language support in development (e.g. PGI 9)
WRF Cost Breakdown and Kernel Repository
Microphysics
ynamics and
other
Radiation
Planetary
Boundary
Cumulus
TKE
Surface processes
Percentages of total run time
(single processor profile)
www.mmm.ucar.edu/wrf/WG2/GPU
WSM5 Microphysics
• WRF Single Moment 5-Tracer (WSM5)* scheme
• Represents condensation, precipitation, and
thermodynamic effects of latent heat release
• Operates independently up each column of 3D WRF
domain
• Expensive and relatively computationally intense (~2 ops
per word)
*Hong,
S., J. Dudhia, and S. Chen (2004). Monthly Weather Review, 132(1):103-120.
WSM5 Microphysics
contributed by Roman Dubtsov, Intel
WSM5 Microphysics
• CUDA version distributed with WRFV3
• Users have seen 1.2-1.3x improvement
– Case/system dependent
– Makes other parts of code run faster (!)
• PGI has implemented with 9.0 acceleration directives and seen
comparable speedups and overheads from transfer cost
WRF CONUS 12km benchmark
Courtesy Brent Leback and
Craig Toepfer, PGI
total seconds
microphysics
Kernel: WRF-Chem
• WRF model coupled to
atmospheric chemistry for air
quality research and air
pollution forecasting
– Time evolution and advection of
tens to hundreds of chemical
species being produced and
consumed at varying rates in
networks of reactions
– Many times cost of core
meteorology; seemingly ideal
for acceleration
et al., WRF Chem Version 3.0 User’s Guide, http://ruc.fsl.noaa.gov/wrf/WG11
E. and G. Wanner. Solving ODEs II: Stiff and Differential-Algebraic Problems, Springer 1996.
***Damian, et al. (2002). Computers & Chemical Engineering 26, 1567-1579.
*Grell
**Hairer
Kernel: WRF-Chem
• Rosenbrock solver for stiff
system of ODEs at each cell
– Rosenbrock** solver for stiff
system of ODEs at each cell
– Computation at each cell
independent – perfectly parallel
– Solver itself is not parallelizable
– 600K fp ops per cell
(compare to 5K ops/cell for
meteorology)
– 1-million load-stores per cell
– Very large footprint: 15KB state
at each cell
Kernel: WRF-Chem
• KPP generates Fortran solver called at each grid-cell in 3D
domain
• Multi-core implementation
– Insert OpenMP directives and multithread loop over grid cells
! OMP PARALLEL
DO J = 1, ...
DO K = 1, ...
DO I = 1, ...
Kernel: WRF-Chem
• Cell BE implementation
– PPU acts as master, invoking SPEs cell-by-cell
– SPUs round robin through domain, triple buffering
– Enhancement:
•
•
SPUs process cells in blocks of four
Cell-index innermost in blocks, utilize SIMD
Kernel: WRF-Chem
• GPU implementation
–
–
–
–
–
Host CPU controls outer loop over steps in Rosenbrock algorithm
Each step implemented as kernel over all cells in domain
Thread-shared memory utilized where possible (but not much)
Cells are masked out of the computation as they reach convergence
Cell index (thread index) innermost for coalesced access to device memory
CPU
GPU
kernel invocation
kernel invocation
kernel invocation
kernel invocation
kernel invocation
kernel invocation
kernel invocation
Chemistry Performance on GPU
Chemistry Performance on GPU
Some preliminary conclusions
•
Chemistry kinetics
–
–
–
–
•
WRF Microphysics
–
–
–
•
Very expensive but not computationally intense, so
data movement costs are high and memory footprint
is very large (15K bytes per cell)
Each Cell BE core has local store large enough to
allow effective overlap of computation and
communication
Today’s GPUs have ample device memory
bandwidth, but there is not enough fast local memory
for working set
Xeon 5500, Power6 have sufficient cache,
concurrency, and bandwidth
More computationally intense so GPU has an edge,
but Xeon is closing
PCIe transfer costs tip balance to the Xeon but this
can be addressed
Haven’t tried on Cell
In all cases, conventional multi-core CPU
easier to program, debug, and optimize
Garcia, J. R. Kelly, and T. Voran. Computing Spectropolarimetric Signals on
Accelerator Hardware Comparing the Cell BE and NVIDIA GPUs. Proceedings
of 2009 LCI Conference. 10-12 March 2009. Boulder CO.
Accelerators for Weather and Climate?
• Considerable conversion effort and maintenance issues, esp. for
large legacy codes.
• What speedup justifies?
• What limits speedups?
– Fast, close, and large enough memory resources
– Distance from host processor
– Baseline moving: CPUs getting faster
• Can newer generations of accelerators address?
– Technically: probable
– Business case: ???