Transcript ppt - Cosmo

Priority Project
Performance On Massively Parallel
Architectures (POMPA)
Nice to meet you!
COSMO GM10, Moscow
Overview
• Motivation
• COSMO code (as seen by computer engineer)
•
•
•
•
Important Bottlenecks
Memory bandwidth
Scaling
I/O
• POMPA overview
Motivation
• What can you do with more computational power?
Resolution (x 1.25)
Lead time (x 2)
# EPS
members (x 2)
Model
complexity (x 2)
Motivation
• How to increase computational power?
Algorithm
POMPA
Efficiency
Computer
Motivation
• Moore’s law has held since 1970’s and will probably
continue to hold
?
• Up to now we didn’t need to worry too much about adapting
our codes, why should we worry now?
Current HPC Platforms
• Research system: Cray XT5 – “Rosa”
•
•
•
3688 AMD hexa-core Opteron @ 2.4 GHz (212 TF)
28.8 TB DDR2 RAM
9.6 GB/s interconnect bandwidth
• Operational system: Cray XT4 – “Buin”
•
•
•
264 AMD quad-core Opteron @ 2.6 GHz (4.6 TF)
2.1 TB DDR RAM
7.6 GB/s interconnect bandwidth
• Old system: Cray XT3 – “Palu”
•
•
•
416 AMD dual-core Opteron @ 2.6 GHz (5.7 TF)
0.83 TB DDR RAM
7.6 GB/s interconnect bandwidth
Source: CSCS
The Thermal Wall
• Power ~ Voltage2 × Frequency ~ Frequency3
• Clock frequency will not follow Moore’s Law!
Source: Intel
Moore’s Law Reinterpreted
• Number of cores doubles every year while clock speed
decreases (not increases)
Source: Wikipedia
What are transistors used for?
• AMD Opteron (single-core)
memory
(latency avoidance)
load/store/control
(latency tolerance)
memory and I/O
interface
Source: Advanced Micro Devices Inc.
The Memory Gap
• Memory speed only doubles every 6 years!
Source: Hennessy and Patterson, 2006
“Brutal Facts of HPC”
•
Massive concurrency – increase in number of cores, stagnant or
decreasing clock frequency
•
Less and “slower” memory per thread – memory bandwidth per
insruction/second and thread will decrease, more complex
memory hierarchies
•
Only slow improvements of inter-processor and inter-thread
communication – interconnect bandwidth will improve only slowly
•
Stagnant I/O sub-systems – technology for long-term data
storage will stagnate compared to compute performance
•
Resilience and fault tolerance – mean time to failure of
massively parallel system may be short as compared to time to
solution of simulation, need fault tolerant software layers
 We will have to adapt our codes to exploit the power of future
HPC architectures!
Source: HP2C
Why a new Priority Project?
• Efficient codes may enable new science and save money
for operations
• We need to adapt our codes to efficiently run on current /
future massively parallel architectures!
• Great opportunity to profit from the momentum and
knowhow generated by the HP2C or G8 projects and use
synergies (e.g. ICON).
• Consistent with goals of the COSMO Science Plan and
similar activities in other consortia.
COSMO Code
• How would a computer engineer look at the COSMO code?
COSMO Code
• 227’389 lines of Fortran 90 code
% Code Lines
% Runtime (C-2 forecast)
Dynamics
Key Algorithmic Motifs
•
Stencil computations
do k=1,ie
do j=1,je
do i=1,ie
a(i,j,k) = w1 * b(i+1,j,k) + w2 * b(i,j,k) + w3 * b(i-1,j,k)
end do
end do
end do
•
Tridiagonal solver (vertical, Thomas alogrithm)
do j=1,je
! Modify coefficients
do k=2,ke
do i=1,ie
c(i,j,k) = 1.0 / ( b(i,j,k) – c(i,j,k-1) * a(i,j,k) )
d(i,j,k) = ( d(i,j,k) – d(i,j,k-1) * a(i,j,k) ) * c(i,j,k)
end do
end do
! Back substitution
do k=n-1,1,-1
do i=1,ie
x(i,j,k) = d(i,j,k) – c(i,j,k) * x(i,j,k+1)
end do
end do
end do
Code / Data Structures
•
field(ie,je,ke,nt)
•
Optimized for minimal computation (pre calculations)
•
Optimized for vector machine
•
Often repeatedly sweeps over the complete grid
(bad cache usage)
•
A lot of copy paste for handling different configurations
(difficult to maintain)
•
Metric terms and different averaging positions make code complex
[in Fortran first is fastest varying]
Parallelization Strategy
• How do distribute work onto O(1000) cores?
• 2D-domain decomposition using MPI library calls
• Example: operational COSMO-2
Total: 520 x 350 x 60 gridpoints
Per core: 24 x 16 x 60 gridpoints
Exchange information with MPI
halo/comp = 0.75
Bottlenecks?
• What are/will be the main bottlenecks of the COSMO code
on current/future massively parallel architectures?
• Memory bandwidth
• Scalability
• I/O
Memory scaling
•
Relative Runtime (4 cores = 100%)
•
Problem size 102 x 102 x 60 gridpoints (60 cores, similar to COSMO2)
Keep number of cores constant, vary number of cores/node used
HP2C: Feasibility Study
• Goal: Investigate how COSMO would have to be
implemented in order to reach optimal performance on
modern processors
• Tasks
• understand the code
• performance model
• prototype software
• new software design proposal
• Company
http://www.scs.ch/
• Duration
4 months (3 months of work)
Feasibility Study: Idea
• Focus only on dynamical core (fast wave solver) as it…
• dominates profiles (30% time)
• contains the key algorithmic motifs
(stencils, tridiagonal solver)
• is manageable size (14’000 lines)
• can be run stand-alone in a meaningful way
• correctness of prototype can be verified
Feasibility Study: Results
Prototype vs. Original
Key Ingredients
• Reduce number of memory accesses (less
precalculation)
• Change index order from (i,j,k) to (2,k,i/2,j) or (2,k,j/2,i,)
• cache efficiency in tridiagonal solver
• don’t load halo into cache
• Use iterators instead of on the fly array position
computations
• Merge loops in order to reduce the number of sweeps over
full domain
• Vectorize as much as possible of code
GPUs have O(10) higher bandwidth!
Source: Prof. Aoki, Tokio Tech
Bottlenecks?
• What are the main bottlenecks of the COSMO code on
current/future massively parallel architectures?
• Memory bandwidth
• Scalability
• I/O
“Weak” scaling
•
Problem size 1142 x 765 x 90 gridpoints (dt = 8s)
“COSMO-2”
Matt Cordery, CSCS
Strong scaling (small problem)
•
Problem size 102 x 102 x 60 gridpoints (dt = 20s)
“COSMO-2”
Improve Scalability?
•
Several approaches can be followed...
•
Improve MPI parallelization
•
Hybrid parallelization (loop level)
•
Hybrid parallelization (restructure code)
•
...
Hybrid Motivation
• NUMA = Non-Uniform Memory Access
• Nodes views…
Reality
Hybrid Pros / Cons
• Pros
• Eliminates domain decomposition at node
• Automatic memory coherency at node
• Lower (memory) latency and faster data movement
within node
• Can synchronize on memory instead of barrier
• Easier on-node load balancing
• Cons
• Benefit for memory bound codes questionable
• Can be hard to maintain
Hybrid: First Results
• OpenMP on loop level (> 600 directives)
linear speedup
Matt Cordery, CSCS
Bottlenecks?
• What are the main bottlenecks of the COSMO code on
current/future massively parallel architectures?
• Memory bandwidth
• Scalability
• I/O
The I/O Bottleneck
• NetCDF I/O is serial and synchronous
• grib1 output is asynchronous (and probably not in an ideal
way)
• No parallel output exists!
• Example: Operational COSMO-2 run
REF (s)
NO OUTPUT
(s)
DIFF (s)
TOTAL
1889
1676
-212 (-11%)
MPI
571
387
-184
USER
1317
1289
-28
MPI_gather
178
1
-177
cal_conv_ind
22
0
-22
organize_output
3
0
-3
tautsp2d
1
0
-1
PP-POMPA
• Performance On Massively Parallel Architectures
• Goal Prepare COSMO code for emerging massively
parallel architectures
• Timeframe 3 years (Sep. 2010 – Sep. 2013)
• Status Draft of project plan has been sent around. STC has
approved the project.
• Next step Kickoff meeting and detailed planning of
activities with all participants.
Tasks
① Performance analysis
② Redesign memory layout
③ Improving scalability (MPI, hybrid)
④ Massively parallel I/O
Current COSMO
code base
⑤ Adapt physical parametrizations
⑥ Redesign dynamical core
⑦ Explore GPU acceleration
⑧ Update documentation
 See project plan!
New code and
programming models
Who is POMPA?
•
DWD (Ulrich Schättler, …)
•
ARPA-SIMC, USAM & CASPUR (Davide Cesari, Stefano
Zampini, David Palella, Piero Lancura, Alessandro Cheloni,
Pier Francesco Coppola, …)
•
MeteoSwiss, CSCS & SCS (Oliver Fuhrer, Will Sawyer,
Thomas Schulthess, Matt Cordery, Xavier Lapillonne, Neil
Stringfellow, Tobias Gysi, …)
•
And you?
Questions?
Coming to a supercomputer near your soon!