Transcript ppt - Cosmo
Priority Project
Performance On Massively Parallel
Architectures (POMPA)
Nice to meet you!
COSMO GM10, Moscow
Overview
• Motivation
• COSMO code (as seen by computer engineer)
•
•
•
•
Important Bottlenecks
Memory bandwidth
Scaling
I/O
• POMPA overview
Motivation
• What can you do with more computational power?
Resolution (x 1.25)
Lead time (x 2)
# EPS
members (x 2)
Model
complexity (x 2)
Motivation
• How to increase computational power?
Algorithm
POMPA
Efficiency
Computer
Motivation
• Moore’s law has held since 1970’s and will probably
continue to hold
?
• Up to now we didn’t need to worry too much about adapting
our codes, why should we worry now?
Current HPC Platforms
• Research system: Cray XT5 – “Rosa”
•
•
•
3688 AMD hexa-core Opteron @ 2.4 GHz (212 TF)
28.8 TB DDR2 RAM
9.6 GB/s interconnect bandwidth
• Operational system: Cray XT4 – “Buin”
•
•
•
264 AMD quad-core Opteron @ 2.6 GHz (4.6 TF)
2.1 TB DDR RAM
7.6 GB/s interconnect bandwidth
• Old system: Cray XT3 – “Palu”
•
•
•
416 AMD dual-core Opteron @ 2.6 GHz (5.7 TF)
0.83 TB DDR RAM
7.6 GB/s interconnect bandwidth
Source: CSCS
The Thermal Wall
• Power ~ Voltage2 × Frequency ~ Frequency3
• Clock frequency will not follow Moore’s Law!
Source: Intel
Moore’s Law Reinterpreted
• Number of cores doubles every year while clock speed
decreases (not increases)
Source: Wikipedia
What are transistors used for?
• AMD Opteron (single-core)
memory
(latency avoidance)
load/store/control
(latency tolerance)
memory and I/O
interface
Source: Advanced Micro Devices Inc.
The Memory Gap
• Memory speed only doubles every 6 years!
Source: Hennessy and Patterson, 2006
“Brutal Facts of HPC”
•
Massive concurrency – increase in number of cores, stagnant or
decreasing clock frequency
•
Less and “slower” memory per thread – memory bandwidth per
insruction/second and thread will decrease, more complex
memory hierarchies
•
Only slow improvements of inter-processor and inter-thread
communication – interconnect bandwidth will improve only slowly
•
Stagnant I/O sub-systems – technology for long-term data
storage will stagnate compared to compute performance
•
Resilience and fault tolerance – mean time to failure of
massively parallel system may be short as compared to time to
solution of simulation, need fault tolerant software layers
We will have to adapt our codes to exploit the power of future
HPC architectures!
Source: HP2C
Why a new Priority Project?
• Efficient codes may enable new science and save money
for operations
• We need to adapt our codes to efficiently run on current /
future massively parallel architectures!
• Great opportunity to profit from the momentum and
knowhow generated by the HP2C or G8 projects and use
synergies (e.g. ICON).
• Consistent with goals of the COSMO Science Plan and
similar activities in other consortia.
COSMO Code
• How would a computer engineer look at the COSMO code?
COSMO Code
• 227’389 lines of Fortran 90 code
% Code Lines
% Runtime (C-2 forecast)
Dynamics
Key Algorithmic Motifs
•
Stencil computations
do k=1,ie
do j=1,je
do i=1,ie
a(i,j,k) = w1 * b(i+1,j,k) + w2 * b(i,j,k) + w3 * b(i-1,j,k)
end do
end do
end do
•
Tridiagonal solver (vertical, Thomas alogrithm)
do j=1,je
! Modify coefficients
do k=2,ke
do i=1,ie
c(i,j,k) = 1.0 / ( b(i,j,k) – c(i,j,k-1) * a(i,j,k) )
d(i,j,k) = ( d(i,j,k) – d(i,j,k-1) * a(i,j,k) ) * c(i,j,k)
end do
end do
! Back substitution
do k=n-1,1,-1
do i=1,ie
x(i,j,k) = d(i,j,k) – c(i,j,k) * x(i,j,k+1)
end do
end do
end do
Code / Data Structures
•
field(ie,je,ke,nt)
•
Optimized for minimal computation (pre calculations)
•
Optimized for vector machine
•
Often repeatedly sweeps over the complete grid
(bad cache usage)
•
A lot of copy paste for handling different configurations
(difficult to maintain)
•
Metric terms and different averaging positions make code complex
[in Fortran first is fastest varying]
Parallelization Strategy
• How do distribute work onto O(1000) cores?
• 2D-domain decomposition using MPI library calls
• Example: operational COSMO-2
Total: 520 x 350 x 60 gridpoints
Per core: 24 x 16 x 60 gridpoints
Exchange information with MPI
halo/comp = 0.75
Bottlenecks?
• What are/will be the main bottlenecks of the COSMO code
on current/future massively parallel architectures?
• Memory bandwidth
• Scalability
• I/O
Memory scaling
•
Relative Runtime (4 cores = 100%)
•
Problem size 102 x 102 x 60 gridpoints (60 cores, similar to COSMO2)
Keep number of cores constant, vary number of cores/node used
HP2C: Feasibility Study
• Goal: Investigate how COSMO would have to be
implemented in order to reach optimal performance on
modern processors
• Tasks
• understand the code
• performance model
• prototype software
• new software design proposal
• Company
http://www.scs.ch/
• Duration
4 months (3 months of work)
Feasibility Study: Idea
• Focus only on dynamical core (fast wave solver) as it…
• dominates profiles (30% time)
• contains the key algorithmic motifs
(stencils, tridiagonal solver)
• is manageable size (14’000 lines)
• can be run stand-alone in a meaningful way
• correctness of prototype can be verified
Feasibility Study: Results
Prototype vs. Original
Key Ingredients
• Reduce number of memory accesses (less
precalculation)
• Change index order from (i,j,k) to (2,k,i/2,j) or (2,k,j/2,i,)
• cache efficiency in tridiagonal solver
• don’t load halo into cache
• Use iterators instead of on the fly array position
computations
• Merge loops in order to reduce the number of sweeps over
full domain
• Vectorize as much as possible of code
GPUs have O(10) higher bandwidth!
Source: Prof. Aoki, Tokio Tech
Bottlenecks?
• What are the main bottlenecks of the COSMO code on
current/future massively parallel architectures?
• Memory bandwidth
• Scalability
• I/O
“Weak” scaling
•
Problem size 1142 x 765 x 90 gridpoints (dt = 8s)
“COSMO-2”
Matt Cordery, CSCS
Strong scaling (small problem)
•
Problem size 102 x 102 x 60 gridpoints (dt = 20s)
“COSMO-2”
Improve Scalability?
•
Several approaches can be followed...
•
Improve MPI parallelization
•
Hybrid parallelization (loop level)
•
Hybrid parallelization (restructure code)
•
...
Hybrid Motivation
• NUMA = Non-Uniform Memory Access
• Nodes views…
Reality
Hybrid Pros / Cons
• Pros
• Eliminates domain decomposition at node
• Automatic memory coherency at node
• Lower (memory) latency and faster data movement
within node
• Can synchronize on memory instead of barrier
• Easier on-node load balancing
• Cons
• Benefit for memory bound codes questionable
• Can be hard to maintain
Hybrid: First Results
• OpenMP on loop level (> 600 directives)
linear speedup
Matt Cordery, CSCS
Bottlenecks?
• What are the main bottlenecks of the COSMO code on
current/future massively parallel architectures?
• Memory bandwidth
• Scalability
• I/O
The I/O Bottleneck
• NetCDF I/O is serial and synchronous
• grib1 output is asynchronous (and probably not in an ideal
way)
• No parallel output exists!
• Example: Operational COSMO-2 run
REF (s)
NO OUTPUT
(s)
DIFF (s)
TOTAL
1889
1676
-212 (-11%)
MPI
571
387
-184
USER
1317
1289
-28
MPI_gather
178
1
-177
cal_conv_ind
22
0
-22
organize_output
3
0
-3
tautsp2d
1
0
-1
PP-POMPA
• Performance On Massively Parallel Architectures
• Goal Prepare COSMO code for emerging massively
parallel architectures
• Timeframe 3 years (Sep. 2010 – Sep. 2013)
• Status Draft of project plan has been sent around. STC has
approved the project.
• Next step Kickoff meeting and detailed planning of
activities with all participants.
Tasks
① Performance analysis
② Redesign memory layout
③ Improving scalability (MPI, hybrid)
④ Massively parallel I/O
Current COSMO
code base
⑤ Adapt physical parametrizations
⑥ Redesign dynamical core
⑦ Explore GPU acceleration
⑧ Update documentation
See project plan!
New code and
programming models
Who is POMPA?
•
DWD (Ulrich Schättler, …)
•
ARPA-SIMC, USAM & CASPUR (Davide Cesari, Stefano
Zampini, David Palella, Piero Lancura, Alessandro Cheloni,
Pier Francesco Coppola, …)
•
MeteoSwiss, CSCS & SCS (Oliver Fuhrer, Will Sawyer,
Thomas Schulthess, Matt Cordery, Xavier Lapillonne, Neil
Stringfellow, Tobias Gysi, …)
•
And you?
Questions?
Coming to a supercomputer near your soon!