Parallel programming trends
Download
Report
Transcript Parallel programming trends
CINECA: the Italian HPC infrastructure
and his evolution in the European scenario
Giovanni Erbacci,
Supercomputing, Applications and Innovation Department, CINECA, Italy
[email protected]
www.cineca.it
Agenda
• CINECA: the Italian HPC Infrastructure
• CINECA and the Euroepan HPC Infrastructure
• Evolution: Parallel Programming Trends in Extremely Scalable Architectures
www.cineca.it
Agenda
• CINECA: the Italian HPC Infrastructure
• CINECA and the Euroepan HPC Infrastructure
• Evolution: Parallel Programming Trends in Extremely Scalable Architectures
www.cineca.it
CINECA
CINECA non profit Consortium, made
up of 50 Italian universities*, The National
Institute of Oceanography and
Experimental Geophysics - OGS, the
CNR (National Research Council), and the
Ministry of Education, University and
Research (MIUR).
CINECA is the largest Italian computing centre, one of the most important worldwide.
The HPC department
- manages the HPC infrastructure,
- provide support end HPC resources to Italian and European researchers,
- promote technology transfer initiatives for industry.
www.cineca.it
The Story
1969:
1975:
1985:
1989:
1993:
1994:
1995:
1998:
2002:
2005:
2006:
2009:
2012:
CDC 6600
CDC 7600
Cray X-MP / 4 8
Cray Y-MP / 4 64
Cray C-90 / 2 128
Cray T3D 64
Cray T3D 128
Cray T3E 256
IBM SP4 512
IBM SP5 512
IBM BCX
IBM SP6
IBM BG/Q
www.cineca.it
1st system for scientific computing
1st supercomputer
1st vector supercomputer
1st parallel supercomputer
1st MPP supercomputer
1 Teraflops
10 Teraflops
100 Teraflops
> 1 Petaflops
CINECA and Top 500
www.cineca.it
10 PF
Trend di sviluppo
1 PF
100 TF
10 TF
1 TF
100 GF
10 GF
1 GF
www.cineca.it
>2 PF
HPC Infrastructure for Scientific computing
Logical Name
SP6 (Sep 2009)
BGP (jan 2010)
PLX (2011)
IBM P575
IBM BG / P
IBM IDATAPLEX
SMP
MPP
Linux Cluster
IBM Power 6 4.7 Ghz
IBM PowerPC 0,85 GHz
Intel Westmere Ec 2.4 Ghz
# of core
5376
4096
3288 + 548 GPGPU Nvidia Fermi
M2070
# of node
168
32
274
# of rack
12
1
10
Total RAM
20 Tera Byte
2 Tera Byte
~ 13 Tera Byte
Qlogic Infiniband DDR 4x
IBM 3D Torus
Qlogiq QDR 4x
AIX
Suse
RedHat
~ 800 Kwatts
~ 80 Kwatts
~ 200 Kwatts
> 101 Tera Flops
~ 14 Tera Flops
~ 300 Tera Flops
Model
Architecture
Processor
Interconnection
Operating System
Total Power
Peak Performance
www.cineca.it
Visualisation system
Visualisation and computer graphycs
Virtual Theater
6 video-projectors BARCO SIM5
Audio surround system
Cylindric screen 9.4x2.7 m, angle 120°
Ws + Nvidia cards
RVN nodes on PLX system
www.cineca.it
Storage Infrastructure
System
Available
bandwidth (GB/s)
Space (TB)
Connection
Tecnology
Disk Tecnology
2 x S2A9500
3,2
140
FCP 4Gb/s
FC
4 x S2A9500
3,2
140
FCP 4Gb/s
FC
6 x DCS9900
5,0
540
FCP 8Gb/s
SATA
4 x DCS9900
5,0
720
FCP 4Gb/s
SATA
3 x DCS9900
5,0
1500
FCP 4Gb/s
SATA
Hitachi Ds
3,2
360
FCP 4Gb/s
SATA
3 x SFA1000
10,0
2200
QDR
SATA
1 x IBM5100
3,2
66
FCP 8Gb/s
FC
> 5,6 PB
www.cineca.it
SP Power6 @ CINECA
- 168 compute nodes IBM p575 Power6 (4.7GHz)
- 5376 compute cores (32 core / node)
- 128 Gbyte RAM / node (21Tbyte RAM)
- IB x 4 DDR (double data rate)
Peak performance 101 TFlops
Rmax 76.41 Tflop/s
Efficiency (workload) 75.83 %
N. 116 Top500 (June 11)
- 2 login nodes IBM p560
- 21 I/O + service nodes IBM p520
- 1.2 PByte Storage row:
500 Tbyte working area High Performance
700 Tbyte data repository
www.cineca.it
BGP @ CINECA
Model: IBM BlueGene / P
Architecture: MPP
Processor Type: IBM PowerPC 0,85 GHz
Compute Nodes: 1024 (quad core, 4096 total)
RAM: 4 GB/compute node (4096 GB total)
Internal Network: IBM 3D Torus
OS: Linux (login nodes)
CNK (compute nodes)
Peak Performance: 14.0 TFlop/s
www.cineca.it
PLX @ CINECA
IBM Server dx360M3 – Compute node
2 x processori Intel Westmere 6c X5645 2.40GHz
12MB Cache, DDR3 1333MHz 80W
48GB RAM su 12 DIMM 4GB DDR3 1333MHz
1 x HDD 250GB SATA
1 x QDR Infiniband Card 40Gbs
2 x NVIDIA m2070 (m2070q su 10 nodi)
Peak performance 32 TFlops
(3288 cores a 2.40GHz)
Peak performance 565 TFlops Single Precision o 283
TFlops Double Precision (548 Nvidia M2070)
N. 54 Top500 (June 11)
www.cineca.it
Science @ CINECA
Scientific Area
Chemistry
Physics
Life Science
Engineering
Astronomy
Geophysics
Climate
Cultural Heritage
www.cineca.it
National Institutions
INFM-CNR
SISSA
INAF
INSTM
OGS
INGV
ICTP
Academic Institutions
Main Activities
Molecular Dynamics
Material Science Simulations
Cosmology Simulations
Genomic Analysis
Geophysics Simulations
Fluid dynamics Simulations
Engineering Applications
Application Code development/
parallelization/optimization
Help desk and advanced User support
Consultancy for scientific software
Consultancy and research activities
support
Scientific visualization support
The HPC Model at CINECA
From agreements with National Institutions to National HPC Agency in
an European context
- Big Science – complex problems
- Support Advanced computational science projects
- HPC support for computational sciences at National and European level
- CINECA calls for advanced National Computational Projects
ISCRA Italian SuperComputing Resource Allocation
http://iscra.cineca.it
Objective: support large-scale, computationally intensive projects not
possible or productive without terascale, and in future petascale,
computing.
Class A: Large Projects (> 300.000 CPUh x project): two calls per Year
Class B: Standard Projects. two calls per Year
Class C: Test and Development Projects (< 40.000 CPU h x project):
continuous submission scheme; proposals reviewed 4 times per year,
www.cineca.it
ISCRA: Italian SuperComputing Resource
Allocation
- National scientific committee
- Blind National Peer review system
- Allocation procedure
Sp6, 80TFlops (5376 core)
BGP, 17 TFlops (4096 core)
N. 116 Top500, June 2011
iscra.cineca.it
www.cineca.it
PLX, 142TFlops (3288 core
+ 548 nVidia M2070)
N. 54 Top500 June 2011
CINECA and Industry
CINECA provides HPC service to Industry:
–
–
–
–
ENI (geophysics)
BMW-Oracle (American cup, CFD structure)
Arpa (weather forecast, Meteoclimatology)
Dompé (pharmaceutical)
CINECA hosts the ENI HPC system:
HP ProLiant SL390s G7 Xeon 6C X5650, Infiniband,
Cluster Linux HP, 15360 cores
N. 60 Top 500 (June 2011) 163.43 Tflop/s Peak, 131.2 Linpack
www.cineca.it
CINECA Summer schools
www.cineca.it
Agenda
• CINECA: the Italian HPC Infrastructure
• CINECA and the Euroepan HPC Infrastructure
• Evolution: Parallel Programming Trends in Extremely Scalable Architectures
www.cineca.it
PRACE
The European HPC-Ecosystem
PRACE Research Infrastructure
(www.prace-ri.eu): the top level of
the European HPC ecosystem
European (PRACE)
Tier 0
National
Tier 1
Local
Tier 2
CINECA:
- represents Italy in PRACE
- hosting member in PRACE
- Tier-1 system
> 5 % PLX + SP6
- Tier-0 system in 2012
BG/Q 2 Pflop/s
Creation of a European HPC ecosystem involving all stakeholders
HPC service providers on all tiers
Scientific and industrial user communities
The European HPC hw and sw industry
www.cineca.it
- involved in PRACE 1IP, 2IP
- PRACE 2IP prototype EoI
HPC-Europa 2: Providing access to HPC
resources
HPC-Europa 2
- consortium of seven European HPC infrastructures
- integrated provision of advanced computational services to the
European research community
- Provision of transnational access to some of the most powerful HPC
facilities in Europe
- Opportunities to collaborate with scientists working in related fields at a
relevant local research institute
http://www.hpc-europa.eu/
HP-Europa 2: 2009 – 2012
(FP7-INFRASTRUCTURES-2008-1)
www.cineca.it
HPCworld
Plug-it
Europlanet
MMM@HPC
Vph-op
Verce
VMUST
HPC-Europa
DEISA
EMI
www.cineca.it
PRACE
Deep
Montblanc
EESI
EUDAT
Agenda
• CINECA: the Italian HPC Infrastructure
• CINECA and the Euroepan HPC Infrastructure
• Evolution: Parallel Programming Trends in Extremely Scalable Architectures
www.cineca.it
This Power A2 core has a 64-bit instruction set, like other commercial Power-based processors sold by IBM since 1995 but unlike the prior 32-bit PowerPC chips used in prior BlueGene/L and BlueGene/P supercomputers. The A2 c
BG/Q in CINECA
The Power A2 core has a 64-bit
instruction set (unlike the prior 32-bit
PowerPC chips used in BG/L and BG/P
The A2 core have four threads and has
in-order dispatch, execution, and
completion instead of out-of-order
execution common in many RISC
processor designs.
The A2 core has 16KB of L1 data cache
and another 16KB of L1 instruction
cache.
Each core also includes a quad-pumped
double-precision floating point unit:
Each FPU on each core has four
pipelines, which can be used to execute
scalar floating point instructions, fourwide SIMD instructions, or two-wide
complex arithmetic SIMD instructions.
www.cineca.it
16 core chip @ 1.6 GHz
- a crossbar switch links the cores
and L2 cache memory together.
- 5D torus interconnect
HPC Evolution
Moore’s law is holding, in
the number of transistors
– Transistors on an ASIC still
doubling every 18 months at
constant cost
– 15 years of exponential clock
rate growth has ended
Moore’s Law reinterpreted
– Performance improvements are
now coming from the increase in
the number of cores on a
processor (ASIC)
– #cores per chip doubles every
18 months instead of clock
– 64-512 threads per node will
become visible soon
From Herb Sutter<[email protected]>
www.cineca.it
Number of cores of no 1 system from Top500
500000
Number of cores
Paradigm Change in HPC
400000
300000
200000
100000
-1
1
Ju
n
-1
0
Ju
n
-0
9
Ju
n
-0
8
Ju
n
-0
7
Ju
n
-0
6
Ju
n
-0
5
-0
4
Ju
n
-0
3
Ju
n
-0
2
Ju
n
Ju
n
-0
1
-0
0
Ju
n
-9
9
Ju
n
-9
8
Ju
n
-9
6
-9
7
Ju
n
….
Ju
n
-9
5
Ju
n
-9
4
Ju
n
Ju
n
-9
3
0
Ju
n
T
o
p
5
0
0
600000
What about applications?
Next HPC systems with more than e in the order of 500.000 cores
www.cineca.it
Real HPC Crisis is with Software
A supercomputer application and software are usually much more long-lived than a
hardware
- Hardware life typically four-five years at most.
- Fortran and C are still the main programming models
Programming is stuck
- Arguably hasn’t changed so much since the 70’s
Software is a major cost component of modern technologies.
- The tradition in HPC system procurement is to assume that the software is free.
It’s time for a change
- Complexity is rising dramatically
- Challenges for the applications on Petaflop systems
- Improvement of existing codes will become complex and partly impossible
- The use of O(100K) cores implies dramatic optimization effort
- New paradigm as the support of a hundred threads in one node implies new
parallelization strategies
- Implementation of new parallel programming methods in existing large
applications has not always a promising perspective
There is the need for new community codes
www.cineca.it
Roadmap to Exascale
(architectural trends)
www.cineca.it
What about parallel App?
In a massively parallel context, an upper limit for the scalability of
parallel applications is determined by the fraction of the overall
execution time spent in non-scalable operations (Amdahl's law).
maximum speedup tends to
1/(1−P)
P= parallel fraction
1000000 core
P = 0.999999
serial fraction= 0.000001
www.cineca.it
Programming Models
•
•
•
Message Passing (MPI)
Shared Memory (OpenMP)
Partitioned Global Address Space Programming (PGAS)
Languages
•
Next Generation Programming Languages and Models
•
•
UPC, Coarray Fortran, Titanium
Chapel, X10, Fortress
Languages and Paradigm for Hardware Accelerators
CUDA, OpenCL
Hybrid: MPI + OpenMP + CUDA/OpenCL
www.cineca.it
trends
Scalar Application
Vector
MPP System, Message Passing: MPI
Distributed
memory
Multi core nodes: OpenMP
Accelerator (GPGPU,
FPGA): Cuda, OpenCL
Shared
Memory
Hybrid codes
www.cineca.it
CPU
memory
node
memory
node
node
Message Passing
domain decomposition
CPU
memory
CPU
CPU
www.cineca.it
memory
CPU
node
memory
node
node
Internal High Performance Network
memory
CPU
Ghost Cells - Data exchange
Processor 1
sub-domain boundaries
i,j+1
i-1,j i,j i+1,j
i,j+1
i-1,j i,j i+1,j
Ghost Cells
i,j-1
Processor 1
i,j+1
i-1,j i,j i+1,j
Ghost Cells exchanged
between processors
at every update
i,j+1
i-1,j i,j i+1,j
i,j-1
i,j+1
i-1,j i,j i+1,j
i,j-1
Processor 2
www.cineca.it
Processor 2
Message Passing: MPI
Main Characteristic
• Library
• Coarse grain
• Inter node parallelization
(few real alternative)
• Domain partition
• Distributed Memory
• Almost all HPC parallel
App
www.cineca.it
Open Issue
• Latency
• OS jitter
• Scalability
Shared memory
node
CPU
Thread 1
CPU
Thread 2
CPU
Thread 3
CPU
y
memory
www.cineca.it
Thread 0
x
Shared Memory: OpenMP
Main Characteristic
• Compiler directives
• Medium grain
• Intra node parallelization (pthreads)
• Loop or iteration partition
• Shared memory
• Many HPC App
www.cineca.it
Open Issue
• Thread creation overhead
• Memory/core affinity
• Interface with MPI
OpenMP
!$omp parallel do
do i = 1 , nsl
call 1DFFT along z ( f [ offset( threadid ) ] )
end do
!$omp end parallel do
call fw_scatter ( . . . )
!$omp parallel
do i = 1 , nzl
!$omp parallel do
do j = 1 , Nx
call 1DFFT along y ( f [ offset( threadid ) ] )
end do
!$omp parallel do
do j = 1, Ny
call 1DFFT along x ( f [ offset( threadid ) ] )
end do
end do
!$omp end parallel
www.cineca.it
Accelerator/GPGPU
+
Sum of 1D array
www.cineca.it
CUDA sample
void
CPUCode( int* input1, int* input2, int* output, int length) {
for ( int i = 0; i < length; ++i ) {
output[ i ] = input1[ i ] + input2[ i ];
}
}
__global__void
GPUCode( int* input1, int*input2, int* output, int length) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if ( idx < length ) {
output[ idx ] = input1[ idx ] + input2[ idx ];
}
}
Each thread execute one loop iteration
www.cineca.it
CUDA
OpenCL
Main Characteristic
• Ad-hoc compiler
• Fine grain
• offload parallelization (GPU)
• Single iteration parallelization
• Ad-hoc memory
• Few HPC App
www.cineca.it
Open Issue
• Memory copy
• Standard
• Tools
• Integration with other
languages
Hybrid
(MPI+OpenMP+CUDA+…
Take the positive off all models
Exploit memory hierarchy
Many HPC applications are adopting this model
Mainly due to developer inertia
Hard to rewrite million of source lines
…+python)
www.cineca.it
Hybrid parallel programming
Python: Ensemble simulations
MPI: Domain partition
OpenMP: External loop partition
CUDA: assign inner loops
Iteration to GPU threads
Quantum ESPRESSO
www.cineca.it
http://www.qe-forge.org/
Storage I/O
•
•
•
•
•
•
The I/O subsystem is not
keeping the pace with CPU
Checkpointing will not be
possible
Reduce I/O
On the fly analysis and
statistics
Disk only for archiving
Scratch on non volatile
memory (“close to RAM”)
www.cineca.it
Conclusion
Parallel programming trends in extremely scalable architectures
•
•
•
•
•
•
•
•
www.cineca.it
Exploit millions of ALU
Hybrid Hardware
Hybrid codes
Memory Hierarchy
Flops/Watt (more that Flops/Sec)
I/O subsystem
Non volatile memory
Fault Tolerance!