Transcript Moore`s Law
Exascale: challenges and opportunities in a
power constrained world
Carlo Cavazzoni – [email protected]
SuperComputing Applications and Innovation Department
outline
-
Cineca Roadmap
Exascale challenge
Applications challange
Energy efficiency
GALILEO
Name: Galileo
Model: IBM/Lenovo NeXtScale
X86 based system for
production of medium
scalability applications
Processor type: Intel Xeon Haswell@ 2.4 GHz
Computing Nodes: 516
Each node: 16 cores, 128 GB of RAM
Computing Cores: 8.256
RAM: 66 TByte
Internal Network: Infiniband 4xQDR switches (40 Gb/s)
Accelerators: 768 Intel Phi 7120p (2 per node on 384 nodes) + 80 Nvidia K80
Peak Performance: 1.5 PFlops
•
National and PRACE Tier-1 calls
PICO
Storage and processing of large volumes of data
Model: IBM NeXtScale IB linux cluster
Processor type: Intel Xeon E5 2670 v2 @2,5Ghz
Computing Nodes: 80
Each node: 20 cores/node, 128 GB of RAM
2 Visualization nodes
2 Big Mem nodes
4 data mover nodes
Storage
50TByte of SSD
5PByte on-line repository
(same fabric of the cluster)
16PByte of tapes
Services
Hadoop & PBS
OpenStack cloud
NGS pipelines
Workflows (weather/sea forecast)
Analytics
High-throughput workloads
Cineca Road-map
Tier0: Fermi (BGQ)
Tier1: Galileo
BigData: Pico
today
Tier0: new (on going procurement)
(HPC Top10)
BigData: Galileo/Pico
Q1 2016
Tier0 BigData:
50PFlops
50PByte
Q1 2019
Dennard scaling law
(downscaling)
new VLSI gen.
old VLSI gen.
The core frequency
and performance do not
grow following the
Moore’s law any longer
L’ = L / 2
V’ = V / 2 do not hold anymore!
F’ = F * 2
D’ = 1 / L2 = 4D
P’ = P
L’ = L / 2
V’ = ~V
F’ = ~F * 2
2
D’ = 1 / L = 4 * D
P’ = 4 * P
Increase the number of cores
to maintain the
architectures evolution
on the Moore’s law
The power crisis!
Programming crisis!
Moore’s Law
Number of transistors
per chip double every
18 month
The true it double
every 24 month
Oh-oh! Huston!
The silicon lattice
0.54 nm
Si lattice
50 atoms!
There will be still 4~6 cycles (or technology generations) left until
we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008).
Amdahl's law
upper limit for the scalability of parallel applications
determined by the fraction of the overall execution time spent in nonscalable operations (Amdahl's law).
maximum speedup tends to
1/(1−P)
P= parallel fraction
1000000 core
P = 0.999999
serial fraction= 0.000001
HPC trends
(constrained by the three law)
opportunity
Peak Performance
exaflops
Moore law
FPU Performance
gigaflops
Dennard law
Number of FPUs
10^9
Moore + Dennard
App. Parallelism
Serial fraction
1/10^9
Amdahl's law
challenge
Energy trends
“traditional” RISK and CISC chips are designed
for maximum performance for all possible
workloads
A lot of silicon to maximize single
thread performace
Energy
Datacenter Capacity
Compute Power
Change of paradigm
New chips designed for maximum performance
in a small set of workloads
Simple functional units, poor single
thread performance, but maximum
throughput
Energy
Datacenter Capacity
Compute Power
Architecture toward exascale
GPU/MIC/FPGA
CPU
ACC.
Single
thread perf.
throughput
bottleneck
Photonic -> platform flexibility
TSV -> stacking
CPU
OpenPower
Nvidia GPU
ACC.
AMD APU
CPU
ACC.
SoC
KNL, ARM
3D
stacking
Active memory
Exascale architecture
CPU
Hybrid
CPU
Nvidia GPU
ACC.
AMD APU
ACC.
two model
Homogeneus
SoC
ARM
Intel
Accelerator/GPGPU
+
Sum of 1D array
Intel Vector Units
Next to come: AVX-512 up to 16 Multiply Add / clock
Applications Challenges
-Programming model
-Scalability
-I/O, Resiliency/Fault tolerance
-Numerical stability
-Algorithms
-Energy Aware/Efficiency
H2020
MaX Center of
Excellence
www.quantum-espresso.org
Scalability
The case of Quantum Espresso
QE parallelization hierarchy
ok for petascale, not enough for
exascale
CNT10POR8 - CP on BGQ
Ab-initio simulations -> numerical solution of the quantum mechanical equations
400
350
seconds /steps
300
calphi
dforce
250
rhoofr
updatc
200
ortho
150
100
50
0
20
4096
8192
16384
32768
65536
Virtual cores
2048
4096
8192
16384
32768
Real cores
1
2
4
8
16
Band groups
QE
evolution
New Algorithm:
Communication
High Throughput / Ensamble Simulations
CG vs Davidson
avoiding
Coupled Application
DSL
Task level parallelism
Double buffering
LAMMPS
QE
QE
QE
•
Reliability
•
Completeness
•
Robustness
•
Standard Interface
Multi-level parallelism
Workload Management: system level, High-throughput
Python: Ensemble simulations, workfows
MPI: Domain partition
OpenMP: Node Level shared mem
CUDA/OpenCL/OpenAcc:/OpenMP4
floating point accelerators
QE (Al2O3 small benchmark)
Energy to solution – as a function of the clock
23
Conclusions
• Exascale Systems, will be there
• Power is the main architectural constraints
• Exascale Applications?
• Yes, but…
• Concurrency, Fault Tolerance, I/O …
• Energy aware
Backup slides
I/O Challenges
Today
Tomorrow
100 clients
1000 core per client
10K clients
100K core per clients
3PByte
3K Disks
100 Gbyte/sec
8MByte blocks
Parallel Filesystem
One Tier architecture
1Exabyte
100K Disks
100TByte/sec
1Gbyte blocks
Parallel Filesystem
Multi Tier architecture
I/O subsystem of high performance computers are still deployed using spinning disks, with their mechanical limitation (spinning
speed cannot grow above a certain regime, above which the vibration cannot be controlled), and like for the DRAM they eat
energy even if their state is not changed. Solid state technology appear to be a possible alternative, but costs do not allow to
implement data storage systems of the same size. Probably some hierarchical solutions can exploit both technology, but this do
not solve the problem of having spinning disks spinning for nothing.
Storage I/O
•
•
•
•
•
•
27
The I/O subsystem is not keeping
the pace with CPU
Checkpointing will not be possible
Reduce I/O
On the fly analysis and statistics
Disk only for archiving
Scratch on non volatile memory
(“close to RAM”)
Today
cores
I/O client
cores
I/O client
RAID
Controller
disks
RAID
Controller
disks
RAID
Controller
disks
I/O server
Switch
I/O server
Switch
cores
I/O client
I/O server
…..
…..
160K cores, 96 I/O clients, 24 I/O servers, 3 RAID controllers
IMPORTANT: I/O subsystem has its own parallelism!
Today-Tomorrow
cores
I/O client
RAID
Controller
FLASH
RAID
Controller
disks
RAID
Controller
disks
RAID
Controller
disks
Tier-1
cores
I/O client
I/O server
Switch
I/O server
Switch
cores
I/O client
I/O server
…..
…..
1M cores, 1000 I/O clients, 100 I/O servers, 10 RAID FLASH/DISK controllers
Tier-2
Tomorrow
I/O client
cores
NVRAM
cores
RAID
Controller
FLASH
RAID
Controller
disks
RAID
Controller
disks
RAID
Controller
disks
Tier-2
I/O client
I/O server
NVRAM
cores
Switch
I/O client
I/O server
Switch
NVRAM
I/O server
…..
…..
Tier-1 (byte addressable?)
Tier-2/Tier-3 (Block device)
1G cores, 10K NVRAM nodes, 1000 I/O clients, 100 I/O servers, 10 RAID controllers
Tier-3
Resiliency/Fault tolerance
Check point / restart
Warm start vs Cold Start
Non volatile memory -> Data pooling
Node 1
Node 2
Node 3
Node 4
Wf 1
Wf 2
Wf 3
Wf 4
Node 1
Node 2
Node 3
Node 4
Wf 1
Wf 2
Wf 3
Wf 4
Wf 2
Wf 1
Wf 4
Wf 3
Pool 1
Pool 2
Energy Aware/Efficiency
EURORA
PRACE Prototype experience
Address Today HPC Constraints:
Flops/Watt,
Flops/m2,
Flops/Dollar.
Efficient Cooling Technology:
hot water cooling (free cooling);
measure power efficiency, evaluate (PUE & TCO).
Improve Application Performances:
at the same rate as in the past (~Moore’s Law);
new programming models.
3,200MOPS/W – 30KW
Evaluate Hybrid (accelerated) Technology:
Intel Xeon Phi;
NVIDIA Kepler.
Custom Interconnection Technology:
3D Torus network (FPGA);
evaluation of accelerator-to-accelerator
communications.
64 compute cards
128 Xeon SandyBridge (2.1GHz, 95W and 3.1GHz, 150W)
16GByte DDR3 1600MHz per node
160GByte SSD per node
1 FPGA (Altera Stratix V) per node
IB QDR interconnect
3D Torus interconnect
128 Accelerator cards (NVIDA K20 and INTEL PHI)
#1 in The Green500 List June
2013
Monitoring Infrastructure
Data collection “front-end”
powerDAM (LRZ)
Monitoring, Energy accounting
Matlab
Modelling and feature extraction
Data collection “back-end”
Node stats
(Intel CPUs, Intel MIC, NVidia GPUs)
12-20ms overhead, update every 5s.
Rack stats (Power Distribution Unit)
Room stats (Cooling and power supply)
Job stats (PBS)
Accounting
PMU (x Core):
-
What we collect
CPI
HW Freq
Load
Temperature
RAPL (xCPU):
-
MIC (Ganglia):
- Temperature
VRs
3 Card sensor
Die & GDDR
- 7 Power meas. point
- Core & GDDR frequency
- Core & memory load
- Bandwith PCIe Link
Pcores
Ppackage
Pdram
GPU (nvm APl)
-
35
Die Temp.
Die Power
GPU Load
MEM Load
GPU freq
SMEM freq
MEM freq
Eurora At Work
On-demand governor:
per core – different freq per core
3.1GHz CPUs
Frequency
CPUs
Inactive nodes
2.1GHz CPUs
Turbo boost (HW driven)
The other cpu is inactive
nodeID
CoreID
Average frequency over 24h
Using on-demand governor to track workload
Eurora At Work
CPI = non-idle-clockperio/instructions-executed
Clock per Instruction
nodeID
CoreID
Application
Benchmarks
Quantum ESPRESSO Energy to Solution (PHI)
Time-to-solution (right) and Energy-to-solution (left) compared
between Xeon Phi and CPU only versions of QE on a single
node.
Quantum ESPRESSO Energy to Solution (K20)
Time-to-solution (right) and Energy-to-solution (left) compared
between GPU and CPU only versions of QE on a single node
Conclusions
• Exascale Systems, will be there
• Power is the main architectural constraints
• Exascale Applications?
• Yes, but…
• Concurrency, Fault Tolerance, I/O …
• Energy aware