slides - Agenda INFN

Download Report

Transcript slides - Agenda INFN

Evoluzione delle CPU in
relazione all'efficienza
energetica
Michele Michelotto
Real World Problems Taking Us
BEYOND PETASCALE
SUM
1 ZFlops
Of Top500
100 EFlops
#1
10 EFlops
1 EFlops
100 PFlops
10 PFlops
1 PFlops
100 TFlops
10 TFlops
What we can just model
today with <100TF
1 TFlops
100 GFlops
10 GFlops
1 GFlops
100 MFlops
1993
Aerodynamic Analysis:
1 Petaflops
Example
Real
World
Challenges:
Laser Optics:
10 Petaflops
• Full modeling
of aninaircraft
in all conditions
Molecular
Dynamics
Biology:
20 Petaflops
• Green airplanes
Aerodynamic
Design:
1 Exaflops
• Genetically tailored
medicine
Computational
Cosmology:
10 Exaflops
•Turbulence
Understand
the
origin
of
the
universe
in Physics:
100 Exaflops
• Synthetic fuels
everywhere
Computational
Chemistry:
1 Zettaflops
• Accurate extreme weather prediction
Source: Dr. Steve Chen, “The Growing HPC Momentum in China”,
June 30th, 2006, Dresden, Germany
1999
2005
2011
2017
2023
2
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
2029
Relative Tr Performance
Moore’s Law and High
Performance Computing
1000
Exa
Peta
100
Relative Performance
(GFlops as the base)
Peta: Today’s COTS 11.5K
Processors assuming 2.7 GHz
1.E+08
Tera
500X
1.E+06
10
30X
G
250X
Exa
Peta
Tera: ASCI Red
9,298 Processors
1.E+04
2.5M X
4,000X
Tera
1
1986
1996
2006
2016
36X
1.E+02
G
Source: Intel labs
Transistor Performance
1.E+00
1986
1996
2006
2016
From Peta to Exa, 2X Transistor Performance, Requiring ~30K cores @2800 SPI2K
3
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
A look at CERN’s Computing Growth
Source: CERN,
Jarp Sverre
120
Tape Space
(PetaByte)
100
CERN Tape Library
Computing
80
60
21,500 Cores @
1400 SI2K per core
Disk Space
(PetaByte)
40
20
0
2007
4
2008
2009
2010
2011
2012
2013
Lots of computing (45% CAGR), lots of data; no upper boundary!
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
5
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Reach Exascale by 2018
From GigFlops to ExaFlops
~2018
2008
~1997
~1987
Note: Numbers are based on Linpack Benchmark.
Dates are approximate.
“The pursuit of each milestone has led to important
breakthroughs in science and engineering.”
Source: IDC “In Pursuit of Petascale Computing: Initiatives Around the World,” 2007
6
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
What is Preventing us?
Power is Gating Every Part of Computing
An ExaFLOPS Machine without Power Management
Power Consumption
1000,000
EFLOP
2015-18
Power?
Other misc. power
consumptions:
…
Power supply losses
Cooling
… etc
?
Power (KW)
100,000
100+ MW?
Voltage is not scaling as in the past
10000
PFLOP
Disk
10MW
10EB disk
@ 10TB/disk @10W
Comm
70MW
100pJ comm per FLOP
Memory
80MW
0.1B/FLOP
@ 1.5nJ per Byte
Compute
70MW
170K chips
@ ~400W each
1000
TFLOP
GFLOP
MFLOP
100
1964
1985
1997
2008
2018
The Challenge of Exascale
Source: Intel, for illustration and assumptions, not product representative
7
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
HPC Platform Power
3%
5% 2% 1%
CPUs
31%
11%
CPU
Planar & VRs
Memory
PSUs
Memory
26%
Fans
Planar
&VRs
HDD
22%
Data from P3 Jet Power Calculator, V2.0
DP 80W Nehalem
Memory – 48GB (12 x 4GB DIMMs) Single Power Supply Unit @ 230Vac
PCI+GFX
Peripherals
Need a platform view of power consumption: CPU, Memory and VR, etc.
8
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Exponential Power and Computing Growth
1
Relative Energy/Op
Power at a glance:
5V
(assume 31% CPU Power in a system)
•Today’s Peta: 0.7- 2 nj/op
G
Vcc scaling
•Today’s COTS: 2nj/op
0.1
(assume 100W/50GFlops)
Tera
•Unmanaged Exa: if 1GW, 0.31nj/op;
Peta
0.01
Exa
Exa
0.001
1986
1.E+08
1996
2006
Relative Performance
2016
Peta
1.E+06
1.E+04
Relative Power
Tera
Relative Performance and Power
(GFlops as the base)
1M X
1.E+02
G
4,000X
80X
1.E+00
1986
1996
2006
Unmanaged growth in power will reach Giga Watt level at Exascale
9
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
2016
To Reach ExaFlops
Flops
1.E+15Peta
1.E+14
1.E+13
1.E+12 Tera
Intel® Core™ uArch
1.E+11
Pentium® 4 Architecture
1.E+10
1.E+09 Giga
Pentium® II Architecture
Pentium® Architecture
1.E+08
1.E+07
Future Projection
Pentium® III Architecture
386
486
1.E+06
1985
Source: Intel
1990
1995
2000
2005
2010
2015
Power goal = 200W / Socket, to reach Linpack ExaFlops:
• 5 pJ / op / socket * 40 TFlops - 25K sockets peak or 33K sustained, or
•10 pJ / op / socket * 20 TFlops - 50K sockets peak (conservative)
10
Intel estimates of future trends. Intel estimates are based in part on historical capability of Intel products and projections
for capability improvement. Actual capability of Intel products will vary based on actual product configurations.
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
2020
Parallelism for Energy Efficient
Performance
10000000
Many Core
Relative Performance
1000000
100000
Multi Threaded
Future Projection
10000
Speculative, OOO
1000
Super Scalar
100
486
386
10
286
8086
1
0.1
0.01
1970
11
Multi-Core
Era of
Pipelined
Architecture
1980
Era of
Instruction
Level
Parallelism
1990
2000
Era of Thread
& Processor
Level
Parallelism
2010
Intel estimates of future trends. Intel estimates are based in part on historical capability of Intel products and projections
for capability improvement. Actual capability of Intel products will vary based on actual product configurations.
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
2020
Reduce Memory and
Communication Power
Chip to memory
~1.5nJ per Byte
~300pJ per Byte
Core-to-core
~10pJ per Byte
Chip to chip
~100pJ per Byte
Data movement is expensive
12
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
13
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
14
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
15
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Solid State Drive
Future Performance and Energy Efficiency
SSD GigaBytes
Assume: Capacity of the SSD grows at a
CAGR of about 1.5; historical HDD at 1.6
5000
Vision
10 ExaBytes at 2018:
100
• 2 Million SSD’s vs. ½ Million
HDD
50
Future projection
0
2008
2010
2012
2014
2016
Source: Intel, calculations based on today’s vision
2018
• If @2.5w each, total 5MW
• If HDD (300 IOPS) and SSD (10k
IOPS) constant: SSD has 140X
IOPS
Innovations to improve IO: 2X less power with 140x performance gain
16
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Reliability, Reliability and Reliability
Density is on the Rise
Reliability is an Issue
Simplify for Reliability
•Solid State Drives or
Diskless nodes
•Moore’s Law provides
more transistors
•Many core provides more
computing
•HPC requires super high
socket count
large numbers
•Silent Data Corruption
(SDC)
•Detectable
Uncorrectable Errors
(DUE)
Computing Errors
•Fewer cables by using
backplanes
•Simpler node design
(fewer Voltage
Regulator Modules,
fewer capacitors, …)
Simplification
Mean Time Between Failure (MTBF) Trends down:
(Probably) (large number) = Probably NOT
17
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Revised Exascale System Power
ExaFLOPS Machine without Power Mgmt
Other misc. power
consumption:
…
Power supply losses
Cooling
… etc
100+ MW?
ExaFLOPS Machine Future Vision
Other misc. power
consumption:
…
Power supply losses
Cooling
… etc
<<100MW
Disk
10MW
10EB disk
@ 10TB/disk @10W
SSD
5MW
Comm
70MW
100pJ com per FLOP
Comm
7MW
Memory
Compute
80MW
0.1B/FLOP
@ 1.5nJ per Byte
70MW
170K chips
@ ~400W each
Memory
16MW
Compute
8-16MW
10EB SSD
@ 5TB/SSD @2.5W
10pJ comm per FLOP
0.1B/FLOP
@ 300pJ per Byte
25K - 80K chips
@~200W each
Source: Intel, for illustration and assumptions, not product representative
18
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
19
Computing in the years zero
Transistors used to increase raw-power
Increase global power
Moore’s law
19
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
20
The 'three walls’
While hardware continued to follow Moore’s law, the
perceived exponential grow of the “effective” computing
power faded away in hitting three “walls”:
The memory wall
The power wall
The instruction level parallelism (micro-architecture) wall
A turning point was reached and a new paradigm emerged:
multicore
20
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
21
The ‘memory
wall’
Core 1
…
Core n
Processor clock rates have been
increasing faster than memory
clock rates
larger and faster “on chip” cache
memories help alleviate the
problem but does not solve it.
Latency in memory access is
often the major performance
issue in modern software
applications
Main memory:
200-300 cycles
21
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
22
The ‘power wall’
Processors consume more and more power the faster they go
Not linear:
•
•
73% increase in power gives just 13% improvement in performance
(downclocking a processor by about 13% gives roughly half the power
consumption)
Many computing center are today limited by the total electrical power installed
and the corresponding cooling/extraction power.
How else increase the number of instruction per unit-time:
22
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Go parallel!
23
The ‘Architecture walls’
Longer and fatter parallel instruction
pipelines has been a main architectural
trend in `90s
Hardware branch prediction, hardware
speculative execution, instruction reordering (a.k.a. out-of-order execution),
just-in-time compilation, hardwarethreading are some notable examples of
techniques to boost ILP
In practice inter-instruction data
dependencies and run-time branching
limit the amount of achievable ILP
23
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
The Challenge of
Parallelization
Exploit all 7 “parallel” dimensions of modern computing architecture for HPC
Inside a core (climb the ILP wall)
1.
2.
3.
Superscalar: Fill the ports (maximize instruction per cycle)
Pipelined: Fill the stages (avoid stalls)
SIMD (vector): Fill the register width (exploit SSE)
Inside a Box (climb the memory wall)
4.
5.
6.
HW threads: Fill up a core (share core & caches)
Processor cores: Fill up a processor (share of low level resources)
Sockets: Fill up a box (share high level resources)
LAN & WAN (climb the network wall)
7.
Optimize scheduling and resource sharing on the Grid
HEP has been traditionally good (only) in the latter
24
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
24
25
Where are WE?
See talks by P.Elmer, G.Eulisse, S. Binet
HEP code does not exploit the power of current processors
•
•
•
•
One instruction per cycle at best
Little or no use of vector units (SIMD)
Poor code locality
Abuse of the heap
Running N jobs on N=8 cores still efficient but:
•
Memory (and to less extent cpu cycles) wasted in non sharing
– “static” condition and geometry data
– I/O buffers
– Network and disk resources
•
Caches (memory on CPU chip) wasted and trashed
– Not locality of code and data
This situation is already bad today, will become only worse in future architectures
25
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
26
Code optimization
Ample Opportunities for improving code performance
•
Measure and analyze performance of current LHC physics application
software on multi-core architectures
•
•
•
Improve data and code locality (avoid trashing the caches)
Effective use of vector instruction (improve ILP)
Exploit modern compiler’s features (does the work for you!)
See Paolo Calafiura’s talk
All this is absolutely necessary, still not sufficient
26
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
27
Event parallelism
Opportunity: Reconstruction Memory-Footprint shows large condition data
How to share common data between different process?
 multi-process vs multi-threaded
 Read-only: Copy-on-write, Shared
Libraries
 Read-write: Shared Memory,
sockets, files
27
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Exploit “Kernel Shared
Memory”
KSM is a linux driver that allows dynamically sharing identical memory pages between one or more
processes.
•
It has been developed as a backend of KVM to help memory sharing between virtual machines running on
the same host.
•
KSM scans just memory that was registered with it. Essentially this means that each memory allocation,
sensible to be shared, need to be followed by a call to a registry function.
Test performed “retrofitting” TCMalloc with KSM
•
Just one single line of code added!
CMS reconstruction of real data (Cosmics with full detector)
•
•
No code change
400MB private data; 250MB shared data; 130MB shared code
ATLAS
•
•
No code change
In a Reconstruction job of 1.6GB VM, up to 1GB can be shared with KSM
28
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
28
29
SSD vs HDD on 8 Node Cluster
See Sergey Panitkin’s talk
Courtesy of S. Panitkin,
Solid State Disk:
120GB for 400Euro

Aggregate (8 node farm) analysis rate as a function of number of
workers per node

Almost linear scaling with number of nodes
29
15/04/2007
G. Ganis, Parall.-MultiCore Workshop
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
29
30
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
31
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
32
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
33
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
34
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
35
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
36
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
37
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
38
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
39
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
40
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
41
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
42
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
43
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
44
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
45
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
46
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Nehalem
HyperThreading OFF
HyperThreading ON
Si vedono 8 CPU logiche
Si vedono 16 CPU logiche
Incremento del 30%
47
Limitato dai 12GB di
memoria?
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
48
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Quanta memoria per un
NEHALEM?
Vecchia regola 2GB per core (16GB) o 2
GB per CPU logica (32 GB)
Ma se ogni processore ha 3 canali per
avere la massima efficienza devo mettere
3 x 2 x N GB
Quindi 12, 24 o 48 GB per macchina
49
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Shanghai 40W !!
50
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Cambio di Roadmap AMD
51
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
52
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Thanks to
Plenary talk More Computing with Less energy.
by Dr. Steve PAWLOWSKI (Intel)
Plenary talk: The challenge of adapting HEP physics software
applications to run on many-core cpus
by Prof. Vincenzo INNOCENTE (CERN)
Event Processing Track:
CMS Software Performance Strategies
by Dr. Peter ELMER (PRINCETON UNIVERSITY)
HEP C++ meets reality -- lessons and tips
by Mr. Giulio EULISSE (NORTHEASTERN UNIVERSITY OF BOSTON
(MA) U.S.A.)
53
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009
Questions?
54
Michele Michelotto – Padova – Workshop CCR 10-15 Maggio 2009