Other CPU architectures

Transcript Other CPU architectures

Multicore – The future of
Computing
Chief Engineer
Terje Mathisen
Moore’s Law
 «The number of transistors we can put on a chip will double
every two years»
–
Originally from 1965, modified in 1975
–
Up to around the turn of century this meant a doubling in performance
every 18 months.
–
Power has become the worst problem.
–
Bipolar transistors->NMOS->CMOS->(lots of tweaks)->3D
–
Voltage scaling
–
Today, leakage current is a limiter
– Even CMOS transistors leak when they get really tiny
Moore's Law has held for 40 years
1E+10
1E+09
100000000
Haswell: 5,6e9,
22nm
10000000
1000000
100000
10000
1975
1985
1995
2005
2015
What could we use all the transistors for?
 Increase scalar performance
 Increasingly more complicated cpus
 Multiple cycles/instruction:
–
8088 (29K) – 80286 (134K) – 80386 (275K)
 Pipeline, one cycle/instruction
–
80486 (1,2M)
 Superscalar: Multiple instructions/cycle
–
Pentium (3,1M) (two in-order pipelines)
 Out of order/superscalar/multithreaded
–
Pentium Pro/Pentium III/Pentium4/Core/etc (5,5M --> 5,6B)
Pentium4 had the fastest pipeline ever
 3 Ghz clock
–
Inner core ran at 2x, i.e. 6 Ghz
–
Only simple instructions, like ADD/SUB/AND/OR
 Guessing at branches
–
If (a > b) {...} else {…}
 Mistakes were very costly, both in time and power
–
10 to 200 wasted instructions each time the cpu guessed wrong!
Core 2: Multiple complicated cores
 Running two individual processes in parallel causes fewer
wasted instructions, leads to more power-efficient
computing.
–
Shorter pipelines are better at branching
–
Object-oriented programming uses many branches
 Every two years: Double the number of cores
–
Core 2 –> Core 2 Duo -> Core 2 Quad
–
Latest server cpus have up to 18 cores, using 5.6e9 transistors
Vector operations
 SIMD: Work with more data in each instruction
–
SSE uses 16-byte vectors (4 float/2 double)
–
AVX uses 32-byte vectors (8 float/4 double)
 Each core can do two SSE operations/cycle
–
Quad cpu with 4*2*4 = 32 fp operations/cycle
–
64 Gflops @ 2 GHz, 100 Gflops @ 3+ GHz
–
High-end AVX implementation doubles this, 12-18 cores add another
multiplier
Other CPU architectures
• Sun Sparc
•
2005: Niagara: 8 cores, 4 threads/core, low clock speed
•
Multithreaded server workloads
• Oracle Sparc M7
•
2014: 32 cores, 8 threads/core
•
Optimized for DB operations
Other CPU architectures

Sparc
– Multithreaded server workloads

IBM/Sony Cell
– 2005: Playstation 3
– 1 PPE + 7-8 SPE cores, each capable of 25 Gflops/s
– Works on 16-byte vectors (4 float/2 double)
– ~200 Gflops SP -> 14 Gflops DP
– Special HPC version with 100+ Gflops DP
Other CPU architectures
 Sun Sparc
 IBM/Sony Cell
 GPGPU
–
Graphics cards with semi-general fp pipelines
Intel Larrabee/Many Integrated Core /Xeon Phi
 Project started 2003
–
Architecture review Oct 2006
 Announced 2007
–
64-bit
–
x86 compatible
 Similar to Pentium
–
Dual in-order pipelines
–
More flexible mixing of instructions
 Special graphics instructions, incl. scatter/gather
–
S/G are very useful for HPC applications
LRB cont.
 Even longer vectors
–
Works with 64-byte blocks (16 float/8 double)
–
Combined FMUL/FADD instruction
 More than 50 cores on first product
–
4 threads/core
–
16x2x51 = 1616 flops/cycle
–
1.3 Ghz core -> 2 Tflop (Seismic cluster is ~10 Tflops)
 First product will be graphics coprocessor card
 Will use the same 125 watts (max) as a single P4
 New name: Many Integrated Core (MIC)/ Knights Corner/ Xeon Phi
Future directions
 Heterogeneous cpus:
–
Maybe 2-4 Core2 + 20-60 Larrabee?
–
Run single-threaded applications on Core, multi-threaded/vector-based on
Xeon Phi. (2013 - Fastest computer in the world: Ivy Bridge+Phi)
–
OS threads without fp operations can also use simple in-order LRB cores
 Power-efficient processing
–
Both laptops/mobiles and servers are limited by power use
 Simpler/slower cores with mostly in-order processing can
use 80% less power
Conclusion
 Multicore will give us an extra factor of ~10 increase in fp
processing power
–
Most current forms of simulation becomes possible on a single workstation
with 2-4 cpus
 MIPS/Watt is crucial
–
Easier to make many simpler cores than one complex
–
Less wasted work
–
Server farms and laptops
What are the consequences?
 High performance requires multithreading
–
Currently this is mostly server workloads
–
Games are next, today they use 2-4 threads
 High performance requires vector programming
–
Can we work on 4, 16 or more variables simultaneously?
 Many programs (and most programmers) don't care!
–
If it is fast enough today, it will surely be OK in the future as well?
 Not neccessarily, because
–
Data grows exponentially!
HPC applications
 Seismic processing
–
PC with
– Complete model of small fields
– Reduced resolution test runs for larger fields
–
Deskside server with nearly the same capability as current 2048-cpu
seismic cluster
 Crash simulation
–
Everything could fit on a laptop in 2012-2015
 Financial modelling, incl Monte Carlo risk analysis
 Dynamic global process control
From current Unix cluster…
… to deskside workstation in 5 years?
Summary
 Multicore will give us an extra factor of ~10 increase in fp
processing power
 Moore's law will go on
 MIPS/Watt is crucial
 Evry is at leading edge of this development
Thank you!
Do we have the required
programmers?
 Will we get them from the universities in the future?
–
Possibly
–
Today, most graduates learn only Java, which isn't very suitable
 There's hope:
–
LRB on the NTNU CS curriculum today
 Similar situation at most universities
 Can our standard vendors deliver updated SW?
–
Eclipse, GeoFrame, Sismage, Ansys, Finite Element
Smaller
transistors & slightly larger
Smaller transistors & slightly larger chips
chips
Transistor size
10000
f(x) = 1,47E+131·0,86^x
nanometer
1000
Tr. size(nm)
Exponential
Regression for Tr.
size(nm)
100
10
1975
1980
1985
1990
Year
1995
2000
2005
2010

Other CPU architectures

Transcript Other CPU architectures

Directory