High Performance Computing An Overview

Download Report

Transcript High Performance Computing An Overview

High Performance Computing
An overview
Alan Edelman
Massachusetts Institute of Technology
Applied Mathematics & Computer Science and AI
Labs
(Interactive Supercomputing, Chief Science Officer)
Not said: many powerful
computer owners prefer low
profiles
Some historical machines
Earth Simulator was #1 now #30
Moore’s Law
• The number of people who point out that “Moore’s Law”
is dead is doubling every year. 
• Feb 2008: NSF requests $20M for "Science and
Engineering Beyond Moore's Law"
– Ten years out, Moore’s law itself may be dead
• Moore’s law has various forms and versions never stated
by Moore but roughly doubling every 18 months-2 years
– Number of transistors
– Computational Power
– Parallelism!
Still good for a while!
At Risk!
AMD Opteron quadcore
8350 Sept 2007
Eight core in 2009?
2.0?
2.0?
Intel Clovertown and Dunnington
Six Core: Later in 2008?
FPU
MT UltraSparc
8K D$
FPU
MT UltraSparc
8K D$
FPU
MT UltraSparc
8K D$
FPU
MT UltraSparc
8K D$
FPU
MT UltraSparc
8K D$
FPU
MT UltraSparc
8K D$
FPU
MT UltraSparc
8K D$
4x128b FBDIMM memory controllers
42.7GB/s (read), 21.3 GB/s (write)
1.4gHz
16 core in 2008?
Fully Buffered DRAM
4MB Shared L2 (16 way)
8K D$
179 GB/s
(fill)
MT UltraSparc
90 GB/s
(writethru)
FPU
Crossbar Switch
Sun Niagara 2
Accelerators
512K L2
PPE
PPE
512K L2
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
BIF
XDR
25.6GB/s
XDR DRAM
<<20GB/s
each
direction
BIF
XDR
Global Thread Scheduler
16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K 16K
EIB (Ring Network)
EIB (Ring Network)
MFC 256K SPE
SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM
Address units Address units Address units Address units Address units Address units Address units Address units
8KB L1 8KB L1 8KB L1 8KB L1 8KB L1 8KB L1 8KB L1 8KB L1
const$, tex$ const$, tex$ const$, tex$ const$, tex$ const$, tex$ const$, tex$ const$, tex$ const$, tex$
Crossbar?? Ring??
128KB L2 const$ & texture$ (shared across SMs)
25.6GB/s
XDR DRAM
IBM Cell Blade
DRAM controllers (6 x 64b)
86.4 GB/s
768MB GDDR3 Device DRAM
NVIDIA
Sicortex
• Teraflops from Milliwatts
Software
Give me software
leverage and a
supercomputer, and I
shall solve the world’s
problems
(apologies to)
Archimedes
What’s wrong with this story?
• I can’t get my five year old son off my
(serial) computer
• I have access to the world’s fastest
machines and have nothing cool to show
him!
Engineers and Scientists
(The leading indicators)
• Mostly work in serial (still!)
(Just like my 5 year old)
• Those working in parallel
Go to conferences, show off speedups
• Software: MPI
–
–
–
–
–
(Message Passing Interface)
Really thought of as the only choice
Some say the assembler of parallel computing
Some say has allowed code to be portable
Others say has held back progress and performance
Old Homework
(emphasized for effect)
• Download a parallel program from
somewhere.
– Make it work
• Download another parallel program
– Now, …, make them work together!
Apples and Oranges
• A: row distributed array (or worse)
• B: column distributed array(or worse)
• C=A+B
MPI Performance vs PThreads
Professional Performance Study by
Sam Williams
AMD Opteron
4.0
3.5
3.5
3.0
3.0
2.5
2.5
Naïve Single Thread
MPI(autotuned)
Pthreads(autotuned)
MPI may introduce speed bumps on current architectures
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
0.0
FEM-Har
0.0
Tunnel
0.5
FEM-Cant
0.5
FEM-Sphr
1.0
Protein
1.0
FEM-Cant
1.5
FEM-Sphr
1.5
2.0
Protein
2.0
Dense
GFlop/s
4.0
Dense
GFlop/s
Intel Clovertown
MPI Based Libraries
Typical sentence: … we enjoy using parallel
computing libraries such as Scalapack
• What else? … you know, such as
scalapack
• And …? Well, there is scalapack
• (petsc, superlu, mumps, trilinos, …)
• Very few users, still many bugs, immature
• Highly Optimized Libraries? Yes and No
Natural Question may not be the
most important
• How do I parallelize x?
– First question many students ask
– Answer often either one of
• Fairly obvious
• Very difficult
– Can miss the true issues of high performance
• These days people are often good at exploiting locality for
performance
• People are not very good about hiding communication and
anticipating data movement to avoid bottlenecks
• People are not very good about interweaving multiple functions to
make the best use of resources
– Usually misses the issue of interoperability
• Will my program play nicely with your program?
• Will my program really run on your machine?
Real Computations have
Dependencies (example FFT)
Time wasted
on the telephone
Modern Approaches
•
Allow users to “wrap up” computations into nice packages often denoted
threads
Express dependencies among threads
Threads need not be bound to a processor
Not really new at all: see Arvind Dataflow etc
Industry not yet caught up with the damage SPMD and MPI has done
See Transactional Memories, Streaming Languages etc.
•
•
•
•
•
•
•
•
•
Advantages
Easier on Programmer
More productivity
Allows for autotuning
Can Overlap Communication with
Computation
LU Example
Software
Give me software
leverage and a
supercomputer, and I
shall solve the world’s
problems
(apologies to)
Archimedes
New Standards for Quality of
Computation
• Associative Law:
(a+b)+c=a+(b+c)
• Not true in roundoff
• Mostly didn’t matter in serial
• Parallel computation reorganizes
computation
• Lawyers get very upset!