18.337 Introduction Super
Download
Report
Transcript 18.337 Introduction Super
18.337 Introduction
News you can use
• Hardware
– Multicore chips (2009: mostly 2 cores and 4
cores)(2010 hexacores,octocores)(2011
twelve cores)
– Servers (often many multicores sharing
memory)
– Clusters (often several, to tens, and many
more servers not sharing memory)
Performance
• Single processor speeds for now no longer growing.
• Moore’s law still allows for more real estate per core
(transistors double/nearly every two years)
– http://www.intel.com/technology/mooreslaw/index.htm
• People want performance but hard to get
• Slowdowns seen before speedups
• Flops (floating point ops / second)
– Gigaflops (109), Teraflops (1012), Petaflops(1015)
• Compare matmul with matadd. What’s the difference?
Some historical machines
Earth Simulator was #1
Some interesting hardware
Nvidia
Cell Processor
Sicortex – “Teraflops from Milliwatts”
http://www.sicortex.com/products/sc648
http://www.gizmag.com/mit-cycling-human-powered-computation/8503/
Programming
• MPI: The Message Passing Interface
– Low level “lowest common denominator” language that the world
has stuck with for nearly 20 years
– Can get performance, but can be a hindrance as well
• Some say that there are those that will pay for a 2x
speedup, just make it easy
• Reality is that many want at least 10x and more for a
qualitative difference in results
• People forget that serial performance can depend on
many bottlenecks including time to memory
• Performance (and large problems) are the reason for
parallel computing, but difficult to get the “ease of use”
vs “performance” trade-off right.
Places to Look
• Best current news:
– http://www.hpcwire.com/
• Huge Conference:
– http://sc11.supercomputing.org/
Architecture Diagrams from Sam
Williams (formerly) @ Berkeley
Bottom Up Performance Engineering: Understanding Hardware’s
implications on performance up to software
Top Down: measuring software and tweaking sometimes aware and
sometimes unaware of hardware
http://www.cs.berkeley.edu/~samw/research/talks/sc07.pdf
Want to delve into hard numerical
algorithms
• Examples:
– FFTs and Sparse Linear Algebra
• At the MIT level:
– Potential “not quite right” question: How do
you parallelize these operations?
– Rather what issues arise and why is getting
performance hard?
• Why is nxn matmul easy? Almost cliché?
• Comfort level in this class to delve in?
Old Homework
(emphasized for effect)
• Download a parallel program from
somewhere.
– Make it work
• Download another parallel program
– Now, …, make them work together!
SIMD
•
SIMD (Single Instruction, Multiple Data) refers to parallel hardware that can
execute the same instruction on multiple data. (Think the addition of two
vectors. One add instruction applies to every element of the vector.)
– Term was coined with one element per processor in mind, but with today’s deep
memories and hefty processors, large chunks of the vectors would be added on
one processor.
– Term was coined with a broadcasting of an instruction in mind, hence the single
instruction, but today’s machines are usually more flexible.
– Term was coined with A+B and elementwise AxB in mind and so nobody really
knows for sure if matmul or fft is SIMD or not, but these operations can certainly
be built from SIMD operations.
•
Today, it is not unusual to refer to a SIMD operation (sometimes but not
always historically synonymous with Data Parallel Operations though this
feels wrong to me) when the software appears to run “lock-step” with every
processor executing the same instruction.
– Usage: “I hear that machine is particularly fast when the program primarily
consists of SIMD operations.”
– Graphics processors such as NVIDEA seem to run fastest on SIMD type
operations, but current research (and old research too) pushes the limits of
SIMD.
SIMD summary
• SIMD (Single Instruction, Multiple Data)
refers to parallel hardware that can
execute the same instruction on multiple
data.
• One may also refer to a SIMD operation
(sometimes but not always historically
synonymous with a Data Parallel
Operation) when the software appears to
run “lock-step” with every processor
executing the same instructions.
The Cloud
• Problems with HPC systems not what you
think
• Users wrote codes that nobody could use
• Systems hard to install
• The Interactive Supercomputing
Experience
• What the cloud could do
• What are the limitations