Transcript slides1

Scaling Parallel Applications
Mukesh Agrawal
Introduction
Parallel systems are ccNUMA
...so is ccNUMA useful?
How much faster is it?
How can we make it faster?
How hard is it?
ccNUMA (review)
Multiple processors
Private physical memories
Shared address space
Hardware support for cache coherence
Scenario
Scientific computation problems (SPLASH-2)
Metric:
Efficiency=
Acheived Speedup
num processors
Simulation study (simulate Stanford FLASH)
Experimental study (SGI Origin 2000, 128 proc)
Efficiency and Size
What is the smallest problem instance to achieve
60% efficiency?
Why might this be a bad metric?
Efficiency and Size
What is the smallest problem instance to achieve
60% efficiency?
Why might this be a bad metric?
Assumes more efficiency for larger instances
May not happen if data is laid out poorly (cache usage)
Why might larger instances run more efficiently?
Efficiency and Size
What is the smallest problem instance to achieve
60% efficiency?
Why might this be a bad metric?
●
●
Assumes more efficiency for larger instances
May not happen if data is laid out poorly (cache usage)
Why might larger instances run more efficiently?
Better communication/computation ratio (nearest
neighbor)
Less load imbalance (less waiting for others)
Cache capacity (many misses on uniprocessor)
Cache sharing (small problem may share lines)
Efficiency and Size (results)
Depends on problem
For some, efficiency on reasonable sizes (BarnesHut)
Others never efficient (Radix)
Experiments show: reality requires larger
instances than simulation
Efficiency and Structure
Can we get higher efficiency on small instances
by modifying computation structure?
What might we try?
Efficiency and Structure
Can we get higher efficiency on small instances
by modifying computation structure?
What might we try?
Reduce communication!
Algorithmic changes
Cache management (keep remote data in cache)
Static partitioning
Efficiency and Structure
Can we get higher efficiency on small instances
by modifying computation structure?
What might we try?
Reduce communication!
Algorithmic changes
Cache management (keep remote data in cache)
Static partitioning
Most programs can scale after restructuring
Bonus: changes for ccNUMA often help with
SVM (cluster) systems as well
Programming Guidelines
Partition statically; optimize for locality
Load balance should not be compromised
Separate partitions, avoid write sharing
Conclusion
ccNUMA can deliver scalable performance for
scientific computation
Restructuring program usually required
ccNUMA and SVM machines need similar
program mods
Simulator good for qualitative questions; not so
good for quantitative