Scalable Numerical Algorithms and Methods on the ASCI Machines

Download Report

Transcript Scalable Numerical Algorithms and Methods on the ASCI Machines

Part II
Department of Computer Science
University of the West Indies
Parallel Programming?
ENIAC, University of Pennsylvania 1946
(http://www.library.upenn.edu/special/gallery/mauchly/jwmintro.html)
The Need For Power
Computational Science
 Traditional scientific and engineering paradigm
 Do theory or paper design
 Perform experiments or build system
 Replacing both by numerical experiments
 Real phenomena are too complicated to model by hand
 Real experiments are:
 too hard, e.g., build large wind tunnels
 too expensive, e.g., build a throw-away passenger jet
 too slow, e.g., wait for climate or galactic evolution
 too dangerous, e.g., weapons, drug design
Computational Science Examples
 Astrophysical thermonuclear flashes
 Nuclear weapons
 Weather prediction
 Climate and atmospheric modeling
 Drug design
 Blood flow
 Fluid dynamics (CFD)
Fluid Dynamics
Forced convective heat transfer
Buoyant convection
Hairpin vortex generation
Rayleigh-Taylor instability
Hairpin Vortices - Transition to Turbulence



Boundary layer flow past a hemispherical roughness element
Re=200-2000 based on hemisphere height
K=512-8168 spectral elements of polynomial degree N=7-15
Simulation Cost
 Cost is O(Re3)
 Re=1K simulation ~ 1 week on 512 processors of ASCI Red
 50GF, 64 GB
 Re=10K ~ 1 year on all 8192 processors of ASCI Red
 800GF, 1TB
 We’re really interested in Re=1M …
 Can’t even think of doing the Re=1K problem on a uniprocessor
machine let alone the 10K or 1M problems!
The Necessity of Parallel Computing
How fast can a serial computer be?
1 Tflop 1 TB
sequential
machine
r = .3 mm
 Consider the 1 Tflop sequential machine
 data must travel some distance, r, to get from memory to CPU
 to get 1 data element per cycle, this means 1012 times per second at the
speed of light, c = 3e8 m/s
 r < c/1012 = 0.3 mm
 Now put 1 TB of storage in a .3 mm2 area
 each word occupies about 3 Angstroms2, the size of a small atom
Even if we could make it ...
 ... it’d be too expensive
 Market forces are dictating use of COTS
The Solution ?
 Add more workers!
 Use a collection of processors and memory modules to work
together to solve our problems
 Supercomputers, MPPs, Clusters, Beowulfs
Bad News
Still Lots of Work
 Decide on and implement an interconnection network for the
processors and memory modules
 Design and implement system software for the hardware
 Devise algorithms and data structures for solving our problems
 Divide the algorithms and data structures up into subproblems
 Identify the communication that will be needed between the
subproblems
 Assign subproblems to processors and memory modules
Modern Layered Framework
CAD
Database
Multiprogramming
Shared
address
Scientific modeling
Message
passing
Data
parallel
Compilation
or library
Operating systems support
Communication hardware
Physical communication medium
Parallel applications
Programming models
Communication abstraction
User/system boundary
Hardware/software boundary