Transcript Balls

Titanium: Language and Compiler
Support for Scientific Computing
Gregory T. Balls
University of California - Berkeley
Alex Aiken, Dan Bonachea, Phillip Colella, David Gay,
Susan Graham, Paul Hilfinger, Arvind
Krishnamurthy, Ben Liblit, Chang Sun Lin, Peter
McCorquodale, Carleton Miyamoto, Geoff Pike, Kar
Ming Tang, Siu Man Yau, Katherine Yelick
Titanium
gtb 1
Target Problems
• Many modeling problems in astrophysics, biology,
material science, and other areas require
– Enormous range of spatial and temporal scales
• To solve interesting problems, one needs:
– Adaptive methods
– Large scale parallel machines
• Titanium is designed for methods with
– Stuctured grids
– Locally-structured grids (AMR)
Titanium
gtb 2
Common Requirements
• Algorithms for numerical PDE computations are
(compared to linear algebra)
– communication intensive
– memory intensive
• AMR makes these harder
– more small messages
– more complex data
structures
– most of the programming effort is debugging
the boundary cases
– locality and load balance trade-off is hard
Titanium
gtb 3
Titanium for Scientific Computing
• The Language
– Java dialect compiled to C
– Extensions for serial programming
– Extensions for parallel programming
• The Compiler
– Uniprocessor optimizations
– Parallel optimizations
– Available architectures
• The Results
Titanium
gtb 4
Java for Scientific Computing
• Computational scientists work on increasingly
complex models
– Popularized C++ features: classes, overloading,
pointer-based data structures
• But C++ is very complicated
– easy to lose performance and readability
• Java is a better C++
– Safe: strongly typed, garbage collected
– Much simpler to implement (research vehicle)
– Industrial interest as well: IBM HP Java
Titanium
gtb 5
Data Types
• Primitive scalar types: boolean, double, int, etc.
– implementations store these in place
– access is fast -- comparable to other languages
• Objects: user-defined and library
– passed by pointer value
– has level of indirection (pointer to) implicit
– simple model, but inefficient for small objects
• Fast Objects (immutable classes)
– similar to structs in C
Titanium
gtb 6
Titanium Object Example
immutable class Complex {
private double real;
private double imag;
public Complex(double r, double i) {
real = r; imag = i; }
public Complex operator+(Complex c) {
return new Complex(c.real + real,
c.imag + imag); }
public double getReal {return real;}
public double getImag {return imag;}
}
Complex c = new Complex(7.1, 4.3);
c = c + c;
Titanium
gtb 7
Arrays in Java
• Arrays in Java are
objects
• Only 1D arrays are
directly supported
• Multidimensional arrays
are slow
2d
array
• Subarrays are important in AMR (e.g., interior
of a grid)
– Even C and C++ don’t support these well
– Hand-coding (array libraries) can confuse
optimizer
Titanium
gtb 8
Multidimensional Arrays in Titanium
• New multidimensional array added to Java
– One array may be a subarray of another
» e.g., a is interior of b, or a is all even elements of b
– Indexed by Points (tuples of ints)
– Constructed over a set of Points, called
Rectangular Domains (RectDomains)
– Points, Domains and RectDomains are built-in
immutable classes
• Support for AMR and other grid computations
– domain operations: intersection, shrink, border
Titanium
gtb 9
Unordered iteration
• Memory hierarchy optimizations are essential
• Compilers can sometimes do these, but hard in
general
• Titanium adds unordered iteration on rectangular
domains
foreach (p in r) { ... }
– p is a Point
– r is a RectDomain or Domain
• Foreach simplifies bounds checking as well
• Additional operations on domains and arrays to
subset and transform
Titanium
gtb 10
Titanium for Scientific Computing
• The Language
– Java dialect compiled to C
– Extensions for serial programming
– Extensions for parallel programming
• The Compiler
– Uniprocessor optimizations
– Parallel optimizations
– Available architectures
• The Results
Titanium
gtb 11
SPMD Model
• All processors start together and execute same
code, but not in lock-step
• Basic control done using
– Ti.numProcs() total number of processors
– Ti.thisProc() number of executing processor
• Bulk-synchronous style
read all particles and compute forces on mine
Ti.barrier();
write to my particles using new forces
Ti.barrier();
• This is neither message passing nor data-parallel
Titanium
gtb 12
Global Address Space
• References (pointers) may be remote
– useful in building adaptive meshes
– easy to port shared-memory programs
– uniform programming model across machines
• Global pointers are more expensive than local
– True even when data is on the same processor
» space (processor number + memory address)
» dereference time (check to see if local)
– Use local declarations in critical sections
Titanium
gtb 13
Example: A Distributed Data Structure
• Data can be accessed
across processor
boundaries
Proc 0
Proc 1
local_grids
all_grids
Titanium
gtb 14
Example: Setting Boundary Conditions
foreach (l in local_grids.domain()) {
foreach (a in all_grids.domain()) {
local_grids[l].copy(all_grids[a]);
}
}
Titanium
gtb 15
Titanium for Scientific Computing
• The Language
– Java dialect compiled to C
– Extensions for serial programming
– Extensions for parallel programming
• The Compiler
– Uniprocessor optimizations
– Communication optimizations
– Available architectures
• The Results
Titanium
gtb 16
Sequential Optimizations
• Current optimizations
– foreach loops
» within 20% of FORTRAN on many loop-intensive codes
• Optimizations in development
– Cache blocking
– Inlining
Titanium
gtb 17
Parallel Optimizations
• Titanium compiler performs parallel optimizations
– communication overlap and aggregation
– fast parallel bulk I/O
• New analyses:
– synchronization analysis: the parallel analog to
control flow analysis for serial code [Gay & Aiken]
– shared variable analysis: the parallel analog to
dependence analysis [Krishnamurthy & Yelick]
– local qualification inference: automatically
inserts local qualifiers [Liblit & Aiken]
Titanium
gtb 18
Architectures
• Titanium runs on many platforms
– SP machines, T3Es, Networks of Workstations
• Titanium on Blue Horizon specifics
– Uses LAPI (not MPI)
– Allows user to specify threads (procs) per node
– Performs conservative distributed garbage
collection
Titanium
gtb 19
Titanium for Scientific Computing
• The Language
– Java dialect compiled to C
– Extensions for serial programming
– Extensions for parallel programming
• The Compiler
– Uniprocessor optimizations
– Communication optimizations
– Available architectures
• The Results
Titanium
gtb 20
AMR Gas Dynamics
• Hyperbolic Solver [McCorquodale & Colella]
– Implementation of Berger-Colella algorithm
– Mesh generation algorithm included
• 2D Example (3D supported)
– Mach-10 shock on solid surface
at oblique angle
Titanium
gtb 21
1.31x10-9
• Finite Difference
based Method of
Local Corrections
0
FD-MLC for Poisson Problem
• Example run on
16 processors
– 1 large highwavenumber
charge
– 2 smaller
star-shaped
charges
Titanium
-6.47x10-9
[Balls & Colella]
gtb 22
Parallel Performance
• Speedup on Ultrasparc SMP
8
• EM3D is small kernel
– relaxation on
unstructured mesh
– shows high parallel
efficiency of Titanium
system
• AMR speedup limited by
– small fixed mesh
– 2-levels, 9 patches
Titanium
7
6
5
4
em3d
3
amr
2
1
0
1
2
4
8
gtb 23
FD-MLC Parallel Performance
• Communication requirement is low (< 5%)
• Scaled speedup experiments are nearly ideal (flat)
IBM SP2 at SDSC
Titanium
Cray T3E at NERSC
gtb 24
Future Work
• Titanium language and compiler developments
– Templates
– Further optimization of serial performance
• Algorithm Development in Titanium
– Self-gravitating gas dynamics
– Immersed boundary methods
• Comparison to library approach
– Performance
– Code size and readability
Titanium
gtb 25
Summary
• Language support
– Arrays, Immutable, Overloading, …
• Compiler optimizations
– Uniprocessor optimizations
– Parallel analyses
• Architectures
– Ported to several different platforms
• Results
– Several algorithms implemented
– Good parallel performance
Titanium
gtb 26