Center for Applied Scientific Computing Month DD, 1997

Download Report

Transcript Center for Applied Scientific Computing Month DD, 1997

The Sisal Programming Language
Pat Miller
Center for Applied Scientific Computing
and erstwhile USF hanger-on
December 18, 2003
Abstract
Richard Feynman recounted a time when, as a graduate student at Princeton, a fellow student came down the hall
beaming and shouting, "Everything I know about physics is wrong! Isn't it wonderful!" The premise of Sisal is that
everything you think you know about computing is wrong, and that (when you discover the inner beauty of Sisal) it is
just wonderful.
The big idea advocated by Sisal is that the concept of memory is the root of all evil. The Von Neumann concept that a
central processor executes instructions that update a fixed set of "memory" cells is the wrong way of looking at the
world. This introduces a bottleneck (the infamous Von Neumann bottleneck) wherein the CPU is trying to feed itself
though a tiny pipe connected to the vast stores of connected memory. If you look at modern computer architectures,
they go to tremendous effort to deal with this very problem: caches, vectors, multiple instruction streams, and PIMs are
all ways to get around the problem.
In Sisal, however, the concept of memory is replaced by one of "definition." This definition is semantically enforced
through single assignment. An entire program can be represented as a dataflowgraph. Since all operations are
functional, there are no side effects, and global analysis is trivial. The compiler can restructure data and control flow
with amazing abandon: fusing and splitting loops while fracturing and restructuring records to optimize for vector or
super scalar architectures. It can assemble heavyweight chunks of work or find hundreds of tiny execution threads
much more easily than the compiler for an imperative language like FORTRAN or C.
We'll talk about the technical core of the Sisal compiler tools, see how to program in a single-assignment language, and
look at some programming examples. We'll look a bit at implementations of Sisal for the good ol' Crays, the
Manchester Dataflow Machine and a systolic array processor. We'll even look at some of the (surmountable)
weaknesses that helped doom Sisal and how they are being addressed in my home hobby language: SLAG. I will
assign a couple of short homework programming assignments for those interested in trying the language out.
CASC
PJM 2
Tanks – An alegory
There was a Pentagon working group looking at tank defense. The group included
generals, defense contractors, technical gurus, etc….
The question on the floor was how the cycle of defense, attack, counter-defense etc…
could be broken (e.g. Attack: self-forging munitions  ablative armor  multiple
sabot, etc….)
Somebody on the floor offered, “Maybe we should get rid
of the tanks?
CASC
PJM 3
So what does that have to do with me?


If we are ever to make more than evolutionary
progress, we must rethink what our objectives and
targets really are.
We must buy into the “conservation of complexity”
argument that states that you can never remove
complexity, but can only sweep it under a rug where it
is someone else’s problem
CASC
PJM 4
Overview





What is Sisal
Sisal computational structures
— functions
— arrays and loops
— conditionals
A sane implementation of an insane semantic
— simple optimization
— reference count optimization
— update-in-place / build-in-place
— parallel partitioning
Evidence of a real success
The Death of Sisal (abbreviated version)
SLAG (Sisal Lives AGain)

CASC
PJM 5
Some Sisal Places

LLNL

University of Santiago, Chile

University of Colorado

University of Sao Paulo

DEC

University of Manchester
(UK)

Adelaide University
(Australia)

USC

University of Puerto Rico
CASC
PJM 6
What is Sisal






Dataflow work at MIT on VAL language
The acronym is for Streams and Iteration in a Single
Assignment Language
Defined in 1983, revised and frozen in 1985
— Sisal 2.0 (CSU) 1986
— Sisal90 (LLNL) 1989
— Frozen in part because the original grammar
tables were not under configuration control :-)
Original collaborators were LLNL, Colorado State U,
University of Manchester, and DEC.
Large (100+) install base (pre-WWW) and still cited
Threaded SMP implementations (e.g. Cray) + weird
stuff
CASC
PJM 7
Objectives






to define a general-purpose functional language
to define a language independent intermediate form
for dataflow graphs
to develop optimization techniques for high
performance parallel applicative computing
to develop a microtasking environment that supports
dataflow on conventional computer systems
to achieve execution performance comparable to
imperative languages
to validate the functional style of programming for
large-scale scientific applications
Note: Citations in the form [NAME 99] are referenced at the end of the talk.
Citations without a year [Miller] simply refer to the person who worked on the project
CASC
[Feo96]
PJM 8
Quick overview of functional languages





all operations are functions
functions are side-effect free
— no changes to arguments nor global state
functions do no hold persistent state
no side effects implies no race conditions in parallel
environment implies recreatable results
determining if two operations can go in parallel is
reduces to a matter of identifying a dataflow
dependency between the operations inputs and
outputs
— There is a WHOLE lot of low level parallelism in
every code we write. The trick is exploiting it!
CASC
PJM 9
Sisal Tools




frontend
— convert to IF1 intermediate
— originally Pascal with M4 macros
— p2c conversion to C and hand hacked
— LL(1) grammar tables
DI (debugger interpreter for IF1)
OSC
— Optimizing Sisal Compiler
— IF1 -> optimizer intermediates -> C
TWINE
— semi-compiled debugger environment
— Thesis Work Is Never Ending
CASC
PJM 10
Basic Sisal






separate compilation, one global namespace
functions are listed in define to export
functions take 0 or more inputs and produce 1 or
more outputs
— strictly typed
— Like FORTRAN “pure” – i.e. no side effects
Arguments are named and typed, outputs are typed
The body is an expression
A well defined error semantic insures orderly,
predictable execution even with divide-by-zero and
bounds errors (sorta)
CASC
PJM 11
Examples
% Hello world!
define main
function main(returns array[character])
“hello world”
end function
% Simple arrays
define main
function main(A: array[integer] returns integer, array[integer])
for element in A
sqr := A*A;
returns
value of sum sqr
array of sqr
end for
end function
CASC
PJM 12
Arrays and loops




Sisal was targeted at scientific computation
Arrays
— vector of vectors syntax
— size is implicit to an instance
— arbitrary lower bound (default 1)
Streams
— non-strict containers
Loops
— Iterative and parallel (do-across) for loops
— “dot” product indices to describe “run-together”
indices
— “cross” product loops to describe nesting
CASC
PJM 13
Examples
for i in 1,10
returns
value of sum i
end for
% Basic integer range
% Simple reduction
x, max_x :=
for v in velocity dot t in time
position := v*t;
returns
array of position
value of greatest position
end for;
CASC
PJM 14
Examples
for i in 0,9 cross j in 0,9 % Implied nested loops range
ij := i + 10*j;
returns
array of ij
% Returns a 9x9 array[array[integer]]
end for
for x in A
x2 := x*x;
returns
array of x2
end for;
for x in B at i,j,k
x2 := x*x;
returns
array of x2
end for;
CASC
% compute across outer dimension
% result is the same size as A
% compute across 3 nested dimensions
% has same size and shape as B
PJM 15
Reductions/Filters
array of x
stream of x
% collectives
value of x
value of sum x
value of product x
value of greatest x
value of least x
% scalar reduction
value of catenate
% join arrays
value of left RRRR x
value of right RRRR x
value of tree RRRR x
% left associative reduction
% right associative
% tree associative
XXXX of x when flag
XXXX of x unless flag
%filtered
CASC
PJM 16
Iterative form





Not all loops are implicitly data parallel
Sisal supports an iterative form that supports the idea
of “loop carried values”
The loop body is allowed to reference the “old” value
of a definition (variable)
An separate body defines the initial values
Loop exits on a boolean control expression (no
break!)
CASC
PJM 17
For Initial
for initial
A := some_value();
% This is the first “iteration”
err := 1.0;
% It sets initial values for variants….
epsilon := 1e-6;
% and loop constant info
while err > epsilon repeat
A := advance( old A );
% Note the use of “old” to denote last value
err :=
for element in A dot element1 in old A % “old A” is still available
diff := element – element1;
returns
value of sum diff*diff % This loop will get fused with advance
end for;
returns
value of A
end for
CASC
PJM 18
For Initial 3-point stencil (Array Replace)
for initial
A := some_value();
% This is the first “iteration”
i := array_liml(A);
while i < array_limh(A) repeat
i := old i + 1;
A := old A[i: ( old A[i-1] + old A[i] + old A[i+1]) / 3.0 ];
returns
value of A
end for
The semantic for the red line says to…
* make a copy of old A
* replace the value of old A[i] with the new value
* bind that value to the name “A”
* i.e. There are 3 full copies of the array hanging around on each iteration
* (But we didn’t implement it this way)
CASC
PJM 19
Conditionals


if-then-[elseif]-else-end if form
— it’s an expression
— must be something in each branch [why?]
tag-case for union (variant records)
if x < 10 then
x+5
elseif y > 10 then
x-3
else
10
end if
CASC
PJM 20
Fibre I/O







structured description of data
integers, reals, double_precision, char, string
array [1: 11 22 33 …. ]
presized array [1,4: 11 22 33 44]
stream {1: 11 22 33 44 … }
record < 3 “hello” [1: 11 22 33] >
union (2: 33)
CASC
PJM 21
Sisal compiles to an intermediate form


IF1 -- dataflow graph
Consider the following function
function f(a,b: integer returns integer)
a+b*3
end function
a
b
3
3
*
+
CASC
PJM 22
More complex things work with
compound nodes


IF/FORALL/ITER are higher order functions
Consider the following function
function f(a,b: integer returns integer)
5*
if a < 10 then
b+3
else
a+b
end if
end function
10
<
3
+
+
5
*
CASC
PJM 23
Common Subexpression Elimination

If two nodes have the same opcode and the same
inputs, they compute the same value
function f(a,b: integer returns integer)
(a+b)*(a+b)
end function
==>
 Similarly
CASC
we get inlining, loop fusion, record fission, ...
PJM 24
Reference Count optimizations



Implicit memory allocation --> implicit deallocation
— OSC used compiler generated reference counting
— GC was not appropriate because of re-use issues
— 1st cut we spent 40% of time reference counting
— RefCnt is a critical section
We can eliminate reference counts if a data
dependency insures object stays alive
Deeper analysis can statically eliminated almost all
reference counting ops and replaced them with
statically defined lifetime
— see [SkedzSimpson 88]
CASC
PJM 25
Update-in-Place

Array replacement (and record replacement) are
prohibitively expensive if implemented naively
A := old A[i: new_element];
R := rec replace [field: val];



Introducing artificial dependencies can push
“readers” in front of “writers” (at the cost of a small
amount of fine grain parallelism)
reference count analysis may be able to statically
guarantee the final writer may overwrite
A dynamic reference count check (non-blocking) can
allow overwrite if the reference count is one
CASC
PJM 26
Build-in-place

In order to minimize data motion (and associated
copying), the “build-in-place” optimizations attempt
to compute a value into its final resting place
first_row := for ….. end for;
middle := for row in 2,array_size(A)-1 …. end for;
last_row := for …. end for;
solution := first_row || middle || last_row;



You want to allocate the full array and then have each
loop fill its portion
CANNOT express this in functional dataflow since
you are calling for a side effect
IF2 re-introduces state (in the form of memory
buffers) to perform these optimizations
CASC
PJM 27
Cost estimates and parallel partitioning

The problem isn’t the dearth of parallelism, it’s the glut of finegrained parallelism
— Early implementations eagerly forked every function and
split every parallel loop and experienced perfect inverse
scaling (parallel slowdown)

We switched to a system that tried to aglomerate big chunks of
work
— abandoned function parallelism for loop parallelism (OK on
4 processor Crays)

We also needed a static performance estimate to decide which
loops to split

We also made dynamic estimates to throttle small loops

Finally, there was a feedback loop that improved static guesses.
CASC
PJM 28
Implementations



The main (LLNL) implementation used the Sisal->IF1
frontend and OSC for conventional shared memory
machines [Cann core extended Miller/Denton/et al]
–Alliant
–Cray (1, 2, XMP, YMP)
–Dec/ULTRIX
–Encore
–Sequent
–SGI
–Solaris
Did not implement error semantics and used strict
streams (i.e arrays)
Current version supports generic PThread & T3[DE]
CASC
PJM 29
Debugger implementations



SDB [ Cann ]
— OSC add-on for simple breakpoint and variable
inspection
DI [Skedz Yates]
— Was used to predict maximum available
parallelism
— Implemented almost the full language
specification
— Provided a command line debugger/inspector
TWINE [Miller Cedeno]
— semi-compiled environment with more complete
debugger information and better speed/memory
usage
CASC
PJM 30
Early and strange implementations






Burton Smith’s HEP [Allan85]
Multi-headed VAX at LLNL (Circus and Dumbo)
Manchester Dataflow Machine [Foley87]
CMU Warp Systolic Array machine [ no citation! ]
Distributed Array implementation [M Miller]
IBM Power4
— Shared Memory
— No hardware cache consistency
— Compiler inserted cache directives [Wolski95]
CASC
PJM 31
Programming Apocrypha





Quadtree matrix [McGraw Skedz]
Livermore Loops [Feo]
Salishan Problems [Feo92]
Cray FFT [Feo]
“Simple” Hydro
— 2D KL Lagrange
— Heavy loop fusion
CASC
PJM 32
The death of Sisal






Project lost LLNL funding in 1996 (except for a small
slice of NSF funding through USC)
Work was suspended and most of the group drifted
away
Was DOE’s first “Open Source” release
The source code was stashed in various parts of the
globe and now lives on at SourceForge
Was referenced at http://www.llnl.gov/sisal until 6
months ago
The remnants of Sisal include
— Sisal Mugs (SC ‘91) and banner
— A video tutorial series
— Some dubious backup tapes holding the mortal
remains of the code
CASC
PJM 33
Some of the reasons

Failure to penetrate the programs stymied by the beliefs that:
— There is but one language, FORTRAN with Cray pointers
— We will never recode our algorithms
— We will never live in a mixed language world (see point 1)
— The CRAYs will live forever and we already know how to
vectorize
— If I really need parallelism (a dubious thought), an automatic
parallel compiler will give me enough

Short term, programmatic focus of about 2 years

Early focus on distributed memory machines was pushed over
to SMP, but….
— SMP/vector work finished and polished right about the time
the Meiko arrived and project was killed as irrelevant 
CASC
PJM 34
Still more reasons...

vector of vector syntax

FIBRE I/O was universally hated

clunky interface to/from FORTRAN
— trying to avoid copies at function call interface

final runtime was no better than hand-tuned FORTRAN
— albeit arrived at in a much shorter time!

no complex type

difficult to add reductions

poor library support (out of the box)

wordy syntax and many keywords

a non-zero learning curve for the language

will you be here in seven years?
CASC
PJM 35
SLAG

Sisal Lives AGain [Miller]

C/C++-like syntax which compiles to XML version of IF1

Very few keywords

Modules and simplified extension interfaces (via Babel?)

Distributed implementation (hopefully)
— The idea is to capture NUMA
— threads on SMP
— MPI-2 Get or SHMEM cross-box

Will have objects, but objects cannot update state
— objects birth new objects with modified state

Maybe a Mentat style framework of cooperating functional objects

My personal target is Sony PS* since it has vector units and will
eventually go parallel (+ streams)
CASC
PJM 36
Slag program
module foo {
using math,runtime;
pi := 3.1415926;
radians_per_proc := pi / runtime::number_of_processors;
double daxpy(...) { …. } in “FORTRAN”
double compute_dt(double* A) { …. } for “Python”
int main(returns string) { // when main() returns string, it is stdout
n := int( argc[0] );
if ( n < 0 ) {
string(n) << “ is negative”
} else {
“it is non-negative”
}
}
}
CASC
PJM 37
Terms





functional language – only real computational
element is function call
imperative language – a language in which
statements modify the contents of memory cells and
set control flow
dataflow – computation is accomplished by input
which flows along edges to nodes which fire when all
inputs arrive.
single assignment – a name is associated with
exactly one value in an execution context
referential transparency – a name and its value are
always interchangeable
CASC
PJM 38
Terms


lazy evaluation – values are computed as needed
eager evaluation - values are computed as soon as
possible
CASC
PJM 39
More stuff...





W Ackerman and JB Dennis. VAL – A value oriented
Algorithmic Language. MIT/LCS/TR-218, June 1979.
D Raymond. SISAL: A Safe and Efficient Language for
Numerical Calculations. Linux Journal #80, December
2000. see
http://www.linuxjournal.com/article.php?sid=4383
J Feo. Sisal. A comparative study of parallel programming
languages: the Salishan problems. June 1992 UCRLJC-110915
S Skedzielewski and R Simpson. A simple method to
remove reference counting in appreciative programs.
UCRL-100156. 1988
J Foley. Manchester Dataflow Machine: Preliminary
Benchmark Test Evaluation. UMCS-87-11-2. 1987
CASC
PJM 40
Still more stuff…



S Allan and R Oldehoeft, HEP SISAL: parallel functional
programming, on Parallel MIMD computation: HEP
supercomputer and its applications, p.123-150, June
1985
R Wolski and D Cann. Compiler Enforced Cache
Coherence Using a Functional Language. Journal of
Scientific Programming, December, 1995.
J Feo. Livermore Loops in Sisal. UCID-21159. 1987
CASC
PJM 41
UCRL-PRES-210560
Work performed under the auspices of the U. S. Department of Energy by Lawrence
Livermore National Laboratory under Contract W-7405-Eng-48
CASC
PJM 42