Whatever Burton wants to talk about

Download Report

Transcript Whatever Burton wants to talk about

Getting Ready
for
Mainstream Parallel Computing
Burton Smith
Cray Inc.
1
Parallel computing is upon us†

Uniprocessors are nudging against performance limits


Meanwhile, logic cost ($/gate-Hz) continues to fall




Semantic information processing and retrieval
Personal robots
Better human-computer interfaces
Newer microprocessors are multi-core and/or multithreaded


What are we to do with all that hardware?
New “killer apps” will probably need more performance


More transistors/processor increases watts, not speed
So far, it’s just “more of the same” architecturally
The big question: what are parallel computers good for?


†And
This is a question for both hardware and software
Hardware has gotten ahead of software, as usual
it’s about time!
2
Parallel software is almost rudderless

We don’t have a good parallel language story



We don’t have a good debugging story



TotalView?
gdb?
We don’t have a good resource management story




MPI?
PGAS languages?
Some-weight-or-other kernels?
Linux?
Lustre?
We don’t have a good robustness story


Malware resistance?
User-generated checkpointints (or else)?
3
Better parallel languages
A better parallel programming language could make
programmers more productive by:
 Making data races impossible
 Implementing higher level operations
 Presenting a transparent performance model
 Exploiting architectural support for parallelism



Global memory operations
Light-weight synchronization
Enabling a more abstract specification of program locality

Give the programmer the hard parts and nothing else
4
Are functional languages the answer?

Imperative languages schedule values into variables



Functional languages avoid races by avoiding variables





In a sense, they compute new constants from old
Data races are impossible (because loads do commute)
We can make dead constant reclamation pretty efficient
Program transformation is enabled by these semantics
High-level operations become practical



Parallel versions of such languages encourage data races
The basic problem: stores don’t commute with loads
Programmer productivity is much improved
Caution: it is pretty easy to opacify performance
Abstracting locality is much easier in functional languages

Copying is safer, for example
5
The bad news


There is no notion of state in functional languages
Attempts to add state while preserving commutativity:




A related fact: functional programs are deterministic


Applicative State Transition systems (Backus)
Monads (Wadler et al.)
M-structures (Arvind et al.)
Introducing state leads to non-determinism (e.g. races)
Some kinds of nondeterminism are good



Any ordering that does not affect final results is OK
Only the programmer knows where the opportunities are
How can we tell good non-determinism from bad?
6
A (serial) histogramming example
const double in[N];
//data to be histogrammed
const int f(double); //f(x) < M is the bin of x
int hist[M];
//histogram, initially 0
for(i = 0; i < N; i++)
{/*(int k)(hist[k] = {j|0j<i  f(in[j])= k})*/
int bin = f(in[i]);
hist[bin]++;
} /*(int k)(hist[k] = {j|0j<N  f(in[j])= k})*/
Don’t try this in parallel with a functional language!
7
Histogramming in parallel
const double in[N];
//data to be histogrammed
const int f(double); //f(x) < M is the bin of x
int hist[M];
//histogram, initially 0
forall i in 0..N-1
{ /*(int k)(hist[k] = {j|j  f(in[j])=k})*/
int bin = f(in[i]);
lock hist[bin];
hist[bin]++;
unlock hist[bin];
} /*(int k)(hist[k] = {j|0j<N  f(in[j])=k})*/
• is the set of values i processed “so far”
•The loop instances commute with respect to the invariant
•Premature reads of hist[] get non-deterministic garbage
8
What do the locks do?

The locks guarantee the integrity of the invariant


As long as invariants describe all we care about in the
computation and forward progress is made, all is well



Barriers won’t do the job
Can we automate or at least verify lock insertion?


We have non-determinism “beneath the invariants”
In the example, the set  captures that non-determinism
Pretty clearly, the locks need to be lightweight


They protect whatever makes the invariant temporarily false
If we had a language for the invariants, maybe so
A constructive step is to let the language handle the locks

Efficiency with safety is one reason
9
Atomic sections
const double in[N];
//data to be histogrammed
const int f(double); //f(x) < M is the bin of x
int hist[M];
//histogram, initially 0
forall i in 0..N-1 do
atomic {
int bin = f(in[i]);
hist[bin]++;
}



This abstraction permits implementation mechanisms other
than locking
It works better interprocedurally
All of the proposed HPCS languages have something like it
10
Operations on multiple objects
node *m; //a node in an undirected graph
/*(node m)(node n)(n(m->nbr)*  m(n>nbr)*)*/
atomic { //remove *m from the mesh
for (n = m->nbr, n != NULL, n = n->nbr){
//remove links between *n to *m
for (p = n->nbr, p != NULL, ... //etc
}




A naíve implementation would routinely deadlock
If a sequence would deadlock or fail, preservation of the
invariant requires that it be “undone”, reversing its side
effects
In other words, what we need is linguistic support for
nestable multi-object atomic transactions
Operating system implementers really need this stuff
11
Better debugging

Don’t worry about:


The ability to single-step any thread
Instead, worry about:

Conditional breakpoints



Whole-program data perusal





For both programs and data
Ad-hoc conditional expressions
Declaration awareness
Run-time data structures, e.g. queues
Ad-hoc data visualization
Ad-hoc data verification
A much higher level user interface language


Don’t make the GUI mandatory (what generates code?)
Anyone familiar with duel?
gdb> duel #/(root-->(left,right)->key)
12
Debugging shared memory parallelism

Data races are a perennial problem


Another big issue is verifying invariants of data structures



Denelcor debugged Unix System 3 on the HEP that way
Trouble is, the daemons race with the code being checked
Of course the daemons can transact along with everyone else


Anyone else ever read a dump with this end in view?
Invariants could be checked continually, say with daemons


A more disciplined programming paradigm, e.g using
transactions systematically, would help a lot
If transactions are protecting each invariant’s domain, a
daemon will always see a consistent state by definition
As long as we are at it, why not routinely make sure the
invariant is restored whenever a transaction commits?
13
Better resource management

Parallel computers have a plethora of resources to manage


Resource allocation is all about enforcing policies while
satisfying constraints


One policy, e.g. for process scheduling, never fits all users
On the other hand, policy implementations are usually so
mixed up with mechanism that they are tough to tweak


And well-virtualized resources of a given type are fungible
It’s certainly not something we want users doing
Why do we use imperative languages to implement policy?

Declarative programming, e.g. CLP, may be a better fit
14
What the heck is CLP?

CLP is Constraint Logic Programming — guarded Horn
clauses, say, augmented by predicates over some domain





Unification is extended appropriately to the predicates
The cool thing about CLP is that it directly enforces the
constraints and searches for a solution that satisfies them all
There is probably a way to make CLP parallel if need be


CLP() adds < , = , etc. over the real numbers 
Another flavor is CLP(FD) over a finite domain
In any event it needs to be able to transact against the current
state of the resource requests and allocations
We are exploring these ideas in Cascade, our HPCS project
15
Better robustness

Our computer systems are too easy to attack


Continuous verification of a system’s properties is a direct
way to detect intrusion



Transactions are needed to make it work
Another enemy of robustness is hardware failure



Violations of resource allocation policy
Loss of data structure integrity
The debugging discussion applies here


Prevention is best, but detection is a decent runner-up
Hardware error checking is necessary but hardly sufficient
Continually checking invariant properties of programs, with
daemons or upon transaction commits, is the rest of the story
But what should we do to recover from failures?

Checkpoint/restart is the usual answer
16
How to checkpoint

Don’t do it


Let the user do it





Ditto, hopefully at a “good time”
Let the OS do it


Tell the OS where the restart file(s) and entry point are
Let the compiler do it


Just re-run or re-boot as necessary
Maybe at times specified by the user or the compiler
Do it directly to disk
Use 2x the memory and RAID and/or buffer to disk
Use COW and transactions and do it concurrently and
incrementally
Mix these above strategies
How do we decide?
17
Productivity

Productivity  is utility U divided by cost C



Utility is generally a function of the time-to-solution




It is positive, non-increasing, and eventually zero
Here we consider execution time, which depends on:


What one gets for what one spends
We want to maximize it
The application 
The system 
These two parameters also strongly influence the cost
U (t (, ))
()  max
C (, )

We assume the application  is fixed in what follows
18
The rental execution cost model

Plausibly, the cost of a run is linear in the time used

The rate might be the life-cycle cost of the system times the
fraction used over the system’s useful life:
life_cost ()frac_used ()
C () 
t ()  R()t ()
useful_lif e()
and so
U (t ())
  max
 R()t ()
 For a given system under the rental model, productivity
strictly decreases with compute time until it becomes zero
when the utility reaches zero
 The implication is an injective functional relationship
between time and productivity under the rental model
19
Merits of the productivity approach

Utility functions quantify the value of timeliness



Productivities can be computed from execution times
Given a probability distribution for productivity, we can
compute other quantities readily:



One example is a hard deadline (“box”) utility function
The mean and variance of the productivity
The probability that the productivity is greater than a particular
value
The impact of policies and system configuration decisions can
be assessed, e.g.



Is a more expensive but more reliable system worthwhile?
How frequently should we checkpoint, if at all?
Should memory or disks be added to speed up checkpointing?
20
Estimating total running time

The distribution of total running time t is determined by:







Failure process parameters, e.g. the failure rate 
The no-failure execution time Tx
The execution interval between checkpoints Ti
The operating system start-up time To
The time to perform a checkpoint Tc
The time to perform a restart Tr
Without checkpointing, job execution looks like this:


To
failure

To

To
Tx
failure failure
With checkpointing, job execution looks like this:
Ti
Tc
Ti
Tc

To Tr
Ti
Tc
failure
21
A Markov process model

Here we have a nontrivial Markov process:
p/a
p/a
p/a
p/a . . .
0
1
2
3
n
q/b p/c q/b p/c q/b p/c q/b p/c
0
1
q/d


2
q/d
3
q/d
q/d
where
T  Ti  Tc
q  prob(  T )  1  p
T   To  Tr
q  prob(  T  T )  1  p
and the delays associated with the events a–d are
c :T  T
a :T
d :    T  T
b:  T
22
Mean total time



Computing the mean total time, we get
nq
t  t (0) 
p
Expanding,
n(1  e   (Ti Tc ) )
t    (T T T T )
i c o r
e
n  (Ti Tc To Tr )  (To Tr )
 (e
e
)

This approaches n(Ti+Tc) when the MTBF -1 is large and
n  (Ti Tc To Tr )
e

when -1 is small compared to Ti+Tc
24
Optimizing the checkpoint interval

Setting n = Tx / Ti allows determination of the optimal
checkpoint interval Ťi:

  Tx (e (Ti Tc To Tr )  e (To Tr ) ) 
0
t

Ti
Ti 
Ti

Tx  e (Ti Tc To Tr ) e (Ti Tc To Tr )  e (To Tr ) 
 



Ti
Ti 2



whence



(
T
i Tc )
 1 e
Ti 

This is identical to Daly’s result†
† Equation
(9) in Daly, J.T. “A Strategy for Running Large Scale Applications Based on a Model that Optimizes
the Checkpoint Interval for Restart Dumps”, Proc. Int. Conf. Supercomputing, St. Malo, France, June 2004
25
Optimizing checkpoint bandwidth

Discs can be bought to reduce t by reducing Tc (and Tr)




Whether it is worth doing depends on the utility
Assuming a rental cost model, constant utility, an optimal
checkpoint interval and a disk rental “rate” of D dollars/s
per image/s:
U

( R  D / Tc )t
Letting Tc = Tr and basing cost on the expected total time,
 
 
0
  R  D T t 
c 
Tc  
The solution looks something like

2
  D  D  4 RD  ft (Tc , )
Tc 
2R
26
Conclusions


It is time to take parallel computing seriously
We have a lot of architectural work to do in software




Not that the hardware is all done!
Software architecture is needed to define and refine it
New ways of thinking about programming may be useful
New ways of thinking about performance are also needed
27