Distributed Parallel Programming
Download
Report
Transcript Distributed Parallel Programming
Parallel and High
Performance Computing
Burton Smith
Technical Fellow
Microsoft
7/17/2015
1
Agenda
Introduction
Definitions
Architecture and Programming
Examples
Conclusions
7/17/2015
2
INTRODUCTION
7/17/2015
3
“Parallel and High Performance”?
“Parallel computing is a form of computation
in which many calculations are carried out
simultaneously” G.S. Almasi and A. Gottlieb, Highly
Parallel Computing. Benjamin/Cummings, 1994
A High Performance (Super) Computer is:
7/17/2015
One of the 500 fastest computers as measured by
HPL: the High Performance Linpack benchmark
A computer that costs 200.000.000 руб or more
Necessarily parallel, at least since the 1970’s
4
Recent Developments
For 20 years, parallel and high performance
computing have been the same subject
Parallel computing is now mainstream
It reaches well beyond HPC into client systems:
desktops, laptops, mobile phones
HPC software once had to stand alone
7/17/2015
Now, it can be based on parallel PC software
The result: better tools and new possibilities
5
The Emergence of the Parallel Client
Uniprocessor performance is leveling off
Logic density continues to grow (Moore’s Law)
Instruction-level parallelism nears a limit (ILP Wall)
Power is getting painfully high (Power Wall)
Caches show diminishing returns (Memory Wall)
So uniprocessors will collapse in area and cost
Cores per chip need to increase exponentially
We must all learn to write parallel programs
So new “killer apps” will enjoy more speed
The ILP Wall
Instruction-level parallelism preserves the
serial programming model
While getting speed from “undercover” parallelism
For example, see HPS†: out-of-order issue, inorder retirement, register renaming, branch
prediction, speculation, …
At best, we get a few instructions/clock
† Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“
Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.
The Power Wall
In the old days, power was kept roughly constant
Now, shrinking no longer reduces V very much
Dynamic power, equal to CV2f, dominated
Every shrink of .7 in feature size halved transistor area
Capacitance C and voltage V also decreased by .7
Even with the clock frequency f increased by 1.4,
power per transistor was cut in half
So even at constant frequency, power density doubles
Static (leakage) power is also getting worse
Simpler, slower processors are more efficient
And to conserve power, we can turn some of them off
The Memory Wall
We can get bigger caches from more transistors
To speed up 2X without changing bandwidth
below the cache, the miss rate must be halved
How much bigger does the cache have to be?†
Does this suffice, or is there a problem scaling up?
For dense matrix multiply or dense LU, 4x bigger
For sorting or FFTs, the square of its former size
For sparse or dense matrix-vector multiply, impossible
Deeper interconnects increase miss latency
Latency tolerance needs memory access parallelism
† H.T. Kung, “Memory requirements for balanced computer architectures,”
13th International Symposium on Computer Architecture, 1986, pp. 49−54.
Overcoming the Memory Wall
Provide more memory bandwidth
Use architecture to tolerate memory latency
Increase DRAM I/O bandwidth per gigabyte
Increase microprocessor off-chip bandwidth
More latency more threads or longer vectors
No change in programming model is needed
Use caches for bandwidth as well as latency
Let compilers control locality
Keep cache lines short
Avoid mis-speculation
The End of The von Neumann Model
“Instructions are executed one at a time…”
We have relied on this idea for 60 years
Now it (and things it brought) must change
Serial programming is easier than parallel
programming, at least for the moment
But serial programs are now slow programs
We need parallel programming paradigms
that will make all programmers successful
The stakes for our field’s vitality are high
Computing must be reinvented
DEFINITIONS
7/17/2015
12
Asymptotic Notation
Quantities are often meaningful only within a
constant factor
Algorithm performance analyses, for example
f(n) = O(g(n)) means there exist constants c
and n0 such that n n0 implies |f(n)| |cg(n)|
f(n) = (g(n)) means there exist constants c
and n0 such that n n0 implies |f(n)| |cg(n)|
f(n) = (g(n)) means both f(n) = O(g(n)) and
f(n) = (g(n))
7/17/2015
13
Speedup, Time, and Work
The speedup of a computation is how much
faster it runs in parallel compared to serially
If one processor takes T1 and p of them take Tp
then the p-processor speedup is Sp = T1/Tp
The work done is the number of operations
performed, either serially or in parallel
7/17/2015
W1 = O(T1) is the serial work, Wp the parallel work
We say a parallel computation is work-optimal if
Wp = O(W1) = O(T1)
We say a parallel computation is time-optimal if
Tp = O(W1/p) = O(T1/p)
14
Latency, Bandwidth, & Concurrency
In any system that moves items from input to
output without creating or destroying them,
latency × bandwidth = concurrency
Queueing theory calls this result Little’s law
bandwidth = 2
concurrency = 6
latency = 3
ARCHITECTURE AND
PROGRAMMING
7/17/2015
16
Parallel Processor Architecture
SIMD: Each instruction operates concurrently
on multiple data items
MIMD: Multiple instruction sequences execute
concurrently
Concurrency is expressible in space or time
Spatial: the hardware is replicated
Temporal: the hardware is pipelined
Spatial concurrency
Temporal concurrency
SIMD
PE arrays and VLIW:
ILLIAC-IV, Elbrus 3M
Vector pipelined systems:
TI ASC, Cray 1
MIMD
Multiprocessors:
iPSC/1, PS-2000
Multithreaded systems:
HEP, MARS-M, Niagara
Type
7/17/2015
17
Trends in Parallel Processors
Today’s chips are spatial MIMD at top level
Temporal MIMD is also used
SIMD is tending back toward spatial
Intel’s Larrabee combines all three
Temporal concurrency is easily “adjusted”
To get enough performance, even in PCs
Vector length or number of hardware contexts
Temporal concurrency tolerates latency
7/17/2015
Memory latency in the SIMD case
For MIMD, branches and synchronization also
18
Parallel Memory Architecture
A shared memory system is one in which any
processor can address any memory location
Quality of access can be either uniform (UMA) or
nonuniform (NUMA), in latency and/or bandwidth
A distributed memory system is one in which
processors can’t address most of memory
7/17/2015
The disjoint memory regions and their associated
processors are usually called nodes
A cluster is a distributed memory system with
more than one processor per node
Nearly all HPC systems are clusters
19
Parallel Programming Variations
Data Parallelism and Task Parallelism
Functional Style and Imperative Style
Shared Memory and Message Passing
…and more we won’t have time to look at
A parallel application may use all of them
7/17/2015
20
Data Parallelism and Task Parallelism
A computation is data parallel when similar
independent sub-computations are done
simultaneously on multiple data items
A computation is task parallel when dissimilar
independent sub-computations are done
simultaneously
Applying the same function to every element of a
data sequence, for example
Controlling the motions of a robot, for example
It sounds like SIMD vs. MIMD, but isn’t quite
7/17/2015
Some kinds of data parallelism need MIMD
21
Functional and Imperative Programs
A program is said to be written in (pure)
functional style if it has no mutable state
Computing = naming and evaluating expressions
Programs with mutable state are usually
called imperative because the state changes
must be done when and where specified:
while (z < x) { x = y; y = z; z = f(x, y);} return y;
Often, programs can be written either way:
let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y;
7/17/2015
22
Shared Memory and Message Passing
Shared memory programs access data in a
shared address space
When to access the data is the big issue
Subcomputations therefore must synchronize
Message passing programs transmit data
between subcomputations
7/17/2015
The sender computes a value and then sends it
The receiver recieves a value and then uses it
Synchronization can be built in to communication
Message passing can be implemented very well
on shared memory architectures
23
Barrier Synchronization
A barrier synchronizes multiple parallel subcomputations by letting none proceed until all
have arrived
It is named after the barrier
used to start horse races
It guarantees everything before the barrier
finishes before anything after it begins
It is a central feature in several data-parallel
languages such as OpenMP
7/17/2015
24
Mutual Exclusion
This type of synchronization ensures only one
subcomputation can do a thing at any time
It classically uses a lock: a data structure with
which subcomputations can stop and start
Basic operations on a lock object L might be
If the thing is a code block, it is a critical section
Acquire(L): blocks until other subcomputations are
finished with L, then acquires exclusive ownership
Release(L): yields L and unblocks some Acquire(L)
A lot has been written on these subjects
7/17/2015
25
Non-Blocking Synchronization
The basic idea is to achieve mutual exclusion
using memory read-modify-write operations
Most commonly used is compare-and-swap:
7/17/2015
CAS(addr, old, new) reads memory at addr and if
it contains old then old is replaced by new
Arbitrary update operations at an addr require
{read old; compute new; CAS(addr, old, new);}
be repeated until the CAS operation succeeds
If there is significant updating contention at addr,
the repeated computation of new may be wasteful
26
Load Balancing
Some processors may be busier than others
To balance the workload, subcomputations
can be scheduled on processors dynamically
Analogous imbalances can occur in memory
A technique for parallel loops is self-scheduling:
processors repetitively grab chunks of iterations
In guided self-scheduling, the chunk sizes shrink
Overloaded memory locations are called hot spots
Parallel algorithms and data structures must be
designed to avoid them
Imbalanced messaging is sometimes seen
7/17/2015
27
EXAMPLES
7/17/2015
28
A Data Parallel Example: Sorting
void sort(int *src, int *dst, int size, int nvals) {
int i, j, t1[nvals], t2[nvals];
for (j = 0 ; j < nvals ; j++) {
t1[j] = 0;
}
for (i = 0 ; i < size ; i++) {
t1[src[i]]++;
}
//t1[] now contains a histogram of the values
t2[0] = 0;
for (j = 1 ; j < nvals ; j++) {
t2[j] = t2[j-1] + t1[j-1];
}
//t2[j] now contains the origin for value j
for (i = 0 ; i < size ; i++) {
dst[t2[src[i]]++] = src[i];
}
}
7/17/2015
29
When Is a Loop Parallelizable?
The loop instances must safely interleave
A way to do this is to only read the data
Another way is to isolate data accesses
Look at the first loop:
for (j = 0 ; j < nvals ; j++) {
t1[j] = 0;
}
7/17/2015
The accesses to t1[] are isolated from each other
This loop can run in parallel “as is”
30
Isolating Data Updates
The second loop seems to have a problem:
for (i = 0 ; i < size ; i++) {
t1[src[i]]++;
}
Two iterations may access the same t1[src[i]]
If both reads precede both increments, oops!
A few ways to isolate the iteration conflicts:
7/17/2015
Use an “isolated update” (lock prefix) instruction
Use an array of locks, perhaps as big as t1[]
Use non-blocking updates
Use a transaction
31
Dependent Loop Iterations
The 3rd loop is an interesting challenge:
for (j = 1 ; j < nvals ; j++) {
t2[j] = t2[j-1] + t1[j-1];
}
Each iteration depends on the previous one
This loop is an example of a prefix computation
If • is an associative binary operation on a set
S, the • - prefixes of the sequence x0 ,x1 ,x2 …
of values from S is x0,x0•x1,x0•x1•x2 …
7/17/2015
Prefix computations are often known as scans
Scan can be done in efficiently in parallel
32
Cyclic Reduction
Each vertical line represents a loop iteration
The associated sequence element is to its right
a
b
c
d
e
f
g
a
ab
bc
cd
de
ef
fg
a
ab
abc
abcd
bcde
cdef
defg
a
ab
abc
abcd
abcde abcdef abcdefg
On step k of the scan, iteration j prefixes its
own value with the value from iteration j – 2k
7/17/2015
33
Applications of Scan
Linear recurrences like the third loop
Polynomial evaluation
String comparison
High-precision addition
Finite automata
Each xi is the next-state function given the ith input
symbol and • is function composition
APL compress
When only the final value is needed, the
computation is called a reduction instead
It’s a little bit cheaper than a full scan
More Iterations n Than Processors p
Wp = 3n + O(p log p), Tp = 3n / p + O(log p)
7/17/2015
35
OpenMP
OpenMP is a widely-implemented extension
to C++ and Fortran for data† parallelism
It adds directives to serial programs
A few of the more important directives:
#pragma omp parallel for <modifiers>
<for loop>
#pragma omp atomic
<binary op=,++ or -- statement>
#pragma omp critical <name>
<structured block>
#pragma omp barrier
†And
7/17/2015
perhaps task parallelism soon
36
The Sorting Example in OpenMP
Only the third “scan” loop is a problem
We can at least do this loop “manually”:
nt = omp_get_num_threads();
int ta[nt], tb[nt];
#omp parallel for
for(myt = 0; myt < nt; myt++) {
//Set ta[myt]= local sum of nvals/nt elements of t1[]
#pragma omp barrier
for(k = 1; k <= myt; k *= 2){
tb[myt] = ta[myt];
ta[myt] += tb[myt - k];
#pragma omp barrier
}
fix = (myt > 0) ? ta[myt – 1] : 0;
//Set nvals/nt elements of t2[] to fix + local scan of t1[]
}
7/17/2015
37
Parallel Patterns Library (PPL)
PPL is a Microsoft C++ library built on top of the
ConcRT user-mode scheduling runtime
It supports mixed data- and task-parallelism:
parallel_for, parallel_for_each, parallel_invoke
agent, send, receive, choice, join, task_group
Parallel loops use C++ lambda expressions:
parallel_for(1,nvals,[&t1](int j) {
t1[j] = 0;
});
Updates can be isolated using intrinsic functions
(void)_InterlockedIncrement(t1[src[i]]++);
Microsoft and Intel plan to unify PPL and TBB
7/17/2015
38
Dynamic Resource Management
PPL programs are written for an arbitrary
number of processors, could be just one
Load balancing is mostly done by work stealing
There are two kinds of work to steal:
Work that is unblocked and waiting for a processor
Work that is not yet started and is potentially parallel
Work of the latter kind will be done serially
unless it is first stolen by another processor
7/17/2015
This makes recursive divide and conquer easy
There is no concern about when to stop parallelism
39
A Quicksort Example
void quicksort (vector<int>::iterator first,
vector<int>::iterator last) {
if (last - first < 2){return;}
int pivot = *first;
auto mid1 = partition (first, last,
[=](int e){return e < pivot;});
auto mid2 = partition (mid1, last,
[=](int e){return e == pivot;});
parallel_invoke(
[=] { quicksort(first, mid1); },
[=] { quicksort(mid2, last); }
);
};
7/17/2015
40
LINQ and PLINQ
LINQ (Language Integrated Query) extends
the .NET languages C#, Visual Basic, and F#
A LINQ query is really just a functional monad
It queries databases, XML, or any IEnumerable
PLINQ is a parallel implementation of LINQ
7/17/2015
Non-isolated functions must be avoided
Otherwise it is hard to tell the two apart
41
A PLINQ
LINQ Example
var q = from n in names.AsParallel()
where n.Name == queryInfo.Name &&
n.State == queryInfo.State &&
n.Year >= yearStart &&
n.Year <= yearEnd
orderby n.Year ascending
select n;
7/17/2015
42
Message Passing Interface (MPI)
MPI is a widely used message passing library
for distributed memory HPC systems
Some of its basic functions:
MPI_Init
MPI_Comm_rank
MPI_Comm_size
A few of its “collective communication” functions:
MPI_Barrier
MPI_Gather
MPI_Allgather
MPI_Alltoall
7/17/2015
MPI_Send
MPI_Recv
MPI_Reduce
MPI_Allreduce
MPI_Scan
MPI_Exscan
43
Sorting in MPI
Roughly, it could work like this on n nodes:
Run the first two loops locally
Use MPI_Allreduce to build a global histogram
Run the third loop (redundantly) at every node
Allocate n value intervals to nodes (redundantly)
Balancing the data per node as well as possible
Run the fourth loop using the local histogram
Use MPI_Alltoall to redistribute the data
Merge the n sorted subarrays on each node
Collective communication is expensive
7/17/2015
But sorting needs it (see the Memory Wall slide)
44
Another Way to Sort in MPI
The Samplesort algorithm is like Quicksort
It works like this on n nodes:
Sort the local data on each node independently
Take s samples of the sorted data on each node
Use MPI_Allgather to send all nodes all samples
Compute n 1 splitters (redundantly) on all nodes
7/17/2015
Balancing the data per node as well as possible
Use MPI_Alltoall to redistribute the data
Merge the n sorted subarrays on each node
45
CONCLUSIONS
7/17/2015
46
Parallel Computing Has Arrived
We must rethink how we write programs
Other things will also need to change
And we are definitely doing that
Architecture
Operating systems
Algorithms
Theory
Application software
We are seeing the biggest revolution in
computing since its very beginnings
7/17/2015
47