Transcript Document
Frédéric Gava
Le modèle BSP
Bulk-Synchronous Parallel
Background
Parallel programming
Implicit
Explicit
BSP
Automatic
parallelization
skeletons
Data-parallelism
Parallel
extensions
Concurrent
programming
The BSP model
BSP architecture:
P/M
P/M
Unit of synchronization
P/M
P/M
Network
Characterized by:
p Number of processors
r
Processors speed
L Global synchronization
g Phase of communication (1 word at most
sent of received by each processor)
P/M
Model of execution
Global (collective)
communications between
processors
Global synchronization :
exchanged data available
for the next super-step
Cost(i) =
(max0x<p wxi)
+ hig + L
Super-step i+1
Local computing on
each processor
Super-step i
Beginning of the super-step i
wi
ghi
L
wi+1
ghi+1
L
Exemple d’une machine BSP
Modèle de coût
• Coût(programme)=somme des coûts des super-étapes
• BSP computation: scalable, portable, predictable
• BSP algorithm design: minimising W (temps de calculs), H
(communications), S (nombre de super-étapes)
• Coût(programme) = W + g*H+ S*L
• g et L sont calculables (benchmark) d’où possibilité de prédiction
• Main principles:
– Load-balancing minimises W
– data locality minimises H
– coarse granularity minimises S
• In genral, data locality good, network locality bad!
• Typically, problem size n>>>p (slackness)
• Input/output distribution even, but otherwise arbitrary
A libertarian model
No master :
Homogeneous power of the nodes
Global (collective) decision procedure instead
No god :
Confluence (no divine intervention)
Cost predictable
Scalable performances
Practiced but confined
•
Advantages
and
drawbacks
Advantages :
– Allows cost prediction and deadlock free
– Structuring execution and thus bulk-sending ; it can be very
efficient (sending one file of 1000 bytes performs better than
sending 1000 file of 1 byte) in many architectures (multi-cores,
clusters, etc.)
– Abstract architecture = portable
– …?
• Drawbacks :
– Some algorithmic patterns don’t feet well in the BSP model :
pipeline etc.
– Some problem are difficult (impossible) to feet to a Coarse-grain
execution model (fine-grained parallelism)
– Abstract architecture = don’t take care of some efficient
possibilities of some architecture (cluster of multi-core, grid) and
thus need other libraries or model of execution
– …?
Example : broadcast
Direct broadcast (one super-step):
0
1
BSP cost = png + L
Broadcast with 2 super-steps:
BSP cost = ng + L
2
Algorithmes BSP
• Matrices : multiplication, inversion, décomposition,
algèbre linéaire, etc.
• Matrices creuses : idem.
• Graphes : plus court chemin, décomposition, etc.
• Géométrie : diagramme de Voronoi, intersection de
polygones, etc.
• FFT : Fast Fournier Transformation
• Recherches de motifs
• Etc.
Parallel prefixes
If we suppose associative operator (+)
a+(b+c)=(a+b)+c or better
a+(b+(c+d))=(a+b) + (c+d)
Example :
On a processor
On another processor
Parallel Prefixes
Classical log(p) super-steps method :
0
1
2
Cost = log(p) × ( Time(op)+Size(d)×g+L)
3
Parallel Prefixes
Divide-and-conquer method :
0
1
2
3
Our parallel machine
• Cluster of PCs
– Pentium IV 2.8 Ghz
– 512 Mb RAM
• A front-end Pentium IV 2.8 Ghz, 512 Mb RAM
• Gigabit Ethernet cards and switch,
• Ubuntu 7.04 as OS
Our BSP Parameters ‘g’
Our BSP Parameters ‘L’
How to read bench
•
There are many manners to publish benchs :
–
–
•
•
Tables
Graphics
The goal is to say « it is a good parallel method, see my
benchs » but it is often easy to arrange the presentation of
the graphics to hide the problems
Using graphics (from the simple to hide to the hardest) :
1)
2)
3)
4)
5)
Increase size of data and see for some number of processors
Increase number of processors to a typical size of data
Acceleration, i.e, Time(seq)/Time(par)
Efficienty , i.e, Acceleration/Number of processors
Increase number of processors and size of the data
Increase number of processors
Acceleration
Efficienty
Increase data and processors
Super-linear acceleration ?
• Better than theoretical acceleration. Possible if data
feet more on caches memories than in the RAM due
to a small number of data on each processor.
• Why the impact of caches ? Mainly, each processor
has a little of memory call cache. Access to this
memory is (all most) twice faster that RAM
accesses.
• Take for example, multiplication of matrices
“Fast” multiplication
• A straight-forward C implementation of
res=mult(A,B) (of size N*N) can look like this :
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += a[i][k] * b[k][j];
• Considerer the following equation :
where “T” is the transposition of matrix “b”
“Fast” multiplication
• One can implement this equation in C as :
double tmp[N][N];
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
tmp[i][j] = b[j][i];
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += a[i][k] * tmp[j][k];
where tmp is the transpose of “b”
• This new multiplication if 2 time fasters. With other caches
optimisations, one can have a 64 faster programs without
modifying really the algorithm.
More complicated
examples
N-body problem
Presentation
• We have a set of body
– coordinate in 2D or 3D
– point masse
• The classic N-body problem is to calculate the gravitational
energy of N point masses that is :
• Quadratique complexity…
• In practice, N is very big and sometime, it is impossible to
keep the set in the main memory
Parallel methods
•
•
Each processor has a sub-part of the original set
Parallel method one each processsor :
1) compute local interactions
2) compute interactions with other point masses
3) parallel prefixes of the local interactions
•
For 2) simple parallel methods :
– using a total exchange of the sub-sets
– using a systolic loop
Cost of the systolic method :
Systolic loop
0
1
2
3
Benchs and BSP predictions
Benchs and BSP predictions
Benchs and BSP predictions
Parallel methods
•
•
•
•
There exist many better algorithms than this one
Especially, considering when computing all
interactions is not needed (distancing molecules)
One classic algorithm is to divide the space intosubspace, and computed recursively the n-body on
each sub-space (so have sub-sub-spaces) and only
consider, interactions between these sub-spaces.
Stop the recursion when there are at most two
molecules in the sub-space
That introduces n*log(n) computations
Sieve of Eratosthenes
Presentation
• Classic : find the prime number by
enumeration
• Pure functional implementation
using list
• Complexity : n×log(n)/log(log(n))
• We used :
– elim:int listintint list which deletes
from a list all the integers multiple of
the given parameter
– final elim:int listint listint list
iterates elim
– seq_generate:intintint list which
returns the list of integers between 2
bounds
– select:intint listint list which gives
the first prime numbers of a list.
Parallel methods
• Simple Parallel methods :
– using a kind of scan
– using a direct sieve
– using a recursive one
• Different partitions of data
– per block (for scan) :
11,12,13,14,15
16,17,18,19,20
21,22,23,24,25
12,15,18,21,24
13,16,19,22,25
– cyclic distribution :
11,14,17,20,23
Scan version
• Method using a scan :
– Each processor computes a local sieve (the processor 0
contains thus the first prime numbers)
– then our scan is applied and we eliminate on processor i
the integers that are multiple of integers of processors
i−1, i−2, etc.
• Cost : as a scan (logarithmic)
Direct version
• Method :
– each processor computes a local sieve
– then integers that are less to
are globally exchanged and a new
sieve is applied to this list of integers (thus giving prime numbers)
– each processor eliminates, in its own list, integers that are
multiples of this first primes
Inductive version
• Recursive method by induction over n :
– We suppose that the inductive step gives the th first
primes
– we perform a total exchange on them to eliminates the
non-primes.
– End of this induction comes from the BSP cost: we end
when n is small enough so that the sequential methods is
faster than the parallel one
• Cost :
Benchs and BSP predictions
Benchs and BSP predictions
Benchs and BSP predictions
Benchs and BSP predictions
Parallel sample sorting
Presentation
• Each processor has listed set of data (array, list, etc.)
• The goal is that :
– data on each processor are ordored.
– data on processor i are smaller than data on processor
i+1
– good balancing
• Parallel sorting is not very efficient due to too many
communications
• But usefull and more efficient than gather all the
data in one processor and then sort them
Tri Parallèle BSP
Tiskin’s Sampling Sort
0
1,11,16,7,14,2,20
1
2
18,9,13,21,6,12,4
15,5,19,3,17,8,10
4,6,9,12,13,18,21
3,5,8,10,15,17,19
Local sort
1,2,7,11,14,16,20
Select first samples (p+1 elements with at last first and last ones)
1,2,7,11,14,16,20
4,6,9,12,13,18,21
3,5,8,10,15,17,19
Total exchange
of first sample
1,7,14,20,4,9,13,21,3,8,15,19
1,7,14,20,4,9,13,21,3,8,15,19
1,7,14,etc.
Local sort of samples (each processor)
1,3,4,7,8,9,13,14,15,19,20,21
1,3,4,7,8,9,13,14,15,19,20,21
1,3,4,7,etc.
Tiskin’s Sampling Sort
Select second samples (p+1 elements with at last first and last ones)
1,3,4,7,8,9,13,14,15,19,20,21
1,3,4,7,8,9,13,14,15,19,20,21
1,3,4,7,etc.
Interval for
processor 0
Interval for
processor 1
Interval for
processor 2
1,2,7,11,14,16,20
4,6,9,12,13,18,21
3,5,8,10,15,17,19
1,2,7
4,6
3,5
11,14
9,12,13
8,10
16,20
18,21
15,17,19
Fusion of receveid and sorted elements
1,2,3,4,5,6,7
8,9,10,11,12,13,14
15,16,17,18,19,20,21
Benchs and BSP predictions
Benchs and BSP predictions
Matrix multiplication
Naive parallel algorithm
•
•
•
•
•
We have two matrices A and B of size n×n
We supose
Each matrice is distributed by blocs of size
That is, element A(i,j) is on processor
Algorithm :
Each processor reads
twice one bloc from
another processor
Benchs and BSP predictions
Benchs and BSP predictions
Benchs and BSP predictions
Benchs and BSP predictions