Introduction to PETSc

Download Report

Transcript Introduction to PETSc

Introduction to PETSc
VIGRE Seminar, Wednesday,
November 8, 2006
Parallel Computing
How (basically) does it work?
Parallel Computing
How (basically) does it work?
• Assign each processor a number
Parallel Computing
How (basically) does it work?
• Assign each processor a number
• The same program goes to all
Parallel Computing
How (basically) does it work?
• Assign each processor a number
• The same program goes to all
• Each uses separate memory
Parallel Computing
How (basically) does it work?
•
•
•
•
Assign each processor a number
The same program goes to all
Each uses separate memory
They pass information back and forth
as necessary
Parallel Computing
Example 1: Matrix-Vector Product
Parallel Computing
Example 1: Matrix-Vector Product
and
0:
are inputs
1:
into the
program.
2:
Parallel Computing
Example 1: Matrix-Vector Product
The control
node (0) reads
in the matrix
and distributes
the rows
amongst the
processors.
0: (a, b, c)
1: (d, e, f)
2: (g, h, i)
Parallel Computing
Example 1: Matrix-Vector Product
The control
node also
sends the
vector to each
processor’s
memory.
0: (a, b, c) ; (j, k, l)
1: (d, e, f) ; (j, k, l)
2: (g, h, i) ; (j, k, l)
Parallel Computing
Example 1: Matrix-Vector Product
Each processor
0: (a, b, c) • (j, k, l) = aj+bk+cl
computes its
own dot
product.
1: (d, e, f) • (j, k, l) = dj+ek+fl
2: (g, h, i) • (j, k, l) = gj+hk+il
Parallel Computing
Example 1: Matrix-Vector Product
The processors
0: (a, b, c) • (j, k, l) = aj+bk+cl
send their
results to the
control node,
1: (d, e, f) • (j, k, l) = dj+ek+fl
which outputs
.
2: (g, h, i) • (j, k, l) = gj+hk+il
Parallel Computing
Example 2: Matrix-Vector Product
Suppose for
memory
reasons each
processor only
has part of the
vector.
0: (a, b, c) ; j
1: (d, e, f) ; k
2: (g, h, i) ; l
Parallel Computing
Example 2: Matrix-Vector Product
Before the
multiply, each
processor
sends the
necessary
information
elsewhere.
0: (a, b, c) ; j ; (k from 1) ;
(l from 2)
1: (d, e, f) ; (j from 0) ; k ;
(l from 2)
2: (g, h, i) ; (j from 0) ;
(k from 1) ; l
Parallel Computing
Example 2: Matrix-Vector Product
After the
multiply, the
space is freed
again for other
uses.
0: (a, b, c) ; j
1: (d, e, f) ; k
2: (g, h, i) ; l
Parallel Computing
Example 3: Matrix-Matrix Product
The previous
case illustrates 0: (a, b, c) ; (j, k, l)
how to multiply
matrices stored
1: (d, e, f) ; (m, n, o)
across multiple
processors.
2: (g, h, i) ; (p, q, r)
Parallel Computing
Example 3: Matrix-Matrix Product
1) (a, b, c)•(j, m, p)=α
Each column is
0: 2) (a, b, c)•(k, n, q)=β
distributed for
3) (a, b, c)•(l, o, r)=γ
processing in
1) (d, e, f)•(j, m, p)=δ
turn.
1: 2) (d, e, f)•(k, n, q)=ε
3) (d, e, f)•(l, o, r)=ζ
1) (g, h ,i)•(j, m, p)=η
2: 2) (g, h, i)•(k, n, q)=θ
3) (g, h, i)•(l, o, r)=ι
Parallel Computing
Example 3: Matrix-Matrix Product
The result is a
matrix with the 0: (α, β, γ)
same parallel
row structure as
1: (δ, ε, ζ)
the first matrix
and column
structure as the 2: (η, θ, ι)
right.
Parallel Computing
Example 3: Matrix-Matrix Product
The original
0: (α, β, γ)
indices could
also have been
sub-matrices,
1: (δ, ε, ζ)
as long as they
were
compatible.
2: (η, θ, ι)
Parallel Computing
Example 4: Block Diagonal Product
Suppose the
second matrix is 0: (A, B, C) ; (J, 0, 0)
block diagonal.
1: (D, E, F) ; (0, K, 0)
2: (G, H, I) ; (0, 0, L)
Parallel Computing
Example 4: Block Diagonal Product
Much less
information
needs to be
passed
between the
processors.
1) AJ=α
0: 2) BK=β
3) CL=γ
1) DJ=δ
1: 2) EK=ε
3) FL=ζ
1) GJ=η
2: 2) HK=θ
3) IL=ι
Parallel Computing
When is it worth it to parallelize?
Parallel Computing
When is it worth it to parallelize?
• There is a time cost associated with
passing messages
Parallel Computing
When is it worth it to parallelize?
• There is a time cost associated with
passing messages
• The amount of message passing is
dependent on the problem and the
program (algorithm)
Parallel Computing
When is it worth it to parallelize?
• Therefore, the benefits depend more
on the structure of the problem and
the program than on the size/speed of
the parallel network (diminishing
returns).
Parallel Networks
How do I use multiple processors?
Parallel Networks
How do I use multiple processors?
• This depends on the network, but…
• Most networks use some variation of
PBS, a job scheduler, and mpirun or
mpiexec.
Parallel Networks
How do I use multiple processors?
• This depends on the network, but…
• Most networks use some variation of
PBS, a job scheduler, and mpirun or
mpiexec.
• A parallel program needs to be
submitted as a batch job.
Parallel Networks
• Suppose I have a program myprog,
which gets data from data.dat, which I
call in the following fashion when only
using one processor:
./myprog –f data.dat
I would write a file myprog.pbs that
looks like the following:
Parallel Networks
#PBS –q compute (name of the processing queue [not
necessary on all networks])
#PBS -N myprog (the name of the job)
#PBS –l nodes=2:ppn=1,walltime=00:10:00
(number of nodes and number of processes per node,
maximum time to allow the program to run)
#PBS -o /home/me/mydir/myprog.out (where the
output of the program should be written)
#PBS -e /home/me/mydir/myprog.err (where the
error stream should be written)
These are the headers that tell the job scheduler how to
handle your job.
Parallel Networks
Although what follows depends on the MPI software that
the network runs, it should look something like this:
cd $PBS_O_WORKDIR (makes the processors run the
program in the directory where myprog.pbs is saved)
mpirun –machinefile $PBS_NODEFILE –np 2
myprog –f mydata.dat (tells the MPI software
which processes to use and how many processes to start:
notice that command line arguments follows as usual)
Parallel Networks
• Once the .pbs file is written, it can be
submitted to the job scheduler with
qsub:
qsub myprog.pbs
Parallel Networks
• Once the .pbs file is written, it can be
submitted to the job scheduler with
qsub:
qsub myprog.pbs
• You can check to see if your job is
running with the command qstat.
Parallel Networks
• Some systems (but not all) will allow
you to simulate running your program
in parallel on one processor, which is
useful for debugging:
mpirun –np 3 myprog –f
mydata.dat
Parallel Networks
What parallel systems are available?
Parallel Networks
What parallel systems are available?
• RTC : Rice Terascale Cluster: 244
processors.
Parallel Networks
What parallel systems are available?
• RTC : Rice Terascale Cluster: 244
processors.
• ADA : Cray XD1: 632 processors.
Parallel Networks
What parallel systems are available?
• RTC : Rice Terascale Cluster: 244
processors.
• ADA : Cray XD1: 632 processors.
• caamster: CAAM department
exclusive: 8(?) processors.
PETSc
What do I use PETSc for?
PETSc
What do I use PETSc for?
• File I/O with “minimal” understanding
of MPI
PETSc
What do I use PETSc for?
• File I/O with “minimal” understanding
of MPI
• Vector and matrix based data
management (in particular: sparse)
PETSc
What do I use PETSc for?
• File I/O with “minimal” understanding
of MPI
• Vector and matrix based data
management (in particular: sparse)
• Linear algebra routines familiar from
the famous serial packages
PETSc
• At the moment, ada and caamster
(and harvey) have PETSc installed
PETSc
• At the moment, ada and caamster
(and harvey) have PETSc installed
• You can download and install PETSc
on your own machine (requires
cygwin for Windows), for educational
and debugging purposes
PETSc
• PETSc builds on existing software
BLAS and LAPACK: which
implementations to use can be
specified at configuration
PETSc
• PETSc builds on existing software
BLAS and LAPACK: which
implementations to use can be
specified at configuration
• Has (slower) debugging configuration
and (faster, tacit) optimized
configuration
PETSc
• Installation comes with
documentation, examples, and
manual pages.
PETSc
• Installation comes with
documentation, examples, and
manual pages.
• The biggest part of learning how to
use PETSc is learning how to use the
manual pages.
PETSc
• It is extremely useful to have an
environmental variable PETSC_DIR
in you shell of choice, which gives the
path to the installation of PETSc, e.g.
PETSC_DIR=/usr/local/src/petsc-2.3.1-p13/
export PETSC_DIR
PETSc
Makefile
PETSc
Makefile
• You can pretty much
copy/paste/modify the makefiles in
the examples, but here is the basic
setup:
PETSc
Makefile
(…) (Other definitions for CFLAGS, etc.)
LOCDIR = ~/mydir
include ${PETSC_DIR}/bmake/common/base
(This is why it is useful to have this variable saved)
myprog: myprog.o chkopts
-${CLINKER} -o myprog myprog.o
${PETSC_LIB}
${RM} myprog.o
PETSc
Headers
PETSc
Headers
• #include “petsc.h” in all files,
unless the routines that you use need
more specific headers.
PETSc
Headers
• #include “petsc.h” in all files,
unless the routines that you use need
more specific headers.
• How do you know? Consult the
manual pages!
PETSc
Data Types
PETSc
Data Types
• PETSc has a slew of its own data
types: PetscInt, PetscReal,
PetscScalar, etc.
PETSc
Data Types
• PETSc has a slew of its own data
types: PetscInt, PetscReal,
PetscScalar, etc.
• Usually aliases of normal data types:
PetscInt ~ int, PetscReal ~ double
PETSc
Data Types
• PETSc has a slew of its own data
types: PetscInt, PetscReal,
PetscScalar, etc.
• Usually aliases of normal data types:
PetscInt ~ int, PetscReal ~ double
• Safer to use for compatibility
PETSc
Usage in C/C++
PETSc
Usage in C/C++
• The top program should begin:
Static char[] help=“Your message here.”
int main(int argc,char **argv){
(… declarations)
PetscInitialize(&argc,&argv,PETSC_NULL,help)
PETSc
Usage in C/C++
• The top program should end:
(…)
PetscFinalize();
return(0);
}
PETSc
Usage in C/C++
• When first programming, include the
following variable:
PetscErrorCode ierr;
• Where you’d call a PETSc routine,
Routine(arg);
write instead
ierr=Routing(arg);CHKERRQ(ierr);
PETSc
Usage in C/C++
• When you try to run your program,
you will be informed of any problems
with incompatible data
types/dimensions/etc.
PETSc
Data
• Anything data type larger than a
scalar has a Create and a Destroy
routine.
PETSc
Data
• Anything data type larger than a
scalar has a Create and a Destroy
routine.
• If you run ./myprog –log_summary, you
get # created and # destroyed for
each data type, to find memory leaks.
PETSc
Example: Vec
PETSc
Example: Vec
• Two types: global and local
PETSc
Example: Vec
• Two types: global and local
• Dependent on function: do other
processors need to see this data?
PETSc
Example: Vec
• Two types: global and local
• Dependent on function: do other
processors need to see this data?
• Basic usage:
Vec X;
VecCreate([PETSC_COMM_WORLD/PETSC_COMM_SELF],
&X);
PETSc
Example: Vec
• Advanced usage:
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,&X);
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,&X);
VecLoad(instream,VECSEQ,&X);
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,&X);
VecLoad(instream,VECSEQ,&X);
VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,&X);
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,&X);
VecLoad(instream,VECSEQ,&X);
VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,&X);
VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,&X);
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,&X);
VecLoad(instream,VECSEQ,&X);
VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,&X);
VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,&X);
VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,&X);
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,&X);
VecLoad(instream,VECSEQ,&X);
VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,&X);
VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,&X);
VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,&X);
VecLoad(instream,VECMPI,&X);
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,&X);
VecLoad(instream,VECSEQ,&X);
VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,&X);
VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,&X);
VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,&X);
VecLoad(instream,VECMPI,&X);
VecDuplicate(Y,&X);
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,&X);
VecLoad(instream,VECSEQ,&X);
VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,&X);
VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,&X);
VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,&X);
VecLoad(instream,VECMPI,&X);
VecDuplicate(Y,&X);
MatGetVecs(M,&X,PETSC_NULL);
PETSc
Example: Vec
• Advanced usage:
VecCreateSeq(PETSC_COMM_SELF,n,&X);
VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,&X);
VecLoad(instream,VECSEQ,&X);
VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,&X);
VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,&X);
VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,&X);
VecLoad(instream,VECMPI,&X);
VecDuplicate(Y,&X);
MatGetVecs(M,&X,PETSC_NULL);
MatGetVecs(M,PETSC_NULL,&X);
PETSc
Example: Vec
• If not created with array or loaded
from file, values still needed:
PETSc
Example: Vec
• If not created with array or loaded
from file, values still needed
• To copy the values of another Vec,
with the same parallel structure, use
VecCopy(Y,X).
PETSc
Example: Vec
• If not created with array or loaded
from file, values still needed
• To copy the values of another Vec,
with the same parallel structure, use
VecCopy(Y,X).
• To set all values to a single scalar
value, use VecSet(X,alpha).
PETSc
Example: Vec
• There are routines for more
complicated ways to set values
PETSc
Example: Vec
• There are other routines for more
complicated ways to set values
• PETSc guards the block of data
where the actual values are stored
very closely
PETSc
Example: Vec
• There are other routines for more
complicated ways to set values
• PETSc guards the block of data
where the actual values are stored
very closely
• An assembly routine must be called
after these other routines
PETSc
Example: Vec
• Other routines:
PETSc
Example: Vec
• Other routines:
VecSetValue
PETSc
Example: Vec
• Other routines:
VecSetValue
VecSetValueLocal (different indexing used)
PETSc
Example: Vec
• Other routines:
VecSetValue
VecSetValueLocal (different indexing used)
VecSetValues
PETSc
Example: Vec
• Other routines:
VecSetValue
VecSetValueLocal (different indexing used)
VecSetValues
VecSetValuesLocal
PETSc
Example: Vec
• Other routines:
VecSetValue
VecSetValueLocal (different indexing used)
VecSetValues
VecSetValuesLocal
VecSetValuesBlocked
PETSc
Example: Vec
• Other routines:
VecSetValue
VecSetValueLocal (different indexing used)
VecSetValues
VecSetValuesLocal
VecSetValuesBlocked
VecSetValuesBlockedLocal
PETSc
Example: Vec
• Once a vector is assembled, there
are routines for (almost) every
function we could want from a vector:
AXPY, dot product, absolute value,
pointwise multiplication, etc.
PETSc
Example: Vec
• Once a vector is assembled, there
are routines for (almost) every
function we could want from a vector:
AXPY, dot product, absolute value,
pointwise multiplication, etc.
• Call VecDestroy(X) to free its array
when it isn’t needed anymore.
PETSc
Example: Mat
PETSc
Example: Mat
• Like Vec, a Mat can be global or local
(MPI/Seq)
PETSc
Example: Mat
• Like Vec, a Mat can be global or local
(MPI/Seq)
• A Mat can take on a large number of
data structures to optimize * and \,
even though the same routine is used
on all structures.
PETSc
•
•
•
•
•
Example: Mat
Row compressed
Block row compressed
Symmetric block row compressed
Block diagonal
And even dense
PETSc
File I/O
PETSc
File I/O
• The equivalent to a stream is a
viewer.
PETSc
File I/O
• PETSc has equivalent routines to
printf, but you must decide if you want
every node to print or just the control
node
PETSc
File I/O
• PETSc has equivalent routines to
printf, but you must decide if you want
every node to print or just the control
node
• To ensure clarity when multiple nodes
print, use PetscSynchronizedPrintf
followed by PetscSynchronizedFlush.
PETSc
File I/O
• The equivalent to a stream is a
“viewer”, but a viewer organizes data
across multiple processors.
PETSc
File I/O
• The equivalent to a stream is a
“viewer”, but a viewer organizes data
across multiple processors.
• A viewer combines an output location
(file/stdout/stderr), with a format.
PETSc
File I/O
• The equivalent to a stream is a
“viewer”, but a viewer organizes data
across multiple processors.
• A viewer combines an output location
(file/stdout/stderr), with a format.
• Most data types have a View routine
such as MatView(M,viewer)
PETSc
File I/O
• On a batch server, ASCII I/O can be
horrendously slow.
PETSc
File I/O
• On a batch server, ASCII I/O can be
horrendously slow.
• PETSc only reads into a parallel
format data which is stored in binary
form.
PETSc
File I/O
• On a batch server, ASCII I/O can be
horrendously slow.
• PETSc only reads into a parallel
format data which is stored in binary
form.
• Lots of output data is likely: binary is
more compressed than ASCII.
PETSc
I have ASCII input data: solution?
PETSc
•
•
•
•
I have ASCII input data: solution?
Write a wrapper program
Runs on one processor
Creates the data to be used in
parallel, and “views” it to a binary
input file
In parallel, it will be automatically
distributed
PETSc
Next Time, Issues for Large
Dynamical Systems:
• Time Stepping
• Updating algebraically
• Managing lots of similar equations
(Scattering/Gathering)