NUMAの構成 - Keio University
Download
Report
Transcript NUMAの構成 - Keio University
NORA/Clusters
AMANO, Hideharu
Textbook pp.140-147
NORA (No Remote Access
Memory Model)
No hardware shared memory
Data exchange is done by messages (or
packets)
Dedicated synchronization mechanism is
provided.
High peak performance
Message passing library (MPI,PVM) is
provided.
Message passing
(Blocking: randezvous)
Send
Receive
Send
Receive
Message passing
(with buffer)
Send
Receive
Send
Receive
Message passing
(non-blocking)
Send
Other
Receive Job
PVM (Parallel Virtual Machine)
A buffer is provided for a sender.
Both blocking/non-blocking receive is
provided.
Barrier synchronization
MPI
(Message Passing Interface)
Superset of the PVM for 1 to 1
communication.
Group communication
Various communication is supported.
Error check with communication tag.
Detail will be introduced later.
Programming style using MPI
SPMD (Single Program Multiple Data
Streams)
Multiple processes executes the same program.
Independent processing is done based on the
process number.
Program execution using MPI
Specified number of processes are generated.
They are distributed to each node of the NORA
machine or PC cluster.
Communication methods
Point-to-Point communication
A sender and a receiver executes function for
sending and receiving.
Each function must be strictly matched.
Collective communication
Communication between multiple processes.
The same function is executed by multiple
processes.
Can be replaced with a sequence of Point-to-Point
communication, but sometimes effective.
Fundamental MPI functions
Most programs can be described using six
fundamental functions
MPI_Init() … MPI Initialization
MPI_Comm_rank() … Get the process #
MPI_Comm_size() … Get the total process #
MPI_Send() … Message send
MPI_Recv() … Message receive
MPI_Finalize() … MPI termination
Other MPI functions
Functions for measurement
MPI_Barrier() … barrier synchronization
MPI_Wtime() … get the clock time
Non-blocking function
Consisting of communication request and check
Other calculation can be executed during waiting.
An Example
1: #include <stdio.h>
2: #include <mpi.h>
3:
4: #define MSIZE 64
5:
6: int main(int argc, char **argv)
7: {
8:
char msg[MSIZE];
9:
int pid, nprocs, i;
10: MPI_Status status;
11:
12: MPI_Init(&argc, &argv);
13: MPI_Comm_rank(MPI_COMM_WORLD, &pid);
14: MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
15:
16: if (pid == 0) {
17: for (i = 1; i < nprocs; i++) {
18:
MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status);
19:
fputs(msg, stdout);
20:
}
21: }
22: else {
23:
sprintf(msg, "Hello, world! (from process #%d)\n", pid);
24:
MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
25: }
26:
27: MPI_Finalize();
28:
29: return 0;
30: }
Initialize and Terminate
int MPI_Init(
int *argc, /* pointer to argc */
char ***argv /* pointer to argv */ );
mpi_init(ierr)
integer ierr
! return code
The attributes from command line must be passed directly to argc and argv.
int MPI_Finalize();
mpi_finalize(ierr)
integer ierr
! return code
Commincator functions
It returns the rank (process ID) in the communicator comm.
int MPI_Comm_rank(
MPI_Comm comm, /* communicator */
int *rank /* process ID (output) */ );
mpi_comm_rank(comm, rank, ierr)
integer comm, rank
integer ierr
! return code
It returns the total number of processes in the communicator comm.
int MPI_Comm_size(
MPI_Comm comm, /* communicator */
int *size /* number of process (output) */ );
mpi_comm_size(comm, size, ierr)
integer comm, size
integer ierr
! return code
Communicators are used for sharing commnication space among a subset
of processes. MPI_COMM_WORLD is pre-defined one for all processes.
MPI_Send
It sends data to process “dest”.
int MPI_Send(
void *buf, /* send buffer */
int count, /* # of elements to send */
MPI_Datatype datatype, /* datatype of elements */
int dest, /* destination (receiver) process ID */
int tag, /* tag */
MPI_Comm comm /* communicator */ );
mpi_send(buf, count, datatype, dest, tag, comm, ierr)
<type> buf(*)
integer count, datatype, dest, tag, comm
integer ierr
! return code
Tags are used for identification of message.
MPI_Recv
int MPI_Recv(
void
*buf,
int
count,
MPI_Datatype datatype,
int
source,
int
tag,
MPI_Comm comm,
MPI_Status
/* receiver buffer */
/* # of elements to receive */
/* datatype of elements */
/* source (sender) process ID */
/* tag */
/* communicator */
/* status (output) */ );
mpi_recv(buf, count, datatype, source, tag, comm, status, ierr)
<type> buf(*)
integer count, datatype, source, tag, comm, status(mpi_status_size)
integer ierr
! return code
The same tag as the sender’s one must be passed to MPI_Recv.
Set the pointers to a variable MPI_Status. It is a structure with three members:
MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the
sender, tag and error code.
datatype and count
The size of the message is identified with
count and datatype.
MPI_CHAR char
MPI_INT
int
MPI_FLOAT float
MPI_DOUBLE double … etc.
Compile and Execution
% icc –o hello hello.c -lmpi
% mpirun –np 8 ./hello
Hello, world! (from process #1)
Hello, world! (from process #2)
Hello, world! (from process #3)
Hello, world! (from process #4)
Hello, world! (from process #5)
Hello, world! (from process #6)
Hello, world! (from process #7)
Shared memory model vs.Message passing model
Benefits
Distributed OS is easy to implement.
Automatic parallelize compiler.
OpenMP
Message passing
Formal verification is easy (Blocking)
No-side effect (Shared variable is side effect
itself)
Small cost
OpenMP
#include <stdio.h>
int main()
{
pragma omp parallel
{
int tid, npes;
tid = omp_get_thread_num();
npes = omp_get_num_threads();
printf(“Hello World from %d of %d\n”, tid, npes)
}
return 0;
}
Multiple threads are generated by using pragma.
Variables declared globally can be shared.
Convenient pragma for parallel execution
#pragma omp parallel
{
#pragma omp for
for (i=0; i<1000; i++){
c[i] = a[i] + b[i];
}
}
The assignment between i and thread is
automatically adjusted in order that the load of each
thread becomes even.
Automatic parallelizing Compilers
Automatically translating a code for uniprocessors
into multiprocessors.
Loop level parallelism is main target of parallelizing.
Fortran codes have been main targets
No pointers
The array structure is simple
Recently, restricted C becomes a target language
Oscar Compiler (Waseda Univ.), COINS
PC Clusters
Multicomputers
Dedicated hardware (CPU, network)
High performance but expensive
Hitachi’s SR8000, Cray T3E, etc.
WS/PC Clusters
Using standard CPU boards
High Performance/Cost
Standard bus often forms a bottleneck
Beowluf Clusters
Standard CPU boards, Standard components
LAN+TCP/IP
Free-software
A cluster with Standard System Area Network(SAN) like
Myrinet is often called Beowulf Cluster
PC clusters in supercomputing
Clusters occupies more than 80% of top 500
supercomputers in 2008/11
Let’s check http://www.top500.org
SAN (System Area Network) for PC clusters
Virtual Cut-through routing
High throughput/Low latency
Out of the cabinet but in the floor
Also used for connecting disk subsystems
Sometimes called System Area Network
Infiniband
Myrinet
Quadrics
GBEthernet: 10GB Ethernet
Store & Forward
Tree based topologies
SAN vs. Ethernet
Infiniband
Point-to-point direct serial interconnection.
Using 8b/10b code.
Various types of topologies can be supported.
Multicasting/atomic transactions are supported.
The maximum throughput
SDR
1X 2Gbit/s
4X 8Gbit/s
12X 24Gbit/s
DDR
4Gbit/s
16Gbit/s
48Gbit/s
QDR
8Gbit/s
32Gbit/s
96Gbit/s
Remote DMA (user level)
System Call
Local Node
Remote Node
Sender
User
Data Source
Data Sink
Kernel
User
Level
Kernel
Agent
Buffer
RDMA
Buffer
Host I/F
Protocol
Engine
Network Interface
Protocol
Engine
Network Interface
PC Clusters with Myrinet
(RWCP, using Myrinet)
RHiNET Cluster
Node
CPU: Pentium III 933MHz
Memory: 1Gbyte
PCI bus: 64bit/66MHz
OS: Linux kernel 2.4.18
SCore: version 5.0.1
RHiNET-2 with 64 nodes
Network
Optical → Gigabit Ethernet
Cluster of supercomputers
Recent trend of supercomputers is
connecting powerful components with high
speed SANs like Infiniband
Roadrunner by IBM
General purpose CPU + Cell BE
Grid Computing
Supercomputers which are connected with
Internet can be treated virtually as a single
big supercomputer.
Using middleware or toolkit to manage it.
Globus toolkit
GEO (Global Earth Observation Grid)
AIST Japan
Contest/Homework
Assume that there is an array x[100]. Write
the MPI code for computing sum of all
elements of x with 8 processes.