NUMAの構成 - Keio University

Download Report

Transcript NUMAの構成 - Keio University

NORA/Clusters
AMANO, Hideharu
Textbook pp.140-147
NORA (No Remote Access
Memory Model)





No hardware shared memory
Data exchange is done by messages (or
packets)
Dedicated synchronization mechanism is
provided.
High peak performance
Message passing library (MPI,PVM) is
provided.
Message passing
(Blocking: randezvous)
Send
Receive
Send
Receive
Message passing
(with buffer)
Send
Receive
Send
Receive
Message passing
(non-blocking)
Send
Other
Receive Job
PVM (Parallel Virtual Machine)



A buffer is provided for a sender.
Both blocking/non-blocking receive is
provided.
Barrier synchronization
MPI
(Message Passing Interface)





Superset of the PVM for 1 to 1
communication.
Group communication
Various communication is supported.
Error check with communication tag.
Detail will be introduced later.
Programming style using MPI

SPMD (Single Program Multiple Data
Streams)



Multiple processes executes the same program.
Independent processing is done based on the
process number.
Program execution using MPI


Specified number of processes are generated.
They are distributed to each node of the NORA
machine or PC cluster.
Communication methods

Point-to-Point communication



A sender and a receiver executes function for
sending and receiving.
Each function must be strictly matched.
Collective communication



Communication between multiple processes.
The same function is executed by multiple
processes.
Can be replaced with a sequence of Point-to-Point
communication, but sometimes effective.
Fundamental MPI functions

Most programs can be described using six
fundamental functions






MPI_Init() … MPI Initialization
MPI_Comm_rank() … Get the process #
MPI_Comm_size() … Get the total process #
MPI_Send() … Message send
MPI_Recv() … Message receive
MPI_Finalize() … MPI termination
Other MPI functions

Functions for measurement



MPI_Barrier() … barrier synchronization
MPI_Wtime() … get the clock time
Non-blocking function


Consisting of communication request and check
Other calculation can be executed during waiting.
An Example
1: #include <stdio.h>
2: #include <mpi.h>
3:
4: #define MSIZE 64
5:
6: int main(int argc, char **argv)
7: {
8:
char msg[MSIZE];
9:
int pid, nprocs, i;
10: MPI_Status status;
11:
12: MPI_Init(&argc, &argv);
13: MPI_Comm_rank(MPI_COMM_WORLD, &pid);
14: MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
15:
16: if (pid == 0) {
17: for (i = 1; i < nprocs; i++) {
18:
MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status);
19:
fputs(msg, stdout);
20:
}
21: }
22: else {
23:
sprintf(msg, "Hello, world! (from process #%d)\n", pid);
24:
MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
25: }
26:
27: MPI_Finalize();
28:
29: return 0;
30: }
Initialize and Terminate
int MPI_Init(
int *argc, /* pointer to argc */
char ***argv /* pointer to argv */ );
mpi_init(ierr)
integer ierr
! return code
The attributes from command line must be passed directly to argc and argv.
int MPI_Finalize();
mpi_finalize(ierr)
integer ierr
! return code
Commincator functions
It returns the rank (process ID) in the communicator comm.
int MPI_Comm_rank(
MPI_Comm comm, /* communicator */
int *rank /* process ID (output) */ );
mpi_comm_rank(comm, rank, ierr)
integer comm, rank
integer ierr
! return code
It returns the total number of processes in the communicator comm.
int MPI_Comm_size(
MPI_Comm comm, /* communicator */
int *size /* number of process (output) */ );
mpi_comm_size(comm, size, ierr)
integer comm, size
integer ierr
! return code

Communicators are used for sharing commnication space among a subset
of processes. MPI_COMM_WORLD is pre-defined one for all processes.
MPI_Send
It sends data to process “dest”.
int MPI_Send(
void *buf, /* send buffer */
int count, /* # of elements to send */
MPI_Datatype datatype, /* datatype of elements */
int dest, /* destination (receiver) process ID */
int tag, /* tag */
MPI_Comm comm /* communicator */ );
mpi_send(buf, count, datatype, dest, tag, comm, ierr)
<type> buf(*)
integer count, datatype, dest, tag, comm
integer ierr
! return code

Tags are used for identification of message.
MPI_Recv
int MPI_Recv(
void
*buf,
int
count,
MPI_Datatype datatype,
int
source,
int
tag,
MPI_Comm comm,
MPI_Status
/* receiver buffer */
/* # of elements to receive */
/* datatype of elements */
/* source (sender) process ID */
/* tag */
/* communicator */
/* status (output) */ );
mpi_recv(buf, count, datatype, source, tag, comm, status, ierr)
<type> buf(*)
integer count, datatype, source, tag, comm, status(mpi_status_size)
integer ierr
! return code


The same tag as the sender’s one must be passed to MPI_Recv.
Set the pointers to a variable MPI_Status. It is a structure with three members:
MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the
sender, tag and error code.
datatype and count

The size of the message is identified with
count and datatype.




MPI_CHAR char
MPI_INT
int
MPI_FLOAT float
MPI_DOUBLE double … etc.
Compile and Execution
% icc –o hello hello.c -lmpi
% mpirun –np 8 ./hello
Hello, world! (from process #1)
Hello, world! (from process #2)
Hello, world! (from process #3)
Hello, world! (from process #4)
Hello, world! (from process #5)
Hello, world! (from process #6)
Hello, world! (from process #7)
Shared memory model vs.Message passing model

Benefits




Distributed OS is easy to implement.
Automatic parallelize compiler.
OpenMP
Message passing



Formal verification is easy (Blocking)
No-side effect (Shared variable is side effect
itself)
Small cost
OpenMP
#include <stdio.h>
int main()
{
pragma omp parallel
{
int tid, npes;
tid = omp_get_thread_num();
npes = omp_get_num_threads();
printf(“Hello World from %d of %d\n”, tid, npes)
}
return 0;
}
 Multiple threads are generated by using pragma.
 Variables declared globally can be shared.
Convenient pragma for parallel execution
#pragma omp parallel
{
#pragma omp for
for (i=0; i<1000; i++){
c[i] = a[i] + b[i];
}
}
 The assignment between i and thread is
automatically adjusted in order that the load of each
thread becomes even.
Automatic parallelizing Compilers



Automatically translating a code for uniprocessors
into multiprocessors.
Loop level parallelism is main target of parallelizing.
Fortran codes have been main targets




No pointers
The array structure is simple
Recently, restricted C becomes a target language
Oscar Compiler (Waseda Univ.), COINS
PC Clusters



Multicomputers
 Dedicated hardware (CPU, network)
 High performance but expensive
 Hitachi’s SR8000, Cray T3E, etc.
WS/PC Clusters
 Using standard CPU boards
 High Performance/Cost
 Standard bus often forms a bottleneck
Beowluf Clusters
 Standard CPU boards, Standard components
 LAN+TCP/IP
 Free-software
 A cluster with Standard System Area Network(SAN) like
Myrinet is often called Beowulf Cluster
PC clusters in supercomputing
Clusters occupies more than 80% of top 500
supercomputers in 2008/11
Let’s check http://www.top500.org
SAN (System Area Network) for PC clusters




Virtual Cut-through routing
High throughput/Low latency
Out of the cabinet but in the floor
Also used for connecting disk subsystems





Sometimes called System Area Network
Infiniband
Myrinet
Quadrics
GBEthernet: 10GB Ethernet
Store & Forward
Tree based topologies
SAN vs. Ethernet
Infiniband





Point-to-point direct serial interconnection.
Using 8b/10b code.
Various types of topologies can be supported.
Multicasting/atomic transactions are supported.
The maximum throughput
SDR
1X 2Gbit/s
4X 8Gbit/s
12X 24Gbit/s
DDR
4Gbit/s
16Gbit/s
48Gbit/s
QDR
8Gbit/s
32Gbit/s
96Gbit/s
Remote DMA (user level)
System Call
Local Node
Remote Node
Sender
User
Data Source
Data Sink
Kernel
User
Level
Kernel
Agent
Buffer
RDMA
Buffer
Host I/F
Protocol
Engine
Network Interface
Protocol
Engine
Network Interface
PC Clusters with Myrinet
(RWCP, using Myrinet)
RHiNET Cluster
Node
CPU: Pentium III 933MHz
Memory: 1Gbyte
PCI bus: 64bit/66MHz
OS: Linux kernel 2.4.18
SCore: version 5.0.1
RHiNET-2 with 64 nodes
Network
Optical → Gigabit Ethernet
Cluster of supercomputers


Recent trend of supercomputers is
connecting powerful components with high
speed SANs like Infiniband
Roadrunner by IBM

General purpose CPU + Cell BE
Grid Computing


Supercomputers which are connected with
Internet can be treated virtually as a single
big supercomputer.
Using middleware or toolkit to manage it.


Globus toolkit
GEO (Global Earth Observation Grid)

AIST Japan
Contest/Homework

Assume that there is an array x[100]. Write
the MPI code for computing sum of all
elements of x with 8 processes.