shared int *p2

Download Report

Transcript shared int *p2

Unified Parallel C (UPC)
Costin Iancu
The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell,
P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome, K. Yelick
http://upc.lbl.gov
Slides edited by K. Yelick, T. El-Ghazawi, P. Husbands, C.Iancu
Context
• Most parallel programs are written using either:
– Message passing with a SPMD model (MPI)
• Usually for scientific applications with C++/Fortran
• Scales easily
– Shared memory with threads in OpenMP, Threads+C/C++/F or Java
• Usually for non-scientific applications
• Easier to program, but less scalable performance
• Partitioned Global Address Space (PGAS) Languages take the
best of both
– SPMD parallelism like MPI (performance)
– Local/global distinction, i.e., layout matters (performance)
– Global address space like threads (programmability)
• 3 Current languages: UPC (C), CAF (Fortran), and Titanium (Java)
• 3 New languages: Chapel, Fortress, X10
Partitioned Global Address Space
Global
address space
Thread0 Thread1
•
•
•
•
•
•
X[0]
ptr:
X[1]
ptr:
Threadn
X[P]
Shared
ptr:
Private
Shared memory is logically partitioned by processors
Remote memory may stay remote: no automatic caching implied
One-sided communication: reads/writes of shared variables
Both individual and bulk memory copies
Some models have a separate private memory area
Distributed array generality and how they are constructed
Partitioned Global Address Space
Languages
• Explicitly-parallel programming model with SPMD parallelism
– Fixed at program start-up, typically 1 thread per processor
• Global address space model of memory
– Allows programmer to directly represent distributed data structures
• Address space is logically partitioned
– Local vs. remote memory (two-level hierarchy)
• Programmer control over performance critical decisions
– Data layout and communication
• Performance transparency and tunability are goals
– Initial implementation can use fine-grained shared memory
Current Implementations
• A successful language/library must run everywhere
• UPC
– Commercial compilers: Cray, SGI, HP, IBM
– Open source compilers: LBNL/UCB (source-to-source), Intrepid (gcc)
• CAF
– Commercial compilers: Cray
– Open source compilers: Rice (source-to-source)
• Titanium
– Open source compilers: UCB (source-to-source)
• Common tools
– Open64 open source research compiler infrastructure
– ARMCI, GASNet for distributed memory implementations
– Pthreads, System V shared memory
Talk Overview
• UPC Language Design
–
–
–
–
Data Distribution (layout, memory management)
Work Distribution (data parallelism)
Communication (implicit,explicit, collective operations)
Synchronization (memory model, locks)
• Programmi ng in UPC
–
–
–
–
Performance (one-sided communication)
Application examples: FFT, PC
Productivity (compiler support)
Performance tuning and modeling
UPC Overview and Design
• Unified Parallel C (UPC) is:
– An explicit parallel extension of ANSI C with common and
familiar syntax and semantics for parallel C and simple
extensions to ANSI C
– A partitioned global address space language (PGAS)
– Based on ideas in Split-C, AC, and PCP
• Similar to the C language philosophy
– Programmers are clever and careful, and may need to get
close to hardware
• to get performance, but
• can get in trouble
• SPMD execution model: (THREADS, MYTHREAD),
static vs. dynamic threads
Data Distribution
Data Distribution
Global
Address Space
Thread0 Thread1
X[0]
X[1]
Threadn
X[P]
Shared
ours:
ptr:
ptr:
ptr:
mine:
mine:
mine:
Private
• Distinction between memory spaces through extensions of the
type system (shared qualifier)
shared int ours;
shared int X[THREADS];
shared int *ptr;
int mine;
• Data in shared address space:
– Static:
scalars (T0), distributed arrays
– Dynamic: dynamic memory management
(upc_alloc, upc_global_alloc, upc_all_alloc)
Data Layout
• Data layout controlled through extensions of the type system (layout
specifiers):
– [0] or [] (indefinite layout, all on 1 thread):
shared [] int *p;
– Empty (cyclic layout) :
shared int array[THREADS*M];
– [*] (blocked layout):
shared [*] int array[THREADS*M];
– [b] or [b1][b2]…[bn] = [b1*b2*…bn] (block cyclic)
shared [B] int array[THREADS*M];
• Element array[i] has affinity with thread
(i / B) % THREADS
• Layout determines pointer arithmetic rules
• Introspection (upc_threadof, upc_phaseof,
upc_blocksize)
UPC Pointers Implementation
• In UPC pointers to shared objects have three fields:
– thread number
– local address of block
– phase (specifies position in the block)
Virtual Address
Thread Phase
• Example: Cray T3E implementation
Phase
63
Thread
49 48
Virtual Address
38 37
• Pointer arithmetic can be expensive in UPC
0
UPC Pointers
Where does the pointer point?
Where
does the
pointer
reside?
Local
Private PP (p1)
Shared
PS (p3)
Shared SP (p2)
SS (p4)
int *p1;
/*
shared int *p2; /*
int *shared p3; /*
shared int *shared
private pointer to local memory */
private pointer to shared space */
shared pointer to local memory */
p4; /* shared pointer to
shared space */
UPC Pointers
Global
address space
Thread0 Thread1
Threadn
p3:
p3:
p3:
p4:
p4:
p4:
p1:
p1:
p1:
p2:
p2:
p2:
Shared
Private
int *p1;
/* private pointer to local memory */
shared int *p2; /* private pointer to shared space */
int *shared p3; /* shared pointer to local memory */
shared int *shared p4; /* shared pointer to
shared space */
Pointers to shared often require more storage and are more costly to
dereference; they may refer to local or remote memory.
Common Uses for UPC Pointer Types
int *p1;
• These pointers are fast (just like C pointers)
• Use to access local data in part of code performing local work
• Often cast a pointer-to-shared to one of these to get faster access
to shared data that is local
shared int *p2;
• Use to refer to remote data
• Larger and slower due to test-for-local + possible communication
int *shared p3;
• Not recommended
shared int
*shared p4;
• Use to build shared linked structures, e.g., a linked list
• typedef is the UPC programmer’s best friend
UPC Pointers Usage Rules
• Pointer arithmetic supports blocked and non-blocked
array distributions
• Casting of shared to private pointers is allowed but
not vice versa !
• When casting a pointer-to-shared to a pointer-tolocal, the thread number of the pointer to shared may
be lost
• Casting of shared to local is well defined only if the
object pointed to by the pointer to shared has affinity
with the thread performing the cast
Work Distribution
Work Distribution
upc_forall()
• Owner computes rule: loop over all, work on those owned by you
• UPC adds a special type of loop
upc_forall(init; test; step; affinity)
statement;
• Programmer indicates the iterations are independent
– Undefined if there are dependencies across threads
• Affinity expression indicates which iterations to run on each thread. It
may have one of two types:
– Integer: affinity%THREADS == MYTHREAD
– Pointer: upc_threadof(affinity) ==
MYTHREAD
• Syntactic sugar for:
for(i=MYTHREAD; i<N; i+=THREADS)
…
for(i=0; i<N; i++)
if (MYTHREAD == i%THREADS)
Inter-Processor
Communication
Data Communication
• Implicit (assignments)
shared int *p;
…
*p = 7;
• Explicit (bulk synchronous) point-to-point
(upc_memget, upc_memput, upc_memcpy, upc_memset)
• Collective operations http://www.gwu.edu/~upc/docs/
– Data movement: broadcast, scatter, gather, …
– Computational: reduce, prefix, …
– Interface has synchronization modes: (??)
• Avoid over-synchronizing (barrier before/after is simplest
semantics, but may be unnecessary)
• Data being collected may be read/written by any thread
simultaneously
Data Communication
• The UPC Language Specification V 1.2 does not
contain non-blocking communication primitives
• Extensions for non-blocking communication available
in the BUPC implementation
• UPC V1.2 does not have higher level communication
primitives for point-to-point communication.
• See BUPC extensions for
– scatter, gather
– VIS
• Should non-blocking communication be a first class
language citizen?
Synchronization
Synchronization
• Point-to-point synchronization: locks
– opaque type : upc_lock_t*
– dynamically managed: upc_all_lock_alloc,
upc_global_lock_alloc
• Global synchronization:
– Barriers (unaligned) : upc_barrier
– Split-phase barriers:
upc_notify; this thread is ready for barrier
do computation unrelated to barrier
upc_wait;
wait for others to be ready
Memory Consistency in UPC
• The consistency model defines the order in which one
thread may see another threads accesses to memory
– If you write a program with un-synchronized accesses, what
happens?
– Does this work?
data = …
flag = 1;
while (!flag) { };
… = data;
// use the data
• UPC has two types of accesses:
– Strict: will always appear in order
– Relaxed: May appear out of order to other threads
• Consistency is associated either with a program scope
(file, statement)
{ #pragma upc strict flag = 1; }
or with a type
shared strict int flag;
Sample UPC Code
Matrix Multiplication in UPC
• Given two integer matrices A(NxP) and
B(PxM), we want to compute C =A x B.
• Entries cij in C are computed by the formula:
p
c
ij
  ail  blj
l 1
Serial C code
#define N 4
#define P 4
#define M 4
int a[N][P] =
{1,2,3,4,5,6,7,8,9,10,11,12,14,14,15,16}, c[N][M];
int b[P][M] = {0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1};
void main (void) {
int i, j , l;
for (i = 0 ; i<N ; i++) {
for (j=0 ; j<M ;j++) {
c[i][j] = 0;
for (l = 0 ; l<P ; l++)
c[i][j] += a[i][l]*b[l][j];
}
}
}
Domain Decomposition
• Exploits locality in matrix multiplication
• A (N  P) is decomposed rowwise into blocks of size (N  P) /
THREADS as shown below:
• B(P  M) is decomposed column
wise into M/ THREADS blocks as
shown below:
Thread THREADS-1
Thread 0
P
M
0 .. (N*P / THREADS) -1
Thread 0
(N*P / THREADS)..(2*N*P / THREADS)-1
Thread 1
N
P
((THREADS-1)N*P) / THREADS ..
(THREADS*N*P / THREADS)-1
Thread THREADS-1
•Note: N and M are assumed to be multiples
of THREADS
Columns 0:
(M/THREADS)-1
Columns ((THREAD-1) 
M)/THREADS:(M-1)
UPC Matrix Multiplication Code
#include <upc_relaxed.h>
#define N 4
#define P 4
#define M 4
shared [N*P/THREADS] int a[N][P] = {1,..,16}, c[N][M];
// data distribution: a and c are blocked shared
matrices
shared [M/THREADS] int b[P][M] = {0,1,0,1,… ,0,1};
void main (void) {
int i, j , l; // private variables
upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
//work distribution
for (j=0 ; j<M ;j++) {
c[i][j] = 0;
for (l= 0 ; l<P ; l++)
//implicit communication
c[i][j] += a[i][l]*b[l][j];
}
}
}
UPC Matrix Multiplication With Block Copy
#include <upc_relaxed.h>
shared [N*P /THREADS] int a[N][P], c[N][M];
// a and c are blocked shared matrices
shared[M/THREADS] int b[P][M];
int b_local[P][M];
void main (void) {
int i, j , l; // private variables
//explicit bulk communication
upc_memget(b_local, b, P*M*sizeof(int));
//work distribution (c aligned with a??)
upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
for (j=0 ; j<M ;j++) {
c[i][j] = 0;
for (l= 0 ; l<P ; l++)
c[i][j] += a[i][l]*b_local[l][j];
}
}
}
Programming in UPC
Don’t ask yourself what can my compiler do for me, ask
yourself what can I do for my compiler!
Principles of Performance Software
To minimize the cost of communication
• Use the best available communication mechanism on
a given machine
• Hide communication by overlapping
(programmer or compiler or runtime)
• Avoid synchronization using data-driven execution
(programmer or runtime)
• Tune communication using performance models
when they work (??); search when they don’t
(programmer or compiler/runtime)
Best Available Communication Mechanism
• Performance is determined by overhead, latency
and bandwidth
• Data transfer (one-sided communication) is often
faster than (two sided) message passing
• Semantics limit performance
–
–
–
–
In-order message delivery
Message and tag matching
Need to acquire information from remote host processor
Synchronization (message receipt) tied to data transfer
One-Sided vs Two-Sided: Theory
host
CPU
two-sided message
message id
data payload
one-sided put message
address
network
interface
data payload
memory
• A two-sided messages needs to be matched with a receive to
identify memory address to put data
– Offloaded to Network Interface in networks like Quadrics
– Need to download match tables to interface (from host)
• A one-sided put/get message can be handled directly by a
network interface with RDMA support
– Avoid interrupting the CPU or storing data from CPU (preposts)
GASNet: Portability and High-Performance
25
MPI ping-pong
24.2
22.1
GASNet put+sync
Roundtrip Latency (usec)
(down is good)
8-byte Roundtrip Latency
20
15
18.5
17.8
14.6
13.5
11.8
9.6
10
6.6
9.5
9.5
8.3
6.6
4.5
5
0
Elan3
Elan4
Myrinet
G5/IB
Opteron/IB
SP/Fed
XT3
GASNet better for overhead and latency across machines
UPC Group; GASNet design by Dan Bonachea
GASNet: Portability and High-Performance
Flood Bandwidth for 4KB messages
231
80%
70%
MPI
223
763
90%
Percent HW peak
(up is good)
100%
702
190
714
679
GASNet
765
152
GASNet excels at mid-range
sizes: important for overlap
60%
420
50%
750
40%
526
547
252
30%
20%
10%
0%
Elan3
Elan4
Myrinet
G5/IB
Opteron/IB
SP/Fed
XT3
100%
Percent HW peak (BW in MB)
(up is good)
Flood Bandwidth for 2MB messages
90%
857 858
244
225 228
80%
610
70%
799
795
255
1504 1490 1106 1109
630
GASNet at least as high
(comparable) for large
messages
60%
50%
MPI
40%
GASNet
30%
20%
10%
0%
Joint work
with UPC Group; GASNet design by Dan Bonachea
Elan3
Elan4
Myrinet
G5/IB
Opteron/IB
SP/Fed
XT3
One-Sided vs. Two-Sided: Practice
900
2.5
600
1.5
500
NERSC Jacquard
machine with
Opteron
processors
400
300
1
GASNet put (nonblock)"
MPI Flood
GASNet/MPI
200
100
12
8K
25
6K
64
K
32
K
16
K
8K
4K
2K
1K
51
2
25
6
12
8
64
32
0
16
0
0.5
• InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5
• Half power point (N ½ ) differs by one order of magnitude
• This is not a criticism of the implementation!
Yelick,Hargrove, Bonachea
Speedup (GASNet / MPI)
2
700
Bandwidth (MB/s)
(up is good)
800
Overlap
Hide Communication by Overlapping
• A programming model that decouples data transfer
and synchronization (init, sync)
• BUPC has several extensions: (programmer)
– explicit handle based
– region based
– implicit handle based
• Examples:
– 3D FFT (programmer)
– split-phase optimizations (compiler)
– automatic overlap (runtime)
Performing a 3D FFT
• NX x NY x NZ elements spread across P processors
• Will Use 1-Dimensional Layout in Z dimension
– Each processor gets NZ / P “planes” of NX x NY elements
per plane
Example: P = 4
NZ
NZ/P
1D Partition
NX
p3
p2
NY
Bell, Nishtala, Bonachea, Yelick
p1
p0
Performing a 3D FFT (part 2)
• Perform an FFT in all three dimensions
• With 1D layout, 2 out of the 3 dimensions are local
while the last Z dimension is distributed
Step 1: FFTs on the columns
(all elements local)
Step 2: FFTs on the rows
(all elements local)
Step 3: FFTs in the Z-dimension
(requires communication)
Bell, Nishtala, Bonachea, Yelick
Performing the 3D FFT (part 3)
• Can perform Steps 1 and 2 since all the data is
available without communication
• Perform a Global Transpose of the cube
– Allows step 3 to continue
Transpose
Bell, Nishtala, Bonachea, Yelick
Communication Strategies for 3D FFT
• Three approaches:
–Chunk:
chunk = all rows with same destination
• Wait for 2nd dim FFTs to finish
• Minimize # messages
–Slab:
• Wait for chunk of rows destined for
1 proc to finish
• Overlap with computation
–Pencil:
• Send each row as it completes
pencil = 1 row
• Maximize overlap and
• Match natural layout
slab = all rows in a single plane with
same destination
Bell, Nishtala, Bonachea, Yelick
NAS FT Variants Performance
Summary
Chunk (NAS FT
with FFTW)
.5 Tflops
Best MFlop rates for all NAS FT Benchmark versions
Best MPIBest
(always
slabs)
NAS Fortran/MPI
1000
Best
MPI
Best UPC
(always
pencils)
Best UPC
MFlops per Thread
MFlops per Thread
800
600
400
200
0
56
et 6 4
nd 2
a
B
i
Myr in
Infin
3 256
Elan
3 512
Elan
4 256
Elan
4 512
Elan
• Slab is always best for MPI; small message cost too
high
• Myrinet
Pencil isInfiniband
always bestElan3
for UPC;Elan3
more overlap
Elan4
Elan4
#procs 64
256
256
512
256
512
Bisection Bandwidth Limits
• Full bisection bandwidth is (too) expensive
• During an all-to-all communication phase
–Effective (per-thread) bandwidth is fractional share
–Significantly lower than link bandwidth
–Use smaller messages mixed with computation to avoid
swamping the network
Bell, Nishtala, Bonachea, Yelick
Compiler Optimizations
• Naïve scheme (blocking call for each load/store) not
good enough
• PRE on shared expressions
– Reduce the amount of unnecessary communication
– Apply also to UPC shared pointer arithmetic
• Split-phase communication
– Hide communication latency through overlapping
• Message coalescing
– Reduce number of messages to save startup overhead and achieve
better bandwidth
Chen, Iancu, Yelick
Benchmarks
• Gups
– Random access (read/modify/write) to distributed array
• Mcop
– Parallel dynamic programming algorithm
• Sobel
– Image filter
• Psearch
– Dynamic load balancing/work stealing
• Barnes Hut
– Shared memory style code from SPLASH2
• NAS FT/IS
– Bulk communication
% improvement over unoptimized
Performance Improvements
Chen, Iancu, Yelick
Data Driven Execution
Data-Driven Execution
• Many algorithms require synchronization with remote
processor
– Mechanisms: (BUPC extensions)
• Signaling store: Raise a semaphore upon transfer
• Remote enqueue: Put a task in a remote queue
• Remote execution: Floating functions (X10 activities)
• Many algorithms have irregular data dependencies
(LU)
– Mechanisms (BUPC extensions)
• Cooperative multithreading
Matrix Factorization
Blocks 2D
block-cyclic
distributed
Completed part of U
Completed part of L
Panel factorizations
involve communication
for pivoting
A(i,j)
A(i,k)
A(j,i)
A(j,k)
Trailing matrix
to be updated
Panel being factored
Husbands,Yelick
Matrixmatrix
multiplication
used here.
Can be coalesced
Three Strategies for LU Factorization
• Organize in bulk-synchronous phases (ScaLAPACK)
• Factor a block column, then perform updates
• Relatively easy to understand/debug, but extra synchronization
• Overlapping phases (HPL):
• Work associated with on block column factorization can be overlapped
• Parameter to determine how many (need temp space accordingly)
• Event-driven multithreaded (UPC Linpack):
•
•
•
•
•
•
Each thread runs an event handler loop
Tasks: factorization (w/ pivoting), update trailing, update upper
Tasks my suspend (voluntarily) to wait for data, synchronization, etc.
Data moved with remote gets (synchronization built-in)
Must “gang” together for factorizations
Scheduling priorities are key to performance and deadlock avoidance
Husbands,Yelick
UPC-HP Linpack Performance
Opteron
cluster
UPC vs.
MPI/HPL
X1 UPC vs. MPI/HPL
1400
MPI/HPL
1200
UPC
Altix UPC.
Vs.
MPI/HPL
UPC vs.
ScaLAPACK
160
140
1000
120
400
150
100
MPI/HPL
UPC
80
100
60
80
60
40
MPI/HPL
UPC
50
200
0
0
60
X1/64
X1/128
Opt/64
40
20
20
0
ScaLAPACK
UPC
GFlops
600
GFlop/s
800
GFlop/s
GFlop/s
200
Alt/32
0
2x4 pr oc gr i d
•Comparable to HPL (numbers from HPCC database)
•Faster than ScaLAPACK due to less synchronization
•Large scaling of UPC code on Itanium/Quadrics (Thunder)
• 2.2 TFlops on 512p and 4.4 TFlops on 1024p
Husbands, Yelick
4x4 pr oc gr i d
Performance Tuning
Iancu, Strohmaier
Efficient Use of One-Sided
• Implementations: need to be efficient and have scalable
performance
• Application level use of NB benefits from new design
techniques: finer grained decompositions and overlap
• Overlap exercises the system in “un-expected” ways
• Prototyping of implementations for large scale systems is a
hard problem: non-linear behavior of networks,
communication scheduling is NP-hard
• Need methodology for fast prototyping:
– understand interaction network/CPU at large scale
Performance Tuning
• Performance is determined by overhead, latency
and bandwidth, computational characteristics and
communication topology
• It’s all relative: Performance characteristics are
determined by system load
• Basic principles
– Minimize communication overhead
– Avoid congestion:
• control injection rate (end-point)
• avoid hotspots (end-point, network routes)
• Have to use models.
• What kind of answers can a model answer?
Example: Vector-Add
shared double *rdata;
double *ldata, *buf;
upc_memget(buf, rdata, N);
for(i=0; i<N; i++)
ldata[i]+= buf[i];
b
b
b
for(i=0; i<N/B; i++)
h[i]=upc_memget_nb(buf+i*B,rdata+i*B,B);
for(i=0; i<N/B; i++) {
sync(h[i]);
for(j=0;j<B; j++)
ldata[i*B+j]+=buf[i*B+j];
}
Which implementation is faster?
What is B,b?
GET_nb(B0)
:
GET_nb(Bb)
GET_nb(Bb+1)
:
GET_nb(B2b)
sync(B0)
compute(B0)
:
sync(Bb)
compute(Bb)
GET_nb(B2b+1)
:
GET_nb(B3b)
:
sync(BN)
compute(BN)
Prototyping
• Usual approach: use time accurate performance model
(applications, automatically tuned collectives)
– Models (LogP..) don’t capture important behavior (parallelism,
congestion, resource constraints, non-linear behavior)
– Exhaustive search of the optimization space
– Validated only at low concurrency (~tens of procs), might break at high
concurrency, might break for torus networks
• Our approach:
– Use performance model for ideal implementation
– Understand hardware resource constraints and the variation of
performance parameters (understand trends not absolute values)
– Derive implementation constraints to satisfy both optimal
implementation and hardware constraints
– Force implementation parameters to converge towards optimal
Performance
• Network: bandwidth and overhead
Overhead is determined by
message size, communication
schedule, hardware flow of control
Bandwidth is determined
by message size, communication
schedule, fairness of allocation
• Application: communication pattern and schedule (congestion),
computation
Iancu, Strohmaier
Validation
• Understand network behavior in the presence of non-blocking
communication (Infiniband, Elan)
• Develop performance model for scenarios widely encountered in
applications (p2p, scatter, gather, all-to-all) and a variety of aggressive
optimization techniques (strip mining, pipelining, communication schedule
skewing)
• Use both micro-benchmarks and application kernels to
validate approach
Iancu, Strohmaier
Findings
•
•
Can choose optimal values for implementation parameters
Time accurate model for an implementation: hard to develop, inaccurate at
high concurrency
•
Methodology does not require exhaustive search of the optimization space
(only p2p and qualitative behavior of gather)
•
In practice one can produce templatized implementations for an algorithm and
use our approach to determine optimal values: code generation (UPC), automatic
tuning of collective operations, application development
•
Need to further understand the mathematical and statistical properties
End
Avoid synchronization: Data-driven
Execution
• Many algorithms require
synchronization with remote processor
• What is the right mechanism in a PGAS
model for doing this?
• Is it still one-sided?
Part 3: Event-Driven Execution Models
Mechanisms for Event-Driven
Execution
• “Put” operation does a send side notification
– Needed for memory consistency model ordering
• Need to signal remote side on completion
• Strategies:
– Have remote side do a “get” (works in some
algorithms)
– Put + strict flag write: do a put, wait for completion,
then do another (strict) put
– Pipelined put + put-flag: works only on ordered
networks
– Signaling put: add new “store” operation that
embeds signal (2nd remote address) into single
message
Mechanisms for Event-Driven
Execution
Signalling Store
memput signal
upc memput + strict flag
pipelined memput (unsafe)
MPI n-byte ping/pong
MPI send/rec
60
time (usec)
50
40
30
20
10
0
1
10
100
1000
10000
payload size (Bytes)
Preliminary results on Opteron + Infiniband
100000