High-performance Message

Download Report

Transcript High-performance Message

Integrating New Capabilities into NetPIPE
Dave Turner, Adam Oline, Xuehua Chen,
and Troy Benjegerdes
Scalable Computing Laboratory of Ames Laboratory
This work was funded by the MICS office of the US Department of Energy
Network Protocol Independent Performance Evaluator
TCP
MPI
workstations
PCs
MPICH LAM/MPI
MPI/Pro MP_Lite
GM
Myrinet cards
PVM
Clusters
NetPIPE
TCGMSG
runs on
ARMCI or MPI
2-sided
protocols
native
software
layers
1-sided
protocols
internal
systems
Infiniband
Mellanox VAPI
ARMCI
TCP, GM, VIA,
Quadrics, LAPI
MPI-2
1-sided
MPI_Put or MPI_Get
SHMEM
1-sided
puts and gets
memcpy
LAPI
IBM SP
SHMEM
Cray T3E
SGI systems
ARMCI
& GPSHMEM
+ Basic send/recv with options to guarantee pre-posting or use MPI_ANY_SOURCE.
+ Option to measure performance without cache effects.
+ One-sided communications using either Get or Put, with or without fence calls.
+ Measure performance or do an integrity test.
http://www.scl.ameslab.gov/Projects/NetPIPE/
The NetPIPE utility
NetPIPE does a series of ping-pong tests between two nodes.
Message sizes are chosen at regular intervals, and with slight perturbations, to fully test
the communication system for idiosyncrasies.
Latencies reported represent half the ping-pong time for messages smaller than 64 Bytes.
Some typical uses
Measuring the overhead of message-passing protocols.
Help in tuning the optimization parameters of message-passing libraries.
Optimizing driver and OS parameters (socket buffer sizes, etc.).
Identifying dropouts in networking hardware and drivers.
What is not measured
NetPIPE cannot measure the load on the CPU yet.
The effects from the different methods for maintaining message progress.
Scalability with system size.
Recent additions to NetPIPE
Can do an integrity test instead of measuring performance.
Streaming mode measures performance in 1 direction only.
Must reset sockets to avoid effects from a collapsing window size.
A bi-directional ping-pong mode has been added (-2).
One-sided Get and Put calls can be measured (MPI or SHMEM).
Can choose whether to use an intervening MPI_Fence call to synchronize.
Messages can be bounced between the same buffers (default mode),
or they can be started from a different area of memory each time.
There are lots of cache effects in SMP message-passing.
InfiniBand can show similar effects since memory must be registered with the card.
Process 1
Process 0
0
2
1
3
Current projects
Overlapping pair-wise ping-pong tests.
Must consider synchronization if not using bi-directional communications.
Ethernet Switch
n0
n1
n2
n3
n0
n1
n2
n3
Line speed vs
end-point limited
Investigate other methods for testing the global network.
Evaluate the full range from simultaneous nearest neighbor communications to all-to-all.
Performance on Mellanox InfiniBand cards
Burst mode preposts all receives to
duplicate the Mellanox test.
The no-cache performance is much
lower when the memory has to be
registered with the card.
An MP_Lite InfiniBand module
will be incorporated into LAM/MPI.
MVAPICH 0.9.1
7000
Throughput in Mbps
A new NetPIPE module allows us
to measure the raw performance
across InfiniBand hardware
(RDMA and Send/Recv).
6000
IB VAPI
Burst mode
IB VAPI
Send/Recv
5000
4000
MVAPICH
7.5 us
MVAPICH
w/o cache effects
3000
2000
1000
0
100
10,000
1,000,000
Message size in Bytes
10 Gigabit Ethernet
3500
133 MHz PCI-X bus
Single mode fiber
Intel ixgb driver
Can only achieve 2 Gbps now.
Latency is 75 us.
Streaming mode delivers up to
3 Gbps.
Much more development work is
needed.
Throughput in Mbps
Intel 10 Gigabit Ethernet cards
3000
2500
10 GigE
streaming mode
2000
1500
10 GigE
75 us
1000
500
0
100
10,000
1,000,000
Message size in Bytes
Channel-bonding Gigabit Ethernet
for better communications between nodes
Channel bonding in a cluster
Channel-bonding uses 2 or more
Gigabit Ethernet cards per PC to
increase the communication rate
between nodes in a cluster.
GigE cards cost ~$40 each.
Cache
PCI
Cache
Memory
Memory
NIC
NIC
NIC
NIC
24-port switches cost ~$1400.
PCI
Network
switch
 $100 / computer
This is much more cost effective for
PC clusters than using more
expensive networking hardware, and
may deliver similar performance.
CPU
CPU
Cache
CPU
Memory
PCI
NIC
NIC
CPU
Cache
Memory
NIC
NIC
PCI
Performance for channel-bonded Gigabit Ethernet
GigE can deliver 900 Mbps with
latencies of 25-62 us for PCs with
64-bit / 66 MHz PCI slots.
Channel-bonding multiple GigE cards using
MP_Lite and Linux kernel bonding
Channel-bonding 2 GigE cards / PC
using MP_Lite doubles the
performance for large messages.
2500
Adding a 3rd card does not help much.
2000
The same tricks that make channelbonding successful in MP_Lite should
make Linux kernel bonding working
even better.
Any message-passing system could
then make use of channel-bonding on
Linux systems.
Throughput in Mbps
Channel-bonding 2 GigE cards / PC
using Linux kernel level bonding
actually results in poorer performance.
MP_Lite
3 GigE
1500
1000
500
0
100
MP_Lite
2 GigE
1 GigE card
Linux
2 GigE
10,000
1,000,000
Message size in Bytes
Channel-bonding in MP_Lite
User space
Application
on node 0
a
b
MP_Lite
Kernel space
device driver
Large
socket
buffers
device
queue
b
TCP/IP stack
a
TCP/IP stack
dev_q_xmit
DMA
dev_q_xmit
DMA
device
queue
GigE
card
GigE
card
Flow control may stop a given stream at several places.
With MP_Lite channel-bonding, each stream is independent of the others.
Linux kernel channel-bonding
User space
Application
on node 0
Kernel space
device driver
device
queue
Large
socket
buffer
TCP/IP
stack
dqx
dqx
bonding.c
DMA
GigE
card
dqx
DMA
device
queue
GigE
card
A full device queue will stop the flow at bonding.c to both device queues.
Flow control on the destination node may stop the flow out of the socket buffer.
In both of these cases, problems with one stream can affect both streams.
Comparison of high-speed interconnects
7000
6000
Atoll delivers 1890 Mbps with a
4.7 us latency.
5000
SCI delivers 1840 Mbps with only
a 4.2 us latency.
Myrinet performance reaches
1820 Mbps with an 8 us latency.
Channel-bonded GigE offers
1800 Mbps for very large
messages.
Gigabit Ethernet delivers
900 Mbps with a 25-62 us latency.
10 GigE only delivers 2 Gbps
with a 75 us latency.
Throughput in Mbps
InfiniBand can deliver 4500 6500 Mbps at a 7.5 us latency.
InfiniBand RDMA
7.5 us latency
4000
InfiniBand
w/o cache effects
3000
Atoll
4.7 us
2000
1000
Myrinet
8 us
SCI
4.2 us
2xGigE
62 us
GigE
62 us
0
100
10,000
1,000,000
Message size in Bytes
Conclusions
• NetPIPE provides a consistent set of analytical tools in the same flexible
framework to many message-passing and native communication layers.
• New modules have been developed.
– 1-sided MPI and SHMEM
– GM, InfiniBand using the Mellanox VAPI, ARMCI, LAPI
– Internal tests like memcpy
• New modes have been incorporated into NetPIPE.
– Streaming and bi-directional modes.
– Testing without cache effects.
– The ability to test integrity instead of performance.
Current projects
• Developing new modules.
– ATOLL
– IBM Blue Gene/L
– I/O performance
• Need to be able to measure CPU load during communications.
• Expanding NetPIPE to do multiple pair-wise communications.
– Can measure the backplane performance on switches.
– Compare the line speed to end-point limited performance.
• Working toward measuring more of the global properties of a network.
– The network topology will need to be considered.
Contact information
Dave Turner - [email protected]
http://www.scl.ameslab.gov/Projects/MP_Lite/
http://www.scl.ameslab.gov/Projects/NetPIPE/
One-sided Puts between two Linux PCs
LAM/MPI has no message
progress, so a fence is required.
ARMCI uses a polling
method, and therefore does not
require a fence.
An MPI-2 implementation of
MPICH is under development.
700
Throughput in Mbps
MP_Lite is SIGIO based, so
MPI_Put() and MPI_Get()
finish without a fence.
An MPI-2 implementation of
MPI/Pro is under development.
Netgear GA620 fiber GigE
600
raw TCP
MP_Lite
500
LAM/MPI
ARMCI
400
300
200
100
0
1
100
10,000
1,000,000
Message size in Bytes
32/64-bit 33/66 MHz
AceNIC driver
The MP_Lite message-passing library
MPI Applications
restricted to a subset
of the MPI commands
MP_Lite syntax
MP_Lite
InfiniBand
Mellanox VAPI
VIA
OS-bypass
TCP
workstations
PCs
Giganet hardware
M-VIA Ethernet
•
•
•
•
SMP
shared-memory
segment
Mixed system
distributed
SMPs
SHMEM
one-sided
functions
MPI
to retain portability
for MP_Lite syntax
Cray T3E
SGI Origins
A light-weight MPI implementation
Highly efficient for the architectures supported
Designed to be very user-friendly
Ideal for performing message-passing research
http://www.scl.ameslab.gov/Projects/MP_Lite/
A NetPIPE example: Performance on a Cray T3E
Raw SHMEM delivers:
 2600 Mbps
 2-3 us latency
MP_Lite delivers:
 2600 Mbps
 9-10 us latency
Throughput in Mbps
Cray MPI originally delivered:
 1300 Mbps
 20 us latency
3000
MP_Lite
2500
2000
raw SHMEM
new Cray MPI
1500
1000
old Cray MPI
500
0
New Cray MPI delivers:
 2400 Mbps
 20 us latency
1
100
10,000
1,000,000
Message size in Bytes
The top of the spikes are where the message
size is divisible by 8 Bytes.