Transcript Slajd 1
Performance Evaluation of
Gigabit Ethernet-Based
Interconnects for HPC Clusters
Paweł Pisarczyk [email protected]
Jarosław Węgliński [email protected]
Cracow, 16 October 2006
Agenda
Introduction
HPC cluster interconnects
Message propagation model
Experimental setup
Results
Conclusions
Who we are
joint stock company
founded in 1994, earlier (since 1991) as a department
within PP ATM
IPO in September 2004 (Warsaw Stock Exchange)
major shares owned by founders (Polish citizens)
no state capital involved
financial data
stock capital about €6 million
2005 sales €29,7 million
about 230 employees
Mission
building business value through innovative
information & communication technology initiatives
creating new markets in Poland and abroad
ATM's competitive advantage is based on combining three
key competences:
integration of comprehensive IT systems
telecommunication services
consulting and software development
Achievements
1991 Poland’s first company connected to Internet
1993 Poland’s first commercial ISP
1994 Poland’s first LAN with ATM backbone
1994 Poland’s first supercomputer on the Dongarra’s Top 500 list
1995 Poland’s first MAN in ATM technology
1996 Poland’s first corporate network with voice & data integration
2000 Poland’s first prototype Interactive TV system over a public network
2002 Poland’s first validated MES system for a pharmaceutical factory
2003 Poland’s first commercial, public Wireless LAN
2004 Poland’s first public IP content billing system
Client base
other
11,4%
public
4,4%
manufacturing
4,9%
telecommunications
40,4%
utilities
5,7%
media
5,9%
academia
9,6%
finance
17,7%
(based on 2005 sales revenues)
HPC clusters developed by ATM
2004 - Poznan Supercomputing and Networking Center
238 Itanium2 CPU, 119 x HP rx2600 nodes with Gigabit Ethernet interconnect
2005 - University of Podlasie
34 Itanium2 CPU,17 x HP rx2600 nodes with Gigabit Ethernet interconnect and
Lustre 1.2 filesystem
2005 - Poznan Supercomputing and Networking Center
86 dual core Opteron CPU, 42 Sun SunFire v20z and 1 Sun SunFire v40z with
Gigabit Ethernet interconnect
2006 - Military University of Technology Faculty of Engineering, Chemistry
and Applied Physics
32 Itanium2 CPU, 16 x HP rx1620 with Gigabit Ethernet interconnect
2006 Gdansk University of Technology: Department of Pharmaceutial
Technology and Chemistry
22 Itanium2 CPU (11 x HP RX1620) with Gigabit Ethernet interconnect
Selected software projects related to
distributed systems
Distributed Multimedia Archive in Interactive Television (iTVP)
Project
scalable storage for iTVP platform with ability to process the stored
content
ATM Objects
scalable storage for multimedia content distribution platform
system for Cinem@n company (founded by ATM and Monolith)
Cinem@n will introduce high-quality movies, news and entertainment
digital content distribution services
Spread Screens Manager
platform for POS TV
system is currently used by Zabka (shopping network) and
Neckermann (travel service)
about 300 of terminals presenting the multimedia content located in
many polish cities
Selected current projects
ATMFS
distributed filesystem for petabyte scale storage based on COTS
based on variable-sized chunks
advanced replication and enhanced error detection
dependability evaluation based on software fault injection technique
FastGig
RDMA stack for Gigabit Ethernet-based interconnect
message passing latency reduction
increases the application performance
Uses of computer networks in HPC
clusters
Exchange of messages between cluster nodes to
coordinate distributed computation
requires high maximal throughput and also low latency
inefficiency observed when the time consumed in single computation
step is comparable to the message passing time
Access to shared data through network or cluster file
system
requires high bandwidth when transferring data in blocks of defined
size
filesystem and storage drivers are trying to reduce number of i/o
operations issued (by buffering data and aggregating transfers)
Comparison of characteristics of
interconnect technologies
* Brett M. Bode, Jason J. Hill, and Troy R. Benjegerdes “Cluster Interconnect Overview” Scalable
Computing Laboratory, Ames Laboratory
Gigabit Ethernet interconnect
characteristic
Popular technology for low cost cluster interconnects
Satisfied throughput for long frames (1000 bytes and
longer)
High latency and low throughput for small frames
Those drawbacks are mostly caused by construction of
existing network interfaces
What is the influence of the network stack implementation
for the communication latency?
Message propagation model
Latency between transferring
message to/from MPI library and
transferring data to/from stack
Time difference between
sendto/recvfrom function and driver
start_xmit/interrupt functions
Execution time of driver functions
Processing time of the network interface
Propagation latency and latency introduced
by active network elements
Experimental setup
Two HP rx2600 servers
2 x Intel Itanium2 1.3 MHz 3MB cache
Debian GNU/Linux Sarge 3.1 operating system (kernel 2.6.8-2mckinley-smp)
Gigabit Ethernet interfaces
Broadcom BCM5701 chipset connected using PCI-X device bus
In order to eliminate possibility of additional delays, which
may be introduced by external active network devices,
servers were connected using crossover cables
Two NIC drivers were tested: tg3 (polling NAPI dirver),
bcm5700 (interrupt driven driver)
Tools used for measures
NetPipe package for measuring throughput and latency for
TCP and several implementations of MPI
For low level testing test programs working directly on
Ethernet frames were developed
Testing programs and NIC drivers were modified to allow
measuring, inserting and transfer of timestamps
Throughput characteristic for tg3 driver
Latency characteristic for tg3 driver
Results for tg3 driver
The overhead introduced by MPI library is relatively low
There is a big difference between transmission latencies in
the ping-pong and streaming mode
The latency introduced for small frames is similar to latency
introduced by 115kbps UART (in the case of transmitting
one byte only)
We can deduce that there is some mechanism in the
transmission path that delays transmission of single
packets
What is the difference between NAPI and interrupt driven
driver?
Interrupt driven driver vs NAPI driver
(throughput characteristic)
Interrupt driven driver vs NAPI driver
(latency characteristic)
Interrupt driven driver vs NAPI driver
(latency characteristic) - details
Comparison of bcm5700 and tg3 drivers
Using default configuration, BCM5700 driver has worse
characteristics than tg3
Interrupt driven version (default configuration) cannot
achieve more than 650Mb/s of throughput for frames of
any size
After interrupt coalescing disabling, the performance of
BCM5700 driver have exceeded the results obtained by
tg3 driver
Disabling of the polling can improve characteristics of the
network driver, but NAPI is not the major cause of the
transmission delay
Tools for message processing time
measurement
Timestamps were inserted into the message eat each processing
stage
Processing stages on the transmitter side
sendto() function
bcm5700_start_xmit()
interrupt notifying frame transmit
Processing stages on the receiver side
interrupt notifying frame receipt
netif_rx()
recvfrom() function
As high precision timer CPU clock cycle counter was used,
(precision of 0.77ns = 1/1.3GHz)
Transmitter latency in streaming mode
Send
17 us
Answer
17 us
2 us
Distribution of delays in transmission
path between cluster nodes
Conclusions
We estimate that RDMA based communication can reduce MPI
message propagation time from 43μs to 23μs (doubling the
performance for short messages)
There is also possibility of reducing T3 and T5 latencies by
changing the configuration of the network interface (transmit and
receive thresholds)
In the conducted research we didn’t consider differences between
network interfaces (T3 and T5 delays may be longer or shorter
than measured)
Latency introduced by switch is also omitted
FastGig project include not only communication library, but also
measurement and communication profiling framework