Low Latency Message Passing Over Gigabit Ethernet

Download Report

Transcript Low Latency Message Passing Over Gigabit Ethernet

Low Latency Messaging
Over Gigabit Ethernet
Keith Fenech
CSAW
24 September 2004
Why Cluster Computing?

Ideal for computationally intensive applications.

Multi-threaded processes allow jobs to be processed in parallel
over multiple CPUs.

High Bandwidth allows interconnected nodes to achieve
supercomputer performance.

Networks of Workstations (NOWs)1




Easily available (commodity platforms)
Relatively cheap
Nodes may be used independently or as a cluster
Better utilization of idle computing resources.
24 September 2004
CSAW '04
2 / 11
High Performance Networking

Commodity networks dominated by IP over Ethernet

Performance is directly affected by:



Hardware – bus & network bandwidths
Latency – delay incurred in communicating a message from source to destination
Overhead – length of time that a processor is engaged in tx/rx of each message

Fine-grain threads communicate frequently using small messages.

HP communication architecture features:





transparency to the application layer
allow high-throughput for bandwidth intensive applications
low latencies for frequently communicating threads
Minimise protocol processing overhead on host machine
Gigabit performance not achievable at application layers. Why?
24 September 2004
CSAW '04
3 / 11
Conventional NICs & Protocols

Receiver node











Ethernet controller receives frame
Check CRC for frame
Filter MAC destination address
NIC generates HW interrupt to notify host
PCI transfer to host memory
CPU suspends current task & launches interrupt handler to
service high priority interrupt
Check network layer (IP) header & verify checksum
Parse routing tables & store valid IP datagrams in IP buffer
Reassemble fragmented datagrams in host memory
Call transport layer (TCP/UDP) functions
Deliver packet to application layer
24 September 2004
CSAW '04
4 / 11
Problems With Conventional
Protocols & Architectures

NIC generates a CPU interrupt for each frame

Servicing interrupts involves expensive vertical switch to kernel space.

Software interrupts to pass IP datagrams to upper layers

Servicing incoming packets results in high host CPU load

Risk of Receiver Livelock scenarios (as in Denial of Service attacks)

PCI bus startup overheads for each message

Layered protocols implies expensive memory-to-memory buffer copies
24 September 2004
CSAW '04
5 / 11
Available Techniques

Bypass kernel for critical data paths

Buffer & protocol processing moved to user-space

User-level hardware access

Zero-copy techniques

Scatter/Gather techniques

Larger MTUs (Jumbo frames)

Larger DMA transfers avoid PCI startup overheads

Interrupt coalescing

Message descriptors & polling replace interrupts
24 September 2004
CSAW '04
6 / 11
Current Solutions






Enabled by programmable NICs
Virtual Interface Architecture (VIA2)
U-Net 3 (ATM)
Myrinet GM4 and Illinois FM5 (Myrinet)
QsNet6 (Quadrics)
EMP7 (Ethernet)
24 September 2004
CSAW '04
7 / 11
Our Proposal

NOWs running over Gigabit Ethernet

Use Tigon2 programmable NIC features (onboard CPU, memory, DMA)

Design a reliable lightweight communication protocol for GE


Reliable network (ordered & lossless packet delivery)

Low-overhead

Low-latency

Offload protocol processing from host CPU onto NIC CPU

Interrupt-free architecture (message descriptor queues + polling)

OS Bypass: user-applications & NIC hardware communicate through pinned down shared
memory.

Zero Copy

Dynamic MTUs & DMA sizes – reduce PCI startup overheads
Tackle 2 application scenarios

Small messages – Latency is critical

Large bandwidth – Throughput is critical
24 September 2004
CSAW '04
8 / 11
Conclusion

Provide a high performance communication API

Replace PVM8 & MPI9 protocols

Fine-grained thread communication

High Bandwidth applications

Remove network communication bottleneck in user-level thread
messaging.

Interface with SMASH10


user-level thread scheduler

Multi-threaded applications can run seamlessly over a cluster of SMPs.
Achieve higher throughput with minimal usage of host CPU resources.
24 September 2004
CSAW '04
9 / 11
References
1.
D. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau, B. Chun, S. Lumetta, A. Mainwaring, R. Martin, C. Yoshikawa, and F.
Wong. Parallel Computing on the Berkeley NOW. In Ninth Joint Symposium on Parallel Processing, 1997.
2.
Microsoft Compaq, Intel. Virtual Interface Architecture Specification, draft revision 1.0 edition, December 1997.
3.
T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: a user-level network interface for parallel and distributed computing.
In Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 40–53. ACM Press, 1995.
4.
Myricom Inc. Myrinet GM – the low-level message-passing system for Myrinet networks.
5.
Scott Pakin, Mario Lauria, and Andrew Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM)
for Myrinet. 1995.
6.
Fabrizio Petrini, Wu chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. Quadrics Network (QsNet): HighPerformance Clustering Technology. In Hot Interconnects 9, Stanford University, Palo Alto, CA, August 2001.
7.
Piyush Shivam, Pete Wyckoff, and Dhabaleswar Panda. EMP: Zero-copy OSbypass NIC-driven Gigabit Ethernet Message
Passing. 2001.
8.
Message Passing Interface Forum. MPI2: A Message Passing Interface standard. International Journal of High Performance
Computing Applications, 12(1–2):1–299, 1998.
9.
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A User’s Guide
and Tutorial for Network Parallel Computing. MIT Press, Cambridge, Mass., 1994.
10.
Kurt Debattista. High Performance Thread Scheduling on Shared Momory Multiprocessors. Master’s thesis, University of
Malta, 2001.
24 September 2004
CSAW '04
10 / 11
Thank you!
24 September 2004
CSAW '04
11 / 11