Low Latency Message Passing Over Gigabit Ethernet
Download
Report
Transcript Low Latency Message Passing Over Gigabit Ethernet
Low Latency Messaging
Over Gigabit Ethernet
Keith Fenech
CSAW
24 September 2004
Why Cluster Computing?
Ideal for computationally intensive applications.
Multi-threaded processes allow jobs to be processed in parallel
over multiple CPUs.
High Bandwidth allows interconnected nodes to achieve
supercomputer performance.
Networks of Workstations (NOWs)1
Easily available (commodity platforms)
Relatively cheap
Nodes may be used independently or as a cluster
Better utilization of idle computing resources.
24 September 2004
CSAW '04
2 / 11
High Performance Networking
Commodity networks dominated by IP over Ethernet
Performance is directly affected by:
Hardware – bus & network bandwidths
Latency – delay incurred in communicating a message from source to destination
Overhead – length of time that a processor is engaged in tx/rx of each message
Fine-grain threads communicate frequently using small messages.
HP communication architecture features:
transparency to the application layer
allow high-throughput for bandwidth intensive applications
low latencies for frequently communicating threads
Minimise protocol processing overhead on host machine
Gigabit performance not achievable at application layers. Why?
24 September 2004
CSAW '04
3 / 11
Conventional NICs & Protocols
Receiver node
Ethernet controller receives frame
Check CRC for frame
Filter MAC destination address
NIC generates HW interrupt to notify host
PCI transfer to host memory
CPU suspends current task & launches interrupt handler to
service high priority interrupt
Check network layer (IP) header & verify checksum
Parse routing tables & store valid IP datagrams in IP buffer
Reassemble fragmented datagrams in host memory
Call transport layer (TCP/UDP) functions
Deliver packet to application layer
24 September 2004
CSAW '04
4 / 11
Problems With Conventional
Protocols & Architectures
NIC generates a CPU interrupt for each frame
Servicing interrupts involves expensive vertical switch to kernel space.
Software interrupts to pass IP datagrams to upper layers
Servicing incoming packets results in high host CPU load
Risk of Receiver Livelock scenarios (as in Denial of Service attacks)
PCI bus startup overheads for each message
Layered protocols implies expensive memory-to-memory buffer copies
24 September 2004
CSAW '04
5 / 11
Available Techniques
Bypass kernel for critical data paths
Buffer & protocol processing moved to user-space
User-level hardware access
Zero-copy techniques
Scatter/Gather techniques
Larger MTUs (Jumbo frames)
Larger DMA transfers avoid PCI startup overheads
Interrupt coalescing
Message descriptors & polling replace interrupts
24 September 2004
CSAW '04
6 / 11
Current Solutions
Enabled by programmable NICs
Virtual Interface Architecture (VIA2)
U-Net 3 (ATM)
Myrinet GM4 and Illinois FM5 (Myrinet)
QsNet6 (Quadrics)
EMP7 (Ethernet)
24 September 2004
CSAW '04
7 / 11
Our Proposal
NOWs running over Gigabit Ethernet
Use Tigon2 programmable NIC features (onboard CPU, memory, DMA)
Design a reliable lightweight communication protocol for GE
Reliable network (ordered & lossless packet delivery)
Low-overhead
Low-latency
Offload protocol processing from host CPU onto NIC CPU
Interrupt-free architecture (message descriptor queues + polling)
OS Bypass: user-applications & NIC hardware communicate through pinned down shared
memory.
Zero Copy
Dynamic MTUs & DMA sizes – reduce PCI startup overheads
Tackle 2 application scenarios
Small messages – Latency is critical
Large bandwidth – Throughput is critical
24 September 2004
CSAW '04
8 / 11
Conclusion
Provide a high performance communication API
Replace PVM8 & MPI9 protocols
Fine-grained thread communication
High Bandwidth applications
Remove network communication bottleneck in user-level thread
messaging.
Interface with SMASH10
user-level thread scheduler
Multi-threaded applications can run seamlessly over a cluster of SMPs.
Achieve higher throughput with minimal usage of host CPU resources.
24 September 2004
CSAW '04
9 / 11
References
1.
D. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau, B. Chun, S. Lumetta, A. Mainwaring, R. Martin, C. Yoshikawa, and F.
Wong. Parallel Computing on the Berkeley NOW. In Ninth Joint Symposium on Parallel Processing, 1997.
2.
Microsoft Compaq, Intel. Virtual Interface Architecture Specification, draft revision 1.0 edition, December 1997.
3.
T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: a user-level network interface for parallel and distributed computing.
In Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 40–53. ACM Press, 1995.
4.
Myricom Inc. Myrinet GM – the low-level message-passing system for Myrinet networks.
5.
Scott Pakin, Mario Lauria, and Andrew Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM)
for Myrinet. 1995.
6.
Fabrizio Petrini, Wu chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. Quadrics Network (QsNet): HighPerformance Clustering Technology. In Hot Interconnects 9, Stanford University, Palo Alto, CA, August 2001.
7.
Piyush Shivam, Pete Wyckoff, and Dhabaleswar Panda. EMP: Zero-copy OSbypass NIC-driven Gigabit Ethernet Message
Passing. 2001.
8.
Message Passing Interface Forum. MPI2: A Message Passing Interface standard. International Journal of High Performance
Computing Applications, 12(1–2):1–299, 1998.
9.
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A User’s Guide
and Tutorial for Network Parallel Computing. MIT Press, Cambridge, Mass., 1994.
10.
Kurt Debattista. High Performance Thread Scheduling on Shared Momory Multiprocessors. Master’s thesis, University of
Malta, 2001.
24 September 2004
CSAW '04
10 / 11
Thank you!
24 September 2004
CSAW '04
11 / 11