Transcript Faster!
Faster!
Vidhyashankar Venkataraman
CS614 Presentation
U-Net : A User-Level Network
Interface for Parallel and
Distributed Computing
Background – Fast Computing
Emergence of MPP – Massively Parallel Processors in
the early 90’s
Repackage hardware components to form a dense configuration
of very large parallel computing systems
But require custom software
Alternative : NOW (Berkeley) – Network Of Workstations
Formed by inexpensive, low latency, high bandwidth, scalable,
interconnect networks of workstations
Interconnected through fast switches
Challenge: To build a scalable system that is able to use the
aggregate resources in the network to execute parallel programs
efficiently
Issues
Problem with traditional networking architectures
Software path through kernel involves several copies
- processing overhead
In faster networks, may not get application speed-up
commensurate with network performance
Observations:
Small messages : Processing overhead is more
dominant than network latency
Most applications use small messages
Eg.. UCB NFS Trace : 50% of bits sent were messages of
size 200 bytes or less
Issues (contd.)
Flexibility concerns:
Protocol processing in kernel
Greater flexibility if application specific
information is integrated into protocol
processing
Can tune protocol to application’s needs
Eg.. Customized retransmission of video
frames
U-Net Philosophy
Achieve flexibility and performance by
Removing kernel from the critical path
Placing entire protocol stack at user level
Allowing protected user-level access to
network
Supplying full bandwidth to small messages
Supporting both novel and legacy protocols
Do MPPs do this?
Parallel machines like Meiko CS-2, Thinking
Machines CM-5
Have tried to solve the problem of providing user-level
access to network
Use of custom network and network interface – No
flexibility
U-Net targets applications on standard
workstations
Using off-the-shelf components
Basic U-Net architecture
Virtualize N/W device so
that each process has
illusion of owning NI
Mux/ Demuxing device
virtualizes the NI
Offers protection!
Kernel removed from
critical path
Kernel involved only in
setup
The U-Net Architecture
Building Blocks
Application End-points
Communication Segment(CS)
Message Queues
Sending
Assemble message in CS
EnQ Message Descriptor
Receiving
A region of memory
An application endpoint
Poll-driven/ Event-driven
DeQ Message Descriptor
Consume message
EnQ buffer in free Q
U-Net Architecture (contd.)
More on event-handling (upcalls)
Can be UNIX signal handler or user-level interrupt handler
Amortize cost of upcalls by batching receptions
Mux/ Demux :
Each endpoint uniquely identified by a tag (eg.. VCI in ATM)
OS performs initial route setup and security tests and registers a
tag in U-Net for that application
The message tag mapped to a communication channel
Observations
Have to preallocate buffers – memory overhead!
Protected User-level access to NI : Ensured by
demarcating into protection boundaries
Defined by endpoints and communication channels
Applications cannot interfere with each other because
Endpoints, CS and message queues user-owned
Outgoing messages tagged with originating endpoint address
Incoming messages demuxed by U-Net and sent to correct
endpoint
Zero-copy and True zero-copy
Two levels of sophistication depending on whether copy
is made at CS
Base-Level Architecture
Zero-copy : Copied in an intermediate buffer in the CS
CS’es are allocated, aligned, pinned to physical memory
Optimization for small messages
Direct-access Architecture
True zero copy : Data sent directly out of data structure
Also specify offset where data has to be deposited
CS spans the entire process address space
Limitations in I/O Addressing force one to resort to Zerocopy
Kernel emulated end-point
Communication
segments and
message queues are
scarce resources
Optimization:
Provide a single kernel
emulated endpoint
Cost : Performance
overhead
U-Net Implementation
U-Net architectures implemented in two systems
Using Fore Systems SBA 100 and 200 ATM network interfaces
But why ATM?
Setup : SPARCStations 10 and 20 on SunOS 4.1.3 with ASX200 ATM switch with 140 Mbps fiber links
SBA-200 firmware
25 MHz On-board i960 processor, 256 KB RAM, DMA
capabilities
Complete redesign of firmware
Device Driver
Protection offered through VM system (CS’es)
Also through <VCI, communication channel> mappings
U-Net Performance
RTT and bandwidth measurements
Small messages 65 μs RTT (optimization for single
cells)
Fiber saturated at 800 B
U-Net Active Messages Layer
An RPC that can be implemented efficiently on a wide
range of hardware
A basic communication primitive in NOW
Allow overlapping of communication with computation
Message contains data & ptr to handler
Reliable Message delivery
Handler moves data into data structures for some (ongoing)
operation
AM – Micro-benchmarks
Single-cell RTT
RTT ~ 71 μs for a 0-32 B
message
Overhead of 6 μs over raw
U-Net – Why?
Block store BW
80% of the maximum limit
with blocks of 2KB size
Almost saturated at 4KB
Good performance!
Split-C application benchmarks
Parallel Extension to C
Implemented on top of
UAM
Tested on 8 processors
ATM cluster performs
close to CS-2
TCP/IP and UDP/IP over U-Net
Good performance necessary to show flexibility
Traditional IP-over-ATM shows very poor performance
eg.. TCP : Only 55% of max BW
TCP and UDP over U-Net show improved performance
Primarily because of tighter application-network coupling
IP-over-U-Net:
IP-over-ATM does not exactly correspond to IP-over-UNet
Demultiplexing for the same VCI is not possible
Performance Graphs
UDP Performance
Saw-tooth behavior for Fore UDP
TCP Performance
Conclusion
U-Net provides virtual view of network interface to enable userlevel access to high-speed communication devices
The two main goals were to achieve performance and flexibility
By avoiding kernel in critical path
Achieved? Look at the table below…
Lightweight Remote Procedure
Calls
Motivation
Small kernel OSes have most services implemented as
separate user-level processes
Have separate, communicating user processes
Improve modular structure
More protection
Ease of system design and maintenance
Cross-domain & cross-machine communication treated
equal - Problems?
Fails to isolate the common case
Performance and Simplicity considerations
Measurements
Measurements show cross-domain predominance
V System
– 97%
Taos Firefly
– 94%
Sun UNIX+NFS Diskless – 99.4%
But how about RPCs these days?
Taos takes 109 μs for a Null() local call and 464 μs for
RPC – 3.5x overhead
Most interactions are simple with small numbers of
arguments
This could be used to make optimizations
Overheads in Cross-domain Calls
Stub Overhead – Additional execution path
Message buffer overhead – Cross-domain calls
can involve four copy operations for any RPC
Context switch – VM context switch from client’s
domain to the server’s and vice versa on return
Scheduling – Abstract and Concrete threads
Available solutions?
Eliminating kernel copies (DASH system)
Handoff scheduling (Mach and Taos)
In SRC RPC :
Message buffers globally shared!
Trades safety for performance
Solution proposed : LRPCs
Written for the Firefly system
Mechanism for communication between protection
domains in the same system
Motto : Strive for performance without foregoing safety
Basic Idea : Similar to RPCs but,
Do not context switch to server thread
Change the context of the client thread instead, to reduce
overhead
Overview of LRPCs
Design
Client calls server through kernel trap
Kernel validates caller
Kernel dispatches client thread directly to server’s domain
Client provides server with a shared argument stack and its own
thread
Return through the kernel to the caller
Implementation - Binding
Server
Client
Kernel
Export
interface Register with
name server
Wait
Notify
Clerk
Trap for
import
Client
Thread
Server
thread
Clerk
Send
PDL
Send BO
Processing:
A-stack list
Allocates A-stacks
Linkage Records
Binding Object (BO)
Data Structures used and created
Kernel receives Procedure Descriptor List (PDL) from
Clerk
Contains a PD for each procedure
Entry Address apart from other information
Kernel allocates Argument stacks (A-stacks) shared by
client-server domains for each PD
Allocates linkage record for each A-Stack to record
caller’s address
Allocates Binding Object - the client’s key to access the
server’s interface
Calling
Client stub traps kernel for call after
Pushing arguments in A-stack
Storing BO, procedure identifier, address of A-stack in registers
Kernel
Validates client, verifies A-stack and locates PD & linkage
Stores Return address in linkage and pushes on stack
Switches client thread’s context to server by running a new stack Estack from server’s domain
Calls the server’s stub corresponding to PD
Server
Client thread runs in server’s domain using E-stack
Can access parameters of A-stack
Return values in A-stack
Calls back kernel through stub
Stub Generation
LRPC stub automatically generated in
assembly language for simple execution
paths
Sacrifices portability for performance
Maintains local and remote stubs
First instruction in local stub is branch stmt
What are optimized here?
Using the same thread in different domains reduces overhead
Avoids scheduling decisions
Saves on cost of saving and restoring thread state
Pairwise A-stack allocation guarantees protection from third party
domain
Within? Asynchronous updates?
Validate client using BO – To provide security
Elimination of redundant copies through use of A-stack!
1 against 4 in traditional cross-domain RPCs
Sometimes two? Optimizations apply
Argument Copy
But… Is it really good enough?
Trades off memory management costs for
the reduction of overhead
A-stacks have to be allocated at bind time
But size generally small
Will LRPC work even if a server migrates
from a remote machine to the local
machine?
Other Issues – Domain Termination
Domain Termination
LRPC from terminated server domain should be returned back to
the client
LRPC should not be sent back to the caller if latter has
terminated
Use binding objects
Revoke binding objects
For threads running LRPCs in domain restart new threads in
corresponding caller
Invalidate active linkage records – thread returned back to first
domain with active linkage
Otherwise destroyed
Multiprocessor Issues
LRPC minimizes use of shared data structures
on the critical path
Guaranteed by pairwise allocation of A-stacks
Cache contexts on idle processors
Idling threads in server’s context in idle processors
When client thread does RPC to server swap
processors
Reduces context-switch overhead
Evaluation of LRPC
Performance of four test programs (time in μs)
(run on CVAX-Firefly averaged over 100000 calls)
Cost Breakdown for the Null LRPC
Minimum refers to the
inherent minimum
overhead
18 μs spent in client
stub and 3 μs in the
server stub
25% time spent in
TLB misses
Throughput on a multiprocessor
Tested with Firefly on four CVAX and one MicroVaxII I/O
processors
Speedup of 3.7 with 4
processors as against 1
processor
Speedup of 4.3 with 5
processors
SRC RPCs : inferior
performance due to a global
lock held during critical transfer
path
Conclusion
LRPC Combines
Control Transfer and communication model of
capability systems
Programming semantics and large-grained
protection model of RPCs
Enhances performance by isolating the
common case
NOW
We will see ‘NOW’ later in one of the subsequent
614 presentations