Slides - shiftleft.com

Download Report

Transcript Slides - shiftleft.com

Scalable Networking for
Next-Generation Computing
Platforms
Yoshio Turner*, Tim Brecht*‡, Greg Regnier§,
Vikram Saletore§, John Janakiraman*, Brian Lynn*
*
Hewlett Packard Laboratories
§
Intel Corporation
‡
University of Waterloo
Outline
•
Motivation: Enable applications to scale to nextgeneration network and I/O performance on
standard computing platforms
• Proposed technology strategy:
–
Embedded Transport Acceleration (ETA)
– Asynchronous I/O (AIO) programming model
•
Web server application evaluation vehicle
• Evaluation Plan
• Conclusions
14 Feb 2004
SAN-3 workshop – HPCA-10
page 2
Motivation: Next-Generation Platform
Requirements
•
Low overhead packet and protocol processing for nextgeneration commodity interconnects (e.g., 10 gigE)
–
Current systems: performance impeded by interrupts,
context switches, data copies
– Existing proposals include:
• TCP Offload Engines (TOE): special hardware, cost/time to
market issues
• RDMA: new protocol, requires support at both endpoints
•
Increased I/O concurrency for high link utilization
–
I/O bandwidth is increasing
– I/O latency is fixed or slowly decreasing toward limit
 Need larger number of in-flight operations to fill pipe
14 Feb 2004
SAN-3 workshop – HPCA-10
page 3
Proposed Technology Strategy
•
Embedded Transport Acceleration (ETA) architecture
–
–
–
•
Intel Labs project: prototype architecture dedicates one or
more processors to perform all network packet processing
-- ``Packet Processing Engines’’ (PPEs)
Low overhead processing: PPE interacts with network
interfaces/applications directly via cache-coherent shared
memory (bypass the OS kernel)
Application interface: VIA-style user-level communication
Asynchronous I/O (AIO) programming model
–
Split two-phase file/socket operations
• Post an I/O operation request: non-blocking call
• Asynchronously receive completion event information
–
–
High I/O concurrency even for single-threaded application
Initial focus: ETA socket AIO (future extensions to file AIO)
14 Feb 2004
SAN-3 workshop – HPCA-10
page 4
Key Advantages
•
•
Potentially enables Ethernet and TCP to approach latency
and throughput performance of System Area Networks
Uses standard system processor/memory resources:
–
–
–
•
Extensibility: fully programmable PPE to support evolving
data center functionality
–
–
•
Automatically tracks semiconductor cost-performance trends
Leverages microarchitecture trends:
multiple cores, hardware multi-threading
Leverages standard software development environments
 rapid development
Unified IP-based fabric for all I/O
RDMA
AIO increases network-centric application scalability
14 Feb 2004
SAN-3 workshop – HPCA-10
page 5
Overview of the ETA Architecture
•
Partitioned server architecture:
–
–
•
Host-PPE Direct Transport Interface (DTI)
–
–
•
Host: application execution
Packet Processing Engine (PPE)
VIA/Infiniband-like queuing structures in cache coherent
shared host memory (OS bypass)
Optimized for sockets/TCP
Direct User Socket Interface (DUSI)
–
Thin software layer to support user level applications
14 Feb 2004
SAN-3 workshop – HPCA-10
page 6
ETA Overview: Partitioned Architecture
User
Applications
Host
CPU(s)
File
System
Kernel
Applications
Legacy
Sockets
Direct
Access
iSCSI
Shared
Memory
ETA Host Interface
TCP/IP
PPE
Driver
Network Fabric
LAN
14 Feb 2004
Storage
SAN-3 workshop – HPCA-10
IPC
page 7
ETA Overview: Direct Transport Interface
(DTI)
Queuing Structure
HOST
Shared Host Memory
DTI
Doorbells
DTI
Tx
Queue
Data
Buffers
DTI
Rx
Queue
DTI
Event
Queue
Anonymous
Buffer
Pool
Packet Processing Engine
•
•
Asynch socket operations: connect, accept, listen, etc.
TCP buffering semantics – anonymous buffer pool
supports non-pre-posted or OOO receive packets
14 Feb 2004
SAN-3 workshop – HPCA-10
page 8
API for Asynchronous I/O (AIO)
•
Layer socket AIO API above ETA architecture
–
•
Initial focus: ETA Direct User Socket Interface (DUSI)
API
–
•
Investigate impact of AIO API features on application
structure and performance
provides asynchronous socket operations: connect, listen,
accept, send, receive
AIO examples:
–
–
–
File/socket: Windows AIO w/completion ports, POSIX AIO
File I/O: Linux AIO recently introduced
Socket I/O with OS bypass: ETA DUSI, OpenGroup
Sockets API Extensions
14 Feb 2004
SAN-3 workshop – HPCA-10
page 9
ETA Direct User Socket Interface (DUSI) AIO
API
•
Queuing structure setup for sockets:
–
–
•
Memory registration:
–
–
•
•
•
One Direct Transfer Interface (DTI) per socket
Event queues: created separately from DTIs
Pin user space memory regions, provide address
translation information to ETA for zero-copy transfers
Provide access keys (protection tags)
Application posts socket I/O operation requests to DTI
Tx and Rx work queues
PPE delivers operation completion events to DTI event
queues
Both operation posting and event delivery are
lightweight (no OS involvement)
14 Feb 2004
SAN-3 workshop – HPCA-10
page 10
AIO Event Queue Binding
•
AIO API design issue: assignment of events to event
queues
–
•
DUSI: each DTI work queue can be bound at socket
creation to any event queue
–
–
•
Flexible binding enables applications to separate or group
events to facilitate operation scheduling
Allows separating or grouping events from different sockets
Allows separating events by type (transmit, receive)
Alternatives for event queue binding:
–
–
–
Windows: per-socket
Linux and POSIX AIO: per-operation
OpenGroup Sockets API Extensions: per-operation-type
14 Feb 2004
SAN-3 workshop – HPCA-10
page 11
Retrieving AIO Completion Events
•
•
AIO API design issue: application interface for retrieving
events
DUSI: lightweight mechanism bypassing OS
–
–
–
•
Event queues in shared memory
Callbacks: similar to Windows
Event tags
Application monitoring of multiple event queues
–
–
Poll for events (OK for small number of queues)
No events  block in OS on multiple queues
• Uncommon case in a busy server  acceptable in this case
to use OS signaling mechanism
• Useful for simultaneous use of different AIO APIs
–
Race conditions: user level responsibility
14 Feb 2004
SAN-3 workshop – HPCA-10
page 12
AIO for Files and Sockets
•
File AIO support
–
–
•
OS (e.g., Linux AIO, POSIX AIO)
Future: ETA support for file I/O (e.g., via iSCSI or DAFS)
Unified application processing of file/socket events
–
ETA PPE and OS kernel may both supply event queues
• Blocking on event queues of different types facilitated by use
of OS signal signal mechanism (as in DUSI)
• Unified event queues may be desirable: require efficient
coordination of ETA and OS access to event queues
–
Support for zero-copy sendfile(): integration of ETA with
OS management of the shared file buffer in system
memory
14 Feb 2004
SAN-3 workshop – HPCA-10
page 13
Initial Demonstration Vehicle:
Web Server Application
•
•
Plan: demonstrate value of ETA/AIO for network-centric
applications
Initial target: web server application
–
Single request may require multiple I/Os
– Stresses system resources (esp. OS resources)
– Must multiplex thousands/tens of thousands concurrent
connections
•
Web server architecture alternatives:
–
SPED (single process event-driven)
– MP (multi-process) or MT (multi-threaded)
– Hybrid approach: AMPED (asymmetric multi-process
event-driven)
 AIO model favors SPED for raw performance
14 Feb 2004
SAN-3 workshop – HPCA-10
page 14
The userver
•
•
•
•
•
Open source micro web server
Extensive tracing and statistics facilities
SPED model -- run one process per host CPU
Previous support for Unix non-blocking socket I/O and
event notification via Linux epoll()
Modified to support socket AIO (eventually file AIO)
–
•
Generic AIO interface: can be mapped to a variety of
underlying AIO APIs (DUSI, Linux AIO, etc.)
Comparison: web server performance with and without
ETA engine
–
With Standard Linux: processes share file buffer cache
using sendfile() for zero-copy file transfer
– With ETA: mmap() files into shared address space
14 Feb 2004
SAN-3 workshop – HPCA-10
page 15
Web Server Event Scheduling
•
•
Balance accepting new connections with processing of
existing connections
Scheduling:
–
–
•
Separate queues for
accept(), read(), and
write()/close()
completion events
Process based on
current queue
lengths
Early results with
non-blocking I/O –
accept processing
frequency
14 Feb 2004
Throughput impact of frequency of
accepting new connections
SAN-3 workshop – HPCA-10
page 16
Evaluation Plans
•
•
Goal: evaluate approach, compare to design alternatives
Construct functional prototype of proposed stack (Linux)
–
–
–
•
•
Extend existing ETA prototype kernel-level interface to user
level with OS bypass (DUSI)
Extend the userver to use socket AIO, mapping layer to DUSI
Evaluate on 10 gigE –based client/server setup using
SPECweb type workload
Current ETA prototype: promising kernel-level microbenchmark performance
Expectation: ETA + AIO will show significantly higher
scalability than existing Linux network implementation
14 Feb 2004
SAN-3 workshop – HPCA-10
page 17
Proposed Stack/Comparison
uServer - sockets
uServer - AIO
AIO Mapping
Linux Sockets Library
ETA Direct User
Sockets Interface (DUSI)
User
Kernel
Control
Path
Linux Kernel
UDP
ETA Kernel
Agent
TCP
DTI
Data
Path
RAW
IP
ETA Packet Processing Engine
Packet Driver
Network Interfaces
14 Feb 2004
SAN-3 workshop – HPCA-10
page 18
3
1.00
2.5
0.80
2
0.60
1.5
0.40
1
0.20
0.5
0.00
0
256
512
1024
4096
65536
Transmit Size in Bytes
Linux Idle CPU
ETA Idle CPU
Linux Rate
ETA Rate
Throughput in Gigabits
Idle CPU
1.20
Kernel-Level ETA Prototype
Kernel-Level ETA Prototype
2.5
Idle CPU
1.00
2
0.80
1.5
0.60
1
0.40
0.20
0.5
0.00
0
256
512
1024
4096
65536
Receive Size in Bytes
Linux Idle CPU
ETA Idle CPU
Linux Rate
ETA Rate
Throughput in Gigabits
1.20
Evaluation Plans: Analyses and Comparisons
•
•
•
•
•
Compare proposed stack to well-tuned conventional
system: checksum offload, TCP segmentation offload,
interrupt moderation (NAPI)
Examine micro-architectural impacts: VTune/oprofile to
get CPU, memory, cache usage, interrupts, data copies,
context switches
Comparison to TOE
Extend analysis to application domains beyond web
server: e.g., storage, transaction processing
Port highly scalable user-level threading package
(UC Berkeley Capriccio project) to ETA
–
Benefit: familiar threaded programming model with efficient
``under the hood’’ underlying AIO and OS bypass
14 Feb 2004
SAN-3 workshop – HPCA-10
page 21
Summary
•
•
•
•
Proposed technology strategy combining ETA and AIO
to enable industry standard platforms to scale to nextgeneration network performance
Cost-performance, time to market, flexibility advantages
over alternative approaches
Ethernet/TCP to approach performance levels of
today’s SANs – toward unified data center I/O fabric
based on commodity hardware
Status
–
–
–
Promising initial experimental results for kernel-level ETA
Prototype implementation of proposed stack nearly
complete
Testing environment setup based on 10 gigE
14 Feb 2004
SAN-3 workshop – HPCA-10
page 22