High-Performance Object Access in OSD Storage Subsystem
Download
Report
Transcript High-Performance Object Access in OSD Storage Subsystem
High-Performance Object Access
in OSD Storage Subsystem
Yingping Lu
Outline
OSD Overview
Problem and common approaches
Related work
Initial Proposal
Issues
Design Objectives of OSD
Scalability (local area-enterprise-global)
High-performance (high throughput, low latency)
Cross platform
High availability (resilient to device, machine failure)
Support both permanent, mobile and even
disconnected clients
Security (authentication, access control, transmission
and data storage encryption)
Data sharing
Manageability?
Communication Entities:
•Client
•Metadata Manager
•OSD device
MetaData Manager MetaData Manager
Region
MetaData Manager MetaData Manager
IP Network
Laptop client
Laptop client
Region
MetaData Manager MetaData Manager
Desktop client
Desktop client
Region
Communication Paths:
•Client to metadata server
•Client to OSD device
•Metadata to OSD device
•Metadata to Metadata
Problem
The network bandwidth is getting faster and
faster (10Gb/s is on the road).
OSD Application requires high performance
How to efficiently deliver object data between
OSD device and client?
Potential Measures
Potential performance improvement measures
–
Locality-based
Migration (reduce transmission time)
–
Migrate to the location closer to client.
Replication (reduce transmission time)
Replicate a copy within the client’s proximity.
– Can replicate data object or metadata.
–
Cache (reduce disk access time/transmission time)
–
–
–
Where: client, metadata server, object device, etc.
What: data object, metadata, locking.
How long: TTL, lease, renewal.
Performance Improvement
Measures (cont.)
Improvement measures
–
Aggregation (Device grouping)
–
Improve the aggregate I/O throughput and reliability
Works like a RAID system
Data path-based
Decouple the control path from data path
Reduce the length of critical path in the data access level.
Performance Constraints
Consistency (in updating, reconciliation)
Locking and serialization
Security
Small data size access
Crash recovery
Leveraging Data Access Path
Streamline the end system
Zero copy/RDMA
User level programming/OS bypass
TCP offloading
• Improve the transport system
•
•
•
•
•
Large window size
Explicit congestion notification
Selective acknowledgement
Connection splitting (mobile)
Explicit congestion control protocol (XCP)
What’s Wrong With End System
Streamlining end systems
–
Problems: the end system cannot provide the
potential bandwidth to applications.
Memory copy
Context switching
Interrupt service
Checksumming generation
Protocol processing
End System Overhead
Streamlining the end system
–
Overhead
Per packet
–
–
–
–
Protocol processing (execute code, allocate/release buffer)
access control
Interrupt service time for each received packet
Kernel context switching
Per byte
–
–
–
Checksum generation
Memory copy
Data transmission
Streamlining End System
Solutions
–
–
–
–
–
–
–
RDMA (Zero copy)
One system-wide buffer pool
User level networking (bypassing kernel)
TCP offloading
Jumbo packets
Interrupt coalescing
Scatter/gather list
Related work
Previous work:
–
–
–
–
–
–
–
–
–
I/O Lite
VI (Myrinet, Servernet)
SDP
InfiniBand
SRP
DAT (Direct Access Transport collaborative)
DAFS (SNIA)
NFS/RDMA (SNIA)
RDMA over TCP/IP
I/O Lite
Purpose: Reduce
memory copy
Approach: Maintain a
global buffer pool in the
system
Allow application, IPC,
file system, network
subsystem to share one
copy of data
Pros:
–
–
Reduce memory copy
Useful for read-only buffer
Cons:
–
–
System rewritten
Buffer update is difficult
RDMA
Extend DMA’s semantics
across machine boundary
Two operations: RDMA read,
RDMA Write
Memory registration: memory
needs “pinned”
A descriptor carries the src,
dest address, length
A special hardware (nic)
handle the RDMA operation.
Pros:
–
–
Zero copy
Offload CPU processing
Cons
–
–
Need Special hardware
Need reprogramming
Remote DMA Scenario
Host A
Buffer
A
Host B
Buffer
B
CPU
CPU
1
3
RDMA
Engine
(NIC)
2
RDMA
Engine
(NIC)
Virtual Interface Architecture (VIA)
Goal:low latency, high
throughput by direct access
to NIC, zero copy
Programming abstract:
VI(queue pair)
Components: consumer,VI
provider(UA, KA, NIC)
Operations: RDMA,
Send/Receive
Present a standard of RDMA
operations and VI abstract
InfiniBand
An emerging I/O interconnect
technology
Decouple I/O from CPU
Adopt a serial, switchedbased fabric
Provide a unified
communication mechanism (4
layers)
Provide VI support (Verb, QP,
RDMA, etc.)
Implement VI concept in a
standard network
SCSI RDMA Protocol (SRP)
Goal: provide a SCSI
access across IB fabric
Exploit the IB RDMA to
transfer SCSI data
Enable SAN based on IB
It’s targeted specifically
for IB, not suitable for IP
It’s block-level (SCSI)
access, (can be object
level?)
DAFS and NFS/RDMA
DAFS is being
developed by DAFS
consortium
A light weight file sharing
protocol for local data
sharing
Leverage NFS4.0
Exploit RDMA
mechanism to transfer
file data.
Being developed by
SNIA NFS/RDMA group
Enable NFS to exploit
the new networking
technology (VIA, IB)
Make changes to
RPC/XDR to use RDMA
semantics
Target at local area
environment
Socket Direct Protocol (SDP)
Microsoft’s solution in
datacenter (2000)
Retain the same socket
programming interface
Bypass the TCP/IP
processing in kernel
Support RDMA semantic
Not routable, works in a
data center or cluster
Traditional Model
Socket App
Winsock Direct Model
Socket Application
WinSock API
Switch
Switch
Winsock Direct
SPI
TCP/IP/Sockets
Provider
User
Kernel
TCP/IP Transport
Driver
TCP/IP/Sockets
Provider
TCP/IP Transport
Driver
NDIS
NDIS Driver
NIC
SAN Provider
SAN
Mgmt
Driver
Kernel
Bypass
Winsock Direct
NDIS Driver
Kernel Bypass Capable NIC
SAN Provider
Modules
OS Modules
NIC Driver &
Hardware
Figure 1: WSD and SAN Architectural Model
RDMA over TCP/IP
Developed by
rdmaconsortium
Support RDMA over TCP/IP
network
Consisted of three
components: RDMAP, DDP,
MPA
RDMAP: provide RDMA
operations
DDP: direct data placement
MPA: handle framing
SCTP: stream-control
transport protocol
ULP
RDMAP
DDP
MPA
SCTP
TCP
IP
Summary
Link-level
–
–
–
–
No routing info carried
Rely on the underlying
link-level switch to forward
Restricted to data center,
cluster environment
Examples: VIA,
InfiniBand, SRP, SDP,
DFAS, NFS/RDMA
Transport-level
–
–
–
Carries TCP/IP header
Can traverse to IP
network
Process framing, direct
data placement.
OSD Requirements
Direct delivery from object device
–
–
Secure delivery
–
No security channel is assumed, encryption of transmitted
object is necessary
QoS requirement
–
Direct transmission between initiator and target device
This is the critical data path
Object may have specific QoS requirement
Mobile client
–
–
Client may be connect, disconnect connected again.
Error can occur during transmission
Initial Proposal: OSD/Secure RDMA
This is a ULP-based RDMA
–
Leverage RDMA over TCP/IP
–
Extend the communication to IP network
OSD device initiate RDMA request
Security-enabled RDMA
–
The RDMA is tightly integrated with OSD protocol
The underlying transport support security
QoS support
–
Virtual Lane-type mechanism to provide QoS support
OSD/Secure RDMA Architecture
OSD Client
OSD Device
OSD controller
Application
Buffers
Buffers
OSD
VIPL
OSD
VIPL
Object
Manager
VI NIC
driver
VI NIC
driver
NIC
NIC
IP network
Disk
Driver
Protocol Stacks
OSD/RDMA maps
Consumer
OSD to RDMA
OSD
DDP provide the
direct data placement VIPL
The underlying
transport can be
either SCTP or MPA Intelligent
NIC
with TCP.
IPSec is used as
security protocol
(object encryption)
OSD Consumer
OSD Protocol
OSD/RDMA
DDP
MPA
SCTP
TCP
IP/IPSec
Data Access Case – Get an Object
OSD Client
OSD Device
1* Request an obj with
Obj id, credential, descriptor 2*
RDMA write
Data packet
Data packet
1*:
RDMAWrCompl
•need first get access
permission and establish an
session .
•Register memory
•Post a send request
2*:
•Validate the request.
•Register a memory buffer
•Fetch the object from disk
or cache to the buffer
•Post a RDMA write request
Issues to be solved
Elaborate OSD object transfer protocol.
–
–
Should we simply consider SCSI/OSD?
What would be new requirement, e.g. security?
The integration of iSCSI over RDMA.
–
The establishment of session
–
OSD session/iSCSI session/RDMA connection/TCP connection
Sequence?
Persistence vs. transient?
Define the format of OSD/RDMA packet
Memory descriptor
Commands (login, logout, CMD)
Flow-control
Issues
Integration of RDMA with OSD (cont.)
–
Define a set of standard API for OSD/RDMA
Integration with security
–
Create a session
Register memory
Post a work queue element
Query status, etc.
IPSec vs. SSL?
Handle QoS requirement
–
–
QoS attributes, how to specify in an object
QoS assurance: credit-based flow control?