Federated DAFS: Scalable Cluster-based Direct Access

Download Report

Transcript Federated DAFS: Scalable Cluster-based Direct Access

Federated DAFS: Scalable Cluster-based
Direct Access File Servers
Murali Rangarajan, Suresh Gopalakrishnan
Ashok Arumugam, Rabita Sarker
Rutgers University
Liviu Iftode
University of Maryland
Network File Servers
TCP/IP
NFS
FILE
SERVER
CLIENTS

OS involvement increases latency & overhead

TCP/UDP protocol processing

Memory-to-memory copying
2
SAN-2
Disco Lab
User-level Memory Mapped
OS
Communication
Application
Send
Receive
OS
OS
NIC



Application
NIC
Application has direct access to the network interface
OS involved only in connection setup to ensure protection
Performance benefits: zero-copy, low-overhead
3
SAN-2
Disco Lab
Virtual Interface Architecture

Application

RECV
COMP
SEND
QUEUE QUEUE QUEUE
Setup &
Memory registration
VI Provider Library

Data transfer from userspace
Setup & Memory
registration through kernel
Communication models


Send/Receive: a pair of
descriptor queues
Remote DMA: receive
operation not required
Kernel
Agent
VI NIC
4
SAN-2
Disco Lab
Direct Access File System Model
DAFS Server
Application
Buffers
DAFS Client
User
VIPL
File access API
DAFS File Server
Buffers
VIPL
DAFS File Server
Buffers
Driver
Kernel
KVIPL
VI NIC
Driver
VI NIC
Driver
NIC
5
NIC
SAN-2
Disco Lab
Goal: High-performance DAFS Server
Cluster-based DAFS Server



Direct access to network-attached storage
distributed across server cluster
Clusters of commodity computers - Good
performance at low cost
User-level communication for server clustering



6
Low-overhead mechanism
Lightweight protocol for file access across cluster
SAN-2
Disco Lab
Outline



Portable DAFS client and server implementation
Clustering DAFS servers – Federated DAFS
Performance Evaluation
7
SAN-2
Disco Lab
User-space DAFS Implementation
Application
Application
DAFS
Client
DAFS
Client
VI
VI
DAFS API Request
Response
VI Network
VI Network
DAFS Server
VI
LocalFS
FS
Local

DAFS client and server in user-space

DAFS API primitives translate to RPCs on server

Staged Event Driven Architecture

Portable across Linux, FreeBSD and Solaris
8
SAN-2
Disco Lab
DAFS Server
SERVER
Connection Protocol Threads
Manager
CLIENT
Response
Connection Request
DAFS
DAFS API
API Request
Request
9
SAN-2
Disco Lab
Client-Server Communication
buf
buf
buf
req
req
Application
dafs_write(file,
buf)
dafs_read(file,
buf)
dafs_read(file, buf)
Request
Response
DAFS Server
DAFS
Server
DAFS Server
DAFS
Client
DAFS
Client
DAFS
Client
VI VI
VIVI
VI Network
VI
VI
Network
VINetwork
Network
VI
VI VI
VI
Local
Local FS
FS

VI channel established at client initialization
VIA Send/Receive used except for dafs_read

Zero-copy data transfers



Emulation of RDMA Read used for dafs_read
Scatter/gather I/O used in dafs_write
10
SAN-2
Disco Lab
Asynchronous I/O Implementation




Applications use I/O descriptors to submit
asynchronous read/write requests
Read/write call returns immediately to
application
Result stored in I/O descriptor on completion
Applications need to use I/O descriptors to
wait/poll for completion
11
SAN-2
Disco Lab
Benefits of Clustering
Standalone
Single
Clustered
DAFS
DAFS
DAFS
Server
Servers
Servers
on a Cluster
Application
DAFS Server
DAFS
Server
Clustering
Layer
DAFS Client
VI
VI
Application
Application
DAFS Client
DAFS Client
Local FS
DAFS Server
DAFS
Server
Clustering
Layer
DAFS
Server
Local
VI
Local FS
FS
VI
VI
•
VI
•
•
Application
DAFS Server
DAFS
Server
Clustering
Layer
Application
DAFS
Client
VIClient
DAFS
VI
Local FS
VI
12
SAN-2
Disco Lab
Clustering DAFS Servers Using FedFS
DAFS
Server
DAFS
Server
DAFS
Server
DAFS
Server
File I/O
FedFS over SAN

Federated File System (FedFS)



Federation of local file systems on cluster nodes
Extend the benefits of DAFS to cluster-based servers
Low overhead protocol over SAN
13
SAN-2
Disco Lab
FedFS Goals

Global name space across the cluster



Created dynamically for each distributed application
Load balancing
Dynamic Reconfiguration
14
SAN-2
Disco Lab
Virtual Directory (VD)


Union of all local directories with same
pathname
Each VD is mapped to a manager node


Virtual Directory (/usr)
Determined using
hash function on pathname
/
usr
Manager constructs
and maintains the VD
file1
file2
/
/
usr
usr
file1
15
file2
SAN-2
Disco Lab
Constructing a VD


Constructed on first access to directory
Manager performs dirmerge to merge real
directory info on cluster nodes into a VD


16
Summary of real directory info is generated and
exchanged at initialization
Cached in memory and updated on directory
modifying operations
SAN-2
Disco Lab
File Access in FedFS
manager(f1)
DAFS Server
VI
FedFS
Local FS
DAFS Server
VI
FedFS
Local FS
DAFS Server
VI
FedFS
Local FS
f1
VI Network

Each file mapped to a manager




home(f1)
Determined using hash on pathname
Maintains information about the file
Request manager for location (home) of file
Access file from home
17
SAN-2
Disco Lab
Optimizing File Access

Directory Table (DT) to cache file information



File information cached after first lookup
Cache of name space distributed across cluster
Block level in-memory data cache


18
Data blocks cached on first access
LRU Replacement
SAN-2
Disco Lab
Communication in FedFS
DAFS Server
Buffer
VI
FedFS
DAFS Server
RDMA for for
Send/Receive
Response
with data
Request/Response
Local FS
FedFS
VI
Local FS
VI Network

Two VI channels between any pair of server nodes



Send/Receive for request/response
RDMA exclusively for data transfer
Descriptors and buffers registered at initialization
19
SAN-2
Disco Lab
Performance Evaluation
Application
DAFS Server
FedFS
VI
Local FS
DAFS Client
VI
Application
DAFS Client
VI
•
VI Network
•
•
•
•
•
DAFS Server
FedFS
VI
Local FS
Application
DAFS Client
VI
20
SAN-2
Disco Lab
Experimental Platform

Eight node server cluster


Clients




800 MHz PIII, 512 MB SDRAM, 9 GB 10K RPM SCSI
Dual processor (300 MHz PII), 512 MB SDRAM
Linux-2.4
Servers and Clients equipped with Emulex
cLAN adapter
32 port Emulex switch in full-bandwith
configuration
21
SAN-2
Disco Lab
SAN Performance Characteristics

VIA Latency and Bandwidth

poll/wait for latency/bandwidth measurement
respectively
Packet Size
(Bytes)
22
Roundtrip
Latency (ms)
Bandwidth
(MB/s)
256
23.3
56
512
27.3
85
1024
36.9
108
2048
56.0
109
4096
91.2
110
SAN-2
Disco Lab
Workloads

Postmark – Synthetic benchmark



Short-lived small files
Mix of metadata-intensive operations
Benchmark outline



23
Create a pool of files
Perform transactions – READ/WRITE paired with
CREATE/DELETE
Delete created files
SAN-2
Disco Lab
Workload Details


Each client performs 30,000 transactions
Each transaction – READ paired with
CREATE/DELETE





READ = open, read, close
CREATE = open, write, close
DELETE = unlink
Multiple clients used for maximum throughput
Clients distribute requests to servers using a
hash function on pathnames
24
SAN-2
Disco Lab
Base Case (Single Server)

Maximum throughput


5075 transactions/second
Average time per transaction


25
For client ~ 200 ms
On server ~ 100 ms
SAN-2
Disco Lab
Postmark Throughput
Postmark Throughput (txns/sec)
30000
File size: 2 K
File size: 4 K
File size: 8 K
File size: 16 K
25000
20000
15000
10000
5000
0
0
1
2
3
4
5
6
7
8
9
Number of Servers
26
# Servers
2
4
8
Speedup
1.75
3
5
SAN-2
Disco Lab
FedFS Overheads


Files are physically placed on the node which
receives client requests
Only metadata operations may involve
communication



first open(file)
delete(file)
Observed communication overhead

27
Average of one roundtrip message among servers
per transaction
SAN-2
Disco Lab
Other Workloads

No client request sent to file’s correct location




Optimized coherence protocol minimizes
communication


All files created outside Federated DAFS
Only READ operations (open, read, close)
Potential increase in communication overhead
Avoid communication at open and close in the
common case
Data Caching helps reduce the frequency of
communication for remote data access
28
SAN-2
Disco Lab
Postmark Read Throughput
Each transaction = READ
Postmark Read Throughput (txns/sec)
60000
Federated DAFS
Federated DAFS - No Cache
50000
40000
30000
20000
10000
0
2
4
Number of Servers
29
SAN-2
Disco Lab
Communication Overhead Without
Caching

Without caching, each read results in remote
fetch


Each remote fetch costs ~65ms
request message (< 256 B) + response message
(4096 B)
# Servers
# Clients for Max.
Throughput
# Transactions
# Remote Reads
on each server
2
10
300,000
150,000
4
20
600,000
150,000
30
SAN-2
Disco Lab
Work in Progress



Study other application workloads
Optimized coherence protocols to minimize
communication in Federated DAFS
File migration




Alleviate performance degradation from
communication overheads
Balance load
Dynamic reconfiguration of cluster
Study DAFS over a Wide Area Network
31
SAN-2
Disco Lab
Conclusions




Efficient user-level DAFS implementation
Low overhead user-level communication used
to provide lightweight clustering protocol
(FedFS)
Federated DAFS minimizes overheads by
reducing communication among server nodes
in the cluster
Speedups of 3 on 4-node and 5 on 8-node
clusters demonstrated using Federated DAFS
32
SAN-2
Disco Lab
Thanks
Distributed Computing Laboratory
http://discolab.rutgers.edu
DAFS Performance
Postmark Throughput (txns/sec)
40000
File size: 4 K
35000
30000
25000
20000
15000
10000
5000
0
0
2
4
6
8
10
Number of Servers
34
SAN-2
Disco Lab