Transcript Slide 1
Performance Evaluation of
InfiniBand NFS/RDMA for Linux
Benjamin Allan, Helen Chen, Scott Cranford, Ron
Minnich, Don Rudish, and Lee Ward
Sandia National Laboratories
This work was supported by the United States Department
of Energy, Office of Defense Programs. Sandia is a
multiprogram laboratory operated by Sandia Corporation, a
Lockheed-Martin Company, for the United States Department of
Energy under contract DE-AC04-94-AL85000.
Talk Outline
•
•
•
•
•
•
Read/Write performance
Application Profile
System Profile
Network Profile
Infinite fast file and disk I/O
Infinite Fast Network
Sandia Motivation for looking at
NFS/RDMA
• Why NFS-RDMA and why is Sandia looking at
it?
– Use for HPC Platforms
– Transparent solution for applications
– In the mainstream kernel
– Increased performance over normal NFS
Reads VS Writes
TCP VS RDMA
• NFS/TCP read/write
ratio 2:1
• NFS/RDMA read/write
ratio 5:1
• Previous work
http://www.chelsio.com/nf
s_over_rdma.html
FTQIO Application
• FTQ (fixed time quantum)
– Simply put, rather than doing
a fixed unit of work, and
measuring how long that
work took; FTQ measures the
amount of work done in a
fixed time quantum.
• FTQ I/O
– Modified FTQ to measure file
system performance by
writing data and recording
statistics.
– High Resolution Benchmark
• More info visit:
–
http://rt.wiki.kernel.org/index.php/FTQ
• How it works?
– One thread of the program
writes blocks of allocated
memory to disk.
– A second thread records the
number of bytes written and
optionally Supermon data.
(talk supermon about later)
• Basic Operation
– The loop will count work
done until it reaches a fixed
end-point in time.
– It then records the starting
point of the loop and the
amount of work that was
done.
FTQIO Application Profile
•Red DOTS represent data written in a 440 microsecond interval.
•Every 440 microseconds, FTQIO will count how many bytes to wrote and then plot it
Bytes Written
27 sec FTQIO Run
Application Profile with VMM Data
•Red DOTS represent bytes recorded in a 440 microsecond interval.
•Blue DOTS represent number of dirty pages
•Purple DOTS represent number of pages in writeback queue
•Black DOTS represent when application goes to sleep
Bytes Written
27 sec FTQIO Run
Added bytes transmitted from IB card
Bytes Written
•Red DOTS represent bytes recorded in a 440 microsecond interval.
•Blue DOTS represent number of dirty pages
•Purple DOTS represent number of pages in writeback queue
•Black DOTS represent when application goes to sleep
•Green DOTS represent number of bytes transmitted on the InfiniBand card
27 sec FTQIO Run
Baseline Approach
User
Client
Server
Application
Application
VFS
VFS
Page Cache
Kernel NFS Client
Page Cache
NFS Server
RPC
FS
TCP/IPRDMA
TCP/IPRDMA
Block
PCI
HCA
PCI
HCA
PCI
RPC
HW
Short Circuit
Patch
Short Circuit
Patch
IB Fabric
Controller
Disk
Look at the Code
Where to look?
The Linux Cross Reference
http://lxr.linux.no
Ftrace
Comes with the kernel
No userspace programs
Debugfs
http://rt.wiki.kernel.org/index.php/Ftrace
# tracer: function_graph
#
# CPU TASK/PID
DURATION
# |
|
|
|
|
0)
dd-2280
|
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0)
dd-2280
|
0)
dd-2280
|
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
0)
dd-2280
|
0.000 us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
FUNCTION CALLS
|
|
|
|
schedule_tail() {
finish_task_switch();
__might_sleep();
_cond_resched();
__task_pid_nr_ns();
down_read_trylock();
__might_sleep();
_cond_resched();
find_vma();
handle_mm_fault() {
kmap_atomic() {
kmap_atomic_prot() {
page_address();
}
}
_spin_lock();
do_wp_page() {
vm_normal_page();
reuse_swap_page();
unlock_page() {
__wake_up_bit();
}
kunmap_atomic() {
arch_flush_lazy_mmu_mode();
}
anon_vma_prepare() {
__might_sleep();
_cond_resched();
}
__alloc_pages_internal() {
__might_sleep();
_cond_resched();
get_page_from_freelist() {
next_zones_zonelist();
next_zones_zonelist();
zone_watermark_ok();
}
}
Infinite fast file and disk I/O
• When NFS Server wants
to write to a file, claim
success.
User
Application
Application
VFS
VFS
Page Cache
Kernel
In fs/nfsd/nfs3proc.c
/*
* Write data to a file
*/
static __be32
nfsd3_proc_write(struct svc_rqst *rqstp, struct nfsd3_writeargs
*argp,
struct nfsd3_writeres *resp)
{
__be32 nfserr;
Page Cache
NFS Client
NFS Server
RPC
RPC
FS
TCP/IP
HW
RDMA
TCP/IP
PCI
HCA
RPC
Payload
(Bytes)
32768
65536
131072
262144
524288
IB Fabric
RDMA
if (foobar_flag != '0') {
resp->count = argp->count;
RETURN_STATUS(0);
}
Block
PCI
PCI
HCA
Controller
Throughput (MB/s)
32-KB
512-KB
1-MB
Record
Record
Record
285.60
283.40
281.60
377.00
350.50
293.00
387.50
363.50
306.00
401.40
335.80
305.00
425.00
376.50
312.50
Disk
fh_copy(&resp->fh, &argp->fh);
resp->committed = argp->stable;
nfserr = nfsd_write(rqstp, &resp->fh, NULL,
argp->offset,
rqstp->rq_vec, argp->vlen,
argp->len,
&resp->committed);
resp->count = argp->count;
RETURN_STATUS(nfserr);
}
Infinite Fast Network
• Remove the RDMA transport
from the NFS write-path.
• Max throughput of
1.25GB/sec
• Nothing goes out on
the network.
• RPC transmit
– Factor of 3 improvement as
oppose to when we send the
data over the wire.
User
Application
Application
VFS
VFS
Page Cache
Kernel
– Returns claiming that
the transmit
completed and now
has the reply.
• Tells the NFS client
service that the page
was committed.
Page Cache
NFS Client
NFS Server
RPC
RPC
FS
TCP/IP
HW
RDMA
TCP/IP
PCI
HCA
IB Fabric
RDMA
Block
PCI
PCI
HCA
Controller
Disk
Recap And Conclusion
User
Client
Server
Application
Application
VFS
VFS
Page Cache
Page Cache
Kernel NFS Client
RPC
377MB/sec
1.25GB/sec
TCP/IPRDMA
HW
NFS Server
RPC
FS
TCP/IPRDMA
Block
PCI
HCA
PCI
1.8GB/sec
PCI
HCA
IB Fabric
Controller
Disk