Transcript Slide 1
Latency Reduction
Techniques for
Remote Memory
Access in ANEMONE
Mark Lewandowski
Department of Computer Science
Florida State University
Outline
Introduction
Architecture / Implementation
Adaptive NEtwork MemOry engiNE (ANEMONE)
Reliable Memory Access Protocol (RMAP)
Two Level LRU Caching
Early Acknowledgments
Experimental Results
Future Work
Related Work
Conclusions
Introduction
Virtual Memory performance is
bound by slow disks
State of computers today lends
to the idea of shared memory
Gigabit Ethernet
Machines on a LAN have lots
of free memory
Improvements to ANEMONE
yield higher performance than
disk and the original
ANEMONE system
Registers
Cache
Memory
ANEMONE
Disk
Contributions
Pseudo Block Device (PBD)
Reliable Memory Access Protocol
Replace
Early Acknowledgments
Shortcut
NFS
Communication Path
Two Level LRU-Based Caching
Client
Memory
Engine
ANEMONE Architecture
ANEMONE (NFS)
Client
Memory Engine
Memory Server
ANEMONE
NFS Swapping
Pseudo Block Device (PBD)
Swap Daemon Cache
Client Cache
No caching
Engine Cache
Must wait for server to
receive page
Early ACKs
Communicates with Memory
Engine
Communicates with Memory
Engine
Architecture
Engine Cache
RMAP Protocol
Client Module
Pseudo Block Device
Provides a transparent interface for swap
daemon and ANEMONE
Is not a kernel modification
Begins handling READ/WRITE requests in
order of arrival
No
expensive elevator algorithm
Reliable Memory Access Protocol
(RMAP)
Lightweight
Reliable
Flow Control
Protocol sits next to
IP layer to allow swap
daemon quick access
to pages
Application
Sw
Transport
IP RM
Ethernet
RMAP
• Window Based Protocol
• Requests are served as they arrive
• Messages:
•REG/UNREG – Register the client with the ANEMONE cluster
•READ/WRITE – send/receive data from ANEMONE
•STAT – retrieves statistics from the ANEMONE cluster
Why do we need cache?
It is a natural answer to on-disk buffers
Caching reduces network traffic
Decreases Latency
Write
latencies benefit the most
Buffers requests before they are sent over the
wire
Basic Cache Structure
Head
Tail
FIFO
Queue
Cache_entry
Hash
Function
struct cache_entry {
struct list_head queue;
unsigned long
offset;
u8
*page;
int
write;
struct sk_buff
*skb;
int
Index (Hash Table)
/* points to the linked list that makes up the cache */
/* Offset of page */
/* the page */
/* This may or may not point to an sk_buff. If it does,
* then the cache must take care to call kfree_skb when the
* page is kicked out of memory (this is to avoid a memcpy). */
answered;
};
FIFO Queue is used to keep track of LRU page
Hashtable is used for fast page lookups
ANEMONE Cache Details
Client Cache
16
MB
Write-Back
Memory allocation at load time
Engine Cache
80
MB
Write-Through
Partial memory allocation at load time
sk_buffs are copied when they arrive at the Engine
Early Acknowledgments
• Reduces client wait time
• Can reduce write latency by up to 200 µs per
write request
• Early ACK performance is slowed by small RMAP
window size
• Small pool (~200) of sk_buffs are maintained for
forward ACKing
`
Client
Memory Server
Memory Engine
Experimental Testbed
Experimental testbed configured with 400,000
blocks (4KB page) of memory (~1.6 GB)
Experimental Description
Latency
100,000 Read/Write
Sequential/Random
requests
Application Run Times
Quicksort / POV-Ray
Single/Multiple Processes
Execution Times
Cache Performance
Measured cache
Client / Engine
hit rates
Sequential Read
Sequential Write
Random Read
Random Write
Single Process Performance
Increase single process size by 100 MB for each iteration
Quicksort: 298% performance increase over disk, 226% increase
over original ANEMONE
POV-Ray: 370% performance increase over disk, 263% increase
over original ANEMONE
Multiple Process Performance
Increase number of 100 MB processes by 1 for each
iteration
Quicksort: 710% increase over disk, and 117% increase
over original ANEMONE
POV-Ray: 835% increase over disk, and 115% increase
over original ANEMONE
Client Cache Performance
Hits save ~500 µs
POV-Ray hit rate
saves ~270 seconds
for 1200 MB test
Quicksort hit rate
saves ~45 seconds
for 1200 MB test
Swap daemon
interferes with cache
hit rates
Prefetching
Engine Cache Performance
Cache performance
levels out ~10%
POV-Ray does not
exceed 10% because it
performs over 3x the
number of page swaps
that Quicksort does
Engine cache saves up to
1000 seconds for 1200
MB POV-Ray
Future Work
More extensive testing
Aggressive caching algorithms
Data Compression
Page Fragmentation
P2P
RDMA over Ethernet
Scalability and Fault tolerance
Related Work
Global Memory System [feeley95]
Implements a global memory management algorithm over ATM
Does not directly address Virtual Memory
Reliable Remote Memory Pager [markatos96], Network RAM Disk
[flouris99]
Samson [stark03]
TCP Sockets
Myrinet
Does not perform caching
Remote Memory Model [comer91]
Implements custom protocol
Guarantees in-order delivery
Conclusions
ANEMONE does not modify client OS or
applications
Performance increases by up to 263% for single
processes
Performance increases by up to 117% for
multiple processes
Improved caching is provocative line of
research, but more aggressive algorithms are
required.
Questions?
Appendix A: Quicksort Memory
Access Patterns
Appendix B: POV-Ray Memory
Access Patterns