Transcript Slide 1

Latency Reduction
Techniques for
Remote Memory
Access in ANEMONE
Mark Lewandowski
Department of Computer Science
Florida State University
Outline


Introduction
Architecture / Implementation








Adaptive NEtwork MemOry engiNE (ANEMONE)
Reliable Memory Access Protocol (RMAP)
Two Level LRU Caching
Early Acknowledgments
Experimental Results
Future Work
Related Work
Conclusions
Introduction


Virtual Memory performance is
bound by slow disks
State of computers today lends
to the idea of shared memory

Gigabit Ethernet
 Machines on a LAN have lots
of free memory

Improvements to ANEMONE
yield higher performance than
disk and the original
ANEMONE system
Registers
Cache
Memory
ANEMONE
Disk
Contributions


Pseudo Block Device (PBD)
Reliable Memory Access Protocol
 Replace

Early Acknowledgments
 Shortcut

NFS
Communication Path
Two Level LRU-Based Caching
 Client
 Memory
Engine
ANEMONE Architecture
ANEMONE (NFS)
Client
Memory Engine
Memory Server
ANEMONE
NFS Swapping
Pseudo Block Device (PBD)
Swap Daemon Cache
Client Cache
No caching
Engine Cache
Must wait for server to
receive page
Early ACKs
Communicates with Memory
Engine
Communicates with Memory
Engine
Architecture
Engine Cache
RMAP Protocol
Client Module
Pseudo Block Device
Provides a transparent interface for swap
daemon and ANEMONE
 Is not a kernel modification
 Begins handling READ/WRITE requests in
order of arrival

 No
expensive elevator algorithm
Reliable Memory Access Protocol
(RMAP)




Lightweight
Reliable
Flow Control
Protocol sits next to
IP layer to allow swap
daemon quick access
to pages
Application
Sw
Transport
IP RM
Ethernet
RMAP
• Window Based Protocol
• Requests are served as they arrive
• Messages:
•REG/UNREG – Register the client with the ANEMONE cluster
•READ/WRITE – send/receive data from ANEMONE
•STAT – retrieves statistics from the ANEMONE cluster
Why do we need cache?
It is a natural answer to on-disk buffers
 Caching reduces network traffic
 Decreases Latency

 Write
latencies benefit the most
 Buffers requests before they are sent over the
wire
Basic Cache Structure
Head
Tail
FIFO
Queue
Cache_entry
Hash
Function
struct cache_entry {
struct list_head queue;
unsigned long
offset;
u8
*page;
int
write;
struct sk_buff
*skb;
int
Index (Hash Table)


/* points to the linked list that makes up the cache */
/* Offset of page */
/* the page */
/* This may or may not point to an sk_buff. If it does,
* then the cache must take care to call kfree_skb when the
* page is kicked out of memory (this is to avoid a memcpy). */
answered;
};
FIFO Queue is used to keep track of LRU page
Hashtable is used for fast page lookups
ANEMONE Cache Details

Client Cache
 16
MB
 Write-Back
 Memory allocation at load time

Engine Cache
 80
MB
 Write-Through
 Partial memory allocation at load time

sk_buffs are copied when they arrive at the Engine
Early Acknowledgments
• Reduces client wait time
• Can reduce write latency by up to 200 µs per
write request
• Early ACK performance is slowed by small RMAP
window size
• Small pool (~200) of sk_buffs are maintained for
forward ACKing
`
Client
Memory Server
Memory Engine
Experimental Testbed

Experimental testbed configured with 400,000
blocks (4KB page) of memory (~1.6 GB)
Experimental Description

Latency
 100,000 Read/Write
 Sequential/Random

requests
Application Run Times
 Quicksort / POV-Ray
 Single/Multiple Processes
 Execution Times

Cache Performance
 Measured cache
 Client / Engine
hit rates
Sequential Read
Sequential Write
Random Read
Random Write
Single Process Performance



Increase single process size by 100 MB for each iteration
Quicksort: 298% performance increase over disk, 226% increase
over original ANEMONE
POV-Ray: 370% performance increase over disk, 263% increase
over original ANEMONE
Multiple Process Performance



Increase number of 100 MB processes by 1 for each
iteration
Quicksort: 710% increase over disk, and 117% increase
over original ANEMONE
POV-Ray: 835% increase over disk, and 115% increase
over original ANEMONE
Client Cache Performance




Hits save ~500 µs
POV-Ray hit rate
saves ~270 seconds
for 1200 MB test
Quicksort hit rate
saves ~45 seconds
for 1200 MB test
Swap daemon
interferes with cache
hit rates
 Prefetching
Engine Cache Performance



Cache performance
levels out ~10%
POV-Ray does not
exceed 10% because it
performs over 3x the
number of page swaps
that Quicksort does
Engine cache saves up to
1000 seconds for 1200
MB POV-Ray
Future Work
More extensive testing
 Aggressive caching algorithms
 Data Compression
 Page Fragmentation
 P2P
 RDMA over Ethernet
 Scalability and Fault tolerance

Related Work

Global Memory System [feeley95]

Implements a global memory management algorithm over ATM
 Does not directly address Virtual Memory

Reliable Remote Memory Pager [markatos96], Network RAM Disk
[flouris99]


Samson [stark03]



TCP Sockets
Myrinet
Does not perform caching
Remote Memory Model [comer91]


Implements custom protocol
Guarantees in-order delivery
Conclusions




ANEMONE does not modify client OS or
applications
Performance increases by up to 263% for single
processes
Performance increases by up to 117% for
multiple processes
Improved caching is provocative line of
research, but more aggressive algorithms are
required.
Questions?
Appendix A: Quicksort Memory
Access Patterns
Appendix B: POV-Ray Memory
Access Patterns