FaRM: Fast Remote Memory

Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main memory Removes OH of disk/flash Enables small random data accesses Network communication still a bottleneck! Fast networks won t reduce this bottleneck Systems still use TCP/IP networking

Problem Context: (continued) Remote Direct Memory Access (RDMA): allows computers in a network to exchange data in main memory w/o involving the processor, cache, or OS of either computer Provides reliable user-level reads/writes of remote memory Achieves low latency, high throughput Bypasses the kernel Avoids complex protocol stack overheads Frees up resources

The Solution: FaRM FaRM: a main memory distributed computing platform Exploits RDMA to improve latency and throughput More than an order of magnitude higher than state-of-the-art main memory systems that use TCP/IP Simplified programming model All of the memory of machines in the cluster is a shared address space Sufficient for more application code Applications use transactions to allocate, read, write, and free objects in addr. space with local transparency

FaRM: Communication Primitives Uses one-sided RDMA read for direct data access Uses RDMA writes to implement a fast message passing primitive Circular buffer to implement a unidirectional channel Buffer is stored on the receiver One buffer for each sender/receiver pair

FaRM: Architecture Communication primitives are fast, but Accesses to main memory still achieve up to a 23x higher request rate Designed FaRM to enable performance improvement by collocating data and computation on the same machine FaRM machines store data in main memory Also execute application threads Memory of all machines in the cluster is exposed as a shared address-space

FaRM: Distributed Memory Management Shared address space consists of numerous 2GB shared memory regions Represent the unit of address mapping, recovery, and registration for RDMA with the NIC Address of object = 32-bit region identifier, 32-bit offset relative to start Object access done through consistent hashing Maps region identifier to the machine that stores the object

FaRM: Lock-free operations Application is guaranteed to read a consistent object state, even if it is concurrent with the writes to the same object Reliance on cache coherent DMA lockfreeread: reads the object with RDMA and checks if the header version is unlocked and matches all the cache line versions

FaRM: Hashtables FaRM provides a general key-value store interface Implemented as a hash table on top of the shared address space Used to obtain pointers to shared objects

Evaluation FaRM s performance compared to a baseline system that uses TCP/IP for messaging: Performs better than MemC3 which is the best main-memory key-value store in literature Order of magnitude greater of throughput and latency than the baseline These results hold over a wide range of settings

Related Work: Pilaf Pilaf: a key-value store Uses send/receive verbs to send update operations to the server Uses one-sided RDMA reads to implement lookups Provides linearizability using 64-bit CRCS (cyclic redundancy checks) to detect inconsistent reads FaRM: Technique to detect inconsistent reads is more general Better hashtable performance Uses fewer RDMAS to perform lookups Higher space utilization

Related Work: RAMCloud RAMCloud: describes techniques for logging and recovering in a main-memory key-value store. Doesn t provide a lot of information about normal case operations. FaRM: uses similar techniques for logging and recovery, but extends them Deals with transactions on general data structures Shared address space Focused on techniques to achieve good performance in normal case

Limitations Requires a major overhaul of the application because TCP/IP is no longer used and there is a need to rewrite the application to use the FaRM API Requires overhauling the existing datacenter infrastructure Need RDMA NICs on every server Need Infiniband for data centers larger than 100 servers because RoCE doesn t scale well 2 GB pages => resource fragmentation

Next Steps The holy grail in this area would be to create some sort of drop-in replacement for TCP/IP that could be used by existing applications, without modification This would allow applications to better utilize the network bandwidth available with modern hardware technology