Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Size: px

Start display at page:

Download "Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin"

Jennifer Jacobs
5 years ago
Views:

1 Infiniswap Efficient Memory Disaggregation Mosharaf Chowdhury with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

2 Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing Coflow Networking Open Source Apache Spark Open Source Cluster File System Facebook Proactive Analytics Before You Think! Resource Allocation DAG Scheduling Cluster Caching Microsoft Apache YARN Alluxio Fast Analytics Over the WAN

3 Rack-Scale Computing Datacenter-Scale Computing Geo-Distributed Computing < 0.01 ms ~ 1 ms > 100 ms

4 Memory-Intensive Applications The volume of data we want to make sense of is increasing Memory is getting bigger and cheaper Many workloads fit in memory In-memory * is all the rage!

5 Perform Great! TPS (Thousands) % 75% 50% In-Memory Working Set TPC-C on VoltDB

6 Perform Great Until Memory Runs Out TPS (Thousands) % 75% 50% In-Memory Working Set TPC-C on VoltDB

7 Perform Great Until Memory Runs Out TPS (Thousands) % 75% 50% In-Memory Working Set TPC-C on VoltDB Ops (Thousands) % 75% 50% In-Memory Working Set FB Workload on Memcached

8 Perform Great Until Memory Runs Out TPS (Thousands) % 75% 50% In-Memory Working Set Ops (Thousands) % 75% 50% In-Memory Working Set Completion Time (s) % 75% 50% In-Memory Working Set TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph

9 50% Less Memory Causes Slowdown of TPS (Thousands) % 75% 50% In-Memory Working Set Ops (Thousands) % 75% 50% In-Memory Working Set Completion Time (s) % 75% 50% In-Memory Working Set TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph

10 Between a Rock and a Hard Place Underallocation Leads to severe performance loss VS. Overallocation Leads to underutilization

11 Memory Underutilization at Google [1] Allocated Used Fraction of Memory 0.8 Time (days) 0.5 [1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC 12.

12 Memory Load Imbalance Measured as the 99 th percentile to median memory utilization ratio Perfect Balance Google Cluster Facebook Cluster

13 How Can We Recover This Memory?

14 Infiniswap Disaggregates Memory Exposes memory across server boundaries in a scalable, fault-tolerant, and efficient manner without modifying any applications, operating systems, or hardware

15 Memory Disaggregation Disaggregated Memory Machine 1 Machine 2 Machine 3 Machine N Used Memory Free Memory Remote Memory

16 Design Goals Improve application performance and cluster efficiency Minimize deployment overhead No new hardware No software modification Tolerate failures Machine crash, network disconnection Manage remote memory at scale

17 Selected Prior Efforts No H/W Design No App Modification Fault- Tolerant Scalable Memory Blade [ISCA 09] HPBD [CLUSTER 05] / nbdx [1] RDMA key-value service (HERD [SIGCOMM 14], FaRM [NSDI 14] ) Intel Rack Scale Architecture (RSA) [2] Infiniswap [1] [2]

18 Infiniswap Exposes free remote memory as swap devices in a decentralized manner w/o affecting remote processes 1. Infiniswap Block Device 2. Infiniswap Daemon Finds free remote memory, maps pages, and provides fault tolerance without any central coordination Proactively evicts remote pages to ensure transparent, best-effort service

19 Infiniswap in One Slide Container 1 Container N Infiniswap Daemon User Space Kernel Space Virtual Memory Manager (VMM) Container A Machine-2 Infiniswap Daemon User Space Kernel Space Container 1 Container N Infiniswap Daemon User Space Kernel Space Virtual Memory Manager (VMM) Page fault Individual page Infiniswap Block Device Async Sync Local Disk RNIC Machine-1 X Container A Mapped to memory of Machine-X Machine-3 Infiniswap Daemon User Space Kernel Space 2 3 Infiniswap Block Device Async Sync Local Disk RNIC Machine-N

20 Are We There Yet? Improve application performance and cluster efficiency Minimize deployment overhead No new hardware No software modification Tolerate failures Machine crash, network disconnection Manage remote memory at scale Remote memory paging over RDMA Async. backup to disk?

21 Scalability Challenges How to find remote memory in the cluster? Too many pages lead to too much management overhead Centralized solution can be slow and expensive

22 Decentralized Mapping Use large slab instead of page for memory management Power of two choices Select from new machines After activity crosses a threshold Select the least-loaded of the two machines to map slab S S 2 Infiniswap Block Device Infiniswap Daemon Infiniswap Daemon Infiniswap Daemon Infiniswap Daemon

23 Scalability Challenges How to find remote memory in the cluster? Too many pages lead to too much management overhead Centralized solution can be slow and expensive Which remote mapping should we evict? Should be performed to avoid affecting remote applications performance Problem: Paging estimation is hard because one-sided RDMA do not involve CPU

24 Batch Eviction Power of many choices Approximate LFU Without contacting all slabs When free memory falls below a threshold Infiniswap Block Device Infiniswap Block Device Infiniswap Block Device Infiniswap Block Device Contact up to E+E machines to evict E slabs Infiniswap Daemon

25 Infiniswap Design Choices Improve application performance and cluster efficiency Minimize deployment overhead No new hardware No software modification Tolerate failures Machine crash, network disconnection Manage remote memory at scale Remote memory paging over RDMA Async. backup to disk Decentralized mapping and eviction

26 Evaluation Deployment and evaluation on a 32-node 56-Gbps InfiniBand network on CloudLab using memory-intensive applications 1. Does it improve performance? 2. Does it improve utilization? 3. Does it scale? 4. Can it handle failure? 5. YES

27 Even on 50% Memory, Slowdown is TPS (Thousands) % 75% 50% In-Memory Working Set TPC-C on VoltDB Ops (Thousands) % 75% 50% In-Memory Working Set FB Workload on Memcached Completion Time (s) % 75% 50% In-Memory Working Set PageRank on PowerGraph

28 Higher & More Balanced Memory Utilization Memory Utilization (%) Infiniswap w/o Infiniswap Rank of Machines

29 Higher & More Balanced Memory Utilization Memory Utilization (%) Infiniswap w/o Infiniswap Rank of Machines Higher Utilization

30 #1 #2 #3 Performance Isolation Avoid Disk Backups Rethink Three Paging Subsystem Followups Between multiple tenants In VMM and RDMA API Performance during failures Handle large paging bursts For high-speed block devices Infiniswap & NVMe devices

31 Infiniswap Disaggregates Memory Exposes memory across server boundaries in a scalable, fault-tolerant, and efficient manner without modifying any applications, operating systems, or hardware

32 Infiniswap Disaggregates Memory Learn more in our NSDI 17 paper Try it from Contact us at Juncheng Gu Youngmoon Lee Yiwen Zhang

34 Infiniswap Microbenchmarks Bandwidth (MB/s) Infiniswap Write Infiniswap Read nbdx Write nbdx Read % CPU Usage of 32 vcpus Infiniswap nbdx 0 4K 16K 64K 256K Block Size 0 4K 16K 64K 256K Block Size Higher I/O Bandwidth NO Remote CPU Usage

35 Host Performance Unaffected Memory Utilization (%) Local Memory Remote Memory Time (s) Ops (Thousands) Baseline 94.1 Infiniswap Proactive Eviction NO Impact on Performance

Efficient Memory Disaggregation with Infiniswap. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, MosharafChowdhury, Kang G. Shin

Efficient Memory Disaggregation with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, MosharafChowdhury, Kang G. Shin Agenda Motivation and related work Design and system overview Implementation and evaluation