ReFlex: Remote Flash Local Flash

Size: px

Start display at page:

Download "ReFlex: Remote Flash Local Flash"

Rodger Logan
6 years ago
Views:

1 ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis October 28, 216 IAP Cloud Workshop

2 Flash in Datacenters Flash provides 1 higher throughput and 2 lower latency than disk PCIe Flash: 1,, IOPS 1s of µs latency Flash is underunlized due to imbalanced resource requirements 2

3 Datacenter Flash Use-Case Applica(on Tier Datastore Tier Datastore Service App Tier App Clients Servers get(k) put(k,val) TCP/IP NIC Key-Value Store CPU RAM So9ware Hardware Flash 3

4 Imbalanced Resource UNlizaNon Sample unlizanon of Facebook servers hosnng a Flashbased key-value store over 6 months [EuroSys 16] [EuroSys 16] Flash storage disaggrega.on. Klimovic, A., Kozyrakis, C., Thereska, E., John, B., Kumar, S. 4

5 Imbalanced Resource UNlizaNon Sample unlizanon of Facebook servers hosnng a Flashbased key-value store over 6 months [EuroSys 16] [EuroSys 16] Flash storage disaggrega.on. Klimovic, A., Kozyrakis, C., Thereska, E., John, B., Kumar, S. 5

6 Imbalanced Resource UNlizaNon Sample unlizanon of Facebook servers hosnng a Flashbased key-value store over 6 months [EuroSys 16] unlizanon [EuroSys 16] Flash storage disaggrega.on. Klimovic, A., Kozyrakis, C., Thereska, E., John, B., Kumar, S. 6

7 Imbalanced Resource UNlizaNon Flash capacity and IOPS are underunlized for long periods of Nme unlizanon [EuroSys 16] Flash storage disaggrega.on. Klimovic, A., Kozyrakis, C., Thereska, E., John, B., Kumar, S. 7

8 Local Flash Architecture Applica(on Tier Datastore Tier Datastore Service App Tier App Clients Servers get(k) put(k,val) TCP/IP NIC Key-Value Store CPU RAM So9ware Hardware Flash Provision Flash and CPU in a dependent manner. 8

9 Disaggregated Flash Architecture Applica(on Tier Datastore Tier Datastore Service App Tier App Clients Servers get(k) put(k,val) TCP/IP NIC Key-Value Store CPU RAM So5ware Hardware read(blk); write(blk,data) Protocol iscsi Flash Tier Remote Block Service So5ware CPU NIC RAM Flash Hardware 9

10 ExisNng Approaches Why not apply methods for remote disk or remote memory to access remote flash? There are 2 main issues: 1. Performance overhead 2. Interference on shared remote flash device 1

11 Issue 1: Performance Overhead 1 4kB random read p95 read latency (us) latency 75% throughput drop Local Flash iscsi (1 core) libaio+libevent (1core) IOPS (Thousands) TradiNonal network storage protocols (e.g. iscsi) and convennonal Linux mechanisms have high overhead 11

125 Total IOPS (Thousands) Writes impact

12 Issue 2: Interference Latency depends on IOPS load p95 read latency (us) %read 99%read 95%read 9%read 75%read 5%read Total IOPS (Thousands) Writes impact read tail latency Flash read performance degrades with increasing % write To share flash, need performance isolanon mechanisms 12

13 Requirements Low Latency Cost Scalability QoS & Perf IsolaPon TradiNonal I/O protocols (e.g. iscsi) X X Distributed storage systems (e.g. GFS, BigTable) X X RDMA (e.g. NVMe over Fabrics) ~ ~ X ReFlex 13

14 ReFlex A remote flash system that provides remote local flash performance over commodity networks 1. Low latency, high throughput, low compute overhead 2. Enforce tail latency and throughput guarantees for clients sharing flash 14

15 ContribuNons A remote flash system that provides remote local flash performance over commodity networks 1. Efficient dataplane execunon model à Low latency and high throughput at low compute overhead 2. Novel I/O scheduler à Enforce tail latency and throughput guarantees for tenants sharing flash 15

16 ReFlex Design Separate control & data planes: Control plane: Resource management: cores, network, Flash Allocate for IOPS, capacity, and latency requirements Example SLO: 1ms tail read latency at 1K IOPS Dataplane: Integrate storage & network stack Low latency, high throughput, high efficiency Enforce QoS with I/O scheduling 16

17 System architecture Ring 3 Control Plane App libix Guest Ring IX Host Ring Linux kernel Dune RX TX Core Core 17

18 System architecture Ring 3 Control Plane App libix App libix Guest Ring IX ReFlex Host Ring Linux kernel Dune RX TX RX TX SQ CQ Core Core Core 18

19 ExecuNon Model Ring 3 ReFlex Server 3 Event CondiNons libix Batched Syscalls Guest Ring NVMe TCP/IP 2 TCP/IP Scheduler NVMe CQ 1 RX TX SQ 4 19

20 ExecuNon Model Ring 3 Event CondiNons ReFlex Server libix 7 Batched Syscalls Guest Ring NVMe 6 TCP/IP TCP/IP Scheduler NVMe 8 CQ 5 RX TX SQ 2

21 Principle 1: Process to complenon Ring 3 ReFlex Server Event CondiNons libix Batched Syscalls Guest Ring NVMe TCP/IP TCP/IP Scheduler NVMe CQ RX TX SQ Improve data-cache locality 21

22 Principle 2: Batch adapnvely Ring 3 ReFlex Server Event CondiNons libix Batched Syscalls Guest Ring NVMe TCP/IP Match batch size to system load TCP/IP Scheduler NVMe CQ RX TX SQ Improve instrucnon-cache locality and prefetching 22

23 Principle 3: Avoid data copies Ring 3 ReFlex Server Event CondiNons libix Batched Syscalls Guest Ring NVMe TCP/IP Forward data directly from NIC à Flash TCP/IP NVMe Scheduler CQ RX TX SQ Reduce latency, cache pollunon, and compute overhead 23

24 Principle 4: Schedule I/O Ring 3 ReFlex Server Event CondiNons libix Batched Syscalls Guest Ring NVMe TCP/IP Control rate at which each tenant issues I/O to shared flash device TCP/IP Scheduler NVMe CQ RX TX SQ Provide Quality of Service (QoS) guarantees 24

25 Request Cost Model Determine relanve I/O cost during calibranon Cost = rela.ve impact on tail latency of a concurrent (4kB) read, in unit of tokens Example: 4kB read cost = 1 token, 4kB write cost = X tokens p95 read latency (us) %read 99%read 95%read 9%read 75%read 5%read Total IOPS (Thousands) Find X that unifies curves for all rd/wr ra.os Scale write IOPS by cost factor X p95 Read Latency (us) %read 99%read 95%read 9%read 75%read 5%read Weighted IOPS (x 1 3 tokens/s ) For this device, X=1 Rd IOPS + Wr IOPS Rd IOPS + X * (Wr IOPS) 25

26 Token-based Scheduling p95 Read Latency (us) ms tail latency SLO 1%read 99%read 95%read 9%read 75%read 5%read Weighted IOPS (x 1 3 tokens/s ) 26

27 Token-based Scheduling p95 Read Latency (us) ms tail latency SLO Device max IOPS: 485K 1%read 99%read 95%read 9%read 75%read 5%read Weighted IOPS (x 1 3 tokens/s ) 27

28 Token-based Scheduling p95 Read Latency (us) ms tail latency SLO Tenant reserves 2K Device max IOPS: 485K 1%read 99%read 95%read 9%read 75%read 5%read Weighted IOPS (x 1 3 tokens/s ) 28

29 Token-based Scheduling p95 Read Latency (us) ms tail latency SLO Tenant reserves 2K Device max IOPS: 485K 1%read 99%read 95%read 9%read 75%read 5%read Weighted IOPS (x 1 3 tokens/s ) 29

30 Token-based Scheduling 2 p95 Read Latency (us) ms tail latency SLO Tenant reserves 2K Device max IOPS: 485K 1%read 99%read 95%read 9%read 75%read 5%read Weighted IOPS (x 1 3 tokens/s ) For latency-cripcal tenants: Generate tokens based on IOPS in SLO Avoid overhead by giving token credit limit Rate limit large bursts that exceed IOPS reservanon Don t allow idle tenants to accumulate tokens 3

31 Token-based Scheduling 2 p95 Read Latency (us) ms tail latency SLO Tenant reserves 2K Device max IOPS: 485K 1%read 99%read 95%read 9%read 75%read 5%read Weighted IOPS (x 1 3 tokens/s ) For best-effort tenants: Generate tokens based on ler-over IOPS Schedule only if have enough tokens Round-robin for fairness 31

32 Token-based Scheduling 2 p95 Read Latency (us) ms tail latency SLO Tenant reserves 2K Device max IOPS: 485K 1%read 99%read 95%read 9%read 75%read 5%read Weighted IOPS (x 1 3 tokens/s ) Distributed scheduler: Each core manages tokens for own tenants Coordinate only to share unused tokens Scales well for mulncore 32

33 Results: Local Remote Latency p95 Read Latency (us) Linux: 75K IOPS/core ReFlex: 85K IOPS/core Local-1T ReFlex-1T Libaio-1T IOPS (Thousands) 33

34 Results: Local Remote Latency p95 Read Latency (us) Unloaded latency Local Flash 78 µs ReFlex 99 µs Libaio 121 µs Local-1T ReFlex-1T Libaio-1T IOPS (Thousands) 34

Results: Local Remote Latency p95 Read Latency (us) 1 9 8 7 6 5 4 3 2 1 ReFlex saturates Flash

35 Results: Local Remote Latency p95 Read Latency (us) ReFlex saturates Flash device Local-1T Local-2T ReFlex-1T ReFlex-2T Libaio-1T Libaio-2T IOPS (Thousands) 35

36 Results: Performance IsolaNon Read p95 latency (us) I/O sched disabled I/O sched enabled Latency SLO IOPS (Thousands) I/O sched disabled Tenant A IOPS SLO I/O sched enabled Tenant B IOPS SLO Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 1%rd 8%rd 95%rd 25%rd 1%rd 8%rd 95%rd 25%rd Tenants A & B: latency-crincal; Tenant C + D: best effort 36

37 Results: Performance IsolaNon Read p95 latency (us) I/O sched disabled I/O sched enabled Latency SLO IOPS (Thousands) I/O sched disabled Tenant A IOPS SLO I/O sched enabled Tenant B IOPS SLO Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 1%rd 8%rd 95%rd 25%rd 1%rd 8%rd 95%rd 25%rd Tenants A & B: latency-crincal; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated 37

38 Results: Performance IsolaNon Read p95 latency (us) I/O sched disabled I/O sched enabled Latency SLO IOPS (Thousands) I/O sched disabled Tenant A IOPS SLO I/O sched enabled Tenant B IOPS SLO Tenant A Tenant B Tenant C Tenant D Tenant A Tenant B Tenant C Tenant D 1%rd 8%rd 95%rd 25%rd 1%rd 8%rd 95%rd 25%rd Tenants A & B: latency-crincal; Tenant C + D: best effort Without scheduler: latency and bandwidth QoS for A/B are violated Scheduler rate limits best-effort tenants to enforce SLOs 38

39 Conclusion ReFlex enables flash disaggreganon: Achieves remote local flash by rethinking the OS & network storage sorware stack Provides QoS on shared flash with I/O scheduler Uses commodity networking, has low CPU overhead Future work: Client interface: transacnons, databases? Control plane: policies for resource management QoS with host-side Flash TranslaNon Layer (FTL) 39

ReFlex: Remote Flash Local Flash

ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW 18 Memorable Paper Award Finalist 1 Flash in Datacenters Flash provides 1000 higher throughput and 100 lower latency than