LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

Size: px

Start display at page:

Download "LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang"

Bernice James
6 years ago
Views:

1 LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai, Yiying Zhang

2 Time 2

3 Berkeley Socket Userspace Kernel Hardware Time

4 Berkeley Socket TCP Offload engine Arrakis & mtcp IX Userspace Kernel Hardware Time s 2014? U-Net RDMA in HPC RDMA in Datacenters 2

5 RDMA (Remote Direct Memory Access) CPU User Kernel Memory RDMA Directly read/write remote memory Bypassing kernel Memory zero copy Benefits Low latency High throughput Low CPU utilization 3

6 Things have worked well in HPC Special hardware Few applications Cheaper developer 4

7 RDMA-Based Datacenter Applications 5

8 RDMA-Based Datacenter Applications Pilaf [ATC 13] HERD-RPC [ATC 16] Cell [ATC 16] FaRM [NSDI 14] Wukong [OSDI 16] FaSST [OSDI 16] HERD [SIGCOMM 14] Hotpot [SoCC 17] NAM-DB [VLDB 17] RSI [VLDB 16] DrTM [SOSP 15] APUS [SoCC 17] Octopus [ATC 17] DrTM+R [EuroSys 16] FaRM+Xact Mojim [SOSP 15] [ASPLOS 15] 5

9 Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 6

10 Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Kernel Space OS Library Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 7

11 Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 7

12 Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 7

13 Userspace Hardware Low-level Difficult to use High-level Easy to use 8

14 Userspace Hardware Low-level Difficult to use Developers want High-level Easy to use 8

15 Userspace Hardware Low-level Difficult to use Developers want High-level Easy to use Socket 8

16 Userspace Hardware Low-level Difficult to use RDMA Developers want High-level Easy to use Socket 8

17 Userspace Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation 8

18 Userspace Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation Abstraction Mismatch 8

19 Userspace Fat applications No resource sharing Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation Abstraction Mismatch 8

20 Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 9

21 Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 9

22 Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 9

23 Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 10

24 Userspace Hardware 11

25 Userspace On-NIC SRAM 1. Fetches and caches page table entries 2. Stores secret keys for every consecutive memory region Hardware 11

26 Userspace Hardware On-NIC SRAM 1. Fetches and caches page table entries 2. Stores secret keys for every consecutive memory region Requests /us Write-64B Write-1K Total Size (MB) 11

27 Userspace Hardware Requests /us Expensive, unscalable On-NIC SRAM 1. Fetches and caches page table entries 2. Stores secret keys for every consecutive memory region Total Size (MB) Write-64B Write-1K hardware 11

28 Things have been good in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 12

29 Things have been good in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 12

30 Things have been good in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 12

31 13

32 Fat applications No resource sharing Expensive, unscalable hardware 13

33 Fat applications No resource sharing Expensive, unscalable hardware Are we removing too much from kernel? 13

34 Outline Introduction and motivation Overall design and abstraction LITE internals LITE applications Conclusion 14

35 Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

36 Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

37 Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

38 Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

39 Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15

40 LITE - Local Indirection TiEr High-level High-level abstraction abstraction Resource Resource sharing sharing Protection Performance isolation Performance isolation Protection 15

41 All problems in computer science can be solved by another level of indirection Butler Lampson 16

42 User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 17

43 User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory APIs RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Memory space Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 18

44 Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory APIs RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Memory space Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 18

45 Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr Mem Mgmt LITE APIs Memory APIs RPC/Msg APIs Sync APIs Kernel Space LITE Connections Queues Keys Permission check Address mapping Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 19

46 Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory APIs RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Permission check Address mapping RDMA Verbs Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 19

47 User Space Simpler applications Conn Mgmt User-Level RDMA App send recv node, lkey, rkey addr Mem Mgmt LITE APIs Memory APIs RPC/Msg APIs Sync APIs Kernel Space LITE Connections Queues Keys Permission check Address mapping RDMA Verbs Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 19

48 Implementing Remote memset Native RDMA 20

49 Implementing Remote memset Native RDMA LITE 20

50 Implementing Remote memset Native RDMA LITE 20

51 Implementing Remote memset Native RDMA LITE 20

52 All problems in computer science can be solved by another level of indirection Butler Lampson 21

53 All problems in computer science can be solved by another level of indirection Butler Lampson David Wheeler 21

54 All problems in computer science can be solved by another level of indirection Butler Lampson David Wheeler except for the problem of too many layers of indirection David Wheeler 21

55 Main Challenge: How to preserve the performance benefit of RDMA? 22

56 Design Principles 1.Indirection only at local for one-sided RDMA CPU User Memory Kernel Berkeley Socket CPU User Kernel Memory RDMA Userspace Kernel Hardware 23

57 Design Principles 1.Indirection only at local for one-sided RDMA CPU User Memory Kernel CPU User Memory Kernel CPU User Memory Kernel Berkeley Socket RDMA LITE Userspace Kernel Hardware 23

58 Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Hardware RNIC Address mapping Permission check 24

59 Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Address mapping Permission check Hardware RNIC Address mapping Permission check 24

60 Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Address mapping Permission check Hardware RNIC No redundant indirection Scalable performance 24

61 Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost 25

62 Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection David Wheeler 25

63 Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection David Wheeler Great Performance and Scalability 25

64 Outline Introduction and motivation Overall design and abstraction LITE internals LITE applications Conclusion 26

65 LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS Kernel App Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27

66 LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS Kernel App Verbs Abstraction LITE 1-Side RDMA global lkey global rkey RNIC Driver lh1 lh2 Permission check Address mapping addr1 addr2 RNIC global lkey global rkey 27

67 LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS Kernel App Verbs Abstraction LITE 1-Side RDMA global lkey global rkey RNIC Driver lh1 lh2 Permission check Address mapping addr1 addr2 LITE RPC RDMA Buffer Mgmt send RPC Client RPC Server Connections Queues poll recv RNIC global lkey global rkey 27

68 LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS LITE APIs Kernel App mgmt mem synch msging RPC Verbs Abstraction LITE 1-Side RDMA global lkey global rkey RNIC Driver lh1 lh2 Permission check Address mapping addr1 addr2 LITE RPC RDMA Buffer Mgmt send RPC Client RPC Server Connections Queues poll recv RNIC global lkey global rkey 27

69 LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction Verbs Abstraction OS LITE 1-Side RDMA LITE APIs global lkey global rkey RNIC Driver lh1 Kernel App lh2 mgmt mem synch msging RPC Permission check Address mapping LITE RPC addr1 addr2 RDMA Buffer Mgmt send RPC Client RPC Server Connections Queues poll recv RNIC global lkey global rkey 27

70 Onload Costly Operations LITE Connections Queues Keys Memory space OS RNIC Permission check Address mapping 28

71 Onload Costly Operations LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC Perform address mapping and protection in kernel 28

72 Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Cached PTEs Challenge: How to eliminate hardware indirection without changing hardware? 29

73 Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Cached PTEs Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs 29

74 Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs 29

75 Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs Register whole memory at once one global key 29

76 Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Global lkey Memory space Global rkey RNIC Global lkey Global rkey Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs Register whole memory at once one global key 29

77 LITE LMR and RDMA Userspace application LITE in Kernel Network Remote nodes 30

78 LITE LMR and RDMA LMR Userspace application LITE in Kernel Network Remote nodes 30

79 LITE LMR and RDMA LMR Node Phy Addr 1 0x45 4 0x27 Userspace application LITE in Kernel Network Remote nodes 30

80 LITE LMR and RDMA LMR Node Phy Addr 1 0x45 Node 1 0x45 4 0x27 Node 4 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

81 LITE LMR and RDMA lh LMR Node Phy Addr 1 0x45 Node 1 0x45 4 0x27 Node 4 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

82 LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 0x45 4 0x27 Node 4 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

83 LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check QoS 4 0x27 Node 4 0x45 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

84 LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check Userspace application QoS Offset 4 0x27 LITE in Kernel Network Node 4 0x27 Remote nodes 0x45 30

85 LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check QoS 4 0x27 Node 4 0x45 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

86 LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check QoS 4 0x27 Node 4 0x45 Userspace application LITE in Kernel Network 0x27 Remote nodes 30

87 LITE RDMA:Size of MR Scalability Requests /us Write-64B LITE_write-64B Write-1K LITE_write-1K Total Size (MB) 31

88 LITE RDMA:Size of MR Scalability Requests /us Write-64B LITE_write-64B Write-1K LITE_write-1K Total Size (MB) 31

89 LITE RDMA:Size of MR Scalability Requests /us Write-64B LITE_write-64B Write-1K LITE_write-1K 0 LITE 1 scales 4 much 16 better 64 than 256 native 1024 RDMA wrt MR Total size Size (MB) and numbers 31

90 LITE-RDMA Latency 60 Latency (us) user space kernel space K 32K Request Size (B) 32

91 LITE-RDMA Latency 60 Latency (us) user space kernel space K 32K Request Size (B) 32

92 LITE-RDMA Latency 60 Latency (us) user space kernel space K 32K Request Size (B) 32

93 LITE-RDMA Latency 60 Latency (us) user space kernel space K 32K Request Size (B) 32

94 LITE-RDMA Latency 60 Latency (us) user space kernel space LITE 15only adds a very slight overhead even when native RDMA doesn t have 0 scalability issues K 32K Request Size (B) 32

95 LITE RPC RPC communication using two RDMA-write-imm One global busy poll thread Separate LMRs at server for different RPC clients Hide syscall cost behind performance critical path Benefits Low latency Low memory utilization Low CPU utilization 33

96 Outline Introduction and motivation Overall design and abstraction LITE internals LITE applications Conclusion 34

97 LITE Application Effort Application LOC LOC using LITE Student Days LITE-Log LITE-MapReduce 600* 49 4 LITE-Graph LITE-Kernel-DSM LITE-Graph-DSM Simple to use Needs no expert knowledge Flexible, powerful abstraction Easy to achieve optimized performance * LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition 35

98 MapReduce Results LITE-MapReduce adapted from Phoenix [1] Runtime (sec) Hadoop Phoenix LITE 2 0 Phoenix 2-node 4-node 8-node [1]: Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07) 36

99 MapReduce Results LITE-MapReduce adapted from Phoenix [1] Runtime (sec) Hadoop Phoenix LITE LITE-MapReduce 2 outperforms Hadoop by 4.3x to 5.3x 0 Phoenix 2-node 4-node 8-node [1]: Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07) 36

100 Graph Results LITE-Graph built directly on LITE using PowerGraph design Grappa and PowerGraph 10 Runtime (sec) LITE-Graph Grappa PowerGraph 0 4 nodes x 4threads 7x4 4 nodes x 4 threads 7 nodes x 4 threads 37

101 Graph Results LITE-Graph built directly on LITE using PowerGraph design Grappa and PowerGraph 10 Runtime (sec) LITE-Graph Grappa PowerGraph LITE-Graph 2 outperforms PowerGraph 0 by 3.5x to 5.6x 4 nodes x 4threads 7x4 4 nodes x 4 threads 7 nodes x 4 threads 37

102 Conclusion LITE virtualizes RDMA into flexible abstraction LITE preserves RDMA s performance benefits Indirection not always degrade performance! 38

103 Conclusion LITE virtualizes RDMA into flexible abstraction LITE preserves RDMA s performance benefits Indirection not always degrade performance! Division across user space, kernel, and hardware 38

104 Thank you Questions? Get LITE at: wuklab.io

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang Datacenter 3 Monolithic Computer OS / Hypervisor 4 Can monolithic Application Hardware