LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai, Yiying Zhang
Time 2
Berkeley Socket Userspace Kernel Hardware Time 1983 2
Berkeley Socket TCP Offload engine Arrakis & mtcp IX Userspace Kernel Hardware 1995 2017 Time 1983 2000s 2014? U-Net RDMA in HPC RDMA in Datacenters 2
RDMA (Remote Direct Memory Access) CPU User Kernel Memory RDMA Directly read/write remote memory Bypassing kernel Memory zero copy Benefits Low latency High throughput Low CPU utilization 3
Things have worked well in HPC Special hardware Few applications Cheaper developer 4
RDMA-Based Datacenter Applications 5
RDMA-Based Datacenter Applications Pilaf [ATC 13] HERD-RPC [ATC 16] Cell [ATC 16] FaRM [NSDI 14] Wukong [OSDI 16] FaSST [OSDI 16] HERD [SIGCOMM 14] Hotpot [SoCC 17] NAM-DB [VLDB 17] RSI [VLDB 16] DrTM [SOSP 15] APUS [SoCC 17] Octopus [ATC 17] DrTM+R [EuroSys 16] FaRM+Xact Mojim [SOSP 15] [ASPLOS 15] 5
Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 6
Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Kernel Space OS Library Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 7
Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 7
Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 7
Userspace Hardware Low-level Difficult to use High-level Easy to use 8
Userspace Hardware Low-level Difficult to use Developers want High-level Easy to use 8
Userspace Hardware Low-level Difficult to use Developers want High-level Easy to use Socket 8
Userspace Hardware Low-level Difficult to use RDMA Developers want High-level Easy to use Socket 8
Userspace Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation 8
Userspace Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation Abstraction Mismatch 8
Userspace Fat applications No resource sharing Hardware Low-level Difficult to use Difficult to share RDMA Socket Developers want High-level Easy to use Resource share Isolation Abstraction Mismatch 8
Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 9
Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 9
Things have worked well in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 9
Userspace Native RDMA Hardware User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Kernel Space OS Kernel Bypassing Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 10
Userspace Hardware 11
Userspace On-NIC SRAM 1. Fetches and caches page table entries 2. Stores secret keys for every consecutive memory region Hardware 11
Userspace Hardware On-NIC SRAM 1. Fetches and caches page table entries 2. Stores secret keys for every consecutive memory region Requests /us 6 4.5 3 1.5 0 Write-64B Write-1K 1 4 16 64 256 1024 Total Size (MB) 11
Userspace Hardware Requests /us Expensive, unscalable On-NIC SRAM 1. Fetches and caches page table entries 2. Stores secret keys for every consecutive memory region 6 4.5 3 1.5 0 1 4 16 64 256 1024 Total Size (MB) Write-64B Write-1K hardware 11
Things have been good in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 12
Things have been good in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 12
Things have been good in HPC Special hardware Few applications Cheaper developer What about datacenters? Commodity, cheaper hardware Many (changing) applications Resource sharing and isolation 12
13
Fat applications No resource sharing Expensive, unscalable hardware 13
Fat applications No resource sharing Expensive, unscalable hardware Are we removing too much from kernel? 13
Outline Introduction and motivation Overall design and abstraction LITE internals LITE applications Conclusion 14
Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15
Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15
Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15
Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15
Without Kernel High-level abstraction Resource sharing Protection Performance isolation 15
LITE - Local Indirection TiEr High-level High-level abstraction abstraction Resource Resource sharing sharing Protection Performance isolation Performance isolation Protection 15
All problems in computer science can be solved by another level of indirection Butler Lampson 16
User Space Conn Mgmt User-Level RDMA App send recv Connections Queues Keys node, lkey, rkey addr Mem Mgmt Memory space Library Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 17
User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory APIs RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Memory space Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 18
Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory APIs RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Memory space Hardware RNIC Permission check Address mapping Cached PTEs lkey 1 lkey n rkey 1 rkey n 18
Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr Mem Mgmt LITE APIs Memory APIs RPC/Msg APIs Sync APIs Kernel Space LITE Connections Queues Keys Permission check Address mapping Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 19
Simpler applications User-Level RDMA App User Space Conn Mgmt send recv node, lkey, rkey addr LITE APIs Memory APIs RPC/Msg APIs Sync APIs Mem Mgmt Kernel Space LITE Connections Queues Keys Permission check Address mapping RDMA Verbs Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 19
User Space Simpler applications Conn Mgmt User-Level RDMA App send recv node, lkey, rkey addr Mem Mgmt LITE APIs Memory APIs RPC/Msg APIs Sync APIs Kernel Space LITE Connections Queues Keys Permission check Address mapping RDMA Verbs Global lkey Memory space Global rkey Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 19
Implementing Remote memset Native RDMA 20
Implementing Remote memset Native RDMA LITE 20
Implementing Remote memset Native RDMA LITE 20
Implementing Remote memset Native RDMA LITE 20
All problems in computer science can be solved by another level of indirection Butler Lampson 21
All problems in computer science can be solved by another level of indirection Butler Lampson David Wheeler 21
All problems in computer science can be solved by another level of indirection Butler Lampson David Wheeler except for the problem of too many layers of indirection David Wheeler 21
Main Challenge: How to preserve the performance benefit of RDMA? 22
Design Principles 1.Indirection only at local for one-sided RDMA CPU User Memory Kernel Berkeley Socket CPU User Kernel Memory RDMA Userspace Kernel Hardware 23
Design Principles 1.Indirection only at local for one-sided RDMA CPU User Memory Kernel CPU User Memory Kernel CPU User Memory Kernel Berkeley Socket RDMA LITE Userspace Kernel Hardware 23
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Hardware RNIC Address mapping Permission check 24
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Address mapping Permission check Hardware RNIC Address mapping Permission check 24
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection Kernel Space LITE Address mapping Permission check Hardware RNIC No redundant indirection Scalable performance 24
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost 25
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection David Wheeler 25
Design Principles 1.Indirection only at local for one-sided RDMA 2.Avoid hardware indirection 3.Hide kernel cost except for the problem of too many layers of indirection David Wheeler Great Performance and Scalability 25
Outline Introduction and motivation Overall design and abstraction LITE internals LITE applications Conclusion 26
LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS Kernel App Verbs Abstraction RNIC Driver RNIC global lkey global rkey 27
LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS Kernel App Verbs Abstraction LITE 1-Side RDMA global lkey global rkey RNIC Driver lh1 lh2 Permission check Address mapping addr1 addr2 RNIC global lkey global rkey 27
LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS Kernel App Verbs Abstraction LITE 1-Side RDMA global lkey global rkey RNIC Driver lh1 lh2 Permission check Address mapping addr1 addr2 LITE RPC RDMA Buffer Mgmt send RPC Client RPC Server Connections Queues poll recv RNIC global lkey global rkey 27
LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction OS LITE APIs Kernel App mgmt mem synch msging RPC Verbs Abstraction LITE 1-Side RDMA global lkey global rkey RNIC Driver lh1 lh2 Permission check Address mapping addr1 addr2 LITE RPC RDMA Buffer Mgmt send RPC Client RPC Server Connections Queues poll recv RNIC global lkey global rkey 27
LITE - Architecture Mgmt User-Level App User-Level App User-Level RPC Function LITE Abstraction Verbs Abstraction OS LITE 1-Side RDMA LITE APIs global lkey global rkey RNIC Driver lh1 Kernel App lh2 mgmt mem synch msging RPC Permission check Address mapping LITE RPC addr1 addr2 RDMA Buffer Mgmt send RPC Client RPC Server Connections Queues poll recv RNIC global lkey global rkey 27
Onload Costly Operations LITE Connections Queues Keys Memory space OS RNIC Permission check Address mapping 28
Onload Costly Operations LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC Perform address mapping and protection in kernel 28
Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Cached PTEs Challenge: How to eliminate hardware indirection without changing hardware? 29
Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Cached PTEs Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs 29
Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs 29
Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Memory space RNIC lkey 1 lkey n rkey 1 rkey n Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs Register whole memory at once one global key 29
Avoid Hardware Indirection LITE OS Connections Queues Keys Permission check Address mapping Global lkey Memory space Global rkey RNIC Global lkey Global rkey Challenge: How to eliminate hardware indirection without changing hardware? Register with physical address no need for any PTEs Register whole memory at once one global key 29
LITE LMR and RDMA Userspace application LITE in Kernel Network Remote nodes 30
LITE LMR and RDMA LMR Userspace application LITE in Kernel Network Remote nodes 30
LITE LMR and RDMA LMR Node Phy Addr 1 0x45 4 0x27 Userspace application LITE in Kernel Network Remote nodes 30
LITE LMR and RDMA LMR Node Phy Addr 1 0x45 Node 1 0x45 4 0x27 Node 4 Userspace application LITE in Kernel Network 0x27 Remote nodes 30
LITE LMR and RDMA lh LMR Node Phy Addr 1 0x45 Node 1 0x45 4 0x27 Node 4 Userspace application LITE in Kernel Network 0x27 Remote nodes 30
LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 0x45 4 0x27 Node 4 Userspace application LITE in Kernel Network 0x27 Remote nodes 30
LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check QoS 4 0x27 Node 4 0x45 Userspace application LITE in Kernel Network 0x27 Remote nodes 30
LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check Userspace application QoS Offset 4 0x27 LITE in Kernel Network Node 4 0x27 Remote nodes 0x45 30
LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check QoS 4 0x27 Node 4 0x45 Userspace application LITE in Kernel Network 0x27 Remote nodes 30
LITE LMR and RDMA lh LITE_read(lh, offset, size) LMR Node Phy Addr 1 0x45 Node 1 Permission check QoS 4 0x27 Node 4 0x45 Userspace application LITE in Kernel Network 0x27 Remote nodes 30
LITE RDMA:Size of MR Scalability Requests /us 6 4.5 3 1.5 Write-64B LITE_write-64B Write-1K LITE_write-1K 0 1 4 16 64 256 1024 Total Size (MB) 31
LITE RDMA:Size of MR Scalability Requests /us 6 4.5 3 1.5 Write-64B LITE_write-64B Write-1K LITE_write-1K 0 1 4 16 64 256 1024 Total Size (MB) 31
LITE RDMA:Size of MR Scalability Requests /us 6 4.5 3 1.5 Write-64B LITE_write-64B Write-1K LITE_write-1K 0 LITE 1 scales 4 much 16 better 64 than 256 native 1024 RDMA wrt MR Total size Size (MB) and numbers 31
LITE-RDMA Latency 60 Latency (us) 45 30 15 user space kernel space 0 8 512 2048 8K 32K Request Size (B) 32
LITE-RDMA Latency 60 Latency (us) 45 30 15 user space kernel space 0 8 512 2048 8K 32K Request Size (B) 32
LITE-RDMA Latency 60 Latency (us) 45 30 15 user space kernel space 0 8 512 2048 8K 32K Request Size (B) 32
LITE-RDMA Latency 60 Latency (us) 45 30 15 user space kernel space 0 8 512 2048 8K 32K Request Size (B) 32
LITE-RDMA Latency 60 Latency (us) 45 30 user space kernel space LITE 15only adds a very slight overhead even when native RDMA doesn t have 0 scalability issues 8 512 2048 8K 32K Request Size (B) 32
LITE RPC RPC communication using two RDMA-write-imm One global busy poll thread Separate LMRs at server for different RPC clients Hide syscall cost behind performance critical path Benefits Low latency Low memory utilization Low CPU utilization 33
Outline Introduction and motivation Overall design and abstraction LITE internals LITE applications Conclusion 34
LITE Application Effort Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26 LITE-Graph-DSM 1300 0 5 Simple to use Needs no expert knowledge Flexible, powerful abstraction Easy to achieve optimized performance * LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition 35
MapReduce Results LITE-MapReduce adapted from Phoenix [1] Runtime (sec) 25 23 21 8 6 4 Hadoop Phoenix LITE 2 0 Phoenix 2-node 4-node 8-node [1]: Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07) 36
MapReduce Results LITE-MapReduce adapted from Phoenix [1] Runtime (sec) 25 23 21 8 6 4 Hadoop Phoenix LITE LITE-MapReduce 2 outperforms Hadoop by 4.3x to 5.3x 0 Phoenix 2-node 4-node 8-node [1]: Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07) 36
Graph Results LITE-Graph built directly on LITE using PowerGraph design Grappa and PowerGraph 10 Runtime (sec) 8 6 4 2 LITE-Graph Grappa PowerGraph 0 4 nodes x 4threads 7x4 4 nodes x 4 threads 7 nodes x 4 threads 37
Graph Results LITE-Graph built directly on LITE using PowerGraph design Grappa and PowerGraph 10 Runtime (sec) 8 6 4 LITE-Graph Grappa PowerGraph LITE-Graph 2 outperforms PowerGraph 0 by 3.5x to 5.6x 4 nodes x 4threads 7x4 4 nodes x 4 threads 7 nodes x 4 threads 37
Conclusion LITE virtualizes RDMA into flexible abstraction LITE preserves RDMA s performance benefits Indirection not always degrade performance! 38
Conclusion LITE virtualizes RDMA into flexible abstraction LITE preserves RDMA s performance benefits Indirection not always degrade performance! Division across user space, kernel, and hardware 38
Thank you Questions? Get LITE at: https://github.com/wuklab/lite wuklab.io