Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack Dan Li ( 李丹 ) Tsinghua University
Data Center Network Performance
Hardware Capability of Modern Servers Multi-core CPU Kernel stack becomes the performance bottleneck! Linux PCIe 3.0, 4.0, 5.0 100G~400Gbps NIC
Limitation of Linux Kernel Interruption based I/O in high-speed traffic Coupling sockets with VFS Lack of connection locality Shared accept core CPU Usage Breakdown of Web Server (Web server (Lighttpd) Serving a 64 byte file) 83% of CPU usage spent inside kernel! Applicati on TCP/IP 34% Packet I/O 4% Kernel (without TCP/IP)
Prior Works Improvement to Linux kernel Latest Linux 4.14, Fastsocket, Mega-pipe, Affinityaccept, IsoStack, StackMap Problems of the kernel stack remain except the percore accept queue User-level I/O DPDK, PFRing, Netmap, PSIO User-level TCP stack mtcp, IX, mos, SeaStar, F-Stack Problem: need to modify the app. source code
Light Design Goal User-level TCP stack High performance High throughput Low (tail) latency Full compatibility Do not need to touch the application code at all
Challenge Caused by Full Compatibility Performance interference between application and stack Polling-mode I/O Taking over network-related API Distinguishing FD spaces Read(), write() User-level blocking API send(), recv(), epoll() Fault detection and resource recycle
Accept Ready Queue Close Ready Queue TX Ready Queue RX Ready Queue Accept Ready Queue Close Ready Queue TX Ready Queue RX Ready Queue Command Queue Command Queue Architecture Overview (1) Three Components of Light: FM (Fronted Module) Provides POSIX API for apps. BM (Backend Module) Polls the Command Queue and processes the commands sequentially. App process 0 core 2 core 3 App process 1 Program Logic Program Logic POSIX API POSIX API Frontend Module Frontend Module Shared Hugepage Memory Light Epoll Light Socket Backend Module Backend Module PPM (Protocol Process Module) Undertakes the major process logic of the TCP/IP/Ethernet protocols Protocol Process Module Light Process 0 core 0 DPDK Protocol Process Module core 1 Light Process 1 User Space RSS Kernel Space NIC
Architecture Overview (2) Light-App Separation: Run the Light stack and apps on separate cores; APP Core 0 APP Core 1 APP Core 2 Applications One-to-many and many-toone match between the stack and apps. Stack Core 0 RSS NIC Stack Core 1 Light Stack Eliminate the performance interference between application and stack.
Design for Full Compatibility (1) Taking over Network-related APIs: LD_PRELOAD dlsym Application Network-related APIs Dynamic Linker Other APIs Hijacked by LD_PRELOAD Light FM Lib dlsym GNC C Lib
Design for Full Compatibility (2) Distinguishing FD Spaces: ssize_t read(int fd, void *buf, size_t count) 0 Bottom-up Top-down Other FDs Maintained by Kernel Network-related FDs Maintained by Light glibc Light Implementation
Design for Full Compatibility (3) User-Level Blocking APIs: Epoll_wait(): Can monitor both network-related FDs and non-network FDs with blocking semantics. epoll_create() 1.1 Listened FDs Socket FD epoll_ctl() Nonnetwork FD Application 2.1 Light epoll epoll_wait() Event collection Nonnetwork event 3.1 Networkrelated event 6 Other Blocking APIs: Leverage epoll_wait() to realize the blocking semantics. kernel epoll_create() 1.2 Listened FDs FIFO FD 1. 4.1 3 kernel epoll_ctl() 2.2 Nonnetwork FD Kernel epoll 5 kernel epoll_wait() 3.2 Kernel Event collection Nonnetwork readable FIFO event event 4. 2 5.1 5.2 Kernel Light FIFO FD
Design for Full Compatibility (4) Fault Detection and Resource Recycle: Fault Detection Resource Recycle 3 Epoll Monitor IPC socket 1 IPC socket 2 1 App 1 App 2 2 IPC Socket 2 Event Kernel
Design for High Performance (1) (1) Benefits from DPDK: General Techniques PMD, Zero-copy, Hugepage, etc. Lockless Shared-Queue Based IPC (2) TCB Management Local Listen Table and Established Table Dedicated Accept Queues
Design for High Performance (2) (3) Full Connection Locality Core Locality for Passive Connections Core Locality for Active Connections: Use soft-rss to compute and record the stack core index in the socket object. In this way, the reply packets can be steered to the same core as the original packets.
Implementation System Configuration Ubuntu 18.04 (kernel version 4.15.0-13-generic) DPDK 17.02 Code 18263 lines of C code (excluding DPDK Library and the protocol stack ported from the kernel) APIs Most TCP related APIs have been realized.
Evaluation (1) Network Throughput and Multi-core Scalability We use two powerful machines: 1) One runs wrk to generate a high workload of http requests; 2) Another runs Nginx on kernel stack or Light stack. Request Response wrk Nginx Server
Evaluation (2) Network Throughput and Multi-core Scalability Nginx on Light gets 56% higher throughput on 8 CPU cores and achieves a linear speedup ratio of 0.89 in terms of network throughput. The RPS of Nginx running on Light and Linux kernel stack against the number of CPU cores used. The message size is set as 64 Bytes.
Evaluation (3) Network Throughput and Multi-core Scalability Nginx on Light can consistently achieve more than 50% RPS compared with kernel stack. The RPS of Nginx running on Light and Linux kernel stack against the message size. The number of CPU cores used is 8.
Evaluation (4) Network Latency (1) Two machines: 1) One runs wrk to generate a high workload of http requests; 2) Another runs Nginx on kernel stack or Light stack. Request Response wrk Nginx Server
Evaluation (5) Network Latency (1) Light can reduce the tail latency by two orders of magnitude compared to kernel stack. CDF of round-trip latency for Nginx on Light and kernel stack.
Evaluation (6) Network Latency (2) We use two machines to run as NetPIPE server and NetPIPE client respectively both on Light stack or kernel stack. Request Response NetPIPE Server NetPIPE Server
Evaluation (7) Network Latency Compared with Linux kernel stack, Light can reduce the average latency by above 40%, with a maximum of 52%. One-way latency for NetPIPE on Light and kernel stack.
Light in DMM (1) Light should develop adapter-library (Light-adapter) for DMM to integrate for communication with Light. Light- nsocket API DMM adapter must implement the interfaces defined by DMM, including the socket APIs, epoll APIs, fork APIs and the resource recycle APIs. Kernel Adapt Light-adapter nstack adapter nrd Light should integrate the DMM Light stack adapter-library(nstack adapter), developed by DMM. The library utilizes HAL the plug-in interface to provide rich features, such as resource (shared NIC memory) and event management.
Light in DMM (2) Key Techniques Distributed and Centralized nrd deployment Web APP Video streaming Online gaming (LRD & CRD) provide end-to-end protocol orchestration Stack-transparent Protocol Routing (Stack orchestrator) POSIX compatible socket APIs Flexible socket API redirection and mapping Socket Layer L2~L4 POSIX Socket-compatible API (LD_PRELOAD) VPP Host Stack IPv4 input/output Socket Bridge(SBR) TLDK DPDK input Light Data-plane EAL IPv6 input/output Socket MUX Protocol Orchestrator User Space nrd Honeycomb REST REST (SBR) Flexible APIs for integration of third party stacks NIC Kernel stack Kernel Space DMM VPP 3 rd Party stack (EAL) Multiple stack instances support Multiple I/O engines support
Future Work Network operating system out of kernel Redesign PPM module New transport protocol New congestion control mechanism Virtualization / container environment Integrating Light into DMM framework
Thanks!