Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

Similar documents
Empower Diverse Open Transport Layer Protocols in Cloud Networking GEORGE ZHAO DIRECTOR OSS & ECOSYSTEM, HUAWEI

Light & NOS. Dan Li Tsinghua University

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Accelerate Network Protocol Stack Performance and Adoption in the Cloud Networking via DMM

PASTE: A Network Programming Interface for Non-Volatile Main Memory

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

TLDK Overview. Transport Layer Development Kit Ray Kinsella February ray.kinsella [at] intel.com IRC: mortderire

Learning with Purpose

MegaPipe: A New Programming Interface for Scalable Network I/O

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

TLDK Overview. Transport Layer Development Kit Keith Wiles April Contributions from Ray Kinsella & Konstantin Ananyev

IsoStack Highly Efficient Network Processing on Dedicated Cores

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

Agilio CX 2x40GbE with OVS-TC

VPP Host Stack. TCP and Session Layers. Florin Coras, Dave Barach, Keith Burns, Dave Wallace

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Much Faster Networking

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

VPP Host Stack. Transport and Session Layers. Florin Coras, Dave Barach, Keith Burns, Dave Wallace

Ziye Yang. NPG, DCG, Intel

Ed Warnicke, Cisco. Tomasz Zawadzki, Intel

Lecture 8: Other IPC Mechanisms. CSC 469H1F Fall 2006 Angela Demke Brown

Topics. Lecture 8: Other IPC Mechanisms. Socket IPC. Unix Communication

Bringing&the&Performance&to&the& Cloud &

Demystifying Network Cards

Data Path acceleration techniques in a NFV world

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework

Custom UDP-Based Transport Protocol Implementation over DPDK

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Arrakis: The Operating System is the Control Plane

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

QuickSpecs. Overview. HPE Ethernet 10Gb 2-port 535 Adapter. HPE Ethernet 10Gb 2-port 535 Adapter. 1. Product description. 2.

PASTE: Fast End System Networking with netmap

Solarflare and OpenOnload Solarflare Communications, Inc.

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

SLIPSTREAM: AUTOMATIC INTERPROCESS COMMUNICATION OPTIMIZATION. Will Dietz, Joshua Cranmer, Nathan Dautenhahn, Vikram Adve

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

libvnf: building VNFs made easy

DPDK Summit China 2017

Advanced Computer Networks. End Host Optimization

Containing RDMA and High Performance Computing

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Memory-Mapped Files. generic interface: vaddr mmap(file descriptor,fileoffset,length) munmap(vaddr,length)

Containers Do Not Need Network Stacks

DPDK Summit China 2017

Outline. Overview. Linux-specific, since kernel 2.6.0

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

VPP Host Stack. Transport and Session Layers. Florin Coras, Dave Barach

The Power of Batching in the Click Modular Router

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

ODP Relationship to NFV. Bill Fischofer, LNG 31 October 2013

DPDK Summit 2016 OpenContrail vrouter / DPDK Architecture. Raja Sivaramakrishnan, Distinguished Engineer Aniket Daptari, Sr.

Xilinx Answer QDMA Performance Report

DPDK Load Balancers RSS H/W LOAD BALANCER DPDK S/W LOAD BALANCER L4 LOAD BALANCERS L7 LOAD BALANCERS NOV 2018

Like select() and poll(), epoll can monitor multiple FDs epoll returns readiness information in similar manner to poll() Two main advantages:

Evolution of the netmap architecture

PDP : A Flexible and Programmable Data Plane. Massimo Gallo et al.

Developing Stateful Middleboxes with the mos API KYOUNGSOO PARK & YOUNGGYOUN MOON

Interprocess Communication Mechanisms

shared storage These mechanisms have already been covered. examples: shared virtual memory message based signals

Networking at the Speed of Light

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

IBM POWER8 100 GigE Adapter Best Practices

L41 - Lecture 5: The Network Stack (1)

LINUX INTERNALS & NETWORKING Weekend Workshop

Open Source Traffic Analyzer

VALE: a switched ethernet for virtual machines

Performance Objects and Counters for the System

To Grant or Not to Grant

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

An FPGA-Based Optical IOH Architecture for Embedded System

DPDK on Arm64 Status Review & Plan

Intel Ethernet Server Adapter XL710 for OCP

Software Datapath Acceleration for Stateless Packet Processing

Introduction to OpenOnload Building Application Transparency and Protocol Conformance into Application Acceleration Middleware

TRex Realistic Traffic Generator

FAQ. Release rc2

Supporting Fine-Grained Network Functions through Intel DPDK

Memory Management Strategies for Data Serving with RDMA

ECE 650 Systems Programming & Engineering. Spring 2018

Self-driving Datacenter: Analytics

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

DPDK Performance Report Release Test Date: Nov 16 th 2016

PASTE: A Networking API for Non-Volatile Main Memory

Accelerate Cloud Native with FD.io

LegUp: Accelerating Memcached on Cloud FPGAs

Agilio OVS Software Architecture

An Intelligent NIC Design Xin Song

A Scalable Event Dispatching Library for Linux Network Servers

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

초고속네트워크보안시스템설계및구현. KyoungSoo Park School of Electrical Engineering, KAIST. (Collaboration with many students & faculty members at KAIST)

No Tradeoff Low Latency + High Efficiency

Application Acceleration Beyond Flash Storage

Maximizing Network Throughput for Container Based Storage David Borman Quantum

ECE 650 Systems Programming & Engineering. Spring 2018

Fundamental Questions to Answer About Computer Networking, Jan 2009 Prof. Ying-Dar Lin,

Transcription:

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack Dan Li ( 李丹 ) Tsinghua University

Data Center Network Performance

Hardware Capability of Modern Servers Multi-core CPU Kernel stack becomes the performance bottleneck! Linux PCIe 3.0, 4.0, 5.0 100G~400Gbps NIC

Limitation of Linux Kernel Interruption based I/O in high-speed traffic Coupling sockets with VFS Lack of connection locality Shared accept core CPU Usage Breakdown of Web Server (Web server (Lighttpd) Serving a 64 byte file) 83% of CPU usage spent inside kernel! Applicati on TCP/IP 34% Packet I/O 4% Kernel (without TCP/IP)

Prior Works Improvement to Linux kernel Latest Linux 4.14, Fastsocket, Mega-pipe, Affinityaccept, IsoStack, StackMap Problems of the kernel stack remain except the percore accept queue User-level I/O DPDK, PFRing, Netmap, PSIO User-level TCP stack mtcp, IX, mos, SeaStar, F-Stack Problem: need to modify the app. source code

Light Design Goal User-level TCP stack High performance High throughput Low (tail) latency Full compatibility Do not need to touch the application code at all

Challenge Caused by Full Compatibility Performance interference between application and stack Polling-mode I/O Taking over network-related API Distinguishing FD spaces Read(), write() User-level blocking API send(), recv(), epoll() Fault detection and resource recycle

Accept Ready Queue Close Ready Queue TX Ready Queue RX Ready Queue Accept Ready Queue Close Ready Queue TX Ready Queue RX Ready Queue Command Queue Command Queue Architecture Overview (1) Three Components of Light: FM (Fronted Module) Provides POSIX API for apps. BM (Backend Module) Polls the Command Queue and processes the commands sequentially. App process 0 core 2 core 3 App process 1 Program Logic Program Logic POSIX API POSIX API Frontend Module Frontend Module Shared Hugepage Memory Light Epoll Light Socket Backend Module Backend Module PPM (Protocol Process Module) Undertakes the major process logic of the TCP/IP/Ethernet protocols Protocol Process Module Light Process 0 core 0 DPDK Protocol Process Module core 1 Light Process 1 User Space RSS Kernel Space NIC

Architecture Overview (2) Light-App Separation: Run the Light stack and apps on separate cores; APP Core 0 APP Core 1 APP Core 2 Applications One-to-many and many-toone match between the stack and apps. Stack Core 0 RSS NIC Stack Core 1 Light Stack Eliminate the performance interference between application and stack.

Design for Full Compatibility (1) Taking over Network-related APIs: LD_PRELOAD dlsym Application Network-related APIs Dynamic Linker Other APIs Hijacked by LD_PRELOAD Light FM Lib dlsym GNC C Lib

Design for Full Compatibility (2) Distinguishing FD Spaces: ssize_t read(int fd, void *buf, size_t count) 0 Bottom-up Top-down Other FDs Maintained by Kernel Network-related FDs Maintained by Light glibc Light Implementation

Design for Full Compatibility (3) User-Level Blocking APIs: Epoll_wait(): Can monitor both network-related FDs and non-network FDs with blocking semantics. epoll_create() 1.1 Listened FDs Socket FD epoll_ctl() Nonnetwork FD Application 2.1 Light epoll epoll_wait() Event collection Nonnetwork event 3.1 Networkrelated event 6 Other Blocking APIs: Leverage epoll_wait() to realize the blocking semantics. kernel epoll_create() 1.2 Listened FDs FIFO FD 1. 4.1 3 kernel epoll_ctl() 2.2 Nonnetwork FD Kernel epoll 5 kernel epoll_wait() 3.2 Kernel Event collection Nonnetwork readable FIFO event event 4. 2 5.1 5.2 Kernel Light FIFO FD

Design for Full Compatibility (4) Fault Detection and Resource Recycle: Fault Detection Resource Recycle 3 Epoll Monitor IPC socket 1 IPC socket 2 1 App 1 App 2 2 IPC Socket 2 Event Kernel

Design for High Performance (1) (1) Benefits from DPDK: General Techniques PMD, Zero-copy, Hugepage, etc. Lockless Shared-Queue Based IPC (2) TCB Management Local Listen Table and Established Table Dedicated Accept Queues

Design for High Performance (2) (3) Full Connection Locality Core Locality for Passive Connections Core Locality for Active Connections: Use soft-rss to compute and record the stack core index in the socket object. In this way, the reply packets can be steered to the same core as the original packets.

Implementation System Configuration Ubuntu 18.04 (kernel version 4.15.0-13-generic) DPDK 17.02 Code 18263 lines of C code (excluding DPDK Library and the protocol stack ported from the kernel) APIs Most TCP related APIs have been realized.

Evaluation (1) Network Throughput and Multi-core Scalability We use two powerful machines: 1) One runs wrk to generate a high workload of http requests; 2) Another runs Nginx on kernel stack or Light stack. Request Response wrk Nginx Server

Evaluation (2) Network Throughput and Multi-core Scalability Nginx on Light gets 56% higher throughput on 8 CPU cores and achieves a linear speedup ratio of 0.89 in terms of network throughput. The RPS of Nginx running on Light and Linux kernel stack against the number of CPU cores used. The message size is set as 64 Bytes.

Evaluation (3) Network Throughput and Multi-core Scalability Nginx on Light can consistently achieve more than 50% RPS compared with kernel stack. The RPS of Nginx running on Light and Linux kernel stack against the message size. The number of CPU cores used is 8.

Evaluation (4) Network Latency (1) Two machines: 1) One runs wrk to generate a high workload of http requests; 2) Another runs Nginx on kernel stack or Light stack. Request Response wrk Nginx Server

Evaluation (5) Network Latency (1) Light can reduce the tail latency by two orders of magnitude compared to kernel stack. CDF of round-trip latency for Nginx on Light and kernel stack.

Evaluation (6) Network Latency (2) We use two machines to run as NetPIPE server and NetPIPE client respectively both on Light stack or kernel stack. Request Response NetPIPE Server NetPIPE Server

Evaluation (7) Network Latency Compared with Linux kernel stack, Light can reduce the average latency by above 40%, with a maximum of 52%. One-way latency for NetPIPE on Light and kernel stack.

Light in DMM (1) Light should develop adapter-library (Light-adapter) for DMM to integrate for communication with Light. Light- nsocket API DMM adapter must implement the interfaces defined by DMM, including the socket APIs, epoll APIs, fork APIs and the resource recycle APIs. Kernel Adapt Light-adapter nstack adapter nrd Light should integrate the DMM Light stack adapter-library(nstack adapter), developed by DMM. The library utilizes HAL the plug-in interface to provide rich features, such as resource (shared NIC memory) and event management.

Light in DMM (2) Key Techniques Distributed and Centralized nrd deployment Web APP Video streaming Online gaming (LRD & CRD) provide end-to-end protocol orchestration Stack-transparent Protocol Routing (Stack orchestrator) POSIX compatible socket APIs Flexible socket API redirection and mapping Socket Layer L2~L4 POSIX Socket-compatible API (LD_PRELOAD) VPP Host Stack IPv4 input/output Socket Bridge(SBR) TLDK DPDK input Light Data-plane EAL IPv6 input/output Socket MUX Protocol Orchestrator User Space nrd Honeycomb REST REST (SBR) Flexible APIs for integration of third party stacks NIC Kernel stack Kernel Space DMM VPP 3 rd Party stack (EAL) Multiple stack instances support Multiple I/O engines support

Future Work Network operating system out of kernel Redesign PPM module New transport protocol New congestion control mechanism Virtualization / container environment Integrating Light into DMM framework

Thanks!