Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Similar documents
PASTE: A Network Programming Interface for Non-Volatile Main Memory

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

PASTE: Fast End System Networking with netmap

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

PASTE: A Networking API for Non-Volatile Main Memory

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Netchannel 2: Optimizing Network Performance

A Look at Intel s Dataplane Development Kit

Enabling Fast, Dynamic Network Processing with ClickOS

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Software Routers: NetMap

Xen Network I/O Performance Analysis and Opportunities for Improvement

Evolution of the netmap architecture

Learning with Purpose

Light & NOS. Dan Li Tsinghua University

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework

Demystifying Network Cards

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

IsoStack Highly Efficient Network Processing on Dedicated Cores

Fast packet processing in linux with af_xdp

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Memory Management Strategies for Data Serving with RDMA

Advanced Computer Networks. End Host Optimization

Agilio CX 2x40GbE with OVS-TC

FAQ. Release rc2

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Arrakis: The Operating System is the Control Plane

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

The Power of Batching in the Click Modular Router

Accelerating VM networking through XDP. Jason Wang Red Hat

Much Faster Networking

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Bringing&the&Performance&to&the& Cloud &

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

libvnf: building VNFs made easy

The Path to DPDK Speeds for AF XDP

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Virtualization, Xen and Denali

Keeping up with the hardware

Supporting Fine-Grained Network Functions through Intel DPDK

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Containers Do Not Need Network Stacks

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

Implementation and Analysis of Large Receive Offload in a Virtualized System

Chapter 8: I/O functions & socket options

Network device drivers in Linux

VALE: a switched ethernet for virtual machines

Virtual Switch Acceleration with OVS-TC

Open Source Traffic Analyzer

To Grant or Not to Grant

The NE010 iwarp Adapter

BSDCan 2015 June 13 th Extensions to FreeBSD Datacenter TCP for Incremental Deployment Support. Midori Kato

Network stack specialization for performance

High Performance Packet Processing with FlexNIC

Chapter 13: I/O Systems

Video capture using GigE Vision with MIL. What is GigE Vision

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

Solarflare and OpenOnload Solarflare Communications, Inc.

Experiences in Building a 100 Gbps (D)DoS Traffic Generator

Networking at the Speed of Light

Improving the DragonFlyBSD Network Stack

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

URDMA: RDMA VERBS OVER DPDK

IBM POWER8 100 GigE Adapter Best Practices

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

Application Acceleration Beyond Flash Storage

No Tradeoff Low Latency + High Efficiency

Chapter 13: I/O Systems

Chapter 13: I/O Systems. Chapter 13: I/O Systems. Objectives. I/O Hardware. A Typical PC Bus Structure. Device I/O Port Locations on PCs (partial)

An FPGA-Based Optical IOH Architecture for Embedded System

DPDK Summit China 2017

A Scalable Event Dispatching Library for Linux Network Servers

Operating Systems. 17. Sockets. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Low-latency Network Monitoring via Oversubscribed Port Mirroring

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Utilizing the IOMMU scalably

MegaPipe: A New Programming Interface for Scalable Network I/O

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

RDMA-like VirtIO Network Device for Palacios Virtual Machines

The benefits and costs of writing a POSIX kernel in a high-level language

SE Memory Consumption

Performance evaluation of Linux CAN-related system calls

ebpf Offload to Hardware cls_bpf and XDP

Data Path acceleration techniques in a NFV world

Chapter 13: I/O Systems

TLDK Overview. Transport Layer Development Kit Ray Kinsella February ray.kinsella [at] intel.com IRC: mortderire

KVM as The NFV Hypervisor

SmartNIC Programming Models

TX bulking and qdisc layer

An Experimental review on Intel DPDK L2 Forwarding

Latest Developments with NVMe/TCP Sagi Grimberg Lightbits Labs

Understanding and Improving the Cost of Scaling Distributed Event Processing

Transcription:

Speeding up Linux TCP/IP with a Fast Packet I/O Framework Michio Honda Advanced Technology Group, NetApp michio@netapp.com With acknowledge to Kenichi Yasukata, Douglas Santry and Lars Eggert 1

Motivation Linux TCP/IP State-of-the-art features Cope with all the network conditions and traffic patterns FACK, FRTO, RACK, DSACK, Fast Open, DCTCP Various security enhancements (e.g., RFC5961) Out-of-tree: MPTCP, TcpCrypt User-space TCP/IP (e.g., Seastar) Fast due to a dedicated NIC to an app (netmap, DPDK) App-driven NIC I/O and network stack execution Direct packet buffer access Integrating the best aspects of both of the worlds 2

Problems Request-response traffic with: Small messages/packets at high rates Concurrent TCP connections Queueing delays epoll_wait() read() read() read() write() write() write() TCP /IP tcp_sendmsg() NIC Descriptors [#] Latency [µs] 1 8 6 4 2 3 2 1 # of descriptors returned by epoll_wait() 2 4 6 8 1 5 99 th %ile latency 4 mean latency 2 4 6 8 1 rx-usecs 1 (default), 124 B response message 3

Design Principles Dedicate a NIC to a privileged app Similar to fast user-space TCP/IPs Use TCP/IP stack in the kernel Regular apps must be able to run on other NICs When the privileged app crashes, the system and the other apps must survive 4

StackMap Overview App registers a NIC Socket API for control socket(), bind(), listen() etc netmap API for datapath (alters read()/write()) user regular app StackMap app kernel Socket API TCP/IP/Ethernet Linux packet I/O netmap API/framework packet buffers NIC Drivers and NICs NIC 5

StackMap Datapath Packet buffers are mapped to NIC rings, app and pre-allocated skbuffs App triggers NIC I/O via netmap API syscall The syscall processes data/packets in TCP/IP before (TX) or after (RX) NIC I/O user regular app StackMap app kernel Socket API TCP/IP/Ethernet Linux packet I/O netmap API/framework packet buffers NIC Drivers and NICs NIC 6

Experimental Results Implementation Linux 4.2 with 188 LoC changes netmap with 68 LoC changes A new kernel module with 22 LoC Setup Two machines with Xeon E5-268 v2 (2.8 Ghz) Intel 83599 1 GbE NIC Server: Linux (rx-usecs 1) or StackMap Client: Linux with wrk HTTP benchmark tool 7

Basic Performance Serving 124 B HTTP OK with a single CPU core Descriptors [#] Latency [µs] 1 8 6 4 2 Linux StackMap 2 4 6 8 1 5 Linux (99 th %ile) 4 Linux (mean) 3 StackMap (99 th %ile) StackMap (mean) 2 1 Throughput [Gb/s] 2 4 6 8 1 8 Linux 6 StackMap 4 2 2 4 6 8 1 8

Memcached Performance Memcached with 1% set and 9 % get (124 B objects, single CPU core) Throughput [Gb/s] 4 3 2 1 Linux Seastar StackMap 1 2 4 6 8 1 Memcached with 1% set and 9 % get (64 B objects, 6 concurrent TCP connections) Throughput [Gb/s] 3 2 1 Linux Seastar StackMap 1 4 8 CPU cores [#] 9

Conclusion Linux TCP/IP protocol processing is fast We can bring the most of techniques in user-space TCPs into Linux TCP/IP What makes StackMap fast? all the advantages of the netmap framework syscall batching memory allocator static but flexible packet buffer pool whose buffers can be dynamically linked to a NIC ring (without dma_(un)map_single()) I/O batching (more aggressive than xmit_more) no skb (de)allocation, no vfs layer synchronous execution of app and protocol processing 1

Base Latency Single HTTP request (97B) and response (124B) latency Linux: rx-usec with epoll_wait(timeout=) 23.5 us 43.64 MB Linux: rx-usec with epoll_wait(timeout=-1) 25.54 us 39.49 MB Linux: rx-usec 1 with epoll_wait(timeout=) 56.6 us 18.11 MB Linux: rx-usec 1 with epoll_wait(timeout=-1) 56.67 us 18.11 MB Linux: rx-usec 1 with net.core.busy_poll=5 (poll()) 23.2 us 43.76 MB StackMap (NIC polling) 21.94 us (45.8 MB/s) 11