Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Similar documents
Arrakis: The Operating System is the Control Plane

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

FlexNIC: Rethinking Network DMA

École Polytechnique Fédérale de Lausanne. Porting a driver for the Intel XL710 40GbE NIC to the IX Dataplane Operating System

High Performance Packet Processing with FlexNIC

Arrakis: The Operating System is the Control Plane

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Arrakis: The Operating System is the Control Plane

Combining Dataplane Operating Systems and Containers for Fast and Isolated Datacenter Applications

Arrakis: The Operating System Is the Control Plane

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Advanced Computer Networks. End Host Optimization

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Lecture 7. Xen and the Art of Virtualization. Paul Braham, Boris Dragovic, Keir Fraser et al. 16 November, Advanced Operating Systems

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Much Faster Networking

Virtualization, Xen and Denali

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Netchannel 2: Optimizing Network Performance

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

ReFlex: Remote Flash Local Flash

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

No Tradeoff Low Latency + High Efficiency

QuickSpecs. HP Z 10GbE Dual Port Module. Models

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

Learning with Purpose

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Towards a Software Defined Data Plane for Datacenters

Supporting Fine-Grained Network Functions through Intel DPDK

Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Efficient zero-copy mechanism for intelligent video surveillance networks

238P: Operating Systems. Lecture 5: Address translation. Anton Burtsev January, 2018

CSE 120 Principles of Operating Systems

Design Overview of the FreeBSD Kernel CIS 657

references Virtualization services Topics Virtualization

Achieve Low Latency NFV with Openstack*

Container Adoption for NFV Challenges & Opportunities. Sriram Natarajan, T-Labs Silicon Valley Innovation Center

Distributed Systems Operation System Support

Chapter 13: I/O Systems

Performance Considerations of Network Functions Virtualization using Containers

Task Scheduling of Real- Time Media Processing with Hardware-Assisted Virtualization Heikki Holopainen

Light & NOS. Dan Li Tsinghua University

CS533 Concepts of Operating Systems. Jonathan Walpole

Review: Hardware user/kernel boundary

ReFlex: Remote Flash Local Flash

Advanced Operating Systems (CS 202) Virtualization

Data Path acceleration techniques in a NFV world

Predictable Interrupt Management and Scheduling in the Composite Component-based System

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Operating System Architecture. CS3026 Operating Systems Lecture 03

Unify Virtual and Physical Networking with Cisco Virtual Interface Card

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Lecture 15: I/O Devices & Drivers

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

Design Overview of the FreeBSD Kernel. Organization of the Kernel. What Code is Machine Independent?

Towards High-Performance Application-Level Storage Management

An Intelligent NIC Design Xin Song

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Background. IBM sold expensive mainframes to large organizations. Monitor sits between one or more OSes and HW

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

PASTE: A Networking API for Non-Volatile Main Memory

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Chapter 13: I/O Systems

libvnf: building VNFs made easy

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Custom UDP-Based Transport Protocol Implementation over DPDK

Virtual Machines. Part 2: starting 19 years ago. Operating Systems In Depth IX 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Introduction. CS3026 Operating Systems Lecture 01

Software Routers: NetMap

Agilio CX 2x40GbE with OVS-TC

143A: Principles of Operating Systems. Lecture 6: Address translation. Anton Burtsev January, 2017

19: I/O. Mark Handley. Direct Memory Access (DMA)

Support for Smart NICs. Ian Pratt

Tolerating Malicious Drivers in Linux. Silas Boyd-Wickizer and Nickolai Zeldovich

DPDK Summit China 2017

IO-Lite: A Unified I/O Buffering and Caching System

Memory Management Strategies for Data Serving with RDMA

CS 152 Computer Architecture and Engineering

FPGA Augmented ASICs: The Time Has Come

Energy Proportionality and Workload Consolidation for Latency-critical Applications

Virtual Machine Monitors (VMMs) are a hot topic in

Performance Evaluation of Myrinet-based Network Router

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Xen Network I/O Performance Analysis and Opportunities for Improvement

Ethernet transport protocols for FPGA

Containers Do Not Need Network Stacks

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

HIGH PACKET RATES IN GSTREAMER

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Iris: An optimized I/O stack for low-latency storage devices

IsoStack Highly Efficient Network Processing on Dedicated Cores

Today: Protection. Sermons in Computer Science. Domain Structure. Protection

PacketShader: A GPU-Accelerated Software Router

CS510 Operating System Foundations. Jonathan Walpole

Transcription:

Kernel Bypass Sujay Jayakar (dsj36) 11/17/2016

Kernel Bypass Background Why networking? Status quo: Linux Papers Arrakis: The Operating System is the Control Plane. Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, and Thomas Anderson, Timothy Roscoe. OSDI 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, and Christos Kozyrakis, Edouard Bugnion. OSDI 2014.

Why networking? Bigger pipes: Intel XL710 chipset (40 Gbps) http://www.intel.com/content/www/us/en/ethernet-products/converged-network-adapters/ethernet-xl710.html

Why networking? Faster storage: Intel P3700 (20 Gbps, 100 µs reads)

Why networking?

Why networking? Software requirements

Status Quo: Linux

Status Quo: Linux 0.34 µs 0.54 µs 1.26 µs 1.05 µs = 3.36 µs (+2.23µs to 6.19 µs if off-core)

Status Quo: Linux Ethernet frame: 84 bytes to 1538 bytes (1500 MTU) 10 Gb/s = 1.25 GB/s = 800K to 15M packets/sec Service time is only 67 ns to 1.25 µs per packet!

Status Quo: Linux Ethernet frame: 84 bytes to 1538 bytes (1500 MTU) 40 Gb/s = 5 GB/s = 3.25M to 60M packets/sec Service time is only 17 ns to 307 ns per packet!

Status Quo: Linux Ethernet frame: 84 bytes to 1538 bytes (1500 MTU) 100 Gb/s = 12.5 GB/s = 8M to 150M packets/sec Service time is only 6.7 ns to 125 ns per packet!

Status Quo: Linux Network stack: demultiplexing, checks, scheduling synchronization needed for multi-core POSIX limitations Copy required for send(2) and recv(2) File descriptor migration among processes Context switch overhead

Arrakis: The Operating System is the Control Plane. Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. OSDI 2014.

Network virtualization VNICs managed by control plane in kernel NIC responsible for bandwidth allocation & demultiplexing Intel 82599 chipset Only supports filtering by MAC address, want arbitrary header fields Weighted round-robin bandwidth allocation At most 64 VNICs

Storage virtualization VSIC: SCSI/ATA queue with rate limiter Virtual Storage Area (VSA): Persistent disk segment Each VSIC has many VSAs, and each VSA can be mapped into many VSICs (for interprocess sharing) Don t have a device like this, emulated in software with a dedicated core

Arrakis

Arrakis: Control Plane VIC management: queue pair creation & deletion Doorbells: IPC endpoint for VIC notifications Hardware control Demultiplexing filters (just layer 2 for now) Rate specifiers for rate limiting

Arrakis: Doorbells SR-IOV virtualizes the MSI-x interrupt registers libos driver handles interrupts in user space if oncore, otherwise goes through control plane API is just a regular POSIX file descriptor

Arrakis: User APIs Extaris POSIX sockets Arrakis/N Zero-copy rings send_packet(queue, array) Ethernet receive_packet(queue) packet packet_done(packet) doorbells on completion

IX: A Protected Dataplane Operating System for High Throughput and Low Latency. Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. OSDI 2014.

IX: More protection Network stack in user-space unacceptable Firewall, ACL management in data plane Sending raw packets requires root Does zero-copy send require memory protection? Approach: use CPU virtualization to get three-way protection between kernel, networking, and user

IX: Protection can be cheap FlexSC (OSDI 10): Why are syscalls expensive? Direct effects: User-mode pipeline flush, mode switch, stashing registers. Only ~150 cycles!

IX: Protection can be cheap FlexSC (OSDI 10): Why are syscalls expensive? Direct effects: User-mode pipeline flush, mode switch, stashing registers. Only ~150 cycles! Indirect effects: TLB flush (if no tagged TLB), cache pollution with kernel instructions/data IPC i-cache d-cache L2 L3 d-tlb pwrite(2) 0.18 50 373 985 3160 44

IX: Protection can be cheap Dune (OSDI 12): Run Linux processes as VMX non-root ring0 with syscalls replaced with hypercalls Untrusted code can run in ring3, can cheaply mode switch back into ring0 Processes can then manipulate page tables, interrupts, hardware, privilege,

IX: More protection

Why not both? IX s use of VMX non-root ring0 is clever. Its embedding in Linux (with Dune) is nice. But if they used SR-IOV, they wouldn t have needed the Toeplitz hash hack and monopolizing a queue. And then they would have gotten (fast!) virtualized interrupts to not have to poll.

Arrakis: Evaluation 0.34 µs 0.54 µs 1.26 µs 1.05 µs = 3.36 µs (+2.23µs to 6.19 µs if off-core)

Arrakis: Evaluation 0.21 µs 0.17 µs = 0.38 µs

Arrakis: memcached

IX: Evaluation

IX: Give up on POSIX No to POSIX: Non-blocking API with batching Packets processed to completion Elastic vs. background threads No internal buffering Slow consumers lead to delayed ACKs

IX: Packet processing