PASTE: Fast End System Networking with netmap

Similar documents
PASTE: A Network Programming Interface for Non-Volatile Main Memory

PASTE: A Networking API for Non-Volatile Main Memory

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Demystifying Network Cards

Software Routers: NetMap

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Evolution of the netmap architecture

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

Enabling Fast, Dynamic Network Processing with ClickOS

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

Non-Volatile Memory Through Customized Key-Value Stores

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

TOWARDS FAST IP FORWARDING

Application Access to Persistent Memory The State of the Nation(s)!

libvnf: building VNFs made easy

File Systems: Consistency Issues

RDMA Requirements for High Availability in the NVM Programming Model

Bringing&the&Performance&to&the& Cloud &

A Look at Intel s Dataplane Development Kit

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Memory Management Strategies for Data Serving with RDMA

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Persistent Memory over Fabrics

Research on DPDK Based High-Speed Network Traffic Analysis. Zihao Wang Network & Information Center Shanghai Jiao Tong University

Architectural Principles for Networked Solid State Storage Access

Network stack specialization for performance

High Performance Transactions in Deuteronomy

I/O Management Intro. Chapter 5

ECE 598-MS: Advanced Memory and Storage Systems Lecture 7: Unified Address Translation with FlashMap

Persistent Memory over Fabric (PMoF) Adding RDMA to Persistent Memory Pawel Szymanski Intel Corporation

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Virtualization, Xen and Denali

VALE: a switched ethernet for virtual machines

Midterm Exam #2 Solutions April 20, 2016 CS162 Operating Systems

SNIA NVM Programming Model Workgroup Update. #OFADevWorkshop

Advanced Computer Networks. End Host Optimization

IO-Lite: A Unified I/O Buffering and Caching System

Lecture 8: Other IPC Mechanisms. CSC 469H1F Fall 2006 Angela Demke Brown

Arrakis: The Operating System is the Control Plane

Topics. Lecture 8: Other IPC Mechanisms. Socket IPC. Unix Communication

IsoStack Highly Efficient Network Processing on Dedicated Cores

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Xen Network I/O Performance Analysis and Opportunities for Improvement

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Learning with Purpose

Data Center Traffic and Measurements: SoNIC

Chapter 13: I/O Systems

NVMe SSDs with Persistent Memory Regions

The Power of Batching in the Click Modular Router

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

Developing Stateful Middleboxes with the mos API KYOUNGSOO PARK & YOUNGGYOUN MOON

Input / Output. Kevin Webb Swarthmore College April 12, 2018

The Berkeley Sockets API. Networked Systems Architecture 3 Lecture 4

Programmable NICs. Lecture 14, Computer Networks (198:552)

Exception-Less System Calls for Event-Driven Servers

HKG net_mdev: Fast-path userspace I/O. Ilias Apalodimas Mykyta Iziumtsev François-Frédéric Ozog

Improve VNF safety with Vhost-User/DPDK IOMMU support

I/O. Disclaimer: some slides are adopted from book authors slides with permission 1

Operating Systems 2010/2011

ECE 650 Systems Programming & Engineering. Spring 2018

Much Faster Networking

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

Application Fault Tolerance Using Continuous Checkpoint/Restart

Persistent Memory: The Value to HPC and the Challenges

A Scalable Event Dispatching Library for Linux Network Servers

Monitor your containers with the Elastic Stack. Monica Sarbu

Remote Access to Ultra-Low-Latency Storage Tom Talpey Microsoft

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

Windows Persistent Memory Support

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

jverbs: Java/OFED Integration for the Cloud

Midterm Exam #2 April 20, 2016 CS162 Operating Systems

PSPAT: Packet Scheduling with PArallel Transmit

Windows Support for PM. Tom Talpey, Microsoft

libnetfilter_log Reference Manual

CS162 Operating Systems and Systems Programming Lecture 16. I/O Systems. Page 1

Application Acceleration Beyond Flash Storage

An Analysis of Persistent Memory Use with WHISPER

DPDK Summit China 2017

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Impact on Application Development: SNIA NVM Programming Model in the Real World. Andy Rudoff pmem SW Architect, Intel

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

ntop Users Group Meeting

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

Experiences in Building a 100 Gbps (D)DoS Traffic Generator

143A: Principles of Operating Systems. Lecture 6: Address translation. Anton Burtsev January, 2017

APIs for Persistent Memory Programming

OpenOnload. Dave Parry VP of Engineering Steve Pope CTO Dave Riddoch Chief Software Architect

Shared Memory. By Oren Kalinsky

Network Programming in C: The Berkeley Sockets API. Networked Systems 3 Laboratory Sessions

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

Accessing NVM Locally and over RDMA Challenges and Opportunities

Windows Support for PM. Tom Talpey, Microsoft

Transcription:

PASTE: Fast End System Networking with netmap Michio Honda, Giuseppe Lettieri, Lars Eggert and Douglas Santry BSDCan 2018 Contact: @michioh, micchie@sfc.wide.ad.jp Code: https://github.com/micchie/netmap/tree/stack

This talk is about: What are problems with current network stack? How do we solve it?

This talk is NOT about: User-space network stack is awesome

Server has Xeon 2640v4 2.4 Ghz (uses only 1 core) and Intel X540 10 GbE NIC Client has Xeon 2690v4 2.6 Ghz and runs wrk HTTP benchmark tool Problem 1: Current socket API is slow Request (1400B) and response (64B) over HTTP and TCP 2.8 Gbps 400 us n = kevent(fds) for (i=0; i<n; i++) { read(fds[i], buf);... write(fds[i], res); } 23 us

Problem 2: Current stack cannot utilize Non-Volatile Main Memory efficiently Review: NVMMs offer fast, byte-addressable persistence CPU Caches 5-50 ns Byte access w/ load/store Main Memory 70-1000s ns Block access w/ system calls HDD / SSD 100-1000s us

Problem 2: Current stack cannot utilize Non-Volatile Main Memory efficiently Review: NVMMs offer fast, byte-addressable persistence

Problem 2: Current stack cannot utilize Non-Volatile Main Memory efficiently Durable-write request (1400B) and response (64B) over HTTP and TCP n = kevent(fds) for (i=0; i<n; i++) { read(fds[i], buf);... memcpy(nvmm, buf); clflush(nvmm);... write(fds[i], res); } Almost half Server has Xeon 2640v4 2.4 Ghz Client has Xeon 2690v4 2.6 Ghz and runs wrk HTTP benchmark tool Almost double

Summary Per-socket system call and I/O req. must be avoided Data copy (even to NVMM) must be avoided

Getting architecture right How do we address these problems while preserving benefits offered by the current stack and socket API today?

PASTE Scalable, flexible end system networking architecture True zero copy (even to NVMM) System call and I/O batching across multiple sockets Support for kernel TCP/IP Protocol independence What we benefit from Blocking and busy polling socket API today Protection

PASTE building blocks Two netmap extensions: stack port integrates the kernel TCP/IP implementation same level of abstraction with pipe and vale ports extmem subsystem supports arbitrary (user virtual address) memory region for netmap objects mmap()-ed file in NVMM can be used

PASTE in Action App thread netmap API slot [0] cur [7] 20 27 21 26 22 25 23 24 Shared memory region [0] [4] [8] bufs TCP/IP NIC user kernel

PASTE in Action App thread 1. Run NIC I/O and TCP/IP poll() triggers NIC I/O and TCP/IP processing netmap API slot [0] cur [7] 20 27 21 26 22 25 23 24 Shared memory region [0] [4] [8] bufs TCP/IP NIC user kernel

PASTE in Action App thread 1. Run NIC I/O and TCP/IP Imagine 7 packets received netmap API slot [0] cur [7] 20 27 21 26 22 25 23 24 Shared memory region [0] [4] [8] bufs TCP/IP NIC user kernel

PASTE in Action App thread netmap API slot [0] cur [7] 0 27 tail 1 6 2 5 3 4 1. Run NIC I/O and TCP/IP Shared memory region [0] [4] [8] bufs TCP/IP NIC user kernel They are in-order TCP segments, so the kernel set them to app ring slots Zero copy swap with buffers in the current app ring Advance tail pointer to indicate new app data

PASTE in Action 1. Run NIC I/O and TCP/IP poll() returns App thread netmap API slot [0] cur [7] 0 27 tail 1 6 2 5 3 4 Shared memory region [0] [4] [8] bufs TCP/IP NIC user kernel

PASTE in Action App thread netmap API slot [0] cur [7] 0 27 tail 1 6 2 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on ring Shared memory region [0] [4] [8] bufs TCP/IP NIC user kernel App reads buffers in ring slots from cur to tail

PASTE in Action App thread netmap API slot [0] [7] 0 27 tail 1 6 cur 2 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on ring 3. Update ring pointer Shared memory region [0] [4] [8] bufs TCP/IP NIC user kernel App advances cur Return buffers in slot 0-6 to the kernel at next poll() Buffer indices are also 0-6 in this case

Zero copy write 3 5 App thread netmap API slot [0] [7] 0 27 tail 8 10 cur 9 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on ring 3. Record data 4. Swap out buf(s) 5. Update ring pointer Shared memory region [0] [4] [8] bufs TCP/IP NIC 0 5 7 (1, 96, 120) (2, 96, 987) (6, 96, 512) user kernel B+tree App does not have to return buffers to the kernel e.g, useful for KVSes

Durable zero copy write App thread netmap API slot [0] [7] 0 27 tail 8 10 cur 9 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on ring 3. Flush buf(s) 4. Flush metadata 5. Swap out buf(s) 6. Update ring pointer Shared memory region /mnt/pm/pp [0] [4] [8] bufs /mnt/pm/plog File system /mnt/pm TCP/IP NIC 3 5 0 5 7 (1, 96, 120) (2, 96, 987) (6, 96, 512) user kernel B+tree mmap() a file on NVMM /mnt/pm/pp Create netmap objects in it with extmem Create B+tree also in NVMM

How app code look like nmd = nm_open( stack:0 ); ioctl(nmd->fd,, stack:em0 ); s = socket();bind(s);listen(s); int fds[2] = {nmd, s}; for (;;) { poll(fds, 2,); if (fds[1] & POLLIN) ioctl(nmd,, accept(fds[1])); if (fds[0] & POLLIN) { for (slot in nmd->rxring) { int fd = slot->fd; char *p = NETMAP_BUF(slot) + slot->offset; } } } *use of extmem can be specified at nm_open() What s going on in poll() 1.poll(app_ring) 3.mysoupcall (so) { mark_readable(so->so_rcv); } TCP/UDP/SCTP/IP impl. 2.for (bufi in nic_rxring) { nmb = NMB(bufi); m = m_gethdr(); m->m_ext.ext_buf = nmb; ifp->if_input(m); } 4.for (bufi in readable) { set(bufi, fd(so), app_ring); } netmap netmap

PASTE performance Single CPU core Don t store data Store data in NVMM Server has Xeon 2640v4 2.4 Ghz, Intel X540 10 GbE NIC and HPE NVDIMM Client has Xeon 2690v4 2.6 Ghz and the samenic, and runs wrk HTTP benchmark tool

PASTE performance Multiple CPU cores Server has Xeon 2640v4 2.4 Ghz, Intel X540 10 GbE NIC and HPE NVDIMM Client has Xeon 2690v4 2.6 Ghz and the samenic, and runs wrk HTTP benchmark tool

PASTE performance Redis YCSB (read mostly) YCSB (update heavy) Server has Xeon 2640v4 2.4 Ghz, Intel X540 10 GbE NIC and HPE NVDIMM Client has Xeon 2690v4 2.6 Ghz and the samenic

Changes needed in FreeBSD core @@ -1101,6 +1101,8 @@ soclose(struct socket *so) drop: if (so->so_proto->pr_usrreqs->pru_close!= NULL) (*so->so_proto->pr_usrreqs->pru_close)(so); + if (so->so_dtor!= NULL) + so->so_dtor(so); SOCK_LOCK(so); @@ -111,6 +111,7 @@ struct socket { int so_ts_clock; /* type of the clock used for timestamps */ uint32_t so_max_pacing_rate; /* (f) TX rate limit in bytes/s */ + void (*so_dtor)(struct socket *so); /* (a) optional destructor */ union { /* Regular (data flow) socket. */

Summary PASTE integrates the kernel TCP/IP implementations and emerging NVMM with netmap API Status In the process of upstreaming to netmap (w/ Giuseppe Lettieri) Academic paper: Michio Honda, Giuseppe Lettieri, Lars Eggert and Douglas Santry, PASTE: A Network Programming Interface for Non-Volatile Main Memory, USENIX NSDI 2018 Code: https://github.com/micchie/netmap/tree/stack Contact: @michioh, micchie@sfc.wide.ad.jp