PASTE: A Network Programming Interface for Non-Volatile Main Memory

Similar documents
PASTE: Fast End System Networking with netmap

PASTE: A Networking API for Non-Volatile Main Memory

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

libvnf: building VNFs made easy

Enabling Fast, Dynamic Network Processing with ClickOS

RDMA Requirements for High Availability in the NVM Programming Model

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Evolution of the netmap architecture

RDMA and Hardware Support

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

VALE: a switched ethernet for virtual machines

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

TOWARDS FAST IP FORWARDING

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

SNIA NVM Programming Model Workgroup Update. #OFADevWorkshop

Non-Volatile Memory Through Customized Key-Value Stores

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Request-Oriented Durable Write Caching for Application Performance appeared in USENIX ATC '15. Jinkyu Jeong Sungkyunkwan University

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Application Access to Persistent Memory The State of the Nation(s)!

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Persistent Memory over Fabric (PMoF) Adding RDMA to Persistent Memory Pawel Szymanski Intel Corporation

LITE Kernel RDMA. Support for Datacenter Applications. Shin-Yeh Tsai, Yiying Zhang

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework

Software Routers: NetMap

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Introduction to Operating Systems. Chapter Chapter

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Persistent Memory over Fabrics

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

THE IN-PLACE WORKING STORAGE TIER OPPORTUNITIES FOR SOFTWARE INNOVATORS KEN GIBSON, INTEL, DIRECTOR MEMORY SW ARCHITECTURE

SLM-DB: Single-Level Key-Value Store with Persistent Memory

Windows Support for PM. Tom Talpey, Microsoft

TLDK Overview. Transport Layer Development Kit Ray Kinsella February ray.kinsella [at] intel.com IRC: mortderire

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

A Look at Intel s Dataplane Development Kit

JOURNALING techniques have been widely used in modern

Introduction to Operating Systems. Chapter Chapter

Network stack specialization for performance

Review: Hardware user/kernel boundary

High Performance Transactions in Deuteronomy

Benchmark: In-Memory Database System (IMDS) Deployed on NVDIMM

PDP : A Flexible and Programmable Data Plane. Massimo Gallo et al.

Bringing&the&Performance&to&the& Cloud &

IsoStack Highly Efficient Network Processing on Dedicated Cores

Windows Support for PM. Tom Talpey, Microsoft

I/O Profiling Towards the Exascale

The Power of Batching in the Click Modular Router

Developing Stateful Middleboxes with the mos API KYOUNGSOO PARK & YOUNGGYOUN MOON

Introduction to Operating. Chapter Chapter

Networking at the Speed of Light

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

SoftRDMA: Rekindling High Performance Software RDMA over Commodity Ethernet

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Creating Storage Class Persistent Memory With NVDIMM

Demystifying Network Cards

Toward a Memory-centric Architecture

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Much Faster Networking

Xen Network I/O Performance Analysis and Opportunities for Improvement

Virtualization, Xen and Denali

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

Farewell to Servers: Resource Disaggregation

Redrawing the Boundary Between So3ware and Storage for Fast Non- Vola;le Memories

Accessing NVM Locally and over RDMA Challenges and Opportunities

PCIe Storage Beyond SSDs

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2015

Lazy Persistency: a High-Performing and Write-Efficient Software Persistency Technique

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

NVMe SSDs with Persistent Memory Regions

NEXTGenIO Performance Tools for In-Memory I/O

High Performance Packet Processing with FlexNIC

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Analyzing I/O Performance on a NEXTGenIO Class System

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

PSPAT: Packet Scheduling with PArallel Transmit

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

Advanced Computer Networks. End Host Optimization

Computer System Overview

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 4 th, 2014 NEC Corporation

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Impact of Cache Coherence Protocols on the Processing of Network Traffic

DPDK Summit China 2017

Networking Servers made for BSD and Linux systems

Arrakis: The Operating System is the Control Plane

Fine-grained Metadata Journaling on NVM

Impact on Application Development: SNIA NVM Programming Model in the Real World. Andy Rudoff pmem SW Architect, Intel

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Transcription:

PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018

Review: Memory Hierarchy Slow, block-oriented persistence Byte access w/ load/store CPU Caches Main Memory 5-50 ns 70 ns Block access w/ system calls HDD / SSD 100-1000s us

Review: Memory Hierarchy Fast, byte-addressable persistence CPU Caches 5-50 ns Byte access w/ load/store Main Memory 70 ns -1000s ns Block access w/ system calls HDD / SSD 100-1000s us

Networking is faster than disks/ssds 1.2KB durable write over TCP/HTTP Cables, NICs, TCP/IP, socket API Syscall, PCIe bus, physical media Client Server SSD 23us 1300us

Networking is slower than NVMM 1.2KB durable write over TCP/HTTP Cables, NICs, TCP/IP, socket API Memcpy, memory bus, physical media Client Server NVMM 23us 2us

Networking is slower than NVMM 1.2KB durable write over TCP/HTTP Cables, NICs, TCP/IP, socket API Memcpy, memory bus, physical media Client Server NVMM nevts = epoll_wait(fds) for (i =0; i < nevts; i++) { read(fds[i], buf);... memcpy(nvmm, buf);... write(fds[i], reply) } Client Client

Innovations at both stacks Network stack MegaPipe [OSDI 12] Seastar mtcp [NSDI 14] IX [OSDI 14] Stackmap [ATC 16] Storage stack NVTree [FAST 15] NVWal [ASPLOS 16] NOVA [FAST 16] Decibel [NSDI 17] LSNVMM [ATC 17]

Stacks are isolated Network stack MegaPipe [OSDI 12] Seastar mtcp [NSDI 14] IX [OSDI 14] Stackmap [ATC 16] Costs of moving data Storage stack NVTree [FAST 15] NVWal [ASPLOS 16] NOVA [FAST 16] Decibel [NSDI 17] LSNVMM [ATC 17]

Bridging the gap Network stack MegaPipe [OSDI 12] Seastar mtcp [NSDI 14] IX [OSDI 14] Stackmap [ATC 16] PASTE Storage stack NVTree [FAST 15] NVWal [ASPLOS 16] NOVA [FAST 16] Decibel [NSDI 17] LSNVMM [ATC 17]

PASTE Design Goals Durable zero copy DMA to NVMM Selective persistence Exploit modern NIC s DMA to L3 cache Persistent data structures Indexed, named packet buffers backed fy a file Generality and safety TCP/IP in the kernel and netmap API Best practices from modern network stacks Run-to-completion, blocking, busy-polling, batching etc

PASTE in Action App thread Plog /mnt/pm/plog pbuf off len Zero copy Pring slot [0] cur [7] 20 27 21 26 22 25 23 24 Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs File system /mnt/pm TCP/IP NIC user kernel

PASTE in Action App thread Plog /mnt/pm/plog pbuf off len Zero copy Pring slot [0] cur [7] 20 27 21 26 22 25 23 24 Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs File system /mnt/pm TCP/IP NIC user kernel

PASTE in Action App thread 1. Run NIC I/O and TCP/IP Plog /mnt/pm/plog pbuf off len poll() system call Zero copy Pring slot [0] cur [7] 20 27 21 26 22 25 23 24 Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs File system /mnt/pm TCP/IP NIC user kernel

PASTE in Action App thread Zero copy Pring slot [0] cur [7] 20 27 21 26 22 25 23 24 1. Run NIC I/O and TCP/IP Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs Plog /mnt/pm/plog pbuf off len File system /mnt/pm TCP/IP NIC user kernel poll() system call Got 6 in-order TCP segments

PASTE in Action App thread Zero copy Pring slot [0] cur [7] 0 27 tail 1 6 2 5 3 4 1. Run NIC I/O and TCP/IP Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs Plog /mnt/pm/plog pbuf off len File system /mnt/pm TCP/IP NIC user kernel poll() system call They are set to Pring slots

PASTE in Action App thread 1. Run NIC I/O and TCP/IP Plog /mnt/pm/plog pbuf off len Return from poll() Zero copy Pring slot [0] cur [7] 0 27 tail 1 6 2 5 3 4 Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs File system /mnt/pm TCP/IP NIC user kernel

PASTE in Action App thread 1. Run NIC I/O and TCP/IP 2. Read data on Pring Plog /mnt/pm/plog pbuf off len Zero copy Pring slot [0] cur [7] 0 27 tail 1 6 2 5 3 4 Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs File system /mnt/pm TCP/IP NIC user kernel

PASTE in Action App thread Zero copy Pring slot [0] cur [7] 0 27 tail 1 6 2 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs Plog /mnt/pm/plog pbuf off len File system /mnt/pm TCP/IP NIC user kernel flush Pbuf data from CPU cache to DIMM clflush(opt) instruction

PASTE in Action App thread Zero copy Pring slot [0] cur [7] 0 27 tail 1 6 2 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs Plog /mnt/pm/plog pbuf off len 1 96 120 File system /mnt/pm TCP/IP NIC user kernel Pbuf is persistent data representation Base address is static i.e., file (/mnt/pm/pp) Buffers can be recovered after reboot

PASTE in Action App thread Zero copy Pring slot [0] cur [7] 0 27 tail 8 6 2 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs Plog /mnt/pm/plog pbuf off len 1 96 120 File system /mnt/pm TCP/IP NIC user kernel Prevent the kernel from recycling the buffer

PASTE in Action App thread Zero copy Pring slot [0] cur [7] 0 27 tail 8 10 9 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs Plog /mnt/pm/plog pbuf off len 1 96 120 2 96 768 6 96 987 user kernel File system /mnt/pm TCP/IP NIC Same for Pbuf 2 and 6

PASTE in Action App thread Zero copy Pring slot [0] [7] 0 27 tail 8 10 cur 9 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) 6. Update Pring Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs Plog /mnt/pm/plog pbuf off len 1 96 120 2 96 768 6 96 987 user kernel File system /mnt/pm TCP/IP NIC Advance cur Return buffers in slot 0-6 to the kernel at next poll()

PASTE in Action App thread Zero copy Pring slot [0] [7] 0 27 tail 8 10 cur 9 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) 6. Update Pring Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs Plog /mnt/pm/plog pbuf off len 1 96 120 2 96 768 6 96 987 user kernel File system /mnt/pm TCP/IP NIC Write-Ahead Logs

PASTE in Action App thread Zero copy Pring slot [0] [7] 0 27 tail 8 10 cur 9 5 3 4 1. Run NIC I/O and TCP/IP 2. Read data on Pring 3. Flush Pbuf(s) 4. Flush Plog entry(ies) 5. Swap out Pbuf(s) 6. Update Pring Ppool (shared memory) /mnt/pm/pp [0] [4] [8] Pbufs Plog /mnt/pm/plog File system /mnt/pm TCP/IP NIC 3 5 0 5 7 (1, 96, 120) (2, 96, 987) (6, 96, 512) user kernel B+tree We can organize various data structures in Plog

Evaluation 1. How does PASTE outperform existing systems? 2. Is PASTE applicable to existing applications? 3. Is PASTE useful for systems other than file/db storage?

How does PASTE outperform existing systems? 64B What if we use more complex data structures? 1280B WAL B+tree (all writes)

How does PASTE outperform existing systems? 64B 1280B WAL B+tree (all writes)

Is PASTE applicable to existing applications? Redis YCSB (read mostly) YCSB (update heavy)

Is PASTE useful for systems other than DB/file storage? Packet logging prior to forwarding Fault-tolerant middlebox [Sigcomm 15] Traffic recording Extend mswitch [SOSR 15] Scalable NFV backend switch

Conclusion PASTE is a network programming interface that: Enables durable zero copy to NVMM Helps apps organize persistent data structures on NVMM Lets apps use TCP/IP and be protected Offers high-performance network stack even w/o NVMM https://github.com/luigirizzo/netmap/tree/paste micchie@sfc.wide.ad.jp or @michioh

Multicore Scalability WAL throughput

Further Opportunity with Co-designed Stacks What if we use higher access latency NVMM? e.g., 3D-Xpoint Overlap flushes and processing with clflushopt and mfence before system call (triggers packet I/O) See the paper for results Systemcall Receive new requests Examine Examine request clflushopt request clflushopt mfence Systemcall Wait for flushes done Send responses time

Experiment Setup Intel Xeon E5-2640v4 (2.4 Ghz) HPE 8GB NVDIMM (NVDIMM-N) Intel X540 10 GbE NIC Comparison Linux and Stackmap [ATC 15] (current state-of-the art) Fair to use the same kernel TCP/IP implementation