MegaPipe: A New Programming Interface for Scalable Network I/O

Similar documents
MegaPipe: A New Programming Interface for Scalable Network I/O

Bringing&the&Performance&to&the& Cloud &

Exception-Less System Calls for Event-Driven Servers

Light: A Scalable, High-performance and Fully-compatible User-level TCP Stack. Dan Li ( 李丹 ) Tsinghua University

SEDA An architecture for Well Condi6oned, scalable Internet Services

IsoStack Highly Efficient Network Processing on Dedicated Cores

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Network stack specialization for performance

Eleos: Exit-Less OS Services for SGX Enclaves

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

Memory Management Strategies for Data Serving with RDMA

Slides on cross- domain call and Remote Procedure Call (RPC)

Much Faster Networking

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

A Scalable Event Dispatching Library for Linux Network Servers

Networks and Opera/ng Systems Chapter 19: I/O Subsystems

Evolution of the netmap architecture

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Chapter 13: I/O Systems

Flash: an efficient and portable web server

Network Implementation

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

Capriccio : Scalable Threads for Internet Services

StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

Near- Data Computa.on: It s Not (Just) About Performance

Arrakis: The Operating System is the Control Plane

NetSlices: Scalable Mul/- Core Packet Processing in User- Space

Lecture 8: Other IPC Mechanisms. CSC 469H1F Fall 2006 Angela Demke Brown

Topics. Lecture 8: Other IPC Mechanisms. Socket IPC. Unix Communication

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

Remote Procedure Call

CS533 Concepts of Operating Systems. Jonathan Walpole

Asynchronous Events on Linux

Remote Procedure Call (RPC) and Transparency

hashfs Applying Hashing to Op2mize File Systems for Small File Reads

OPERATING SYSTEM. Chapter 12: File System Implementation

Operating Systems. Review ENCE 360

Resource Containers. A new facility for resource management in server systems. Presented by Uday Ananth. G. Banga, P. Druschel, J. C.

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

FlexNIC: Rethinking Network DMA

Chapter 11: Implementing File Systems

Outline. Spanner Mo/va/on. Tom Anderson

Servers: Concurrency and Performance. Jeff Chase Duke University

Implica(ons of Non Vola(le Memory on So5ware Architectures. Nisha Talagala Lead Architect, Fusion- io

Chapter 12: File System Implementation

Week 12: File System Implementation

ECE 650 Systems Programming & Engineering. Spring 2018

OPERATING SYSTEM TRANSACTIONS

416 Distributed Systems. Distributed File Systems 1: NFS Sep 18, 2018

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Opera&ng Systems: Principles and Prac&ce. Tom Anderson

The Power of Batching in the Click Modular Router

Chapter 10: File System Implementation

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

Abstract Storage Moving file format specific abstrac7ons into petabyte scale storage systems. Joe Buck, Noah Watkins, Carlos Maltzahn & ScoD Brandt

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

TCP. Networked Systems (H) Lecture 13

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

CS 3723 Operating Systems: Final Review

Chapter 12: File System Implementation

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Introducing the Cray XMT. Petr Konecny May 4 th 2007

W4118: OS Overview. Junfeng Yang

One Server Per City: Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko,

Like select() and poll(), epoll can monitor multiple FDs epoll returns readiness information in similar manner to poll() Two main advantages:

Light & NOS. Dan Li Tsinghua University

SLIPSTREAM: AUTOMATIC INTERPROCESS COMMUNICATION OPTIMIZATION. Will Dietz, Joshua Cranmer, Nathan Dautenhahn, Vikram Adve

Improving the DragonFlyBSD Network Stack

An Analysis of Linux Scalability to Many Cores

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Module 12: I/O Systems

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Memory-Mapped Files. generic interface: vaddr mmap(file descriptor,fileoffset,length) munmap(vaddr,length)

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

SDC 2015 Santa Clara

Basic OS Progamming Abstrac7ons

Internet Technology. 06. Exam 1 Review Paul Krzyzanowski. Rutgers University. Spring 2016

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Advanced Computer Networks. End Host Optimization

Background: I/O Concurrency

Basic OS Progamming Abstrac2ons

CS307: Operating Systems

Internet Technology 3/2/2016

Caching and reliability

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

Keeping up with the hardware

EbbRT: A Framework for Building Per-Application Library Operating Systems

Transcription:

MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han in collabora=on with Sco? Marshall Byung- Gon Chun Sylvia Ratnasamy University of California, Berkeley Yahoo! Research

tl;dr? MegaPipe is a new network programming API for message- oriented workloads to avoid the performance issues of BSD Socket API OSDI 2012 2

Two Types of Network Workloads 1. Bulk- transfer workload One way, large data transfer Ex: video streaming, HDFS Very cheap A half CPU core is enough to saturate a 10G link OSDI 2012 3

Two Types of Network Workloads 1. Bulk- transfer workload One way, large data transfer Ex: video streaming, HDFS Very cheap A half CPU core is enough to saturate a 10G link 2. Message- oriented workload Short connec=ons or small messages Ex: HTTP, RPC, DB, key- value stores, CPU- intensive! OSDI 2012 4

BSD Socket API Performance Issues n_events = epoll_wait( ); // wait for I/O readiness for ( ) { new_fd = accept(listen_fd); // new connec=on bytes = recv(fd2, buf, 4096); // new data for fd2 Issues with message- oriented workloads System call overhead OSDI 2012 5

BSD Socket API Performance Issues n_events = epoll_wait( ); // wait for I/O readiness for ( ) { new_fd = accept(listen_fd); // new connec=on bytes = recv(fd2, buf, 4096); // new data for fd2 Issues with message- oriented workloads System call overhead Shared listening socket OSDI 2012 6

BSD Socket API Performance Issues n_events = epoll_wait( ); // wait for I/O readiness for ( ) { new_fd = accept(listen_fd); // new connec=on bytes = recv(fd2, buf, 4096); // new data for fd2 Issues with message- oriented workloads System call overhead Shared listening socket File abstrac=on overhead OSDI 2012 7

Microbenchmark: How Bad? RPC- like test on an 8- core Linux server (with epoll) 768 Clients Server new TCP connec=on request (64B) 3. Number of cores 2. Connec=on length response (64B) 10 transac=ons Teardown 1. Message size OSDI 2012 8

1. Small Messages Are Bad 10 Throughput CPU Usage 100 Throughput (Gbps) 8 6 4 2 80 60 40 20 CPU Usage (%) OSDI 2012 0 0 64 128 256 512 1K 2K 4K 8K 16K Low throughput Message Size (B) High overhead 9

2. Short Connec=ons Are Bad 1.5 Throughput (1M transactions/s) OSDI 2012 1.2 0.9 0.6 0.3 0 19x lower 1 2 4 8 16 32 64 128 Number of Transactions per Connection 10

3. Mul=- Core Will Not Help (Much) Throughput (1M transactions/s) 1.5 1.2 0.9 0.6 0.3 0 Ideal scaling Actual scaling 1 2 3 4 5 6 7 8 # of CPU Cores OSDI 2012 11

MEGAPIPE DESIGN OSDI 2012 12

Design Goals Concurrency as a first- class ci=zen Unified interface for various I/O types Network connec=ons, disk files, pipes, signals, etc. Low overhead & mul=- core scalability Main focus of this presenta=on OSDI 2012 13

Overview Problem Cause Solution Low per- core performance System call overhead System call batching Poor mul=- core scalability Shared listening socket VFS overhead Listening socket par==oning Lightweight socket OSDI 2012 14

Handle Similar to file descriptor Key Primi=ves But only valid within a channel TCP connec=on, pipe, disk file, Channel Per- core, bidirec=onal pipe between user and kernel Mul=plexes I/O opera=ons of its handles OSDI 2012 15

Sketch: How Channels Help (1/3) User Handles I/O Batching Channel Kernel OSDI 2012 16

Sketch: How Channels Help (2/3) Core 1 Core 2 Core 3 accept() Shared accept queue New connec=ons OSDI 2012 17

Sketch: How Channels Help (2/3) Core 1 Core 2 Core 3 New connec=ons Listening socket par==oning OSDI 2012 18

Sketch: How Channels Help (3/3) Core 1 Core 2 Core 3 VFS OSDI 2012 19

Sketch: How Channels Help (3/3) Core 1 Core 2 Core 3 VFS Lightweight socket OSDI 2012 20

MegaPipe API Func=ons mp_create() / mp_destroy() Create/close a channel mp_register() / mp_unregister() Register a handle (regular FD or lwsocket) into a channel mp_accept() /mp_read() / mp_write() / Issue an asynchronous I/O command for a given handle mp_dispatch() Dispatch an I/O comple=on event from a channel OSDI 2012 21

Comple=on No=fica=on Model BSD Socket API Wait- and- Go (Readiness model) MegaPipe Go- and- Wait (Comple=on no=fica=on) epoll_ctl(fd1, EPOLLIN); epoll_ctl(fd2, EPOLLIN); epoll_wait( ); ret1 = recv(fd1, ); ret2 = recv(fd2, ); L mp_read(handle1, ); mp_read(handle2, ); ev = mp_dispatch(channel); OSDI 2012 22 ev = mp_dispatch(channel); J Batching J Easy and intui=ve J J Compa=ble with disk files

Transparent batching 1. I/O Batching Exploits parallelism of independent handles Applica=on MegaPipe User- Level Library MegaPipe API (non-batched) Read data from handle 6 Accept a new connection Read data from handle 3 Write data to handle 5 New connection arrived Write done to handle 5 Read done from handle 6 Batched system calls MegaPipe Kernel Module OSDI 2012 23

2. Listening Socket Par==oning Per- core accept queue for each channel Instead of the globally shared accept queue mp_register() Listening socket User Accept queue Kernel OSDI 2012 24

2. Listening Socket Par==oning Per- core accept queue for each channel Instead of the globally shared accept queue mp_register() Listening socket User Accept queue Accept queue Accept queue Kernel OSDI 2012 25

2. Listening Socket Par==oning Per- core accept queue for each channel Instead of the globally shared accept queue Listening socket mp_accept() User Accept queue Accept queue Accept queue Kernel New connec=ons OSDI 2012 26

3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor File instance (states) dentry File name? inode TCP socket OSDI 2012 27

3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor File descriptor File instance (states) dup() or fork() dentry inode TCP socket OSDI 2012 28

3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor File descriptor open() File instance (states) File instance (states) dentry inode TCP socket OSDI 2012 29

3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor lwsocket File instance (states) dentry inode TCP socket TCP socket OSDI 2012 30

EVALUATION OSDI 2012 31

Microbenchmark 1/2 Throughput improvement with various message sizes Throughput Improvement (%) 120 100 80 60 40 20 0 2 KiB 1 2 3 4 5 6 7 8 # of CPU Cores 64 B 256 B 512 B 1 KiB 4 KiB OSDI 2012 32

Microbenchmark 1/2 Throughput improvement with various message sizes Throughput Improvement (%) 120 100 80 60 40 20 0 2 KiB 1 2 3 4 5 6 7 8 # of CPU Cores 64 B 256 B 512 B 1 KiB 4 KiB OSDI 2012 33

Microbenchmark 1/2 Throughput improvement with various message sizes Throughput Improvement (%) 120 100 80 60 40 20 0 2 KiB 1 2 3 4 5 6 7 8 # of CPU Cores 64 B 256 B 512 B 1 KiB 4 KiB OSDI 2012 34

Parallel Speedup Mul=- core scalability 8 7 6 5 4 3 2 1 Microbenchmark 2/2 with various connec=on lengths (# of transac=ons) Baseline 1 2 3 4 5 6 7 8 # of CPU Cores 128 64 32 16 8 4 2 1 OSDI 2012 35 8 7 6 5 4 3 2 1 MegaPipe 1 2 3 4 5 6 7 8 # of CPU Cores

Macrobenchmark memcached In- memory key- value store Limited scalability nginx Object store is shared by all cores with a global lock Web server Highly scalable Nothing is shared by cores, except for the listening socket OSDI 2012 36

memcached memaslap with 90% GET, 10% SET, 64B keys, 1KB values 1050 Throughput (1k requests/s) 900 750 600 450 300 150 0 OSDI 2012 3.6x 3.9x 2.4x 1.3x Global lock bo?leneck 1 2 3 4 5 6 7 8 9 10 Number of Requests per Connection MegaPipe Baseline 37

memcached memaslap with 90% GET, 10% SET, 64B keys, 1KB values Throughput (1k requests/s) 1050 900 750 600 450 300 150 MegaPipe-FL Baseline-FL MegaPipe Baseline OSDI 2012 0 1 2 4 8 16 32 64 128 256 Number of Requests per Connection 38

nginx Based on Yahoo! HTTP traces: 6.3KiB, 2.3 trans/conn on avg. 20 Improvement 100 Throughput (Gbps) 16 12 8 4 MegaPipe Baseline 80 60 40 20 Improvement (%) 0 1 2 3 4 5 6 7 8 0 # of CPU Cores OSDI 2012 39

CONCLUSION OSDI 2012 40

Related Work Batching [FlexSC, OSDI 10] [libflexsc, ATC 11] Excep=on- less system call MegaPipe solves the scalability issues Par==oning [Affinity- Accept, EuroSys 12] Per- core accept queue MegaPipe provides explicit control over par==oning VFS scalability [Mosbench, OSDI 10] MegaPipe bypasses the issues rather than mi=ga=ng OSDI 2012 41

Conclusion Short connec=ons or small messages: High CPU overhead Poorly scaling with mul=- core CPUs MegaPipe Key abstrac=on: per- core channel Enabling three op=miza=on opportuni=es: Batching, par==oning, lwsocket 15+% improvement for memcached, 75% for nginx OSDI 2012 42

BACKUP SLIDES OSDI 2012 43

Small Messages with MegaPipe Baseline Throughput MegaPipe 1.8 Throughput (1M trans/s) 1.5 1.2 0.9 0.6 0.3 0 1 2 4 8 16 32 64 128 # of Transactions per Connection OSDI 2012 44

1. Small Messages Are Bad à Why? # of messages ma?ers, not the volume of traffic Per- message cost >>> per- byte cost 1KB msg is only 2% more expensive than 64B msg 10G link with 1KB messages à 1M IOPS! Thus 1M+ system calls System calls are expensive [FlexSC, 2010] Mode switching between kernel and user CPU cache pollu=on OSDI 2012 45

Short Connec=ons with MegaPipe Baseline CPU Usage MegaPipe 10 100 Throughput (Gbps) 8 6 4 2 80 60 40 20 CPU Usage (%) 0 64 128 256 512 1K 2K 4K 8K 16K Message Size (B) 0 OSDI 2012 46

2. Short Connec=ons Are Bad à Why? Connec=on establishment is expensive Three- way handshaking / four- way teardown More packets More system calls: accept(), epoll_ctl(), close(), Socket is represented as a file in UNIX File overhead VFS overhead OSDI 2012 47

Mul=- Core Scalability with MegaPipe Baseline Per-Core Efficiency MegaPipe 1.5 100 Throughput (1M trans/s) 1.2 0.9 0.6 0.3 80 60 40 20 Efficiency (%) 0 1 2 3 4 5 6 7 8 0 # of CPU Cores OSDI 2012 48

3. Mul=- Core Will Not Help à Why? Shared queue issues [Affinity- Accept, 2012] Conten=on on the listening socket Poor connec=on affinity User accept() Listening socket TX Kernel RX OSDI 2012 49

3. Mul=- Core Will Not Help à Why? File/VFS mul=- core scalability issues Applica=on Process file descriptor table Shared by threads VFS File instance dentry inode Globally visible in the system OSDI 2012 TCP/IP TCP socket 50

Overview User application thread MegaPipe API Handles MegaPipe user-level library Channel Batched async I/O commands Batched completion events Core 1 Core 2 Core N Channel instance lwsocket handles File handles Pending completion events VFS TCP/IP OSDI 2012 Linux kernel 51

Ping- Pong Server Example ch = mp_create() handle = mp_register(ch, listen_sd, mask=my_cpu_id) mp_accept(handle) while true: ev = mp_dispatch(ch) conn = ev.cookie if ev.cmd == ACCEPT: mp_accept(conn.handle) conn = new Connection() conn.handle = mp_register(ch, ev.fd, cookie=conn) mp_read(conn.handle, conn.buf, READSIZE) elif ev.cmd == READ: mp_write(conn.handle, conn.buf, ev.size) elif ev.cmd == WRITE: mp_read(conn.handle, conn.buf, READSIZE) elif ev.cmd == DISCONNECT: mp_unregister(ch, conn.handle) delete conn OSDI 2012 52

Contribu=on Breakdown OSDI 2012 53

memcached latency Latency (µs) 3500 3000 2500 2000 1500 1000 500 0 Baseline-FL 99% MegaPipe-FL 99% Baseline-FL 50% MegaPipe-FL 50% 0 256 512 768 1024 1280 1536 # of Concurrent Client Connections OSDI 2012 54

Clean- Slate vs. Dirty- Slate MegaPipe: a clean- slate approach with new APIs Quick prototyping for various op=miza=ons Performance improvement: worthwhile! Can we apply the same techniques back to the BSD Socket API? Each technique has its own challenges Embracing all could be even harder Future Work TM OSDI 2012 55