MegaPipe: A New Programming Interface for Scalable Network I/O

MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han in collabora=on with Sco? Marshall Byung- Gon Chun Sylvia Ratnasamy University of California, Berkeley Yahoo! Research

tl;dr? MegaPipe is a new network programming API for message- oriented workloads to avoid the performance issues of BSD Socket API OSDI 2012 2

Two Types of Network Workloads 1. Bulk- transfer workload One way, large data transfer Ex: video streaming, HDFS Very cheap A half CPU core is enough to saturate a 10G link OSDI 2012 3

Two Types of Network Workloads 1. Bulk- transfer workload One way, large data transfer Ex: video streaming, HDFS Very cheap A half CPU core is enough to saturate a 10G link 2. Message- oriented workload Short connec=ons or small messages Ex: HTTP, RPC, DB, key- value stores, CPU- intensive! OSDI 2012 4

BSD Socket API Performance Issues n_events = epoll_wait( ); // wait for I/O readiness for ( ) { new_fd = accept(listen_fd); // new connec=on bytes = recv(fd2, buf, 4096); // new data for fd2 Issues with message- oriented workloads System call overhead OSDI 2012 5

Microbenchmark: How Bad? RPC- like test on an 8- core Linux server (with epoll) 768 Clients Server new TCP connec=on request (64B) 3. Number of cores 2. Connec=on length response (64B) 10 transac=ons Teardown 1. Message size OSDI 2012 8

1. Small Messages Are Bad 10 Throughput CPU Usage 100 Throughput (Gbps) 8 6 4 2 80 60 40 20 CPU Usage (%) OSDI 2012 0 0 64 128 256 512 1K 2K 4K 8K 16K Low throughput Message Size (B) High overhead 9

2. Short Connec=ons Are Bad 1.5 Throughput (1M transactions/s) OSDI 2012 1.2 0.9 0.6 0.3 0 19x lower 1 2 4 8 16 32 64 128 Number of Transactions per Connection 10

3. Mul=- Core Will Not Help (Much) Throughput (1M transactions/s) 1.5 1.2 0.9 0.6 0.3 0 Ideal scaling Actual scaling 1 2 3 4 5 6 7 8 # of CPU Cores OSDI 2012 11

MEGAPIPE DESIGN OSDI 2012 12

Design Goals Concurrency as a first- class ci=zen Unified interface for various I/O types Network connec=ons, disk files, pipes, signals, etc. Low overhead & mul=- core scalability Main focus of this presenta=on OSDI 2012 13

Overview Problem Cause Solution Low per- core performance System call overhead System call batching Poor mul=- core scalability Shared listening socket VFS overhead Listening socket par==oning Lightweight socket OSDI 2012 14

Handle Similar to file descriptor Key Primi=ves But only valid within a channel TCP connec=on, pipe, disk file, Channel Per- core, bidirec=onal pipe between user and kernel Mul=plexes I/O opera=ons of its handles OSDI 2012 15

Sketch: How Channels Help (1/3) User Handles I/O Batching Channel Kernel OSDI 2012 16

Sketch: How Channels Help (2/3) Core 1 Core 2 Core 3 accept() Shared accept queue New connec=ons OSDI 2012 17

Sketch: How Channels Help (2/3) Core 1 Core 2 Core 3 New connec=ons Listening socket par==oning OSDI 2012 18

Sketch: How Channels Help (3/3) Core 1 Core 2 Core 3 VFS OSDI 2012 19

Sketch: How Channels Help (3/3) Core 1 Core 2 Core 3 VFS Lightweight socket OSDI 2012 20

MegaPipe API Func=ons mp_create() / mp_destroy() Create/close a channel mp_register() / mp_unregister() Register a handle (regular FD or lwsocket) into a channel mp_accept() /mp_read() / mp_write() / Issue an asynchronous I/O command for a given handle mp_dispatch() Dispatch an I/O comple=on event from a channel OSDI 2012 21

Comple=on No=fica=on Model BSD Socket API Wait- and- Go (Readiness model) MegaPipe Go- and- Wait (Comple=on no=fica=on) epoll_ctl(fd1, EPOLLIN); epoll_ctl(fd2, EPOLLIN); epoll_wait( ); ret1 = recv(fd1, ); ret2 = recv(fd2, ); L mp_read(handle1, ); mp_read(handle2, ); ev = mp_dispatch(channel); OSDI 2012 22 ev = mp_dispatch(channel); J Batching J Easy and intui=ve J J Compa=ble with disk files

Transparent batching 1. I/O Batching Exploits parallelism of independent handles Applica=on MegaPipe User- Level Library MegaPipe API (non-batched) Read data from handle 6 Accept a new connection Read data from handle 3 Write data to handle 5 New connection arrived Write done to handle 5 Read done from handle 6 Batched system calls MegaPipe Kernel Module OSDI 2012 23

2. Listening Socket Par==oning Per- core accept queue for each channel Instead of the globally shared accept queue mp_register() Listening socket User Accept queue Kernel OSDI 2012 24

2. Listening Socket Par==oning Per- core accept queue for each channel Instead of the globally shared accept queue mp_register() Listening socket User Accept queue Accept queue Accept queue Kernel OSDI 2012 25

2. Listening Socket Par==oning Per- core accept queue for each channel Instead of the globally shared accept queue Listening socket mp_accept() User Accept queue Accept queue Accept queue Kernel New connec=ons OSDI 2012 26

3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor File instance (states) dentry File name? inode TCP socket OSDI 2012 27

3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor File descriptor File instance (states) dup() or fork() dentry inode TCP socket OSDI 2012 28

3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor File descriptor open() File instance (states) File instance (states) dentry inode TCP socket OSDI 2012 29

3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor lwsocket File instance (states) dentry inode TCP socket TCP socket OSDI 2012 30

EVALUATION OSDI 2012 31

Microbenchmark 1/2 Throughput improvement with various message sizes Throughput Improvement (%) 120 100 80 60 40 20 0 2 KiB 1 2 3 4 5 6 7 8 # of CPU Cores 64 B 256 B 512 B 1 KiB 4 KiB OSDI 2012 32

Microbenchmark 1/2 Throughput improvement with various message sizes Throughput Improvement (%) 120 100 80 60 40 20 0 2 KiB 1 2 3 4 5 6 7 8 # of CPU Cores 64 B 256 B 512 B 1 KiB 4 KiB OSDI 2012 33

Microbenchmark 1/2 Throughput improvement with various message sizes Throughput Improvement (%) 120 100 80 60 40 20 0 2 KiB 1 2 3 4 5 6 7 8 # of CPU Cores 64 B 256 B 512 B 1 KiB 4 KiB OSDI 2012 34

Parallel Speedup Mul=- core scalability 8 7 6 5 4 3 2 1 Microbenchmark 2/2 with various connec=on lengths (# of transac=ons) Baseline 1 2 3 4 5 6 7 8 # of CPU Cores 128 64 32 16 8 4 2 1 OSDI 2012 35 8 7 6 5 4 3 2 1 MegaPipe 1 2 3 4 5 6 7 8 # of CPU Cores

Macrobenchmark memcached In- memory key- value store Limited scalability nginx Object store is shared by all cores with a global lock Web server Highly scalable Nothing is shared by cores, except for the listening socket OSDI 2012 36

memcached memaslap with 90% GET, 10% SET, 64B keys, 1KB values 1050 Throughput (1k requests/s) 900 750 600 450 300 150 0 OSDI 2012 3.6x 3.9x 2.4x 1.3x Global lock bo?leneck 1 2 3 4 5 6 7 8 9 10 Number of Requests per Connection MegaPipe Baseline 37

memcached memaslap with 90% GET, 10% SET, 64B keys, 1KB values Throughput (1k requests/s) 1050 900 750 600 450 300 150 MegaPipe-FL Baseline-FL MegaPipe Baseline OSDI 2012 0 1 2 4 8 16 32 64 128 256 Number of Requests per Connection 38

nginx Based on Yahoo! HTTP traces: 6.3KiB, 2.3 trans/conn on avg. 20 Improvement 100 Throughput (Gbps) 16 12 8 4 MegaPipe Baseline 80 60 40 20 Improvement (%) 0 1 2 3 4 5 6 7 8 0 # of CPU Cores OSDI 2012 39

CONCLUSION OSDI 2012 40

Related Work Batching [FlexSC, OSDI 10] [libflexsc, ATC 11] Excep=on- less system call MegaPipe solves the scalability issues Par==oning [Affinity- Accept, EuroSys 12] Per- core accept queue MegaPipe provides explicit control over par==oning VFS scalability [Mosbench, OSDI 10] MegaPipe bypasses the issues rather than mi=ga=ng OSDI 2012 41

Conclusion Short connec=ons or small messages: High CPU overhead Poorly scaling with mul=- core CPUs MegaPipe Key abstrac=on: per- core channel Enabling three op=miza=on opportuni=es: Batching, par==oning, lwsocket 15+% improvement for memcached, 75% for nginx OSDI 2012 42

BACKUP SLIDES OSDI 2012 43

Small Messages with MegaPipe Baseline Throughput MegaPipe 1.8 Throughput (1M trans/s) 1.5 1.2 0.9 0.6 0.3 0 1 2 4 8 16 32 64 128 # of Transactions per Connection OSDI 2012 44

1. Small Messages Are Bad à Why? # of messages ma?ers, not the volume of traffic Per- message cost >>> per- byte cost 1KB msg is only 2% more expensive than 64B msg 10G link with 1KB messages à 1M IOPS! Thus 1M+ system calls System calls are expensive [FlexSC, 2010] Mode switching between kernel and user CPU cache pollu=on OSDI 2012 45

Short Connec=ons with MegaPipe Baseline CPU Usage MegaPipe 10 100 Throughput (Gbps) 8 6 4 2 80 60 40 20 CPU Usage (%) 0 64 128 256 512 1K 2K 4K 8K 16K Message Size (B) 0 OSDI 2012 46

2. Short Connec=ons Are Bad à Why? Connec=on establishment is expensive Three- way handshaking / four- way teardown More packets More system calls: accept(), epoll_ctl(), close(), Socket is represented as a file in UNIX File overhead VFS overhead OSDI 2012 47

Mul=- Core Scalability with MegaPipe Baseline Per-Core Efficiency MegaPipe 1.5 100 Throughput (1M trans/s) 1.2 0.9 0.6 0.3 80 60 40 20 Efficiency (%) 0 1 2 3 4 5 6 7 8 0 # of CPU Cores OSDI 2012 48

3. Mul=- Core Will Not Help à Why? Shared queue issues [Affinity- Accept, 2012] Conten=on on the listening socket Poor connec=on affinity User accept() Listening socket TX Kernel RX OSDI 2012 49

3. Mul=- Core Will Not Help à Why? File/VFS mul=- core scalability issues Applica=on Process file descriptor table Shared by threads VFS File instance dentry inode Globally visible in the system OSDI 2012 TCP/IP TCP socket 50

Overview User application thread MegaPipe API Handles MegaPipe user-level library Channel Batched async I/O commands Batched completion events Core 1 Core 2 Core N Channel instance lwsocket handles File handles Pending completion events VFS TCP/IP OSDI 2012 Linux kernel 51

Ping- Pong Server Example ch = mp_create() handle = mp_register(ch, listen_sd, mask=my_cpu_id) mp_accept(handle) while true: ev = mp_dispatch(ch) conn = ev.cookie if ev.cmd == ACCEPT: mp_accept(conn.handle) conn = new Connection() conn.handle = mp_register(ch, ev.fd, cookie=conn) mp_read(conn.handle, conn.buf, READSIZE) elif ev.cmd == READ: mp_write(conn.handle, conn.buf, ev.size) elif ev.cmd == WRITE: mp_read(conn.handle, conn.buf, READSIZE) elif ev.cmd == DISCONNECT: mp_unregister(ch, conn.handle) delete conn OSDI 2012 52

Contribu=on Breakdown OSDI 2012 53

memcached latency Latency (µs) 3500 3000 2500 2000 1500 1000 500 0 Baseline-FL 99% MegaPipe-FL 99% Baseline-FL 50% MegaPipe-FL 50% 0 256 512 768 1024 1280 1536 # of Concurrent Client Connections OSDI 2012 54

Clean- Slate vs. Dirty- Slate MegaPipe: a clean- slate approach with new APIs Quick prototyping for various op=miza=ons Performance improvement: worthwhile! Can we apply the same techniques back to the BSD Socket API? Each technique has its own challenges Embracing all could be even harder Future Work TM OSDI 2012 55