MegaPipe: A New Programming Interface for Scalable Network I/O

Size: px

Start display at page:

Download "MegaPipe: A New Programming Interface for Scalable Network I/O"

Heather Simmons
6 years ago
Views:

1 MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han in collabora=on with Sco? Marshall Byung- Gon Chun Sylvia Ratnasamy University of California, Berkeley Yahoo! Research

2 tl;dr? MegaPipe is a new network programming API for message- oriented workloads to avoid the performance issues of BSD Socket API OSDI

3 Two Types of Network Workloads 1. Bulk- transfer workload One way, large data transfer Ex: video streaming, HDFS Very cheap A half CPU core is enough to saturate a 10G link OSDI

4 Two Types of Network Workloads 1. Bulk- transfer workload One way, large data transfer Ex: video streaming, HDFS Very cheap A half CPU core is enough to saturate a 10G link 2. Message- oriented workload Short connec=ons or small messages Ex: HTTP, RPC, DB, key- value stores, CPU- intensive! OSDI

$BSD Socket API Performance Issues n_events = epoll_wait( ); // wait for I/O readiness for ( ) { new_fd = accept(listen_fd); // new$

5 BSD Socket API Performance Issues n_events = epoll_wait( ); // wait for I/O readiness for ( ) { new_fd = accept(listen_fd); // new connec=on bytes = recv(fd2, buf, 4096); // new data for fd2 Issues with message- oriented workloads System call overhead OSDI

$BSD Socket API Performance Issues n_events = epoll_wait( ); // wait for I/O readiness for ( ) { new_fd = accept(listen_fd); // new connec=on$

6 BSD Socket API Performance Issues n_events = epoll_wait( ); // wait for I/O readiness for ( ) { new_fd = accept(listen_fd); // new connec=on bytes = recv(fd2, buf, 4096); // new data for fd2 Issues with message- oriented workloads System call overhead Shared listening socket OSDI

7 BSD Socket API Performance Issues n_events = epoll_wait( ); // wait for I/O readiness for ( ) { new_fd = accept(listen_fd); // new connec=on bytes = recv(fd2, buf, 4096); // new data for fd2 Issues with message- oriented workloads System call overhead Shared listening socket File abstrac=on overhead OSDI

8 Microbenchmark: How Bad? RPC- like test on an 8- core Linux server (with epoll) 768 Clients Server new TCP connec=on request (64B) 3. Number of cores 2. Connec=on length response (64B) 10 transac=ons Teardown 1. Message size OSDI

9 1. Small Messages Are Bad 10 Throughput CPU Usage 100 Throughput (Gbps) CPU Usage (%) OSDI K 2K 4K 8K 16K Low throughput Message Size (B) High overhead 9

10 2. Short Connec=ons Are Bad 1.5 Throughput (1M transactions/s) OSDI x lower Number of Transactions per Connection 10

11 3. Mul=- Core Will Not Help (Much) Throughput (1M transactions/s) Ideal scaling Actual scaling # of CPU Cores OSDI

12 MEGAPIPE DESIGN OSDI

13 Design Goals Concurrency as a first- class ci=zen Unified interface for various I/O types Network connec=ons, disk files, pipes, signals, etc. Low overhead & mul=- core scalability Main focus of this presenta=on OSDI

14 Overview Problem Cause Solution Low per- core performance System call overhead System call batching Poor mul=- core scalability Shared listening socket VFS overhead Listening socket par==oning Lightweight socket OSDI

15 Handle Similar to file descriptor Key Primi=ves But only valid within a channel TCP connec=on, pipe, disk file, Channel Per- core, bidirec=onal pipe between user and kernel Mul=plexes I/O opera=ons of its handles OSDI

16 Sketch: How Channels Help (1/3) User Handles I/O Batching Channel Kernel OSDI

17 Sketch: How Channels Help (2/3) Core 1 Core 2 Core 3 accept() Shared accept queue New connec=ons OSDI

18 Sketch: How Channels Help (2/3) Core 1 Core 2 Core 3 New connec=ons Listening socket par==oning OSDI

19 Sketch: How Channels Help (3/3) Core 1 Core 2 Core 3 VFS OSDI

20 Sketch: How Channels Help (3/3) Core 1 Core 2 Core 3 VFS Lightweight socket OSDI

21 MegaPipe API Func=ons mp_create() / mp_destroy() Create/close a channel mp_register() / mp_unregister() Register a handle (regular FD or lwsocket) into a channel mp_accept() /mp_read() / mp_write() / Issue an asynchronous I/O command for a given handle mp_dispatch() Dispatch an I/O comple=on event from a channel OSDI

Comple=on No=fica=on Model BSD Socket API Wait- and- Go (Readiness model) MegaPipe Go- and- Wait (Comple=on no=fica=on) epoll_ctl(fd1, EPOLLIN); epoll_ctl(fd2, EPOLLIN); epoll_wait( ); ret1 =

22 Comple=on No=fica=on Model BSD Socket API Wait- and- Go (Readiness model) MegaPipe Go- and- Wait (Comple=on no=fica=on) epoll_ctl(fd1, EPOLLIN); epoll_ctl(fd2, EPOLLIN); epoll_wait( ); ret1 = recv(fd1, ); ret2 = recv(fd2, ); L mp_read(handle1, ); mp_read(handle2, ); ev = mp_dispatch(channel); OSDI ev = mp_dispatch(channel); J Batching J Easy and intui=ve J J Compa=ble with disk files

23 Transparent batching 1. I/O Batching Exploits parallelism of independent handles Applica=on MegaPipe User- Level Library MegaPipe API (non-batched) Read data from handle 6 Accept a new connection Read data from handle 3 Write data to handle 5 New connection arrived Write done to handle 5 Read done from handle 6 Batched system calls MegaPipe Kernel Module OSDI

24 2. Listening Socket Par==oning Per- core accept queue for each channel Instead of the globally shared accept queue mp_register() Listening socket User Accept queue Kernel OSDI

25 2. Listening Socket Par==oning Per- core accept queue for each channel Instead of the globally shared accept queue mp_register() Listening socket User Accept queue Accept queue Accept queue Kernel OSDI

26 2. Listening Socket Par==oning Per- core accept queue for each channel Instead of the globally shared accept queue Listening socket mp_accept() User Accept queue Accept queue Accept queue Kernel New connec=ons OSDI

27 3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor File instance (states) dentry File name? inode TCP socket OSDI

28 3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor File descriptor File instance (states) dup() or fork() dentry inode TCP socket OSDI

29 3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor File descriptor open() File instance (states) File instance (states) dentry inode TCP socket OSDI

30 3. lwsocket: Lightweight Socket Common- case op=miza=on for sockets Sockets are ephemeral and rarely shared Bypass the VFS layer Convert into a regular file descriptor only when necessary File descriptor lwsocket File instance (states) dentry inode TCP socket TCP socket OSDI

31 EVALUATION OSDI

32 Microbenchmark 1/2 Throughput improvement with various message sizes Throughput Improvement (%) KiB # of CPU Cores 64 B 256 B 512 B 1 KiB 4 KiB OSDI

33 Microbenchmark 1/2 Throughput improvement with various message sizes Throughput Improvement (%) KiB # of CPU Cores 64 B 256 B 512 B 1 KiB 4 KiB OSDI

34 Microbenchmark 1/2 Throughput improvement with various message sizes Throughput Improvement (%) KiB # of CPU Cores 64 B 256 B 512 B 1 KiB 4 KiB OSDI

35 Parallel Speedup Mul=- core scalability Microbenchmark 2/2 with various connec=on lengths (# of transac=ons) Baseline # of CPU Cores OSDI MegaPipe # of CPU Cores

36 Macrobenchmark memcached In- memory key- value store Limited scalability nginx Object store is shared by all cores with a global lock Web server Highly scalable Nothing is shared by cores, except for the listening socket OSDI

37 memcached memaslap with 90% GET, 10% SET, 64B keys, 1KB values 1050 Throughput (1k requests/s) OSDI x 3.9x 2.4x 1.3x Global lock bo?leneck Number of Requests per Connection MegaPipe Baseline 37

38 memcached memaslap with 90% GET, 10% SET, 64B keys, 1KB values Throughput (1k requests/s) MegaPipe-FL Baseline-FL MegaPipe Baseline OSDI Number of Requests per Connection 38

39 nginx Based on Yahoo! HTTP traces: 6.3KiB, 2.3 trans/conn on avg. 20 Improvement 100 Throughput (Gbps) MegaPipe Baseline Improvement (%) # of CPU Cores OSDI

40 CONCLUSION OSDI

41 Related Work Batching [FlexSC, OSDI 10] [libflexsc, ATC 11] Excep=on- less system call MegaPipe solves the scalability issues Par==oning [Affinity- Accept, EuroSys 12] Per- core accept queue MegaPipe provides explicit control over par==oning VFS scalability [Mosbench, OSDI 10] MegaPipe bypasses the issues rather than mi=ga=ng OSDI

42 Conclusion Short connec=ons or small messages: High CPU overhead Poorly scaling with mul=- core CPUs MegaPipe Key abstrac=on: per- core channel Enabling three op=miza=on opportuni=es: Batching, par==oning, lwsocket 15+% improvement for memcached, 75% for nginx OSDI

43 BACKUP SLIDES OSDI

44 Small Messages with MegaPipe Baseline Throughput MegaPipe 1.8 Throughput (1M trans/s) # of Transactions per Connection OSDI

45 1. Small Messages Are Bad à Why? # of messages ma?ers, not the volume of traffic Per- message cost >>> per- byte cost 1KB msg is only 2% more expensive than 64B msg 10G link with 1KB messages à 1M IOPS! Thus 1M+ system calls System calls are expensive [FlexSC, 2010] Mode switching between kernel and user CPU cache pollu=on OSDI

46 Short Connec=ons with MegaPipe Baseline CPU Usage MegaPipe Throughput (Gbps) CPU Usage (%) K 2K 4K 8K 16K Message Size (B) 0 OSDI

47 2. Short Connec=ons Are Bad à Why? Connec=on establishment is expensive Three- way handshaking / four- way teardown More packets More system calls: accept(), epoll_ctl(), close(), Socket is represented as a file in UNIX File overhead VFS overhead OSDI

48 Mul=- Core Scalability with MegaPipe Baseline Per-Core Efficiency MegaPipe Throughput (1M trans/s) Efficiency (%) # of CPU Cores OSDI

49 3. Mul=- Core Will Not Help à Why? Shared queue issues [Affinity- Accept, 2012] Conten=on on the listening socket Poor connec=on affinity User accept() Listening socket TX Kernel RX OSDI

50 3. Mul=- Core Will Not Help à Why? File/VFS mul=- core scalability issues Applica=on Process file descriptor table Shared by threads VFS File instance dentry inode Globally visible in the system OSDI 2012 TCP/IP TCP socket 50

completion events Core 1 Core 2 Core N Channel instance lwsocket

51 Overview User application thread MegaPipe API Handles MegaPipe user-level library Channel Batched async I/O commands Batched completion events Core 1 Core 2 Core N Channel instance lwsocket handles File handles Pending completion events VFS TCP/IP OSDI 2012 Linux kernel 51

52 Ping- Pong Server Example ch = mp_create() handle = mp_register(ch, listen_sd, mask=my_cpu_id) mp_accept(handle) while true: ev = mp_dispatch(ch) conn = ev.cookie if ev.cmd == ACCEPT: mp_accept(conn.handle) conn = new Connection() conn.handle = mp_register(ch, ev.fd, cookie=conn) mp_read(conn.handle, conn.buf, READSIZE) elif ev.cmd == READ: mp_write(conn.handle, conn.buf, ev.size) elif ev.cmd == WRITE: mp_read(conn.handle, conn.buf, READSIZE) elif ev.cmd == DISCONNECT: mp_unregister(ch, conn.handle) delete conn OSDI

53 Contribu=on Breakdown OSDI

54 memcached latency Latency (µs) Baseline-FL 99% MegaPipe-FL 99% Baseline-FL 50% MegaPipe-FL 50% # of Concurrent Client Connections OSDI

55 Clean- Slate vs. Dirty- Slate MegaPipe: a clean- slate approach with new APIs Quick prototyping for various op=miza=ons Performance improvement: worthwhile! Can we apply the same techniques back to the BSD Socket API? Each technique has its own challenges Embracing all could be even harder Future Work TM OSDI

MegaPipe: A New Programming Interface for Scalable Network I/O

MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han +, Scott Marshall +, Byung-Gon Chun *, and Sylvia Ratnasamy + + University of California, Berkeley *Yahoo! Research Abstract We