Exception-Less System Calls for Event-Driven Servers

Similar documents
FlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto

A Scalable Event Dispatching Library for Linux Network Servers

MegaPipe: A New Programming Interface for Scalable Network I/O

CISC2200 Threads Spring 2015

I/O Models. Kartik Gopalan

Asynchronous Events on Linux

Assignment 2 Group 5 Simon Gerber Systems Group Dept. Computer Science ETH Zurich - Switzerland

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

BENCHMARKING LIBEVENT AGAINST LIBEV

Introduction to Asynchronous Programming Fall 2014

Servers: Concurrency and Performance. Jeff Chase Duke University

SCONE: Secure Linux Container Environments with Intel SGX

Prepared by Prof. Hui Jiang Process. Prof. Hui Jiang Dept of Electrical Engineering and Computer Science, York University

Process. Prepared by Prof. Hui Jiang Dept. of EECS, York Univ. 1. Process in Memory (I) PROCESS. Process. How OS manages CPU usage? No.

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

An AIO Implementation and its Behaviour

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

Input / Output. Kevin Webb Swarthmore College April 12, 2018

Inter-Syscall Parallelism

For use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.

Capriccio : Scalable Threads for Internet Services

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

Comparing and Measuring Network Event Dispatch Mechanisms in Virtual Hosts

Storage Performance Development Kit (SPDK) Daniel Verkamp, Software Engineer

Operating Systems CMPSCI 377 Spring Mark Corner University of Massachusetts Amherst

Operating Systems: Internals and Design Principles. Chapter 4 Threads Seventh Edition By William Stallings

Operating Systems. Operating System Structure. Lecture 2 Michael O Boyle

CSC209H Lecture 11. Dan Zingaro. March 25, 2015

Chapter 13: I/O Systems

COMP/ELEC 429/556 Introduction to Computer Networks

December 8, epoll: asynchronous I/O on Linux. Pierre-Marie de Rodat. Synchronous/asynch. epoll vs select and poll

Concurrent Programming

Kernel Synchronization I. Changwoo Min

OPERATING SYSTEM. Chapter 4: Threads

Master s Thesis (Academic Year 2015) Improving TCP/IP stack performance by fast packet I/O framework

Process Scheduling Queues

Eleos: Exit-Less OS Services for SGX Enclaves

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

Chapter 4: Multithreaded Programming. Operating System Concepts 8 th Edition,

Operating System. Chapter 4. Threads. Lynn Choi School of Electrical Engineering

EbbRT: A Framework for Building Per-Application Library Operating Systems

Operating Systems 2010/2011

Guido van Rossum 9th LASER summer school, Sept. 2012

MegaPipe: A New Programming Interface for Scalable Network I/O

Beyond Block I/O: Rethinking

Chapter 4: Multithreaded

CSE 333 Lecture non-blocking I/O and select

CSC369 Lecture 2. Larry Zhang

FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

I/O. Disclaimer: some slides are adopted from book authors slides with permission 1

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Python Twisted. Mahendra

CSci 4061 Introduction to Operating Systems. (Thread-Basics)

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Threads. CS3026 Operating Systems Lecture 06

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

The Performance of µ-kernel-based Systems

Network File System (NFS)

Improving Scalability of Processor Utilization on Heavily-Loaded Servers with Real-Time Scheduling

Anne Bracy CS 3410 Computer Science Cornell University

Overview. Thread Packages. Threads The Thread Model (1) The Thread Model (2) The Thread Model (3) Thread Usage (1)

Network File System (NFS)

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Concurrent Programming

Chapter 4: Threads. Chapter 4: Threads

Lecture 8: Other IPC Mechanisms. CSC 469H1F Fall 2006 Angela Demke Brown

Topics. Lecture 8: Other IPC Mechanisms. Socket IPC. Unix Communication

RxNetty vs Tomcat Performance Results

SPIN Operating System

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

SPDK NVMe In-depth Look at its Architecture and Design Jim Harris Intel

Multithreaded Programming

CSC369 Lecture 2. Larry Zhang, September 21, 2015

Kernel Bypass. Sujay Jayakar (dsj36) 11/17/2016

Processes and Threads

Comprehensive Kernel Instrumentation via Dynamic Binary Translation

Lecture 21 Concurrent Programming

Remote Procedure Call (RPC) and Transparency

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

The benefits and costs of writing a POSIX kernel in a high-level language

File Systems: Consistency Issues

Chapter 4: Threads. Chapter 4: Threads

Definition Multithreading Models Threading Issues Pthreads (Unix)

Chapter 4: Threads. Operating System Concepts 9 th Edition

Threads. What is a thread? Motivation. Single and Multithreaded Processes. Benefits

ECE 650 Systems Programming & Engineering. Spring 2018

CSE 333 Lecture fork, pthread_create, select

User-level threads. with threads. Paul Turner

Background: Operating Systems

W4118: interrupt and system call. Junfeng Yang

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

One Server Per City: Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko,

CS370 Operating Systems

CS240: Programming in C

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Outline. Overview. Linux-specific, since kernel 2.6.0

Transcription:

Exception-Less System Calls for Event-Driven Servers Livio Soares and Michael Stumm University of Toronto

Talk overview At OSDI'10: exception-less system calls Technique targeted at highly threaded servers Doubled performance of Apache Event-driven servers are popular Faster than threaded servers We show that exception-less system calls make event-driven server faster memcached speeds up by 25-35% nginx speeds up by 70-120% 2

Event-driven server architectures Supports I/O concurrency with a single execution context Alternative to thread based architectures At a high-level: Divide program flow into non-blocking stages After each stage register interest in event(s) Notification of event is asynchronous, driving next stage in the program flow To avoid idle time, applications multiplex execution of multiple independent stages 3

Example: simple network server void server() { fd = accept(); read(fd); write(fd); close(fd); } 4

Example: simple network server void server() { S1 fd = accept(); S2 read(fd); S3 write(fd); S4 close(fd); S5 } S1 UNIX options: S2 S3 S4 Non-blocking I/O poll() select() epoll() Async I/O S5 5

Requests/sec. Performance: events vs. threads 14000 12000 10000 8000 6000 4000 2000 0 0 ApacheBench nginx (events) Apache (threads) 100 200 300 400 500 Concurrency nginx delivers 1.7x the throughput of Apache; gracefully copes with high loads 6

Issues with UNIX event primitives Do not cover all system calls Mostly work with file-descriptors (files and sockets) Overhead Tracking progress of I/O involves both application and kernel code Application and kernel communicate frequently Previous work shows that fine-grain mode switching can half processor efficiency 7

FlexSC component overview FlexSC and FlexSC-Threads presented at OSDI 2010 This work: libflexsc for event-driven servers 1) memcached throughput increase of up to 35% 2) nginx throughput increase of up to 120% 8

Benefits for event-driven applications 1) General purpose Any/all system calls can be asynchronous 2) Non-intrusive kernel implementation Does not require per syscall code 3) Facilitates multi-processor execution OS work is automatically distributed 4) Improved processor efficiency Reduces frequent user/kernel mode switches 9

Summary of exception-less syscalls 10

Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT; SUBMIT while (entry->status!= DONE) DONE do_something_else(); return entry->return_code; 11

Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT; SUBMIT SUBMIT while (entry->status!= DONE) DONE do_something_else(); return entry->return_code; 12

Exception-less interface: syscall page write(fd, buf, 4096); entry = free_syscall_entry(); /* write syscall */ entry->syscall = 1; entry->num_args = 3; entry->args[0] = fd; entry->args[1] = buf; entry->args[2] = 4096; entry->status = SUBMIT; SUBMIT DONE while (entry->status!= DONE) DONE do_something_else(); return entry->return_code; 13

Syscall threads Kernel-only threads Part of application process Execute requests from syscall page Schedulable on a per-core basis 14

Dynamic multicore specialization Core 0 Core 1 Core 2 Core 3 1) FlexSC makes specializing cores simple 2) Dynamically adapts to workload needs 15

libflexsc: async syscall library Async syscall and notification library Similar to libevent But operates on syscalls instead of file-descriptors Three main components: 1) Provides main loop (dispatcher) 2) Support asynchronous syscall with associated callback to notify completion 3) Cancellation support 16

Main API: async system call 1 struct flexsc_cb { 2 void (*callback)(struct flexsc_cb *); /* event handler */ 3 void *arg; /* auxiliary var */ 4 int64_t ret; /* syscall return */ 5 } 6 7 int flexsc_##syscall(struct flexsc_cb *, /*syscall args*/); 8 /* Example: asynchronous accept */ 9 struct flexsc_cb cb; 10 cb.callback = handle_accept; 11 flexsc_accept(&cb, master_sock, NULL, 0); 12 13 void handle_accept(struct flexsc_cb *cb) { 14 int fd = cb >ret; 15 if (fd!= 1) { 16 struct flexsc_cb read_cb; 17 read_cb.callback = handle_read; 18 flexsc_read(&read_cb, fd, read_buf, read_count); 19 } 20 } 17

memcached port to libflexsc memcached: in-memory key/value store Simple code-base: 8K LOC Uses libevent Modified 293 LOC Transformed libevent calls to libflexsc Mostly in one file: memcached.c Most memcached syscalls are socket based 18

nginx port to libflexsc Most popular event-driven webserver Code base: 82K LOC Natively uses both non-blocking (epoll) I/O and asynchronous I/O Modified 255 LOC Socket based code already asynchronous Not all file-system calls were asynchronous e.g., open, fstat, getdents Special handling of stack allocated syscall args 19

Evaluation Linux 2.6.33 Nehalem (Core i7) server, 2.3GHz 4 cores Client connected through 1Gbps network Workloads memslap on memcached (30% user, 70% kernel) httperf on nginx (25% user, 75% kernel) Default Linux ( epoll ) vs. libflexsc ( flexsc ) 20

memcached on 4 cores Throughput (requests/sec.) 140000 120000 100000 30% improvement 80000 60000 40000 flexsc epoll 20000 0 0 200 400 600 800 1000 Request Concurrency 21

memcached processor metrics Relative Performance (flexsc/epoll) 1.2 Kernel User 1 0.8 0.6 0.4 0.2 0 L2 CPI i-cache d-cache L2 CPI i-cache d-cache 22

httperf on nginx (1 core) Throughput (Mbps) 120 flexsc epoll 100 80 60 40 100% improvement 20 0 0 10000 20000 30000 40000 50000 60000 Requests/s 23

nginx processor metrics Relative Performance (flexsc/epoll) 1.2 User 1 Kernel 0.8 0.6 0.4 0.2 0 L2 CPI i-cache d-cache Branch CPI d-cache Branch L2 i-cache 24

Concluding remarks Current event-based primitives add overhead I/O operations require frequent communication between OS and application libflexsc: exception-less syscall library 1) General purpose 2) Non-intrusive kernel implementation 3) Facilitates multi-processor execution 4) Improved processor efficiency Ported memcached and nginx to libflexsc Performance improvements of 30-120% 25

Exception-Less System Calls for Event-Driven Servers Livio Soares and Michael Stumm University of Toronto

Backup Slides

Difference in improvements Why does nginx improve more than memcached? 1) Frequency of mode switches: Server Frequency of syscalls (in instructions) memcached nginx 3,750 1,460 2) nginx uses greater diversity of system calls More interference in processor structures (caches) 3) Instruction count reduction nginx with epoll() has connection timeouts 28

Limitations Scalability (number of outstanding syscalls) Interface: operations perform linear scan Implementation: overheads of syscall threads non-negligible Solutions Throttle syscalls at application or OS Switch interface to scalable message passing Provide exception-less versions of async I/O Make kernel fully non-blocking 29

Latency (ApacheBench) 30 epoll flexsc Latency (ms) 25 20 15 50% latency reduction 10 5 0 1 core 2 cores 30