A Scalable Event Dispatching Library for Linux Network Servers

Similar documents
Scalability of Linux Event-Dispatch Mechanisms

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Exception-Less System Calls for Event-Driven Servers

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Asynchronous Events on Linux

Device-Functionality Progression

Chapter 12: I/O Systems. I/O Hardware

Chapter 13: I/O Systems

Chapter 13: I/O Systems. Operating System Concepts 9 th Edition

Chapter 12: I/O Systems

Chapter 13: I/O Systems

Chapter 12: I/O Systems. Operating System Concepts Essentials 8 th Edition

by I.-C. Lin, Dept. CS, NCTU. Textbook: Operating System Concepts 8ed CHAPTER 13: I/O SYSTEMS

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Lecture 8: Other IPC Mechanisms. CSC 469H1F Fall 2006 Angela Demke Brown

Topics. Lecture 8: Other IPC Mechanisms. Socket IPC. Unix Communication

Chapter 8: I/O functions & socket options

Chapter 13: I/O Systems

Lecture 15: I/O Devices & Drivers

Systems Architecture II

The control of I/O devices is a major concern for OS designers

Chapter 13: I/O Systems

Chapter 13: I/O Systems

Chapter 13: I/O Systems

Chapter 13: I/O Systems. Chapter 13: I/O Systems. Objectives. I/O Hardware. A Typical PC Bus Structure. Device I/O Port Locations on PCs (partial)

Module 11: I/O Systems

Comparing and Evaluating epoll, select, and poll Event Mechanisms

Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta

Capriccio : Scalable Threads for Internet Services

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Flash: an efficient and portable web server

CS350: Final Exam Review

Introduction to Asynchronous Programming Fall 2014

Operating System: Chap13 I/O Systems. National Tsing-Hua University 2016, Fall Semester

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Answer to exercises chap 13.2

Operating Systems 2010/2011

Input / Output. School of Computer Science G51CSA

Comparing the Performance of Web Server Architectures

Improving Scalability of Processor Utilization on Heavily-Loaded Servers with Real-Time Scheduling

CSE 120 Principles of Operating Systems

Module 12: I/O Systems

Chapter 13: I/O Systems

Background: Operating Systems

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

Background: I/O Concurrency

G Robert Grimm New York University

GFS: The Google File System

Distributed Systems Operation System Support

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

Input/Output Systems

Silberschatz and Galvin Chapter 12

CS370 Operating Systems

Multiprocessor Systems. Chapter 8, 8.1

I/O AND DEVICE HANDLING Operating Systems Design Euiseong Seo

CS533 Concepts of Operating Systems. Jonathan Walpole

Module 12: I/O Systems

I/O Systems (3): Clocks and Timers. CSE 2431: Introduction to Operating Systems

The Google File System

I/O Systems. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

COMP 3361: Operating Systems 1 Final Exam Winter 2009

Operating Systems 2010/2011

Operating System Design Issues. I/O Management

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

Embedded Software Programming

BENCHMARKING LIBEVENT AGAINST LIBEV

Authors : Ruslan Nikolaev Godmar Back Presented in SOSP 13 on Nov 3-6, 2013

SEDA: An Architecture for Scalable, Well-Conditioned Internet Services

Resource Containers. A new facility for resource management in server systems. Presented by Uday Ananth. G. Banga, P. Druschel, J. C.

Threads. Still Chapter 2 (Based on Silberchatz s text and Nachos Roadmap.) 3/9/2003 B.Ramamurthy 1

3.1 Introduction. Computers perform operations concurrently

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

GFS: The Google File System. Dr. Yingwu Zhu

CS 5523 Operating Systems: Midterm II - reivew Instructor: Dr. Tongping Liu Department Computer Science The University of Texas at San Antonio

Learning with Purpose

Chapter 9 Memory Management

An analysis of Web Server performance with an accelerator

Lecture 13 Input/Output (I/O) Systems (chapter 13)

Last 2 Classes: Introduction to Operating Systems & C++ tutorial. Today: OS and Computer Architecture

Light & NOS. Dan Li Tsinghua University

SDC 2015 Santa Clara

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization

Chapter 13: I/O Systems

Capriccio: Scalable Threads for Internet Services

Virtual Memory Outline

OS-caused Long JVM Pauses - Deep Dive and Solutions

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Servers: Concurrency and Performance. Jeff Chase Duke University

Chapter 13: I/O Systems

CISC2200 Threads Spring 2015

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

QUESTION BANK UNIT I

COSC3330 Computer Architecture Lecture 20. Virtual Memory

One Server Per City: Using TCP for Very Large SIP Servers. Kumiko Ono Henning Schulzrinne {kumiko,

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

Performance Impact of Multithreaded Java Server Applications

IsoStack Highly Efficient Network Processing on Dedicated Cores

Processes. Process Scheduling, Process Synchronization, and Deadlock will be discussed further in Chapters 5, 6, and 7, respectively.

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann

Virtualization, Xen and Denali

Transcription:

A Scalable Event Dispatching Library for Linux Network Servers Hao-Ran Liu and Tien-Fu Chen Dept. of CSIE National Chung Cheng University

Traditional server: Multiple Process (MP) server A dedicated process to every connection. Concurrency is provided by OS. Disadvantage Context-switching overhead Synchronization overhead TLB miss rate Example Apache Process 1 Process 2 Kernel Connection 1 Connection 2

Modern server: Single Process Event Driven (SPED) server User level concurrency within a single process. Nonblocking I/O A mechanism to detect I/O events Disadvantage Influenced by inefficient design of event dispatching mechanism in OS kernel A single page fault or disk read suspends whole server. Example Squid web cache

Event-driven server architecture User thread context or Connection processing context Single process Kernel Connections

Problem statement and our goal The problem Inefficiency design of event dispatching mechanism Our goal Improve the performance of SPED servers in user-mode.

A list of event dispatching mechanisms Scalable Devpoll (Solaris 8) RT signals (Linux) I/O completion port (Windows 2000) Non-scalable select() (POSIX) poll() (POSIX)

poll() and select() Event dispatching mechanisms in Linux select() or poll() scales poorly with large set of file descriptors. 30% of CPU time are spent on select() in a overloaded proxy server (Banga 99)

Interface of select() and poll()

Source of overhead on poll() Step by step: Read/write handler Poll() is called, array copy Linear scan in kernel Driver poll callback Poll() return, array copy Linear scan in app. Read or write handler ev_array ufds User Kernel Device driver poll()

Source of overhead on poll() Linear scan of ev_array and call device driver s poll callback function. Linear scan in server application to detect network events. Most work are wasted for idle connections.

POSIX.4 Real Time Signal Event dispatching mechanisms in Linux POSIX.4 RT extension allows signal delivery with a payload. sigwaitinfo(), sigtimedwait() Payload can carry sender PID, UID, etc. Linux extension of POSIX.4 RT signal Allow delivery of socket readiness via a particular real time signal. Pro Official support in Linux kernel since version 2.4 Con Linux specific

Flowchart of POSIX.4 Real Time Signal Network events Sig no FD no event type RT signal queue kernel user sigtimedwait() Read/write a FD

Problems in RT signals Edge-triggered readiness notification Signal queue may contain multiple events of a FD Stale event Kernel signal queue size limit. 2048 entries RT signal queue overflow Some connections will fall into deadlock state.

Solution to RT signal queue overflow Application solution Server falls back to traditional select() or poll() Kernel solution Signal-per-fd [Chandra 01] Collapse events of the same file descriptor (signal queue size == max no. of files a process can open) => no signal queue overflow

Our solution to reduce poll() overhead: Scalable event dispatching library Motivation Web connections are idle most of the time. Network events in a single HTTP transaction are bursty. Our strategy Save the time wasted on idle connections for more useful works. The frequency of calling poll() on a idle connection is decreased. Average number of file descriptors at every poll() is decreased. Exploit temporal locality among events in a connection to reduce the frequency of calling poll() on a FD.

Live counter and polling sets: Keys to implement locality poll() A live counter is associated with a file descriptor. On every poll() to the FD: If event is detected, counter increases. If no event is detected, counter decreases. All FDs in a server are divided into three sets according to their live counter. The frequency of calling poll() on each set is different.

Architecture view of event dispatching library (FDM) Modular Hashing key = fd, value=nth set fdm_reg_fd(int fd) fdm_update_fd_mode(struct pollfd *fds, void *payload) fdm_wait_event() Active polling set Doze polling set Idle polling set Actual polling set sys_poll() User space Kernel space

Library interface design logic Reduce copy of FD array Interesting set are built gradually Separate routine to fetch event Reduce linear scan of FD array in server code Timeout timestamp and payload associated with a file descriptor is maintained in this library. Listening socket descriptor can be locked in the active polling set.

Proposed library interface

Performance evaluation Compare event dispatching mechanisms available in Linux 2.4, including our FDM library Test server dphttpd + event-threading + FDM All tests are under Multi-accept implementation Pentium III 600MHZ, 128MB RAM Test Client Httperf, HTTP workload generator three Pentium III 1GHZ, 512MB RAM

Dispatching overhead Goal see the overhead under Large number of idle connections Request rate is fixed at a pretty light load.

Server reply rate with fixed light load Reply Rate (1000 req/s, N=3, 1KB reply size) 1040 poll FDM RT signals 1020 Reply rate (reply/s) 1000 980 960 940 920 900 0 1000 2000 3000 4000 5000 6000 Number of idle connections

Server response time with fixed light load 18 16 Response Time (1000 req/s, N=3, 1KB reply size) poll FDM RT signals 14 Response time (ms) 12 10 8 6 FDM is more scalable than poll 4 2 0 0 1000 2000 3000 4000 5000 6000 Number of idle connections FDM polling set maintain overhead >= polling time saved RT signal still needs to maintain timeout array

Dispatching Throughput Goal See the throughput under Fixed number of idle connections Overloaded request rate

Server reply rate with 1000 idle connections Reply rate (reply/sec) 10000 9000 8000 7000 6000 5000 4000 3000 Reply Rate (1000 idle connections, N=3, 1KB reply size) 2000 1000 poll FDM RT signals(poll) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Request rate (req/sec) 100Mb Ethernet saturated, poll performs well because of multi-accept

Server response time with 1000 idle connections Response time (ms) Response Time (1000 idle connections, N=3, 1KB reply size) 10 poll 9 FDM RT signals(poll) 8 7 6 5 4 3 2 1 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Request rate (req/sec) Even at light load, we can observe overhead of 1000 idle connection. FDM suffers too since it depends on poll() Difference Between FDM And poll

Server reply rate with 6000 idle connections 8000 Reply Rate (6000 idle connections, N=3, 1KB reply size) Reply rate (reply/sec) 7000 6000 RTsig+select 5000 is 4000 better than RTsig+poll, 3000 but signal queue 2000 overflow poll 1000 FDM still limit RT signals(poll) RT signals(select) scalability. 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Request rate (req/sec) RT signal queue start overflow

Server response time with 6000 idle connections Response time (ms) Response Time (6000 idle connections, N=3, 1KB reply size) 200 poll 180 FDM RT signals(poll) RT signals(select) 160 140 120 100 80 60 40 20 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Request rate (req/sec) Signal queue overflow, server fall back to poll(), poll overhead at 6000 idle connections is a lot larger such that this behavior can t be observed at 1000 idle connections.

Summary of performance evaluation At light load FDM incurs low overhead At heavy load FDM improves poll() RT signal queue overflow should be protected by select(), not poll()

Conclusion FDM library improves poll() exploiting temporal locality property of events in a file descriptor. Better performance than RT signals if RT signal queue overflowed. RT signal queue overflow recover select() is a better choice. Availability of FDM and test web server http://arch1.cs.ccu.edu.tw/~lhr89/fdm/

Backup slides follow

Keys to scalable event dispatching mechanisms Interesting set of file descriptors is built gradually inside kernel Separate building of interesting set from event retrieval Only return a file descriptor if there is a event. Collect event efficiently Cache device driver poll result Don t run driver s poll() at every poll() Or, maintain a event queue and collect events gradually at every call to TCP/IP event handler.

Summary of event dispatching mechanisms methods features Scalable to large set of file descriptors Event collapsing Dequeue multiple event per system call Event queue overflow When queue overflow, kernel fallback to traditional poll() Return initial state when declare interest fd Multiple interest sets maintained in kernel for per process Select() No NA Yes NA NA NA NA Poll() No NA Yes NA NA NA NA /dev/poll Yes NA Yes NA NA NA Yes RT signals Yes No No Yes No No Yes RT sig-per-fd Yes Yes No No NA No Yes Declare_interest Yes Yes Yes Yes Yes Yes No

Polling frequency of a set Depends on the default value of live counter Assume default value is N On every fdm_wait_event() Active polling set is polled On every N fdm_wait_event() Doze polling set is polled On every N^2 fdm_wait_event() Idle polling set is polled

Server performance with different default value of live counter Dispatching throughput and response time (1000 connections, 500 req/s, 1K reply size) 600 200 Reply rate Response time 180 500 160 Reply rate (replys/sec) 400 300 200 100 140 120 100 80 60 40 20 Response time (ms) 0 0 1 10 100 1000 10000 100000 Value of live counter (N) Small N Is good Fd in Idle set Fd in doze set Fd in active set Large N useless