IO-Lite: A Unified I/O Buffering and Caching System

IO-Lite: A Unified I/O Buffering and Caching System Vivek S. Pai, Peter Druschel and Willy Zwaenepoel Rice University (Presented by Chuanpeng Li) 2005-4-25 CS458 Presentation 1

IO-Lite Motivation Network hardware is very fast User perceived speed is not always so fast It also depends on the performance of server systems General-purpose OS is inadequate Inefficient buffering and caching schemes Example: a web server sends a file, 3 copies of data exist file cache server application buffer network subsystem buffer 2005-4-25 CS458 Presentation 2

Outline Motivation Problems in detail Traditional approaches IO-Lite design IO-Lite implementation Performance evaluation 2005-4-25 CS458 Presentation 3

Problems in more detail Each major IO subsystem has its own buffering and caching mechanism File system cache Network subsystem buffer IPC buffer (e.g., pipe) Application buffers Other IO subsystems Redundant data copying: CPU overhead Multiple buffering: waste memory, more miss rate in file-system cache Lack of cross-subsystem optimization: Optimizations like TCP checksum caching [4] support for application-specific cache replacement policy [5] 2005-4-25 CS458 Presentation 4

Traditional approaches Memory-mapped files: can avoid double buffering in file system and application, but still buffering in the network subsystem Sendfile system call in Linux, FreeBSD, windows NT Send file through socket directly Do not support dynamic content Fbufs copy-free cross-domain transfer and buffering facility[3] Mainly for handling network streams Do not support file system cache Built for non-unix environment 2005-4-25 CS458 Presentation 5

IO-Lite overview Unify I/O buffering and caching system Almost all subsystems share a single physical copy of the data Principles: Immutable buffers Mutable buffer aggregates 2005-4-25 CS458 Presentation 6

Immutable buffers (IO-Lite design) Buffers are allocated with initial content If shared, the content can not be modified (read-only sharing) Eliminate problems: synchronization, protection, consistency and so on Efficient data transfer across protection domain boundaries: all subsystems safely refer to a single physical copy of data 2005-4-25 CS458 Presentation 7

Buffer, buffer aggregate, and slice <pointer, length> (from ref 1) 2005-4-25 CS458 Presentation 8

Buffer aggregates (IO-Lite design) Data in immutable buffers can not be modified in place Buffer aggregates are built on top of buffers All data accesses are through buffer aggregates An ordered list of <pointer, length> pairs Buffer aggregates are mutable How to modify: newly allocated buffer for modified content (copy on write) Reference counting of buffers allows safe reclamation 2005-4-25 CS458 Presentation 9

Aggregate modification 2005-4-25 CS458 Presentation 10

IO-Lite and applications (IO-Lite design) IO-Lite API IOL_read(int fd, IOL_Agg **aggr, size_t size) read data to a buffer aggregate IOL_write(int fd, IOL_Agg *aggr) write data in a buffer aggregate to a file Linking with modified IO Libs (e.g., stdio) Legacy applications not affected 2005-4-25 CS458 Presentation 11

Buffer allocation (IO-Lite design) Immutable buffers are allocated in a reserved region of virtual address space: IO-Lite window The window appears in the address space of all protection domains, including the kernel In allocation, the owner has temporary write permission to the buffer so it can be initialized After initialization, the buffer is immutable 2005-4-25 CS458 Presentation 12

Access control (IO-Lite design) Access control and protection are at the granularity of process IO-Lite maintains cached pools of buffers with the same ACL (Access Control List) Each pool has its own ACL Programs determine the ACL of an data object before storing it in memory The ACL of the data determines the choice of a pool for a new buffer 2005-4-25 CS458 Presentation 13

Interprocess communication (IO-Lite design) When I/O data is transferred buffer aggregates passed by value buffers are passed by reference IPC is based on page remapping and shared memory (like in fbufs [3]) When an immutable buffer is transferred, VM remapping is done in the receiver When deallocated, the buffer is added to a buffer pool, mappings persist, buffers can be reused When the buffer is reused, no mapping changes are required 2005-4-25 CS458 Presentation 14

IO-Lite and file system (IO-Lite design) Data access is through buffer aggregate File cache supports mapping <file-id,offset,length> -->buffer aggregate Write operations cause buffer replacement Example of IOL_write() after IOL_read() 2005-4-25 CS458 Presentation 15

Example: write after read in file cache 2005-4-25 CS458 Presentation 16

IO-Lite and network (IO-Lite design) Use buffer aggregates to store and manipulate network packets Network device driver determines the ACL of the data before storing it in memory Drivers get that from packet headers using a packet filter: early demultiplexing 2005-4-25 CS458 Presentation 17

Issues to consider Cache replacement and paging Impact of immutable buffer Cross-subsystem optimization Operation in web servers 2005-4-25 CS458 Presentation 18

Cache replacement and paging (1) Cached data may be concurrently accessed by multiple applications Data can be shared in complex ways 2005-4-25 CS458 Presentation 19

Cache replacement and paging (2) Simple strategy: Cache entries maintained in a list ordered by last access time Approximate LRU replacement How cache eviction is triggered More than half of the VM pages selected for replacement are in file cache How file cache is enlarged when there is miss in the cache IO-Lite buffers reside in pageable virtual memory evicted page may need to be written back to more than one backing stores 2005-4-25 CS458 Presentation 20

Impact of immutable buffer Data modification need to allocate new buffer Every word is modified (fully rewriting) The cost is buffer allocation A subset of words is modified Logically combine the unmodified and modified portion The cost is buffer allocation + chaining Modifications are widely scattered Use mmap interface to support modification in place 2005-4-25 CS458 Presentation 21

Cross-subsystem optimization Optimizations leverage the ability to uniquely identify a particular I/O data object throughout the system TCP checksum caching Internet checksum is cached for each slice, avoid repeated checksum calculation Generation number of each buffer is increased every time it is reallocated. Generation number and the buffer address uniquely identify the content of a buffer. 2005-4-25 CS458 Presentation 22

Operation in web servers (1): static content Traditionally, document may be stored in the file cache and the TPC transmission buffers With IO-Lite, all data coping and multiple buffering is eliminated Buffer aggregates are passed between file cache, server application, and network subsystem TCP checksum can be reused 2005-4-25 CS458 Presentation 23

Operation in web servers (2): dynamic content CGI process transfer data to web server process by IPC Multiple buffering in CGI, server process and TCP transmission buffer With IO-Lite, only buffer aggregates are passed, only one copy data is used TCP checksum is reused for portion of dynamic content that are repeated transmitted 2005-4-25 CS458 Presentation 24

Outline Motivation Problems in detail Traditional approaches IO-Lite design IO-Lite implementation Performance evaluation 2005-4-25 CS458 Presentation 25

Implementation (1) Loadable kernel module in FreeBSD 2.2.6; the Lib provides buffer aggregate manipulation routines and stubs for the IO-Lite system calls Network subsystem: encapsulate IO-Lite buffers inside BSD network buffer abstraction mbufs. The mbufs outside interface is unchanged. TCP/IP stack source code unchanged 2005-4-25 CS458 Presentation 26

Implementation (2) File system: IO-Lite file cache module replaces the original BSD buffer cache VM system: IO-Lite buffer are allocated in IO-Lite window, a virtual address space in all processes and kernel No significant changes in terms of how to page-in and pageout Replacement policy of IO-Lite buffers is implemented by page-out handler IPC: modify BSD pipe implementation Transfer buffer aggregates instead of data Ensure the IO-Lite buffers are readable in receiving domain 2005-4-25 CS458 Presentation 27

Experimental environments Pentium II 333MHz, 128MB, 5 network adaptors, 100Mbps Etheret Flash: high performance web server One of the fastest servers available Event driven model Flash-Lite is a IO-Lite version of flash Apache (process based) is used for comparison 2005-4-25 CS458 Presentation 28

Web servers with static files and CGI Static vs CGI Nonpersistent connections v.s. persistent connections (figures from [1]) 2005-4-25 CS458 Presentation 29

Trace based evaluation Rice trace (from [1]) 2005-4-25 CS458 Presentation 30

Wan effects (from [1]) 2005-4-25 CS458 Presentation 31

Other applications (from [1]) 2005-4-25 CS458 Presentation 32

Conclusion IO-Lite provides an efficient and unified framework for IO buffering and caching Experiments show that IO-Lite can improve the performance of web servers and other IO intensive application by 40%-80% 2005-4-25 CS458 Presentation 33

References 1. IO-Lite: A Unified I/O Buffering and Caching System, V. Pai, P. Druschel, and W. Zwaenepoel, ACM TOCS 00. 2. Flash: An Efficient and Portable Web server, V. Pai, P. Druschel, and W. Zwaenepoel, USENIX 99 3. Fbufs: A High-Bandwidth Cross-Domain Transfer Facility, P. Druschel and L. Peterson, SOSP 93 4. Application Performance and Flexibility on Exokernel Systems, M. Kaashoek et al, SOSP 97 5. Implementation and Performance of Application-controlled File Caching, P. Cao and E. Felten, OSDI 94 2005-4-25 CS458 Presentation 34