VM and I/O. IO-Lite: A Unified I/O Buffering and Caching System. Vivek S. Pai, Peter Druschel, Willy Zwaenepoel

Similar documents
IO-Lite: A Unified I/O Buffering and Caching System

IO-Lite: A Unified I/O Buffering and Caching System

IO-Lite: A Unied I/O Buering and Caching System. Rice University. disk accesses and reducing throughput. Finally,

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CSE 120 Principles of Operating Systems

Virtual Memory. Virtual Memory

Page 1. Multilevel Memories (Improving performance using a little cash )

Virtual Memory. Main memory is a CACHE for disk Advantages: illusion of having more physical memory program relocation protection.

IO-Lite: A unied I/O buering and caching system. Vivek S. Pai Peter Druschel Willy Zwaenepoel. Rice University

CSE 120 Principles of Operating Systems Spring 2017

ECE468 Computer Organization and Architecture. Virtual Memory

Virtual Memory. 1 Administrivia. Tom Kelliher, CS 240. May. 1, Announcements. Homework, toolboxes due Friday. Assignment.

ECE4680 Computer Organization and Architecture. Virtual Memory

Address spaces and memory management

Flash: an efficient and portable web server

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

Basic Memory Management

CS399 New Beginnings. Jonathan Walpole

Outrunning Moore s Law Can IP-SANs close the host-network gap? Jeff Chase Duke University

MEMORY: SWAPPING. Shivaram Venkataraman CS 537, Spring 2019

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Virtual Memory Management

Modeling Page Replacement: Stack Algorithms. Design Issues for Paging Systems

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Design Issues 1 / 36. Local versus Global Allocation. Choosing

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

CS 318 Principles of Operating Systems

Virtual Memory. CS61, Lecture 15. Prof. Stephen Chong October 20, 2011

Random-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Buffer Management for XFS in Linux. William J. Earl SGI

Cache Performance (H&P 5.3; 5.5; 5.6)

Random-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy

virtual memory Page 1 CSE 361S Disk Disk

CS 153 Design of Operating Systems Winter 2016

Lecture 14: Cache & Virtual Memory

1. Creates the illusion of an address space much larger than the physical memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

16 Sharing Main Memory Segmentation and Paging

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Caching and Demand-Paged Virtual Memory

virtual memory. March 23, Levels in Memory Hierarchy. DRAM vs. SRAM as a Cache. Page 1. Motivation #1: DRAM a Cache for Disk

ADVANCED OPERATING SYSTEMS

Lecture 12: Demand Paging

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Midterm Exam #2 Solutions April 20, 2016 CS162 Operating Systems

Computer System Overview

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

History of FreeBSD. FreeBSD Kernel Facilities

Exokernel: An Operating System Architecture for Application Level Resource Management

Operating System Principles: Memory Management Swapping, Paging, and Virtual Memory CS 111. Operating Systems Peter Reiher

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay

Memory Management: Virtual Memory and Paging CS 111. Operating Systems Peter Reiher

Chapter 4: Memory Management. Part 1: Mechanisms for Managing Memory

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 23

Chapter 4 Memory Management. Memory Management

CS 111. Operating Systems Peter Reiher

Virtual Memory: From Address Translation to Demand Paging

Pipelined processors and Hazards

The memory of a program. Paging and Virtual Memory. Swapping. Paging. Physical to logical address mapping. Memory management: Review

OPERATING SYSTEM. Chapter 9: Virtual Memory

1. Memory technology & Hierarchy

ECE 571 Advanced Microprocessor-Based Design Lecture 13

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, SPRING 2013

Lecture#16: VM, thrashing, Replacement, Cache state

Running on the Bare Metal with GeekOS

Operating System Architecture. CS3026 Operating Systems Lecture 03

HY225 Lecture 12: DRAM and Virtual Memory

Memory management units

Virtual Memory. Computer Systems Principles

Memory Hierarchy Requirements. Three Advantages of Virtual Memory

Operating System Structure

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

Exokernel Engler, Kaashoek etc. advantage: fault isolation slow (kernel crossings)

CS 134: Operating Systems

Memory Management. To improve CPU utilization in a multiprogramming environment we need multiple programs in main memory at the same time.

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 24

Caching Basics. Memory Hierarchies

ECE 598 Advanced Operating Systems Lecture 14

Computer Systems. Virtual Memory. Han, Hwansoo

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

Virtual Memory. Samira Khan Apr 27, 2017

Design Overview of the FreeBSD Kernel CIS 657

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

Design Overview of the FreeBSD Kernel. Organization of the Kernel. What Code is Machine Independent?

Chapter 3 Memory Management: Virtual Memory

ECE 571 Advanced Microprocessor-Based Design Lecture 12

Operating Systems Design Exam 2 Review: Spring 2011

UC Berkeley CS61C : Machine Structures

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. November 6, Prof. Joe Pasquale

Transcription:

VM and I/O IO-Lite: A Unified I/O Buffering and Caching System Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Software Prefetching and Caching for TLBs Kavita Bala, M. Frans Kaashoek, William E. Weihl

General themes CPU, network bandwidth increasing rapidly Main memory, IPC unable to keep up trend towards microkernels increase number of IPC transactions

General themes CPU, network bandwidth increasing rapidly Main memory, IPC unable to keep up trend towards microkernels increase number of IPC transactions One remedy is to increase speed/bandwidth of IPC data (data moving between processes)

fbufs Attempts to increase bandwidth within network subsystem In a nutshell: provides immutable buffers shared among processes of subsystem Implemented using shared memory and page remapping in a specialized OS: the x- kernel

fbuf, details Incoming packet data units passed to higher protocols in fbufs PDUs are assembled into application data units by use of an aggregation ADT

fbufs, details fbuf interface does not support writes after producer fills buffer (PDU) fbufs can be reused after consumer is finished; leads to sequential use of fbufs applications shouldn t have to modify data anyway

fbufs, details fbuf interface does not support writes after producer fills buffer (PDU) fbufs can be reused after consumer is finished; leads to sequential use of fbufs applications shouldn t have to modify data anyway LIMITATION, especially in a more general system

Enter IO-Lite Take fbufs, but make them more general, accessible to the filesystem in addition to the network subsystem more versatile, usable on standard OSes (not just x-kernel) Solves a more general problem: rapidly increasing CPUs (not just network bandwidth)

Before comparing them to fbufs... Problems in the old way of doing things redundant data copying redundant copies of data lying around no special optimizations between subsystems

IO-Lite at a high level IO-Lite must provide system-wide buffers to prevent multiple copies UNIX allocates filesystem buffer cache from different pool of kernel memory than, say, network buffers and application-level buffers

CGI file system web server TCP/IP

CGI file system web server TCP/IP

CGI file system web server TCP/IP

CGI file system web server TCP/IP

CGI file system A web server A TCP/IP A

CGI file system A web server A TCP/IP A

CGI file system A web server A TCP/IP A

Access Control Lists Processes must be granted permission to view buffers each buffer pool has an ACL for this purpose for each buffer space, list of processes granted permission to access it

Consequence of ACLs Producer must know data path to consumer gets slightly tricky with incoming network packets must use early demultiplexing (mentioned as a common enough technique)

CGI file system A web server A TCP/IP A

P1 file system A P2 CGI P3 web server A P4 TCP/IP A

P1 file system A P2 CGI P3 web server A P4 TCP/IP A Buffers: ACLs: 1 2 3 P1, P2 P1, P3, P4 P4

P1 file system A P2 CGI P3 web server A P4 TCP/IP A Buffers: ACLs: 1 2 3 P1, P2 P1, P3, P4 P4

Pipelining Abstractly represents good modularity Conceptually data moves through pipeline from producer to consumer IO-Lite comes close to implementing this in practice when the path is known ahead of time, context switches are the biggest overheads in pipeline

immutable --> mutable Data in an OS must be manipulated in various ways network protocols (same as fbufs) modifying cached files (i. e., to send to various clients via a network/writing checksums) IO-Lite must support concurrent buffer use among sharing processes

immutable --> mutable fbufs IO-Lite

immutable --> mutable File Cache Buffer 1 Buffer Aggregate (in user process)

immutable --> mutable File Cache Buffer 1 Buffer Aggregate (in user process)

immutable --> mutable File Cache Buffer 1 Buffer 2 Buffer Aggregate (in user process)

Consequences of mutable bufs Whole buffers are rewritten same as if there was no IO-Lite -- same penalty as a data copy Bits and pieces of files are rewritten what this system was designed for -- ADT handles modified sections nicely Too many bits and pieces are rewritten IO-Lite uses mmap to make it contiguous -- usually results in a kernel memory copy

Evicting I/O pages LRU policy on unreferenced bufs (if one exists) Otherwise, LRU on referenced bufs since bufs can have multiple references, might require multiple write-backs to disk Tradeoff between size of I/O cache and size of VM pages greater than 50% replaced pages are IO-Lite, evict one to reduce the number

The bad news Applications must be modified to use special IO-Lite read/write calls Both applications at either end of a UNIX pipe must use library to gain benefits of IO- Lite s IPC

The good news Many applications can take further advantage of IPC computing packet checksums only once

The good news Many applications can take further advantage of IPC computing packet checksums only once <generation #, addr> --> I/O buf data

Flash-Lite Flash web server modified to use IO-Lite HTTP up to 43% faster than Flash up to 137% faster than Apache Persistent HTTP (less TCP overhead) up to 90% network saturation Dynamic pages have advantage because of IPC between server and CGI program

HTTP/PHTTP

PHTTP with CGI

Something else fbufs can t do Non-network applications Fewer memory copies across IPC

On to prefetching/caching Once again, CPU speeds far exceed main memory speeds Tradeoff prefetch too early --> less cache space cache too long --> less room for prefetching Try to strike a balance

Let s focus on the TLB Microkernel modularity pays a price: more TLB misses Solution in software -- no hardware mods Handles only kernel misses -- 50% of total

user addr space

kernel data structs user addr space

user addr space user page tables kernel data structs

next level of page tables user addr space user page tables kernel data structs

next level of page tables user addr space user page tables kernel data structs

Prefetching Prefetch on IPC path concurrency in separate domains increases misses fetch L2 mappings to process stack, code, and data segments Generic trap handles misses first time, caches them in flat PTLB for future hash lookups

Caching Goal: avoid cascaded misses in page table entries evicted from TLB are cached in STLB adds 4-cycle overhead to most misses in general trap handler When using STLB, don t prefetch L3 usually evicts useful cached entries In fact, using both caching + prefetching only improves performance if have a lot of IPCs, such as in servers

Performance -- PTLB

Performance -- overall

Performance -- overall BUT NO OVERALL GRAPH GIVEN FOR NUMBER OF PENALTIES

Amdahl s Law in action Overall performance only marginally better

Summary Bridging the gap between memory speeds and CPU is worthwhile Microkernels have fallen out of favor but could come back relatively slow memory is still a problem Sharing resources between processes without placing too many restrictions on the data is a good approach