CSE 506: Opera.ng Systems. The Page Cache. Don Porter

Similar documents
The Page Cache 3/16/16. Logical Diagram. Background. Recap of previous lectures. The address space abstracvon. Today s Problem.

CSE506: Operating Systems CSE 506: Operating Systems

RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

VFS, Continued. Don Porter CSE 506

Ext3/4 file systems. Don Porter CSE 506

Page Frame Reclaiming

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

CSE 120 Principles of Operating Systems

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

CSE506: Operating Systems CSE 506: Operating Systems

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Caching and reliability

Block Device Scheduling. Don Porter CSE 506

RCU. ò Dozens of supported file systems. ò Independent layer from backing storage. ò And, of course, networked file system support

CS 318 Principles of Operating Systems

CSE 120 Principles of Operating Systems Spring 2017

The Art and Science of Memory Allocation

Process Address Spaces and Binary Formats

CS 153 Design of Operating Systems Winter 2016

Virtual File System. Don Porter CSE 306

238P: Operating Systems. Lecture 5: Address translation. Anton Burtsev January, 2018

Memory Management Topics. CS 537 Lecture 11 Memory. Virtualizing Resources

CS 4284 Systems Capstone

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

CSE506: Operating Systems CSE 506: Operating Systems

Module 4: Index Structures Lecture 13: Index structure. The Lecture Contains: Index structure. Binary search tree (BST) B-tree. B+-tree.

File Systems. CS170 Fall 2018

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

143A: Principles of Operating Systems. Lecture 5: Address translation. Anton Burtsev October, 2018

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

1. Creates the illusion of an address space much larger than the physical memory

File Systems: Fundamentals

ECE 571 Advanced Microprocessor-Based Design Lecture 12

File System Internals. Jo, Heeseung

PROCESS VIRTUAL MEMORY PART 2. CS124 Operating Systems Winter , Lecture 19

Memory Management. Fundamentally two related, but distinct, issues. Management of logical address space resource

File Systems: Fundamentals

CS 333 Introduction to Operating Systems. Class 11 Virtual Memory (1) Jonathan Walpole Computer Science Portland State University

CSE 373 OCTOBER 25 TH B-TREES

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Logical disks. Bach 2.2.1

ECE 598 Advanced Operating Systems Lecture 18

COMP 530: Operating Systems File Systems: Fundamentals

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

File Systems Management and Examples

Virtual File System. Don Porter CSE 506

Address spaces and memory management

Virtual Memory Management in Linux (Part II)

File Systems. Kartik Gopalan. Chapter 4 From Tanenbaum s Modern Operating System

Virtual Memory: Systems

Lecture 12: Demand Paging

File System: Interface and Implmentation

File Systems. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse]

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University.

Operating Systems Design Exam 2 Review: Fall 2010

CS 550 Operating Systems Spring File System

Process Time. Steven M. Bellovin January 25,

PAGE REPLACEMENT. Operating Systems 2015 Spring by Euiseong Seo

Computer Systems Laboratory Sungkyunkwan University

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

CS 4284 Systems Capstone

Applications of. Virtual Memory in. OS Design

This Unit: Main Memory. Virtual Memory. Virtual Memory. Other Uses of Virtual Memory

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Lecture 5: Suffix Trees

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 23

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Operating Systems Design Exam 2 Review: Spring 2011

CS399 New Beginnings. Jonathan Walpole

Virtual Memory: From Address Translation to Demand Paging

Multi-Process Systems: Memory (2) Memory & paging structures: free frames. Memory & paging structures. Physical memory

CS 416: Opera-ng Systems Design March 23, 2012

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Chapter 8: Main Memory

CSE 410 Final Exam 6/09/09. Suppose we have a memory and a direct-mapped cache with the following characteristics.

Virtual Memory Management

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 23

COSC3330 Computer Architecture Lecture 20. Virtual Memory

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis

Virtual Memory: From Address Translation to Demand Paging

Chapter 8: Memory-Management Strategies

143A: Principles of Operating Systems. Lecture 6: Address translation. Anton Burtsev January, 2017

Modern Virtual Memory Systems. Modern Virtual Memory Systems

The Art and Science of Memory Alloca4on

Chapter 10: Case Studies. So what happens in a real operating system?

The UNIX Time- Sharing System

CS370 Operating Systems

Transcription:

The Page Cache Don Porter 1

Logical Diagram Binary Formats RCU Memory Management File System Memory Allocators Threads System Calls Today s Lecture Networking (kernel level Sync mem. management) Device CPU Drivers Scheduler User Kernel Interrupts Disk Net Consistency Hardware 2

Recap of previous lectures Page tables: translate virtual addresses to physical addresses VM Areas (Linux): track what should be mapped at in the virtual address space of a process Hoard/Linux slab: Efficient allocaton of objects from a superblock/slab of pages 3

Background Lab2: Track physical pages with an array of PageInfo structs Contains reference counts Free list layered over this array Just like JOS, Linux represents physical memory with an array of page structs Obviously, not the exact same contents, but same idea Pages can be allocated to processes, or to cache file data in memory 4

Today s Problem Given a VMA or a file s inode, how do I figure out which physical pages are storing its data? Next lecture: We will go the other way, from a physical page back to the VMA or file inode 5

The address space abstracton Unifying abstracton: Each file inode has an address space (0 file size) So do block devices that cache data in RAM (0---dev size) The (anonymous) virtual memory of a process has an address space (0 4GB on x86) In other words, all page mappings can be thought of as and (object, offset) tuple Make sense? 6

VM Areas (VMAs) Files Address Spaces for: 7

Start Simple Anonymous memory no file backing it E.g., the stack for a process Not shared between processes Will discuss sharing and swapping later How do we figure out virtual to physical mapping? Just walk the page tables! Linux doesn t do anything outside of the page tables to track this mapping 8

File mappings A VMA can also represent a memory mapped file The kernel can also map file pages to service read() or write() system calls Goal: We only want to load a file into memory once! 9

Logical View? Foo.txt inode?? Process A? Process B Hello! Disk Process C 10

VMA to a file Also easy: VMA includes a file pointer and an offset into file A VMA may map only part of the file Offset must be at page granularity Anonymous mapping: file pointer is null File pointer is an open file descriptor in the process file descriptor table We will discuss file handles later 11

Logical View Hello! Disk? FDs are processspecific Foo.txt inode File Descriptor Table Process A Process B Process C 12

Tracking file pages What data structure to use for a file? No page tables for files For example: What page stores the first 4k of file foo What data structure to use? Hint: Files can be small, or very, very large 13

The Radix Tree A space-optmized trie Trie: Rather than store entre key in each node, traversal of parent(s) builds a prefix, node just stores suffix Especially useful for strings Prefix less important for file offsets, but does bound key storage space More important: A tree with a branching factor k > 2 Faster lookup for large files (esp. with tricks) Note: Linux s use of the Radix tree is constrained 14

From Understanding the Linux Kernel radix_tree_root rnode radix_tree_root rnode height = 2 radix_tree_node count = 2 0 63... height = 1 radix_tree_node count = 2 0 63... slots[0] radix_tree_node count = 2 0 63... slots[2] radix_tree_node count = 1 0 63... slots[0] slots[4] slots[0] slots[4] slots[3] page page page page page index = 0 index = 4 index = 0 index = 4 index = 131 (a) radix tree of height 1 (b) radix tree of height 2 15

A bit more detail Assume an upper bound on file size when building the radix tree Can rebuild later if we are wrong Specifically: Max size is 256k, branching factor (k) = 64 256k / 4k pages = 64 pages So we need a radix tree of height 1 to represent these pages 16

Tree of height 1 Root has 64 slots, can be null, or a pointer to a page Lookup address X: Ship off low 12 bits (offset within page) Use next 6 bits as an index into these slots (2^6 = 64) If pointer non-null, go to the child node (page) If null, page doesn t exist 17

Tree of height n Similar story: Ship off low 12 bits At each child ship off 6 bits from middle (startng at 6 * (distance to the bosom 1) bits) to find which of the 64 potental children to go to Use fixed height to figure out where to stop, which bits to use for offset ObservaTons: Key at each node implicit based on positon in tree Lookup Tme constant in height of tree In a general-purpose radix tree, may have to check all k children, for higher lookup cost 18

Fixed heights If the file size grows beyond max height, must grow the tree RelaTvely simple: Add another root, previous tree becomes first child Scaling in height: 1: 2^( (6*1) +12) = 256 KB 2: 2^( (6*2) + 12) = 16 MB 3: 2^( (6*3) + 12) = 1 GB 4: 2^( (6*4) + 12) = 64 GB 5: 2^( (6*5) + 12) = 4 TB 19

Back to address spaces Each address space for a file cached in memory includes a radix tree Radix tree is sparse: pages not in memory are missing Radix tree also supports tags: such as dirty A tree node is tagged if at least one child also has the tag Example: I tag a file page dirty Must tag each parent in the radix tree as dirty When I am finished writng page back, I must check all siblings; if none dirty, clear the parent s dirty tag 20

Address Space Logical View Foo.txt inode Process A Hello! Radix Tree Process B Disk Process C 21

Recap Anonymous page: Just use the page tables File-backed mapping VMA -> open file descriptor-> inode Inode -> address space (radix tree)-> page 22

Problem 2: Dirty pages Most OSes do not write file updates to disk immediately (Later lecture) OS tries to optmize disk arm movement OS instead tracks dirty pages Ensures that write back isn t delayed too long Lest data be lost in a crash ApplicaTon can force immediate write back with sync system calls (and some open/mmap optons) 23

Sync system calls sync() Flush all dirty buffers to disk fsync(fd) Flush all dirty buffers associated with this file to disk (including changes to the inode) fdatasync(fd) Flush only dirty data pages for this file to disk Don t bother with the inode 24

How to implement sync? Goal: keep overheads of finding dirty blocks low A naïve scan of all pages would work, but expensive Lots of clean pages Idea: keep track of dirty data to minimize overheads A bit of extra work on the write path, of course 25

How to implement sync? Background: Each file system has a super block All super blocks in a list Each super block keeps a list of dirty inodes Inodes and superblocks both marked dirty upon use 26

Dirty list SB / FS OrganizaTon SB /floppy Dirty list of inodes SB /d1 One Superblock per FS inode Inodes and radix nodes/pages marked dirty separately 27

for each s in superblock list: Simple traversal if (s->dirty) writeback s for i in inode list: if (i->dirty) writeback i if (i->radix_root->dirty) : // Recursively traverse tree writng // dirty pages and clearing dirty flag 28

Asynchronous flushing Kernel thread(s): pdflush A kernel thread is a task that only runs in the kernel s address space 2-8 threads, depending on how busy/idle threads are When pdflush runs, it is given a target number of pages to write back Kernel maintains a total number of dirty pages Administrator configures a target dirty rato (say 10%) 29

pdflush When pdflush is scheduled, it figures out how many dirty pages are above the target rato Writes back pages untl it meets its goal or can t write more back (Some pages may be locked, just skip those) Same traversal as sync() + a count of wrisen pages Usually quits earlier 30

How long dirty? Linux has some inode-specific bookkeeping about when things were dirted pdflush also checks for any inodes that have been dirty longer than 30 seconds Writes these back even if quota was met Not the strongest guarantee I ve ever seen 31

But where to write? Ok, so I see how to find the dirty pages How does the kernel know where on disk to write them? And which disk for that maser? Superblock tracks device Inode tracks mapping from file offset to sector 32

Block size mismatch Most disks have 512 byte blocks; pages are generally 4K Some new green disks have 4K blocks Per page in cache usually 8 disk blocks When blocks don t match, what do we do? Simple answer: Just write all 8! But this is expensive if only one block changed, we only want to write one block back 33

Buffer head Simple idea: for every page backed by disk, store an extra data structure for each disk block, called a buffer_head If a page stores 8 disk blocks, it has 8 buffer heads Example: write() system call for first 5 bytes Look up first page in radix tree Modify page, mark dirty Only mark first buffer head dirty 34

From Understanding the Linux Kernel Page descriptor Page b_data private b_this_page b_page Buffer Buffer head Buffer Buffer head Buffer Buffer head Buffer Buffer head Figure 15-2. A buffer page including four buffers and their buffer heads 35

More on buffer heads On write-back (sync, pdflush, etc), only write dirty buffer heads To look up a given disk block for a file, must divide by buffer heads per page Ex: disk block 25 of a file is in page 3 in the radix tree Note: memory mapped files mark all 8 buffer_heads dirty. Why? Can only detect write regions via page faults 36

Summary Seen how mappings of files/disks to cache pages are tracked And how dirty pages are tagged Radix tree basics When and how dirty data is wrisen back to disk How difference between disk sector and page sizes are handled 37