VFS, Continued. Don Porter CSE 506

Similar documents
RCU. ò Walk through two system calls in some detail. ò Open and read. ò Too much code to cover all FS system calls. ò 3 Cases for a dentry:

CSE506: Operating Systems CSE 506: Operating Systems

RCU. ò Dozens of supported file systems. ò Independent layer from backing storage. ò And, of course, networked file system support

Virtual File System. Don Porter CSE 306

Virtual File System. Don Porter CSE 506

CSE 506: Opera.ng Systems. The Page Cache. Don Porter

The Page Cache 3/16/16. Logical Diagram. Background. Recap of previous lectures. The address space abstracvon. Today s Problem.

Ext3/4 file systems. Don Porter CSE 506

Fall 2017 :: CSE 306. File Systems Basics. Nima Honarmand

CSE506: Operating Systems CSE 506: Operating Systems

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay

Read-Copy Update (RCU) Don Porter CSE 506

Logical disks. Bach 2.2.1

UNIX File System. UNIX File System. The UNIX file system has a hierarchical tree structure with the top in root.

CSE 410: Systems Programming

Page Frame Reclaiming

CSE 333 SECTION 3. POSIX I/O Functions

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Operating systems. Lecture 7

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

File Systems. CS170 Fall 2018

CS 140 Project 4 File Systems Review Session

PROJECT 6: PINTOS FILE SYSTEM. CS124 Operating Systems Winter , Lecture 25

Kernel Synchronization I. Changwoo Min

Operating Systems Design Exam 2 Review: Spring 2011

The Art and Science of Memory Allocation

CS510 Operating System Foundations. Jonathan Walpole

SYSTEM CALL IMPLEMENTATION. CS124 Operating Systems Fall , Lecture 14

CS 416: Opera-ng Systems Design March 23, 2012

File Systems: Consistency Issues

FILE SYSTEMS, PART 2. CS124 Operating Systems Winter , Lecture 24

Memory Management. Fundamentally two related, but distinct, issues. Management of logical address space resource

NFS. Don Porter CSE 506

Introduction to File Systems. CSE 120 Winter 2001

Process Address Spaces and Binary Formats

Lecture files in /home/hwang/cs375/lecture05 on csserver.

Lecture 4: Memory Management & The Programming Interface

CSE 333 Lecture 9 - storage

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

Files and the Filesystems. Linux Files

Basic OS Progamming Abstrac2ons

CSE506: Operating Systems CSE 506: Operating Systems

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

CS370 Operating Systems

CSC369 Lecture 2. Larry Zhang

Scheduling, part 2. Don Porter CSE 506

CSE 410 Final Exam 6/09/09. Suppose we have a memory and a direct-mapped cache with the following characteristics.

File System Implementation

CSE 333 SECTION 3. POSIX I/O Functions

File Systems. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Section 3: File I/O, JSON, Generics. Meghan Cowan

CMPS 105 Systems Programming. Prof. Darrell Long E2.371

Problem Set 2. CS347: Operating Systems

Native POSIX Thread Library (NPTL) CSE 506 Don Porter

Verification & Validation of Open Source

Recitation 8: Tshlab + VM

Basic OS Progamming Abstrac7ons

we are here Page 1 Recall: How do we Hide I/O Latency? I/O & Storage Layers Recall: C Low level I/O

Applications of. Virtual Memory in. OS Design

Memory Management Topics. CS 537 Lecture 11 Memory. Virtualizing Resources

ò Server can crash or be disconnected ò Client can crash or be disconnected ò How to coordinate multiple clients accessing same file?

CSC369 Lecture 2. Larry Zhang, September 21, 2015

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Read-Copy Update (RCU) Don Porter CSE 506

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall 2008.

Chapter 6. File Systems

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Lecture 3. Introduction to Unix Systems Programming: Unix File I/O System Calls

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

File Systems. Todays Plan. Vera Goebel Thomas Plagemann. Department of Informatics University of Oslo

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Quiz II Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall Department of Electrical Engineering and Computer Science

File Systems. CSE 2431: Introduction to Operating Systems Reading: Chap. 11, , 18.7, [OSC]

Lab 4 File System. CS140 February 27, Slides adapted from previous quarters

Emulation 2. G. Lettieri. 15 Oct. 2014

Operating Systems CMPSC 473 File System Implementation April 10, Lecture 21 Instructor: Trent Jaeger

How to hold onto things in a multiprocessor world

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 23

CS140 Operating Systems Final December 12, 2007 OPEN BOOK, OPEN NOTES

Advanced Database Systems

File Systems. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse]

Multi-level Translation. CS 537 Lecture 9 Paging. Example two-level page table. Multi-level Translation Analysis

INTERNAL REPRESENTATION OF FILES:

CS240: Programming in C

ECE 650 Systems Programming & Engineering. Spring 2018

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Lab 09 - Virtual Memory

Address spaces and memory management

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 23

we are here I/O & Storage Layers Recall: C Low level I/O Recall: C Low Level Operations CS162 Operating Systems and Systems Programming Lecture 18

CS 43: Computer Networks. 07: Concurrency and Non-blocking I/O Sep 17, 2018

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

[537] Fast File System. Tyler Harter

RCU in the Linux Kernel: One Decade Later

Midterm Exam Answers

Process Time. Steven M. Bellovin January 25,

OS COMPONENTS OVERVIEW OF UNIX FILE I/O. CS124 Operating Systems Fall , Lecture 2

UNIX File Systems. How UNIX Organizes and Accesses Files on Disk

Operating System Labs. Yuanbin Wu

Transcription:

VFS, Continued Don Porter CSE 506

Logical Diagram Binary Formats Memory Allocators System Calls Threads User Today s Lecture Kernel RCU File System Networking Sync Memory Management Device Drivers CPU Scheduler Interrupts Disk Net Consistency Hardware

Previous lectures ò Basic VFS abstractions ò Including data structures ò And programming model (file system) ò And APIs ò Some system call examples ò Walk through some system calls ò Plus synchronization issues

Today s goal: Synthesis ò Walk through two system calls in some detail ò Open and read ò Too much code to cover all FS system calls

Quick review: dentry ò What purpose does a dentry serve? ò Essentially maps a path name to an inode ò More in 2 slides on how to find a dentry ò Dentries are cached in memory ò Only recently accessed parts of a directory are in memory; others may need to be read from disk ò Dentries can be freed to reclaim memory (like pages)

Dentry caching ò 3 Cases for a dentry: ò In memory (exists) ò Not in memory (doesn t exist) ò Not in memory (on disk/evicted for space or never used) ò How to distinguish last 2 cases? ò Case 2 can generate a lot of needless disk traffic ò Negative dentry Dentry with a NULL inode pointer

Dentry tracking ò Dentries are stored in four data structures: ò A hash table (for quick lookup) ò A LRU list (for freeing cache space wisely) ò A child list of subdirectories (mainly for freeing) ò An alias list (to do reverse mapping of inode -> dentries) ò Recall that many directories can map one inode

Open summary ò Key kernel tasks: ò Map a human-readable path name to an inode ò Check access permissions, from / to the file ò Possibly create or truncate the file (O_CREAT, O_TRUNC) ò Create a file descriptor

Open arguments ò int open(const char *path, int flags, int mode); ò Path: file name ò Flags: many (see manual page), include read/write perms ò Mode: If a file is created, what permissions should it have? (e.g., 0755) ò Return value: File handle index (>= 0 on success) ò Or (0 errno) on failure

Absolute vs. Relative Paths ò Each process has a current root and working directory ò Stored in current->fs-> (fs, pwd---respectively) ò Specifically, these are dentry pointers (not strings) ò Note that these are shared by threads ò Why have a current root directory? ò Some programs are chroot jailed and should not be able to access anything outside of the directory

More on paths ò An absolute path starts with the / character ò E.g., /home/porter/foo.txt, /lib/libc.so ò A relative path starts with anything else: ò E.g., vfs.pptx,../../etc/apache2.conf ò First character dictates where in the dcache to start searching for a path

Search ò Executes in a loop, starting with the root directory or the current working directory ò Treats / character in the path as a component delimiter ò Each iteration looks up part of the path ò E.g., /home/porter/foo would look up home, porter, then foo, starting at /

Detail (iteration 1) ò For current dentry (/), dereference the inode ò Check access permission (recall, mode is stored in inode) ò Use a permission() function pointer associated with the inode can be overridden by a security module (such as SeLinux, or AppArmor), or the file system ò If ok, look at next path component (/home)

Detail (2) ò Some special cases: ò If next component is a., just skip to next component ò If next component is a.., try to move up to parent ò ò If not a. or.. : Catch the special case where the current dentry is the process root directory and treat this as a no-op ò Compute a hash value to find bucket in d_hash table ò Hash is based on full path (e.g., /home/foo, not foo ) ò Search the d_hash bucket at this hash value

Detail (3) ò If there isn t a dentry in the hash bucket, calls the lookup() method on parent inode (provided by FS), to read the dentry from disk ò Or the network, or kernel data structures ò If found, check whether it is a symbolic link ò ò If so, call inode->readlink() (also provided by FS) to get the path stored in the symlink Then continue next iteration ò If not a symlink, check if it is a directory ò If not a directory and not last element, we have a bad path

Iteration 2 ò We have dentry/inode for /home, now finding porter ò Check permission in /home ò Hash /home/porter, find dentry ò Confirm not.,.., or a symlink ò Confirm is a directory ò Recur with dentry/inode for /home/porter, search for foo

Symlink problems ò What if /home/porter/foo is a symlink to foo? ò Kernel gets in an infinite loop ò Can be more subtle: ò foo -> bar ò bar -> baz ò baz -> foo

Preventing infinite recursion ò More simple heuristics ò If more than 40 symlinks resolved, quit with ELOOP ò If more than 6 symlinks resolved in a row without a nonsymlink inode, quit with ELOOP ò Maybe add some special logic for obvious self-references ò Can prevent execution of a legitimate 41 symlink path ò Generally considered reasonable

Back to open() ò Key tasks: ò Map a human-readable path name to an inode ò Check access permissions, from / to the file ò Possibly create or truncate the file (O_CREAT, O_TRUNC) ò Create a file descriptor ò We ve seen how steps 1 and 2 are done

Creation ò Handled as part of search; treat last item specially ò Usually, if an item isn t found, search returns an error ò If last item (foo) exists and O_EXCL flag set, fail ò If O_EXCL is not set, return existing dentry ò If it does not exist, call fs create method to make a new inode and dentry ò This is then returned

File descriptors ò User-level file descriptors are an index into a processlocal table of struct files ò A struct file stores a dentry pointer, an offset into the file, and caches the access mode (read/write/both) ò The table also tracks which entries are valid ò Open marks a free table entry as in use ò If full, create a new table 2x the size and copy old one ò Allocates a new file struct and puts a pointer in table

Truncation ò The O_TRUNC flag causes the file to be truncated to zero bytes at the end of opening ò This is done with a routine that frees cached pages, updates inode size, and calls an FS-provided truncate() hook ò This routine generally updates on-disk data, freeing stored blocks

Open questions?

Now on to read ò int read(int fd, void *buf, size_t bytes); ò fd: File descriptor index ò buf: Buffer kernel writes the read data into ò bytes: Number of bytes requested ò Returns: bytes read (if >= 0), or errno

Simple steps ò Translate int fd to a struct file (if valid) ò Check cached permissions in the file ò Increase reference count ò Validate that sizeof(buf) >= bytes requested ò And that buf is a valid address ò Do read() routine associated with file (FS-specific) ò Drop refcount, return bytes read

Hard part: Getting data ò In addition to an offset, the file structure caches a pointer to the address space associated with the file ò Recall: this includes the radix tree of in-memory pages ò Search the radix tree for the appropriate page of data ò If not found, or PG_uptodate flag not set, re-read from disk ò If found, copy into the user buffer (up to inode->i_size)

Requesting a page read ò First, the page must be locked ò Atomically set a lock bit in the page descriptor ò If this fails, the process sleeps until page is unlocked ò Once the page is locked, double-check that no one else has re-read from disk before locking the page ò Also, check that no one has freed the page while we were waiting (by changing the mapping field) ò Invoke the address_space->readpage() method (set by FS)

Generic readpage ò Recall that most disk blocks are 512 bytes, yet pages are 4k ò Block size stored in inode (blkbits) ò Each file system provides a get_block() routine that gives the logical block number on disk ò Check for edge cases (like a sparse file with missing blocks on disk)

More readpage ò If the blocks are contiguous on disk, read entire page as a batch ò If not, read each block one at a time ò These block requests are sent to the backing device I/O scheduler (recall lecture on I/O schedulers)

After readpage ò Mark the page accessed (for LRU reclaiming) ò Unlock the page ò Then copy the data, update file access time, advance file offset, etc.

Copying data to user ò Kernel needs to be sure that buffer is a valid address ò How to do it? ò Can walk appropriate page table entries ò What could go wrong? ò Concurrent munmap from another thread ò Page might be lazy allocated by kernel

Trick ò What if we don t do all of this validation? ò Looks like kernel had a page fault ò Usually REALLY BAD ò Idea: set a kernel flag that says we are in copy_to_user ò If a page fault happens for a user address, don t panic ò Just handle demand faults ò If the page is really bad, write an error code into a register so that it breaks the write loop; check after return

Benefits ò This trick actually speeds up the common case (buf is ok) ò Avoids complexity of handling weird race conditions ò Still need to be sure that buf address isn t in the kernel

Summary ò Goal: Synthesize key VFS concepts, data structures, and optimizations with concrete examples ò Understand key steps in open and read system calls