File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

Similar documents
File System Structure. Kevin Webb Swarthmore College March 29, 2018

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

Operating Systems Design Exam 2 Review: Spring 2011

CS 416: Opera-ng Systems Design March 23, 2012

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. February 22, Prof. Joe Pasquale

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

File System Internals. Jo, Heeseung

CS 550 Operating Systems Spring File System

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

I/O and file systems. Dealing with device heterogeneity

OPERATING SYSTEM. Chapter 12: File System Implementation

Page Replacement. (and other virtual memory policies) Kevin Webb Swarthmore College March 27, 2018

C13: Files and Directories: System s Perspective

File Systems. Kartik Gopalan. Chapter 4 From Tanenbaum s Modern Operating System

Chapter 11: Implementing File Systems

Disk Scheduling COMPSCI 386

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. November 6, Prof. Joe Pasquale

CSE 333 Lecture 9 - storage

CS 4284 Systems Capstone

Operating Systems. Operating Systems Professor Sina Meraji U of T

File Systems. Before We Begin. So Far, We Have Considered. Motivation for File Systems. CSE 120: Principles of Operating Systems.

Computer Systems Laboratory Sungkyunkwan University

Chapter 10: Case Studies. So what happens in a real operating system?

Chapter 11: Implementing File

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

CSC369 Lecture 9. Larry Zhang, November 16, 2015

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Processes, Context Switching, and Scheduling. Kevin Webb Swarthmore College January 30, 2018

Chapter 11: Implementing File Systems

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Block Device Scheduling. Don Porter CSE 506

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Block Device Driver. Pradipta De

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

I/O Device Controllers. I/O Systems. I/O Ports & Memory-Mapped I/O. Direct Memory Access (DMA) Operating Systems 10/20/2010. CSC 256/456 Fall

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

ECE 598 Advanced Operating Systems Lecture 14

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

Operating System Concepts Ch. 11: File System Implementation

File System Implementation

CSCI-GA Operating Systems. I/O : Disk Scheduling and RAID. Hubertus Franke

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Operating Systems Design Exam 2 Review: Spring 2012

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

I/O Systems and Storage Devices

Chapter 12: File System Implementation

19 File Structure, Disk Scheduling

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

CS3600 SYSTEMS AND NETWORKS

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

CSE506: Operating Systems CSE 506: Operating Systems

STORAGE SYSTEMS. Operating Systems 2015 Spring by Euiseong Seo

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Operating System Concepts

Chapter 11: File System Implementation. Objectives

CSE 153 Design of Operating Systems

Question Points Score Total 100

Preview. Process Scheduler. Process Scheduling Algorithms for Batch System. Process Scheduling Algorithms for Interactive System

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

SCSI overview. SCSI domain consists of devices and an SDS

CS370 Operating Systems

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Ext3/4 file systems. Don Porter CSE 506

Chapter 12: File System Implementation

Operating Systems 2010/2011

Implementation should be efficient. Provide an abstraction to the user. Abstraction should be useful. Ownership and permissions.

File Systems Ch 4. 1 CS 422 T W Bennet Mississippi College

22 File Structure, Disk Scheduling

2 nd Half. Memory management Disk management Network and Security Virtual machine

1 / 23. CS 137: File Systems. General Filesystem Design

Chapter 10: File System Implementation

1 / 22. CS 135: File Systems. General Filesystem Design

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

What is a file system

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University

CS5460: Operating Systems Lecture 20: File System Reliability

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University

Memory Management. Kevin Webb Swarthmore College February 27, 2018

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

UNIX File Systems. How UNIX Organizes and Accesses Files on Disk

FS Consistency & Journaling

CS510 Operating System Foundations. Jonathan Walpole

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

Input / Output. Kevin Webb Swarthmore College April 12, 2018

To understand this, let's build a layered model from the bottom up. Layers include: device driver filesystem file

CS370 Operating Systems

Operating Systems. Week 9 Recitation: Exam 2 Preview Review of Exam 2, Spring Paul Krzyzanowski. Rutgers University.

Exam Guide COMPSCI 386

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

Transcription:

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

Today s Goals Supporting multiple file systems in one name space. Schedulers not just for CPUs, but disks too! Caching and prefetching disk blocks. Organizing a file system for performance.

File Systems so far 0 1 2 Directory 3 File System Metadata File Metadata So far, how FS is structured on disk. Still missing: How users/programs interact with them! Regular file 4 Block Map 2 7 32... cat inodes Data Blocks 4 16 dog fish contents of dog

Userspace Perspective Userspace processes make system calls to interact with files: User process P 1 OS Text Data Heap open( file_x, ) Userspace Stack Kernel OS Kernel Hardware

Userspace Perspective Userspace processes make system calls to interact with files: User process P 1 OS Text Data Heap open( file_x, ) Userspace Stack Kernel OS Kernel File System Hardware

Userspace Perspective Userspace processes make system calls to interact with files: User process P 1 OS Text Data Heap File descriptor (3) Stack Userspace Kernel P 1 s Descriptor Table 3 file_x OS Kernel File System Hardware

Userspace Perspective Userspace Userspace processes make system calls to interact with files: User process P 1 OS Text Data Heap Stack read(3, ) write(3, ) close(3) This picture: one process one file system Kernel P 1 s Descriptor Table 3 file_x OS Kernel File System Hardware

Challenges What if we have multiple disks? What if those disks use different file systems? Example: CS department uses NFS for common files, /local for local disks CS also: separate NFS mount for /scratch, Ameet has a research NFS mount The path in the file tree does NOT say anything about which FS files live on.

Single name space, multiple backing stores Root / home usr local soni kwebb fontes TODO CS45 Even crazier (not recommended): you can mount a FS on top of a non-empty directory.

If you were tasked with solving this problem, what would you do? That is, how would you design the OS such that multiple file systems, possibly on different physical disks, can all share one name space?

Virtual File System (VFS) Layer Userspace processes make system calls to interact with files: User process P 1 OS Text Data Heap FS requests Stack Userspace Kernel P 1 s Descriptor Table 3 file_x OS Kernel File System 1 VFS Abstraction Layer File System 2 File System 3 Hardware

VFS Layer Unifies the file name space and paths. Paths all start from common root (/) and are passed to VFS layer. VFS layer records which paths correspond to which FS. VFS translates application requests to appropriate low-level FS calls. Other benefits? Drawbacks? OS Kernel VFS Abstraction Layer File System 1 File System 2 File System 3 Analogy: Instruction Set Architecture (ISA) interface

Having this VFS layer A. Is good for performance (why?) B. Is bad for performance (why?) C. Doesn t really mean much for performance (why not?)

Virtual File System (VFS) Layer Userspace Userspace processes make system calls to interact with files: User process P 1 OS Text Data Heap Stack FS requests This picture: one process multiple file systems Kernel P 1 s Descriptor Table 3 file_x OS Kernel File System 1 VFS Abstraction Layer File System 2 File System 3 Hardware

FUSE Userspace processes make system calls to interact with files: User process P 1 OS Text Data Heap FS requests Reminiscent of microkernels? FUSE userspace process OS Text Data Heap Stack Stack Userspace Kernel P 1 s Descriptor Table 3 file_x OS Kernel File System 1 VFS Abstraction Layer File System 2 File System 3 Hardware

Multiple Concurrent Disk Requests Userspace Userspace processes make system calls to interact with files: OS Text Data Heap Stack OS Text Data Heap Stack OS Text Data Heap Stack FS requests This picture: multiple processes one file system Kernel OS Kernel VFS Abstraction Layer File System 1 Hardware

Disk Request Queueing Recall I/O bound process behavior and the CPU scheduler: If a process is making lots of I/O calls, let it run first, it will probably block Implication: If there are processes doing disk I/O on the system, they will often have an outstanding disk request. If there are multiple such processes, there will often be a queue of disk requests that are waiting for service from the disk.

Wait a second In this scenario, we have: multiple processes, all wanting to use the same resource (disk). only one process can use the resource at a time. the OS needs to decide which one. This is a scheduling problem! Might similar scheduling policies be relevant?

How about a simple policy like FIFO? A. Good idea for disk scheduling. Why? B. Bad idea for disk scheduling. Why? C. It depends. On what?

Disk Scheduling Like CPU scheduling, disk scheduling is a policy decision What should happen if multiple processes all want to access disk. Like CPU scheduling, the choice of metric influences the policy decision: Priority: if a process is important, give execute its requests first Throughput: maximize the data transfer from the disk Fairness: give each process the same opportunity to access the disk In some cases, trade-off between these two.

Disk Scheduling Unlike CPU scheduling, disk characteristics vary significantly. CPUs have different ISAs, but they mostly behave the same. For certain types of disks (solid state), FIFO might make a lot of sense (when targeting throughput): The disk has no moving parts, so the fastest thing to do is just issue requests immediately as they come in. For traditional spinning disks?

Disk Arm To simplify the discussion, let s assume the disk arm can move back and forth from left to right and right to left. Track 0 Track max Request @ Track 20 time Track 0 Track max wider horizontal distance -> longer seek time

FIFO Scheduling Track 0 Track max Performance is highly variable: depends on the order and track location of requests.

What might a better disk scheduling policy do? (for spinning disks) Policy to maximize throughput? Policy to maximize fairness? Balanced policy? Track 0 Track max What might the diagram look like for your policies?

Shortest Seek Time First (SSTF) Track 0 Track max Always choose the request that is closest to current arm position. Goal: minimize arm movement Good: less seeking, more throughput! Bad: potentially long delays for far-away locations (starvation)

Elevator (SCAN) Track 0 Track max Move end-to-end in one direction, they go back the other way. Goal: balance Intuition: like an elevator in a tall building.

Many other variants Circular SCAN (C-SCAN) LOOK Circular LOOK (C-LOOK) Some care about variance, fairness, performance, whatever

Linux s Default Scheduler: CFQ Completely Fair Queueing (CFQ) Not to be confused with the completely fair scheduler (CFS) for the CPU P 1 P 2 P 3 P N Keep a disk request queue for each process. Move requests from process queues to dispatch queue in round-robin fashion (if equal priority). Dispatch queue can re-order for throughput. Disk dispatch queue

The story so far Userspace Userspace processes make system calls to interact with files: OS Text Data Heap Stack OS Text Data Heap Stack OS Text Data Heap Stack FS requests open() Finally! multiple processes multiple file systems Kernel OS Kernel VFS Abstraction Layer File System 1 File System 2 File System 3 Hardware

Why do we have an open() call, as opposed to just read()ing or write()ing a file path? A. To check file permissions B. To improve performance C. To lookup the file s location on disk D. Two of the above E. All of the above

Improving Performance 1. Take advantage of locality! 2. Changes to FS structure.

Improving Performance 1. Take advantage of locality! 2. Changes to FS structure.

Where should we do FS / disk caching? A. On the disk device B. In the OS file system implementation C. In the OS VFS layer D. In applications E. Somewhere else (where?)

What sort of information should we cache?

Caches Directory cache caching the results of directory lookups inode cache inode data for frequently/recently used files Block cache data blocks for frequently/recently used files

Directory Cache 1 3 5 7 inode 0 (root) inode 22 inode 15 inode 88 File metadata 2 bin 37 boot 12 local 25 home 22 usr 67 4 meeden 62 newhall 44 adanner 21 kwebb 15 6 TODO 88 CS31 92 CS45 46 TODO file data Data Recall path lookup: /home/kwebb/todo This is an extremely expensive operation: several disk accesses. In this example, at least seven disk reads just to get the file s inode (metadata)

Directory Cache inode 0 (root) inode 22 inode 15 inode 88 File metadata bin 37 boot 12 local 25 home 22 usr 67 meeden 62 newhall 44 adanner 21 kwebb 15 TODO 88 CS31 92 CS45 46 TODO file data Data Solution: in the VFS layer, keep a small table in software. Combination of hash table (fast lookup) and linked lists (for LRU replacement) Maps commonly used paths to the final inode number. (The structure of this cache is similar to that of the block cache coming up soon)

Directory Cache inode 0 (root) inode 22 inode 15 inode 88 File metadata bin 37 boot 12 local 25 home 22 usr 67 meeden 62 newhall 44 adanner 21 kwebb 15 TODO 88 CS31 92 CS45 46 TODO file data Data In this example, every system in CS cluster probably knows where /home is all the time. Everybody uses /home! On my desktop (sesame) it should always know where /home/kwebb is too

inode cache If a file is accessed frequently, it will likely have its inode read/written frequently too. Especially true when writing update size, block count, modification time It would be crazy to have to read the inode from the disk every time! Solution: keep another table Similar structure to directory cache and block cache. Maps inode number to inode data contents.

Block Cache In-memory cache of recently used disk data (blocks). On read or write operation: Check if block is in cache; if it is, great! No disk access necessary. If not in cache, access disk, place data in cache. Eventually, like any cache, it will fill up: need a replacement policy LRU: keep a list ordered by access Remove old block at head of list Put new block at tail of list

Comparing disk caching and virtual memory In both cases: the user requests some data. For VM: it s a page of memory For file: it s a block on the disk In both cases: the data is either in memory or on disk. In both cases: the data in memory is a limited subset and we need a replacement scheme that exploits locality.

Why does it make sense to use full LRU for the disk block cache, but for page replacement, we wanted an approximation? A. The block cache is more important than memory. B. Disk accesses are slower than memory accesses. C. A system call has to be made for disk data accesses. D. The disk block cache will be smaller, so less state will be necessary. E. Some other reason(s) (what?).

Block Cache Two primary concerns: 1. Fast to access (is the block in the cache?) 2. Block replacement (which block should we evict?) Data structures: Hashed access (for 1) Linked LRU list (for 2) Independent from one another!

Searching the Block Cache Green Fast accesses Hash table LRU List 0 Head Tail 80 24 36 Red/Blue LRU Pointers Linked list 1 2 25 41 13 30 34 42 15 47 3 27 59 Hash Table

Searching the Block Cache Read (36) LRU List Head 36%4 = 0 Search list 0 for 36 0 1 Tail 80 24 36 25 41 13 2 3 30 34 42 27 59 15 47 Hash Table

Searching the Block Cache Read (36) LRU List Head 36%4 = 0 Search list 0 for 36 Cache hit! Update LRU list 0 1 2 3 Tail 80 24 36 25 41 13 30 34 42 27 59 15 47 Hash Table

Searching the Block Cache Read (47) LRU List Head 47%4 = 3 Search list 3 for 47 0 1 Tail 80 24 36 25 41 13 2 3 30 34 42 27 59 15 47 Hash Table

Searching the Block Cache Read (47) LRU List Head 47%4 = 3 Search list 3 for 47 Cache miss! Retrieve 47 0 1 2 Tail 80 24 36 25 41 13 30 34 42 15 Update LRU list 3 27 59 47 Hash Table

Searching the Block Cache Read (15) LRU List Head 15%4 = 3 Search list 3 for 15 0 1 Tail 80 24 36 25 41 13 2 30 34 42 15 3 27 59 47 Hash Table

Which block will we evict? What should the picture look like after 15 is added? Read (15) LRU List Head 15%4 = 3 Search list 3 for 15 0 1 Tail 80 24 36 25 41 13 2 30 34 42 15 3 27 59 47 Hash Table A: 27 B: 47 C: 42 D: 80 E: Some other block

Searching the Block Cache Read (15) LRU List Head 15%4 = 3 Search list 3 for 15 Cache miss! Remove 42 0 1 2 Tail 80 24 36 25 41 13 30 34 15 42 Retrieve 15 3 27 59 47 Hash Table

Cache Structure This same general structure is used for the other caches too. LRU List 0 Head Tail 80 24 36 Some of you probably did this in your lab 4 LRU implementation. 1 2 25 41 30 34 13 15 42 3 27 59 47 Hash Table

Prefetching ( readahead ) Observation: the disk is slow to react when we ask it for something (moving parts), but once it finds the data, it can deliver it quickly. Idea: Look for patterns in file accesses. Simple example: application calls read() asking for data from the file s first block. Then the second block, then the third block Look sequential: request 4 th and 5 th block from the disk now Being wrong removes old, unused data (LRU), right minimal latency

Consistency After you perform a write, the next read should retrieve the most recently written changes. This can be a very complex problem in a distributed system. Not as difficult when we re talking about one disk. Caching can introduce some important decisions. A write is safe or reliable only if it s written to the disk. For performance, we d like to write the data to the in-memory block cache.

What about writes? When should we write data to the disk to ensure reliability? A. Write blocks to disk immediately, to ensure safety. B. Write blocks when we evict them from the cache, to maximize performance. C. Write blocks when the user explicitly tells us to, since the user/application knows best. D. Write blocks to disk at some other time.

USB Disk Drives

Writing to Disk When writing metadata, write to disk immediately (write-through) inode changes, directory entries, free block lists metadata is usually small (update one block), critical for reliability When writing data, write it to memory cache. Flush to disk when 1. User says to do so. 2. The block is getting evicted from the cache. 3. The FS is being unmounted. 4. Enough time has passed (approx. 30 seconds)

Improving Performance 1. Take advantage of locality! 2. Changes to FS structure.

Block Size What should the block size be? Larger block size: better throughput (disk reads/writes larger chunks) Smaller block size: less wasted space Technology trends Disk density is increasing faster than disk speed Make disk blocks larger: 512 B -> 4096 B Underlying assumption (typically true for spinning disks): data that is contiguous can be accessed more quickly (reduced seeking)

Block map: extents Current block map: one pointer, one block. Blocks potentially scattered around disk. Extents: each pointer contains more info: start and length 2 3 Fewer pointers needed if disk has large free regions. Data contiguous on disk.

Organizing Metadata Observations: Why allocate all the inodes in advance? Why store them in a fixed-size table? FS Metadata File Metadata

B+Tree Structure Generalized BST structure: if tree is balanced, find items quickly Commonly used in databases to create an index for records FS insight: use this structure for file metadata (inodes), directory entries, and file block maps Requires fewer disk accesses for name lookups Widely used today, popularized by Hans Reiser s ReiserFS in 2001

Other FS Structures Log-structured FS: Treat all disk blocks as a circular array, and always append any written data or metadata to the end, like a log file. Writes are always sequential, giving good throughput, reads more difficult. Useful for database transaction logs, crash recovery. Cluster FS: Treat the disks of a large compute cluster as one giant logical disk. Files replicated for redundancy and performance. Popularized by Google s GFS: designed to store ALL the web crawling data.

Summary VFS layer separates user-facing file interface from FS implementation. OS s scheduling of disk accesses impacts throughput, fairness. Disks are slow, so we cache directories, inodes, and disk data. Combined data structures for fast access and LRU. Many other FS structures exist, with wildly different characteristics.