Operating Systems: Filesystems

Similar documents
412 Notes: Filesystem

Operating Systems. Operating Systems Professor Sina Meraji U of T

File Layout and Directories

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Chapter 11: File System Implementation. Objectives

CS370 Operating Systems

Introduction. Secondary Storage. File concept. File attributes

File Systems. Chapter 11, 13 OSPP

I/O and file systems. Dealing with device heterogeneity

Lecture 16: Storage Devices

EECS 482 Introduction to Operating Systems

File System Implementation. Sunu Wibirama

Secondary Storage (Chp. 5.4 disk hardware, Chp. 6 File Systems, Tanenbaum)

File System Management

Chapter 8: Filesystem Implementation

Operating Systems. File Systems. Thomas Ropars.

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CS3600 SYSTEMS AND NETWORKS

Main Points. File systems. Storage hardware characteristics. File system usage patterns. Useful abstractions on top of physical devices

Lecture 18: Reliable Storage

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File Systems. What do we need to know?

CS 4284 Systems Capstone

Main Points. File layout Directory layout

Long-term Information Storage Must store large amounts of data Information stored must survive the termination of the process using it Multiple proces

File Systems Management and Examples

Thanks for the feedback! Chapter 8: Filesystem Implementation. File system operations. Acyclic-Graph Directories. General Graph Directory

CS 537 Fall 2017 Review Session

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Advanced Operating Systems

ECE 598 Advanced Operating Systems Lecture 14

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

UNIX File Systems. How UNIX Organizes and Accesses Files on Disk

ECE 598 Advanced Operating Systems Lecture 18

CS370 Operating Systems

File System: Interface and Implmentation

Motivation. Operating Systems. File Systems. Outline. Files: The User s Point of View. File System Concepts. Solution? Files!

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

22 File Structure, Disk Scheduling

File System CS170 Discussion Week 9. *Some slides taken from TextBook Author s Presentation

CSC369 Lecture 9. Larry Zhang, November 16, 2015

Lecture S3: File system data layout, naming

Case study: ext2 FS 1

File Systems. CS170 Fall 2018

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Chapter 11: Implementing File

File Systems. File system interface (logical view) File system implementation (physical view)

Disk Scheduling COMPSCI 386

Chapter 11: Implementing File Systems

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

To understand this, let's build a layered model from the bottom up. Layers include: device driver filesystem file

Hard Disk Organization. Vocabulary

Chapter 12: File System Implementation

NTFS Recoverability. CS 537 Lecture 17 NTFS internals. NTFS On-Disk Structure

Case study: ext2 FS 1

File Systems: Fundamentals

OPERATING SYSTEM. Chapter 12: File System Implementation

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Chapter 4 File Systems. Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved

File Systems: Fundamentals

FILE SYSTEM IMPLEMENTATION. Sunu Wibirama

Operating Systems 2010/2011

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Typical File Extensions File Structure

CSCI 350 Ch. 13 File & Directory Implementations. Mark Redekopp Michael Shindler & Ramesh Govindan

Chapter 10: File System Implementation

CS 111. Operating Systems Peter Reiher

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File Systems. CS 4410 Operating Systems

File System & Device Drive Mass Storage. File Attributes (Meta Data) File Operations. Directory Structure. Operations Performed on Directory

ECE 650 Systems Programming & Engineering. Spring 2018

File Systems. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

Chapter 12: File System Implementation

Operating System Concepts Ch. 11: File System Implementation

File Systems. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse]

File Management. Ezio Bartocci.

SCSI overview. SCSI domain consists of devices and an SDS

ICS Principles of Operating Systems

File System Implementation

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Files and File Systems

Introduction to OS. File Management. MOS Ch. 4. Mahmoud El-Gayyar. Mahmoud El-Gayyar / Introduction to OS 1

Segmentation with Paging. Review. Segmentation with Page (MULTICS) Segmentation with Page (MULTICS) Segmentation with Page (MULTICS)

FFS: The Fast File System -and- The Magical World of SSDs

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

File Systems. CSE 2431: Introduction to Operating Systems Reading: Chap. 11, , 18.7, [OSC]

SMD149 - Operating Systems - File systems

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

COMP 530: Operating Systems File Systems: Fundamentals

Week 12: File System Implementation

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

File System Internals. Jo, Heeseung

makes floppy bootable o next comes root directory file information ATTRIB command used to modify name

CSE 120 Principles of Operating Systems

Journaling. CS 161: Lecture 14 4/4/17

File Systems. Information Server 1. Content. Motivation. Motivation. File Systems. File Systems. Files

Chapter 12: File System Implementation

Transcription:

Operating Systems: Filesystems Shankar April 25, 2015

Outline fs interface Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text

Overview: Filesystem Interface fs interface Persistent structure of directories and les File: named sequence of bytes Directory: zero or more pointers to directories/les Persistent: non-volatile storage device: disk, ssmd,... Structure organized as a tree or acylic graph nodes: directories and les root directory: / path to a node: /a/b/... acyclic: more than one path to a le (but not directory) problematic for disk utilities, backup Node metadata: owner, creation time, allowed access,... Users can create/delete nodes, modify content/metadata Examples: FAT, UFS, NTFS, ZFS,..., PFAT, GOSFS, GSFS2

Node fs interface Name(s) a name is the last entry in a path to the node eg, path /a/b vs name b subject to size and char limits Type: directory or le or device or... Size: subject to limit Directory may have a separate limit on number of entries. Time of creation, modication, last access,... Content type (if le): eg, text, pdf, jpeg, mpeg, executable,... Owner Access for owner, other users,...: eg, r, w, x, setuid,...

Operations on Filesystems fs interface Format(dev) create an empty lesystem on device dev (eg, disk, ash,...) Mount(fstype, dev) attach (to computer) lesystem of type fstype on device dev returns a path to the lesystem (eg, volume, mount point,...) after this, processes can operate on the lesystem Unmount(path) detach (from computer) lesystem at path // nish all io after this, the lesystem is inert in its device, unaccessible

Operations on Attached Filesystem 1 fs interface Create(path), CreateDir(path) create a le/directory node at given path Link(existingPath, newpath) create a (hard) link to an existing le node Delete(path) // aka Unlink(path) delete the given path to the le node at path delete le node of if no more links to it DeleteDir(path) delete the directory node at path // must be empty Change attributes (name, metadata) of node at path eg, stat, touch, chown/chgrp, chmod, rename/mv

Operations on Attached Filesystem 2 fs interface Open(path, access), OpenDir(path, access) open the node at path with given access (r, w,...) returns a le descriptor // prefer: node descriptor after this, node can be operated on Close(nd), CloseDir(nd) close the node associated with node descriptor nd Read(nd, file range, buffer), ReadDir(nd, dir range, buffer), read the given range from open node nd into given buer returns number of bytes/entries read Write(nd, file range, buffer) write buer contents into the given range of open le node nd returns number of bytes written

Operations on Attached Filesystem 3 fs interface Seek(nd, file location), SeekDir(nd, entry) move r/w head to given location/entry MemMap(nd, file range, mem range) map the given range of le nd to given range of memory MemUnmap(nd, file range, mem range) unmap the given range of le nd from given range of memory Sync(nd) complete all pending io for open node nd

Consistency of Shared Files fs interface Shared le one that is opened by several processes concurrently Consistency of the shared le: when does a write by one process become visible in the reads by the other processes Various types of consistency (from strong to weak) visible when write returns visible when sync returns visible when close returns Single-processor system all types of consistency easily achieved Multi-processor system strong notions are expensive/slow to achieve

Reliability fs interface Disks, though non-volatile, are susceptible to failures magnetic / mechanical / electronic parts wear out power loss in the middle of a lesystem operation Filesystem should work inspite of failures to some extent Small disk-area failures no loss/degradation Large disk-area failures lesystem remains a consistent subtree eg, les may be lost, but no undetected corrupted les/links Power outage each ongoing lesystem operation remains atomic ie, either completes or has no eect on lesystem

Outline storage devices Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text

Disk Drive: Geometry storage devices Platter(s) xed to rotating spindle Spindle speed Platter has two surfaces Surface has concentric tracks Track has sectors more in outer than inner Sector: xed capacity Eg, laptop disk (2011) 1 or 2 platters 420015000 rpm, 154 ms/rotation diameter: 2.5 in track width: < 1 micron sector: 512 bytes buer: 16 MB Movable arm with xed rw heads, one per surface Buer memory

Disk Access Time storage devices Disk access time = seek + rotation + transfer Seek time: (moving rw head) + (electronic settling) minimum: target is next track; settling only maximum: target is at other end rotation delay: half-rotational delay on avg transfer time: platter buer transfer time: buer host memory Eg, laptop disk (2011) min seek: 0.31.5 ms max seek: 1020 ms rotation delay: 7.52 ms platter buer: 50 (inner) 100 (outer) MB/s buer host memory: 100300 MB/s

Disk IO storage devices Disk IO is in blocks (sectors): slower, more bursty (wrt memory) Causes the lesystem to be implemented in blocks also Disk geometry strongly aects lesystem layout Disk blocks usually become corrupted/unusable over time Filesystem layout must compensate for this redundancy (duplication) error-correcting codes migration without stopping

Disk Scheduling storage devices FIFO: terrible: lots of head movement SSTF (shortest seek time rst) favors middle requests; can starve edge requests SCAN (elevator) sweep from inner to outer, until no requests in this direction sweep from outer to inner, until no requests in this direction CSCAN: like SCAN but in only one direction fairer, less chance of sparsely-requested track R-SCAN / R-CSCAN allow minor deviations in direction to exploit rotational delays

Flash Devices storage devices RAM with a oating insulated gate No moving parts more robust than disk to impacts consumes less energy uniform access time // good for random access Wears out with repeated usage Unused device can lose charge over time (years?) Read/write operations done in blocks (12KB to 128KB) tens of microseconds // if erased area available for writes Write requires erased area Erase operation erasure block (few MB) tens of milliseconds so maintain a large erased area

Outline fs implementation Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text

Overview of Fs Implementation 1 fs implementation Dene a mapping of lesystem to device blocks and implement lesystem operations performance: random and sequential data access reliability: inspite of device failures accomodate dierent devices (blocks, size, speed, geometry,...) The term lesystem refers to two distinct entities 1. lesystem interface, ie, lesystem seen by users 2. lesystem implementation Here are some abbreviations (used as qualiers) fs: lesystem fs-int: lesystem interface // eg, fs-int node is directory or le fs-imp: lesystem implementation

Overview of Fs Implementation 2 fs implementation Fs implementation usually starts at the second block in the device [The rst device block is usually reserved for the boot sector] Fs implementation organizes the device in fs blocks 0, 1, Fs block = one or more device blocks, or vice versa henceforth, block without qualier means fs block Fs implementation is a graph over blocks, rooted at a special block (the superblock) [In contrast, fs interface is a graph over les and directories] Each le or directory x maps to a subgraph of the fs-imp graph, attached to the fs-imp graph at a root block subgraph's blocks hold x's data, x's metadata, pointers to subgraph's blocks pointers to the roots of other subgraphs if x directory

Overview of Fs Implementation 3 fs implementation fs interface fs implementation fs blocks device blocks graph over nodes (tree or acyclic) a / b d e f g h c graph over fs-blks /.subgraph a.subgraph superblk fs-blk contains - metadata - data - ptrs to data - ptrs to other subgraphs (for dir-node subgraph) 0 1 2 boot 0 1 2 operations - mount, unmout,... - create, link, unlink, delete - open, read, write, sync, close implementations of operations

Superblock fs implementation First fs block read by OS when mounting a lesystem Usually in the rst few device blocks after the boot sector Identies the lesystem type and the info to mount it magic number, indicating lesystem type fs blocksize (vs device blocksize) size of disk (in fs blocks). pointer (block #) to the root of / directory's subgraph pointer to list of free blocks (perhaps) pointer to array of roots of subgraphs...

Subgraph for fs-int node (le or directory) fs implementation Node name(s) // multiple names/paths due to links Node metadata size owner, access,... creation time,...... Pointers to fs blocks containing node's data If node is a le, the data is the le's data If node is a directory, the data is a table of directory entries table may be unordered, sorted, hashed,... depends on number and size of entries, desired performance,... A directory entry points to the entry's subgraph

Data Pointers Organization 1 fs implementation Organization of the pointers from subgraph root to data blocks Contiguous data is placed in consecutive fs blocks subgraph root has starting fs block and data size pros: random access cons: external fragmentation; poor tolerance to bad blocks Linked-list each fs block of data has a pointer to the next fs block of data (or nil, if no more data) subgraph root has starting fs block and data size pros: no fragmentation; good tolerance to bad blocks cons: poor random access

Data Pointers Organization 2 fs implementation Separate linked-list keep all the pointers (w/o data) in one or more fs blocks rst load those fs blocks into memory then the fs block for any data chunk is located by traversing a linked-list in memory (rather than disk) pros: no fragmentation; good tolerance to bad blocks; good random access (as long as le size is not too big) Indexed have an array of pointers have multiple levels of pointers for accomodating large les ie, a pointer points to a fs block of pointers pros: no fragmentation; good tolerance to bad blocks; good random access (logarithmic cost in data size)

For Good Performance fs implementation Want large numbers of consecutive reads or writes So put related info in nearby blocks Large buers (to minimize read misses) Batched writes: need a large queue of writes

Reliability: overcoming block failures fs implementation Redundancy within a disk error-detection/correction code (EDC/ECC) in each disk block map each fs block to multiple disk blocks dynamically remap within a disk to bypass failing areas Redundancy across disks map each fs block to blocks in dierent disks EDC/ECC in each disk block Or EDC/ECC for a fs block across disks // eg, RAID

Reliability: Atomicity via Copy-On-Write fs implementation When a user modies a le or directory f identify the part, say X, of the fs-imp graph to be modied write the new value of X in fresh blocks, say Y attach Y to the fs-imp graph // step with low prob of failure detach X and garbage collect its blocks note that X typically includes part of the path(s) from the fs-imp root to f 's subgraph Consider a path a 0,, a N, where a 0 is the fs-imp root and a N holds data modied by user Then for some point a J in the path the new value of the sux a j,, a N are in new blocks the new values of the prex a 0,, a J 1 are in the old blocks

Reliability: Atomicity via Logging fs implementation Maintain log of requested operations When user issues a sequence of disk operations add records (one for each operation) to log add commit record after last operation asynchronously write them to disk erase from log after copmletion Upon recovery from crash, (re)do all operations in log operations are idempotent (repeating is ok)

Hierarchy in Filesystem fs implementation Virtual lesystem: optional memory-only framework on which to mount real lesystems Mounted Filesystem(s) real lesystems, perhaps of dierent types Block cache cache lesystem blocks: performance, sharing,... Block device wrapper for the various block devices with lesystems Device drivers for the various block devices

GeekOS: Hierarchy in Filesystem fs implementation

Outline FAT Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text

FAT32 Overview 1 FAT FAT: MS-DOS lesystem simple, but no good for scalability, hard links, reliability,... currently used only on simple storage devices: oppy, ash,... Disk divided into following regions Boot sector: device block 0 BIOS parameter block OS boot loader code Filesystem info sector: device block 1 signatures, fs type, pointers to other sections fs blocksize, # free fs blocks, # last allocated fs block... FAT: fs blocks 0 and 1; corresponds to the superblock Data region: rest of the disk, organized as an array of fs blocks holds the data of the fs-int nodes (ie, les and directories)

FAT32 Overview 2 FAT Each block in the data region is either free or bad or holds data (of a le or directory) FAT: array with an entry for each block in data region entries j 0, j 1, form a chain i blocks j 0, j 1, hold successive data (of a fs-int node) Entry n contains constant, say FREE, if block n is free constant, say BAD, if block n is bad (ie, unusable) 32-bit number, say x, if block n holds data (of a fs-int node) and block x holds the succeeding data (of the fs-int node) constant, say END, if block n holds the last data chunk Root directory table: typically at start of data region (block 2)

FAT32 Overview 3 FAT Directory entry: 32 bytes name (8) extension (3) attributes (1) read-only, hidden, system, volume label, subdirectory, archive, device reserved (10) last modication time (2) last modication date (2) fs block # of starting fs block of the entry's data size of entry's data (4) Hard links??

Outline FFS Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text

FFS Layout FFS Boot blocks Superblock magic number lesystem geometry (eg, locations of groups) lesystem statistics/tuning params Groups, each consisting of backup copy of superblock header with statistics free space bit map... array of inodes holds metadata and data pointers of fs-int nodes # inodes xed at format time array of data blocks // few blocks at start // after boot blocks // cylinder groups

FFS Inodes FFS Inodes are numbered sequentially starting at 0 inodes 0 and 1 are reserved inode 2 is the root directory's inode An inode is either free or not Non-free inode holds metadata and data pointers of a fs-int node owner id type (directory, le, device,...) access modes reference count size and # blocks of data 15 pointers to data blocks 12 direct 1 single-indirect, 1 double-indirect, 1 triple-indirect // # of hard links

FFS: fs-int node asymmetric tree (max depth 4) FFS inode indirect blocks data blocks meta

FFS Directory entries FFS The data blocks of a directory node hold directory entries A directory entry is not split across data blocks Directory entry inode data/pointer blocks Directory entry for a fs-int node # of the inode of the fs-int node // hard link size of node data length of node name (up to 255 bytes) entry name Multiple directory entries can point to the same inode

User Ids FFS Every user account has a user id (uid) Root user (aka superuser, admin) has uid of 0 Processes and lesystem entries have associated uids indicates owners determines access processes have to lesystem entries determines which processes can be signalled by a process

Process Ids FFS Every process has two associated uids eective user id (euid) uid of user on whose behalf it is currently executing determines its access to lesystem entries real uid (ruid) uid of the process's owner determines which processes it can signal: x can signal y only if x is superuser or x.ruid = y.ruid

Process Ids FFS Process is created: ruid/euid creating process's euid Process with euid 0 executes SetUid(z): ruid/euid z no eect if process has non-zero euid Example SetUid usage login process has euid 0 (to access auth info les) upon successful login, it starts a shell process (with euid 0) shell executes SetUid(authenticated user s uid) When a process executes a le f with setuid bit set: its euid is set to f's owner's uid while it is executing f. Upon bootup, the rst process (init) runs with uid of 0 it spawns all other processes directly or indirectly

Directory entry's uids and permissions FFS Every directory entry has three classes of users: owner (aka user) group (owner need not be in this group) others (users other than owner or group) Each class's access is dened by three bits: r, w, x For a le: r: read the le w: modify the le x: execute the le For a directory: r: read the names (but not attributes) of entries in the directory w: modify entries in the directory (create, delete, rename) x: access an entry's contents and metainfo When a directory entry is created: attributes are set according to the creating process's attributes (euid, umask, etc)

Directory entry's setuid bit FFS Each directory entry also has a setuid bit. If an executable le has setuid set and a process (with execute access) executes it, the process's euid changes to the le's owner's uid while executing the le. Typically, the executable le's owner is root, allowing a normal user to get root privileges while executing the le This is a high-level analog of system calls

Directory entry's sticky bit FFS Each directory entry also has a sticky bit. Executable le with sticky bit set: hint to the OS to retain the text segment in swap space after the process executes An entry x in a directory with sticky bit set: a user with wx access to the directory can rename/delete an entry x in the directory only if it is x's owner (or superuser) Usually set on /tmp directory.

Directory entry's setgid bit FFS Unix has the notion of groups of users A group is identied by a group id, abbreviated gid A gid denes a set of uids A user account can be in dierent groups, i.e., have multiple gids Process has eective gid (egid) and real gid (rgid) play a similar role as euid and ruid A directory entry has a setgid bit plays a similar role to setuid for executables plays an entirely dierent role for directories

Outline NTFS Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text

NTFS Index Structure NTFS Master File Table (MFT) corresponds to FFS inode array holds an array of 1KB MFT records MFT Record: sequence of variable-size attribute records Std info attribute: owner id, creation/mod/... times, security,... File name attribute: le name and number Data attribute record data itself (if small enough), or list of data extents (if not small enough) Attribute list pointers to attributes in this or other MFT records pointers to attribute extents needed if attributes do not t in one MFT record eg, highly-fragmented and/or highly-linked // resident // non-resident

Example: Files with single MFT record NTFS Master File Table (MFT) < MFT record > File A (data resident) std info file name data (resident) free start / length data extent File B (2 data extents) std info file name data (non resident) free start / length data extent

Example: File with three MFT records NTFS data extents File C std info attr list fn fn data attr list extent data extents std info attr list fn fn data data extents std info data

Outline COW Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text