Operating Systems: Filesystems Shankar April 25, 2015
Outline fs interface Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text
Overview: Filesystem Interface fs interface Persistent structure of directories and les File: named sequence of bytes Directory: zero or more pointers to directories/les Persistent: non-volatile storage device: disk, ssmd,... Structure organized as a tree or acylic graph nodes: directories and les root directory: / path to a node: /a/b/... acyclic: more than one path to a le (but not directory) problematic for disk utilities, backup Node metadata: owner, creation time, allowed access,... Users can create/delete nodes, modify content/metadata Examples: FAT, UFS, NTFS, ZFS,..., PFAT, GOSFS, GSFS2
Node fs interface Name(s) a name is the last entry in a path to the node eg, path /a/b vs name b subject to size and char limits Type: directory or le or device or... Size: subject to limit Directory may have a separate limit on number of entries. Time of creation, modication, last access,... Content type (if le): eg, text, pdf, jpeg, mpeg, executable,... Owner Access for owner, other users,...: eg, r, w, x, setuid,...
Operations on Filesystems fs interface Format(dev) create an empty lesystem on device dev (eg, disk, ash,...) Mount(fstype, dev) attach (to computer) lesystem of type fstype on device dev returns a path to the lesystem (eg, volume, mount point,...) after this, processes can operate on the lesystem Unmount(path) detach (from computer) lesystem at path // nish all io after this, the lesystem is inert in its device, unaccessible
Operations on Attached Filesystem 1 fs interface Create(path), CreateDir(path) create a le/directory node at given path Link(existingPath, newpath) create a (hard) link to an existing le node Delete(path) // aka Unlink(path) delete the given path to the le node at path delete le node of if no more links to it DeleteDir(path) delete the directory node at path // must be empty Change attributes (name, metadata) of node at path eg, stat, touch, chown/chgrp, chmod, rename/mv
Operations on Attached Filesystem 2 fs interface Open(path, access), OpenDir(path, access) open the node at path with given access (r, w,...) returns a le descriptor // prefer: node descriptor after this, node can be operated on Close(nd), CloseDir(nd) close the node associated with node descriptor nd Read(nd, file range, buffer), ReadDir(nd, dir range, buffer), read the given range from open node nd into given buer returns number of bytes/entries read Write(nd, file range, buffer) write buer contents into the given range of open le node nd returns number of bytes written
Operations on Attached Filesystem 3 fs interface Seek(nd, file location), SeekDir(nd, entry) move r/w head to given location/entry MemMap(nd, file range, mem range) map the given range of le nd to given range of memory MemUnmap(nd, file range, mem range) unmap the given range of le nd from given range of memory Sync(nd) complete all pending io for open node nd
Consistency of Shared Files fs interface Shared le one that is opened by several processes concurrently Consistency of the shared le: when does a write by one process become visible in the reads by the other processes Various types of consistency (from strong to weak) visible when write returns visible when sync returns visible when close returns Single-processor system all types of consistency easily achieved Multi-processor system strong notions are expensive/slow to achieve
Reliability fs interface Disks, though non-volatile, are susceptible to failures magnetic / mechanical / electronic parts wear out power loss in the middle of a lesystem operation Filesystem should work inspite of failures to some extent Small disk-area failures no loss/degradation Large disk-area failures lesystem remains a consistent subtree eg, les may be lost, but no undetected corrupted les/links Power outage each ongoing lesystem operation remains atomic ie, either completes or has no eect on lesystem
Outline storage devices Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text
Disk Drive: Geometry storage devices Platter(s) xed to rotating spindle Spindle speed Platter has two surfaces Surface has concentric tracks Track has sectors more in outer than inner Sector: xed capacity Eg, laptop disk (2011) 1 or 2 platters 420015000 rpm, 154 ms/rotation diameter: 2.5 in track width: < 1 micron sector: 512 bytes buer: 16 MB Movable arm with xed rw heads, one per surface Buer memory
Disk Access Time storage devices Disk access time = seek + rotation + transfer Seek time: (moving rw head) + (electronic settling) minimum: target is next track; settling only maximum: target is at other end rotation delay: half-rotational delay on avg transfer time: platter buer transfer time: buer host memory Eg, laptop disk (2011) min seek: 0.31.5 ms max seek: 1020 ms rotation delay: 7.52 ms platter buer: 50 (inner) 100 (outer) MB/s buer host memory: 100300 MB/s
Disk IO storage devices Disk IO is in blocks (sectors): slower, more bursty (wrt memory) Causes the lesystem to be implemented in blocks also Disk geometry strongly aects lesystem layout Disk blocks usually become corrupted/unusable over time Filesystem layout must compensate for this redundancy (duplication) error-correcting codes migration without stopping
Disk Scheduling storage devices FIFO: terrible: lots of head movement SSTF (shortest seek time rst) favors middle requests; can starve edge requests SCAN (elevator) sweep from inner to outer, until no requests in this direction sweep from outer to inner, until no requests in this direction CSCAN: like SCAN but in only one direction fairer, less chance of sparsely-requested track R-SCAN / R-CSCAN allow minor deviations in direction to exploit rotational delays
Flash Devices storage devices RAM with a oating insulated gate No moving parts more robust than disk to impacts consumes less energy uniform access time // good for random access Wears out with repeated usage Unused device can lose charge over time (years?) Read/write operations done in blocks (12KB to 128KB) tens of microseconds // if erased area available for writes Write requires erased area Erase operation erasure block (few MB) tens of milliseconds so maintain a large erased area
Outline fs implementation Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text
Overview of Fs Implementation 1 fs implementation Dene a mapping of lesystem to device blocks and implement lesystem operations performance: random and sequential data access reliability: inspite of device failures accomodate dierent devices (blocks, size, speed, geometry,...) The term lesystem refers to two distinct entities 1. lesystem interface, ie, lesystem seen by users 2. lesystem implementation Here are some abbreviations (used as qualiers) fs: lesystem fs-int: lesystem interface // eg, fs-int node is directory or le fs-imp: lesystem implementation
Overview of Fs Implementation 2 fs implementation Fs implementation usually starts at the second block in the device [The rst device block is usually reserved for the boot sector] Fs implementation organizes the device in fs blocks 0, 1, Fs block = one or more device blocks, or vice versa henceforth, block without qualier means fs block Fs implementation is a graph over blocks, rooted at a special block (the superblock) [In contrast, fs interface is a graph over les and directories] Each le or directory x maps to a subgraph of the fs-imp graph, attached to the fs-imp graph at a root block subgraph's blocks hold x's data, x's metadata, pointers to subgraph's blocks pointers to the roots of other subgraphs if x directory
Overview of Fs Implementation 3 fs implementation fs interface fs implementation fs blocks device blocks graph over nodes (tree or acyclic) a / b d e f g h c graph over fs-blks /.subgraph a.subgraph superblk fs-blk contains - metadata - data - ptrs to data - ptrs to other subgraphs (for dir-node subgraph) 0 1 2 boot 0 1 2 operations - mount, unmout,... - create, link, unlink, delete - open, read, write, sync, close implementations of operations
Superblock fs implementation First fs block read by OS when mounting a lesystem Usually in the rst few device blocks after the boot sector Identies the lesystem type and the info to mount it magic number, indicating lesystem type fs blocksize (vs device blocksize) size of disk (in fs blocks). pointer (block #) to the root of / directory's subgraph pointer to list of free blocks (perhaps) pointer to array of roots of subgraphs...
Subgraph for fs-int node (le or directory) fs implementation Node name(s) // multiple names/paths due to links Node metadata size owner, access,... creation time,...... Pointers to fs blocks containing node's data If node is a le, the data is the le's data If node is a directory, the data is a table of directory entries table may be unordered, sorted, hashed,... depends on number and size of entries, desired performance,... A directory entry points to the entry's subgraph
Data Pointers Organization 1 fs implementation Organization of the pointers from subgraph root to data blocks Contiguous data is placed in consecutive fs blocks subgraph root has starting fs block and data size pros: random access cons: external fragmentation; poor tolerance to bad blocks Linked-list each fs block of data has a pointer to the next fs block of data (or nil, if no more data) subgraph root has starting fs block and data size pros: no fragmentation; good tolerance to bad blocks cons: poor random access
Data Pointers Organization 2 fs implementation Separate linked-list keep all the pointers (w/o data) in one or more fs blocks rst load those fs blocks into memory then the fs block for any data chunk is located by traversing a linked-list in memory (rather than disk) pros: no fragmentation; good tolerance to bad blocks; good random access (as long as le size is not too big) Indexed have an array of pointers have multiple levels of pointers for accomodating large les ie, a pointer points to a fs block of pointers pros: no fragmentation; good tolerance to bad blocks; good random access (logarithmic cost in data size)
For Good Performance fs implementation Want large numbers of consecutive reads or writes So put related info in nearby blocks Large buers (to minimize read misses) Batched writes: need a large queue of writes
Reliability: overcoming block failures fs implementation Redundancy within a disk error-detection/correction code (EDC/ECC) in each disk block map each fs block to multiple disk blocks dynamically remap within a disk to bypass failing areas Redundancy across disks map each fs block to blocks in dierent disks EDC/ECC in each disk block Or EDC/ECC for a fs block across disks // eg, RAID
Reliability: Atomicity via Copy-On-Write fs implementation When a user modies a le or directory f identify the part, say X, of the fs-imp graph to be modied write the new value of X in fresh blocks, say Y attach Y to the fs-imp graph // step with low prob of failure detach X and garbage collect its blocks note that X typically includes part of the path(s) from the fs-imp root to f 's subgraph Consider a path a 0,, a N, where a 0 is the fs-imp root and a N holds data modied by user Then for some point a J in the path the new value of the sux a j,, a N are in new blocks the new values of the prex a 0,, a J 1 are in the old blocks
Reliability: Atomicity via Logging fs implementation Maintain log of requested operations When user issues a sequence of disk operations add records (one for each operation) to log add commit record after last operation asynchronously write them to disk erase from log after copmletion Upon recovery from crash, (re)do all operations in log operations are idempotent (repeating is ok)
Hierarchy in Filesystem fs implementation Virtual lesystem: optional memory-only framework on which to mount real lesystems Mounted Filesystem(s) real lesystems, perhaps of dierent types Block cache cache lesystem blocks: performance, sharing,... Block device wrapper for the various block devices with lesystems Device drivers for the various block devices
GeekOS: Hierarchy in Filesystem fs implementation
Outline FAT Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text
FAT32 Overview 1 FAT FAT: MS-DOS lesystem simple, but no good for scalability, hard links, reliability,... currently used only on simple storage devices: oppy, ash,... Disk divided into following regions Boot sector: device block 0 BIOS parameter block OS boot loader code Filesystem info sector: device block 1 signatures, fs type, pointers to other sections fs blocksize, # free fs blocks, # last allocated fs block... FAT: fs blocks 0 and 1; corresponds to the superblock Data region: rest of the disk, organized as an array of fs blocks holds the data of the fs-int nodes (ie, les and directories)
FAT32 Overview 2 FAT Each block in the data region is either free or bad or holds data (of a le or directory) FAT: array with an entry for each block in data region entries j 0, j 1, form a chain i blocks j 0, j 1, hold successive data (of a fs-int node) Entry n contains constant, say FREE, if block n is free constant, say BAD, if block n is bad (ie, unusable) 32-bit number, say x, if block n holds data (of a fs-int node) and block x holds the succeeding data (of the fs-int node) constant, say END, if block n holds the last data chunk Root directory table: typically at start of data region (block 2)
FAT32 Overview 3 FAT Directory entry: 32 bytes name (8) extension (3) attributes (1) read-only, hidden, system, volume label, subdirectory, archive, device reserved (10) last modication time (2) last modication date (2) fs block # of starting fs block of the entry's data size of entry's data (4) Hard links??
Outline FFS Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text
FFS Layout FFS Boot blocks Superblock magic number lesystem geometry (eg, locations of groups) lesystem statistics/tuning params Groups, each consisting of backup copy of superblock header with statistics free space bit map... array of inodes holds metadata and data pointers of fs-int nodes # inodes xed at format time array of data blocks // few blocks at start // after boot blocks // cylinder groups
FFS Inodes FFS Inodes are numbered sequentially starting at 0 inodes 0 and 1 are reserved inode 2 is the root directory's inode An inode is either free or not Non-free inode holds metadata and data pointers of a fs-int node owner id type (directory, le, device,...) access modes reference count size and # blocks of data 15 pointers to data blocks 12 direct 1 single-indirect, 1 double-indirect, 1 triple-indirect // # of hard links
FFS: fs-int node asymmetric tree (max depth 4) FFS inode indirect blocks data blocks meta
FFS Directory entries FFS The data blocks of a directory node hold directory entries A directory entry is not split across data blocks Directory entry inode data/pointer blocks Directory entry for a fs-int node # of the inode of the fs-int node // hard link size of node data length of node name (up to 255 bytes) entry name Multiple directory entries can point to the same inode
User Ids FFS Every user account has a user id (uid) Root user (aka superuser, admin) has uid of 0 Processes and lesystem entries have associated uids indicates owners determines access processes have to lesystem entries determines which processes can be signalled by a process
Process Ids FFS Every process has two associated uids eective user id (euid) uid of user on whose behalf it is currently executing determines its access to lesystem entries real uid (ruid) uid of the process's owner determines which processes it can signal: x can signal y only if x is superuser or x.ruid = y.ruid
Process Ids FFS Process is created: ruid/euid creating process's euid Process with euid 0 executes SetUid(z): ruid/euid z no eect if process has non-zero euid Example SetUid usage login process has euid 0 (to access auth info les) upon successful login, it starts a shell process (with euid 0) shell executes SetUid(authenticated user s uid) When a process executes a le f with setuid bit set: its euid is set to f's owner's uid while it is executing f. Upon bootup, the rst process (init) runs with uid of 0 it spawns all other processes directly or indirectly
Directory entry's uids and permissions FFS Every directory entry has three classes of users: owner (aka user) group (owner need not be in this group) others (users other than owner or group) Each class's access is dened by three bits: r, w, x For a le: r: read the le w: modify the le x: execute the le For a directory: r: read the names (but not attributes) of entries in the directory w: modify entries in the directory (create, delete, rename) x: access an entry's contents and metainfo When a directory entry is created: attributes are set according to the creating process's attributes (euid, umask, etc)
Directory entry's setuid bit FFS Each directory entry also has a setuid bit. If an executable le has setuid set and a process (with execute access) executes it, the process's euid changes to the le's owner's uid while executing the le. Typically, the executable le's owner is root, allowing a normal user to get root privileges while executing the le This is a high-level analog of system calls
Directory entry's sticky bit FFS Each directory entry also has a sticky bit. Executable le with sticky bit set: hint to the OS to retain the text segment in swap space after the process executes An entry x in a directory with sticky bit set: a user with wx access to the directory can rename/delete an entry x in the directory only if it is x's owner (or superuser) Usually set on /tmp directory.
Directory entry's setgid bit FFS Unix has the notion of groups of users A group is identied by a group id, abbreviated gid A gid denes a set of uids A user account can be in dierent groups, i.e., have multiple gids Process has eective gid (egid) and real gid (rgid) play a similar role as euid and ruid A directory entry has a setgid bit plays a similar role to setuid for executables plays an entirely dierent role for directories
Outline NTFS Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text
NTFS Index Structure NTFS Master File Table (MFT) corresponds to FFS inode array holds an array of 1KB MFT records MFT Record: sequence of variable-size attribute records Std info attribute: owner id, creation/mod/... times, security,... File name attribute: le name and number Data attribute record data itself (if small enough), or list of data extents (if not small enough) Attribute list pointers to attributes in this or other MFT records pointers to attribute extents needed if attributes do not t in one MFT record eg, highly-fragmented and/or highly-linked // resident // non-resident
Example: Files with single MFT record NTFS Master File Table (MFT) < MFT record > File A (data resident) std info file name data (resident) free start / length data extent File B (2 data extents) std info file name data (non resident) free start / length data extent
Example: File with three MFT records NTFS data extents File C std info attr list fn fn data attr list extent data extents std info attr list fn fn data data extents std info data
Outline COW Filesystem Interface Persistent Storage Devices Filesystem Implementation FAT Filesystem FFS: Unix Fast Filesystem NTFS: Microsoft Filesystem Copy-On-Write: Slides from OS:P&P text