Advanced UNIX File Systems. Berkley Fast File System, Logging File System, Virtual File Systems

Advanced UNIX File Systems Berkley Fast File System, Logging File System, Virtual File Systems

Classical Unix File System Traditional UNIX file system keeps I-node information separately from the data blocks; Accessing a file involves a long seek from I-node to the data. Since files are grouped by directories, but I-nodes for the files from the same directory are not allocated consecutively, non-consecutive block reads are long. Allocation of data blocks is sub-optimal because often next sequential block is on different cylinder. Reliability is an issue, cause there is only one copy of super-block, and I-node list; Free blocks list quickly becomes random, as files are created and destroyed Randomness forces a lot of cylinder seek operations, and slows down the performance; There was no way to defragment the file system. Dump, redefine, restore required. Due to small block size (512-1024), many random blocks, many reads, files too small, low bandwidth utilization on system bus..

Berkeley Fast File System (FFS) BSD Unix did a redesign in mid 80s that they called the Fast File System (FFS) McKusick, Joy, Leffler, and Fabry. Improved disk utilization, decreased response time Minimal block size is 4096 bytes allows to files as large as 4G to be created with only 2 levels of indirection. Disk partition is divided into one or more areas called cylinder groups;

Data Block and INODE Placement The original Unix FS had two placement problems: Data blocks allocated randomly in aging file systems - Blocks for the same file allocated sequentially when FS is new - As FS ages and fills, need to allocate into blocks freed up when other files are deleted - Deleted files essentially randomly placed So, blocks for new files become scattered across the disk. INODEs allocated far from data blocks - All inodes at beginning of disk, far from data - Traversing file name paths, manipulating files, directories requires going back and forth from inodes to data blocks FRAGMENTATION. Both of these problems generate many long seeks (100X slower than sequential read)

Cylinder Groups BSD FFS addressed these problems using the notion of a cylinder group - Disk partitioned into groups of cylinders - Data blocks in same file allocated in same cylinder - Files in same directory allocated in same cylinder - Inodes for files allocated in same cylinder as file data blocks - Minimize seek times Free space requirements - To be able to allocate according to cylinder groups, the disk must have free space scattered across cylinders - 10% of the disk is reserved just for this purpose (root) - Only used by root why it is possible for df to report >100%. And why a filesystem is full at 90%.

Cylinder Groups Each cylinder group is one or more consecutive cylinders on a disk. Each cylinder group contains a redundant copy of the super block, and I-node information, and free block list pertaining to this group. Each group is allocated a static amount of I-nodes. The default policy is to allocate one I-node for each 2048 bytes of space in the cylinder group on the average. The idea of the cylinder group is to keep I-nodes of files close to their data blocks to avoid long seeks; also keep files from the same directory together. BFFS uses varying offset from the beginning of the group to avoid having all crucial data on one place The offset differs between the beginning and the information is used for data blocks.

Cylinder Groups

File System Block Size Problem: Small blocks (1K) cause two problems: -Low bandwidth utilization - Small max file size (function of block size) FFS Solution: use a larger block (4K). - Very large files, only need two levels of indirection for 2^32 - Improved bandwidth. As block size increases the efficiency of a single transfer also increases, and the space taken up by I-nodes and block lists decreases. Problem: as block size increases, the space is wasted due to internal fragmentation. FFS Solution: introduce fragments (1K pieces of a block) - Divide block into fragments; - Each fragment is individually addressable; - Fragment size is specified upon a file system creation; - The lower bound of the fragment size is the disk sector size; Problem: Media failures FFS Solution: Replicate superblock, subdivide INODE list, directory entries among cylinder groups Problem: Device oblivious FFS Solution: Parameterize FS according to device characteristics (see smart disk drives).

File System Integrity Due to hardware failures, or power failures, file system can potentially enter an inconsistent state. Metadata (e.g., directory entry update, I-node update, free blocks list updates, etc.) should be synchronous, cache operations (write through). A specific sequence of operations also matters. We cannot guarantee that the last operation does not fail, but we want to minimize the damage, so that the file system can be more easily restored to a consistent state. Consider file creation: write directory entry, update directory I-node, allocate I-node, write allocated I-node on disk, write directory entry, update directory I-node. But what should be the order of these synchronous operations to minimize the damage that may occur due to failures? Correct order is: 1) allocate I-node (if needed) and write it on disk; 2) update directory on disk; 3) update I-node of the directory on disk.

FFS Considerations Synchronous operations for updating the metadata: Should be synchronous, thus need seeks to I-nodes; In BFFS is not a great problem as long as files are relatively small, cause directory, file data blocks, and Inodes should be all in the same cylinder group. Write-back of the dirty blocks: Because the file access pattern is random different applications use different files at the same time, and the dirty blocks are not guaranteed to be in the same cylinder group. Still more problems with FFS: - With cylinder groups or extent allocation, data placement is improved, but still have many small seeks possibly related files are physically separated - nodes separated from files (small seeks) - directory entries separate from inodes - Metadata requires synchronous writes: with small files, most writes are to metadata (synchronous) and synchronous writes very slow; Solution: Logging File System, concept everything is log including metadata. Think database transaction recovery log. Everything written as a sequential log entry.

FFS Problems LFS Solutions Problem: while caching is enough for good read performance. Writes are the real performance bottleneck. Writing-back cached user blocks may require many random disk accesses. Write-through for metadata reliability denies optimizations. Solution: Logging, Each write - both data and control is appended to the end of the sequential log. Problem: how to locate files and data efficiently for random access by reads Solution: use a floating file map (virtual journal in memory). Problem: FFS uses inodes to locate data blocks - Inodes pre-allocated in each cylinder group - Directories contain locations of inodes LFS appends inodes to end of the log just like data. BUT, it makes them hard to find. Solution: - Use another level of indirection: Inode maps - Inode maps map file #s to inode location - Location of inode map blocks kept in checkpoint region - Checkpoint region has a fixed location - Cache inode maps in memory for performance

Logging File System (LFS) Logging File Systems Developed by Ousterhout & Douglis (1992) LFS treats the disk as a single log for appending All info written to disk is appended to log - Data blocks, attributes, inodes, directories, etc. The Log-structured File System (LFS) was designed in response to three trends in technology: smart drives, disk bandwidth and main memory. LFS takes advantage of these to increase FS performance: - Disk bandwidth scaling significantly (40% a year), disk latency (performance) is not. - Smart disks improve sequential access - sector # only, no C/H/S. - Large main memories in machines: Large buffer caches, absorb large fraction of read requests Can use for writes as well, combine # of small writes into large writes LFS collects writes in disk cache, write out entire collection in one large disk request Leverages disk bandwidth No seeks (assuming head is at end of log)

LFS Problems, Solutions LFS has two challenges it must address for it to be practical: 1) Locating data written to the log. - FFS places files in a location, LFS writes data at the end 2) Managing free space on the disk - Disk is finite, so log is finite, cannot always append. LFS append-only quickly runs out of disk space - Need to recover deleted blocks in old parts of log Approach: - Fragment log into segments - Thread segments on disk - Segments can be anywhere - Reclaim space by cleaning segments - Read segment - Copy live data to end of log - Now have free segment you can reuse Memory / disk cleaning is still a big problem - Costly overhead [bracket processes]

Other File System Types The first commercial UNIX FS was V6FS The first FFS was FFS for BSD 4.2 The first journaled FS was JFS1 (IBM AIX). Other journaled file systems: - jfs - Journaled File System - a filesystem contributed to Linux by IBM. - xfs - A filesystem contributed to open source by SGI. - reiserfs, developed by Namesys, default filesystem for SUSE Linux. Some LINUX FS features: Ext The original native file system Ext2 FFS like filesystem using block group allocation instead of FFS extent allocation Ext3 Journaled version of Ext2 Ext4 Extended version of Ext2 using extent, multiblock allocation, delayed allocation, larger files sizes and checksumming for journal and INODE lists ( fast fsck ). A fascinating comparison of OS Filesystems is kept here: http://en.wikipedia.org/wiki/comparison_of_file_systems

Virtual File Systems (VFS) To support multitude of filesystems the operating system provides an abstraction called VFS or the Virtual Filesystem. Kernel level interface to all underlying file systems into one format in memory. VFS receives system calls from user program (open, write, stat, link, truncate, close) Interacts with specific filesystem (support code) at mountpoint. VFS translates between particular FS format (local disk FS, NFS) and VFS data in memory. Receives other kernel requests, usually for memory management. Underlying filesystem is responsible for all physical filesystem management. User data, directories, metadata.

LINUX RamDisk A RAM disk is a filesystem in RAM (inverse concept of swap which is RAM on Disk). RAM disks have fixed sizes and are treated like regular disk partitions. Access time is much faster for a RAM disk than for a real, physical disk. RAM disks are used to store temporary data in a high performance directory. All RamDisk data is lost when the system is powered off and/or rebooted. Ramdisks are useful for a number of things, like Live CD implementations (see Union FS). Beginning with Linux kernel version 2.4 has built-in support for ramdisks. You can check for RamDisk support by the commands: dmesg grep RAMDISK or ls al /dev/ram* You can increase initial RAMDISK size by adding ramdisk_size=16000 option on the kernel entry of grub.conf. The RamDisk memory is permanently set aside once defined (cannot be reclaimed by the kernel), so be careful of the size allocation for each RAMDisk defined. RamDisk devices always exist, but you must specifically define the filesystem and mountpoint at every boot to access them. For example: mke2fs -m 0 /dev/ram0 mkdir /mnt/rd0 mount /dev/ram0 /mnt/rd0 chown root /mnt/rd0 chmod 0750 /mnt/rd0

SWAP Space Swap versus page terminology. RAM on Disk. Disk is 1 million times slower than RAM. RAM is cheap. So DON T SWAP!!!!! Ram utilization: top, vmstat, free Show swap: swapon s In use: free mt Uses different area format mkswap Different and partition type: 82 Turn on swap area with swapon, off with swapoff. See manpage. If low on virtual memory, can allocate temp swap space on an existing filesystem without reboot (see lab). But this is even lower performance than regular swap. Can combine swap on filesystem with RamDisk on solid state drives for almost as good as memory performance. Why? Some OSes, software or hardware platforms have memory address limitations.

UnionFS: File System Namespace Unification UnionFS is an extension of VFS that involves merging the contents of two or more directories/filesystems to present a unified view as a single mountpoint. NOT DEVICES!!! A union file system combines one (or more) R/O base directory(s) and a writable overlay as R/W. Initially, the view contains the base filesystem as R/O. The overlay is added second as R/W. Any updates to the mountpoint are written to the overlay directory. Uses: Live CD merge RAMDisk with CDROM (LINUX, KNOPPIX). Diskless NFS clients, Server Consolidation. Examples: Sun TLS, BSD, MacOSx (from BSD), LINUX funionfs(fuse), aufs (SourceForge). UnionFS can be compiled into the kernel separately or installed with a separate product LINUX FUSE (Filesystem in User Space). Note the other options under FUSE. When compiled natively into the kernel, unionfs shows up as a filesystem type under mount: mount t unionfs -o dirs=/dir1=rw:/dir2=ro none /mountpoint When installed separately in a product (funionfs under LINUX): funionfs none o dirs=/dir1=rw:/dir2=ro /mountpoint

LINUX Filesystem under User Space (FUSE) Project FUSE under VFS useable by ordinary users. Commands: fusermount, funionfs, sshfs

Other mount options mount bind and mount rbind can act as a single directory unionfs. That is you can re-mount a portion of a file system at another mountpoint. This is not supported on all UNIX versions. These can be made permanent in /etc/fstab. Examples: mount --bind olddir newdir mount --rbind olddir newdir includes submounts mount --move olddir newdir move a mountpoint. Labeled mounts: you can mount a device by a label. First label the device: e2label /dev/sda /home Add to /etc/fstab: LABEL=/home /home ext3 defaults 1 2 Also uses: mount L+ option See manpage for mount for other options.