Filesystems Overview ext2, NTFS, ReiserFS, and the Linux Virtual Filesystem Switch mdeters@cs.wustl.edu www.cs.wustl.edu/ doc/ Fall 2003 Seminar on Storage-Based Supercomputing
Filesystems Overview: Outline Outline UNIX file API in a Nutshell Layout of, algorithms for, and trickery in filesystems ext2 disk layout, journaling, trickery NTFS disk layout Π ReiserFS v3 Π disk layout, comparison to ext2, trickery, Reiser4 preview Π Linux s Virtual Filesystem Switch (VFS) 1
Filesystems Overview: ext2 UNIX File API in a Nutshell open(path, flags, mode) creat(path, mode) close(fd) read(fd, buf, count) write(fd, buf, count) open a file create a file closeanopenfile read from an open file writetoanopenfile truncate(path, length) truncate/extend a file lseek(fd, offset, whence) seek within an open file ftruncate(fd, length) link(oldpath, newpath) unlink(path) truncate/extend an open file create a new link to a file remove link (maybe delete file) 2
Filesystems Overview: ext2 ext2 3
Filesystems Overview: ext2 Basic Layout on Media boot super group block descs data bmap inode bmap inode table data blocks super group block descs data bmap inode bmap inode table data blocks Filesystem composed of (0::G block groups 1) Superblock and group descriptors replicated Inode and data block bitmaps are always one block One inode bitmap and one data block bitmap per block group Assuming = 4 KiB blocksize... 2 a bitmap 12 represents total inodes/blocks 2 therefore each block groups 15 has = 128 MiB of data 2 27 4
Filesystems Overview: ext2 Block Group Layout super block group descs data bmap inode bmap inode table data blocks Data block bitmap indicates data blocks in use Data blocks are unformatted chunks of file data, or pointers to other data blocks Group descriptors contain block number of bitmaps and inode table count of directories in the block group count of free inodes in block group count of free data blocks in block group 5
Filesystems Overview: ext2 Block Group Layout (continued) super block group descs data bmap inode bmap inode table data blocks All inodes allocated statically up front Inode bitmap indicates inodes in use Inodes contain file type & mode, file owner, link count, access/creation/modification timestamps, pointers to blocks direct pointers indirect doubly indirect triply indirect 6
Filesystems Overview: ext2 Block Group Layout (continued) super block group descs data bmap inode bmap inode table data blocks All inodes allocated statically up front Inode bitmap indicates inodes in use Inodes contain file type & mode, file owner, link count, access/creation/modification timestamps, pointers to blocks direct pointers indirect doubly indirect triply indirect 6
Filesystems Overview: ext2 Inside an Inode File type & mode type bits 0170000 mode bits 07777 Owner and group IDs 16! 32 bits Size (both bytes and blocks) Version (for use by NFS etc.) (And also a few other things) UNIX file types: regular directory symlink block special char special named pipe socket 7
then try exhaustive linear search of groups starting from p Filesystems Overview: ext2 Allocating an inode If new inode represents a directory scatter directories through partially-used block groups find group with maximum count of free data blocks of all groups with greater-than-average free inode count Otherwise starting from parent p,searchlog(n) directory s group groups Ψ as p + 2 given 1 by G) j 0» i < G (mod i Φ 8
Filesystems Overview: ext2 Allocating a data block Favors blocks near previous block Falls back to block group containing inode Failing that, allocates wherever it can find a free block 9
Filesystems Overview: ext2 How Directories Are Stored. and.. links stored explicitly Linear, unsorted map of link names to inode numbers Records padded to 4-byte boundaries Type byte allows kernel optimizations (don t have to read inode) Large directories (10,000+ entries) unwieldy and inecient name_len type inode rec_len name 100 90 103 105 0 175 4 12 12 12 32 16 16 2 1 2 3 7 6 8 1 D D D F F D 1.\0\0\0..\0\0 u s r\0 v m l i n u z\0 f o o b a r\0\0 m y p h o t o s varies 10
Filesystems Overview: ext2 Deleting files Remove entry from parent directory set inode to zero increase previous record s rec len Decrement inode s link count If link count is zero mark data blocks free (in data block bitmap) mark inode free (in inode bitmap) 11
Filesystems Overview: ext2 File Holes If truncate() or lseek() expands a file, ext2 makes a hole no data blocks are actually allocated reads of blocks that have NULL pointers return all zeros Useful especially for any application storing large hashes in files databases, etc. ext2 disk inode data data block pointers data data data data data hole 12
Filesystems Overview: ext2 ext2 Extensibility ext2 has been extended for ACLs, file undeletion, journaling... superblock contains several compatibility flags compatible feature set changes to the filesystem are fully backward-compatible Π incompatible feature set changes to the filesystem are not backward-compatible Π read-only compatible feature set changes to the filesystem are read-compatible, but a system Π not recognizing any of these flags shouldn t attempt to write 13
Filesystems Overview: ext2 Journaling (ext3) On disk mount, superblock marked uncleanly unmounted on umount, superblock marked cleanly unmounted When system boots, unclean disks are fscked as necessary For large disks, fscking is a real pain Canbespedupwithdatajournaling every block of data written twice once to journal, once to file on unclean boot, data consistency is ensured by replaying the journal or metadata journaling data integrity isn t ensured, but directory information and inode structures are can be ordered 14
Filesystems Overview: ext2 ext2 Trickery On disk storing small symlinks in inode compatibility & extensibility In memory superblock and bitmap caching 15
Filesystems Overview: NTFS NTFS 16
Filesystems Overview: NTFS Basic Layout on Media Boot sector first 16 sectors on disk Master File Table (MFT) each file/directory has a record itself a file analogous (sort of) to the inode table File attributes resident vs. non-resident inode information identified by code/name 17
Filesystems Overview: NTFS NTFS Attribute Types Standard information timestamp, link count Non-resident attribute list File name long and short names Owner/permissions (ACLs) Unnamed/named data extents Object identifier Logged tool stream Reparse point Index root Index allocation Bitmap Volume information Volume label 18
Filesystems Overview: NTFS Encryption Disk quotas NTFS5 Extensions (Windows 2000/XP) Sparse files (file holes) worksevenwithcompressedfiles Reparse points Volume mount points 19
Filesystems Overview: NTFS NTFS Trickery File record inlining of small files/directories Naming of file attributes provides potential extensibility 20
Filesystems Overview: ReiserFS v3 ReiserFS v3 21
Filesystems Overview: ReiserFS v3 Basic Layout on Media Everything in balanced trees controversial in the past ReiserFS demonstrates the ecacy of balanced trees 22
Filesystems Overview: ReiserFS v3 Advantages Over Stock ext2 Journaling Ecient large directories Small file eciency (tail packing) Block access 23
Filesystems Overview: ReiserFS v3 Balanced Trees M F R E J Q T 24
Filesystems Overview: ReiserFS v3 Balanced Trees M F R E J Q T C 24
Filesystems Overview: ReiserFS v3 Balanced Trees M F R E J Q T C A 24
Filesystems Overview: ReiserFS v3 Balanced Trees M rotate F R E J Q T C A 24
Filesystems Overview: ReiserFS v3 Balanced Trees M F R E J Q T C A 24
Filesystems Overview: ReiserFS v3 Balanced Trees M F R C J Q T A E 24
Filesystems Overview: ReiserFS v3 Balanced Trees rotate M F R C J Q T A E B 24
Filesystems Overview: ReiserFS v3 Balanced Trees M F R C J Q T A E B 24
Filesystems Overview: ReiserFS v3 Balanced Trees M C R A F Q T B E J 24
Filesystems Overview: ReiserFS v3 Balanced Tree Node Implementation Block head Keys Pointers free Internal nodes within the tree point to other nodes Block head Item heads (contain keys) free Items Leaf nodes at bottom of tree point to items Block head identifies block level, number of constituents, free space, right delimiting key (for leaves) Everything sorted by key, both within and between blocks 25
Filesystems Overview: ReiserFS v3 In order: 1. parent directory ID 2. object ID (inode #) 3. offset within object 4. item type Implications ReiserFS Keys items belonging to the same file are together in tree items belonging to the same directory are together in tree 26
Filesystems Overview: ReiserFS v3 Items Tree leaves contain item heads and items Item heads contain item key, type, size... Items are directory, direct, indirect, stat data directory items direct items stored completely within leaf node indirect items stored in unformatted blocks (formatted tail) stat data - like an inode without address blocks 27
Filesystems Overview: ReiserFS v3 Representation of files and directories So what is a file in ReiserFS? three distinct parts: stat data item (for file metadata) Π plus some number of indirect items (depending on file size) Π Π plus direct item (for the tail)...and a directory? set of directory items 28
Filesystems Overview: ReiserFS v3 ReiserFS v3 Trickery Consistency of representation Tail packing but apparently a lot of people turn it off 29
Filesystems Overview: ReiserFS v3 What s the fuss about Reiser4? Plugins UNIX view of files taken to the extreme composability, filtering,... Fixes various mistakes in ReiserFS v3 In beta (still?) 30
Filesystems Overview: The Virtual Filesystem Switch The Virtual Filesystem Switch 31
Filesystems Overview: The Virtual Filesystem Switch The Virtual Filesystem Switch (VFS) ext2 super_block fs_type operations device "/" vfsmount parent children mountpoint superblock /dev/hda1 NFS super_block fs_type operations device super_block fs_type operations device "/home" vfsmount parent children mountpoint superblock "/var" vfsmount parent children mountpoint superblock "/usr" vfsmount parent children mountpoint superblock /dev/hda3 ReiserFS super_block fs_type operations device NTFS /dev/sda1 * This is a simplified view for presentation; it is based on Linux 2.4.21 but is incomplete, and field names have been altered. 32
Filesystems Overview: Discussion Discussion www.cs.wustl.edu/ doc/ Fall 2003 Seminar on Storage-Based Supercomputing 33