Linux Filesystems Ext2, Ext3 Nafisa Kazi 1
What is a Filesystem A filesystem: Stores files and data in the files Organizes data for easy access Stores the information about files such as size, file permissions, owner, creation time etc. May use a storage device such as a hard disk or CD-ROM Involve maintaining the physical location of the files Could be virtual and exist only as an access method for virtual data or for data over a network (e.g. NFS). 2
Linux File System History Minix: The first file system for Linux Restrictive and lacked performance Filenames longer than 14 characters not allowed Maximum file size was 64 Mbytes EXT (Extended File System): The first file system designed specifically for Linux Introduced in April 1992 Still lacked performance In 1993, the Second Extended File system, or EXT2, was added In 1999, the Third Extended File system or Ext3 was developed by Stephen Tweedie 3
Linux File System History (cont d.) VFS (Virtual File System): developed when EXT filesystem was added VFS allows Linux to support different file systems Each file system presents a common software interface to the VFS All the details of various file systems are translated by software All file systems appear identical to rest of Linux kernel 4
VFS For example: cp /floppy/test /tmp/test 5
VFS : Superblocks and i-nodes VFS describes system s files in terms of superblocks and inodes The VFS i-nodes: Describe files and directories within the system The VFS superblocks: As each system is initialized, it registers itself with VFS at boot time Each file system type s superblock read routine maps the filesytem s topology onto VFS superblock VFS keeps a list of the mounted file systems and their VFS superblocks Each VFS superblock contains a pointer to the first VFS inode on the file system As the system's processes access directories and files, system routines are called that traverse the VFS inodes 6
Logical Diagram of VFS 7
Caching in VFS I-node cache: Repeatedly accessed inodes are kept in inode cache for quicker access Directory cache: VFS also keeps a cache of directory lookups so that the inodes for frequently used directories can be found quickly Stores directory name i-node mapping 8
Caching in VFS (cont d.) Buffer cache: Cache data buffers from the devices to help speed up access Makes the Linux file systems independent from the underlying media and from the device drivers that support them Is integrated with the block device interface Read request from filesytem result in block device drivers reading physical blocks from the device that they control These blocks are saved in the global buffer cache and are shared by all filesystems Buffers are identified by their block number and a unique identifier for the device that read it Filesystems don t have to go to the device if a block is in the cache 9
Ext2 Disk Data Structures The first block in each Ext2 partition is reserved for the partition boot sector Rest of space is split into block groups, each of which has following layout All the block groups have the same size and are stored sequentially The kernel can derive the location of a block group in a disk simply from its integer index. 10
Ext2 Superblock Contains a description of the file system Duplicated in each block group The superblock and the group descriptors in block group 0 are used when the filesystem is mounted. Some important information that this block holds are: Magic Number : Identifies the filesytem type Block Group Number : The Block Group number that holds this code of the Superblock Block Size The size of the block for this file system in bytes Blocks per Group The number of blocks in a group. This is fixed when the file system is created Free Blocks The number of free blocks in the file system, Free Inodes The number of free Inodes in the file system, First Inode This is the inode number of the first inode in the file system. The first inode in an EXT2 root file system would be the directory entry for the '/' directory 11
EXT2 Group Descriptor and Bitmap All the group descriptors for all of the Block Groups are duplicated in each Block Group. It contains: Blocks Bitmap Inode Bitmap Inode Table The bitmaps are sequences of bits Value 0 specifies that the corresponding inode or data block is free Value 1 specifies that the corresponding inode or data block is used 12
Inodes Every file and directory in the file system is described by one inode The inodes for each Block Group are kept in the inode table together with a bitmap. The inode contains the following fields: mode Permissions that users have Owner Information Size The size of the file in bytes, Timestamps The time that the inode was created and the last time that it was modified, Datablocks Pointers to the blocks that contain the data that this inode is describing. 13
Inode structure 14
Consistency Check Problem with Ext2 Updates to filesystem blocks are kept in dynamic memory before being flushed to disk A power-down failure might leave the filesystem in inconsistent state To overcome this problem, each filesystem is checked (and fixed) before it is mounted Utility is called fsck Runs upon reboot after a system crash Does not scale well With today s large disks and filesystems, fsck can take many hours to perform consistency check Totally unacceptable in production environment 15
Ext3 Filesystem Ext3 is a journaling filesystem Goal of journaling filesystem: To avoid time-consuming consistency checks during system start-up after ungraceful termination Main idea: First write blocks to a special area of disk called journal Then write blocks from journal to the filesystem Examples of journaling file systems SGI s XFS and IBM s JFS Ext3 is as much compatible as possible with Ext2 filesystem Fairly easy to migrate between Ext2 and Ext3 16
Journaling Filesystem (JFS) Two step procedure for performing high-level change to the filesystem: Step 1: Committing to the Journal Keeps track of the information to be written to the hard drive in a journal A copy of the blocks to be written is stored in the journal Step 2: Committing to the filesystem When I/O transfer to the journal is completed, the blocks are written to the filesystem When I/O transfer to the filesystem is completed, the copies of the blocks in the journal are discarded Journal allows quick recovery of filesystem after crash No need to scan the entire disk; only scan the journal area 17
System Recovery with JFS Two cases for system recovery Case 1: the system failure occurred before a commit to the journal Either the copies of the blocks relative to the change are missing from the journal or they are incomplete In both cases, fsck ignores them Result: the high-level change to the filesystem is lost, but the filesystem state is still consistent Case 2: the system failure occurred after a commit to the journal The copies of the blocks are valid, and fsck writes them into the filesystem Result: fsck applies the whole change, thus fixing every inconsistency due to unfinished I/O data transfers into the filesystem 18
Journaling Modes Logging blocks to the journal leads to a significant performance penalty Therefore, JFS allows operator to decide what kind of blocks has to be logged Gives rise to three journaling modes: Journal Ordered Writeback Journaling mode is specified as an option to mount command Example: mount t ext3 data= writeback /dev/wd0a /jdisk 19
The Journal Journaling Mode All filesystem data and metadata are logged into the journal Metadata includes superblocks, inodes, data bitmap blocks, bitmap blocks etc Minimizes loss of updates made to each file Requires additional disk accesses Example: when a new file is created, all its data blocks are duplicated as log records Safest but slowest mode 20
Ordered Journaling Mode Only changes to filesystem metadata are logged to the journal Metadata and relative data blocks are grouped Data blocks are written to disk before the metadata is written to disk Two cases of changes to a file Case 1: appending to a file If system crashes after data blocks are written to disk, metadata will not reflect the change Hence file consistent though the changes to file are lost Case 2: overwriting part of a file No guarantee that blocks are written to disk in order Thus, can not assume that because overwritten block x was updated, overwritten block x-1 was updated as well No changes to metadata (block allocation bitmap) Hence no way of knowing if file is consistent or not Default journaling mode for Ext3 filesystem Works out fine in practice as appending to a file is much more common than overwriting in the middle of a file 21
Writeback Journaling Mode Only changes to filesystem metadata are logged Does not wait for associated changes to file data to be written Example: files may exhibit metadata inconsistencies Block allocation bitmap will have data blocks as occupied, however updated data was not written when the system went down This isn't fatal, but can be disappointing to users Fastest mode 22
Journaling Block Device Layer Ext3 journal is stored in hidden file./journal in the root directory of filesystem The journal handled by a kernel layer called Journaling Block Device (JBD) Ext3 filesystem invokes JBD routines to ensure disk data structures don t get corrupted in case of system failure 23
Interaction Between Ext3 and JBD JBD uses the same disk to log changes performed by Ext3 filesystem Thus JBD must protect itself from system failure that could corrupt the journal Hence, interaction between Ext3 and JBD is based on three fundamental units: Log Record Atomic Operation Handles Transactions Log Record Describes a single update of a disk block Describes a low-level operation issued by the filesystem Represented inside journal as blocks of data or metadata 24
Atomic Operation Handles Log records of a set of low-level operations that correspond to a high-level changes of the filesystem Example: appending block of data to file involves many low-level operations If system failure occurs in middle, inconsistency Hence, when recovering from system failure, either the whole high-level operation is applied or none 25
Transactions All log records belonging to several atomic operation handles are grouped into a single transaction All log records are stored in consecutive blocks of the journal JBD handles each transaction as a whole Reclaims blocks used by a transaction only after all data in its log records are committed to filesystem 26
References http://www.tldp.org/ldp/tlk/fs/filesystem.html Safari book online : Understanding the Linux Kernel http://web.mit.edu/tytso/www/linux/ext2intro.htmls http://uranus.it.swin.edu.au/~jn/explore2fs/es2fs.htm http://www.lugatgt.org/articles/filesystems/?print=ht ml http://www.redhat.com/support/wpapers/redhat/ext3/i ndex.html http://www.gentoo.org/doc/en/articles/l-afig-p8.xml http://olstrans.sourceforge.net/release/ols2000- ext3/ols2000-ext3.html 27