Case study: ext2 FS 1

Similar documents
Case study: ext2 FS 1

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

4/19/2016. The ext2 file system. Case study: ext2 FS. Recap: i-nodes. Recap: i-nodes. Inode Contents. Ext2 i-nodes

UNIX File Management. COMP3231/COMP s2

UNIX File Management. UNIX File Management. UNIX index node (inode) Inode Contents. Inode Contents

ECE 598 Advanced Operating Systems Lecture 18

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

ECE 598 Advanced Operating Systems Lecture 17

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Operating Systems

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Computer Systems Laboratory Sungkyunkwan University

Linux Filesystems Ext2, Ext3. Nafisa Kazi

CS 4284 Systems Capstone

File System Internals. Jo, Heeseung

Chapter 10: Case Studies. So what happens in a real operating system?

File System Implementation

Filesystems Overview

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

W4118 Operating Systems. Instructor: Junfeng Yang

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Local File Stores. Job of a File Store. Physical Disk Layout CIS657

UNIX File Systems. How UNIX Organizes and Accesses Files on Disk

File Systems. File system interface (logical view) File system implementation (physical view)

File Systems 1. File Systems

File Systems 1. File Systems

Main Points. File layout Directory layout

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

File Systems. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse]

COS 318: Operating Systems. Journaling, NFS and WAFL

Operating System Concepts Ch. 11: File System Implementation

Chapter 10: File System Implementation

Files and File Systems

C13: Files and Directories: System s Perspective

Chapter 11: File System Implementation. Objectives

File Systems. Chapter 11, 13 OSPP

UNIX File System. UNIX File System. The UNIX file system has a hierarchical tree structure with the top in root.

NTFS Recoverability. CS 537 Lecture 17 NTFS internals. NTFS On-Disk Structure

Chapter 11: Implementing File Systems

Long-term Information Storage Must store large amounts of data Information stored must survive the termination of the process using it Multiple proces

File Layout and Directories

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

Chapter 12: File System Implementation

Introduction to OS. File Management. MOS Ch. 4. Mahmoud El-Gayyar. Mahmoud El-Gayyar / Introduction to OS 1

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

OPERATING SYSTEM. Chapter 12: File System Implementation

Shared snapshots. 1 Abstract. 2 Introduction. Mikulas Patocka Red Hat Czech, s.r.o. Purkynova , Brno Czech Republic

COMP 530: Operating Systems File Systems: Fundamentals

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

Chapter 11: Implementing File

CS370 Operating Systems

File Systems. What do we need to know?

Topics. " Start using a write-ahead log on disk " Log all updates Commit

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Review: FFS background

CSE506: Operating Systems CSE 506: Operating Systems

File System Implementation

UNIX File Management (continued)

File System Implementation

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Caching and reliability

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

File System (Internals) Dave Eckhardt

Operating Systems Design Exam 2 Review: Fall 2010

Announcements. Persistence: Crash Consistency

Operating Systems 2010/2011

Outline. Operating Systems. File Systems. File System Concepts. Example: Unix open() Files: The User s Point of View

Motivation. Operating Systems. File Systems. Outline. Files: The User s Point of View. File System Concepts. Solution? Files!

CS3600 SYSTEMS AND NETWORKS

Chapter 12: File System Implementation

FS Consistency & Journaling

Chapter 11: Implementing File Systems

File Systems Management and Examples

Fast File System (FFS)

SCSI overview. SCSI domain consists of devices and an SDS

CS 318 Principles of Operating Systems

File Systems: Fundamentals

What is a file system

Advanced UNIX File Systems. Berkley Fast File System, Logging File System, Virtual File Systems

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

File Systems: Consistency Issues

Lecture 19: File System Implementation. Mythili Vutukuru IIT Bombay

CSE 153 Design of Operating Systems

Evolution of the Unix File System Brad Schonhorst CS-623 Spring Semester 2006 Polytechnic University

File Systems: Fundamentals

Operating Systems. File Systems. Thomas Ropars.

CS370 Operating Systems

Operating Systems. Operating Systems Professor Sina Meraji U of T

Logical disks. Bach 2.2.1

Chapter 12: File System Implementation

Transcription:

Case study: ext2 FS 1

The ext2 file system Second Extended Filesystem The main Linux FS before ext3 Evolved from Minix filesystem (via Extended Filesystem ) Features Block size (1024, 2048, and 4096) configured at FS creation inode-based FS Performance optimisations to improve locality (from BSD FFS) Main Problem: unclean unmount e2fsck Ext3fs keeps a journal of (meta-data) updates Journal is a file where updates are logged Compatible with ext2fs

Recap: i-nodes Each file is represented by an inode on disk Inode contains the fundamental file metadata Access rights, owner,accounting info (partial) block index table of a file Each inode has a unique number System oriented name Try ls i on Unix (Linux) Directories map file names to inode numbers Map human-oriented to system-oriented names 3

Recap: i-nodes

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect Ext2 i-nodes Mode Uid Gid Type Regular file or directory Access mode rwxrwxrwx User ID Group ID 5

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect Inode Contents atime Time of last access ctime Time when file was created mtime Time when file was last modified 6

mode uid gid atime ctime mtime size block count reference count direct blocks (12) Inode Contents - Size What does size of a file really mean? The space consumed on disk? With or without the metadata? The number of bytes written to the file? The highest byte written to the file? 0 N single indirect File system double indirect triple indirect 4 7 8 2 5 1 6 3 7

mode uid gid atime ctime mtime size block count reference count direct blocks (12) Inode Contents - Size What does size of a file really mean? The space consumed on disk? With or without the metadata? The number of bytes written to the file? The highest byte written to the file? 0 N single indirect File system double indirect triple indirect 7 8 5 1 8

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect Inode Contents - Size What does size of a file really mean? The space consumed on disk? With or without the metadata? The number of bytes written to the file? The highest byte written to the file? 0 N File system 8 9

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect Inode Contents Size Offset of the highest byte written Block count Number of disk blocks used by the file. Note that number of blocks can be much less than expected given the file size Files can be sparsely populated E.g. write(f, hello ); lseek(f, 1000000); write(f, world ); Only needs to store the start and end of file, not all the empty blocks in between. Size = 1000005 Blocks = 2 + overheads 10

mode uid gid atime ctime mtime size block count reference count direct blocks (12) 40,58,26,8,12, 44,62,30,10,42,3,21 single indirect double indirect triple indirect Inode Contents Direct Blocks Block numbers of first 12 blocks in the file Most files are small We can find blocks of file directly from the inode 0 3 0 56 8 2 9 1 10 4 5 7 6 7 11 63 File 11 10 9 8 7 6 5 4 3 2 1 0 Disk 11

Problem How do we store files with data at offsets greater than 12 blocks? Adding significantly more direct entries in the inode results in many unused entries most of the time. 12

mode uid gid atime ctime mtime size block count reference count direct blocks (12) 40,58,26,8,12, 44,62,30,10,42,3,21 single indirect: 32 double indirect triple indirect Inode Contents Single Indirect Block Block number of a block containing block numbers 28 29 38 46 61 43 0 3 SI 0 56 8 2 9 1 10 17 4 12 5 Disk 13 16 7 14 15 6 7 11 63 13 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Single Indirection Requires two disk access to read One for the indirect block; one for the target block Max File Size Assume 1Kbyte block size, 4 byte block numbers 12 * 1K + 1K/4 * 1K = 268 KiB For large majority of files (< 268 KiB), given the inode, only one or two further accesses required to read any block in file. 14

mode uid gid atime ctime mtime size block count reference count direct blocks (12) 40,58,26,8,12, 44,62,30,10,42,3,21 single indirect: 32 double indirect triple indirect Inode Contents Double Indirect Block Block number of a block containing block numbers of blocks containing block numbers 15

mode uid gid atime ctime mtime size block count reference count direct blocks (12) 40,58,26,8,12, 44,62,30,10,42,3,21 single indirect: 32 double indirect triple indirect Inode Contents Double Indirect Block Block number of a block containing block numbers of blocks containing block numbers Triple Indirect Block number of a block containing block numbers of blocks containing block numbers of blocks containing block numbers 16

UNIX Inode Block Addressing Scheme 17

UNIX Inode Block Addressing Scheme Assume 8 byte blocks, containing 4 byte block numbers => each block can contain 2 block numbers (1-bit index) Assume a single direct block number in inode

A B C D E F G H I J K L M N O 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Block numbers for corresponding file offset Level 3 Triple Indirect Block Level 2 Double Indirect Blocks 30 10 42 3 23 28 29 38 Single Indirect Block 8 12 44 62 31 47 45 53 Level 2 Triple Indirect Block 58 26 18 34 Level 1 Double Indirect Block Level 1 Triple Indirect Block 57 50 0 D I Ti1 Di2 C Si Di2 A J Ti2 56 Ti2 B direct blocks (1) 40 single indirect: 32 double indirect:6 triple indirect: 16 K Subset of Inode E M N F Ti3 Ti3 Di1 H O G 7 L Ti3 Ti3 63 Disk 19

Max File Size Assume 4 bytes block numbers and 1K blocks The number of addressable blocks Direct Blocks = 12 Single Indirect Blocks = 256 Double Indirect Blocks = 256 * 256 = 65536 Triple Indirect Blocks = 256 * 256 * 256 = 16777216 Max File Size 12 + 256 + 65536 + 16777216 = 16843020 blocks 16 GB 20

Where is the data block number stored? Assume 4K blocks, 4 byte block numbers, 12 direct blocks A 1 byte file produced by lseek(fd, 1048576, SEEK_SET) /* 1 megabyte */ write(fd, x, 1) What if we add lseek(fd, 5242880, SEEK_SET) /* 5 megabytes */ write(fd, x, 1) 21

Where is the block number is this tree? 22

Solution? 4K blocks, 4 byte block numbers => 1024 block numbers in indirect blocks (10 bit index) Block # range location 0 ---11 Direct blocks 12 --- 1035 (11 + 1024) Single-indirect blocks 1036 --- 1049611 (1035 + 1024 * 1024) Double-indirect blocks 1049612 ---???? Triple-indirect blocks File (not to scale) 0 N

Solution Address = 1048576 ==> block number=1048576/4096=256 Single indirect offset = 256 12 = 244 Block # range location 0 ---11 Direct blocks 12 --- 1035 Single-indirect blocks 1036 --- 1049611 Double-indirect blocks 1049612 ---???? Triple-indirect blocks

Where is the block number is this tree? 244 th entry 25

Address = 5242880 ==> Block number = 5242880/4096 =1280 Double indirect offset (20-bit) = 1280 1036 = 244 Top 10 bits = 0 Lower 10 bits = 244 Solution Block # range location 0 ---11 Direct blocks 12 --- 1035 Single-indirect blocks 1036 --- 1049611 Double-indirect blocks 1049612 ---???? Triple-indirect blocks

Where is the block number is this tree? 0 th entry 244 th entry 27

Some Best and Worst Case Access Patterns Assume Inode already in memory To read 1 byte Best: 1 access via direct block Worst: 4 accesses via the triple indirect block To write 1 byte Best: 1 write via direct block (with no previous content) Worst: 4 reads (to get previous contents of block via triple indirect) + 1 write (to write modified block back) 28

Worst Case Access Patterns with Unallocated Indirect Blocks Worst to write 1 byte 4 writes (3 indirect blocks; 1 data) 1 read, 4 writes (read-write 1 indirect, write 2; write 1 data) 2 reads, 3 writes (read 1 indirect, read-write 1 indirect, write 1; write 1 data) 3 reads, 2 writes (read 2, read-write 1; write 1 data) Worst to read 1 byte If reading writes a zero-filled block on disk Worst case is same as write 1 byte If not, worst-case depends on how deep is the current indirect block tree. 29

Inode Summary The inode (and indirect blocks) contains the on-disk metadata associated with a file Contains mode, owner, and other bookkeeping Efficient random and sequential access via indexed allocation Small files (the majority of files) require only a single access Larger files require progressively more disk accesses for random access Sequential access is still efficient Can support really large files via increasing levels of indirection 30

Where/How are Inodes Stored Boot Block Super Block Inode Array Data Blocks System V Disk Layout (s5fs) Boot Block contain code to bootstrap the OS Super Block Contains attributes of the file system itself e.g. size, number of inodes, start block of inode array, start of data block area, free inode list, free data block list Inode Array Data blocks 31

Some problems with s5fs Inodes at start of disk; data blocks end Long seek times Must read inode before reading data blocks Only one superblock Corrupt the superblock and entire file system is lost Block allocation was suboptimal Consecutive free block list created at FS format time Allocation and de-allocation eventually randomises the list resulting in random allocation Inode free list also randomised over time Directory listing resulted in random inode access patterns 32

Berkeley Fast Filesystem (FFS) Historically followed s5fs Addressed many limitations with s5fs ext2fs mostly similar 33

Layout of an Ext2 FS Boot Block Block Group 0. Block Group n Partition: Reserved boot block, Collection of equally sized block groups All block groups have the same structure 34

Layout of a Block Group Super Block Group Descriptors Data Block Bitmap Inode Bitmap Inode Table 1 blk n blks 1 blk 1 blk m blks k blks Data blocks Replicated super block For e2fsck Group descriptors Bitmaps identify used inodes/blocks All block groups have the same number of data blocks Advantages of this structure: Replication simplifies recovery Proximity of inode tables and data blocks (reduces seek time) 35

Superblocks Size of the file system, block size and similar parameters Overall free inode and block counters Data indicating whether file system check is needed: Uncleanly unmounted Inconsistency Certain number of mounts since last check Certain time expired since last check Replicated to provide redundancy to aid recoverability 36

Group Descriptors Location of the bitmaps Counter for free blocks and inodes in this group Number of directories in the group 37

Performance considerations EXT2 optimisations Block groups cluster related inodes and data blocks Pre-allocation of blocks on write (up to 8 blocks) 8 bits in bit tables Better contiguity when there are concurrent writes Aim to store files within a directory in the same group 38

Thus far Inodes representing files laid out on disk. Inodes are referred to by number!!! How do users name files? By number? 39

Ext2fs Directories inode rec_len name_len type name Directories are files of a special type Consider it a file of special format, managed by the kernel, that uses most of the same machinery to implement it Inodes, etc Directories translate names to inode numbers Directory entries are of variable length Entries can be deleted in place inode = 0 Add to length of previous entry 40

Ext2fs Directories f1 = inode 7 file2 = inode 43 f3 = inode 85 7 12 2 f 1 0 0 43 16 5 f i l e 2 0 0 0 85 12 2 f 3 0 0 0 Inode No Rec Length Name Length Name 41

Hard links Note that inodes can have more than one name Called a Hard Link Inode (file) 7 has three names f1 = inode 7 file2 = inode 7 f3 = inode 7 7 12 2 f 1 0 0 7 16 5 f i l e 2 0 0 0 7 12 2 f 3 0 0 0 Inode No Rec Length Name Length Name 42

mode uid gid atime ctime mtime size block count reference count direct blocks (12) 40,58,26,8,12, 44,62,30,10,42,3,21 single indirect: 32 double indirect triple indirect Inode Contents We can have many names for the same inode. When we delete a file by name, i.e. remove the directory entry (link), how does the file system know when to delete the underlying inode? Keep a reference count in the inode Adding a name (directory entry) increments the count Removing a name decrements the count If the reference count == 0, then we have no names for the inode (it is unreachable), we can delete the inode (underlying file or directory) 43

Hard links (a) Situation prior to linking (b) After the link is created (c)after the original owner removes the file

Symbolic links A symbolic link is a file that contains a reference to another file or directory Has its own inode and data block, which contains a path to the target file Marked by a special file attribute Transparent for some operations Can point across FS boundaries

Ext2fs Directories Deleting a filename rm file2 7 12 2 f 1 0 0 7 16 5 f i l e 2 0 0 0 7 12 2 f 3 0 0 0 Inode No Rec Length Name Length Name 46

Ext2fs Directories Deleting a filename rm file2 Adjust the record length to skip to next valid entry 7 32 2 f 1 0 0 7 12 2 f 3 0 0 0 Inode No Rec Length Name Length Name 47

FS reliability Disk writes are buffered in RAM OS crash or power outage ==> lost data Commit writes to disk periodically (e.g., every 30 sec) Use the sync command to force a FS flush FS operations are non-atomic Incomplete transaction can leave the FS in an inconsistent state

FS reliability dir entries i-nodes data blocks 2 3 1 Example: deleting a file 1.Remove the directory entry 2.Mark the i-node as free 3.Mark disk blocks as free

FS reliability dir entries i-nodes data blocks 2 3 1 Example: deleting a file 1.Remove the directory entry--> crash 2.Mark the i-node as free 3.Mark disk blocks as free The i-node and data blocks are lost

FS reliability dir entries i-nodes data blocks 2 3 1 Example: deleting a file 1.Mark the i-node as free --> crash 2.Remove the directory entry 3.Mark disk blocks as free The dir entry points to the wrong file

FS reliability dir entries i-nodes data blocks 2 3 1 Example: deleting a file 1.Mark disk blocks as free --> crash 2.Remove the directory entry 3.Mark the i-node as free The file randomly shares disk blocks with other files

e2fsck FS reliability Scans the disk after an unclean shutdown and attempts to restore FS invariants Journaling file systems Keep a journal of FS updates Before performing an atomic update sequence, write it to the journal Replay the last journal entries upon an unclean shutdown Example: ext3fs