CS5460: Operating Systems Lecture 20: File System Reliability

File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving Effect Eliminates problem Reduces seeks Overlap/hide disk access Reduces seeks Reduces rotational latency Goal: Reduce or hide expensive disk operations

Buffer/Page Cache Idea: Keep recently used disk blocks in kernel memory Process reads from a file: If blocks are not in buffer cache» Allocate space in buffer cache Q: What do we purge and how?» Initiate a disk read» Block the process until disk operations complete Copy data from buffer cache to process memory Finally, system call returns Usually, a process does not see the buffer cache directly mmap() maps buffer cache pages into process RAM

Buffer/Page Cache Process writes to a file: If blocks are not in the buffer cache» Allocate pages» Initiate disk read» Block process until disk operations complete Copy written data from process RAM to buffer cache Default: writes create dirty pages in the cache, then the system call returns Data gets written to device in the background What if the file is unlinked before it goes to disk? Optional: Synchronous writes which go to disk before the system call returns Really slow!

Performing Large File I/Os Idea: Try to allocate contiguous chunks of file in large contiguous regions of the disk Disks have excellent bandwidth, but lousy latency! Amortize expensive seeks over many block read/writes Question: How? Maintain free block bitmap (cache parts in memory) When you allocate blocks, use a modified best fit algorithm, rather than allocating a block at a time (pre-allocate even) Problem: Hard to do this when disk full/fragmented Solution A: Keep a reserve (e.g., 10%) available at all times Solution B: Run a disk defragger occasionally

Prefetching Idea: Read blocks from disk ahead of user request Goal: Reduce number of seeks visible to user If block read before request à hits in file buffer cache User Read 0 Read 1 Read 2 File System Read 0 Read 1 Read 2 Problem: What blocks should we prefetch? Easy: Detect sequential access and prefetch ahead N blocks Harder: Detect periodic/predictable random accesses

Fault Tolerance and Reliability

Fault Tolerance What kinds of failures do we need to consider? OS crash, power failure» Data not on disk is lost; rarely, partial writes Disk media failure» Data on disk corrupted or unavailable Disk controller failure» Large swaths of data unavailable temporarily or permanently Network failure» Clients and servers cannot communicate (transient failure)» Only have access to stale data (if any) (what else?)

Techniques to Tolerate Failure Careful disk writes and fsck Leave disk in recoverable state even if not all writes finish Run disk check program to identify/fix inconsistent disk state RAID: Redundant Array of Inexpensive Independent Disks Write each block on more than one independent disk If disk fails, can recover block contents from non-failed disks Logging Rather than overwrite-in-place, write changes to log file Use two-phase commit to make log updates transactional Clusters Replicate data at the server level

Careful Writes Order writes so that disk state is recoverable Accept that disk contents may be inconsistent or stale Run sanity check program to detect and fix problems Properties that should hold at all times All blocks pointed to are not marked free All blocks not pointed to are marked free No block belongs to more than one file Goal: Avoid major inconsistency Not a goal: Never lose data

Careful Writes Example To create a file, you must: Allocate and initialize an inode Allocate and initialize some data blocks Modify the directory file of the directory containing the file Modify the directory file s inode (last modified time, size) In what order should we do these writes? How to add transactional (all or nothing) semantics? How do careful writes interact with optimizations?

Careful Writes Exercise To delete a file, you must: Deallocate the file s inode Deallocate the file s disk blocks Modify the directory file of the directory containing the file Update the directory file s inode In what order should we do these operations? Consider what intermediate states are recoverable via fsck

Soft Update Rules Never point to a block before initializing it Never reuse a block before nullifying pointers to it Never reset last pointer to live block before setting a new one Always mark free-block bitmap entries as used before making the directory entry point to it

Careful Writes: More Exercises To write a file, you must: Modify (and perhaps allocate) the file s disk blocks Modify the file s inode (size and last modified time) Maybe, modify indirect block(s) To move a file between directories, you must: Modify the source directory Modify the destination directory Modify the inodes of both directories

RAID Goal: Organize multiple physical disks into a single high-performance, high-reliability logical disk CPU I/O bus RAID ctlr. Issues to consider: Multiple disks à higher aggregate throughput (more spindles) Multiple disks à (hopefully) independent failure modes Multiple disks à vulnerable to individual disk failures (MTTF) Writing to multiple disks for replication à higher write overhead

Possible Uses of Multiple Disks Striping Spread pieces of a single file across multiple disks Advantages:» Can service multiple independent requests in parallel» Can service single large requests in parallel Issues:» Interleave factor» How the data is striped across disks Redundancy (replication) Store multiple copies of blocks on independent disks Advantages:» Can tolerate partial system failure à How much? Issues:» How widely do you want to spread the data?

Types of RAID RAID level Description 0 Data striping w/o redundancy 1 Disk mirroring 2 Parallel array of disks w/ error correcting disk (checksum) 3 Bit-interleaved parity 4 Block-interleaved parity 5 Block-interleaved, distributed parity

RAID Level 0 Striping Spread contiguous blocks of a file across multiple spindles Simple round-robin distribution Non-redundant No fault tolerance Advantages Higher throughput Larger storage Disadvantages Lower reliability any drive failure destroys the file system Added cost RAID ctlr. I/O bus CPU

RAID Level 1 Mirroring Write complete copies of all blocks to multiple disks How many copies à how much reliability No striping No added write bandwidth Potential for pipelined reads Advantage: Can tolerate disk failures ( availability ) Disadvantage: High cost (extra disks and RAID controller) Q: How to recover from drive failure? RAID ctlr. CPU I/O bus

RAID Level 5 Mirroring + striping + distributed parity Spread contiguous blocks of a file across multiple spindles Adds parity information» Example: XOR of other blocks Combines features of 0 & 1 Advantages Higher throughput Lower cost (than level 1) Any single disk can fail Disadvantages More complexity in RAID controller Slower recovery time than RAID 1 RAID 6: 2 parity disks RAID ctlr. CPU I/O bus

RAID Tradeoffs Space efficiency Minimum number of disks Number of simultaneous failures tolerated Read performance Write performance Time to recover from a failed disk Complexity of controller

RAID Discussion RAID can be implemented by hardware or software Hardware RAID implemented by RAID controller» Often supports hot swapping using hot spare disks» Not totally clear that cheap RAID HW is worth it Software RAID implemented by OS kernel (device driver) Multiple parity disks can handle multiple errors Nested RAID Can use a RAID array as a disk in a higher level RAID» RAID 1+0: RAID 0 (striping) run across RAID 1 (mirrored) arrays» RAID 0+1: RAID 1 (mirroring) run across RAID 0 (striped) arrays

RAID Discussion What are the risks due to purchasing a large number of disks at the same time for use in a RAID? Hot spares can be useful What does a RAID look like to the file system code? RAID summary Tolerates failed disks May not deal well with correlated failure modes Can improve sustained transfer rate Does not improve individual seek latencies

Observations: Logging / Journaling Recreating consistent disk after failure is problematic Conventional file systems optimized for large contiguous reads File buffer cache eliminates reads à writes often bottleneck» Recall careful writes à cannot defer metadata writes indefinitely» Metadata ops access non-contiguous parts of disk (file, inode, dir) Idea: redesign the file system around a log Contiguous log structure à append at end Usage is similar to a database transaction log Eliminate random seeks in the critical path Sweeper process Copies data from log to real locations Kicked off periodically (e.g., log filling up) StartTransaction <transaction info> EndTransaction StartTransaction <transaction info> EndTransaction

Example: File Creation Conventional file system: Allocate and initialize inode Write inode to disk Load directory file Load directory inode Update directory file Write directory file to disk Update directory inode Write directory inode to disk Later: Flush free inode bitmap Lots of seeks Lots of small writes Log-based file system: Allocate and initialize inode Load directory file Load directory inode Write:» BeginTransaction (FileCreate)» Filename: /tmp/foo» Inode#: 1234» Inode Contents:» Directory Contents:» EndTransaction (FileCreate) Later: Copy data from log to real structures Few seeks + one big write

Using the Operation Log Issue: Inconsistency between log contents and real contents (for anything not yet copied back) Questions: What problems can this cause? How do you get around these problems? Issue: What if I re-modify file/inode before flush?

Using the Operation Log Issue: Inconsistency between log contents and real contents (for anything not yet copied back) Questions: What problems can this cause?» Cannot simply read data/metadata from real locations» Need to check log contents on any lookup/read How do you get around these problems?» Maintain index of logged-but-not-flushed state in DRAM» Always check index first whenever you want to read data/metadata Issue: What if I re-modify file/inode before flush? Correct: Simply flush changes in order they appear in log Optimized: If 2 nd change negates first, only flush 2 nd à be careful!

What About File Data Writes? Option one: Write the new data into a log Later copy data from log to real disk blocks Option two: Write new data to real disk blocks right away Tradeoffs?

Crash Recovery Question: How do you recover after a crash? What inconsistencies are possible? How do you detect and correct inconsistencies? Answer: Run a log sweeper (ala fsck/chkdsk) Search through the log to find oldest valid record Walk log from oldest to newest:» If complete transaction present in the log à complete (if necessary)» If incomplete transaction found à abort/undo it Recovery analogous to transaction logs in database systems

Advantages of logging: Logging vs. Not Fast metadata operations à one big synchronous write Efficient for small write operations (if normal writes are logged) Clean, fast recovery mechanism Disadvantages of logging: Space overhead à log and in-memory structures Complexity à transactions, extra data structures, sweeper process Duplication of effort à write to both log and real locations

Logging Filesystems in Practice NTFS uses a log Recent versions of UFS+ use a log Linux EXT2 does not use a log Works using techniques we discussed through the last lecture Linux EXT3 is log-based, and is forward-compatible You can take an EXT2 filesystem and start using it as EXT3 by adding a log EXT3 can be converted back to EXT2 EXT4 is more sophisticated than EXT3 but still retains back-compatibility Btrfs does not use logging

Questions?