CS5460: Operating Systems Lecture 20: File System Reliability

Similar documents
Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Operating Systems. File Systems. Thomas Ropars.

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Chapter 11: File System Implementation. Objectives

CSE 153 Design of Operating Systems

File systems CS 241. May 2, University of Illinois

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes

I/O CANNOT BE IGNORED

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Operating Systems. Operating Systems Professor Sina Meraji U of T

Operating System Concepts Ch. 11: File System Implementation

CS 537 Fall 2017 Review Session

What is a file system

Lecture 18: Reliable Storage

November 9 th, 2015 Prof. John Kubiatowicz

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COS 318: Operating Systems. Journaling, NFS and WAFL

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Announcements. Persistence: Crash Consistency

HP AutoRAID (Lecture 5, cs262a)

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

Current Topics in OS Research. So, what s hot?

Review: FFS background

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University

CS3600 SYSTEMS AND NETWORKS

File Systems: Consistency Issues

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

COSC 6385 Computer Architecture. Storage Systems

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

CS 318 Principles of Operating Systems

File Systems. Chapter 11, 13 OSPP

CS 4284 Systems Capstone

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

CS510 Operating System Foundations. Jonathan Walpole

CSE506: Operating Systems CSE 506: Operating Systems

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Storage. Hwansoo Han

Storage systems. Computer Systems Architecture CMSC 411 Unit 6 Storage Systems. (Hard) Disks. Disk and Tape Technologies. Disks (cont.

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

Ext3/4 file systems. Don Porter CSE 506

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

NPTEL Course Jan K. Gopinath Indian Institute of Science

File System Consistency

Announcements. Persistence: Log-Structured FS (LFS)

HP AutoRAID (Lecture 5, cs262a)

Disks and I/O Hakan Uraz - File Organization 1

CSE 451: Operating Systems Winter Module 17 Journaling File Systems

Caching and reliability

TCSS 422: OPERATING SYSTEMS

Chapter 10: Mass-Storage Systems

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

GFS: The Google File System

Storage Systems. Storage Systems

The Google File System (GFS)

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

The Google File System

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Mass-Storage Structure

GFS: The Google File System. Dr. Yingwu Zhu

Frequently asked questions from the previous class survey

CLOUD-SCALE FILE SYSTEMS

Case study: ext2 FS 1

Virtual File System -Uniform interface for the OS to see different file systems.

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

Journaling. CS 161: Lecture 14 4/4/17

Example Implementations of File Systems

I/O CANNOT BE IGNORED

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

CSE380 - Operating Systems

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

CS370 Operating Systems

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Appendix D: Storage Systems

Midterm evaluations. Thank you for doing midterm evaluations! First time not everyone thought I was going too fast

Journaling versus Soft-Updates: Asynchronous Meta-data Protection in File Systems

Part IV I/O System. Chapter 12: Mass Storage Structure

Mass-Storage Structure

Case study: ext2 FS 1

Topics. " Start using a write-ahead log on disk " Log all updates Commit

Professor: Pete Keleher! Closures, candidate keys, canonical covers etc! Armstrong axioms!

Mass-Storage. ICS332 - Fall 2017 Operating Systems. Henri Casanova

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

OS and Hardware Tuning

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Distributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID

I/O and file systems. Dealing with device heterogeneity

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

W4118 Operating Systems. Instructor: Junfeng Yang

OS and HW Tuning Considerations!

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

Transcription:

CS5460: Operating Systems Lecture 20: File System Reliability

File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving Effect Eliminates problem Reduces seeks Overlap/hide disk access Reduces seeks Reduces rotational latency Goal: Reduce or hide expensive disk operations

Buffer/Page Cache Idea: Keep recently used disk blocks in kernel memory Process reads from a file: If blocks are not in buffer cache» Allocate space in buffer cache Q: What do we purge and how?» Initiate a disk read» Block the process until disk operations complete Copy data from buffer cache to process memory Finally, system call returns Usually, a process does not see the buffer cache directly mmap() maps buffer cache pages into process RAM

Buffer/Page Cache Process writes to a file: If blocks are not in the buffer cache» Allocate pages» Initiate disk read» Block process until disk operations complete Copy written data from process RAM to buffer cache Default: writes create dirty pages in the cache, then the system call returns Data gets written to device in the background What if the file is unlinked before it goes to disk? Optional: Synchronous writes which go to disk before the system call returns Really slow!

Performing Large File I/Os Idea: Try to allocate contiguous chunks of file in large contiguous regions of the disk Disks have excellent bandwidth, but lousy latency! Amortize expensive seeks over many block read/writes Question: How? Maintain free block bitmap (cache parts in memory) When you allocate blocks, use a modified best fit algorithm, rather than allocating a block at a time (pre-allocate even) Problem: Hard to do this when disk full/fragmented Solution A: Keep a reserve (e.g., 10%) available at all times Solution B: Run a disk defragger occasionally

Prefetching Idea: Read blocks from disk ahead of user request Goal: Reduce number of seeks visible to user If block read before request à hits in file buffer cache User Read 0 Read 1 Read 2 File System Read 0 Read 1 Read 2 Problem: What blocks should we prefetch? Easy: Detect sequential access and prefetch ahead N blocks Harder: Detect periodic/predictable random accesses

Fault Tolerance and Reliability

Fault Tolerance What kinds of failures do we need to consider? OS crash, power failure» Data not on disk is lost; rarely, partial writes Disk media failure» Data on disk corrupted or unavailable Disk controller failure» Large swaths of data unavailable temporarily or permanently Network failure» Clients and servers cannot communicate (transient failure)» Only have access to stale data (if any) (what else?)

Techniques to Tolerate Failure Careful disk writes and fsck Leave disk in recoverable state even if not all writes finish Run disk check program to identify/fix inconsistent disk state RAID: Redundant Array of Inexpensive Independent Disks Write each block on more than one independent disk If disk fails, can recover block contents from non-failed disks Logging Rather than overwrite-in-place, write changes to log file Use two-phase commit to make log updates transactional Clusters Replicate data at the server level

Careful Writes Order writes so that disk state is recoverable Accept that disk contents may be inconsistent or stale Run sanity check program to detect and fix problems Properties that should hold at all times All blocks pointed to are not marked free All blocks not pointed to are marked free No block belongs to more than one file Goal: Avoid major inconsistency Not a goal: Never lose data

Careful Writes Example To create a file, you must: Allocate and initialize an inode Allocate and initialize some data blocks Modify the directory file of the directory containing the file Modify the directory file s inode (last modified time, size) In what order should we do these writes? How to add transactional (all or nothing) semantics? How do careful writes interact with optimizations?

Careful Writes Exercise To delete a file, you must: Deallocate the file s inode Deallocate the file s disk blocks Modify the directory file of the directory containing the file Update the directory file s inode In what order should we do these operations? Consider what intermediate states are recoverable via fsck

Soft Update Rules Never point to a block before initializing it Never reuse a block before nullifying pointers to it Never reset last pointer to live block before setting a new one Always mark free-block bitmap entries as used before making the directory entry point to it

Careful Writes: More Exercises To write a file, you must: Modify (and perhaps allocate) the file s disk blocks Modify the file s inode (size and last modified time) Maybe, modify indirect block(s) To move a file between directories, you must: Modify the source directory Modify the destination directory Modify the inodes of both directories

RAID Goal: Organize multiple physical disks into a single high-performance, high-reliability logical disk CPU I/O bus RAID ctlr. Issues to consider: Multiple disks à higher aggregate throughput (more spindles) Multiple disks à (hopefully) independent failure modes Multiple disks à vulnerable to individual disk failures (MTTF) Writing to multiple disks for replication à higher write overhead

Possible Uses of Multiple Disks Striping Spread pieces of a single file across multiple disks Advantages:» Can service multiple independent requests in parallel» Can service single large requests in parallel Issues:» Interleave factor» How the data is striped across disks Redundancy (replication) Store multiple copies of blocks on independent disks Advantages:» Can tolerate partial system failure à How much? Issues:» How widely do you want to spread the data?

Types of RAID RAID level Description 0 Data striping w/o redundancy 1 Disk mirroring 2 Parallel array of disks w/ error correcting disk (checksum) 3 Bit-interleaved parity 4 Block-interleaved parity 5 Block-interleaved, distributed parity

RAID Level 0 Striping Spread contiguous blocks of a file across multiple spindles Simple round-robin distribution Non-redundant No fault tolerance Advantages Higher throughput Larger storage Disadvantages Lower reliability any drive failure destroys the file system Added cost RAID ctlr. I/O bus CPU

RAID Level 1 Mirroring Write complete copies of all blocks to multiple disks How many copies à how much reliability No striping No added write bandwidth Potential for pipelined reads Advantage: Can tolerate disk failures ( availability ) Disadvantage: High cost (extra disks and RAID controller) Q: How to recover from drive failure? RAID ctlr. CPU I/O bus

RAID Level 5 Mirroring + striping + distributed parity Spread contiguous blocks of a file across multiple spindles Adds parity information» Example: XOR of other blocks Combines features of 0 & 1 Advantages Higher throughput Lower cost (than level 1) Any single disk can fail Disadvantages More complexity in RAID controller Slower recovery time than RAID 1 RAID 6: 2 parity disks RAID ctlr. CPU I/O bus

RAID Tradeoffs Space efficiency Minimum number of disks Number of simultaneous failures tolerated Read performance Write performance Time to recover from a failed disk Complexity of controller

RAID Discussion RAID can be implemented by hardware or software Hardware RAID implemented by RAID controller» Often supports hot swapping using hot spare disks» Not totally clear that cheap RAID HW is worth it Software RAID implemented by OS kernel (device driver) Multiple parity disks can handle multiple errors Nested RAID Can use a RAID array as a disk in a higher level RAID» RAID 1+0: RAID 0 (striping) run across RAID 1 (mirrored) arrays» RAID 0+1: RAID 1 (mirroring) run across RAID 0 (striped) arrays

RAID Discussion What are the risks due to purchasing a large number of disks at the same time for use in a RAID? Hot spares can be useful What does a RAID look like to the file system code? RAID summary Tolerates failed disks May not deal well with correlated failure modes Can improve sustained transfer rate Does not improve individual seek latencies

Observations: Logging / Journaling Recreating consistent disk after failure is problematic Conventional file systems optimized for large contiguous reads File buffer cache eliminates reads à writes often bottleneck» Recall careful writes à cannot defer metadata writes indefinitely» Metadata ops access non-contiguous parts of disk (file, inode, dir) Idea: redesign the file system around a log Contiguous log structure à append at end Usage is similar to a database transaction log Eliminate random seeks in the critical path Sweeper process Copies data from log to real locations Kicked off periodically (e.g., log filling up) StartTransaction <transaction info> EndTransaction StartTransaction <transaction info> EndTransaction

Example: File Creation Conventional file system: Allocate and initialize inode Write inode to disk Load directory file Load directory inode Update directory file Write directory file to disk Update directory inode Write directory inode to disk Later: Flush free inode bitmap Lots of seeks Lots of small writes Log-based file system: Allocate and initialize inode Load directory file Load directory inode Write:» BeginTransaction (FileCreate)» Filename: /tmp/foo» Inode#: 1234» Inode Contents:» Directory Contents:» EndTransaction (FileCreate) Later: Copy data from log to real structures Few seeks + one big write

Using the Operation Log Issue: Inconsistency between log contents and real contents (for anything not yet copied back) Questions: What problems can this cause? How do you get around these problems? Issue: What if I re-modify file/inode before flush?

Using the Operation Log Issue: Inconsistency between log contents and real contents (for anything not yet copied back) Questions: What problems can this cause?» Cannot simply read data/metadata from real locations» Need to check log contents on any lookup/read How do you get around these problems?» Maintain index of logged-but-not-flushed state in DRAM» Always check index first whenever you want to read data/metadata Issue: What if I re-modify file/inode before flush? Correct: Simply flush changes in order they appear in log Optimized: If 2 nd change negates first, only flush 2 nd à be careful!

What About File Data Writes? Option one: Write the new data into a log Later copy data from log to real disk blocks Option two: Write new data to real disk blocks right away Tradeoffs?

Crash Recovery Question: How do you recover after a crash? What inconsistencies are possible? How do you detect and correct inconsistencies? Answer: Run a log sweeper (ala fsck/chkdsk) Search through the log to find oldest valid record Walk log from oldest to newest:» If complete transaction present in the log à complete (if necessary)» If incomplete transaction found à abort/undo it Recovery analogous to transaction logs in database systems

Advantages of logging: Logging vs. Not Fast metadata operations à one big synchronous write Efficient for small write operations (if normal writes are logged) Clean, fast recovery mechanism Disadvantages of logging: Space overhead à log and in-memory structures Complexity à transactions, extra data structures, sweeper process Duplication of effort à write to both log and real locations

Logging Filesystems in Practice NTFS uses a log Recent versions of UFS+ use a log Linux EXT2 does not use a log Works using techniques we discussed through the last lecture Linux EXT3 is log-based, and is forward-compatible You can take an EXT2 filesystem and start using it as EXT3 by adding a log EXT3 can be converted back to EXT2 EXT4 is more sophisticated than EXT3 but still retains back-compatibility Btrfs does not use logging

Questions?