CS3210: Crash consistency

Similar documents
Lecture 10: Crash Recovery, Logging

Lecture 11: Linux ext3 crash recovery

CSE506: Operating Systems CSE 506: Operating Systems

FS Consistency & Journaling

CS 318 Principles of Operating Systems

Ext3/4 file systems. Don Porter CSE 506

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Announcements. Persistence: Crash Consistency

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Caching and reliability

File System Consistency

CS3210: Virtual memory applications. Taesoo Kim

Lecture 18: Reliable Storage

Journaling versus Soft-Updates: Asynchronous Meta-data Protection in File Systems

CS3210: Virtual memory. Taesoo Kim w/ minor updates K. Harrigan

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

CrashMonkey: A Framework to Systematically Test File-System Crash Consistency. Ashlie Martinez Vijay Chidambaram University of Texas at Austin

Journaling. CS 161: Lecture 14 4/4/17

EECS 482 Introduction to Operating Systems

Linux Filesystems Ext2, Ext3. Nafisa Kazi

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Operating Systems. File Systems. Thomas Ropars.

File Systems: Consistency Issues

CS5460: Operating Systems Lecture 20: File System Reliability

CS3210: File Systems Tim Andersen

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

CSE 451: Operating Systems Winter Module 17 Journaling File Systems

CS3600 SYSTEMS AND NETWORKS

CS3210: Virtual memory applications

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Tricky issues in file systems

Case study: ext2 FS 1

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS3210: Processes and switching 1. Taesoo Kim

Push-button verification of Files Systems via Crash Refinement

CSE 333 Lecture 9 - storage

[537] Journaling. Tyler Harter

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Case study: ext2 FS 1

November 9 th, 2015 Prof. John Kubiatowicz

Announcements. Persistence: Log-Structured FS (LFS)

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

CS 318 Principles of Operating Systems

FFS: The Fast File System -and- The Magical World of SSDs

W4118 Operating Systems. Instructor: Junfeng Yang

File System Implementation

ECE 598 Advanced Operating Systems Lecture 18

Review: FFS background

CS3210: Isolation Mechanisms

Towards Efficient, Portable Application-Level Consistency

CS370 Operating Systems

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Journaling and Log-structured file systems

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University

What is a file system

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

The Google File System

ext3 Journaling File System

C13: Files and Directories: System s Perspective

Chapter 11: File System Implementation. Objectives

CS 4284 Systems Capstone

File Systems II. COMS W4118 Prof. Kaustubh R. Joshi hdp://

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

COS 318: Operating Systems. Journaling, NFS and WAFL

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

6.828 Fall 2007 Quiz II Solutions

Operating Systems. Operating Systems Professor Sina Meraji U of T

CS 318 Principles of Operating Systems

OPERATING SYSTEM. Chapter 12: File System Implementation

Lecture 9: File System. topic: file systems what they are how the xv6 file system works intro to larger topics

NPTEL Course Jan K. Gopinath Indian Institute of Science

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Chapter 11: Implementing File Systems

1 / 23. CS 137: File Systems. General Filesystem Design

Computer Engineering II Solution to Exercise Sheet 9

CS3210: Multiprocessors and Locking

Files and File Systems

I/O and file systems. Dealing with device heterogeneity

2. PICTURE: Cut and paste from paper

Chapter 11: Implementing File

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

CS 537 Fall 2017 Review Session

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

CS122 Lecture 15 Winter Term,

FILE SYSTEMS, PART 2. CS124 Operating Systems Winter , Lecture 24

ECE 598 Advanced Operating Systems Lecture 19

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

Transcription:

CS3210: Crash consistency Kyle Harrigan 1 / 45

Administrivia Lab 4 Part C Due Tomorrow Quiz #2. Lab3-4, Ch 3-6 (read "xv6 book") Open laptop/book, no Internet 2 / 45

Summary of cs3210 Power-on -> BIOS -> bootloader -> kernel -> user programs OS: abstraction, multiplexing, isolation, sharing Design: monolithic (xv6) vs. micro kernels (jos) Abstraction: process, system calls, [[files]], IPC, networking (lab6) Isolation mechanisms: CPL, segmentation, paging File systems: structure (superblock, inode, log, block), API, buffer cache 3 / 45

Why crash recovery (power failure)? 4 / 45

Why crash recovery (bugs)? 5 / 45

What happens after a FS crash? Is it possible that AAAA doesn't exist? (yes/no?) Then, BBBB? Is it possible that BBBB contains junks? (yes/no?) Is it possible that BBBB is empty? (yes/no?) Is it possible that BBBB contains "hello"? (yes/no?) Is it possible that BB exists in the current directory? (yes/no?) $ cat AAAA hello world! $ cp AAAA BBBB [panic]... [reboot] 6 / 45

Why crash recovery? Then, is your file system still usable? Main problem: Crash during multi-step operation Leaves FS invariants violated (Q: examples?) Can lead to ugly FS corruption Worse yet, media corruption (very frequent!) is out-of-scope Ex. bit rot, silent corruption Media error vs memory error? Detect? Correct? ECC memory, ZFS, etc. 7 / 45

Example: inconsistent file systems Breakdowns of create(): create new dirent allocate file inode Crash: dirent points to free inode -- disaster! Crash: inode not free but not used -- not so bad 8 / 45

Today's Lecture Problem: crash recovery crash leads to inconsistent on-disk file system on-disk data structure has "dangling" pointers Solutions: synchronous write delayed writes (e.g., write-back cache, soft updates) logging 9 / 45

What can we hope for? (after recovery) 1. FS internal invariants maintained e.g., no block is both in free list and in a file 2. All but last few operations preserved on disk e.g., data written yesterday are preserved 3. No order anomalies echo 99 > result ; echo done > status 10 / 45

Simplifying assumption: disk is "failstop" Disk executes the writes FS sends it, and does nothing else Perhaps doesn't perform the very last write no wild writes no decay of sectors 11 / 45

Correctness vs. performance Safety -> write to disk ASAP Speed -> don't write the disk (e.g., batch, write-back cache) Two approaches: synchronous meta-data update + fsck (linux ext2) logging (xv6 and linux ext3/4) meta-data: other than actual file contents (i.e., data block) 12 / 45

Synchronous-write solution Synchronous meta-data update: an old approach to crash recovery simple, slow, incomplete Most problem cases look like dangling references inode -> free block dirent -> free inode 13 / 45

Idea: always initialize on disk before creating any reference "synchronous writes" is implemented by 1. doing the initialization write 2. waiting for it to complete 3. and then doing the referencing write 14 / 45

Example: file creation Q: what's the right order of synchronous writes (dirent -> free inode)? 15 / 45

Example: file creation Q: what's the right order of synchronous writes (dirent -> free inode)? 1. mark inode as allocated 2. create directory entry 16 / 45

What will be true after crash+reboot? create(): 1. mark inode as allocated <- Q: what if failed after ialloc()? 2. create directory entry 17 / 45

Idea: fix FS when mounting (if crashed) To free unreferenced inodes and blocks (orphan) To clean-up an interrupted rename() 18 / 45

Problems with sync. meta-data update Very slow during normal operation (Q: why?) Very slow during recovery (Q: why? e.g., 100 MB/sec on 2TB HDD) 19 / 45

How to get better performance? Use RAM (e.g., write-back cache) Exploit disk sequential throughput (100 MB/sec) Keep track of dependencies among buffer caches Q: cycle dependencies? Q: still need slow fsck? 20 / 45

Storage performance Q: HDD vs. SSD? faster? bandwidth? Q: which one is faster? read vs. write? Q: in sequential vs. random? (ref. http://www.pcgamer.com/hard-drive-vs-ssd-performance/2/) 21 / 45

Chart1: Sequential read 22 / 45

Chart2: Sequential write 23 / 45

Chart3: Random read 24 / 45

Chart4: Random write 25 / 45

Better idea: "logging" How can we get both speed and safety? write only to cache somehow remember relationships among writes e.g., don't send #1 to disk w/o #2 and #3 26 / 45

Goals of logging 1. Atomic system calls w.r.t. crashes 2. Fast recovery (no hour-long fsck) 3. Speed of write-back cache for normal operations 27 / 45

Basic approach: "write-ahead" logging Atomicity: transaction either fails or succeeds 1. record all writes to the log 2. record "done" 3. do the real writes 4. clear "done" On crash+recovery: if "done" in log, replay all writes in log if no "done", ignore log 28 / 45

xv6's simple logging 01 + beg_op(); 02 bp = bread(dev, bn); 03 // modify bp=>data[] 04 - bwrite(buf); 05 + log_write(bp); 06 brelse(bp); 07 + end_op(); 29 / 45

xv6's simple logging 01 + beg_op(); 02 bp = bread(dev, bn); 03 // modify bp=>data[] 04 - bwrite(buf); 05 + log_write(bp); 06 brelse(bp); 07 + end_op(); What is good about this design? 30 / 45

xv6's simple logging 01 + beg_op(); 02 bp = bread(dev, bn); 03 // modify bp=>data[] 04 - bwrite(buf); 05 + log_write(bp); 06 brelse(bp); 07 + end_op(); What is good about this design? Correctness due to write-ahead log Good disk throughput (Q: why? why not?) Faster recovery without slow fsck Q: What about concurrency? xv6: no concurrency to make our life easier 31 / 45

Disk structure for logging log n superblock data blocks logheader n block[logsize]... 32 / 45

Example: writing a block (bn = 100) bn=100 AA... n=0 superblock DD... logheader data blocks... n = 0 block[logsize] log... 33 / 45

Step1: writing to a log bn=100 AA... n=0 log superblock DD... logheader data blocks... n = 0 block[logsize] AA...... 34 / 45

Step2: flushing the logheader (committing) bn=100 AA... n=1 log superblock DD... logheader data blocks... n = 0 block[logsize] AA...... n = 1 block[0] = 100 35 / 45

Step3: overwriting the data block bn=100 AA... n=1 log superblock AA... logheader data blocks... n = 1 block[logsize] AA...... n = 1 block[0] = 100 36 / 45

Step4: cleaning up the logheader n=0 bn=100 log superblock AA... logheader data blocks... n = 0 block[logsize] AA...... n = 1 block[0] = 100 n = 0 block[0] = 0 37 / 45

What if failed (say power-o! and reboot)? Does FS contain "AA.." (❶) or "BB.." (❷)? Step1: writing to a log (❶/❷?) Step2: flushing the logheader (❶/❷?) Step3: overwriting the data block (❶/❷?) Step4: cleaning up the logheader (❶/❷?) 38 / 45

Step1: writing to a log bn=100 AA... n=0 log superblock DD... logheader data blocks... n = 0 block[logsize] AA...... 39 / 45

Step2: flushing the logheader bn=100 AA... n=1 log superblock DD... logheader data blocks... n = 0 block[logsize] AA...... n = 1 block[0] = 100 40 / 45

Step3: overwriting the data block bn=100 AA... n=1 log superblock AA... logheader data blocks... n = 1 block[logsize] AA...... n = 1 block[0] = 100 41 / 45

Step4: cleaning up the logheader? n=0 bn=100 log superblock AA... logheader data blocks... n = 0 block[logsize] AA...... n = 1 block[0] = 100 n = 0 block[0] = 0 42 / 45

DEMO: dumplog.c 01 static void commit() { 02 if (log.lh.n > 0) { 03 write_log(); // Write modified blocks from cache to log 04 // Q1: panic("after writing to log!"); 05 write_head(); // Write header to disk -- the real commit 06 // Q2: panic("after writing the loghead!"); 07 install_trans(); // Now install writes to home locations 08 // Q3: panic("after the transaction!"); 09 log.lh.n = 0; 10 write_head(); // Erase the transaction from the log 11 // Q4: panic("after cleaning the loghead!"); 12 } 13 } 43 / 45

A few complications How to write larger data that doesn't fit to the log region? How to handle concurrency? How to avoid 2x writing (redundant)? How to log partial data (changes on a few bits)? 44 / 45

References Intel Manual UW CSE 451 OSPP MIT 6.828 Wikipedia The Internet Previous charts from Taesoo Kim and Tim Andersen 45 / 45