Announcements. Persistence: Crash Consistency

Similar documents
[537] Journaling. Tyler Harter

FS Consistency & Journaling

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Announcements. Persistence: Log-Structured FS (LFS)

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

File System Consistency

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Operating Systems. File Systems. Thomas Ropars.

Journaling and Log-structured file systems

CS 318 Principles of Operating Systems

CS 318 Principles of Operating Systems

File Systems: Consistency Issues

[537] Fast File System. Tyler Harter

Journaling. CS 161: Lecture 14 4/4/17

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

CS 318 Principles of Operating Systems

Lecture 18: Reliable Storage

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Caching and reliability

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

EECS 482 Introduction to Operating Systems

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

CS5460: Operating Systems Lecture 20: File System Reliability

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

W4118 Operating Systems. Instructor: Junfeng Yang

*-Box (star-box) Towards Reliability and Consistency in Dropbox-like File Synchronization Services

File Systems: Recovery

Chapter 11: File System Implementation. Objectives

Lecture 11: Linux ext3 crash recovery

File Systems II. COMS W4118 Prof. Kaustubh R. Joshi hdp://

Linux Filesystems Ext2, Ext3. Nafisa Kazi

Case study: ext2 FS 1

CS3210: Crash consistency

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

COS 318: Operating Systems. Journaling, NFS and WAFL

File Systems. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, M. George, E. Sirer, R. Van Renesse]

Review: FFS background

Ext3/4 file systems. Don Porter CSE 506

CSE 153 Design of Operating Systems

CS 537 Fall 2017 Review Session

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

Case study: ext2 FS 1

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Physical Disentanglement in a Container-Based File System

CrashMonkey: A Framework to Systematically Test File-System Crash Consistency. Ashlie Martinez Vijay Chidambaram University of Texas at Austin

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture 10: Crash Recovery, Logging

ViewBox. Integrating Local File Systems with Cloud Storage Services. Yupu Zhang +, Chris Dragga + *, Andrea Arpaci-Dusseau +, Remzi Arpaci-Dusseau +

Chapter 11: Implementing File Systems

Announcements. P4: Graded Will resolve all Project grading issues this week P5: File Systems

Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison

November 9 th, 2015 Prof. John Kubiatowicz

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

I/O and file systems. Dealing with device heterogeneity

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

Topics. " Start using a write-ahead log on disk " Log all updates Commit

File Layout and Directories

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS-537: Final Exam (Fall 2013) The Worst Case

Midterm evaluations. Thank you for doing midterm evaluations! First time not everyone thought I was going too fast

FFS: The Fast File System -and- The Magical World of SSDs

42. Crash Consistency: FSCK and Journaling

NTFS Recoverability. CS 537 Lecture 17 NTFS internals. NTFS On-Disk Structure

Operating Systems. Operating Systems Professor Sina Meraji U of T

File Systems Management and Examples

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

CSE506: Operating Systems CSE 506: Operating Systems

File systems CS 241. May 2, University of Illinois

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 18: Naming, Directories, and File Caching

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University

Crash Consistency: FSCK and Journaling

ECE 598 Advanced Operating Systems Lecture 18

Logging File Systems

CS3600 SYSTEMS AND NETWORKS

2. PICTURE: Cut and paste from paper

CS 4284 Systems Capstone

Introduction to OS. File Management. MOS Ch. 4. Mahmoud El-Gayyar. Mahmoud El-Gayyar / Introduction to OS 1

SCSI overview. SCSI domain consists of devices and an SDS

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

OPERATING SYSTEMS CS136

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

CSE380 - Operating Systems

What is a file system

To Everyone... iii To Educators... v To Students... vi Acknowledgments... vii Final Words... ix References... x. 1 ADialogueontheBook 1

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Towards Efficient, Portable Application-Level Consistency

C13: Files and Directories: System s Perspective

CS 111. Operating Systems Peter Reiher

Implementation should be efficient. Provide an abstraction to the user. Abstraction should be useful. Ownership and permissions.

File Systems Ch 4. 1 CS 422 T W Bennet Mississippi College

Operating System Concepts Ch. 11: File System Implementation

CSC369 Lecture 9. Larry Zhang, November 16, 2015

Chapter 10: Case Studies. So what happens in a real operating system?

Files and File Systems

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

Transcription:

Announcements P4 graded: In Learn@UW by end of day P5: Available - File systems Can work on both parts with project partner Fill out form BEFORE tomorrow (WED) morning for match Watch videos; discussion section Part a : file system checker NOT in xv6 code base Read as we go along! Chapter 42 CS 537 Introduction to Operating Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau Persistence: Crash Consistency Questions answered in this lecture: What benefits and complexities exist because of data redundancy? What can go wrong if disk blocks are not updated consistently? How can file system be checked and fixed after crash? How can ing be used to obtain atomic updates? How can the performance of ing be improved? 1

Data Redundancy Definition: if A and B are two pieces of data, and knowing A eliminates some or all values B could be, there is redundancy between A and B RAID examples: mirrored disk (complete redundancy) parity blocks (partial redundancy) File system examples: Superblock: field contains total blocks in FS Inodes: field contains pointer to data block Is there redundancy across these two fields? Why or why not? File System Redundancy Example Superblock: field contains total number of blocks in FS DATA = N Inode: field contains pointer to data block; possible DATA? DATA in {0, 1, 2,, N - 1} Pointers to block N or after are invalid! Total-blocks field has redundancy with inode pointers 2

Question for You Give 5 examples of redundancy in FFS (or file systems in general) Dir entries AND inode table Dir entries AND inode link count Data bitmap AND inode pointers Data bitmap AND group descriptor Inode file size AND inode/indirect pointers Pros and CONs of Redundancy Redundancy may improve: - reliability RAID-5 parity Superblocks in FFS - performance RAID-1 mirroring (reads) FFS group descriptor FFS bitmaps Redundancy hurts: - capacity - consistency Redundancy implies certain combinations of values are illegal Illegal combinations: inconsistency 3

Consistency Examples Assumptions: Superblock: field contains total blocks in FS. DATA = 1024 Inode: field contains pointer to data block. DATA in {0, 1, 2,, 1023} Scenario 1: Consistent or not? Superblock: field contains total blocks in FS. DATA = 1024 Inode: field contains pointer to data block. DATA = 241 Consistent Scenario 2: Consistent or not? Superblock: field contains total blocks in FS. DATA = 1024 node: field contains pointer to data block. DATA = 2345 Inconsistent Why is consistency challenging? File system may perform several disk writes to redundant blocks If file system is interrupted between writes, may leave data in inconsistent state What can interrupt write operations? - power loss - kernel panic - reboot System (file system and disk) can even REORDER some writes for better performance 4

append to /foo/bar data inode root foo bar root foo bitmap bitmap inode inode inode data data bar data read write read write write Question for You File system is appending to a file and must update 3 blocks: - inode - data bitmap - data block What happens if crash after only updating some blocks? a) bitmap: b) data: c) inode: d) bitmap and data: e) bitmap and inode: f) data and inode: lost block nothing bad point to garbage (what?), another file may use lost block point to garbage another file may use same data block 5

Solution #1: How can file system fix Inconsistencies? FSCK = file system checker Strategy: After crash, scan whole disk for contradictions and fix if needed Keep file system off-line until FSCK completes For example, how to tell if data bitmap block is consistent? Read every valid inode+indirect block If pointer to data block, the corresponding bit should be 1; else bit is 0 Fsck Checks Hundreds of types of checks over different fields Do superblocks match? Do directories contain. and..? Do number of dir entries equal inode link counts? Do different inodes ever point to same block? How to repair problems? Look at last two checks 6

Link Count (example 1) Dir Entry inode link_count = 1 Dir Entry How to fix to have consistent file system? Link Count (example 1) Dir Entry Dir Entry inode link_count = 2 Simple fix! 7

Link Count (example 2) inode link_count = 1 How to fix??? Link Count (example 2) Dir Entry fix! ls -l / total 150 drwxr-xr-x 401 18432 Dec 31 1969 afs/ drwxr-xr-x. 2 4096 Nov 3 09:42 bin/ drwxr-xr-x. 5 4096 Aug 1 14:21 boot/ dr-xr-xr-x. 13 4096 Nov 3 09:41 lib/ dr-xr-xr-x. 10 12288 Nov 3 09:41 lib64/ drwx------. 2 16384 Aug 1 10:57 lost+found/... inode link_count = 1 8

Data Bitmap inode link_count = 1 block (number 123) data bitmap 0011001100 How to fix? for block 123 Data Bitmap inode link_count = 1 block (number 123) data bitmap 0011001101 Simple fix! for block 123 9

Duplicate Pointers inode link_count = 1 block (number 123) inode link_count = 1 How to fix???? Duplicate Pointers inode link_count = 1 block (number 123) copy inode link_count = 1 block (number 789) 10

Duplicate Pointers inode link_count = 1 block (number 123) inode link_count = 1 Simple fix! block (number 789) But is this correct? Bad Pointer inode link_count = 1 9999 super block tot-blocks=8000 How to fix??? 11

Bad Pointer inode link_count = 1 Remove pointer to non-existent data Simple fix! (But is this correct?) super block tot-blocks=8000 Problems with fsck Problem 1: Not always obvious how to fix file system image Don t know correct state, just consistent one Easy way to get consistency: reformat disk! 12

Problem 2: fsck is very slow Checking a 600GB disk takes ~70 minutes ffsck: The Fast File System Checker Ao Ma, EMC Corporation and University of Wisconsin Madison; Chris Dragga, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin Madison Consistency Solution #2: Journaling Goals Do some recovery work after crash, but don t read entire disk Don t move fs to just any consistent state, get correct state Strategy: Atomicity Definition of atomicity for concurrency operations in critical sections are not interrupted by operations on related critical sections Definition of atomicity for persistence collections of writes are not interrupted by crashes; either (all new) or (all old) data is visible 13

Consistency vs Correctness Say a set of writes moves the disk from state A to B empty all states consistent states A B fsck just gives consistency Atomicity gives A or B. Journaling General Strategy Never delete ANY old data, until ALL new data is safely on disk Ironically, add redundancy to fix the problem caused by redundancy 14

Fight Redundancy with Redundancy Want to replace X with Y. Original: X DISK redundant f(x) Fight Redundancy with Redundancy Want to replace X with Y. Original: X DISK f(x) Good time to crash? good time to crash 15

Fight Redundancy with Redundancy Want to replace X with Y. Original: DISK Good time to crash? Y f(x) bad time to crash Fight Redundancy with Redundancy Want to replace X with Y. Original: Y DISK f(y) Good time to crash? good time to crash 16

Fight Redundancy with Redundancy Want to replace X with Y. With : X DISK f(x) Good time to crash? good time to crash Fight Redundancy with Redundancy Want to replace X with Y. With : DISK X Y f(x) good time to crash 17

Fight Redundancy with Redundancy Want to replace X with Y. With : DISK X Y f(x) f(y) good time to crash Fight Redundancy with Redundancy Want to replace X with Y. With : DISK Y Y f(x) f(y) good time to crash 18

Fight Redundancy with Redundancy Want to replace X with Y. With : DISK Y Y f(y) f(y) good time to crash Fight Redundancy with Redundancy Want to replace X with Y. With : DISK Y f(y) f(y) good time to crash 19

Fight Redundancy with Redundancy Want to replace X with Y. With : DISK Y f(y) good time to crash Fight Redundancy with Redundancy Want to replace X with Y. With : DISK Y f(y) With ing, it s always a good time to crash! Challenge after crash: what version to read? 20

Break What was the best thing you did over Thanksgiving Break? What was the best thing about this past semester for you? Question for You Develop algorithm to atomically update two blocks: Write 10 to block 0; write 5 to block 1 Assume these are the only blocks in file system Time Block 0 Block 1 extra extra extra 1 12 3 0 0 0 2 12 5 0 0 0 3 10 5 0 0 0 don t crash here! Usage Scenario: Block 0 stores Alice s bank account; Block 1 stores Bob s bank account; transfer $2 from Alice to Bob Wrong algorithm leads to inconsistent states (non-atomic updates) 21

Initial Solution: Journal New Data Time Block 0 Block 1 0 1 valid 1 12 3 0 0 0 2 12 3 10 0 0 3 12 3 10 5 0 4 12 3 10 5 1 5 10 3 10 5 1 6 10 5 10 5 1 7 10 5 10 5 0 Crash here? à Old data Crash here? ànew data Note: Understand behavior if crash after each write void update_accounts(int cash1, int cash2) { write(cash1 to block 2) // Alice backup write(cash2 to block 3) // Bob backup write(1 to block 4) // backup is safe write(cash1 to block 0) // Alice write(cash2 to block 1) // Bob write(0 to block 4) // discard backup } void recovery() { if(read(block 4) == 1) { write(read(block 2) to block 0) // restore Alice write(read(block 3) to block 1) // restore Bob write(0 to block 4) // discard backup } } 22

Terminology Extra blocks are called a Writes to are transaction Last valid bit written is commit block Problem with Initial APPROACH: Journal Size 0 N-1 N 2N-1 2N Disadvantages? - slightly < half of disk space is usable - transactions copy all the data (1/2 bandwidth!) 23

Fix #1: Small Journals Still need to first write all new data elsewhere before overwriting new data Goal: Reuse small area as backup for any block How? Store block numbers in a transaction header New Layout 5,2 A B 0 transaction: write A to block 5; write B to block 2 24

New Layout B A 5,2 A B 1 transaction: write A to block 5; write B to block 2 Checkpoint: Writing new data to in-place locations New Layout B A 5,2 A B 0 25

New Layout B A 5,2 A B 0 transaction: write C to block 4; write T to block 6 New Layout B A 4,6 A B 0 transaction: write C to block 4; write T to block 6 26

New Layout B A 4,6 C B 0 transaction: write C to block 4; write T to block 6 New Layout B A 4,6 C T 0 transaction: write C to block 4; write T to block 6 27

New Layout B C A T 4,6 C T 1 transaction: write C to block 4; write T to block 6 Checkpoint: Writing new data to in-place locations New Layout B C A T 4,6 C T 0 transaction: write C to block 4; write T to block 6 28

Optimizations 1. Reuse small area for 2. Barriers 3. Checksums 4. Circular 5. Logical Correctness depends on Ordering B C A T 4,6 C T 0 transaction: write C to block 4; write T to block 6 write order: 9, 10, 11, 12, 4, 6, 12 Enforcing total ordering is inefficient. Why? Random writes Instead: Use barriers w/ disk cache flush at key points (when??) 29

Ordering B C A T 4,6 C T 0 transaction: write C to block 4; write T to block 6 write order: 9,10,11 12 4,6 12 Use barriers at key points in time: 1) Before commit, ensure transaction entries complete 2) Before checkpoint, ensure commit complete 3) Before free, ensure in-place updates complete Optimizations 1. Reuse small area for 2. Barriers 3. Checksums 4. Circular 5. Logical 30

Checksum Optimization B C A T 4,6 C T 0 write order: 9,10,11 12 4,6 12 How can we get rid of barrier between (9, 10, 11) and 12??? Checksum Optimization B C A T 4,6 C T (ck) write order: 9,10,11,12 4,6 12 In last transaction block, store checksum of rest of transaction 12 = Cksum(9, 10, 11) During recovery: If checksum does not match transaction, treat as not valid 31

Optimizations 1. Reuse small area for 2. Barriers 3. Checksums 4. Circular 5. Logical Write Buffering Optimization Note: after write, there is no rush to checkpoint If system crashes, still have persistent copy of written data! Journaling is sequential (fast!), checkpointing is random (slow!) Solution? Delay checkpointing for some time Difficulty: need to reuse space Solution: keep many transactions for un-checkpointed data 32

Circular Buffer Journal: T1 T2 T3 T4 0 128 MB Keep data also in memory until checkpointed on disk Circular Buffer Journal: T2 T3 T4 0 128 MB checkpoint and cleanup 33

Circular Buffer Journal: T5 T2 T3 T4 0 128 MB transaction! Circular Buffer Journal: T5 T3 T4 0 128 MB checkpoint and cleanup 34

Optimizations 1. Reuse small area for 2. Barriers 3. Checksums 4. Circular 5. Logical Physical Journal TxB length=3 blks=4,6,1 0000000000 0000000000 0000000000 0000100000 inode addr[?]=521 data block TxE (checksum) 35

Physical Journal TxB length=3 blks=4,6,1 0000000000 0000000000 0000000000 0000100000 inode addr[?]=521 data block TxE (checksum) Actual changed data is much smaller! Logical Journal TxB length=1 list of changes TxE (checksum) Logical s record changes to bytes, not contents of new blocks On recovery: Need to read existing contents of in-place data and (re-)apply changes 36

Optimizations 1. Reuse small area for 2. Barriers 3. Checksums 4. Circular 5. Logical File System Integration FS Journal Scheduler Disk 37

Review: Ordering B C A T 4,6 C T 0 transaction: write C to block 4; write T to block 6 write order: 9,10,11 12 4,6 12 Use I/O barriers (wait for disk to complete write) at key points: 1) Before commit, ensure transaction entries complete 2) Before checkpoint, ensure commit completes 3) Before free, ensure in-place updates complete Review: Circular Journal Journal: T1 T2 T3 T4 0 128 MB Keep data also in memory until checkpointed on disk 38

How to avoid writing all disk blocks Twice? Observation: Some disk blocks (e.g., user data) are less important Strategy: Journal only metadata, including: superblock, bitmaps, inodes, indirect blocks, directories Guarantees metadata is consistent even if crash occurs Won t leak blocks or re-allocate blocks to multiple files Regular data, write it to in-place locations whenever convenient Files may contain garbage if crash and recover Approach 1: Writeback Journal B I? 0 transaction: append data to file with inode I 39

Writeback Journal B I? TxB B I 0 transaction: append data to file with inode I Writeback Journal B I? TxB B I TxE transaction: append data to file with inode I 40

Writeback Journal B I? TxB B I TxE transaction: append data to file with inode I Update in-place copy of I Writeback Journal B I? TxB B I TxE transaction: append data to file with inode I Update in-place copy of B what if we crash now? Solutions? 41

Approach 2: Ordered Journaling Still only metadata But write data before the transaction No leaks of sensitive data! Ordered Journal B I? 0 transaction: append to inode I 42

Ordered Journal B I D 0 transaction: append to inode I What happens if crash now? B indicates D currently free, I does not point to D; Lose D, but that might be acceptable Ordered Journal B I D TxB I B 0 transaction: append to inode I 43

Ordered Journal B I D TxB I B TxE transaction: append to inode I Ordered Journal B I D TxB I B TxE transaction: append to inode I 44

Ordered Journal B I D TxB I B TxE transaction: append to inode I Conclusion Most modern file systems use s ordered-mode for meta-data is very popular (default mode) FSCK is still useful for weird cases - bit flips - FS bugs Some file systems don t use s, but still (usually) write new data before deleting old (copy-on-write file systems) 45