Certifying a file system: Correctness in the presence of crashes

Similar documents
Using Crash Hoare Logic for Certifying the FSCQ File System

Cer$fying a Crash-safe File System

disk writes, then a problem arises.

Push-Button Verification of File Systems

Push-Button Verification of File Systems

Push-button verification of Files Systems via Crash Refinement

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

CS3210: Crash consistency

CrashMonkey: A Framework to Systematically Test File-System Crash Consistency. Ashlie Martinez Vijay Chidambaram University of Texas at Austin

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

Designing Systems for Push-Button Verification

Lecture 11: Linux ext3 crash recovery

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

[537] Journaling. Tyler Harter

File Systems: Consistency Issues

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

<Insert Picture Here> Filesystem Features and Performance

Ext3/4 file systems. Don Porter CSE 506

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

CS 318 Principles of Operating Systems

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

by Alexander V. Konradi AU( 14?O17 C ertified by... M. Frans Kaashoek Charles Piper Professor

File System Consistency

COS 318: Operating Systems. Journaling, NFS and WAFL

FS Consistency & Journaling

The benefits and costs of writing a POSIX kernel in a high-level language

Caching and reliability

Flashix: Results and Perspective. Jörg Pfähler, Stefan Bodenmüller, Gerhard Schellhorn, (Gidon Ernst)

Checking the Integrity of Transactional Mechanisms

CSE506: Operating Systems CSE 506: Operating Systems

Lecture 18: Reliable Storage

CS5460: Operating Systems Lecture 20: File System Reliability

Journaling. CS 161: Lecture 14 4/4/17

Announcements. Persistence: Crash Consistency

2. PICTURE: Cut and paste from paper

The Google File System

Lecture 10: Crash Recovery, Logging

Case study: ext2 FS 1

Optimistic Crash Consistency. Vijay Chidambaram Thanumalayan Sankaranarayana Pillai Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau

Case study: ext2 FS 1

W4118 Operating Systems. Instructor: Junfeng Yang

EECS 482 Introduction to Operating Systems

Topics. " Start using a write-ahead log on disk " Log all updates Commit

Scaling a file system to many cores using an operation log

<Insert Picture Here> Btrfs Filesystem

CS122 Lecture 15 Winter Term,

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Review: FFS background

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

File System Implementation

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

CSE 153 Design of Operating Systems

Checking the Integrity of Transactional Mechanisms

Reducing Crash Recoverability to Reachability

Runtime Checking for Program Verification Systems

Implementing a Verified On-Disk Hash Table

Cross-checking Semantic Correctness: The Case of Finding File System Bugs

A Study of Linux File System Evolution

Verifying concurrent software using movers in Cspec

OPERATING SYSTEM. Chapter 12: File System Implementation

CSL373/CSL633 Major Exam Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

November 9 th, 2015 Prof. John Kubiatowicz

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

NPTEL Course Jan K. Gopinath Indian Institute of Science

Verifying concurrent software using movers in CSPEC

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Current Topics in OS Research. So, what s hot?

File Systems. Chapter 11, 13 OSPP

ScaleFS: A Multicore-Scalable File System. Rasha Eqbal

Linux Journaling File System: ext3 Shangyou zeng Physics & Astronomy Dept., Ohio University Athens, OH, 45701

CSE 333 Lecture 9 - storage

File Systems II. COMS W4118 Prof. Kaustubh R. Joshi hdp://

Towards Efficient, Portable Application-Level Consistency

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

CS3600 SYSTEMS AND NETWORKS

CSE 451: Operating Systems Winter Module 17 Journaling File Systems

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

*-Box (star-box) Towards Reliability and Consistency in Dropbox-like File Synchronization Services

C13: Files and Directories: System s Perspective

Chapter 11: Implementing File Systems

Modular Verification of Order-Preserving Write-Back Caches

Operating Systems. File Systems. Thomas Ropars.

Tricky issues in file systems

Shared snapshots. 1 Abstract. 2 Introduction. Mikulas Patocka Red Hat Czech, s.r.o. Purkynova , Brno Czech Republic

RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E)

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Chapter 12: File System Implementation

Network File System (NFS)

PJFS (Partitioning, Journaling File System): For Embedded Systems. Ada-Europe 6-June-2006 Greg Gicca

Lightweight Application-Level Crash Consistency on Transactional Flash Storage

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Chapter 11: File System Implementation. Objectives

CSL373/CSL633 Major Exam Solutions Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Journaling versus Soft-Updates: Asynchronous Meta-data Protection in File Systems

Transcription:

1 / 28 Certifying a file system: Correctness in the presence of crashes Tej Chajed, Haogang Chen, Stephanie Wang, Daniel Ziegler, Adam Chlipala, Frans Kaashoek, and Nickolai Zeldovich MIT CSAIL

2 / 28 File systems are complex and have bugs File systems are complex (e.g., Linux ext4 is 60,000 lines of code) and have many bugs: # patches for bugs 500 ext3 400 300 200 100 0 Jan'04 Jan'05 Jan'06 Jan'07 Jan'08 Jan'09 Jan'10 Jan'11 Cumulative number of patches for file-system bugs in Linux; data from [Lu et al., FAST 13]

2 / 28 File systems are complex and have bugs File systems are complex (e.g., Linux ext4 is 60,000 lines of code) and have many bugs: # patches for bugs 500 400 300 200 100 0 ext3 ext4 xfs reiserfs jfs btrfs Jan'04 Jan'05 Jan'06 Jan'07 Jan'08 Jan'09 Jan'10 Jan'11 Cumulative number of patches for file-system bugs in Linux; data from [Lu et al., FAST 13] New file systems (and bugs) are introduced over time

2 / 28 File systems are complex and have bugs File systems are complex (e.g., Linux ext4 is 60,000 lines of code) and have many bugs: # patches for bugs 500 400 300 200 100 0 ext3 ext4 xfs reiserfs jfs btrfs Jan'04 Jan'05 Jan'06 Jan'07 Jan'08 Jan'09 Jan'10 Jan'11 Cumulative number of patches for file-system bugs in Linux; data from [Lu et al., FAST 13] New file systems (and bugs) are introduced over time Some bugs are serious: security exploits, data loss, etc.

3 / 28 Much research in avoiding bugs in file systems Most research is on finding bugs: Crash injection (e.g., EXPLODE [OSDI 06]) Symbolic execution (e.g., EXE [Oakland 06]) Design modeling (e.g., in Alloy [ABZ 08]) Some elimination of bugs by proving: FS without directories [Arkoudas et al. 2004] BilbyFS [Keller et al. 2014] Flashix [Ernst et al. 2015]

3 / 28 Much research in avoiding bugs in file systems Most research is on finding bugs: Crash injection (e.g., EXPLODE [OSDI 06]) Symbolic execution (e.g., EXE [Oakland 06]) Design modeling (e.g., in Alloy [ABZ 08]) Reduce # bugs Some elimination of bugs by proving: FS without directories [Arkoudas et al. 2004] BilbyFS [Keller et al. 2014] Flashix [Ernst et al. 2015]

3 / 28 Much research in avoiding bugs in file systems Most research is on finding bugs: Crash injection (e.g., EXPLODE [OSDI 06]) Symbolic execution (e.g., EXE [Oakland 06]) Design modeling (e.g., in Alloy [ABZ 08]) Reduce # bugs Some elimination of bugs by proving: FS without directories [Arkoudas et al. 2004] BilbyFS [Keller et al. 2014] Flashix [Ernst et al. 2015] Incomplete + no crashes

4 / 28 File system must preserve data after crash Crashes occur due to power failures, hardware failures, or software bugs Difficult because crashes expose many different partially-updated states commit 353b67d8ced4dc53281c88150ad295e24bc4b4c5 --- a/fs/jbd/checkpoint.c +++ b/fs/jbd/checkpoint.c @@ -504,7 +503,25 @@ int cleanup_journal_tail(journal_t *journal) spin_unlock(&journal->j_state_lock); return 1; } + spin_unlock(&journal->j_state_lock); + + /* + * We need to make sure that any blocks that were recently written out + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the journal. It s unlikely this will be + * necessary, especially with an appropriately sized journal, but we + * need this to guarantee correctness. Fortunately + * cleanup_journal_tail() doesn t get called all that often. + */ + if (journal->j_flags & JFS_BARRIER) + blkdev_issue_flush(journal->j_fs_dev, GFP_KERNEL, NULL); + spin_lock(&journal->j_state_lock); + if (!tid_gt(first_tid, journal->j_tail_sequence)) { + spin_unlock(&journal->j_state_lock); + /* Someone else cleaned up journal so return 0 */ + return 0; + } /* OK, update the superblock to recover the freed space. * Physical blocks come first: have we wrapped beyond the end of * the log? */

5 / 28 File system must preserve security after crash Mistakes in crash handling can also lead to data disclosure Two optimizations in Linux ext4: direct data write and log checksum Subtle interaction: new file can contain other users data after crash Bug introduced in 2008, fixed in 2014 (six years later!) Author: Jan Kara <jack@suse.cz> Date: Tue Nov 25 20:19:17 2014-0500 ext4: forbid journal_async_commit in data=ordered mode Option journal_async_commit breaks gurantees of data=ordered mode as it sends only a single cache flush after writing a transaction commit block. Thus even though the transaction including the commit block is fully stored on persistent storage, file data may still linger in drives caches and will be lost on power failure. Since all checksums match on journal recovery, we replay the transaction thus possibly exposing stale user data. [...]

Goal: certify a complete file system under crashes A file system with a machine-checkable proof that its implementation meets its specification under normal execution and under any sequence of crashes including crashes during recovery 6 / 28

7 / 28 Contributions CHL: Crash Hoare Logic for persistent storage Crash condition and recovery semantics CHL automates parts of proof effort Proofs mechanically checked by Coq FSCQ: the first certified crash-safe file system Basic Unix-like file system (not parallel) Simple specification for a subset of POSIX E.g., no hard links or symbolic links About 1.5 years of work, including learning Coq

FSCQ runs standard Unix programs: mv, git, make,... 8 / 28

FSCQ runs standard Unix programs: mv, git, make,... 8 / 28

FSCQ runs standard Unix programs: mv, git, make,... 8 / 28

FSCQ runs standard Unix programs: mv, git, make,... TCB includes Coq s extractor, Haskell compiler and runtime,... 8 / 28

9 / 28 How to specify what is correct? Need a specification of correct behavior before we can prove anything Look it up in the POSIX standard?

How to specify what is correct? Need a specification of correct behavior before we can prove anything Look it up in the POSIX standard? [...] a power failure [...] can cause data to be lost. The data may be associated with a file that is still open, with one that has been closed, with a directory, or with any other internal system data structures associated with permanent storage. This data can be lost, in whole or part, so that only careful inspection of file contents could determine that an update did not occur. IEEE Std 1003.1, 2013 Edition POSIX is vague about crash behavior POSIX s goal was to specify common-denominator behavior File system implementations have different interpretations Leads to bugs in higher-level applications [Pillai et al. OSDI 14] 9 / 28

10 / 28 A starting point: correct is transactional Run every file-system call inside a transaction def create(d, name): log_begin() newfile = allocate_inode() newfile.init() d.add(name, newfile) log_commit()

10 / 28 A starting point: correct is transactional Run every file-system call inside a transaction def create(d, name): log_begin() newfile = allocate_inode() newfile.init() d.add(name, newfile) log_commit() log_begin and log_commit implement a write-ahead log on disk After crash, replay any committed transaction in the write-ahead log

A starting point: correct is transactional Run every file-system call inside a transaction def create(d, name): log_begin() newfile = allocate_inode() newfile.init() d.add(name, newfile) log_commit() log_begin and log_commit implement a write-ahead log on disk After crash, replay any committed transaction in the write-ahead log Q: How to formally specify both normal-case and crash behavior? Q: How to specify that it s safe to crash during recovery itself? 10 / 28

11 / 28 Approach: Hoare Logic specifications {pre} code {post} SPEC disk_write(a, v) PRE a v 0 POST a v

12 / 28 CHL extends Hoare Logic with crash conditions {pre} code {post} {crash} SPEC disk_write(a, v) PRE a v 0 POST a v CRASH a v 0 a v CHL s disk model matches what most other file systems assume: writing a single block is an atomic operation no data corruption Specifications for disk_write, disk_read, and disk_sync represent our disk model

13 / 28 Certifying larger procedures pre def bmap(inode, bnum): if bnum >= NDIRECT: indirect = log_read(inode.blocks[ndirect]) return indirect[bnum - NDIRECT] else: return inode.blocks[bnum] post crash

13 / 28 Certifying larger procedures Need pre/post/crash conditions for each called procedure log_read return pre if post return Function bmap crash

13 / 28 Certifying larger procedures CHL s proof automation chains pre- and postconditions log_read return pre if post return Function bmap crash

13 / 28 Certifying larger procedures CHL s proof automation combines crash conditions log_read return pre if post return Function bmap crash

13 / 28 Certifying larger procedures Remaining proof effort: changing representation invariants log_read return pre if post return Function bmap crash

14 / 28 Common pattern: representation invariant SPEC log_write(a, v) PRE disk: log_rep(activetxn, start_state, old_state) old_state: a v 0 POST disk: log_rep(activetxn, start_state, new_state) new_state: a v CRASH disk: log_rep(activetxn, start_state, any) log_rep is a representation invariant Connects logical transaction state to an on-disk representation Describes the log s on-disk layout using many primitives

15 / 28 Specifying an entire system call (simplified) SPEC PRE create(dnum, fn) disk: log_rep(notxn, start_state) start_state: dir_rep(tree) path, tree[path].inode = dnum fn / tree[path]

15 / 28 Specifying an entire system call (simplified) SPEC PRE POST create(dnum, fn) disk: log_rep(notxn, start_state) start_state: dir_rep(tree) path, tree[path].inode = dnum fn / tree[path] disk: log_rep(notxn, new_state) new_state: dir_rep(new_tree) new_tree = tree.update(path, fn, empty_file)

15 / 28 Specifying an entire system call (simplified) SPEC PRE POST CRASH create(dnum, fn) disk: log_rep(notxn, start_state) start_state: dir_rep(tree) path, tree[path].inode = dnum fn / tree[path] disk: log_rep(notxn, new_state) new_state: dir_rep(new_tree) new_tree = tree.update(path, fn, empty_file) disk: log_rep(notxn, start_state) log_rep(notxn, new_state) s, log_rep(activetxn, start_state, s) log_rep(committedtxn, start_state, new_state)...

16 / 28 Specifying log recovery SPEC PRE POST CRASH log_recover() disk: log_intact(last_state, committed_state) disk: log_rep(notxn, last_state) log_rep(notxn, committed_state) disk: log_intact(last_state, committed_state) log_recover is idempotent Crash condition implies pre condition OK to run log_recover again after a crash

17 / 28 CHL s recovery semantics create is atomic, if log_recover runs after every crash: SPEC ON CRASH PRE POST RECOVER create(dnum, fn) log_recover() disk: log_rep(notxn, start_state) start_state: dir_rep(tree) path, tree[path].inode = dnum fn / tree[path] disk: log_rep(notxn, new_state) new_state: dir_rep(new_tree) new_tree = tree.update(path, fn, empty_file) disk: log_rep(notxn, start_state) log_rep(notxn, new_state)

18 / 28 CHL summary Key ideas: crash conditions and recovery semantics CHL benefit: enables precise failure specifications Allows for automatic chaining of pre/post/crash conditions Reduces proof burden CHL cost: must write crash condition for every function, loop, etc. Crash conditions are often simple (above logging layer)

19 / 28 FSCQ: building a file system on top of CHL File system design is close to v6 Unix, plus logging, minus symbolic links Implementation aims to reduce proof effort

20 / 28 Evaluation What bugs do FSCQ s theorems eliminate? How much development effort is required for FSCQ? How well does FSCQ perform?

21 / 28 FSCQ s theorems eliminate many bugs One data point: once theorems proven, no implementation bugs Did find some mistakes in spec, as a result of end-to-end checks E.g., forgot to specify that extending a file should zero-fill

21 / 28 FSCQ s theorems eliminate many bugs One data point: once theorems proven, no implementation bugs Did find some mistakes in spec, as a result of end-to-end checks E.g., forgot to specify that extending a file should zero-fill Common classes of bugs found in Linux file systems: Bug class Violating file or directory invariants Improper handling of corner cases Returning incorrect error codes Resource-allocation bugs Mistakes in logging and recovery logic Misusing the logging API Bugs due to concurrent execution Low-level programming errors Eliminated in FSCQ? Yes Yes Some Some Yes Yes No concurrency Yes

22 / 28 Implementing CHL and FSCQ in Coq Total of 30,000 lines of verified code, specs, and proofs Comparison: xv6 file system is 3,000 lines of code

23 / 28 Change effort proportional to scope of change Reordering disk writes: 1,000 lines in FSCQLOG Indirect blocks: 1,500 lines in inode layer Buffer cache: 300 lines in FSCQLOG, 600 lines in rest of FSCQ Optimize log layout: 150 lines in FSCQLOG Modest incremental effort, partially due to CHL s proof automation and FSCQ s internal layers

24 / 28 Performance comparison File-system-intensive workload Software development: git, make LFS benchmark mailbench: qmail-like mail server Compare with other (non-certified) file systems xv6 (similar design, written in C) ext4 (widely used on Linux), in non-default synchronous mode to match FSCQ s guarantees Running on an SSD on a laptop

25 / 28 Running time for benchmark workload Running time (seconds) 80 70 60 50 40 30 20 10 0 fscq xv6 FSCQ slower than xv6 due to overhead of extracted Haskell

25 / 28 Running time for benchmark workload Running time (seconds) 80 70 60 50 40 30 20 10 0 fscq xv6 ext4 FSCQ slower than xv6 due to overhead of extracted Haskell FSCQ slower than ext4 due to simple write-ahead logging design

25 / 28 Opportunity: change semantics to defer durability Running time (seconds) 80 70 60 50 40 30 20 10 0 fscq xv6 ext4 ext4-async FSCQ slower than xv6 due to overhead of extracted Haskell FSCQ slower than ext4 due to simple write-ahead logging design Deferred durability (ext4 s default mode) allows for big improvement

26 / 28 Directions for future research Certifying deferred durability (fsync) Preliminary results: much of the effort is in coming up with a good specification FSCQ on par with ext4-async in terms of disk I/O (took 1 year of extra work) Certifying a parallel (multi-core) file system Certifying applications with CHL (mail server, key-value store,...) Reducing TCB size and generating efficient executable code

27 / 28 Eventual goal: provable end-to-end app security Formal verification turned a corner May soon be practical for engineering secure systems Can we prove end-to-end security for a complete application? E.g., messages typed by Alice should appear only on Bob s phone Many challenges at the intersection of systems and verification Hardware support for isolating unverified code Composing different types of specifications and proofs Verification of software updates

Eventual goal: provable end-to-end app security Formal verification turned a corner May soon be practical for engineering secure systems Can we prove end-to-end security for a complete application? E.g., messages typed by Alice should appear only on Bob s phone Many challenges at the intersection of systems and verification Hardware support for isolating unverified code Composing different types of specifications and proofs Verification of software updates Big opportunity to address fundamental security problems 27 / 28

28 / 28 Conclusions CHL helps specify and prove crash safety Crash conditions Recovery execution semantics FSCQ: first certified crash-safe file system Usable performance 1.5 years of effort, including learning Coq and building CHL Many open problems and potential for fundamental contributions https://github.com/mit-pdos/fscq-impl