Junfeng Yang. EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Stanford University

Similar documents
EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

EXPLODE: A Lightweight, General Approach to Finding Serious Errors in Storage Systems

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

W4118 Operating Systems. Instructor: Junfeng Yang

MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS

System Errors. Junfeng Yang, Can Sar, and Dawson Engler Computer Systems Laboratory Stanford University

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS 318 Principles of Operating Systems

Ext3/4 file systems. Don Porter CSE 506

File System Consistency

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Caching and reliability

Membrane: Operating System support for Restartable File Systems

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

FS Consistency & Journaling

File Systems: Consistency Issues

File Systems Management and Examples

Case study: ext2 FS 1

Case study: ext2 FS 1

Announcements. Persistence: Crash Consistency

CS3210: Crash consistency

Announcements. Persistence: Log-Structured FS (LFS)

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

CrashMonkey: A Framework to Systematically Test File-System Crash Consistency. Ashlie Martinez Vijay Chidambaram University of Texas at Austin

[537] Journaling. Tyler Harter

CS122 Lecture 15 Winter Term,

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Tricky issues in file systems

Using Model Checking to Find Serious File System Errors

Operating System Concepts Ch. 11: File System Implementation

Towards Efficient, Portable Application-Level Consistency

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

Network File System (NFS)

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

Lecture 18: Reliable Storage

Tolera'ng File System Mistakes with EnvyFS

Distributed File Systems

Administrative Details. CS 140 Final Review Session. Pre-Midterm. Plan For Today. Disks + I/O. Pre-Midterm, cont.

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

Operating Systems. File Systems. Thomas Ropars.

CSE506: Operating Systems CSE 506: Operating Systems

Lecture 10: Crash Recovery, Logging

COS 318: Operating Systems. Journaling, NFS and WAFL

Distributed File Systems. Directory Hierarchy. Transfer Model

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

CSE506: Operating Systems CSE 506: Operating Systems

Journaling versus Soft-Updates: Asynchronous Meta-data Protection in File Systems

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

EIO: Error-handling is Occasionally Correct

File Systems II. COMS W4118 Prof. Kaustubh R. Joshi hdp://

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

File System. Minsoo Ryu. Real-Time Computing and Communications Lab. Hanyang University.

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

ò Server can crash or be disconnected ò Client can crash or be disconnected ò How to coordinate multiple clients accessing same file?

NFS. Don Porter CSE 506

Logical disks. Bach 2.2.1

Virtual File System. Don Porter CSE 306

KLEE: Effective Testing of Systems Programs Cristian Cadar

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

EECS 482 Introduction to Operating Systems

Network File Systems

CS5460: Operating Systems Lecture 20: File System Reliability

Review: FFS background

CSE 451: Operating Systems Winter Module 17 Journaling File Systems

Virtual File System. Don Porter CSE 506

jvpfs: Adding Robustness to a Secure Stacked File System with Untrusted Local Storage Components

To Everyone... iii To Educators... v To Students... vi Acknowledgments... vii Final Words... ix References... x. 1 ADialogueontheBook 1

OPERATING SYSTEM TRANSACTIONS

6.828 Fall 2007 Quiz II Solutions

Reminder from last time

ECE 598 Advanced Operating Systems Lecture 19

CS5412: TRANSACTIONS (I)

JFS Log. Steve Best IBM Linux Technology Center. Recoverable File Systems. Abstract. Introduction. Logging

What is a file system

RCU. ò Dozens of supported file systems. ò Independent layer from backing storage. ò And, of course, networked file system support

Linux Filesystems Ext2, Ext3. Nafisa Kazi

Push-button verification of Files Systems via Crash Refinement

Fall 2017 :: CSE 306. File Systems Basics. Nima Honarmand

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

In This Lecture. Transactions and Recovery. Transactions. Transactions. Isolation and Durability. Atomicity and Consistency. Transactions Recovery

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall 2008.

CLASSIC FILE SYSTEMS: FFS AND LFS

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

CSE 333 Lecture 9 - storage

The Google File System

1 / 23. CS 137: File Systems. General Filesystem Design

VFS Interceptor: Dynamically Tracing File System Operations in real. environments

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

GFS: The Google File System. Dr. Yingwu Zhu

2. PICTURE: Cut and paste from paper

Advanced UNIX File Systems. Berkley Fast File System, Logging File System, Virtual File Systems

Processes and Threads Implementation

Linux Journaling File System: ext3 Shangyou zeng Physics & Astronomy Dept., Ohio University Athens, OH, 45701

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

File Systems. Chapter 11, 13 OSPP

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Midterm evaluations. Thank you for doing midterm evaluations! First time not everyone thought I was going too fast

Motivation. What s the Problem? What Will we be Talking About? What s the Solution? What s the Problem?

Transcription:

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors Junfeng Yang Stanford University Joint work with Can Sar, Paul Twohey, Ben Pfaff, Dawson Engler and Madan Musuvathi

Mom, Google Ate My GMail! 2

3

Why check storage systems? Storage system errors: some of the most serious machine crash data loss data corruption Code complicated, hard to get right Conflicting goals: speed, reliability (recover from any failures and crashes) Typical ways to find these errors: ineffective Manual inspection: strenuous, erratic Randomized testing (e.g. unplug the power cord): blindly throwing darts Error report from mad users 4

Goal: build tools to automatically find storage system errors Sub-goal: comprehensive, lightweight, general 5

EXPLODE [OSDI06] Comprehensive: adapt ideas from model checking General, real: check live systems Can run (on Linux, BSD), can check, even w/o source code Fast, easy Check a new storage system: 200 lines of C++ code Port to a new OS: 1 kernel module + optional modification Effective 17 storage systems: 10 Linux FS, Linux NFS, Soft-RAID, 3 version control, Berkeley DB, VMware Found serious data-loss in all Subsumes FiSC [OSDI04, best paper] 6

Outline Overview Checking process Implementation Example check: crashes during recovery are recoverable Results 7

Long-lived bug fixed in 2 days in the IBM Journaling file system (JFS) Serious Loss of an entire FS! Fixed in 2 days with our complete trace Hard to find 3 years old, ever since the first version Dave Kleikamp (IBM JFS): I really appreciate your work finding and recreating this bug. I'm sure this has bitten us before, but it's usually hard to go back and find out what causes the file system to get messed up so bad 8

Events to trigger the JFS bug creat( /f ); flush /f 5-char system call, crash! not a typo fsck.jfs File system recovery utility, run after reboot Buffer Cache (in mem) Disk / / Orphan file removed. Legal behavior for file systems f 9

Events to trigger the JFS bug creat( /f ); bug under low mem (design flaw) flush / crash! fsck.jfs File system recovery utility, run after reboot Buffer Cache (in mem) Disk / f dangling / pointer! fix by zeroing, entire FS gone! 10

Overview User-written Checker } creat( /f ) EXPLODE Runtime void mutate() { void check() { Toy creat( /old ); checker: crash after int fd creat( /f ) = open( /old, sync(); O_RDONLY); should not lose any old file that is already creat( /f ); User-written persistent checker: on if disk (fd < 0) check_crash_now(); can be either very sophisticated or very simple close(fd); crash-disk f fsck.jfs } error( lost old! ); /root check( ) old JFS /root / f EKM / fsck.jfs /root check( ) old Linux Kernel Hardware / f fsck.jfs check( /root ) old f User code Our code EKM = EXPLODE kernel module 11

Outline Overview Checking process Implementation Example check: crashes during recovery are recoverable Results 12

One core idea from model checking: explore all choices Bugs are often triggered by corner cases How to find? Drive execution down to these tricky corner cases Principle When execution reaches a point in program that can do one of N different actions, fork execution and in first child do first action, in second do second, etc. Result: rare events appear as often as common ones 13

Crashes (Overview slide revisit) User-written Checker EXPLODE Runtime crash-disk f fsck.jfs /root check( ) old JFS /root old / f EKM / / f fsck.jfs fsck.jfs /root check( ) check( /root ) old f Linux Kernel Hardware 14

External choices Fork and do every possible operation /root a c r e at m kd i r unl ink rm dir /root a b /root a c /root /root Explore generated states as well Users write code to check FS valid. EXPLODE amplifies a 15

Internal choices Fork and explore all internal choices /root /root a ca ch e h i t miss d is k re ad s ucce e d f a i l a b /root a /root struct block* read_block (int i) { struct block *b; if ((b = cache_lookup(i))) return b; return disk_read (i); } a b 16

Users expose choices using choose(n) To explore N-choice point, users instrument code using choose(n) (also used in other model checkers) choose(n): N-way fork, return K in K th kid struct block* read_block (int i) { struct block *b; if ((b = cache_lookup(i))) return b; return disk_read (i); } cache_lookup (int i) { if(choose(2) == 0) return NULL; // normal lookup } Optional. Instrumented only 7 places in Linux 17

Crash X External X Internal /root a b /root a /root a c /root /root a 18

Speed: skip same states /root a reat( /b ) c m k dir( /c ) /root a b /root a c /root /root m k di r ( /c ) c reat ( / b ) /root a b c Abstract and hash a state, discard if seen. a 19

Outline Overview Checking process Implementation FiSC, File System Checker, [OSDI04], best paper EXPLODE, storage system checker, [OSDI06] Example check: crashes during recovery are recoverable Results 20

Checking process S0 S0 = checkpoint() enqueue(s0) while(queue not empty){ S = dequeue() restore(s) for each action in S { do action S = checkpoint() if(s is new) enqueue(s ) } } How to checkpoint and restore a live OS? 21

FiSC: jam OS into tool User-written Checker JFS User Mode Linux FiSC Linux Kernel Hardware User code Our code Pros Cons Comprehensive, effective No model, check code Checkpoint and restore: easy Intrusive. Build fake environment. Hard to check anything new. Months for new OS, 1 week for new FS Many tricks, so complicated that we won best paper OSDI 04 22

EXPLODE: jam tool into OS User-written Checker JFS User-written Checker EXPLODE Runtime User Mode Linux FiSC Linux Kernel Hardware JFS EKM Linux Kernel Hardware User code Our code EKM = EXPLODE kernel module 23

EKM lines of code OS Lines of code Linux 2.6 1,915 FreeBSD 6.0 1,210 EXPLODE kernel modules (EKM) are small and easy to write 24

How to checkpoint and restore a live OS kernel? Checker Hard to checkpoint live kernel memory EXPLODE Runtime Virtual machine? No JFS VMware: no source Xen: not portable EKM Linux Kernel Hardware heavyweight There s a better solution for storage systems 25

Checkpoint: save actions instead of bits state = list of actions checkpoint S = save (creat, cache miss) restore = re-initialize, creat, cache miss re-initialize = unmount, mkfs Utility that clears Utility to creat S0 Redo-to-restore-state in-mem state of a idea create used an in storage system model checking to trade empty time FS for space (stateless search). c ache m i s s S We use it only to reduce intrusiveness 26

Deterministic replay Storage system: isolated subsystem Non-deterministic kernel scheduling decision Opportunistic fix: priorities Non-deterministic interrupt Fix: use RAM disks, no interrupt for checked system Non-deterministic kernel choose() calls by other code Fix: filter by thread IDs. No choose() in interrupt Worked well in practice Mostly deterministic Worst case: auto-detect & ignore non-repeatable errors 27

Outline Overview Checking process Implementation Example check: crashes during recovery are recoverable Results 28

Why check crashes during recovery? Crashes are highly correlated Often caused by kernel bugs, hardware errors Reboot, hit same bug/error What to check? fsck once == fsck & crash, re-run fsck fsck(crash-disk) to completion, /a recovered fsck(crash-disk) and crash, fsck, /a gone Bug! Powerful heuristic, found interesting bugs (wait until results) 29

How to check crashes during recovery? crash-disk fsck.jfs EXPLODE Runtime fsck.jfs fsck.jfs same as? same as? fsck.jfs same as? crash-crash-disk Problem: N blocks 2^N crash-crash-disks. Too many! Can prune many crash-crash-disks 30

Simplified example fsck(000) Read(B1) = 0 Write(B2, 1) Write(B3, 1) Read(B3) = 1 Write(B1, 1) 3-block disk, B1, B2, B3 each block is either 0 or 1 crash-disk = 000 (B1 to B3) buffer cache: B2=1 buffer cache: B2=1, B3=1 buffer cache: B2=1, B3=1, B1=1 fsck(000) = 111 31

Naïve strategy: 7 crash-crash-disks crash-disk = 000 fsck(000) = 111 Read(B1) = 0 Write(B2, 1) Write(B3, 1) Read(B3) = 1 Write(B1, 1) buffer cache: B2=1, B3=1, B1=1 000 + {B2=1} fsck(010) == 111? fsck(001) == 111? fsck(011) == 111? fsck(100) == 111? fsck(110) == 111? fsck(101) == 111? fsck(111) == 111? 32

Optimization: exploiting determinism crash-disk = 000 fsck(000) = 111 Read(B1) = 0 Write(B2, 1) Write(B3, 1) Read(B3) = 1 Write(B1, 1) For all practical purposes, fsck is deterministic read same blocks write same blocks 000 + {B2=1} fsck(010) == 111? fsck(000) doesn t read B2 So, fsck(010) = 111 33

What blocks does fsck(000) actually read? crash-disk = 000 fsck(000) = 111 Read(B1) = 0 Write(B2, 1) Write(B3, 1) Read(B3) = 1 Write(B1, 1) Read of B3 will get what we just wrote. Can t depend on B3 fsck(000) reads/depends only on B1. It doesn t matter what we write to the other blocks. fsck(0**) = 111 34

Prune crash-crash-disks matching 0** crash-disk = 000 buffer cache: B2=1, B3=1, B1=1 fsck(000) = 111 Read(B1) = 0 Write(B2, 1) Write(B3, 1) Read(B3) = 1 Write(B1, 1) fsck(010) == 111? fsck(001) == 111? fsck(011) == 111? fsck(100) == 111? fsck(110) == 111? fsck(101) == 111? fsck(111) == 111? Can further optimize using this and other ideas 35

Outline Overview Checking process Implementation Example check: crashes during recovery are recoverable Results 36

Bugs caused by crashes during recovery Found data-loss bugs in all three FS that use logging (ext3, JFS, ReiserFS), total 5 Strict order under normal operation: First, write operation to log, commit Second, apply operation to actual file system Strict (reverse) order during recovery: First, replay log to patch actual file system Second, clear log No order corrupted FS and no log to patch it! 37

Bug in fsck.ext3 recover_ext3_journal( ) { // retval = -journal_recover(journal); // // clear the journal e2fsck_journal_release( ) // } journal_recover( ) { // replay the journal // // sync modifications to disk fsync_no_super ( ) } // Error! Empty macro, doesn t sync data! #define fsync_no_super(dev) do {} while (0) Code directly adapted from the kernel But, fsync_no_super defined as NOP: hard to implement 38

FiSC Results (can reproduce in EXPLODE) Error Type VFS ext2 ext3 JFS ReiserFS total Data loss N/A N/A 1 8 1 10 False clean N/A N/A 1 1 2 Security 2 2 1 3 + 2 Crashes 1 10 1 12 Other 1 1 1 3 Total 2 2 5 21 2 32 32 in total, 21 fixed, 9 of the remaining 11 confirmed 39

EXPLODE checkers lines of code and errors found Storage System Checked 10 file systems CVS Checker 5,477 68 Bugs 18 1 Storage Subversion 69 1 applications EXPENSIVE 124 3 Berkeley DB 202 6 RAID FS + 137 2 Transparent subsystems NFS VMware GSX/Linux FS FS 4 1 Total 6,008 36 6 bugs per 1,000 lines of checker code 40

Related work FS Testing Static (compile-time) analysis Software model checking 41

Conclusion EXPLODE Comprehensive: adapt ideas from model checking General, real: check live systems in situ, w/o source code Fast, easy: simple C++ checking interface Results Checked 17 widely-used, well-tested, real-world storage systems: 10 Linux FS, Linux NFS, Soft-RAID, 3 version control, Berkeley DB, VMware Found serious data-loss bugs in all, over 70 bugs in total Many bug reports led to immediate kernel patches 42

Future work Apply EXPLODE to other systems (e.g. distributed system) Design and build easy-to-check systems Apply program analysis techniques to other system-building stages (e.g. debugging) 43

Other projects Automatically generating malicious disks using symbolic execution. J. Yang, C. Sar, P. Twohey, C. Cadar, and D. Engler [Oakland Security 06] MECA: an extensible, expressive system and language for statically checking security properties. J. Yang, T. Kremenek, Y. Xie, and D. Engler [ACM CCS 03] An empirical study of operating system errors. A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. [SOSP 01] Kinesis: a case for multi-hash replica placement in distributed storage systems. J. MacCormick, N. Murphy, V. Ramasubramanian, U. Wieder, J. Yang, and L. Zhou. [MSR TR 2006] Correlation exploitation in error ranking. T. Kremenek, K. Ashcraft, J. Yang, and D. Engler [FSE 2004] 44

45

Conclusion EXPLODE: comprehensive, lightweight, general, effective Checked 17 real-world storage systems and found serious data loss in every system checked Future work Apply EXPLODE to other systems (e.g. distributed sys) How to design and build easy-to-check systems Adapt program analysis techniques to other systembuilding stages (e.g. debugging) 46

Crude benchmarking result Physical machine: P4 3.2G with 2G mem Virtual machine: 1G mem 20 hours 230,744 crashes, 3 crashes/second 327 different FS topologies 47

FS Sync checking results : observed data-loss Applications (such as your mailer) rely on sync operations, yet they are broken 48

ext2 fsync bug due to design flaw Events to trigger bug truncate A A B creat B Mem write B fsync B crash! Disk inode A B fsck.ext2 indirect block 49

50

Checking crashes EXPLODE Kernel Log: mark_dirty(b1) make_request(b2) end_request(b1,b2) Log { 1 buffer } cache operations. Use this log to recreate cache 1 2 { } { } 1 2 Generate Crashes 1 2 1 2 1 2 51

Checking crashes EXPLODE Kernel Log: mark_dirty(b1) mark_dirty(b1) make_request(b2) end_request(b1,b2) { 1 } { 1 } { 1 2 } { } Keep old value 1 2 Generate Crashes 1 2 1 2 1 2 1 2 1 2 52

Classic app mistake: atomic rename All three version control app. made this mistake Atomically update file A to avoid corruption fd = creat(a_tmp, ); write(fd, ); fsync(fd); // missing! close(fd); rename(a_tmp, A); Problem: rename guarantees nothing abt. Data 53

Checking a storage stack Checker EXPLODE Runtime JFS RAID EKM Linux Hardware User code Our code EKM = EXPLODE kernel module 54

What EXPLODE provides check_crash_now(): check all crashes that can happen at the current moment Paper talks more ways for checking crashes error(): record trace for deterministic replay choose(n): conceptual N-way fork, return K in K th child execution 55

Checker : What users do mutate(): drive JFS to do something check(): verify what JFS did was correct storage component: JFS RAID initialize, set up, repair and tear down JFS, RAID. Write once per system assemble a checking stack JFS RAID 56

0 Checker mutate JFS Component Stack choose(4) creat file rm file mkdir rmdir sync fsync 57

Checker Check file exists check JFS Component Check file contents match Stack Found yet another JFS fsync bug, caused by re-using directory inode as file inode Checkers can be simple (50 lines) or very complex(5,000 lines) Whatever you can express in C++, you can check 58

FS Checker JFS Component storage component: initialize, repair, set up, and tear down your system Mostly wrappers to existing utilities. mkfs, fsck, mount, umount threads(): returns list of kernel thread IDs for deterministic error replay Stack Write once per system, reuse to form stacks Easy to write: proof next slide 59

FS Checker JFS Component Stack 60

FS Checker assemble a checking stack JFS Component Stack JFS Let EXPLODE know how subsystems are connected together, so it can initialize, set up, tear down, and repair the entire stack Easy to write: proof next slide RAID 61

FS Checker JFS Component Stack JFS RAID 62

Internal choices Fork and explore all internal choices /root a ca ch e h i t miss d is k re ad s ucce e d f a i l /root a b /root a /root a b 63

Naïve strategy: 7 crashes crash-disk = 000 crash-crash-disk = 001 fsck(000) = 111 Read(B1) = 0 Write(B2, 1) Write(B3, 1) Read(B3) = 1 Write(B1, 1) fsck(001) =? Read(B1) = 0 Write(B2, 1) Write(B3, 1) Read(B3) = 1 Write(B1, 1) blocks fsck reads empty {B1=0} {B1=0} {B1=0} {B1=0, B3=1} {B1=0, B3=1} 64

Optimization: exploiting determinism For all practical purposes, fsck is deterministic read same blocks write same blocks Want prune without running fsck such crashcrash-disk that fsck(crash-crash-disk) == fsck(crash-disk) crash-crash-disk = crash-disk + W, where W is a subset of fsck-modified blocks If W doesn t change any blocks that fsck(crash-disk) reads, then fsck(crash-disk + W) == fsck(crash-disk) 65

Future work Short-term: EXPLODE follow-on projects Long-term: checking Implementation Deployment Design Testing Patching Monitoring Debugging Crash analysis 66

Future work Short-term: EXPLODE follow-on projects Long-term: checking Design Implementation Debugging Testing Or, whatever good ideas that strike me 67

Optimization: exploiting determinism For all practical purposes, fsck is deterministic read same blocks write same blocks Can prune without running fsck such crashcrash-disk that fsck(crash-crash-disk) == fsck(000) Example: fsck(010) == fsck(000) 010 = 000 + {B2=1} {B2=1} doesn t change anything fsck(000) reads fsck(010) will read the same blocks as fsck(000), therefore write the same blocks 68