Towards Efficient, Portable Application-Level Consistency

Similar documents
All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications

Optimistic Crash Consistency. Vijay Chidambaram Thanumalayan Sankaranarayana Pillai Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau

CrashMonkey: A Framework to Systematically Test File-System Crash Consistency. Ashlie Martinez Vijay Chidambaram University of Texas at Austin

Operating Systems. File Systems. Thomas Ropars.

FS Consistency & Journaling

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

W4118 Operating Systems. Instructor: Junfeng Yang

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Announcements. Persistence: Crash Consistency

Application Crash Consistency and Performance with CCFS

*-Box (star-box) Towards Reliability and Consistency in Dropbox-like File Synchronization Services

Physical Disentanglement in a Container-Based File System

[537] Journaling. Tyler Harter

EECS 482 Introduction to Operating Systems

Optimistic Crash Consistency. Vijay Chidambaram Thanumalayan Sankaranarayana Pillai Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

Announcements. Persistence: Log-Structured FS (LFS)

Optimistic Crash Consistency. Vijay Chidambaram Thanumalayan Sankaranarayana Pillai Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau

Lecture 11: Linux ext3 crash recovery

CS3210: Crash consistency

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Lecture 10: Crash Recovery, Logging

CS 318 Principles of Operating Systems

Crash Recovery. Assignment 1 Posted Saturday

Network File System (NFS)

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System

Journaling versus Soft-Updates: Asynchronous Meta-data Protection in File Systems

<Insert Picture Here> Filesystem Features and Performance

File System Consistency

ViewBox. Integrating Local File Systems with Cloud Storage Services. Yupu Zhang +, Chris Dragga + *, Andrea Arpaci-Dusseau +, Remzi Arpaci-Dusseau +

Caching and reliability

Advanced File Systems. CS 140 Feb. 25, 2015 Ali Jose Mashtizadeh

From Crash Consistency to Transactions. Yige Hu Youngjin Kwon Vijay Chidambaram Emmett Witchel

I/O Stack Optimization for Smartphones

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

File Systems: Consistency Issues

ALICE: Application-Level Intelligent Crash Explorer

Coerced Cache Evic-on and Discreet- Mode Journaling: Dealing with Misbehaving Disks

A Study of Linux File System Evolution

NOVA: The Fastest File System for NVDIMMs. Steven Swanson, UC San Diego

The Dangers and Complexities of SQLite Benchmarking. Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram

Push-button verification of Files Systems via Crash Refinement

Lecture 18: Reliable Storage

Linux Filesystems Ext2, Ext3. Nafisa Kazi

File Systems Fated for Senescence? Nonsense, Says Science!

CS5460: Operating Systems Lecture 20: File System Reliability

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Tricky issues in file systems

2. PICTURE: Cut and paste from paper

Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Case study: ext2 FS 1

Review: FFS background

Tolera'ng File System Mistakes with EnvyFS

(Not so) recent development in filesystems

Emulating Goliath Storage Systems with David

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

<Insert Picture Here> Btrfs Filesystem

Failure-Atomic fle updates for Linux

Case study: ext2 FS 1

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

CS 4284 Systems Capstone

Conoscere e ottimizzare l'i/o su Linux. Andrea Righi -

To Everyone... iii To Educators... v To Students... vi Acknowledgments... vii Final Words... ix References... x. 1 ADialogueontheBook 1

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes

Review: FFS [McKusic] basics. Review: FFS background. Basic FFS data structures. FFS disk layout. FFS superblock. Cylinder groups

Analysis for the Performance Degradation of fsync()in F2FS

Zettabyte Reliability with Flexible End-to-end Data Integrity

Rethink the Sync. Abstract. 1 Introduction

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

November 9 th, 2015 Prof. John Kubiatowicz

Designing a True Direct-Access File System with DevFS

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

NPTEL Course Jan K. Gopinath Indian Institute of Science

Protocol-Aware Recovery for Consensus-based Storage Ramnatthan Alagappan University of Wisconsin Madison

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

The Unwritten Contract of Solid State Drives

Ext3/4 file systems. Don Porter CSE 506

BTREE FILE SYSTEM (BTRFS)

DenseFS: A Cache-Compact Filesystem

Main Points. File System Reliability (part 2) Last Time: File System Reliability. Reliability Approach #1: Careful Ordering 11/26/12

File Systems II. COMS W4118 Prof. Kaustubh R. Joshi hdp://

SMD149 - Operating Systems - File systems

Understanding Write Behaviors of Storage Backends in Ceph Object Store

CA485 Ray Walshe Google File System

The Google File System

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

The Btrfs Filesystem. Chris Mason

Lightweight Application-Level Crash Consistency on Transactional Flash Storage

EIO: Error-handling is Occasionally Correct

Torturing Databases for Fun and Profit

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Operating System Engineering: Fall 2003

ECE 598 Advanced Operating Systems Lecture 19

CLOUD-SCALE FILE SYSTEMS

Membrane: Operating System support for Restartable File Systems

Boosting Quasi-Asynchronous I/Os (QASIOs)

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Files and File Systems

Transcription:

Towards Efficient, Portable Application-Level Consistency Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Joo-Young Hwang, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau 1

File System Crash Consistency 2

File System Crash Consistency What happens if there is a system crash during a file system update? File system crash consistency: Make sure file system s metadata is logically consistent, even if there is a crash Multiple techniques: FSCK, Soft Updates, Journaling, Copy-On-Write 3

Application-Level Crash Consistency (ALC) 4

Application-Level Consistency (ALC) What happens to user data if there is a crash? Consistency of user data Application-Level Crash Consistency (ALC) This work Study of what happens to user data 5

Result 6

Result State of the art: For effective application-level consistency, application developers depend on specific details of file system implementation This is bad: Many file systems in use (Linux: ext3, ext4, btrfs, xfs, zfs ) 7

Outline Background: Application-Level Consistency (ALC) Goals, Methodology of Study File System Behavior ALC Bugs ALC Performance Summary 8

Outline Background: Application-Level Consistency (ALC) Goals, Methodology of Study File System Behavior ALC Bugs ALC Performance Summary 9

Application-Level Data Structures Modern applications store many data structures Google Chrome initialization: 500+ files History Cookies Web page cache 10

Application-Level Consistency (ALC) Applications impose invariants on data Web page cache: Should contain complete entries Photo application: Thumbnails match pictures Invariants should hold across system crashes Violation: application failures, silent corruption Requires complex implementations explode [OSDI 06], Eat My Data [Stewart Smith] 11

Example: Atomic File Rewrite File grub.conf (Original) kernel vmlinuz initrd initrd.img File grub.conf (Updated) print Hello kernel vmlinuz initrd initrd.img File should always be in either Fully original state or fully updated state File should never Contain garbage, or be empty, or filled with zeroes 12

Atomic File Rewrite Correct Protocol fd = creat( temp ) write(fd) fsync(fd) rename( temp, grub.conf ) 13

Atomic File Rewrite Wrong Protocol fd = creat( temp ) Occurs because file write(fd) systems can re-order fsync(fd) write() and rename() rename( temp, grub.conf ) Possible states after crash grub.conf (Original) kernel vmlinuz initrd initrd.img grub.conf (Updated) print Hello kernel vmlinuz initrd initrd.img grub.conf (Zeroes) 000000000000000 000000000000000 000000000000000 14

Atomic File Rewrite Depends on FS Wrong protocol is commonly used why? Bug (invalid assumption) Correctness sacrificied for performance Works under most common file systems Ext4, btrfs etc. explicitly ensure correctness Though not required by standard FS interface Observation: FS implementation affects applications 15

Outline Background: Application-Level Consistency (ALC) Goals, Methodology of Study File System Behavior ALC Bugs ALC Performance Summary 16

Goals Study relationship of FS implementation with ALC correctness ALC performance Characterize common file systems Deduce high-level properties affecting ALC 17

Methodology Case study: Two applications (SQLite, LevelDB) Find new bugs, analyze existing bugs Manual system call trace analysis, Bugzilla Find any correctness-performance tradeoffs Extract FS implementation details affecting bugs Convert details to high-level properties Characterize file systems Understanding source code 18

Outline Background: Application-Level Consistency (ALC) Goals, Methodology of Study File System Behavior ALC Bugs ALC Performance Summary 19

Post-Crash Property Post-Crash Property (True / False): Does a system call sequence only result in a given, desirable set of post-crash states 1. fd = creat( FileA.temp ) 2. write(fd) 3. fsync(fd) 4. rename( FileA.temp, FileA ) Safe-rename property grub.conf (Updated) print Hello kernel vmlinuz initrd initrd.img (or) grub.conf (Original) kernel vmlinuz initrd initrd.img grub.conf (Garbage) #!@$%#!@$%#! @$%#!@$%#!@ $%#!@$%#!@$% (or) grub.conf (Zeroes) 000000000000000 000000000000000 000000000000000 20

File System Comparison Different configurations of ext3 file system Different versions of ext4 file system ext3 ordered ext3 writeback ext4 ordered ext4 ordered original version btrfs Safe rename 21

Ordered Appends Ordered appends property 1. Append(LogA) 2. Append(LogB) File LogA File LogA File LogA 0.00 Started append(loga) 0.00 Started 1.00 Msg append(logb) 0.00 Started 1.00 Msg File LogB File LogB File LogB 0.00 Started 0.00 Started 0.00 Started 2.00 FAULT 22

Ordered Appends Ordered appends property 1. Append(LogA) 2. Append(LogB) File LogA File LogA File LogA File LogA 0.00 Started File LogB (or) 0.00 Started 1.00 Msg File LogB (or) 0.00 Started 1.00 Msg File LogB 0.00 Started File LogB 0.00 Started 0.00 Started 0.00 Started 2.00 FAULT 0.00 Started 2.00 FAULT 23

File System Comparison Ordered appends: Appends get persisted in the issued order Safe rename Ordered appends ext3 ordered ext3 writeback ext4 ordered ext4 original btrfs 24

More Properties Ordered dir-ops: Directory operations (creat, unlink, rename ) get persisted in issued order Safe appends: When a file is appended, the appended portion will never contain garbage Safe new file: After fsync() on a new file, another fsync() on the parent directory is not needed 25

File System Comparison Ext3-ordered: Safest for applications Safe new file: Manpages explicitly warn against this property. Safe rename Ordered appends Ordered dirops Safe appends Safe new file ext3 ordered ext3 writeback ext4 ordered ext4 original Btrfs 26

Outline Background: Application-Level Consistency (ALC) Goals, Methodology of Study File System Behavior ALC Bugs ALC Performance Summary 27

Bugs: LevelDB Guarantees LevelDB is a key-value database Put(key, value, synchronous) Atomic Ordered If a Put() can be retrieved, all previous Put() can also be retrieved Synchronous = true: Durable No corruption 28

Bugs: Guarantees vs Post-Crash Properties Post-Crash Property LevelDB Guarantee Affected Ordered append Re-ordering, Corruption Ordered directory operations Re-ordering, Corruption Safe new file Corruption Safe rename Corruption (Previously fixed bug) 29

Outline Background: Application-Level Consistency (ALC) Goals, Methodology of Study File System Behavior ALC Bugs ALC Performance Summary 30

Performance Optimizations SQLite: In particular, we suspect that most modern filesystems exhibit the safe append property and that many of them might support atomic sector writes. But until this is known for certain, SQLite will take the conservative approach and assume the worst. Five configuration options Evaluated performance for each option On top of ext3 - ordered 31

Normalized Throughput Performance: SQLite Configurations 4 3.5 3 2.5 2 1.5 1 0.5 0 Atomic writes Ext3-ordered: 40 to 50% improvement Safe appends Sequential writes Power-safe overwrites Configuration options 250 % improvement with data journaling Sync directory 32

Outline Background: Application-Level Consistency (ALC) Goals, Methodology of Study File System Behavior ALC Bugs ALC Performance Summary 33

Summary Bugs Four new bugs in LevelDB Past bugs: One in LevelDB, three in SQLite All bugs exposed on some file system Performance Wildly differing performance when optimized for exact file system behavior 34

Conclusion State of the art: For effective application-level consistency, application developers depend on specific details of file system implementation 35

Thank you! Questions or suggestions? 36