Lightweight Application-Level Crash Consistency on Transactional Flash Storage

Similar documents
SFS: Random Write Considered Harmful in Solid State Drives

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

File System Consistency

Optimistic Crash Consistency. Vijay Chidambaram Thanumalayan Sankaranarayana Pillai Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau

CS122 Lecture 15 Winter Term,

EECS 482 Introduction to Operating Systems

CS 318 Principles of Operating Systems

File Systems: Consistency Issues

A File-System-Aware FTL Design for Flash Memory Storage Systems

Database Technology. Topic 8: Introduction to Transaction Processing

Boosting Quasi-Asynchronous I/Os (QASIOs)

Optimistic Crash Consistency. Vijay Chidambaram Thanumalayan Sankaranarayana Pillai Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau

Beyond Block I/O: Rethinking

Optimistic Crash Consistency. Vijay Chidambaram Thanumalayan Sankaranarayana Pillai Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Understanding Write Behaviors of Storage Backends in Ceph Object Store

COS 318: Operating Systems. Journaling, NFS and WAFL

Lecture 18: Reliable Storage

Database Management Systems Introduction to DBMS

Torturing Databases for Fun and Profit

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Topics. " Start using a write-ahead log on disk " Log all updates Commit

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

The Dangers and Complexities of SQLite Benchmarking. Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram

Physical Disentanglement in a Container-Based File System

ijournaling: Fine-Grained Journaling for Improving the Latency of Fsync System Call

CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

COURSE 1. Database Management Systems

From Crash Consistency to Transactions. Yige Hu Youngjin Kwon Vijay Chidambaram Emmett Witchel

CSC 261/461 Database Systems Lecture 20. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Outline. Failure Types

*-Box (star-box) Towards Reliability and Consistency in Dropbox-like File Synchronization Services

Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison

ECE 598 Advanced Operating Systems Lecture 18

Design of Flash-Based DBMS: An In-Page Logging Approach

Case study: ext2 FS 1

Caching and reliability

Journaling. CS 161: Lecture 14 4/4/17

Caching and consistency. Example: a tiny ext2. Example: a tiny ext2. Example: a tiny ext2. 6 blocks, 6 inodes

CS3210: Crash consistency

System Software for Persistent Memory

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto

SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory

Transactions. A Banking Example

January 28-29, 2014 San Jose

CSL373/CSL633 Major Exam Solutions Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

I/O Stack Optimization for Smartphones

Case study: ext2 FS 1

15: Filesystem Examples: Ext3, NTFS, The Future. Mark Handley. Linux Ext3 Filesystem

CSE506: Operating Systems CSE 506: Operating Systems

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

TRANSACTIONAL FLASH CARSTEN WEINHOLD. Vijayan Prabhakaran, Thomas L. Rodeheffer, Lidong Zhou

CS5460: Operating Systems Lecture 20: File System Reliability

FStream: Managing Flash Streams in the File System

CrashMonkey: A Framework to Systematically Test File-System Crash Consistency. Ashlie Martinez Vijay Chidambaram University of Texas at Austin

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Database Management System

ARIES (& Logging) April 2-4, 2018

Optimizing Fsync Performance with Dynamic Queue Depth Adaptation

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors. Junfeng Yang, Can Sar, Dawson Engler Stanford University

CMPT 354: Database System I. Lecture 11. Transaction Management

CS3600 SYSTEMS AND NETWORKS

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Understanding Manycore Scalability of File Systems. Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim

Push-button verification of Files Systems via Crash Refinement

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed McGraw-Hill by

Lecture X: Transactions

ECE 650 Systems Programming & Engineering. Spring 2018

Transaction Management & Concurrency Control. CS 377: Database Systems

<Insert Picture Here> Btrfs Filesystem

Operating Systems. File Systems. Thomas Ropars.

File Systems. Chapter 11, 13 OSPP

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

CTWP005: Write Abort Handling for Cactus Technologies Industrial-Grade Flash-Storage Products

Tricky issues in file systems

Lecture 11: Linux ext3 crash recovery

ACID Properties. Transaction Management: Crash Recovery (Chap. 18), part 1. Motivation. Recovery Manager. Handling the Buffer Pool.

Designing a True Direct-Access File System with DevFS

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Deep Dive: InnoDB Transactions and Write Paths

Barrier Enabled IO Stack for Flash Storage

Orphans, Corruption, Careful Write, and Logging, or Gfix says my database is CORRUPT or Database Integrity - then, now, future

NTFS Recoverability. CS 537 Lecture 17 NTFS internals. NTFS On-Disk Structure

Deep Dive: InnoDB Transactions and Write Paths

Recoverability. Kathleen Durant PhD CS3200

UNIT 9 Crash Recovery. Based on: Text: Chapter 18 Skip: Section 18.7 and second half of 18.8

Database Management Systems Reliability Management

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

CSL373/CSL633 Major Exam Operating Systems Sem II, May 6, 2013 Answer all 8 questions Max. Marks: 56

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Looking for Ways to Improve the Performance of Ext4 Journaling

Non-Volatile Memory Through Customized Key-Value Stores

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Topics in Reliable Distributed Systems

Operating Systems. Operating Systems Professor Sina Meraji U of T

Transcription:

Lightweight Application-Level Crash Consistency on Transactional Flash Storage Changwoo Min, Woon-Hak Kang, Taesoo Kim, Sang-Won Lee, Young Ik Eom Georgia Institute of Technology Sungkyunkwan University

Application's data is not consistent after random failures Mobile Phone Bank Account Hang while booting Financial loss

Application's data is not consistent after random failures Mobile Phone Bank Account Power Outage Hardware Errors Software Panics (OS, Device Driver) Hang while booting Financial loss

Example: inserting records in two databases Application write(/db1, new ); write(/db2, new );

Example: inserting records in two databases Application File System write(/db1, new ); write(/db2, new ); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) Storage Device /db1 bitmap inode data /db2

Example: inserting records in two databases Application File System write(/db1, new ); write(/db2, new ); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) Storage Device /db1 bitmap inode data /db1 data block is allocated. /db2

Example: inserting records in two databases Application File System write(/db1, new ); write(/db2, new ); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) Storage Device /db1 bitmap inode data /db1 data block is allocated. /db2

Example: inserting records in two databases Application File System write(/db1, new ); write(/db2, new ); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) Storage Device /db1 bitmap inode data /db1 data block is allocated. /db2

Example: inserting records in two databases Application File System write(/db1, new ); write(/db2, new ); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) Storage Device /db1 bitmap inode data /db1 data block is allocated. /db2

Example: inserting records in two databases Application File System Storage Device write(/db1, new ); write(/db2, new ); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) Random failures? Data corruption! /db1 bitmap inode data /db1 data block is allocated. /db2

How to Achieve Crash Consistency : The Case of SQLite atomic_update { write(/db1, new ); write(/db2, new ); } Logging (i.e., journaling) & Crash Recovery Write logs first before writing data in place

How to Achieve Crash Consistency : The Case of SQLite atomic_update { write(/db1, new ); write(/db2, new ); } Logging (i.e., journaling) & Crash Recovery Write logs first before writing data in place Maintaining three log files : for each DB and their master : 3 create() & 3 unlink()

How to Achieve Crash Consistency : The Case of SQLite atomic_update { write(/db1, new ); write(/db2, new ); } Logging (i.e., journaling) & Crash Recovery Write logs first before writing data in place Maintaining three log files : for each DB and their master : 3 create() & 3 unlink() Redundant write : 7 write()

How to Achieve Crash Consistency : The Case of SQLite atomic_update { write(/db1, new ); write(/db2, new ); } Logging (i.e., journaling) & Crash Recovery Write logs first before writing data in place Ordering & durability : 11 fsync() Maintaining three log files : for each DB and their master : 3 create() & 3 unlink() Redundant write : 7 write()

How to Achieve Crash Consistency : The Case of SQLite atomic_update { write(/db1, new ); write(/db2, new ); } // create master journal open(/master.jnl); write(/master.jnl, /db1,/db2 ); fsync(/master.jnl); fsync(/); // update db1 open(/db1.jnl); write(/db1.jnl, old ); fsync(/db1.jnl); fsync(/); write(/db1.jnl, master.jnl ); fsync(/db1.jnl); write(/db1, new ); fsync(/db1); // update db2 open(/db2.jnl); write(/db2.jnl, old ); fsync(/db2.jnl) fsync(/); write(/db2.jnl, master.jnl ); fsync(/db2.jnl); write(/db2, new ); fsync(/db2); // clean up journals unlink(/master.jnl); fsync(/); unlink(/db1.jnl); unlink(/db1.jnl);

How to Achieve Crash Consistency : The Case of SQLite atomic_update { write(/db1, new ); write(/db2, new ); } // create master journal open(/master.jnl); write(/master.jnl, /db1,/db2 ); fsync(/master.jnl); fsync(/); // update db1 open(/db1.jnl); write(/db1.jnl, old ); fsync(/db1.jnl); fsync(/); write(/db1.jnl, master.jnl ); fsync(/db1.jnl); write(/db1, new ); fsync(/db1); // update db2 open(/db2.jnl); write(/db2.jnl, old ); fsync(/db2.jnl) fsync(/); write(/db2.jnl, master.jnl ); fsync(/db2.jnl); write(/db2, new ); fsync(/db2); // clean up journals unlink(/master.jnl); fsync(/); unlink(/db1.jnl); unlink(/db1.jnl);

Cost of Crash Consistency : The Case of SQLite 40 Written Pages 30 20 Database update: 2 pages 10 0 ext4

Cost of Crash Consistency : The Case of SQLite 40 Written Pages 30 20 SQLite journaling: Database update: 3 pages 2 pages 10 0 ext4

Cost of Crash Consistency : The Case of SQLite 40 Written Pages Ext4 journaling+metadata: 33.6 pages 30 20 SQLite journaling: Database update: 3 pages 2 pages 10 0 ext4 Write Amplification = 38.6/2 = 19.3

Cost of Crash Consistency : The Case of SQLite 40 Written Pages Ext4 journaling+metadata: 33.6 pages 30 SQLite journaling: 3 pages 20 10 Database update: Redundant disk writes 2 pages per single data page write 0 ext4 Write Amplification = 38.6/2 = 19.3

Cost of Crash Consistency : The Case of SQLite 15 Disk Flush fsync() + FS journaling: 12.7 10 5 0 ext4

Cost of Crash Consistency : The Case of SQLite Notoriously slow! 15 xsyncfs [Nightingale:OSDI06] OptFS [Chidambaram:SOSP13] DuraSSD [Kang:SIGMOD14]... 10 Disk Flush fsync() + FS journaling: 12.7 5 0 ext4

Problem: complex, redundant software stack for crash consistency How can we simplify mechanisms for application crash consistency?

Problem: complex, redundant software stack for crash consistency How can we simplify mechanisms for application crash consistency? Can we use atomic updates of multi pages provided by transactional flash?

Transactional Flash 101 NAND Flash SSD No in-place update Log-structured write Mapping table: logical address physical address Mapping Table LBN Pn PBN P1 Flash Chips P1... Pn Old copy of P1,, Pn

Transactional Flash 101 NAND Flash SSD No in-place update Log-structured write Mapping table: logical address physical address Mapping Table LBN Pn P1 PBN write(p1) Flash Chips P1... Pn P1 Old copy of P1,, Pn

Transactional Flash 101 NAND Flash SSD No in-place update Log-structured write Mapping table: logical address physical address Mapping Table LBN Pn P1 PBN write(p1) Flash Chips P1... Pn P1 Old copy of P1,, Pn

Transactional Flash 101 NAND Flash SSD No in-place update Log-structured write Mapping table: logical address physical address Mapping Table LBN Pn P1 PBN write(pn) write(p1) Flash Chips P1... Pn P1... Pn Old copy of P1,, Pn New copy of P1,, Pn

Transactional Flash 101 Transactional Flash SSD Atomic multi-page write by atomically updating the mapping table at commit request write(txid, page), commit(txid), abort(txid) H/W implementation or S/W emulation Mapping Table LBN Pn PBN P1 Flash Chips P1... Pn Old copy of P1,, Pn

Transactional Flash 101 Transactional Flash SSD Atomic multi-page write by atomically updating the mapping table at commit request write(txid, page), commit(txid), abort(txid) H/W implementation or S/W emulation Mapping Table LBN Pn P1 PBN write(txid, P1) Flash Chips P1... Pn P1 Old copy of P1,, Pn New copy of P1,, Pn

Transactional Flash 101 Transactional Flash SSD Atomic multi-page write by atomically updating the mapping table at commit request write(txid, page), commit(txid), abort(txid) H/W implementation or S/W emulation Mapping Table LBN Pn P1 PBN commit(txid) write(txid, Pn) write(txid, P1) Flash Chips P1... Pn P1... Pn Old copy of P1,, Pn New copy of P1,, Pn

Transactional Flash 101 Transactional Flash SSD Atomic multi-page write by atomically updating the mapping table at commit request write(txid, page), commit(txid), abort(txid) H/W implementation or S/W emulation Mapping Table LBN Pn PBN commit(txid) P1 Flash Chips P1... Pn P1... Pn Old copy of P1,, Pn New copy of P1,, Pn

Transactional Flash 101 Transactional Flash SSD Atomic multi-page write by atomically updating the mapping table at commit request write(txid, page), commit(txid), abort(txid) H/W implementation or S/W emulation Mapping Table LBN Pn PBN commit(txid) P1 Flash Chips P1... Pn P1... Pn Old copy of P1,, Pn New copy of P1,, Pn

Our Solution: CFS, a new file system using transactional flash Simplifying applications' crash consistency using atomic multi-page write of transactional flash

Our Solution: CFS atomic_update { write(/db1, new ); write(/db2, new ); } 40 30 20 Written Pages Disk Flush Performance FS Jnl + Metadata App Jnl App 15 10 20 15 10 10 5 5 0 ext4 CFS 0 ext4 CFS 0 ext4 CFS

Our Solution: CFS atomic_update { write(/db1, new ); write(/db2, new ); } + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); 40 30 20 Written Pages Disk Flush Performance FS Jnl + Metadata App Jnl App 15 10 20 15 10 16.7x 10 5 12.9x 12.7x 5 0 ext4 CFS 0 ext4 CFS 0 ext4 CFS

Outline Introduction CFS Design Evaluation Conclusion

CFS Architecture Application (e.g., DBMS, K/V) Journaling cfs_begin() cfs_commit() cfs_abort() Application (e.g., DBMS, K/V) File System (e.g., ext4) Block Storage (e.g., HDD/SSD) Journaling CFS Transactional Flash write(txid, LBA) commit(txid) abort(txid) Removing redundant journaling with new primitives provided by transactional flash

Four Challenges 1. Finding a set of pages for atomic updates 2. File system metadata consistency in every case 3. Concurrency control among atomic updates 4. Legacy application support without any modification

#1. Unit of Atomic Write : Atomic Propagation Group Application File System + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) bitmap inode data Transactional Flash /db1 /db2

#1. Unit of Atomic Write : Atomic Propagation Group Application File System + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) bitmap inode data Transactional Flash /db1 /db2

#1. Unit of Atomic Write : Atomic Propagation Group Application File System + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) bitmap inode data Transactional Flash /db1 /db2

#1. Unit of Atomic Write : Atomic Propagation Group Application File System + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); For each database Allocates new data block (bitmap) Fills data block with user data (data) Sets location of data block (inode) Updated data and file system metadata pages need to be atomically updated. Use atomic multi-page write operations Atomic Propagation Group bitmap inode data Transactional Flash /db1 /db2

Example: Two apps. are running Application + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); bitmap inode data /db1 Transactional Flash /db2 /db3

Example: Two apps. are running Application + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); + cfs_begin(); write(/db3, new );... bitmap inode data /db1 Transactional Flash /db2 /db3

Example: Two apps. are running Application + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); + cfs_begin(); write(/db3, new );... What if /db3 bitmap locates in the /db2 bitmap? /db1 bitmap inode data Transactional Flash /db2 /db3

Example: Two apps. are running Application + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); + cfs_begin(); write(/db3, new );... What if /db3 bitmap locates in the /db2 bitmap? /db1 bitmap inode data Transactional Flash /db2 /db3 /db3 data block is allocated.

Example: Two apps. are running Application + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); + cfs_begin(); write(/db3, new );... What if /db3 bitmap locates in the /db2 bitmap? /db1 bitmap inode data Transactional Flash /db2 /db3 /db3 data block is allocated.

Problem: entangled disk pages bitmap inode data 1. cfs_commit() - /db1 and /db2 are persistent /db1 /db2 /db3

Problem: entangled disk pages bitmap inode data 1. cfs_commit() - /db1 and /db2 are persistent /db1 /db2 /db3 2. Sudden Power Outage - /db3 changes are lost.

Problem: entangled disk pages bitmap inode data 1. cfs_commit() - /db1 and /db2 are persistent /db1 /db2 /db3 2. Sudden Power Outage - /db3 changes are lost.

Problem: entangled disk pages bitmap inode data 1. cfs_commit() - /db1 and /db2 are persistent /db1 /db2 /db3 2. Sudden Power Outage - /db3 changes are lost. /db3 data block is allocated. Incorrect bitmap - Free block is marked as allocated.

Problem: entangled disk pages bitmap /db1 /db2 False Sharing of Metadata Pages /db3 + cfs_begin(); write(/db1, new ); write(/db2, new ); + cfs_commit(); + cfs_begin(); write(/db3, new );... - sudden power outage

#2. Metadata False Sharing : In-Memory Metadata Logging Operational Logging for in-memory metadata change toggle_bit(free_block_bitmap, LBA) sub(free_block_count, 1) Maintain two versions of in-memory metadata to selectively propagate only relevant changes to storage Memory version: on-going modification Storage version: committed version, used for storage IO

REDO or UNDO operational logs cfs_commit() { storage version += REDO(logs); write(txid, storage version); commit(txid); } cfs_abort() { memory version -= UNDO(logs); abort(txid); }

Example: Two apps. are running In-Memory Metadata Memory Version bitmap Operational Log for Atomic Propagation Group App 1 App 2 Turn on a bit Turn on a bit Storage Version Transactional Flash

Example: Two apps. are running In-Memory Metadata Memory Version bitmap Operational Log for Atomic Propagation Group App 1 App 2 Turn on a bit Turn on a bit Storage Version cfs_commit() Transactional Flash

Example: Two apps. are running In-Memory Metadata Memory Version bitmap Operational Log for Atomic Propagation Group App 1 App 2 Turn on a bit Turn on a bit Storage Version cfs_commit() Transactional Flash

Example: Two apps. are running Bitmap for db3 is not written. In-Memory Metadata Memory Version bitmap Operational Log for Atomic Propagation Group App 1 App 2 Turn on a bit Turn on a bit Storage Version cfs_commit() Transactional Flash

Example: Two apps. are running Bitmap for db3 is not written. In-Memory Metadata Memory Version bitmap Operational Log for Atomic Propagation Group App 1 App 2 Turn on a bit Turn on a bit Storage Version cfs_commit() cfs_abort() Transactional Flash

Example: Two apps. are running Bitmap for db3 is is not reverted. written. In-Memory Metadata Memory Version bitmap Operational Log for Atomic Propagation Group App 1 App 2 Turn on a bit Turn on a bit Storage Version cfs_commit() cfs_abort() Transactional Flash

Example: Two apps. are running Bitmap for db3 is is not reverted. written. In-Memory Metadata Memory Version bitmap Operational Log for Atomic Propagation Group App 1 App 2 Turn on a bit Turn on a bit Storage Version cfs_commit() cfs_abort() Transactional Flash Selective Propagation of Metadata Change

#3. Isolation? Concurrency Control? : Leave It to Application cfs_begin(); write(/my_file, 200 );... /my_file 100

#3. Isolation? Concurrency Control? : Leave It to Application cfs_begin(); write(/my_file, 200 );... /my_file 100 200 cfs_begin(); val = read(/my_file);...

#3. Isolation? Concurrency Control? : Leave It to Application cfs_begin(); write(/my_file, 200 );... /my_file 100 200 cfs_begin(); val = read(/my_file);... What val should be either of 100 or 200?

#3. Isolation? Concurrency Control? : Leave It to Application cfs_begin(); write(/my_file, 200 );... /my_file 100 200 cfs_begin(); val = read(/my_file);... What val should be either of 100 or 200? It depends on application semantics.

#3. Isolation? Concurrency Control? : Leave It to Application > START TRANSACTION > UPDATE account SET deposit='200' >... > COMMIT /bank_account 100 200 Deposit of bank account must be 200. > START TRANSACTION > SELECT deposit FROM account >... > COMMIT

#3. Isolation? Concurrency Control? : Leave It to Application > START TRANSACTION > UPDATE account SET deposit='200' >... > COMMIT /bank_account 100 200 Deposit of bank account must be 200. KV begin_transaction(...); KV set( account.likes, 200 );... KV end_transaction(...); > START TRANSACTION > SELECT deposit FROM account >... > COMMIT /sns_account KV begin_transaction(...); KV get( account.deposit ); 100 200... KV end_transaction(...); Likes of SNS account can be either of 100 or 200.

#3. Isolation? Concurrency Control? : Leave It to Application Isolation and crash-consistency are orthogonal. Even the SQL standard defines four different isolation levels. CFS does not provide its own concurrency control mechanism. If needed, use existing synchronization primitives (e.g., mutex, RW lock, etc).

#4. Legacy Application Support System-Wide Atomic Propagation Group Every update from legacy applications belongs. Automatically committed by sync() or page flusher CFS supports legacy applications without any modification.

Outline Introduction CFS Design Evaluation Conclusion

Implementation CFS 5.8k LoC modification of ext4 on Linux 3.10 Capture logs by inserting 182 places in ext4 Transactional Flash OpenSSD: 8KB, 128 pages/block, 8GB w/ SATA2 X-FTL/SSD [Kang:SIGMOD'13]

Real Application & Workloads Mobile Database + RLBench + Facebook Text Editor SQL Database + SysBench + LinkBench Key/Value Store Package Installer + kctreetest + db_bench apt-get

Real Application & Workloads Mobile Database + RLBench + Facebook Text Editor SQL Database CFS is + easy-to-use! SysBench + LinkBench : 317 LoC changes out of 3.5 million Key/Value Store Package Installer + kctreetest + db_bench apt-get

SQLite + Facebook App. SQL Trace Write Amplification Disk Flush Performance 4 3 2 1 FS Jnl + Metadata App Jnl App 4.1x 5 4 3 2 1 4x 5 4 3 2 1 4.8x 0 ext4 CFS 0 ext4 CFS 0 ext4 CFS Ext4: ordered journal mode SQLite: rollback journal mode

MariaDB + LinkBench Write Amplification Disk Flush Performance 2 FS Jnl + Metadata App Jnl App 20 15 2 2.1x 1 2x 10 5 17.6x 1 0 ext4 CFS 0 ext4 CFS 0 ext4 CFS Ext4: ordered journal mode

KyotoCabinet + db_bench Write Amplification Disk Flush Performance 3 2 1 FS Jnl + Metadata App Jnl App 2.2x 3 2 1 2.7x 5 4 3 2 1 4.8x 0 ext4 CFS 0 ext4 CFS 0 ext4 CFS Ext4: ordered journal mode

Conclusion Current mechanisms for crash consistency is complex, slow, and error-prone. CFS simplifies application's crash consistency using transactional flash. Atomic propagation group In-memory metadata logging Our evaluation shows Less write: 2 ~ 4x Less disk flush: 3 ~ 17x Higher performance: 2 ~ 5x

Thank you! Changwoo Min changwoo@gatech.edu Woon-Hak Kang, Taesoo Kim, Sang-Won Lee, Young Ik Eom Georgia Institute of Technology Sungkyunkwan University Questions?