1 Rethink the Sync 황인중, 강윤지, 곽현호
Authors 2 USENIX Symposium on Operating System Design and Implementation (OSDI 06)
System Structure Overview 3 User Level Application Layer Kernel Level Virtual File System File System File System File System Device Driver Device Driver Storage Storage Storage
Synchronous VS. Asynchronous FS 4 Trade-off between two FS à Durability and Performance Sync FS Async FS Data will not be lost due to a power failure Guarantees the ordering of modifications Waits for mechanical disk operations à Slow Do not block the calling application à Fast No Guarantees the ordering & Not safe (Use fsync() to transfer all modified data)
Related Works 5 Battery-backed main memory(bb-dram) to make writes persistent (The Conquest File System) Conquest is a disk/persistent-ram hybrid file system envy is a file system that stores data on flash-based NVRAM Although reads from NVRAM were fast, writes were prohibitively slow They used a battery-backed RAM write cache to achieve reasonable write performance Early file systems such as FFS and the original UNIX file system introduced the use of a main memory buffer cache to hold writes until they are asynchronously written to disk Suffered from potential corruption when a computer lost power or an OS crashed Cedar and LFS added the complexity of a write-ahead log to enable fast, consistent recovery of file system state Journaling data to a write-ahead log is insufficient to prevent file system corruption if the drive cache reorders block writes
Motivation 6 Synchronous Asynchronous Durability Performance Provides the reliability and simplicity of synchronous I/O - Data will not be lost due to a power failure - Guarantees the ordering of modifications External Synchrony Closely approaches the performance of asynchronous I/O Externally Synchronous Resolves the tension between durability and performance
Changing the viewpoint 7 Change viewpoint from application to user From the viewpoint of an external observer such as a user or an application running on another computer, the guarantees provided by externally synchronous I/O are identical to the guarantees provided by a traditional file system mount synchronously An external observer never sees output that depends on uncommitted modifications, however it rarely blocks applications, its performance approaches that of asynchronous I/O User-centric View User Application-centric view Application OS Disk Synchronous I/O Externally Synchronous I/O
Xsyncfs 8 Uses mechanisms developed as part of the Speculator Project When a process performs a synchronous I/O operation, xsyncfs validates the operation, adds the modifications to a file system transaction and returns control to the calling process without waiting for the transaction to commit Commit dependency Specifies that the process is not allowed to externalize any output until the transaction commits If the process writes to the external interface, its output is buffered by the OS Output-triggered commits Track the causal relationship between external output and file system modification to decide when to commit data Result Very positive At I/O benchmark (Postmark and Andrew-style build), the performance of xsyncfs is within 7% of the default asynchronous implementation of ext3 Xsyncfs is up to two orders of magnitude faster than a version of ext3 that guards against losing data on power failure
Design Overview 9 The design of external synchrony is based on two principles We define externally synchronous I/O by its externally observable behavior rather than by its implementation We note that application state is an internal property of the computer system The OS can implement user-centric guarantees because it controls access to external devices An application-centric view A user-centric view User Level Kernel Level External interface Application Kernel System call Internal State
Example of externally synchronous file I/O 10 Two are the same a. Values are the same b. Output occur in the same causal order Two optimization to improve performance a. Two modifications are group committed as a single file system transaction b. Buffering screen output grouping Disk commit = External output
Grouping & Buffering 11 If, Op1 is create and Op3 is delete, Op1 and Op3 are not operated Create Delete Op2 Op4 Obey the causal ordering Buffer Op1 Op2 Op3 Op4 Op4 Op2 Time One Transaction Op1 Op2 Op3 Op4
Commit Dependency Inheritance 12 This design requires that the OS track the causal relationship between file system modifications and external output When a process writes to the file system, it inherits a commit dependency on the uncommitted data that it wrote When a process with commit dependencies modifies another kernel object by executing a system call, the OS marks the modified objects with the same commit dependencies Process Speculation Speculator Undo Log checkpoint checkpoint checkpoint inode Undo Log checkpoint
Output-triggered commits 13 Trade off between latency and throughput for group commit strategies Latency is unimportant if no external entity is observing the result Output-triggered commits OS can improve throughput by delaying a commit until some output that depends on the transaction is buffered Maximize throughput when output is not being displayed Op1 Op2 Op3 Op4 dependent Buffer output Op5 Op6 Op7 Op8 Time
Limitations 14 It complicates application-specific recovery from catastrophic media failure The user may have some temporal expectation about when modifications are committed to disk Modifications to data in two different file systems cannot be easily committed with a single disk transaction 1 Catastrophic media failure 2 5 seconds at most 3 dependent Op1 Op2 OpN Op6 Op7 Op8 blocked Time FS 1 FS 2
Implementation - Speculator 15 Speculator improves the performance of distributed file systems by hiding the performance cost of remote operations Rather than block during a remote operation, a file system predicts the operation s result, then uses Speculator to checkpoint the state of the calling process and speculatively continue its execution based on the predicted result If the prediction is correct, the checkpoint is discarded If it is incorrect, the calling process is restored to the checkpoint, and the operation is retried
Speculator in details 16 Saves the state of any open file descriptors and copies any signal pending for the checkpointed process Fork but not in run queue Process Speculator Prediction fails Undo Log checkpoint checkpoint checkpoint Process Restores the process to the state captured during the checkpoint Correct prediction Undo Log Speculator checkpoint checkpoint Embedded checkpoint Software Lab. Process Undo Log Speculator checkpoint checkpoint checkpoint Just discard speculation
Speculator Example 17 On create_speculation Reverse operation On fail_speculation Speculative Execution in a Distributed File System (SOSP 05)
Ext3 Journaling (JBD) 18 Guaranteeing the file system consistency Transaction 단위의원자성보장 저널영역에트랜잭션단위로 write-ahead logging Journal Thread 가 Background 로주기적으로 Commit 수행 EXT3 Transaction Handle = 한개의 System call 에서수정된데이터및메타데이터 Transaction = Handle 의집합 Active Transaction (Running Transaction) FS 에서한개존재, 더많은 handle 을받을수있음. Committing Transactions: 저널영역에기록된 transaction
Ext3 Journaling (JBD) 19 Journaling 순서 (Data 모드 ) 저널영역에 Journal Descriptor 기록 저널영역에기록할 Metadata+Data 의 home location 기록 저널영역에 Metadata + Data 기록 저널영역에 Commit 블록기록 Home Location에 Metadata + Data 기록
File system support for external synchrony 20 Ext3 Ordered mode: writes only metadata Journaled mode: writes both data and metadata Xsyncfs Use journaled mode Guarantees ordering Within a transaction, write in any order Informs Speculator when a new journal transaction is created Default mode Does not provide ordering since data modifications are not journaled output
Rethink Sync 21 When, explicit synchronization operations (sync, fdatasync) Xsyncfs creates a commit dependency between the calling process and the active transaction, and if there is no dependency, the return is almost instantaneous Application Execution Op Op4 Op Op Op Op Op Group commit is provided transparently by xsyncfs without modifying application Application Op5 Time speculator File System Execution Check dependency OpA OpB OpC Committing transaction Op1 Op2 Op3 Op4 Active transaction Time
Evaluation 22 Answers the following questions How does the durability of xsyncfs compare to current file system? How does the performance of xsyncfs compare to current file system How does xsyncfs affect the performance of applications that synchronize explicitly? How much do output-triggered commits improve the performance of xsyncfs? Methodology 3.02GHz Pentium 4 processor with 1GB of RAM A single Western Digital WD-XL40 hard drive (7200RPM 120GB ATA 100 drive with 2MB on-disk cache) Red Hat Enterprise Linux version 3 (kernel version 2.4.21) 400MB journal size for both ex3 and xsyncfs
Evaluation 23 Durability Without write barriers, ext3 does not guarantee durability in both journaled mode and ordered mode Ext3 is mounted synchronously or asynchronously, and even if fsync commands are issued after every write Even worse, despite the use of journaling in ext3, a loss of power can corrupt data and metadata stored in the file system
The Benchmarks 24 PostMark: 10000 files, 10000 transaction (reads, writes, creates..) The Apache build benchmark: 2.0.48 source tree
The Benchmarks 25 The MySQL benchmark The SPECweb99 benchmark
Benefit of output-triggered commits 26 Eager commit strategy for xsyncfs Triggers a commit whenever the file system is modified Allows for group commit since multiple modifications are grouped into a single file system transaction while the previous transaction is committing Attempts to minimize the latency of individual file system operations Sacrifices the opportunity to improve throughput
Conclusion 27 It is challenging to develop simple and reliable software systems if the foundations upon which those systems are built are unreliable Asynchronous I/O is a prime example of one such unreliable foundation OS crashes and power failures can lead to loss of data, file system corruption, and out-of-order modifications Nevertheless, current file systems present an asynchronous I/O interface by default because of performance We have proposed a new abstraction, external synchrony, that preserves the simplicity and reliability of a synchronous I/O interface, yet performs approximately as well as asynchronous I/O interface
Subsequent Studies & Discussion 28 Operating System Support for Application-Specific Speculation (Eurosys 11) Separate two elements of Speculation: Policy, Mechanism Policy is done by Application Mechanism is done by Operating System I/O Speculation for the Microsecond Era (ATC 14) They survey how speculation can address the challenges that microsecond scale device will bring Discussion Can Speculation method break through I/O bottleneck? Can minimize the speculation time?
Aerie: Flexible File-System Interfaces to Storage-Class Memory 강윤지, 곽현호, 황인중
Storage Class Memory (SCM) 30 SCM Persistent storage near the speed of DRAM PCM, STT-RAM, flash-backed DRAM, Memory-like interface Byte accessible Able to access with load/store Short Access time
Storage Class Memory (SCM) 31 Recent works of SCM Persistent write buffers or hold small data BPFS, SCMFS, PMFS, can improve file system performance considerably, but the fixed and inefficient POSIX interface can limit the benefits However, SCM doesn't need a kernel file system SCM enable direct access from user mode SCM does not require a driver for data access as it can implement a standard load/store SCM has no need for scheduling, as there are no long seek or rotation delays
Overhead of file system interface 32 Problems of POSIX file system API Abstraction (file descriptors, inodes, dentry objects) becomes expensive for fast SCM Cost of abstraction Entry function: main routine of VFS operation (include syscall) File descriptors: for the cost of managing fie descriptor Synchronization: cost of synchronization like RCU and lock Memory objects: cost of in-memory inodes and dentries Naming: cost of hierarchical names
The Abstraction Cost of a File 33 25x slower than PCM
File system interface 34 There are other works exposing SCM directly to programmers Application can directly access to SCM (load/store) They lose important file system features File-system interface provides useful features for easy access and protecting data for secure sharing between applications
Introduction 35 Aerie Kernel only handles coarse-grained allocation and protection User-mode libraries should implement and provide the filesystem interface and functionality low-latency access to data with no-layer of code flexibility by considering application semantics
Introduction 36 Main goal of Aerie Implementation for high-performance access to SCM Providing applications with flexibility in defining their own filesystem interface PXFS POSIX-style file system with user-mode FlatFS customized file system with small-file access through put/get
Design 37 Decentralized Architecture Untrusted user-mode lib (libfs) File system interface which application use Functionality to find and access data (file name, file metadata, indexing by offset into byte) Trusted file-system service (TFS) User-mode process (via RPC) Metadata integrity and synchronization Distributed lock service à lease to clients SCM Manager (kernel) Storage allocator Protection (permission)
File system features 38 File system features Naming Object ID: 64-bit storage object ID Collection: like directory support key-value pairs with hash table mfile: metadata for data extent Indirect block
File Systems Interfaces on Aerie 39 PXFS (POSIX like file system) (open/read/write/close) liked with mfiles, fixed extent size for page-size Collection mapping file names to OID Per-client name cache of path name FlatFS (key-value store interface) put/get/erase Single extent holds entire files Flat key-based namespace
Setups 40 2.4GHz Intel Xeon E5645 six-core (12 hyper thread), 40GB DRAM x86-64 linux kernel 3.2.2 SCM emulation DRAM is delayed for SCM, 24 GB memory is used for SCM Workloads file systems: RamFS, ext3, ext4, PXFS, FlatFS Micro benchmark: operates common POSIX API Filebench is modified to call libfs API
Evaluation 41
# Threads 42
Memory latency 43
Conclusion 44 Software interface overheads handicap fast SCM Aerie: Library file systems help remove generic overheads for higher performance
Discussion 45 Fast storage, not only SCM NVMe SSD Ref: Bjørling, Matias, et al. "Linux Kernel Abstractions for Open-Channel Solid State Drives." Non-Volatile Memories Workshop. 2015. The other kernel layer Networking Stack Overheads Ref:Peter, Simon, et al. "Arrakis: The operating system is the control plane." ACM Transactions on Computer Systems (TOCS) 33.4 (2015): 11.