Designing a True Direct-Access File System with DevFS

Similar documents
Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Physical Disentanglement in a Container-Based File System

Arrakis: The Operating System is the Control Plane

CS 318 Principles of Operating Systems

Announcements. Persistence: Log-Structured FS (LFS)

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

COS 318: Operating Systems. Journaling, NFS and WAFL

File System Internals. Jo, Heeseung

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

CS 537 Fall 2017 Review Session

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Computer Systems Laboratory Sungkyunkwan University

ECE 598 Advanced Operating Systems Lecture 18

CS3600 SYSTEMS AND NETWORKS

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Outlook. File-System Interface Allocation-Methods Free Space Management

[537] Journaling. Tyler Harter

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System

Chapter 11: Implementing File Systems

Soft Updates Made Simple and Fast on Non-volatile Memory

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

BTREE FILE SYSTEM (BTRFS)

Announcements. Persistence: Crash Consistency

Performance Modeling and Analysis of Flash based Storage Devices

ViewBox. Integrating Local File Systems with Cloud Storage Services. Yupu Zhang +, Chris Dragga + *, Andrea Arpaci-Dusseau +, Remzi Arpaci-Dusseau +

Chapter 12: File System Implementation

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Caching and reliability

Coerced Cache Evic-on and Discreet- Mode Journaling: Dealing with Misbehaving Disks

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. November 6, Prof. Joe Pasquale

Operating Systems. Operating Systems Professor Sina Meraji U of T

Operating Systems. File Systems. Thomas Ropars.

Discriminating Hierarchical Storage (DHIS)

MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS

CS370 Operating Systems

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Emulating Goliath Storage Systems with David

CSE 153 Design of Operating Systems

File System Implementation

Topics. " Start using a write-ahead log on disk " Log all updates Commit

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

The Unwritten Contract of Solid State Drives

Chapter 11: Implementing File Systems

SFS: Random Write Considered Harmful in Solid State Drives

CS 318 Principles of Operating Systems

File Systems. Before We Begin. So Far, We Have Considered. Motivation for File Systems. CSE 120: Principles of Operating Systems.

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

FS Consistency & Journaling

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. February 22, Prof. Joe Pasquale

Getting Real: Lessons in Transitioning Research Simulations into Hardware Systems

NOVA: The Fastest File System for NVDIMMs. Steven Swanson, UC San Diego

OPERATING SYSTEM. Chapter 12: File System Implementation

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

[537] Fast File System. Tyler Harter

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Chapter 11: Implementing File

TEFS: A Flash File System for Use on Memory Constrained Devices

Case study: ext2 FS 1

FStream: Managing Flash Streams in the File System

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

Dongjun Shin Samsung Electronics

File Systems. CS170 Fall 2018

Chapter 12: File System Implementation

File Systems. Kartik Gopalan. Chapter 4 From Tanenbaum s Modern Operating System

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 11: File System Implementation. Objectives

Near- Data Computa.on: It s Not (Just) About Performance

Chapter 10: Case Studies. So what happens in a real operating system?

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Redesigning LSMs for Nonvolatile Memory with NoveLSM

Case study: ext2 FS 1

File. File System Implementation. Operations. Permissions and Data Layout. Storing and Accessing File Data. Opening a File

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSE 333 Lecture 9 - storage

Kernel Support for Paravirtualized Guest OS

What is a file system

CS370 Operating Systems

Operating Systems Design Exam 2 Review: Spring 2011

W4118 Operating Systems. Instructor: Junfeng Yang

CS 416: Opera-ng Systems Design March 23, 2012

Current Topics in OS Research. So, what s hot?

Operating Systems Design Exam 2 Review: Fall 2010

FFS: The Fast File System -and- The Magical World of SSDs

Distributed Filesystem

An Exploration of New Hardware Features for Lustre. Nathan Rutman

Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Chapter 10: File System Implementation

2 nd Half. Memory management Disk management Network and Security Virtual machine

Main Points. File layout Directory layout

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Transcription:

Designing a True Direct-Access File System with DevFS Sudarsun Kannan, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau University of Wisconsin-Madison Yuangang Wang, Jun Xu, Gopinath Palani Huawei Technologies

Modern Fast Storage Hardware Faster nonvolatile memory technologies such as NVMe, 3D Xpoint Hard Drives PCIe-Flash 3D Xpoint BW: 2.6MB/s 250MB/s 1.3GB/s H/W Lat: 7.1ms 68us 12us S/W cost: 8us 8us 6us OS cost: 5us 5us 4us Bottlenecks shift from hardware to software (file system) 2

Why Use OS File System? Millions of applications use OS-level file system (FS) - Guarantees integrity, concurrency, crash-consistency, and security Object stores have been designed to reduce OS cost [HDFS, CEPH] - Developers unwilling to modify POSIX-interface - Need faster file systems and not new interface User-level POSIX-based FS fail to satisfy fundamental properties 3

Device-level File System (DevFS) Application Move file system into the device hardware Use device-level CPU and memory for DevFS Apps. bypass OS for control and data plane DevFS handles integrity, concurreny, crashconsistency, and security Data FS kernel Data DevFS Check security Read/Write data Update data Update metadata Check security Update data Update metadata Achieves true direct-access NVMe Metadata Data 4

Challenges of Hardware File System Limited memory inside the device - Reverse-cache inactive file system structures to host memory DevFS lack visibility to OS state (e.g., process permission) - Make OS share required (process) information with down-call 5

Performance Emulate DevFS at the device-driver level Compare DevFS with state-of-the-art NOVA file system Benchmarks - more than 2X write and 1.8X read throughput Snappy compression application - up to 22% higher throughput Memory-optimized design reduces file system memory by 5X 6

Outline Introduction Background Motivation DevFS Design Evaluation Conclusion

Traditional S/W Storage Stack Application Data Read/Write data Data FS kernel Check security Update data Update metadata Metadata Data Maintain security, manage integrity, crashconsistency, and concurrency NVMe 8

Traditional S/W Storage Stack Application Data Data FS kernel Read/Write data Check security Update data Update metadata Metadata Data User-to-kernel switch for every data plane operation High software-indirection latency before storage access NVMe 9

Holy grail of Storage Research Application FS library Challenge 1: How to bypass OS and provide direct-storage access? Metadata Data Read/Write data FS kernel Challenge 2: How to provide direct-access without compromising integrity, concurrency, crash-consistency, and security? SSD 10

Classes of Direct-Access File Systems Prior approaches have attempted to provide user-level direct access We categorize them into four classes: - Hybrid user-level - Hybrid user-level with trusted server (Microkernel approach) - Hybrid device Full device-level file system (proposed) 11

Hybrid User-level File System Split file system into user library and kernel file components Kernel FS handles control plane (e.g., file creation) Library handles data plane (e.g., read, write) and manages metadata Application Create file FS lib Read/Write Data Well known hybrid approaches - Arrakis (OSDI 14) - Strata (SOSP 17) Sharing, protection FS kernel NVMe 12

Hybrid Device File System File system split across user-level library, kernel, and hardware Control and data-plane operations same as hybrid user-level FS However, some functionalities moved inside the hardware Manage metadata Application FS lib Well known hybrid approaches Sharing, protection Create file FS kernel Read/Write Data - Moneta-D (ASPLOS 12) - TxDev (OSDI 08) NVMe FS H/W Tx Perm. Check 13

Outline Introduction Background Motivation DevFS Design Evaluation Conclusion

File System Properties Integrity - Correctness of FS metadata for single & concurrent access Crash-consistency - FS metadata consistent after a failure Security - No permission violation for both control and data-plane - OS-level file system checks permission for control and data plane 15

Hybrid User-level FS Integrity Problem Manage metadata Create file FS kernel Application FS lib Metadata Data Direct-access for the data-plane Arrakis (OSDI 14), Strata (SOSP 17) Coordinate sharing, protection NVMe 16

Hybrid User-level FS Integrity Problem Manage metadata Application FS lib Untrusted (buggy or malicious) Create file FS kernel Coordinate sharing, protection NVMe Metadata Data Can compromise metadata integrity and impact crash consistency Data plane security compromised 17

Concurrent Access? App. 1 FS lib Append(F1, buff, 4k) Append(F1, buff, 4k) App. 2 FS lib Set bitmap Append Update inode Skip locking Set bitmap Append Update inode Free block bitmap 1 1 Data block inode { size = 04K m_time = 21 } Arrakis and Strata trap into OS for data-plane and control plane No direct access 18 18

Approaches Summary Class File System Integrity Crash Consistency Security Concurrency POSIX support Directaccess Kernel-level FS NOVA Hybrid user-level FS Arrakis Strata Microkernel Aerie Hybrid-device FS Moneta-D TxDev FUSE Ext4-FUSE Device FS DevFS 19

Outline Introduction Background Motivation DevFS Design Evaluation Conclusion

Device-level File System (DevFS) Application Move file system into the device hardware Use device-level CPU and memory for DevFS Apps. bypass OS for control and data plane DevFS handles integrity, concurreny, crashconsistency, and security Data FS kernel Data DevFS Check security Read/Write data Update data Update metadata Check security Update data Update metadata Achieves true direct-access NVMe Metadata Data 21

DevFS Internals DevFS Global structures In-memory metadata Controller CPU Per-file structures Super Block Bitmaps Inodes Dentries On-disk file metadata Super Block Bitmaps Inodes Dentries 22

DevFS Internals Modern storage device contain multiple CPUs Support up to 64K I/O queues To exploit concurrency, each file has own I/O queue and journal DevFS Global structures In-memory filemap tree Controller CPU Per-file structures In-memory metadata /root/proc /root /root/dir Super Block On-disk file metadata Super Block Bitmaps Inodes Dentries Bitmaps Inodes Dentries filemap { *dentry *inode; *queues *mem_journal *disk_journal } Submission queue (SQ) Completion queue (SQ) Journal Data Per-file Journal Per-file blocks 23 23

DevFS Internals Application User FS lib Vaddr = CreateBuffer() OS allocated command buffer DevFS Global structures In-memory filemap tree Controller CPU Per-file structures In-memory metadata /root/proc /root /root/dir Super Block On-disk file metadata Super Block Bitmaps Inodes Dentries Bitmaps Inodes Dentries filemap { *dentry *inode; *queues *mem_journal *disk_journal } Submission queue (SQ) Completion queue (SQ) Journal Data Per-file Journal Per-file blocks 24 24

DevFS I/O Operation Application User FS lib Open(f1) Cmd OS allocated command buffer Cmd DevFS Controller CPU Global structures In-memory filemap tree Per-file structures In-memory metadata /root/proc /root /root/dir Journal Journal Super Block Bitmaps Inodes Dentries On-disk file metadata Super Block Bitmaps Inodes Dentries filemap { *dentry *inode; *queues *mem_journal *disk_journal } Submission queue (SQ) Completion queue (SQ) Journal Data Per-file Journal Per-file blocks 25 25

DevFS I/O Operation Application User FS lib Open(f1) Cmd OS allocated command buffer Cmd DevFS Controller CPU Global structures In-memory filemap tree Per-file structures In-memory metadata /root/proc /root /root/dir Journal Journal Journal Super Block Bitmaps Inodes Dentries On-disk file metadata Super Block Bitmaps Inodes Dentries filemap { *dentry *inode; *queues *mem_journal *disk_journal } Submission queue (SQ) Completion queue (SQ) Journal Data Per-file Journal Per-file blocks 26 26

DevFS I/O Operation Application User FS lib Write(fd, buff, 4k, off=3) Cmd OS allocated command buffer Cmd DevFS Controller CPU Global structures In-memory filemap tree Per-file structures In-memory metadata /root/proc /root /root/dir Journal Journal Journal Super Block Bitmaps Inodes Dentries On-disk file metadata Super Block Bitmaps Inodes Dentries filemap { *dentry *inode; *queues *mem_journal *disk_journal } Submission queue (SQ) Completion queue (SQ) Journal Data Per-file Journal Per-file blocks 27 27

Capacitance Benefits Inside H/W Writing journals to storage has high overheads Modern storage devices have device-level capacitors Capacitors safely flush memory state to storage after power failure DevFS uses device memory for file system state - Can avoid writing in-memory state to disk journal - Overcomes the double writes problem Capacitance support improves performance 28

Challenges of Hardware File System Limited memory inside the storage device today s focus - Reverse-cache inactive file system structures to host memory DevFS lack visibility to OS state (e.g., process permission) - Make OS share required information with down-call - Please see the paper for more details 29

Device Memory Limitation Device RAM size constrained by cost ($) and power consumption RAM used mainly by file translation layer (FTL) - RAM size proportional to FTL s logical-to-physical block mapping - Example: 512 GB SSD uses 2 GB RAM to support translations Unlike kernel FS, device FS footprint must be kept small 30

Memory Consuming File Structures Our analysis shows four in-memory structures using 90% of memory - Inode (840 bytes) - created for file open, not freed until deletion - Dentry (192 bytes) - created for file open, kept in a cache - File pointer (256 bytes) - released when file is closed - Others (156 bytes) - e.g., DevFS file map structure Simple workload - open and close 1 million files - DevFS memory consumption ~1.2 GB (60% of device memory) 31

Reducing Memory Usage On-demand allocation of structures - Structures such as filemap not used after file is closed - Allocated after first write and released when a file is closed Reverse Caching - Move inactive structures to host memory 32

Reverse-Caching to Reduce Memory Move inactive inode and dentry structures to host memory Application 3. open(file) 1. close(file) 0. Reserved during mount DevFS Device memory Inode list Dentry list File Ptr list 4. Check host for dentry and inode 5. Move to device and delete cache 2. Move to host cache Host Host memory Inode Cache Dentry Cache 33 33

Decompose FS Structures Reverse caching for a complicated for inode Inode s fields accessed even file closing (e.g., directory traversal) Frequently moving between host cache and device can be expensive! Our solution split file system structures (e.g., inode) into a host and device structure 34

Decompose FS Structures Devfs inode structure struct devfs_inode_info { inode_list page_tree Decomposed DevFS structure struct devfs_inode_info { /*always kept in device*/ struct *inode_device journals. struct inode vfs_inode } /*moved to host after close*/ struct *inode_host 593 bytes } 840 bytes 35

Outline Introduction Background Motivation DevFS Design Evaluation Conclusion

Evaluation Benchmarks and Applications - Filebench - Snappy widely used multi-threaded file compression Evaluation comparison - NOVA state-of-the-art in-kernel NVM file system - DevFS-naïve DevFS without direct access - DevFS-cap without direct access but with capacitor support - DevFS-cap-direct capacitor support + direct access For direct-access, benchmark and applications run as driver 37

Filebench - Random Write 100K Ops/Second 16 12 8 4 NOVA DevFS-cap 2.4X 27% DevFS-naïve DevFS-cap-direct 0 1KB 4KB 16KB DevFS-naïve suffers from high journaling overhead DevFS-cap uses capacitors to avoid on-disk journaling DevFS-cap-direct achieves true direct-access bypassing OS 38

Snappy Compression Performance Read a file Compress Write output Sync file 100K Ops/Second 1.2 1 0.8 0.6 0.4 0.2 22% NOVA DevFS-cap DevFS-naïve DevFS-cap-direct Gains even for compute + I/O intensive application 0 1KB 4KB 16KB 64KB 256KB File Size 39

Memory Reduction Benefits Filebench File Create workload (Create 1M files and close files) Memory Usage (MB) 1600 1200 800 400 0 filemap dentry inode Cap Demand Dentry Inode + Dentry No memory reduction On-demand FS structures Reverse caching Dentry Reverse caching Dentry + Inode Demand allocation reduces memory consumption by 156MB (14%) Inode and Dentry reverse caching reduces memory by 5X 40

Memory Reduction Performance Impact 2 100 K Ops/sec 1.6 1.2 0.8 0.4 14% 0 Cap Demand Dentry Inode + Dentry Inode + Dentry + Direct Dentry and Inode reverse caching overhead less than 14% Overhead mainly due to structure movement cost 41

Summary Motivation - Eliminating OS overhead and providing direct access is critical - Hybrid user-level file systems compromise fundamental properties Solution - We design DevFS that moves FS into the storage H/W - Provides direct-access without compromising FS properties - To reduce memory footprint of DevFS designs reverse-caching Evaluation - Emulated DevFS shows up to 2X I/O performance gains - Reduces memory usage by 5X with 14% performance impact 42

Conclusion We are moving towards a storage era with microsecond latency Eliminating software (OS) overhead is critical - But without compromising fundamental storage properties Near-hardware access latency requires embedding S/W into H/W We take first step towards moving file system in H/W Several challenges such as H/W integration, support for RAID, snapshots, and deduplication yet to be addressed 43

Permission Checking User space 1 Process scheduled to CPU 2 OS Set credential in DevFS APP User-FS Write(UID, buff, 4k,off=1) payload=buff ops = READ UID= 1 off = 1 size = 4K 3 DevFS credential region Host CPU Credentials 0 Task1.cred 1 Task1.cred 24 Task2.cred Permission manager t_cred = get_task_cred(cpuid) inode_cred= get_inode_cred(fd) compare_cred(t_cred, inode_cred) 4 44

Concurrent Access 100K Ops/Second 2 1.5 1 0.5 0 NOVA DevFS [+cap] DevFS [+cap +direct] Limited CPUs inside device 1 4 8 12 16 #. Of Instances DevFS uses only 4 device CPU 45 Limited device CPUs restricts DevFS scaling

Slow CPU Impact Snappy 4KB 1.2 1 DevFS [+cap] DevFS [+cap +direct] 100K Ops/sec 0.8 0.6 0.4 0.2 0 1.2 1.4 1.8 2.2 2.6 CPU Frequency (GHz) 46

Thanks! Questions? 47