FStream: Managing Flash Streams in the File System Eunhee Rho, Kanchan Joshi, Seung-Uk Shin, Nitesh Jagadeesh Shetty, Joo-Young Hwang, Sangyeun Cho, Daniel DG Lee, Jaeheon Jeong Memory Division, Samsung Electronics Co., Ltd.
Table of Contents Flash-based SSDs Garbage Collection & WAF Multi-stream FStream Workload Analysis & Experimental Results Conclusion 2/17
Flash-based Solid State Drives Replacement of HDDs Flash Translation Layer (FTL) allows SSDs to maintain traditional block interface Application OS File System Common Interface Application OS File System Block Layer Block Layer HDD Faster Energy Efficient More Reliable FTL SSD 3/18
Garbage Collection & WAF Garbage Collection (GC) Overheads Reclaiming space for empty blocks requires valid page copy Media write amplified due to garbage collection Shortens SSD lifetime and hampers performance Write Amplification Factor (WAF) Ratio of the actual media writes to the user I/O 4/18
Multi-stream Managing data placement on a SSD with streams Mapping data to separate stream by their life expectancy Standardization status T10 (SCSI) standard & NVME 1.3 directives 5/18
Multi-stream Cont d Data Placement Comparison Conventional SSD Multi-streamed SSD H1 C1 H2 C2 C3 H3 H4 C4 1 Writes = H1, C1, H2, C2, C3, H3, H4, C4 H1 H2 H3 H4 C1 C2 C3 C4 Stream = X Stream = Y H1 C1 H2 C2 C3 H3 H4 C4 H1 H2 H3 H4 2 New Writes = H1, H2, H3, H4 H1 H2 H3 H4 C1 C2 C3 C4 H1 H2 H3 H4 Free Space Fragmentation! Valid page copy required to reclaim the free space. Return to free block Stream = Y Stream = X 6/18
FStream Motivation We need easier, general method of stream assignment. Block device layer has limited information about data lifetime. File system metadata has different lifetime from user data, need be separated. [1] Our Approach File system level stream assignment. Separate streams for file system metadata, journal, and user data. Implemented FStream in existing file systems. [2] [1] Kang, JU et al., The Multi-streamed Solid-State Drive, HotStorage 14 [2] Yang, Jingpei et al., AutoStream: automatic stream management for multi-streamed SSDs, SYSTOR 17 7/18
Ext4 EXT4 metadata and journaling EXT4 on-disk layout: block groups with data and metadata related to it EXT4 journal: write ordering in data=ordered mode 8/18
Ext4Stream Mount options Journal-stream Separate journal writes Inode-stream Separate inode writes Dir-stream Separate directory blocks Misc-stream Inode/block bitmap and group-descriptor Fname-stream Distinct stream to file(s) with specific names Extn-stream File-extension based stream 9/18
XFS XFS metadata and journaling Parallel metadata operations, metadata buffering (page cache not used) Mixture of logical and physical journaling Minimum inode update size is a chunk of 64 inodes. 10/18
XFStream Mount options Log-stream Separate journal writes Inode-stream Separate inode writes Fname-stream Distinct stream to file(s) with specific names 11/18
Flushing Application Specific Data Separation Stream for Cassandra s commit log file. Fname_stream option: commitlog-* Write Request Memtable Memory Upon Write Request.. 1) Writes to Commit Log - For error recovery 2) Writes to Memtable - In memory space 3) Returns success 4) Data in Memtable are flushed to SSTable 5) SSTables go through compaction SSTable 21 SSTable 5 K1 K2 K3 SSTable 1 K1 K2 Commit Log SSTable 6 SSTable 7 SSTable 2 K1 K3 SSTable 3 K2 K3 SSTable 4 K1 K3 12/18
Experimental Setup OS: Linux kernel v4.5 with io-streamid support System: Dell PowerEdge R720 server with 32 cores and 32GB memory SSD: Samsung PM963 480GB with streams support Benchmarks: Filebench: Varmail & Fileserver YCSB on Cassandra 13/18
Filebench Workload Analysis Varmail EXT4 EXT4 No Journal XFS Inode 8.0% Other meta 0.2% Directory 4.0% Fileserver Data 55% EXT4 Journal 26% Data 26.8% Inode 16% Journal 61.0% Other meta 0% Directory 3% Data 52% Data 63.0% XFS Journal 16% Inode 21.0% Directory 15.8% Inode 32% Other meta 0.2% Inode 9% XFS writes more inodes (random writes) than Ext4. Data 31% Journal 60% Workload Conditions - Filled 80% of disk before test - Number of test files: 900,000 (14GB) - Varmail : fsync-intensive - Runtime: 2 hours 14/18
Ops/sec Ops/sec Filebench: Performance Fstream achieved 5 ~ 35% performance improvements. 120000 100000 80000 60000 40000 20000 0 Varmail Ext4 Ext4Stream Ext4-NJ Ext4Stream-NJ XFS XFStream 120000 100000 80000 60000 40000 20000 0 Fileserver Ext4 Ext4Stream XFS XFStream 15/18
WAF WAF Filebench: WAF Fstream achieved WAF of close to one. Ext4 s WAF < Ext4NJ s WAF Journal is written in a circular fashion, so is invalidated periodically. Varmail Fileserver 2 1.6 Ext4 2 1.5 1 1.3 1.04 1.09 1.2 1.05 Ext4Stream Ext4-NJ Ext4Stream-NJ 1.5 1 1.1 1.02 1.2 1.05 Ext4 Ext4Stream XFS 0.5 XFS 0.5 XFStream 0 XFStream 0 16/18
Ops/sec WAF YCSB on Cassandra Results Data intensive workload Load phase: 1KB record x 120 million inserts Run phase: 1KB record x 80 million inserts 70000 60000 50000 40000 30000 20000 10000 Ext4 Ext4Stream XFS XFStream 2.5 2 1.5 1 0.5 2 1.1 2 1.2 Ext4 Ext4Stream XFS XFStream 0 0 17/18
Conclusion and Acknowledgements SSD Performance & Lifetime The less FTL garbage collection overheads, the longer SSD lives and the faster SSD performs. Streams: SSD interface for separating data with different lifetimes FStream: stream assignment in file system Separate streams for file system metadata, journal, and user data. Provide filename and extension based user data separation. Achieved 5~35% performance improvement and near 1 WAF for filebench. Acknowledgements We thank Cristian Ungureanu, our shepherd, and anonymous reviewers for their feedbacks. 18/18
T H A N K Y O U