ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

Similar documents
ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

FlashKV: Accelerating KV Performance with Open-Channel SSDs

Presented by: Nafiseh Mahmoudi Spring 2017

p-oftl: An Object-based Semantic-aware Parallel Flash Translation Layer

Application-Managed Flash

SFS: Random Write Considered Harmful in Solid State Drives

Don t stack your Log on my Log

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

OSSD: A Case for Object-based Solid State Drives

Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

Enabling NVMe I/O Scale

Optimizing Flash-based Key-value Cache Systems

The Unwritten Contract of Solid State Drives

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

A File-System-Aware FTL Design for Flash Memory Storage Systems

Using Transparent Compression to Improve SSD-based I/O Caches

AutoStream: Automatic Stream Management for Multi-stream SSDs

pblk the OCSSD FTL Linux FAST Summit 18 Javier González Copyright 2018 CNEX Labs

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

VSSIM: Virtual Machine based SSD Simulator

Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives

Discriminating Hierarchical Storage (DHIS)

Toward SLO Complying SSDs Through OPS Isolation

Introduction to Open-Channel Solid State Drives

Open-Channel SSDs Then. Now. And Beyond. Matias Bjørling, March 22, Copyright 2017 CNEX Labs

Linux Kernel Abstractions for Open-Channel SSDs

Maximizing Data Center and Enterprise Storage Efficiency

FlashTier: A Lightweight, Consistent and Durable Storage Cache

LightNVM: The Linux Open-Channel SSD Subsystem Matias Bjørling (ITU, CNEX Labs), Javier González (CNEX Labs), Philippe Bonnet (ITU)

Blurred Persistence in Transactional Persistent Memory

Copyright 2014 Fusion-io, Inc. All rights reserved.

Introduction to Open-Channel Solid State Drives and What s Next!

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

LightTx: A Lightweight Transac0onal Design in Flash- based SSDs to Support Flexible Transac0ons

Mass-Storage Structure

Optimizing Translation Information Management in NAND Flash Memory Storage Systems

JOURNALING techniques have been widely used in modern

Replacing the FTL with Cooperative Flash Management

Reducing Solid-State Storage Device Write Stress Through Opportunistic In-Place Delta Compression

FlashBlox: Achieving Both Performance Isolation and Uniform Lifetime for Virtualized SSDs

Partitioned Real-Time NAND Flash Storage. Katherine Missimer and Rich West

Could We Make SSDs Self-Healing?

Optimizing Software-Defined Storage for Flash Memory

QLC Challenges. QLC SSD s Require Deep FTL Tuning Karl Schuh Micron. Flash Memory Summit 2018 Santa Clara, CA 1

A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks

A Caching-Oriented FTL Design for Multi-Chipped Solid-State Disks. Yuan-Hao Chang, Wei-Lun Lu, Po-Chun Huang, Lue-Jane Lee, and Tei-Wei Kuo

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

HashKV: Enabling Efficient Updates in KV Storage via Hashing

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System

FOR decades, transactions have been widely used in database

Performance Impact and Interplay of SSD Parallelism through Advanced Commands, Allocation Strategy and Data Granularity

Utilitarian Performance Isolation in Shared SSDs

Flash Memory Based Storage System

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

Non-Blocking Writes to Files

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Getting Real: Lessons in Transitioning Research Simulations into Hardware Systems

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Power Analysis for Flash Memory SSD. Dongkun Shin Sungkyunkwan University

FStream: Managing Flash Streams in the File System

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Solid State Drives (SSDs) Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Unblinding the OS to Optimize User-Perceived Flash SSD Latency

A Semi Preemptive Garbage Collector for Solid State Drives. Junghee Lee, Youngjae Kim, Galen M. Shipman, Sarp Oral, Feiyi Wang, and Jongman Kim

Why Variable-Size Matters: Beyond Page-Based Flash Translation Layers

CA485 Ray Walshe Google File System

LightTx: A Lightweight Transactional Design in Flash-based SSDs to Support Flexible Transactions

I/O Stack Optimization for Smartphones

Linux Kernel Extensions for Open-Channel SSDs

High Performance SSD & Benefit for Server Application

Disks and RAID. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse]

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

NPTEL Course Jan K. Gopinath Indian Institute of Science

NOVA: The Fastest File System for NVDIMMs. Steven Swanson, UC San Diego

An Evaluation of Different Page Allocation Strategies on High-Speed SSDs

Improving MLC flash performance and endurance with Extended P/E Cycles

Distributed File Systems II

Barrier Enabled IO Stack for Flash Storage

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Recent Advances in Analytical Modeling of SSD Garbage Collection

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system

Extending the Lifetime of SSD Controller

NAND Flash-based Storage. Computer Systems Laboratory Sungkyunkwan University

Flash Trends: Challenges and Future

Toward Seamless Integration of RAID and Flash SSD

arxiv: v1 [cs.db] 25 Nov 2018

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

[537] Flash. Tyler Harter

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

SpanFS: A Scalable File System on Fast Storage Devices

The Google File System

<Insert Picture Here> Btrfs Filesystem

Storage Architecture and Software Support for SLC/MLC Combined Flash Memory

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Transcription:

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Devices Jiacheng Zhang, Jiwu Shu, Youyou Lu Tsinghua University 1

Outline Background and Motivation ParaFS Design Evaluation Conclusion 2

Solid State Drives Internal Parallelism Internal Parallelism Channel Level, Chip Level, Die Level, Plane Level Chips in one package share the same 8/16-bit-I/O bus, but have separated chip enable (CE) and ready/busy (R/B) control signals. Each die has one internal R/B signal. Each plane contains thousands of flash blocks and one data register. ü Internal Parallelism à High Bandwidth. Die Level Plane Level Channel Level Block 0 Block 0 Host Interconnect H/W Interface FTL Chip Chip Chip Chip Block 1 Block 1 Block... Block... Register Register Plane 0 Plane 1 Die 0 Die 1 Chip Level

File Systems Log-structured File System Duplicate Functions: Space Allocation, Garbage Collection. Semantic Isolation: FTL Abstraction, Block I/O Interface, Log on Log. Log-structured File System Alloc. Namespace GC READ / WRITE / TRIM FTL Mapping Alloc. GC WL ECC Channel 0 Channel 1 Memory Channel N 4

Normalized Throughput Observation F2FS vs. Ext4 (under heavy write traffic) YCSB: 1000w random Read and Update operations 16GB flash space + 24GB write traffic EXT4 F2FS 7 6 5 4 3 2 1 0 1 4 8 16 32 Number of Channels F2FS has poorer performance than Ext4 on SSDs. GC Efficiency (%) # of Recycled Blocks 100 75 50 25 0 25000 20000 15000 10000 5000 0 1 4 8 16 32 1 4 8 16 32

Problem Internal Parallelism Conflicts Broken Data Grouping : Grouped data are broken and dispatched to different locations. Uncoordinated GC Operations: GC processes in two levels are performed out-of-order. Ineffective I/O Scheduling: erase operations always block the read/write operations, while the writes always delay the reads. The flash storage architecture block the optimizations. 6

Current Approaches (1) Log-structured File System Duplicate Functions: Space Allocation, Garbage Collection. Semantics Isolation: FTL Abstraction, Block I/O Interface, Log on Log. Log-structured File System Alloc. Namespace GC READ / WRITE / TRIM FTL Mapping Alloc. Channel 0 GC WL ECC Channel 1 Memory Channel N NILFS2, F2FS[FAST 15] SFS [FAST 12], Multi-streamed SSD [HotStorage 14] DFS [FAST 10], Nameless Write [FAST 12] 7

Current Approaches (2) Object-based File System [FAST 13] Move the semantics from FS to FTL. Difficult to be adopted due to dramatic changes Internal parallelism under-explored Log-structured File System Alloc. Namespace GC READ / WRITE / TRIM FTL Mapping Alloc. GC WL ECC Object-based File System GET / PUT Alloc. Namespace Object-based FTL Mapping GC WL ECC Channel 0 Channel 1 Channel N Channel 0 Channel 1 Channel N Memory Memory

ParaFS Architecture Goal: How to exploit the internal parallelism of the flash devices while ensuring effective data grouping, efficient garbage collection, and consistent performance? Log-structured File System Alloc. Namespace GC READ / WRITE / TRIM FTL Mapping Alloc. Channel 0 GC WL ECC Channel 1 Memory Channel N 2-D Allocation Coordinated GC Block Mapping Channel 0 ParaFS Namespace Parallelism-Aware Scheduler READ / WRITE / ERASE S-FTL WL Channel 1 Memory ECC Channel N

Outline Background and Motivation ParaFS Design ParaFS Architecture 2-D Data Allocation Coordinated Garbage Collection Parallelism-Aware Scheduling Evaluation Conclusion 10

ParaFS Architecture Simplified FTL (S-FTL) Exposing Physical Layout to FS: # of flash channels, Size of flash block, Size of flash page. Static Block Mapping: Block-level, rarely modified. Data Allocation Functionality is removed. GC process is simplified to Erase process. WL, ECC: functions which need hardware supports. IOCTL ParaFS S-FTL READ / WRITE / ERASE Interface Block Mapping WL ECC ioctl Erase à Trim Memory

ParaFS Architecture ParaFS: Allocation, Garbage Collection, Scheduling ParaFS Namespace 2-D Allocation Coordinated GC Parallelism-Aware Scheduler READ / WRITE / ERASE S-FTL Block Mapping WL ECC Channel 0 Channel 1 Memory Channel N 2-D Allocation Region 0 Region N Coordinated GC Thread 0 Thread N Parallelism-Aware Scheduler Req. Que. 0 Read Write Erase Read Req. Que. N Read Write Erase Read

Problem #1: Grouping vs. Parallelism Hot/Cold Data Grouping Hot Segment Cold Segment FS Vision of Data A B C D E F G H FTL Vision of Data Plane 0 Plane 1 Plane 0 Plane 1 Block 0 Block 0 Block 0 Block 0 A0 1E B0 F 1 C0 G1 D0 H 1 N N N N Block N 0 1 Block N 0 1 Block N 0 1 Block N 0 1 N N N N Die 0 Die 1 Chip 0 Die 0 Die 1 Chip 1

1. 2-D Data Allocation Current Data Allocation Schemes Page Stripe FS Vision of Data Block Stripe Segment 0 Segment 1 Super Block A B C D E F G H FTL Vision of Data Recycle Unit A B C D E F G H C G Parallel Block 0 1 2 3 Exploiting internal parallelism with page unit striping causes heavy garbage collection overhead. C G C G C G

1. 2-D Data Allocation Current Data Allocation Schemes Page Stripe FS Vision of Data Block Stripe Segment 0 Segment 1 Super Block A B C D E F G H FTL Vision of Data Recycle Unit A E C D B F G H C G C C D H G G Parallel Block 0 1 2 3 Not fully exploiting internal parallelism of the device, and performs badly in small sync write situation, mailserver.

1. 2-D Data Allocation Current Data Allocation Schemes Page Stripe FS Vision of Data Block Stripe Segment 0 Segment 1 Super Block A B C D E F G H FTL Vision of Data Recycle Unit A BE C D BE F G H C G C G D H C G Parallel Block 0 1 2 3 Larger GC unit causes extra garbage collection overhead.

1. 2-D Data Allocation Aligned Data Layout in ParaFS Region à Channel Segment à Block Page à Page Write Request Channel-level Dimension 1 2 3 4 5N Log-structured Segment Written Data Page Region 0 Region 1 Region N Alloc. Space Alloc. Head 0 Alloc. Head 1 Alloc. Head m Hotness-level Dimension Free Space Free Seg. List Memory Channel 0 Channel 1 Channel N

1. 2-D Data Allocation Data Allocation Schemes Comparison Page Stripe Block Stripe Super Block 2-D Allocation Parallelism Garbage Collection Stripe Granularity Parallelism Level GC Granularity Grouping Maintenance GC Overhead Page Stripe Page High Block No High Block Stripe Block Low Block Yes Low Super Block Page High Multiple Blocks Yes Medium 2-D Allocation Page High Block Yes Low

Problem #2: GC vs. Parallelism Uncoordinated Garbage Collection FS Vision of Data A B C D E F G H FTL Vision of Data Plane 0 Plane 1 Plane 0 Plane 1 FTL GC early Block 0 A C Block 1 B D Block 0 E F Block 1 G H Die 0 Die 0 Chip 0 Chip 1

Problem #2: GC vs. Parallelism Uncoordinated Garbage Collection FS GC early FS Vision of Data A B C D E F G H FTL Vision of Data Plane 0 Plane 1 Plane 0 Plane 1 First Choice Block 0 A C Block 1 B D Block 0 E F Block 1 G H Second Choice Die 0 Die 0 Chip 0 Chip 1

2. Coordinated Garbage Collection Garbage Collection in Two levels brings overheads FTL only gets page invalidation after Trim commands. Due to the no-overwrite pattern. Pages that are invalid in the FS, moved during FTL-level GC. Unexpected timing of GC starts in two levels. FTL GC starts before FS GC starts. FTL GC blocks the FS I/O and causes unexpected I/O latency. Waste Space. Each level keeps over-provision space for GC efficiency. Each level keeps meta data space, to record page and block(segment) utilization, to remapping. 21

2. Coordinated Garbage Collection Coordinated GC process in two levels FS-level GC Foreground GC or Background GC (trigger condition, selection policy) Migration the valid pages from victim segments. Do the checkpoint in case of crash. Send the Erase Request to S-FTL. FTL-level GC Find the corresponding flash block by static block mapping table. Erase the flash block directly. Multi-threaded Optimization One GC process per Region ( Channel). One Manager process to do the checkpoint after GC. 22

3. Parallelism-Aware Scheduling Request Dispatching Phase Select the least busy channel to dispatch write request New Write Request Queue Write Write Read Write Read Req. Que. 0 Read Read Req. Que. 1 Write Req. Que. 2 Write Read Req. Que. 3 W "#$%%&' = )(W +&$, Size +&$,, W 3+45& Size 3+45& ) 23

3. Parallelism-Aware Scheduling Request Dispatching Phase Select the least busy channel to dispatch write request Request Scheduling Phase Time Slice for Read Request Scheduling and Write/Erase Request Scheduling. Schedule Write or Erase Request according to Space Utilization and Number of Concurrent Erasing Channels. e = a f + b N & In implementation 1? = 2 f + 1 N & 24

3. Parallelism-Aware Scheduling Request Scheduling Phase Example. Free Space e D = 2 75% + 0 > 1 Write Erase Erase Write e G = 2 25% + 0 < 1 e I = 2 30% + 25% < 1 e J = 2 30% + 50% > 1 Used Space 50% Channel 0 Channel 1 Channel 2 Channel 3 f = 75% f = 25% f = 30% f = 30%

Outline Background and Motivation ParaFS Design Evaluation Conclusion 26

Evaluation ParaFS implemented on Linux kernel 2.6.32 based on F2FS Ext4 (In-place update), BtrFS (CoW), back-ported F2FS (Log-structured update) F2FS_SB: back-ported F2FS with large segment configuration Testbed: PFTL, S-FTL Workloads: Workload Pattern R:W FSYNC I/O Size Fileserver random read and write files 33/66 N 1 MB Postmark create, delete, read and append files 20/80 Y 512 B MobiBench random update records in SQLite 1/99 Y 4 KB YCSB read and update records in MySQL 50/50 Y 1 KB

Evaluation Light Write Traffic Normalized to F2FS in 8-Channel case 1.2 BtrFS EXT4 F2FS F2FS_SB ParaFS 3 BtrFS EXT4 F2FS F2FS_SB ParaFS Normalized Throughput 1 0.8 0.6 0.4 0.2 Normalized Throughput 2.5 2 1.5 1 0.5 0 Fileserver Postmark MobiBench YCSB Channel = 8 0 Fileserver Postmark MobiBench YCSB Channel = 32 - Outperforms other file systems in all cases. - Improves performance by up to 13% (Postmark 32-Channel) 28

Evaluation Heavy Write Traffic 24 20 16 12 8 4 24 20 16 12 8 4 0 16 12 8 4 0 1 4 8 16 32 Fileserver 1 4 8 16 32 MobiBench 7 6 5 4 3 2 1 0 1 4 8 16 32 YCSB - Achieves the best throughput among all evaluated file systems - Outperforms: Ext4(1.0X ~ 2.5X) F2FS(1.6X ~ 3.1X) F2FS_SB(1.2X ~1.5X) 0 1 4 8 16 32 Postmark

Evaluation - Endurance 50000 1 # of Recycled Blocks 40000 30000 20000 10000 1 4 8 16 32 Recycled Block Count GC Efficiency 0.8 0.6 0.4 0.2 1 4 8 16 32 GC Efficiency Write Traffic (GB) 80 60 40 20 0 100 ParaFS decreases the write traffic to the flash memory by 80 31.7% ~ 58.1%, compared with F2FS. 60 1 4 8 16 32 Write Traffic from FS Write Traffic (GB) 120 40 20 0 1 4 8 16 32 Write Traffic from FTL

Evaluation Performance Consistency Two Optimizations MGC: Multi-threaded GC process PS: Parallelism-aware Scheduling - Multi-threaded GC improves 18.5% (MGC vs. Base) - Parallelism-aware Scheduling: 20% improvement in first stage. (PS vs. Base) much consistent performance in GC stage. (ParaFS vs. MGC) 31

Conclusion Internal parallelism has not been leveraged in the FSlevel. The mechanisms: Data Grouping, Garbage Collection, I/O Scheduling The architecture: Semantics Isolation & Redundant Functions ParaFS bridges the semantic gap and exploits internal parallelism in the FS-level. (1) 2-D Allocation: page unit striping and data grouping. (2) Coordinated GC: improve the GC efficiency and speed. (3) Parallelism-aware Scheduling: more consistent performance. ParaFS shows performance improvement by up to 3.1X and write reduction by 58.1% compared to F2FS. 32

Thanks ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Devices Jiacheng Zhang, Jiwu Shu, Youyou Lu luyouyou@tsinghua.edu.cn http://storage.cs.tsinghua.edu.cn/~lu 33