ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Devices Jiacheng Zhang, Jiwu Shu, Youyou Lu Tsinghua University 1

Outline Background and Motivation ParaFS Design Evaluation Conclusion 2

Solid State Drives Internal Parallelism Internal Parallelism Channel Level, Chip Level, Die Level, Plane Level Chips in one package share the same 8/16-bit-I/O bus, but have separated chip enable (CE) and ready/busy (R/B) control signals. Each die has one internal R/B signal. Each plane contains thousands of flash blocks and one data register. ü Internal Parallelism à High Bandwidth. Die Level Plane Level Channel Level Block 0 Block 0 Host Interconnect H/W Interface FTL Chip Chip Chip Chip Block 1 Block 1 Block... Block... Register Register Plane 0 Plane 1 Die 0 Die 1 Chip Level

File Systems Log-structured File System Duplicate Functions: Space Allocation, Garbage Collection. Semantic Isolation: FTL Abstraction, Block I/O Interface, Log on Log. Log-structured File System Alloc. Namespace GC READ / WRITE / TRIM FTL Mapping Alloc. GC WL ECC Channel 0 Channel 1 Memory Channel N 4

Normalized Throughput Observation F2FS vs. Ext4 (under heavy write traffic) YCSB: 1000w random Read and Update operations 16GB flash space + 24GB write traffic EXT4 F2FS 7 6 5 4 3 2 1 0 1 4 8 16 32 Number of Channels F2FS has poorer performance than Ext4 on SSDs. GC Efficiency (%) # of Recycled Blocks 100 75 50 25 0 25000 20000 15000 10000 5000 0 1 4 8 16 32 1 4 8 16 32

Problem Internal Parallelism Conflicts Broken Data Grouping : Grouped data are broken and dispatched to different locations. Uncoordinated GC Operations: GC processes in two levels are performed out-of-order. Ineffective I/O Scheduling: erase operations always block the read/write operations, while the writes always delay the reads. The flash storage architecture block the optimizations. 6

Current Approaches (1) Log-structured File System Duplicate Functions: Space Allocation, Garbage Collection. Semantics Isolation: FTL Abstraction, Block I/O Interface, Log on Log. Log-structured File System Alloc. Namespace GC READ / WRITE / TRIM FTL Mapping Alloc. Channel 0 GC WL ECC Channel 1 Memory Channel N NILFS2, F2FS[FAST 15] SFS [FAST 12], Multi-streamed SSD [HotStorage 14] DFS [FAST 10], Nameless Write [FAST 12] 7

Current Approaches (2) Object-based File System [FAST 13] Move the semantics from FS to FTL. Difficult to be adopted due to dramatic changes Internal parallelism under-explored Log-structured File System Alloc. Namespace GC READ / WRITE / TRIM FTL Mapping Alloc. GC WL ECC Object-based File System GET / PUT Alloc. Namespace Object-based FTL Mapping GC WL ECC Channel 0 Channel 1 Channel N Channel 0 Channel 1 Channel N Memory Memory

ParaFS Architecture Goal: How to exploit the internal parallelism of the flash devices while ensuring effective data grouping, efficient garbage collection, and consistent performance? Log-structured File System Alloc. Namespace GC READ / WRITE / TRIM FTL Mapping Alloc. Channel 0 GC WL ECC Channel 1 Memory Channel N 2-D Allocation Coordinated GC Block Mapping Channel 0 ParaFS Namespace Parallelism-Aware Scheduler READ / WRITE / ERASE S-FTL WL Channel 1 Memory ECC Channel N

Outline Background and Motivation ParaFS Design ParaFS Architecture 2-D Data Allocation Coordinated Garbage Collection Parallelism-Aware Scheduling Evaluation Conclusion 10

ParaFS Architecture Simplified FTL (S-FTL) Exposing Physical Layout to FS: # of flash channels, Size of flash block, Size of flash page. Static Block Mapping: Block-level, rarely modified. Data Allocation Functionality is removed. GC process is simplified to Erase process. WL, ECC: functions which need hardware supports. IOCTL ParaFS S-FTL READ / WRITE / ERASE Interface Block Mapping WL ECC ioctl Erase à Trim Memory

ParaFS Architecture ParaFS: Allocation, Garbage Collection, Scheduling ParaFS Namespace 2-D Allocation Coordinated GC Parallelism-Aware Scheduler READ / WRITE / ERASE S-FTL Block Mapping WL ECC Channel 0 Channel 1 Memory Channel N 2-D Allocation Region 0 Region N Coordinated GC Thread 0 Thread N Parallelism-Aware Scheduler Req. Que. 0 Read Write Erase Read Req. Que. N Read Write Erase Read

Problem #1: Grouping vs. Parallelism Hot/Cold Data Grouping Hot Segment Cold Segment FS Vision of Data A B C D E F G H FTL Vision of Data Plane 0 Plane 1 Plane 0 Plane 1 Block 0 Block 0 Block 0 Block 0 A0 1E B0 F 1 C0 G1 D0 H 1 N N N N Block N 0 1 Block N 0 1 Block N 0 1 Block N 0 1 N N N N Die 0 Die 1 Chip 0 Die 0 Die 1 Chip 1

1. 2-D Data Allocation Current Data Allocation Schemes Page Stripe FS Vision of Data Block Stripe Segment 0 Segment 1 Super Block A B C D E F G H FTL Vision of Data Recycle Unit A B C D E F G H C G Parallel Block 0 1 2 3 Exploiting internal parallelism with page unit striping causes heavy garbage collection overhead. C G C G C G

1. 2-D Data Allocation Current Data Allocation Schemes Page Stripe FS Vision of Data Block Stripe Segment 0 Segment 1 Super Block A B C D E F G H FTL Vision of Data Recycle Unit A E C D B F G H C G C C D H G G Parallel Block 0 1 2 3 Not fully exploiting internal parallelism of the device, and performs badly in small sync write situation, mailserver.

1. 2-D Data Allocation Current Data Allocation Schemes Page Stripe FS Vision of Data Block Stripe Segment 0 Segment 1 Super Block A B C D E F G H FTL Vision of Data Recycle Unit A BE C D BE F G H C G C G D H C G Parallel Block 0 1 2 3 Larger GC unit causes extra garbage collection overhead.

1. 2-D Data Allocation Aligned Data Layout in ParaFS Region à Channel Segment à Block Page à Page Write Request Channel-level Dimension 1 2 3 4 5N Log-structured Segment Written Data Page Region 0 Region 1 Region N Alloc. Space Alloc. Head 0 Alloc. Head 1 Alloc. Head m Hotness-level Dimension Free Space Free Seg. List Memory Channel 0 Channel 1 Channel N

1. 2-D Data Allocation Data Allocation Schemes Comparison Page Stripe Block Stripe Super Block 2-D Allocation Parallelism Garbage Collection Stripe Granularity Parallelism Level GC Granularity Grouping Maintenance GC Overhead Page Stripe Page High Block No High Block Stripe Block Low Block Yes Low Super Block Page High Multiple Blocks Yes Medium 2-D Allocation Page High Block Yes Low

Problem #2: GC vs. Parallelism Uncoordinated Garbage Collection FS Vision of Data A B C D E F G H FTL Vision of Data Plane 0 Plane 1 Plane 0 Plane 1 FTL GC early Block 0 A C Block 1 B D Block 0 E F Block 1 G H Die 0 Die 0 Chip 0 Chip 1

Problem #2: GC vs. Parallelism Uncoordinated Garbage Collection FS GC early FS Vision of Data A B C D E F G H FTL Vision of Data Plane 0 Plane 1 Plane 0 Plane 1 First Choice Block 0 A C Block 1 B D Block 0 E F Block 1 G H Second Choice Die 0 Die 0 Chip 0 Chip 1

2. Coordinated Garbage Collection Garbage Collection in Two levels brings overheads FTL only gets page invalidation after Trim commands. Due to the no-overwrite pattern. Pages that are invalid in the FS, moved during FTL-level GC. Unexpected timing of GC starts in two levels. FTL GC starts before FS GC starts. FTL GC blocks the FS I/O and causes unexpected I/O latency. Waste Space. Each level keeps over-provision space for GC efficiency. Each level keeps meta data space, to record page and block(segment) utilization, to remapping. 21

2. Coordinated Garbage Collection Coordinated GC process in two levels FS-level GC Foreground GC or Background GC (trigger condition, selection policy) Migration the valid pages from victim segments. Do the checkpoint in case of crash. Send the Erase Request to S-FTL. FTL-level GC Find the corresponding flash block by static block mapping table. Erase the flash block directly. Multi-threaded Optimization One GC process per Region ( Channel). One Manager process to do the checkpoint after GC. 22

3. Parallelism-Aware Scheduling Request Dispatching Phase Select the least busy channel to dispatch write request New Write Request Queue Write Write Read Write Read Req. Que. 0 Read Read Req. Que. 1 Write Req. Que. 2 Write Read Req. Que. 3 W "#$%%&' = )(W +&$, Size +&$,, W 3+45& Size 3+45& ) 23

3. Parallelism-Aware Scheduling Request Dispatching Phase Select the least busy channel to dispatch write request Request Scheduling Phase Time Slice for Read Request Scheduling and Write/Erase Request Scheduling. Schedule Write or Erase Request according to Space Utilization and Number of Concurrent Erasing Channels. e = a f + b N & In implementation 1? = 2 f + 1 N & 24

3. Parallelism-Aware Scheduling Request Scheduling Phase Example. Free Space e D = 2 75% + 0 > 1 Write Erase Erase Write e G = 2 25% + 0 < 1 e I = 2 30% + 25% < 1 e J = 2 30% + 50% > 1 Used Space 50% Channel 0 Channel 1 Channel 2 Channel 3 f = 75% f = 25% f = 30% f = 30%

Outline Background and Motivation ParaFS Design Evaluation Conclusion 26

Evaluation ParaFS implemented on Linux kernel 2.6.32 based on F2FS Ext4 (In-place update), BtrFS (CoW), back-ported F2FS (Log-structured update) F2FS_SB: back-ported F2FS with large segment configuration Testbed: PFTL, S-FTL Workloads: Workload Pattern R:W FSYNC I/O Size Fileserver random read and write files 33/66 N 1 MB Postmark create, delete, read and append files 20/80 Y 512 B MobiBench random update records in SQLite 1/99 Y 4 KB YCSB read and update records in MySQL 50/50 Y 1 KB

Evaluation Light Write Traffic Normalized to F2FS in 8-Channel case 1.2 BtrFS EXT4 F2FS F2FS_SB ParaFS 3 BtrFS EXT4 F2FS F2FS_SB ParaFS Normalized Throughput 1 0.8 0.6 0.4 0.2 Normalized Throughput 2.5 2 1.5 1 0.5 0 Fileserver Postmark MobiBench YCSB Channel = 8 0 Fileserver Postmark MobiBench YCSB Channel = 32 - Outperforms other file systems in all cases. - Improves performance by up to 13% (Postmark 32-Channel) 28

Evaluation Heavy Write Traffic 24 20 16 12 8 4 24 20 16 12 8 4 0 16 12 8 4 0 1 4 8 16 32 Fileserver 1 4 8 16 32 MobiBench 7 6 5 4 3 2 1 0 1 4 8 16 32 YCSB - Achieves the best throughput among all evaluated file systems - Outperforms: Ext4(1.0X ~ 2.5X) F2FS(1.6X ~ 3.1X) F2FS_SB(1.2X ~1.5X) 0 1 4 8 16 32 Postmark

Evaluation - Endurance 50000 1 # of Recycled Blocks 40000 30000 20000 10000 1 4 8 16 32 Recycled Block Count GC Efficiency 0.8 0.6 0.4 0.2 1 4 8 16 32 GC Efficiency Write Traffic (GB) 80 60 40 20 0 100 ParaFS decreases the write traffic to the flash memory by 80 31.7% ~ 58.1%, compared with F2FS. 60 1 4 8 16 32 Write Traffic from FS Write Traffic (GB) 120 40 20 0 1 4 8 16 32 Write Traffic from FTL

Evaluation Performance Consistency Two Optimizations MGC: Multi-threaded GC process PS: Parallelism-aware Scheduling - Multi-threaded GC improves 18.5% (MGC vs. Base) - Parallelism-aware Scheduling: 20% improvement in first stage. (PS vs. Base) much consistent performance in GC stage. (ParaFS vs. MGC) 31

Conclusion Internal parallelism has not been leveraged in the FSlevel. The mechanisms: Data Grouping, Garbage Collection, I/O Scheduling The architecture: Semantics Isolation & Redundant Functions ParaFS bridges the semantic gap and exploits internal parallelism in the FS-level. (1) 2-D Allocation: page unit striping and data grouping. (2) Coordinated GC: improve the GC efficiency and speed. (3) Parallelism-aware Scheduling: more consistent performance. ParaFS shows performance improvement by up to 3.1X and write reduction by 58.1% compared to F2FS. 32

Thanks ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Devices Jiacheng Zhang, Jiwu Shu, Youyou Lu luyouyou@tsinghua.edu.cn http://storage.cs.tsinghua.edu.cn/~lu 33