Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

A Cross Media File System Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson 1

Let s build a fast server NoSQL store, Database, File server, Mail server Requirements Small updates (1 Kbytes) dominate Dataset scales up to 10 TB Updates must be crash consistent 2

Storage diversification Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02 Better performance Higher capacity 3

Storage diversification Byte-addressable: cache-line granularity IO Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02 Better performance Higher capacity 3

Storage diversification Byte-addressable: cache-line granularity IO Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02 Better performance Higher capacity Large erasure blocks need to be sequentially written Random writes: 5~6x slowdown due to GC [FAST 15] 3

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Kernel file system 91% NVM Kernel file system: NOVA [FAST 16, SOSP 17] 4

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Small, random IO is slow! Kernel file system 1 KB Write to device 91% Kernel code 91% NVM 0 1.5 3 4.5 6 IO latency (us) Kernel file system: NOVA [FAST 16, SOSP 17] 4

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Kernel file system NVM 5

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Need huge capacity, but NVM alone is too expensive! ($40K for 10TB) Kernel file system NVM 5

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Kernel file system Block-level caching NVM SSD HDD 6

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Block-level caching manages data in blocks, but NVM is byte-addressable Extra level of indirection Kernel file system Block-level caching NVM SSD HDD 6

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent 7

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent SQLite HDFS ZooKeeper LevelDB HSQLDB Mercurial Git 0 2 4 6 8 10 Crash vulnerabilities Pillai et al., OSDI 2014 7

Problems in today s file systems Kernel mediates every operation NVM is so fast that kernel is the bottleneck Tied to a single type of device For low-cost capacity with high performance, must leverage multiple device types NVM (soon), SSD, HDD Aggressive caching in DRAM, write to device only when you must (fsync) Applications struggle for crash consistency 8

A Cross Media File System Performance: especially small, random IO Fast user-level device access Low-cost capacity: leverage NVM, SSD & HDD Transparent data migration across different storage media Efficiently handle device IO properties Simplicity: intuitive crash consistency model In-order, synchronous IO No fsync() required 9

main design principle Log operations to NVM at user-level Digest and migrate data in kernel 10

main design principle Log operations to NVM at user-level Performance: Kernel bypass, but private Digest and migrate data in kernel 10

main design principle Log operations to NVM at user-level Performance: Kernel bypass, but private Simplicity: Intuitive crash consistency Digest and migrate data in kernel 10

main design principle Log operations to NVM at user-level Performance: Kernel bypass, but private Simplicity: Intuitive crash consistency Digest and migrate data in kernel Coordinate multi-process accesses 10

main design principle Log operations to NVM at user-level Performance: Kernel bypass, but private Simplicity: Intuitive crash consistency Digest and migrate data in kernel Apply log operations to shared data Coordinate multi-process accesses 10

main design principle LibFS Log operations to NVM at user-level Performance: Kernel bypass, but private Simplicity: Intuitive crash consistency KernelFS Digest and migrate data in kernel Apply log operations to shared data Coordinate multi-process accesses 10

Outline LibFS: Log operations to NVM at user-level Fast user-level access In-order, synchronous IO KernelFS: Digest and migrate data in kernel Asynchronous digest Transparent data migration Shared file access Evaluation 11

Log operations to NVM at user-level unmodified application POSIX API LibFS NVM Private operation log creat write rename File operations (data & metadata) 12

Log operations to NVM at user-level NVM unmodified application Kernelbypass POSIX API LibFS Private operation log Fast writes Directly access fast NVM Sequentially append data Cache-line granularity Blind writes creat write rename File operations (data & metadata) 12

Log operations to NVM at user-level NVM unmodified application Kernelbypass POSIX API LibFS Private operation log creat write rename Fast writes Directly access fast NVM Sequentially append data Cache-line granularity Blind writes Crash consistency On crash, kernel replays log File operations (data & metadata) 12

Intuitive crash consistency unmodified application POSIX API LibFS Kernelbypass Synchronous IO NVM Private operation log 13

Intuitive crash consistency NVM Kernelbypass unmodified application POSIX API LibFS Private operation log Synchronous IO When each system call returns: Data/metadata is durable In-order update Atomic write Limited size (log size) 13

Digest data in kernel Application POSIX API LibFS KernelFS NVM Private operation log NVM Shared area 15

Digest data in kernel Application POSIX API LibFS Write KernelFS NVM Private operation log NVM Shared area 15

Digest data in kernel Application POSIX API LibFS Write KernelFS NVM Private operation log NVM Shared area Digest 15

Digest data in kernel NVM Application POSIX API LibFS Write Private operation log Digest KernelFS NVM Shared area Visibility: make private log visible to other applications Data layout: turn write-optimized to read-optimized format (extent tree) Large, batched IO Coalesce log 15

Digest optimization: Log coalescing SQLite, Mail server: crash consistent update using write ahead logging Application LibFS Digest eliminates unneeded work Private operation log Remove temporary durable writes KernelFS NVM Shared area Create journal file Write data to journal file Write data to database file Delete journal file.. 16

Digest optimization: Log coalescing SQLite, Mail server: crash consistent update using write ahead logging Application LibFS Digest eliminates unneeded work Remove temporary durable writes KernelFS Private operation log Create journal file Write data to journal file Write data to database file Delete journal file NVM Shared area Write data to database file. Throughput optimization:. Log coalescing saves IO while digesting 16

Digest and migrate data in kernel Application LibFS KernelFS Private operation log NVM Shared area 17

Digest and migrate data in kernel Application LibFS Private operation log KernelFS NVM Shared area Low-cost capacity KernelFS migrates cold data to lower layers SSD Shared area HDD Shared area 18

Digest and migrate data in kernel Application LibFS Private operation log KernelFS NVM Shared area NVM data SSD Shared area Low-cost capacity KernelFS migrates cold data to lower layers HDD Shared area 18

Digest and migrate data in kernel Application LibFS Private operation log Logs Digest KernelFS NVM Shared area NVM data SSD Shared area Low-cost capacity KernelFS migrates cold data to lower layers HDD Shared area 18

Digest and migrate data in kernel Application LibFS Private operation log KernelFS NVM Shared area NVM data SSD Shared area HDD Shared area Low-cost capacity KernelFS migrates cold data to lower layers Handle device IO properties Migrate 1 GB blocks Avoid SSD garbage collection overhead 18

Read: hierarchical search Application LibFS Search order KernelFS 1 Private OP log 2 NVM Shared area Log data NVM data 3 SSD data SSD Shared area 4 HDD data HDD Shared area 19

Shared file access Leases grant access rights to applications [SOSP 89] Required for files and directories Function like lock, but revocable Exclusive writer, shared readers 20

Shared file access Leases grant access rights to applications [SOSP 89] Required for files and directories Function like lock, but revocable Exclusive writer, shared readers On revocation, LibFS digests leased data Private data made public before losing lease Leases serialize concurrent updates 20

Experimental setup 2x Intel Xeon E5-2640 CPU, 64 GB DRAM 400 GB NVMe SSD, 1 TB HDD Ubuntu 16.04 LTS, Linux kernel 4.8.12 Emulated NVM Use 40 GB of DRAM Performance model [Y. Zhang et al. MSST 2015] Throttle latency & throughput in software 22

Evaluation questions Latency: Does Strata efficiently support small, random writes? Does asynchronous digest have an impact on latency? Throughput: Strata writes data twice (logging and digesting). Can Strata sustain high throughput? How well does Strata perform when managing data across storage layers? 23

Related work NVM file systems PMFS[EuroSys 14]: In-place update file system NOVA[FAST 16]: log-structured file system EXT4-DAX: NVM support for EXT4 SSD file system F2FS[FAST 15]: log-structured file system 24

Microbenchmark: write latency Strata logs to NVM Compare to NVM kernel file systems: PMFS, NOVA, EXT4-DAX Strata, NOVA In-order, synchronous IO Atomic write PMFS, EXT4-DAX No atomic write Latency (us) 17 21 23 29 10 8 6 4 2 0 Better Strata PMFS NOVA EXT4-DAX 128 B 1 KB 4 KB 16 KB IO size 25

Microbenchmark: write latency Strata logs to NVM Compare to NVM kernel file systems: PMFS, NOVA, EXT4-DAX Strata, NOVA In-order, synchronous IO Atomic write PMFS, EXT4-DAX No atomic write Avg.: 26% better Tail : 43% better Latency (us) 25 10 8 6 4 2 0 Better Strata NOVA 128 B 1 KB 4 KB 16 KB IO size PMFS EXT4-DAX 17 21 23 29

Latency: persistent RPC Foundation of most servers Persist RPC data before sending ACK to client 60 Strata NOVA No persist PMFS EXT4-DAX 98 RPC over RDMA 40 Gb/s Infiniband NIC Latency (us) 45 30 Better For small IO (1 KB) 25% slower than No persist 35% faster than PMFS 7x faster than EXT4-DAX 15 0 1 KB 4 KB 64 KB RPC size (IO size) 26

Latency: LevelDB LevelDB (NVM) Key size: 16 B Value size: 1 KB 300,000 objects Workload causes asynchronous digests Fast user-level logging Latency (us) 30 20 10 Better Strata PMFS NOVA EXT4-DAX 35.2 49.2 37.7 Random write 25% better than PMFS Random read Tied with PMFS 0 Write sync. Write seq. Write rand. Overwrite Read rand. 27

Latency: LevelDB LevelDB (NVM) Key size: 16 B Value size: 1 KB 300,000 objects Workload causes asynchronous digests Fast user-level logging Latency (us) 30 20 10 Better Strata PMFS NOVA EXT4-DAX 35.2 49.2 37.7 25% better Tied Random write 25% better than PMFS Random read Tied with PMFS 0 Write sync. Write seq. Write rand. Overwrite Read rand. 27

Latency: LevelDB LevelDB (NVM) Key size: 16 B Value size: 1 KB 300,000 objects Workload causes asynchronous digests Fast user-level logging Random write 25% better than PMFS Latency (us) 30 20 10 0 Better Write sync. Strata NOVA Write seq. Write rand. PMFS EXT4-DAX 35.2 49.2 37.7 25% better IO latency not impacted by asynchronous digest Random read Tied with PMFS Overwrite Tied Read rand. 27

Throughput: Varmail Mail server workload from Filebench Using only NVM 10000 files Read/Write ratio is 1:1 Write-ahead logging Log coalescing Application LibFS Create journal file Write data to journal Write data to database file Delete journal file Digest eliminates unneeded work Removes temporary durable writes KernelFS Write data to database file 29

Throughput: Varmail Mail server workload from Filebench Using only NVM No kernel file system has both low latency and high throughput: Log coalescing PMFS: better latency Application Digest eliminates unneeded work NOVA: 10000 better files throughput KernelFS Read/Write ratio is 1:1 Write-ahead logging Create journal file Write data to journal Write data to database file Delete journal file Removes temporary durable writes Strata achieves both low latency and high throughput LibFS Write data to database file Strata PMFS NOVA EXT4-DAX Better 0K 100K 200K 300K 400K Throughput (op/s) 29% better Log coalescing eliminates 86% of log entries, saving 14 GB of IO 29

Throughput: data migration File server workload from Filebench Working set starts at NVM, grows to SSD, HDD Read/Write ratio is 1:2 30

Throughput: data migration File server workload from Filebench Working set starts at NVM, grows to SSD, HDD Read/Write ratio is 1:2 User-level migration LRU: whole file granularity Treat each file system as a black-box NVM: NOVA, SSD: F2FS, HDD: EXT4 Block-level caching Linux LVM cache, formatted with F2FS Strata User-level migration Block-level caching 2x faster 0K 2K 4K 6K 8K 10K Avg. throughput (ops/s) 30

Throughput: data migration File server workload from Filebench Working set starts at NVM, grows to SSD, HDD Read/Write ratio is 1:2 User-level migration LRU: whole file granularity Treat each file system as a black-box NVM: NOVA, SSD: F2FS, HDD: EXT4 Block-level caching Linux LVM cache, formatted with F2FS 22% faster than user-level migration Cross layer optimization: placing hot metadata in faster layers Strata User-level migration Block-level caching 2x faster 0K 2K 4K 6K 8K 10K Avg. throughput (ops/s) 30

Conclusion Server applications need fast, small random IO on vast datasets with intuitive crash consistency Strata, a cross media file system, addresses these concerns Performance: low latency, high throughput Novel split of LibFS, KernelFS Fast user-level access Low-cost capacity: leverage NVM, SSD & HDD Asynchronous digest Transparent data migration with large, sequential IO Simplicity: intuitive crash consistency model In-order, synchronous IO Source code is available at https://github.com/ut-osa/strata 31

Backup 32

Device management overhead SSD, HDD prefer large sequential IO For example, SSD Random write: SSD Throughput (MB/s) 1000 750 500 250 64 MB 128 MB 256 MB 512 MB 1024 MB 0 0.1 0.25 0.5 0.6 0.7 0.8 0.9 1 5-6x difference by hardware GC SSD utilization Sequential writes avoid management overhead 33