Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Similar documents
TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

NOVA: The Fastest File System for NVDIMMs. Steven Swanson, UC San Diego

Designing a True Direct-Access File System with DevFS

NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System

Operating Systems. File Systems. Thomas Ropars.

Distributed File Systems II

The Google File System

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

Ziggurat: A Tiered File System for Non-Volatile Main Memories and Disks

An Efficient Memory-Mapped Key-Value Store for Flash Storage

CA485 Ray Walshe Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

CLOUD-SCALE FILE SYSTEMS

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

From Crash Consistency to Transactions. Yige Hu Youngjin Kwon Vijay Chidambaram Emmett Witchel

The Google File System

January 28-29, 2014 San Jose

JOURNALING techniques have been widely used in modern

The Google File System

SLM-DB: Single-Level Key-Value Store with Persistent Memory

Soft Updates Made Simple and Fast on Non-volatile Memory

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

Lecture 18: Reliable Storage

NVthreads: Practical Persistence for Multi-threaded Applications

Arrakis: The Operating System is the Control Plane

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Using Transparent Compression to Improve SSD-based I/O Caches

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

Distributed Filesystem

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

EECS 482 Introduction to Operating Systems

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Understanding Write Behaviors of Storage Backends in Ceph Object Store

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

NVMe SSDs with Persistent Memory Regions

Non-Blocking Writes to Files

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

System Software for Persistent Memory

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Towards Efficient, Portable Application-Level Consistency

Ben Walker Data Center Group Intel Corporation

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

High-Performance Key-Value Store on OpenSHMEM

The Google File System

* Contributed while interning at SAP. September 1 st, 2017 PUBLIC

CrashMonkey: A Framework to Systematically Test File-System Crash Consistency. Ashlie Martinez Vijay Chidambaram University of Texas at Austin

Windows Support for PM. Tom Talpey, Microsoft

The Google File System

COS 318: Operating Systems. Journaling, NFS and WAFL

VSSIM: Virtual Machine based SSD Simulator

Distributed Systems 16. Distributed File Systems II

The Google File System. Alexandru Costan

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

L7: Performance. Frans Kaashoek Spring 2013

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Frequently asked questions from the previous class survey

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Getting Real: Lessons in Transitioning Research Simulations into Hardware Systems

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

High Performance Packet Processing with FlexNIC

Maximizing Data Center and Enterprise Storage Efficiency

PERSISTENCE: FSCK, JOURNALING. Shivaram Venkataraman CS 537, Spring 2019

File Systems. Chapter 11, 13 OSPP

The Google File System

Beyond Block I/O: Rethinking

Windows Support for PM. Tom Talpey, Microsoft

Google File System. Arun Sundaram Operating Systems

Yiying Zhang, Leo Prasath Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. University of Wisconsin - Madison

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Non-Volatile Memory Through Customized Key-Value Stores

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Farewell to Servers: Resource Disaggregation

Isilon Performance. Name

GFS: The Google File System

Accessing NVM Locally and over RDMA Challenges and Opportunities

Current Topics in OS Research. So, what s hot?

2. PICTURE: Cut and paste from paper

What is a file system

The Google File System

Distributed Systems. GFS / HDFS / Spanner

File Systems. CS170 Fall 2018

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

OPERATING SYSTEM. Chapter 12: File System Implementation

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CS3600 SYSTEMS AND NETWORKS

CSE 120: Principles of Operating Systems. Lecture 10. File Systems. November 6, Prof. Joe Pasquale

Transcription:

A Cross Media File System Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson 1

Let s build a fast server NoSQL store, Database, File server, Mail server Requirements Small updates (1 Kbytes) dominate Dataset scales up to 10 TB Updates must be crash consistent 2

Storage diversification Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02 Better performance Higher capacity 3

Storage diversification Byte-addressable: cache-line granularity IO Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02 Better performance Higher capacity 3

Storage diversification Byte-addressable: cache-line granularity IO Latency $/GB DRAM 100 ns 8.6 NVM (soon) 300 ns 4.0 SSD 10 us 0.25 HDD 10 ms 0.02 Better performance Higher capacity Large erasure blocks need to be sequentially written Random writes: 5~6x slowdown due to GC [FAST 15] 3

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Kernel file system 91% NVM Kernel file system: NOVA [FAST 16, SOSP 17] 4

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Small, random IO is slow! Kernel file system 91% NVM Kernel file system: NOVA [FAST 16, SOSP 17] 4

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Small, random IO is slow! Kernel file system 1 KB Write to device 91% Kernel code 91% NVM 0 1.5 3 4.5 6 IO latency (us) Kernel file system: NOVA [FAST 16, SOSP 17] 4

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Small, random IO is slow! Kernel file system 1 KB Write to device 91% Kernel code 91% NVM 0 1.5 3 4.5 6 IO latency (us) Kernel file system: NOVA NVM is so fast that kernel is the bottleneck [FAST 16, SOSP 17] 4

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Kernel file system NVM 5

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Need huge capacity, but NVM alone is too expensive! ($40K for 10TB) Kernel file system NVM 5

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Need huge capacity, but NVM alone is too expensive! ($40K for 10TB) Kernel file system NVM For low-cost capacity with high performance, must leverage multiple device types 5

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Kernel file system Block-level caching NVM SSD HDD 6

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Block-level caching manages data in blocks, but NVM is byte-addressable Extra level of indirection Kernel file system Block-level caching NVM SSD HDD 6

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Block-level caching manages data in blocks, but NVM is byte-addressable Extra level of indirection Kernel file system Block-level caching NOVA Block-level caching NVM SSD HDD 1 KB 0 3 6 9 12 IO latency (us) Better 6

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent Application Block-level caching manages data in blocks, but NVM is byte-addressable Extra level of indirection Kernel file system Block-level caching NOVA Block-level caching 1 KB NVM SSD HDD 0 3 6 9 12 Block-level caching IO is latency too (us) slow Better 6

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent 7

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent SQLite HDFS ZooKeeper LevelDB HSQLDB Mercurial Git 0 2 4 6 8 10 Crash vulnerabilities Pillai et al., OSDI 2014 7

A fast server on today s file system Small updates (1 Kbytes) dominate Dataset scales up to 10TB Updates must be crash consistent SQLite HDFS ZooKeeper LevelDB HSQLDB Mercurial Git 0 2 4 6 8 10 Crash vulnerabilities Pillai et al., OSDI 2014 Applications struggle for crash consistency 7

Problems in today s file systems Kernel mediates every operation NVM is so fast that kernel is the bottleneck Tied to a single type of device For low-cost capacity with high performance, must leverage multiple device types NVM (soon), SSD, HDD Aggressive caching in DRAM, write to device only when you must (fsync) Applications struggle for crash consistency 8

A Cross Media File System Performance: especially small, random IO Fast user-level device access Low-cost capacity: leverage NVM, SSD & HDD Transparent data migration across different storage media Efficiently handle device IO properties Simplicity: intuitive crash consistency model In-order, synchronous IO No fsync() required 9

main design principle Log operations to NVM at user-level Digest and migrate data in kernel 10

main design principle Log operations to NVM at user-level Performance: Kernel bypass, but private Digest and migrate data in kernel 10

main design principle Log operations to NVM at user-level Performance: Kernel bypass, but private Simplicity: Intuitive crash consistency Digest and migrate data in kernel 10

main design principle Log operations to NVM at user-level Performance: Kernel bypass, but private Simplicity: Intuitive crash consistency Digest and migrate data in kernel Coordinate multi-process accesses 10

main design principle Log operations to NVM at user-level Performance: Kernel bypass, but private Simplicity: Intuitive crash consistency Digest and migrate data in kernel Apply log operations to shared data Coordinate multi-process accesses 10

main design principle LibFS Log operations to NVM at user-level Performance: Kernel bypass, but private Simplicity: Intuitive crash consistency KernelFS Digest and migrate data in kernel Apply log operations to shared data Coordinate multi-process accesses 10

Outline LibFS: Log operations to NVM at user-level Fast user-level access In-order, synchronous IO KernelFS: Digest and migrate data in kernel Asynchronous digest Transparent data migration Shared file access Evaluation 11

Log operations to NVM at user-level unmodified application POSIX API LibFS NVM Private operation log creat write rename File operations (data & metadata) 12

Log operations to NVM at user-level NVM unmodified application Kernelbypass POSIX API LibFS Private operation log Fast writes Directly access fast NVM Sequentially append data Cache-line granularity Blind writes creat write rename File operations (data & metadata) 12

Log operations to NVM at user-level NVM unmodified application Kernelbypass POSIX API LibFS Private operation log creat write rename Fast writes Directly access fast NVM Sequentially append data Cache-line granularity Blind writes Crash consistency On crash, kernel replays log File operations (data & metadata) 12

Intuitive crash consistency unmodified application POSIX API LibFS Kernelbypass Synchronous IO NVM Private operation log 13

Intuitive crash consistency NVM Kernelbypass unmodified application POSIX API LibFS Private operation log Synchronous IO When each system call returns: Data/metadata is durable In-order update Atomic write Limited size (log size) 13

Intuitive crash consistency NVM Kernelbypass unmodified application POSIX API LibFS Private operation log Synchronous IO When each system call returns: Data/metadata is durable In-order update Atomic write Limited size (log size) fsync() is no-op 13

Intuitive crash consistency NVM Kernelbypass unmodified application POSIX API LibFS Private operation log Synchronous IO When each system call returns: Data/metadata is durable In-order update Atomic write Limited size (log size) fsync() is no-op Fast synchronous IO: NVM and kernel-bypass 13

Outline LibFS: Log operations to NVM at user-level Fast user-level access In-order, synchronous IO KernelFS: Digest and migrate data in kernel Asynchronous digest Transparent data migration Shared file access Evaluation 14

Digest data in kernel Application POSIX API LibFS KernelFS NVM Private operation log NVM Shared area 15

Digest data in kernel Application POSIX API LibFS Write KernelFS NVM Private operation log NVM Shared area 15

Digest data in kernel Application POSIX API LibFS Write KernelFS NVM Private operation log NVM Shared area Digest 15

Digest data in kernel NVM Application POSIX API LibFS Write Private operation log Digest KernelFS NVM Shared area Visibility: make private log visible to other applications Data layout: turn write-optimized to read-optimized format (extent tree) Large, batched IO Coalesce log 15

Digest optimization: Log coalescing SQLite, Mail server: crash consistent update using write ahead logging Application LibFS Digest eliminates unneeded work Private operation log Remove temporary durable writes KernelFS NVM Shared area.. 16

Digest optimization: Log coalescing SQLite, Mail server: crash consistent update using write ahead logging Application LibFS Digest eliminates unneeded work Private operation log Remove temporary durable writes KernelFS NVM Shared area Create journal file Write data to journal file Write data to database file Delete journal file.. 16

Digest optimization: Log coalescing SQLite, Mail server: crash consistent update using write ahead logging Application LibFS Digest eliminates unneeded work Private operation log Remove temporary durable writes KernelFS NVM Shared area Create journal file Write data to journal file Write data to database file Delete journal file. Write data to database file. 16

Digest optimization: Log coalescing SQLite, Mail server: crash consistent update using write ahead logging Application LibFS Digest eliminates unneeded work Remove temporary durable writes KernelFS Private operation log Create journal file Write data to journal file Write data to database file Delete journal file NVM Shared area Write data to database file. Throughput optimization:. Log coalescing saves IO while digesting 16

Digest and migrate data in kernel Application LibFS KernelFS Private operation log NVM Shared area 17

Digest and migrate data in kernel Application LibFS Private operation log KernelFS NVM Shared area Low-cost capacity KernelFS migrates cold data to lower layers SSD Shared area HDD Shared area 18

Digest and migrate data in kernel Application LibFS Private operation log KernelFS NVM Shared area NVM data SSD Shared area Low-cost capacity KernelFS migrates cold data to lower layers HDD Shared area 18

Digest and migrate data in kernel Application LibFS Private operation log Logs Digest KernelFS NVM Shared area NVM data SSD Shared area Low-cost capacity KernelFS migrates cold data to lower layers HDD Shared area 18

Digest and migrate data in kernel Application LibFS Private operation log Logs Digest KernelFS NVM Shared area NVM data SSD Shared area Low-cost capacity KernelFS migrates cold data to lower layers HDD Shared area 18

Digest and migrate data in kernel Application LibFS Private operation log KernelFS NVM Shared area NVM data SSD Shared area Low-cost capacity KernelFS migrates cold data to lower layers HDD Shared area 18

Digest and migrate data in kernel Application LibFS Private operation log KernelFS NVM Shared area NVM data SSD Shared area HDD Shared area Low-cost capacity KernelFS migrates cold data to lower layers Handle device IO properties Migrate 1 GB blocks Avoid SSD garbage collection overhead 18

Digest and migrate data in kernel Application LibFS Private operation log KernelFS NVM Shared area NVM data SSD Shared area HDD Shared area Low-cost capacity KernelFS migrates cold data to lower layers Handle device IO properties Migrate 1 GB blocks Avoid SSD garbage collection overhead Resembles log-structured merge (LSM) tree 18

Read: hierarchical search Application LibFS Search order KernelFS 1 Private OP log 2 NVM Shared area Log data NVM data 3 SSD data SSD Shared area 4 HDD data HDD Shared area 19

Shared file access Leases grant access rights to applications [SOSP 89] Required for files and directories Function like lock, but revocable Exclusive writer, shared readers 20

Shared file access Leases grant access rights to applications [SOSP 89] Required for files and directories Function like lock, but revocable Exclusive writer, shared readers On revocation, LibFS digests leased data Private data made public before losing lease Leases serialize concurrent updates 20

Outline LibFS: Log operations to NVM at user-level Fast user-level access In-order, synchronous IO KernelFS: Digest and migrate data in kernel Asynchronous digest Transparent data migration Shared file access Evaluation 21

Experimental setup 2x Intel Xeon E5-2640 CPU, 64 GB DRAM 400 GB NVMe SSD, 1 TB HDD Ubuntu 16.04 LTS, Linux kernel 4.8.12 Emulated NVM Use 40 GB of DRAM Performance model [Y. Zhang et al. MSST 2015] Throttle latency & throughput in software 22

Evaluation questions Latency: Does Strata efficiently support small, random writes? Does asynchronous digest have an impact on latency? Throughput: Strata writes data twice (logging and digesting). Can Strata sustain high throughput? How well does Strata perform when managing data across storage layers? 23

Related work NVM file systems PMFS[EuroSys 14]: In-place update file system NOVA[FAST 16]: log-structured file system EXT4-DAX: NVM support for EXT4 SSD file system F2FS[FAST 15]: log-structured file system 24

Microbenchmark: write latency Strata logs to NVM Compare to NVM kernel file systems: PMFS, NOVA, EXT4-DAX Strata, NOVA In-order, synchronous IO Atomic write PMFS, EXT4-DAX No atomic write Latency (us) 17 21 23 29 10 8 6 4 2 0 Better Strata PMFS NOVA EXT4-DAX 128 B 1 KB 4 KB 16 KB IO size 25

Microbenchmark: write latency Strata logs to NVM Compare to NVM kernel file systems: PMFS, NOVA, EXT4-DAX Strata, NOVA In-order, synchronous IO Atomic write PMFS, EXT4-DAX No atomic write Avg.: 26% better Tail : 43% better Latency (us) 25 10 8 6 4 2 0 Better Strata NOVA 128 B 1 KB 4 KB 16 KB IO size PMFS EXT4-DAX 17 21 23 29

Latency: persistent RPC Foundation of most servers Persist RPC data before sending ACK to client 60 Strata NOVA No persist PMFS EXT4-DAX 98 RPC over RDMA 40 Gb/s Infiniband NIC Latency (us) 45 30 Better For small IO (1 KB) 25% slower than No persist 35% faster than PMFS 7x faster than EXT4-DAX 15 0 1 KB 4 KB 64 KB RPC size (IO size) 26

Latency: persistent RPC Foundation of most servers Persist RPC data before sending ACK to client 60 Strata NOVA No persist PMFS EXT4-DAX 98 RPC over RDMA 40 Gb/s Infiniband NIC Latency (us) 45 30 Better For small IO (1 KB) 25% slower than No persist 35% faster than PMFS 7x faster than EXT4-DAX 15 35% better 0 26 1 KB 4 KB 64 KB RPC size (IO size)

Latency: LevelDB LevelDB (NVM) Key size: 16 B Value size: 1 KB 300,000 objects Workload causes asynchronous digests Fast user-level logging Latency (us) 30 20 10 Better Strata PMFS NOVA EXT4-DAX 35.2 49.2 37.7 Random write 25% better than PMFS Random read Tied with PMFS 0 Write sync. Write seq. Write rand. Overwrite Read rand. 27

Latency: LevelDB LevelDB (NVM) Key size: 16 B Value size: 1 KB 300,000 objects Workload causes asynchronous digests Fast user-level logging Latency (us) 30 20 10 Better Strata PMFS NOVA EXT4-DAX 35.2 49.2 37.7 25% better Tied Random write 25% better than PMFS Random read Tied with PMFS 0 Write sync. Write seq. Write rand. Overwrite Read rand. 27

Latency: LevelDB LevelDB (NVM) Key size: 16 B Value size: 1 KB 300,000 objects Workload causes asynchronous digests Fast user-level logging Random write 25% better than PMFS Latency (us) 30 20 10 0 Better Write sync. Strata NOVA Write seq. Write rand. PMFS EXT4-DAX 35.2 49.2 37.7 25% better IO latency not impacted by asynchronous digest Random read Tied with PMFS Overwrite Tied Read rand. 27

Evaluation questions Latency: Does Strata efficiently support small, random writes? Does asynchronous digest have an impact on latency? Throughput: Strata writes data twice (logging and digesting). Can Strata sustain high throughput? How well does Strata perform when managing data across storage layers? 28

Throughput: Varmail Mail server workload from Filebench Using only NVM 10000 files Read/Write ratio is 1:1 Write-ahead logging Log coalescing Application LibFS Create journal file Write data to journal Write data to database file Delete journal file Digest eliminates unneeded work Removes temporary durable writes KernelFS Write data to database file 29

Throughput: Varmail Mail server workload from Filebench Using only NVM 10000 files Read/Write ratio is 1:1 Write-ahead logging Log coalescing Application LibFS Create journal file Write data to journal Write data to database file Delete journal file Digest eliminates unneeded work Removes temporary durable writes KernelFS Write data to database file Strata PMFS NOVA EXT4-DAX Better 0K 100K 200K 300K 400K Throughput (op/s) 29% better 29

Throughput: Varmail Mail server workload from Filebench Using only NVM 10000 files Read/Write ratio is 1:1 Write-ahead logging Log coalescing Application LibFS Create journal file Write data to journal Write data to database file Delete journal file Digest eliminates unneeded work Removes temporary durable writes KernelFS Write data to database file Strata PMFS NOVA EXT4-DAX Better 0K 100K 200K 300K 400K Throughput (op/s) 29% better Log coalescing eliminates 86% of log entries, saving 14 GB of IO 29

Throughput: Varmail Mail server workload from Filebench Using only NVM No kernel file system has both low latency and high throughput: Log coalescing PMFS: better latency Application Digest eliminates unneeded work NOVA: 10000 better files throughput KernelFS Read/Write ratio is 1:1 Write-ahead logging Create journal file Write data to journal Write data to database file Delete journal file Removes temporary durable writes Strata achieves both low latency and high throughput LibFS Write data to database file Strata PMFS NOVA EXT4-DAX Better 0K 100K 200K 300K 400K Throughput (op/s) 29% better Log coalescing eliminates 86% of log entries, saving 14 GB of IO 29

Throughput: data migration File server workload from Filebench Working set starts at NVM, grows to SSD, HDD Read/Write ratio is 1:2 30

Throughput: data migration File server workload from Filebench Working set starts at NVM, grows to SSD, HDD Read/Write ratio is 1:2 User-level migration LRU: whole file granularity Treat each file system as a black-box NVM: NOVA, SSD: F2FS, HDD: EXT4 30

Throughput: data migration File server workload from Filebench Working set starts at NVM, grows to SSD, HDD Read/Write ratio is 1:2 User-level migration LRU: whole file granularity Treat each file system as a black-box NVM: NOVA, SSD: F2FS, HDD: EXT4 Block-level caching Linux LVM cache, formatted with F2FS 30

Throughput: data migration File server workload from Filebench Working set starts at NVM, grows to SSD, HDD Read/Write ratio is 1:2 User-level migration LRU: whole file granularity Treat each file system as a black-box NVM: NOVA, SSD: F2FS, HDD: EXT4 Block-level caching Linux LVM cache, formatted with F2FS Strata User-level migration Block-level caching 2x faster 0K 2K 4K 6K 8K 10K Avg. throughput (ops/s) 30

Throughput: data migration File server workload from Filebench Working set starts at NVM, grows to SSD, HDD Read/Write ratio is 1:2 User-level migration LRU: whole file granularity Treat each file system as a black-box NVM: NOVA, SSD: F2FS, HDD: EXT4 Block-level caching Linux LVM cache, formatted with F2FS 22% faster than user-level migration Cross layer optimization: placing hot metadata in faster layers Strata User-level migration Block-level caching 2x faster 0K 2K 4K 6K 8K 10K Avg. throughput (ops/s) 30

Conclusion Server applications need fast, small random IO on vast datasets with intuitive crash consistency Strata, a cross media file system, addresses these concerns Performance: low latency, high throughput Novel split of LibFS, KernelFS Fast user-level access Low-cost capacity: leverage NVM, SSD & HDD Asynchronous digest Transparent data migration with large, sequential IO Simplicity: intuitive crash consistency model In-order, synchronous IO Source code is available at https://github.com/ut-osa/strata 31

Backup 32

Device management overhead SSD, HDD prefer large sequential IO For example, SSD Random write: SSD Throughput (MB/s) 1000 750 500 250 64 MB 128 MB 256 MB 512 MB 1024 MB 0 0.1 0.25 0.5 0.6 0.7 0.8 0.9 1 5-6x difference by hardware GC SSD utilization Sequential writes avoid management overhead 33