Understanding Manycore Scalability of File Systems. Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim

Similar documents
Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

Understanding Write Behaviors of Storage Backends in Ceph Object Store

Using Transparent Compression to Improve SSD-based I/O Caches

Barrier Enabled IO Stack for Flash Storage

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

SpanFS: A Scalable File System on Fast Storage Devices

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

OS-caused Long JVM Pauses - Deep Dive and Solutions

Lightweight Application-Level Crash Consistency on Transactional Flash Storage

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

A Scalable Ordering Primitive for Multicore Machines. Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

Red Hat Enterprise 7 Beta File Systems

A Case Study: Performance Evaluation of a DRAM-Based Solid State Disk

FStream: Managing Flash Streams in the File System

JOURNALING FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 26

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

The Btrfs Filesystem. Chris Mason

Main Points. File layout Directory layout

Presented by: Nafiseh Mahmoudi Spring 2017

CA485 Ray Walshe Google File System

SFS: Random Write Considered Harmful in Solid State Drives

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS3600 SYSTEMS AND NETWORKS

How Scalable is your SMB?

DenseFS: A Cache-Compact Filesystem

Emulating Goliath Storage Systems with David

CSE 124: Networked Services Lecture-17

WHITEPAPER. Improve PostgreSQL Performance with Memblaze PBlaze SSD

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

Operating Systems. File Systems. Thomas Ropars.

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Google File System. Arun Sundaram Operating Systems

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Designing a True Direct-Access File System with DevFS

IBM Emulex 16Gb Fibre Channel HBA Evaluation

<Insert Picture Here> Filesystem Features and Performance

Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

A Database System Performance Study with Micro Benchmarks on a Many-core System

File System Consistency. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

White Paper. File System Throughput Performance on RedHawk Linux

Arrakis: The Operating System is the Control Plane

Ben Walker Data Center Group Intel Corporation

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Innodb Performance Optimization

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Unioning of the Buffer Cache and Journaling Layers with Non-volatile Memory

CS370 Operating Systems

Duy Le (Dan) - The College of William and Mary Hai Huang - IBM T. J. Watson Research Center Haining Wang - The College of William and Mary

Linux SMR Support Status

Optimizing SDS for the Age of Flash. Krutika Dhananjay, Raghavendra Gowdappa, Manoj Hat

Chapter 12: File System Implementation

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Technical Paper. Performance and Tuning Considerations for SAS on Dell EMC VMAX 250 All-Flash Array

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

NOVA: The Fastest File System for NVDIMMs. Steven Swanson, UC San Diego

The Oracle Database Appliance I/O and Performance Architecture

Hybrid Storage Performance Characteristics

CS370 Operating Systems

Toward SLO Complying SSDs Through OPS Isolation

Oracle Platform Performance Baseline Oracle 12c on Hitachi VSP G1000. Benchmark Report December 2014

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Ext3/4 file systems. Don Porter CSE 506

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Ext4 Filesystem Scaling

<Insert Picture Here> Btrfs Filesystem

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

The Google File System

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

I/O Stack Optimization for Smartphones

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Caching and reliability

Improving throughput for small disk requests with proximal I/O

Software and Tools for HPE s The Machine Project

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

Crash Consistency: FSCK and Journaling. Dongkun Shin, SKKU

Why Does Solid State Disk Lower CPI?

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

Solid Access Technologies, LLC

The Leading Parallel Cluster File System

Filesystem Performance on FreeBSD

High Performance Transactions in Deuteronomy

An SMR-aware Append-only File System Chi-Young Ku Stephen P. Morgan Futurewei Technologies, Inc. Huawei R&D USA

The Unwritten Contract of Solid State Drives

* Contributed while interning at SAP. September 1 st, 2017 PUBLIC

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Transcription:

Understanding Manycore Scalability of File Systems Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim

Application must parallelize I/O operations Death of single core CPU scaling CPU clock frequency: ~. GHz # of physical cores: up to (Xeon E7 v) From mechanical HDD to flash SSD IOPS of a commodity SSD: 9K Non-volatile memory (e.g., D XPoint):,x But file systems become a scalability bottleneck

Problem: Lack of understanding in internal scalability behavior Exim mail server on AMDISK k messages/sec k btrfs FFS ext XFS k Embarrassingly parallel application!. Saturated k k k k k. Collapsed 7 Intel -core machine: -socket, -core Xeon E7-7 AM: GB, TB SSD, 7 PM HDD. Never scale

Even in slower storage medium file system becomes a bottleneck Exim email server at cores k AMDISK SSD HDD messages/sec k k k k k k btrfs ext FFS XFS

Outline Background FxMark design A file system benchmark suite for manycore scalability Analysis of five Linux file systems Pilot solution elated work Summary

esearch questions What file system operations are not scalable? Why they are not scalable? Is it the problem of implementation or design?

Technical challenges Applications are usually stuck with a few bottlenecks cannot see the next level of bottlenecks before resolving them difficult to understand overall scalability behavior How to systematically stress file systems to understand scalability behavior 7

FxMark: evaluate & analyze manycore scalability of file systems FxMark: File systems: tmpfs Memory FS ext XFS J/NJ Journaling FS Storage medium: # core: applications 9 micro-benchmarks btrfs FFS CoW FS Log FS SSD,,,,,,,,, 7,

FxMark: evaluate & analyze manycore scalability of file systems FxMark: File systems: tmpfs Memory FS ext XFS >,7 J/NJ Journaling FS Storage medium: # core: applications 9 micro-benchmarks btrfs FFS CoW FS Log FS SSD,,,,,,,,, 7, 9

Microbenchmark: unveil hidden scalability bottlenecks Data block read Low Sharing Level Medium File High Block Process Operation Legend:

Stress different components with various sharing levels

Evaluation Data block read Linear scalability Low: File systems: Legend btrfs ext extnj FFS tmpfs XFS Storage medium: 7

Outline Background FxMark design Analysis of five Linux file systems What are scalability bottlenecks? Pilot solution elated work Summary

Summary of results: file systems are not scalable DBM 9. 9... 7. 7 DWSL MPL 7 MPM 7... MPH 7 7. 7 7.. 7 MWCM 7......... 7 7.. 9k. k...... messages/sec. 7 7 7 7.. 7.. 7 7k k k k k 7 7 7 7 7 7 7 DBM:O_DIECT................. btrfs ext extnj FFS tmpfs XFS 7. Legend... DBENCH ocksdb 7 MWCL. DBL:O_DIECT.. MWM. k. k. k.. Exim.. MWL. k. DWOM:O_DIECT...... DWOL:O_DIECT. 7........ MWUM..7... MWUL. GB/sec ops/sec 7.. MDM MDL... 9 7. DWTL DWAL 7 DWOM DWOL DBH DBL 7

Summary of results: file systems are not scalable DBM 9. 9... 7. 7 DWSL MPL 7 MPM 7... MPH 7 7. 7 7.. 7 MWCM 7......... 7 7.. 9k. k...... messages/sec. 7 7 7 7.. 7.. 7 7k k k k k 7 7 7 7 7 7 7 DBM:O_DIECT................. btrfs ext extnj FFS tmpfs XFS 7. Legend... DBENCH ocksdb 7 MWCL. DBL:O_DIECT.. MWM. k. k. k.. Exim.. MWL. k. DWOM:O_DIECT...... DWOL:O_DIECT. 7........ MWUM..7... MWUL. GB/sec ops/sec 7.. MDM MDL... 9 7. DWTL DWAL 7 DWOM DWOL DBH DBL 7

Summary of results: file systems are not scalable DBM 9. 9... 7. 7 DWSL MPL 7 MPM 7... MPH 7 7. 7 7.. 7 MWCM 7......... 7 7.. 9k. k...... messages/sec. 7 7 7 7.. 7.. 7 7k k k k k 7 7 7 7 7 7 7 DBM:O_DIECT................. btrfs ext extnj FFS tmpfs XFS 7. Legend... DBENCH ocksdb 7 MWCL. DBL:O_DIECT.. MWM. k. k. k.. Exim.. MWL. k. DWOM:O_DIECT...... DWOL:O_DIECT. 7........ MWUM..7... MWUL. GB/sec ops/sec 7.. MDM MDL... 9 7. DWTL DWAL 7 DWOM DWOL DBH DBL 7

Summary of results: file systems are not scalable DBM 9. 9... 7. 7 DWSL MPL 7 MPM 7... MPH 7 7. 7 7.. 7 MWCM 7......... 7 7.. 9k. k...... messages/sec. 7 7 7 7.. 7.. 7 7k k k k k 7 7 7 7 7 7 7 7 DBM:O_DIECT................. btrfs ext extnj FFS tmpfs XFS 7. Legend... DBENCH ocksdb 7 MWCL. DBL:O_DIECT.. MWM. k. k. k.. Exim.. MWL. k. DWOM:O_DIECT...... DWOL:O_DIECT. 7........ MWUM..7... MWUL. GB/sec ops/sec 7.. MDM MDL... 9 7. DWTL DWAL 7 DWOM DWOL DBH DBL 7

Data block read DBL All file systems linearly scale Low: 7 DBM XFS shows performance collapse Medium: XFS 7 DBH All file systems show performance collapse 9 7 High: 7

Page cache is maintained for efficient access of file data OS Kernel. look up a page cache. read a file block Page cache. copy page. cache miss. read a page from disk Disk 9

Page cache hit OS Kernel. look up a page cache. read a file block Page cache. copy page. cache hit Disk

Page cache can be evicted to secure free memory OS Kernel Page cache Disk

only when not being accessed OS Kernel. read a file block eference counting is used to track # of accessing tasks Page cache. copy page access_a_page(...) { atomic_inc(&page->_count);... atomic_dec(&page->_count); } Disk

eference counting becomes a scalability bottleneck... 9 7 CPI DBH (cycles-per-instruction) access_a_page(...) { atomic_inc(&page->_count);... atomic_dec(&page->_count); } CPI (cycles-per-instruction) 7

eference counting becomes a scalability bottleneck... 9 7 CPI DBH (cycles-per-instruction) access_a_page(...) { atomic_inc(&page->_count);... atomic_dec(&page->_count); } CPI (cycles-per-instruction) High contention on a page reference counter Huge memory stall 7 Many more: directory entry cache, XFS inode, etc

Lessons learned High locality can cause performance collapse Cache hit should be scalable When the cache hit is dominant, the scalability of cache hit does matter.

Data block overwrite DWOL W Ext, FFS, and btrfs show performance collapse Low: W ext FFS btrfs 7 DWOM All file systems degrade gradually... Medium: W W..... 7

Btrfs is a copy-on-write (CoW) file system Directs a write to a block to a new copy of the block Never overwrites the block in place Maintain multiple versions of a file system image W Time T Time T Time T+

CoW triggers disk block allocation for every write W Time T Time T+ Block Allocation Block Allocation Disk block allocation becomes a bottleneck Ext journaling, FFS checkpointing

Lessons learned Overwriting could be as expensive as appending Critical at log-structured FS (FFS) and CoW FS (btrfs) Consistency guarantee mechanisms should be scalable Scalable journaling Scalable CoW index structure Parallel log-structured writing 9

Data block overwrite DWOL W Ext, FFS, and btrfs show performance collapse Low: W ext FFS btrfs 7 DWOM All file systems degrade gradually... Medium: W W..... 7

Entire file is locked regardless of update range All tested file systems hold an inode mutex for write operations ange-based locking is not implemented ***_file_write_iter(...) { mutex_lock(&inode->i_mutex);... mutex_unlock(&inode->i_mutex); }

Lessons learned A file cannot be concurrently updated Critical for VM and DBMS, which manage large files Need to consider techniques used in parallel file systems E.g., range-based locking

Summary of findings High locality can cause performance collapse Overwriting could be as expensive as appending A file cannot be concurrently updated All directory operations are sequential enaming is system-wide sequential Metadata changes are not scalable Non-scalability often means wasting CPU cycles Scalability is not portable See er p a p our

Summary of findings Many themcan arecause unexpected and collapse counter-intuitive High of locality performance Contention at file system level Overwriting could be as expensive as appending to maintain data dependencies A file cannot be concurrently updated All directory operations are sequential enaming is system-wide sequential Metadata changes are not scalable Non-scalability often means wasting CPU cycles Scalability is not portable See er p a p our

Outline Background FxMark design Analysis of five Linux file systems Pilot solution If we remove contentions in a file system, is such file system scalable? elated work Summary

ocksdb on a -partitioned AMDISK scales better A single-partitioned AMDISK tmpfs 7 7 ops/sec ops/sec A -partitioned AMDISK 7.x ** Tested workload: DB_BENCH overwrite ** btrfs ext FFS tmpfs XFS 7

ocksdb on a -partitioned AMDISK scales better A single-partitioned AMDISK tmpfs 7 7 ops/sec ops/sec A -partitioned AMDISK 7.x btrfs ext FFS tmpfs XFS 7 educed contention on file systems ** Tested workload: DB_BENCH overwrite helps improving performance and**scalability 7

But partitioning makes performance worse on HDD ops/sec A single-partitioned HDD FFS A -partitioned HDD btrfs ext FFS XFS.7x ** Tested workload: DB_BENCH overwrite **

But partitioning makes performance worse on HDD ops/sec A single-partitioned HDD FFS A -partitioned HDD btrfs ext FFS XFS.7x But reduced spatial locality degrades performance Medium-specific (e.g.,**spatial locality) ** Testedcharacteristics workload: DB_BENCH overwrite should be considered 9

elated work Scaling operating systems Mostly use memory file system to opt out the effect of I/O operations Scaling file systems Scalable file system journaling ScaleFS [MIT:MSThesis'] SpanFS [ATC'] Parallel log-structured writing on NVAM NOVA [FAST']

Summary Comprehensive analysis of manycore scalability of five widely-used file systems using FxMark Manycore scalability should be of utmost importance in file system design New challenges in scalable file system design Minimizing contention, scalable consistency guarantee, spatial locality, etc. FxMark is open source https://github.com/sslab-gatech/fxmark