High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Similar documents
ijournaling: Fine-Grained Journaling for Improving the Latency of Fsync System Call

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

SpanFS: A Scalable File System on Fast Storage Devices

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

A Scalable Lock Manager for Multicores

High Performance Transactions in Deuteronomy

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Ben Walker Data Center Group Intel Corporation

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Non-Blocking Writes to Files

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Fine-grained Metadata Journaling on NVM

Barrier Enabled IO Stack for Flash Storage

An Efficient Memory-Mapped Key-Value Store for Flash Storage

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

CA485 Ray Walshe Google File System

Beyond Block I/O: Rethinking

Portland State University ECE 588/688. Graphics Processors

January 28-29, 2014 San Jose

CSE 124: Networked Services Lecture-17

FStream: Managing Flash Streams in the File System

Toward SLO Complying SSDs Through OPS Isolation

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Flash-Conscious Cache Population for Enterprise Database Workloads

Evaluating Phase Change Memory for Enterprise Storage Systems

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Capriccio : Scalable Threads for Internet Services

Utilizing the IOMMU scalably

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

HashKV: Enabling Efficient Updates in KV Storage via Hashing

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Designing High Performance DSM Systems using InfiniBand Features

Manylogs Improving CMR/SMR Disk Bandwidth & Latency. Tiratat Patana-anake, Vincentius Martin, Nora Sandler, Cheng Wu, and Haryadi S.

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Coerced Cache Evic-on and Discreet- Mode Journaling: Dealing with Misbehaving Disks

A Case Study: Performance Evaluation of a DRAM-Based Solid State Disk

Understanding Manycore Scalability of File Systems. Changwoo Min, Sanidhya Kashyap, Steffen Maass Woonhak Kang, and Taesoo Kim

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

White Paper. File System Throughput Performance on RedHawk Linux

What's new in MySQL 5.5? Performance/Scale Unleashed

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

Distributed Systems 16. Distributed File Systems II

NVthreads: Practical Persistence for Multi-threaded Applications

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Decentralized Distributed Storage System for Big Data

RAIN: Reinvention of RAID for the World of NVMe. Sergey Platonov RAIDIX

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Unioning of the Buffer Cache and Journaling Layers with Non-volatile Memory

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Unioning of the Buffer Cache and Journaling Layers with Non-volatile Memory. Hyokyung Bahn (Ewha University)

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

CUDA Performance Optimization. Patrick Legresley

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Caching and reliability

SFS: Random Write Considered Harmful in Solid State Drives

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

The Google File System

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

Distributed File Systems II

Non-Volatile Memory Through Customized Key-Value Stores

Kaisen Lin and Michael Conley

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Boosting Quasi-Asynchronous I/Os (QASIOs)

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

Comparing Performance of Solid State Devices and Mechanical Disks

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Designing a True Direct-Access File System with DevFS

I N V E N T I V E. SSD Firmware Complexities and Benefits from NVMe. Steven Shrader

Efficient Storage Management for Aged File Systems on Persistent Memory

RAIN: Reinvention of RAID for the World of NVMe

Pay Migration Tax to Homeland: Anchor-based Scalable Reference Counting for Multicores

Lecture 21: Transactional Memory. Topics: Hardware TM basics, different implementations

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

The Google File System (GFS)

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

Extending the Lifetime of SSD Controller

Enlightening the I/O Path: A Holistic Approach for Application Performance

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

Optimizing Fsync Performance with Dynamic Queue Depth Adaptation

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

JOURNALING techniques have been widely used in modern

InnoDB Scalability Limits. Peter Zaitsev, Vadim Tkachenko Percona Inc MySQL Users Conference 2008 April 14-17, 2008

CS377P Programming for Performance Multicore Performance Synchronization

State of the Linux Kernel

MATRIX:DJLSYS EXPLORING RESOURCE ALLOCATION TECHNIQUES FOR DISTRIBUTED JOB LAUNCH UNDER HIGH SYSTEM UTILIZATION

Google File System. Arun Sundaram Operating Systems

CS3210: Crash consistency

The Google File System

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Transcription:

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han Seoul National University, Korea Dongduk Women s University, Korea

Contents Motivation and Background Design and Implementation Evaluation Conclusion

Motivation and Background Storage technology High-performance storage devices (e.g., SSDs) provide low-latency, high-throughput, and high I/O parallelism Highly parallel SSD (Intel NVMe SSD) Highly parallel SSD (Samsung NVMe SSD) High-Performance SSDs are widely used in cloud platforms, social network services, and so on

Motivation and Background Motivational evaluation for highly parallel SSDs The performance does not scale well or decreases as the number of cores increases Experimental Setup 72-cores / Intel P3700 / EXT4 file system Ordered mode Data journaling mode

Motivation and Background Existing coarse-grained locking and I/O operations by a single thread in transaction processing Locks on transaction processing in EXT4/JBD2 Total write time: 52220s (100%) j_checkpoint_mutex (mutex lock): 17946s (34.40%) j_list_lock (spin lock): 6140s (11.75%) j_state_lock (r/w lock): 102s (0.19%) Hot lock Hot lock Execution time breakdown 72-cores / Intel P3700 / EXT4 data journaling sysbench (72threads, total 72 GiB random write) others j_checkpoint_mutex j_list_lock j_state_lock 0 10000 20000 30000 40000 50000 60000 Seconds

storage app file system Motivation and Background app storage file system storage app file system Overall existing locking and I/O procedure creat() write() write() write() creat() S commit S blocked S jh 1 jh 2 jh 3 jh 1 jh 2 jh 3 jh 1 jh 2 jh 3 bh 1 journal area original area journal area original area 1 TxID: 1 (running) 2 TxID: 1 (committing) application thread journal thread creat() write() write() jh x transaction buffer list checkpoint list journal head M blocked S jh 1 jh 2 jh 3 checkpoint bh x buffer head bh 1 bh 2 bh 3 bh 1 S spin lock (j_list_lock) journal area original area M mutex lock (j_checkpoint_mutex) 3 TxID: 1 (checkpointing)

Motivation and Background Coarse-grained locking limits scalability of multi-cores insert fetch remove S jh 1 jh 2 jh 3 Journaling list (transaction buffer list or checkpoint list) I/O operation by a single thread limits I/O parallelism of SSDs jh 1 jh 2 jh 3 Journaling list (transaction buffer list or checkpoint buffer list) A batched and serialized I/O

Design and Implementation Goal Optimizing transaction processing (running, committing, checkpointing ) in journaling file systems Our schemes Concurrent updates on data structures Adopting lock-free data structures and operations using atomic instructions Lock-free linked list lock-free insert, remove, fetch Using atomic instructions atomic_add()/atomic_read()/atomic_set()/compare_and_swap() Parallel I/O in a cooperative manner Enabling application threads to the journal and checkpoint I/O operations not blocking them Fetching buffers from the shared linked lists, issuing the I/Os, and completing them in parallel

app file system storage Design and Implementation storage app file system app file system storage Overall Proposed Schemes Concurrent updates creat() write() write() Concurrent updates write() creat() commit jh 1 jh 2 jh 3 jh 1 jh 2 jh 3 jh 1 jh 2 jh 3 journal area original area 1 Running (TxID: 1) 2 application thread journal thread transaction buffer list Parallel I/O bh 1 bh 2 bh 3 journal area creat() write() write() original area Committing (TxID: 1) Concurrent updates checkpoint list jh 1 jh 2 jh 3 jh x bh x journal head buffer head bh 1 bh 2 bh 3 bh 1 bh 2 bh 3 checkpoint Parallel I/O S M spin lock (j_list_lock) mutex lock (j_checkpoint_mutex) journal area original area 3 Checkpointing (TxID: 1)

Design and Implementation Concurrent updates on data structures Concurrent insert operations Using atomic set instruction 1: add_buffer(jh, head, tail) 2: { 3: jh->prev = atomic_set(tail, jh); 4: if(jh->prev == NULL) 5: head = jh; 6: else 7: jh->prev->next = jh; 8: }

Design and Implementation Concurrent updates on data structures Concurrent remove operations (two-phase removal) journaling list 1: del_buffer(jh, head, tail) 2: { 3: atomic_set(jh->remove, remove); 4: jh->gc_prev = atomic_set(tail, jh); 5: if(jh->gc_prev == NULL) 6: head = jh; 7: else 8: jh->gc_prev->gc_next = jh; 9:}

Design and Implementation Concurrent updates on data structures Concurrent fetch operations 1: journal_io_start(.) 2: { 3: while((jh = head)!= NULL){ 4: if(atomic_cas(head, jh, jh->next)!= jh) 5: continue; 6: if(atomic_read(jh->removed) == removed) 7: continue; 8: submit_io( ); 9:}

Design and Implementation Parallel I/O operations in a cooperative manner Allowing the application threads to join the I/Os not blocking them Fetching buffers from the shared linked list concurrently Issuing the I/Os in parallel Completing the I/Os in parallel using per-thread list

Experimental Setup Hardware 72-core machine Four Intel Xeon E7-8870 processors (without hyperthreading) 16 GiB DRAM PCI 3.0 interface 800 GiB Intel P3700 NVMe SSD (18-channels) Software Linux kernel 4.9.1 EXT4/JBD2 Benchmarks An optimized EXT4 with parallel I/O: P-EXT4 Fully optimized EXT4: O-EXT4 Benchmarks Descriptions Parameters Tokubench (micro) Metadata-intensive (file creation) Files: 30,000,000, I/O sizes: 4KiB Sysbench (micro) Data-intensive (random write) Files: 72, Each file size: 1GiB, I/O sizes: 4KiB Varmail (macro) Fileserver (macro) Metadata-intensive (read/write ratio = 1:1) Data-intensive (read/write ratio = 1:2) Files: 300,000, Directory width: 10,000 Files: 1,000,000, Directory width: 10,000

Performance Evaluation Tokubench Ordered mode Improvement: upto 1.9x (P-EXT4), upto 2.2x (O-EXT4) Data journaling mode Improvement: upto 1.73x (P-EXT4), upto 1.88x (O-EXT4) Ordered mode Data journaling mode

Performance Evaluation Sysbench Ordered mode Improvement: upto 13.8% (P-EXT4), upto 16.3% (O-EXT4) Data journaling mode Improvement: upto 1.17x (P-EXT4), upto 2.1x (O-EXT4) Ordered mode Data journaling mode

Performance Evaluation Varmail Ordered mode Improvement: upto 1.92x (P-EXT4), upto 2.03x (O-EXT4) Data journaling mode Improvement: upto 31.3% (P-EXT4), upto 39.3% (O-EXT4) Ordered mode Data journaling mode

Performance Evaluation Fileserver Ordered mode Improvement: upto 4.3% (P-EXT4), upto 9.6% (O-EXT4) Data journaling mode Improvement: upto 1.45x (P-EXT4), upto 2.01x (O-EXT4) Ordered mode Data journaling mode

Performance Evaluation Comparison with a scalable file system (SpanFS, ATC 15) Ordered mode Improvement: upto 1.45x The performance of O-EXT4 is similar or slower than SpanFS in the case of small cores Data journaling mode Improvement: upto 1.51x Ordered mode (varmail) Data journaling mode (fileserver)

Performance Evaluation Experimental analysis EXT4 vs. P-EXT4 Improvement Bandwidth: 16.3%, Write time: 15.7% EXT4 vs. O-EXT4 Improvement Bandwidth: 2.06x, Write time: 2.08x File systems EXT4 P-EXT4 O-EXT4 Device-level BW 692 MB/s 805 MB/s 1426 MB/s Write time 52220 s (100%) 45124 s (100%) 25078 s (100%) j_checkpoint_mutex 17946 s (34.4%) 0 0 j_list_lock 6132 s (11.7%) 4890 s (10.8%) 0 j_state_lock 102 s (0.2%) 87 s (0.2%) 182 s (0.7%) others 28040 s (53.7%) 40147 s (89%) 24896 s (99.3%) Device-level BW and total execution time of main locks in data journaling mode (sysbench)

Conclusion Motivation and Background Data structures for transaction processing protected by non-scalable locks Serialized I/O operations by a single thread Approaches Concurrent updates on data structures Parallel I/O in a cooperative manner Evaluation Ordered mode: up to 2.2x Data journaling mode: up to 2.1x Future work Optimizing the locking mechanism for other resources such as file, pa ge cache, etc