Optimizing Local File Accesses for FUSE-Based Distributed Storage

Similar documents
Remote Direct Storage Management for Exa-Scale Storage

White Paper. File System Throughput Performance on RedHawk Linux

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Cluster Setup and Distributed File System

Chapter 12: File System Implementation

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

MAHA. - Supercomputing System for Bioinformatics

System Software for Big Data and Post Petascale Computing

CS370 Operating Systems

File System Internals. Jo, Heeseung

Structuring PLFS for Extensibility

Memory Management Strategies for Data Serving with RDMA

Crossing the Chasm: Sneaking a parallel file system into Hadoop

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Online Monitoring of I/O

High-Performance Lustre with Maximum Data Assurance

Feedback on BeeGFS. A Parallel File System for High Performance Computing

CS3600 SYSTEMS AND NETWORKS

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

Introduction (1 of 2) Performance and Extension of User Space File. Introduction (2 of 2) Introduction - FUSE 4/15/2014

Designing a True Direct-Access File System with DevFS

Crossing the Chasm: Sneaking a parallel file system into Hadoop

RHEL OpenStack Platform on Red Hat Storage

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT

Modification and Evaluation of Linux I/O Schedulers

Operating Systems. File Systems. Thomas Ropars.

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

GPUfs: Integrating a file system with GPUs

Adaptation of Distributed File System to VDI Storage by Client-Side Cache

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Chapter 11: Implementing File Systems

File Systems. File system interface (logical view) File system implementation (physical view)

Finding the Needle Improving Management of Large Lustre File Systems. Presented by David Dillow

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

An Overview of Fujitsu s Lustre Based File System

Chapter 11: Implementing File

Isilon Performance. Name

File. File System Implementation. File Metadata. File System Implementation. Direct Memory Access Cont. Hardware background: Direct Memory Access

Real-Time I/O-Monitoring of HPC Applications

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Chapter 12: File System Implementation

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

Chapter 10: File System Implementation

V. File System. SGG9: chapter 11. Files, directories, sharing FS layers, partitions, allocations, free space. TDIU11: Operating Systems

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Aziz Gulbeden Dell HPC Engineering Team

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

OPERATING SYSTEM. Chapter 12: File System Implementation

Chapter 12: File System Implementation

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Extremely Fast Distributed Storage for Cloud Service Providers

Week 12: File System Implementation

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

HYCOM Performance Benchmark and Profiling

EMC ISILON X-SERIES. Specifications. EMC Isilon X200. EMC Isilon X400. EMC Isilon X410 ARCHITECTURE

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

Two-Choice Randomized Dynamic I/O Scheduler for Object Storage Systems. Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross

COMP520-12C Final Report. NomadFS A block migrating distributed file system

Storage Systems. NPTEL Course Jan K. Gopinath Indian Institute of Science

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

A Case Study: Performance Evaluation of a DRAM-Based Solid State Disk

VIDAS: OBJECT-BASED VIRTUALIZED D A T A S H A R I N G F O R H I G H PERFORMANCE STORAGE I/O

The rcuda middleware and applications

Dell TM Terascala HPC Storage Solution

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Lustre Metadata Fundamental Benchmark and Performance

GPUfs: Integrating a file system with GPUs

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

SFA12KX and Lustre Update

InfiniBand Networked Flash Storage

Distributed Filesystem

EMC ISILON X-SERIES. Specifications. EMC Isilon X210. EMC Isilon X410 ARCHITECTURE

The RAMDISK Storage Accelerator

File System Management

Computer Systems Laboratory Sungkyunkwan University

Voltaire Making Applications Run Faster

Disk Cache-Aware Task Scheduling

ISILON X-SERIES. Isilon X210. Isilon X410 ARCHITECTURE SPECIFICATION SHEET Dell Inc. or its subsidiaries.

Disclaimer: this is conceptually on target, but detailed implementation is likely different and/or more complex.

Feeling odd Using new OS and CPU features to speed up large file copying. Sebastian Krahmer stealth [at] openwall net.

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

Evaluating selected cluster file systems with Parabench

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Parallel File Systems. John White Lawrence Berkeley National Lab

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

What communication library can do with a little hint from programmers? Takeshi Nanri (Kyushu Univ. and JST CREST, Japan)

Understanding Write Behaviors of Storage Backends in Ceph Object Store

Transcription:

Optimizing Local File Accesses for FUSE-Based Distributed Storage Shun Ishiguro 1, Jun Murakami 1, Yoshihiro Oyama 1,3, Osamu Tatebe 2,3 1. The University of Electro-Communications, Japan 2. University of Tsukuba, Japan 3. Japan Science and Technology Agency 1

Background Highly scalable distributed file systems are a key component of data-intensive science Low latency, high capacity, and high scalability of storage are crucial to application performance Many distributed file systems can manage numerous large files by federating multiple storage Ex.: Lustre, GlusterFS, Ceph, Gfarm Several systems allow users to mount file systems on UNIX clients by using FUSE 2

FUSE A framework for implementing user-level file systems Once mounted, the distributed file systems can be accessed with standard system calls such as open Drawback: considerable overhead on I/O It degrades overall performance of data-intensive HPC applications Large overhead is reported in multiple papers [Bent et al., SC09], [Narayan et al., Linux OLS 10], [Rajgarhia et al., 10] Ex.: Bent et al. reported 20% overhead on the bandwidth of file systems 3

Structure of FUSE Composed of a kernel module and a userland daemon Implementing a file system implementing a daemon app app FUSE daemon remote server User read, write, request Kernel FUSE module result local storage 4

Source of Overhead Context switches Between processes Between userland and kernel Redundant memory copies All of them occur even if the file is in local storage Several distributed FSes such as HDFS utilize local storage app app FUSE daemon remote server User read, write, request Kernel FUSE module result local storage 5

Goal Propose a mechanism that allows to access local storage directly via FUSE It bypasses redundant context switches and memory copies Accesses to remote files are out of this work Demonstrate the effectiveness by using Gfarm distributed file system [Tatebe et al., 10] app app FUSE daemon User read, write, request Kernel FUSE module result local storage bypass 6

Gfarm Architecture Gfarm aggressively uses local storage I/O server and client can exist in the same node open metadata metadata server server file locations validity, file info I/O server & client I/O server & client I/O server & client open read, write 7

Mounting a Gfarm File System gfarm2fs Is FUSE daemon for mounting Gfarm file system Communicates with metadata servers and I/O servers by invoking the API of Gfarm 8

How to Access a Mounted Gfarm FS through gfarm2fs User app gfarm2fs (FUSE daemon) I/O server (local) metadata server I/O server (remote) Kernel FUSE module local storage If the accessed file is managed by a remote I/O server, gfarm2fs receives the file from the remote I/O server If the accessed file is managed by the local I/O server, gfarm2fs bypasses the server gfarm2fs itself executes system calls to access the file 9

Elaborating the Overhead of FUSE app FUSE daemon FUSE module local storage 10

Proposed Mechanism Our modified FUSE module directly accesses local storage No context switch between processes Only one round-trip between kernel and user Reduced memory copy Page cache due to FUSE daemon is no longer created Page cache is created only by local FS such as ext3 No change in remote file accesses 11

Overview of the Mechanism app FUSE daemon FUSE module local storage 12

Design (1/2) Key idea gfarm2fs passes the file descriptor (FD) of a locally opened file to the FUSE module FUSE module obtains the file structure associated with the FD FUSE module uses the file structure in the following read/write gfarm2fs (FUSE daemon) FD FUSE module FD file structure local storage 13

Design (2/2) gfarm2fs sends -1 as FD for remote files Thus, FUSE module can know the file is local or remote FUSE module forwards, to gfarm2fs, read and write requests against remote files in the same way as before All operations other than read and write are executed through gfarm2fs open, close, rename, truncate, stat, seek, 14

Implementation Modified programs gfarm2fs FUSE module FUSE library Gfarm library (minor modification) Modified parts Part of communication between gfarm2fs and the FUSE module We added a new member to struct fuse_file_info for passing a FD Part of obtaining and managing file structures Part of file operations of read/write 15

Managing a File Structure in case of remote files struct file private_data struct fuse_file file ext3 s struct file member added in case of local files In case of local files, file structure of ext3 can be obtained by referencing file of struct fuse_file In case of remote files, file points to the original file structure 16

File Operations Both FUSE and ext3 (ext4) invoke the same functions for executing read/write generic_file_read (ext3) / do_sync_read (ext4) generic_file_write (ext3) / do_sync_write (ext4) These are default callback functions in FUSE framework We replace FUSE s callback functions for read/write with our interception functions Thus read/write operations are intercepted 17

Example of Interception Code ssize_t fuse_generic_file_read(struct file *filp, char user *buf, size_t len, } loff_t *ppos){ struct fuse_file *ff = filp->private_data; return generic_file_read(ff->file, buf, len, ppos); Calls original function In case of local files, it is a file structure of ext3. In case of remote files, it is the same as filp (the original structure). 18

Experiments Measured I/O throughput Benchmarking with IOzone File size: 100 GB Record size: 8 MB We compared: Original FUSE Proposed mechanism Direct access without FUSE (ext3) 19

Platform Cluster of 4 nodes connected with Infiniband QDR 1 node: running metadata server 3 nodes: running I/O server and client CPU: Intel Xeon E5645 2.40 GHz (6 cores) 2 memory: 48 GB HDD: SAS 600 GB 15,000 rpm OS: CentOS 5.6 64 bit (kernel-2.6.18-238.el5) File system: ext3 Gfarm: version 2.4.2 FUSE: version 2.7.4 gfarm2fs: version 1.2.3 20

throughput (MB/s) 200 180 160-21% Throughput Original FUSE Proposed mechanism Direct access without FUSE 140 120 100 80 60 40 20 63% 26% 57% 0 sequential read sequential write random read random write 26-63% improvement when compared with original FUSE (except read) 97-99% throughput of direct access Read throughput is affected by OS kernel s read ahead mechanism 21

Further Experiments Measured memory consumption for page cache Page cache sizes before and after reading a 1 GB file We compared: Original FUSE Proposed mechanism 22

size of page cache (GB) Memory Usage for Page Cache Before and after reading 1GB file Increase in page cache Original FUSE: 2GB Our mechanism: 1GB In original FUSE, both the FUSE module and ext3 create their own page cache In our mechanism, there exists ext3 s cache only 3.0 2.5 2.0 1.5 1.0 0.5 0.0 2GB original 1GB before after proposed 23

Related Work (1/2) LDPLFS [Wright et al., 12] Uses DLL to retarget POSIX file operations to functions of a parallel FS Can accelerate file read and write without modification to application code Does not work well with statically-linked binary code [Narayan et al. 10] Stackable FUSE module is used for reducing context switch overhead of FUSE They apply the approach to encryption FS They do not mention application to distributed FS 24

Related Work (2/2) Ceph [Weil 06] Distributed FS that provides highly scalable storage Because the Ceph client is merged with the Linux kernel, Ceph file system can be mounted without FUSE Client nodes are different from storage nodes Lustre Distributed FS that consists of a metadata server and object storage targets, which store the contents of files Lustre client nodes are assumed to be distinct from object storage targets Gfarm aims to improve performance by using local storage effectively when a client and I/O server share the same node 25

Summary and Future Work We proposed a mechanism for accessing local storage directly from a modified FUSE module, and adapted it to Gfarm Applications running with our mechanism installed can read and write a local Gfarm FS without the intervention of FUSE daemon We measured 26-63% increase in read and write throughput In one sense, we show an upper bound of improvement in ideal case Future work Develop a kernel module (driver) that also allows access to remote I/O servers without the intervention of FUSE daemon 26