Structuring PLFS for Extensibility

Similar documents
PLFS and Lustre Performance Comparison

Lustre and PLFS Parallel I/O Performance on a Cray XE6

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Crossing the Chasm: Sneaking a parallel file system into Hadoop

The Fusion Distributed File System

Log-structured files for fast checkpointing

CA485 Ray Walshe Google File System

libhio: Optimizing IO on Cray XC Systems With DataWarp

Distributed File Systems II

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Leveraging Burst Buffer Coordination to Prevent I/O Interference

MDHIM: A Parallel Key/Value Store Framework for HPC

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Fall HDFS Basics

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Toward An Integrated Cluster File System

Distributed Filesystem

CLOUD-SCALE FILE SYSTEMS

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

libhio: Optimizing IO on Cray XC Systems With DataWarp

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

A Plugin for HDF5 using PLFS for Improved I/O Performance and Semantic Analysis

GFS: The Google File System

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Distributed File Systems

Enosis: Bridging the Semantic Gap between

API and Usage of libhio on XC-40 Systems

Coordinating Parallel HSM in Object-based Cluster Filesystems

Introduction to High Performance Parallel I/O

An Evolutionary Path to Object Storage Access

NPTEL Course Jan K. Gopinath Indian Institute of Science

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Let s Make Parallel File System More Parallel

BeoLink.org. Design and build an inexpensive DFS. Fabrizio Manfredi Furuholmen. FrOSCon August 2008

Comparing Performance of Solid State Devices and Mechanical Disks

...And eat it too: High read performance in write-optimized HPC I/O middleware file formats

Efficiency Evaluation of the Input/Output System on Computer Clusters

The Google File System

FROM FILE SYSTEMS TO SERVICES: CHANGING THE DATA MANAGEMENT MODEL IN HPC

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Pattern-Aware File Reorganization in MPI-IO

Cluster Setup and Distributed File System

Service and Cloud Computing Lecture 10: DFS2 Prof. George Baciu PQ838

White Paper. File System Throughput Performance on RedHawk Linux

The Google File System. Alexandru Costan

Cloud Computing CS

Parallel File System Analysis Through Application I/O Tracing

an Object-Based File System for Large-Scale Federated IT Infrastructures

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Nov 28, 2017 Lecture 25: Distributed File Systems All slides IG

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Optimizing Local File Accesses for FUSE-Based Distributed Storage

Storage Systems for Shingled Disks

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

Distributed Systems 16. Distributed File Systems II

Campaign Storage. Peter Braam Co-founder & CEO Campaign Storage

Massive Data Processing on the Acxiom Cluster Testbed

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Triton file systems - an introduction. slide 1 of 28

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

What is a file system

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Case study: ext2 FS 1

Analyzing NFS Client Performance with IOzone

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Parallel File System Analysis Through Application I/O Tracing

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Scalable I/O, File Systems, and Storage Networks R&D at Los Alamos LA-UR /2005. Gary Grider CCN-9

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

GFS: The Google File System. Dr. Yingwu Zhu

Storage Challenges at Los Alamos National Lab

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

LANL Data Release and I/O Forwarding Scalable Layer (IOFSL): Background and Plans

REMEM: REmote MEMory as Checkpointing Storage

JULEA: A Flexible Storage Framework for HPC

mode uid gid atime ctime mtime size block count reference count direct blocks (12) single indirect double indirect triple indirect mode uid gid atime

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Case study: ext2 FS 1

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (

High Level Design IOD KV Store FOR EXTREME-SCALE COMPUTING RESEARCH AND DEVELOPMENT (FAST FORWARD) STORAGE AND I/O

The TokuFS Streaming File System

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

PLFS: A Checkpoint Filesystem for Parallel Applications

Techniques to improve the scalability of Checkpoint-Restart

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

HPC Storage Use Cases & Future Trends

Metropolitan Road Traffic Simulation on FPGAs

Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL )

Cloudian Sizing and Architecture Guidelines

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Google File System. Arun Sundaram Operating Systems

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Designing a True Direct-Access File System with DevFS

Clotho: Transparent Data Versioning at the Block I/O Level

Transcription:

Structuring PLFS for Extensibility Chuck Cranor, Milo Polte, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon University

What is PLFS? Parallel Log Structured File System Interposed filesystem b/w apps & backing storage Los Alamos National Labs, CMU, EMC, Target: HPC checkpoint files PLFS transparently transforms a highly concurrent write access pattern to a pattern more efficient for distributed filesystems First paper: Bent et al, Supercomputer 2009 http://github.com/plfs, http://institute.lanl.gov/plfs/ 2

Checkpoint Write Patterns The two main checkpoint write patterns: N-1: all N processes write to one shared file Concurrent I/O to a single file is often unscalable Small, unaligned, clustered traffic is problematic N-N: each process writes to its own file Overhead of inserting many files in a single dir Easier for DFS (after files created) Archival and management more difficult Initial PLFS focus: improve N-1 case 3

PLFS Transforms Workloads PLFS improves N-1 performance by transforming it into an N-N workload FUSE/MPI: transparent solution, no application changes required 4

PLFS Converts N-1 to N-N 131 132 279 281 152 148 host1 host2 host3 /foo PLFS Virtual Layer /foo/ hostdir.1/ hostdir.2/ hostdir.3/ data.131 indx.131 data.132 indx.132 data.279 indx.279 data.281 indx.281 data.152 indx.152 Physical Underlying 5 Parallel File System data.148 indx.148

PLFS N-1 Bandwidth Speedups 100X SPEED UP 10X 6

The Price of Success Original PLFS was limited to 1 workload: N-1 checkpoint on mounted posix filesystem All data stored in PLFS container logs Ported first to MIO-IO/ROMIO Feasibly deploy on leadership class machines Success with LANL apps: actual adoption? Requires maintainability & roadmap evolution Develop a team: LANL, EMC, CMU, Revisit code with maintainability in mind 7

PLFS Extensibility Architecture HPC Application PLFS high-level API libplfs Logical FS interface flat file small file container byte-range Index API pattern distributed I/O Store interface posix pvfs iofsl hdfs libhdfs/jvm hdfs.jar 8 MDHIM w/leveldb

Case Study: HPC in the Cloud Emergence of Hadoop: converged storage HDFS: Hadoop Distributed Filesystem Key attributes: Single sequential writer (not POSIX, no pwrite) Not VFS mounted, access through Java API Local storage on nodes (converged) Data replicated ~3 times (local+remote1+remote2) HPC in the Cloud: N-1 checkpoint on HDFS? Observation: PLFS log I/O fits HDFS semantics 9

PLFS Backend Limitations PLFS hardwired to POSIX API: Needs a kernel mounted filesystem Uses integer file descriptors Memory maps index files to read them HDFS does not fit these assumptions Solution: I/O Store Insert a layer of indirection above PLFS backend Model after POSIX API to minimize code changes 10

PLFS I/O Store Architecture PLFS FUSE PLFS MPI I/O libplfs PLFS container I/O store posix i/o HDFS i/o posix libc API mounted fs lib{hdfs,jvm} hdfs.jar Java code 11

PLFS/HDFS Benchmark Testbed: PRObE (www.nmc-probe.org) Each node has dual 1.6GHz AMD cores, 16GB RAM, 1TB drive, gigabit ethernet Ubuntu Linux, HDFS 0.21.0, PLFS, OpenMPI Benchmark: LANL FS Test Suite (fs_test) Simulates N-1 checkpoint, strided Filesystems tested: PVFS OrangeFS 2.8.4 w/64mb stripe size PLFS/HDFS w/1 replica (local disk) PLFS/HDFS w/3 replicas (local disk + remote1 + remote 2) Blocksizes: 47001, 48K, 1M Checkpoint size: 32GB written by 64 nodes 12

Benchmark Operation nodes 0 1 2 3 write phase continue pattern for remaining strides block read phase stride 2 3 0 1 nodes (shifted for read) We unmount and cache flush data filesystem between read/write 13

PLFS Implementation Architecture FUSE filesystem and a Middleware lib (MPI) PLFS FUSE daemon PLFS lib FUSE upcall PLFS FUSE app proc1 app i/o PLFS FUSE app proc2 VFS/POSIX API PLFS MPI app proc1 PLFS/ MPI libs PLFS MPI app proc2 PLFS/ MPI libs use r kernel FUSE module Local fs backing store i/o Distributed fs MPI sync calls interconnect to disk to network to other nodes 14

PLFS/HDFS Write Bandwidth 2000 PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write write bandwidth (Mbytes/s) 1500 1000 500 0 47001 48K 1M access unit size (bytes) 15

PLFS/HDFS Write Bandwidth PLFS/HDFS performs well (note HDFS1 is local disk) 2000 PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write write bandwidth (Mbytes/s) 1500 1000 500 0 47001 48K 1M access unit size (bytes) 16

PLFS/HDFS Write Bandwidth PLFS/HDFS performs well (note HDFS3 is 3 copies) 2000 PVFS-write PLFS/HDFS1-write PLFS/HDFS3-write write bandwidth (Mbytes/s) 1500 1000 500 0 47001 48K 1M access unit size (bytes) 17

PLFS/HDFS Read Bandwidth HDFS with small access size benefits from PLFS log grouping 1000 PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read read bandwidth (Mbytes/s) 500 0 47001 48K 1M access unit size (bytes) 18

PLFS/HDFS Read Bandwidth HDFS3 with large access size suffers imbalance 1000 PVFS-read PLFS/HDFS1-read PLFS/HDFS3-read read bandwidth (Mbytes/s) 500 0 47001 48K 1M access unit size (bytes) 19

HDFS 1 vs 3: I/O Scheduling Network counters show HDFS3 read imbalance PLFS/HDFS1 PLFS/HDFS3 Total size of data served (MB) 1000 500 0 10 20 30 40 50 60 Node number 20

I/O Store Status Rewrote initial I/O Store prototype Production-level code Multiple concurrent instances of I/O Stores Re-plumbed entire backend I/O path Prototyped POSIX, HDFS, PVFS stores IOFSL done by EMC Regression tested at LANL I/O Store now part of PLFS released code https://github.com/plfs 21

Conclusions PLFS extensions for workload transformation: Logical FS interface Not just container logs; packing small files, burst buffer I/O Store layer Non-POSIX backends (HDFS, IOFSL, PVFS) Compression, write buffering, IO forwarding Container index extensions PLFS is open source, available on github http://github.com/plfs Developer email: plfs-devel@lists.sourceforge.net 22