CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Similar documents
Enhancing Checkpoint Performance with Staging IO & SSD

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Memcached Design on High Performance RDMA Capable Interconnects

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Unified Runtime for PGAS and MPI over OFED

High Performance Pipelined Process Migration with RDMA

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Progress Report on Transparent Checkpointing for Supercomputing

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

A Case for High Performance Computing with Virtual Machines

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

REMEM: REmote MEMory as Checkpointing Storage

High Performance MPI on IBM 12x InfiniBand Architecture

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Scaling with PGAS Languages

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Accelerating HPL on Heterogeneous GPU Clusters

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing Shared Address Space MPI libraries in the Many-core Era

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

High-Performance Training for Deep Learning and Computer Vision HPC

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Proactive Process-Level Live Migration in HPC Environments

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Structuring PLFS for Extensibility

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Advanced RDMA-based Admission Control for Modern Data-Centers

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

High Performance Migration Framework for MPI Applications on HPC Cloud

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Application Acceleration Beyond Flash Storage

Comparing Ethernet and Soft RoCE for MPI Communication

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

The LiMIC Strikes Back. Hyun-Wook Jin System Software Laboratory Dept. of Computer Science and Engineering Konkuk University

Efficient Shared Memory Message Passing for Inter-VM Communications

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Techniques to improve the scalability of Checkpoint-Restart

IO virtualization. Michael Kagan Mellanox Technologies

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

New Storage Architectures

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

ABySS Performance Benchmark and Profiling. May 2010

Performance Evaluation of InfiniBand with PCI Express

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

HYCOM Performance Benchmark and Profiling

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

2008 International ANSYS Conference

Parallel File Systems for HPC

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Job Startup at Exascale: Challenges and Solutions

Toward An Integrated Cluster File System

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

High-Performance Broadcast for Streaming and Deep Learning

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Design and Evaluation of Benchmarks for Financial Applications using Advanced Message Queuing Protocol (AMQP) over InfiniBand

Memory Management Strategies for Data Serving with RDMA

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Using DDN IME for Harmonie

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Coordinating Parallel HSM in Object-based Cluster Filesystems

Transcription:

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Hao Wang, Jian Huang, Dhabaleswar K. Panda Department of Computer Science & Engineering The Ohio State University

2 Outline Introduction and Motivation Checkpoint Profiling and Analysis CRFS: A lightweight user-level Filesystem Performance Evaluation Conclusions and Future Work

3 Introduction High Performance Computing (HPC) keeps growing in terms of scale and complexity Mean-time-between-failures (MTBF) is getting smaller Fault-Tolerance is becoming imperative Checkpoint/Restart is becoming increasingly important Checkpoint/Restart (C/R): the most widely adopted Fault-tolerance approach Phase 1: build a global consistent state (suspend communications) Phase 2: create a snapshot of every process, save it to a shared parallel filesystem Phase 3: Resume communications and execution

4 Problems with Basic C/R Checkpoint/Restart mechanisms incur intensive I/O overhead Sheer amount of data Simultaneous IO streams leads to severe contentions Large variations of individual process completion time A lot of studies to tackle the I/O bottleneck Performed inside specific MPI stack or checkpoint library or applications Not portable Constrained to certain MPI stacks

5 Problem Statements What are the primary causes of intensive I/O overhead for Checkpoint / Restart? How to design a portable and generic solution with optimizations to improve C/R performance? Can such a portable solution benefit a wide range of MPI stacks? What will be the performance benefits?

6 Outline Introduction and Motivation Checkpoint Profiling and Analysis CRFS: A lightweight user-level Filesystem Performance Evaluation Conclusions and Future Work

7 MVAPICH/MVAPICH2 Software MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) MVAPICH (MPI-1) and MVAPICH2 (MPI-2) Used by more than 1,650 organizations worldwide (in 63 countries) Empowering many TOP500 clusters (7 th, 17 th ) Available with software stacks of many IB, 10GE/iWARP and RoCE, and server vendors including Open Fabrics Enterprise Distribution (OFED) Available with Redhat and SuSE Distributions http://mvapich.cse.ohio-state.edu/ Has supported Checkpoint/Restart and Process Migration for the last several years Already used by many organizations

Checkpoint Writing Profiling (1) Checkpoint Writing information [1] Lots of small/median writes Inefficient IO NAS Parallel Benchmark - LU.C.64 Compute nodes: dual Quad-core Xeon, MVAPICH2-1.6 with Checkpoint/Restart support Checkpoint to ext3 Checkpoint size: 1,472 MB, VFS write calls per node: 7800 [1] X. Ouyang, K. Gopalakrishnan, D. K. Panda, Accelerating Checkpoint Operation by Node-Leve Write Aggregation on Multicore Systems, ICPP 2009 ICPP 2011 8

Checkpoint Writing Profiling (2) Checkpoint Write Time (LU.C.64, 64 processes on 8 nodes) contentions caused by concurrent writes wide variation of completion time. Faster process has to wait for slower counterparts Slow down checkpoint as a while ICPP 2011 9

10 Optimize IO at Different Layers Optimizations in specific MPI stacks Only benefit certain MPI implementations How to get both performance and portability at the same time? Optimizations in checkpoint library Require changes in kernel modules, not portable

11 Our Approach CRFS: a generic Checkpoint/Restart Filesystem CRFS: a user-level filesystem optimized for checkpoint I/O User-level design: portable Optimizations inside filesystem: transparently benefit a wide range of MPI stacks and applications

12 Outline Introduction and Motivation Checkpoint Profiling and Analysis CRFS: A lightweight user-level Filesystem Performance Evaluation Conclusions and Future Work

13 CRFS Design Strategies Based on FUSE: user-level filesystem Intercepts VFS write system calls Aggregates many writes into bigger (fewer) chunks (better IO efficiency) Internal IO thread pool: asynchronously write bigger data chunks to back-end filesystem Reduce IO contentions ext3, NFS, Luster etc.

CRFS Design MPI processes CRFS Buffer Pool Work Queue Checkpoint Library IO Thread Pool VFS VFS Write FUSE Backend Filesystems ext3 Lustre NFS Design choices Buffer pool size IO thread pool ICPP 2011 14

15 Outline Introduction and Motivation Checkpoint Profiling and Analysis CRFS: A lightweight user-level Filesystem Performance Evaluation Conclusions and Future Work

Environment Dual-socket Quad core Xeon, InfiniBand DDR, Linux 2.6.30, FUSE-2.8.5 NAS parallel Benchmark (NPB) 3.3, LU with class B/C/D input MVAPICH2-1.6rc3, OpenMPI 1.5.1, MPICH2 1.3.2p1 With Checkpoint/Restart support No modifications to any MPI stacks required Backend Filesystem: Ext3, NFSv3, Lustre 1.8.3 ( 3 OSS, 1 MDS, o2ib transport ) Experiments Experimental Setup Single node RAW IO bandwidth To select proper design parameters Checkpoint time with 3 MPI stacks Evaluate CRFS performance CRFS scalability with varied level of IO multiplexing CRFS capability to reduce variation in checkpoint completion time ICPP 2011 16

17 CRFS Raw Write Bandwidth 8 processes writing concurrently on a node. 16 MB buffer pool can generate good throughput 128 KB chunk size performs well 4 IO threads yield the best performance in most cases

18 Checkpoint Sizes

Checkpoint Time (MVAPICH2) 5.5X 4.5X Single NFS server cannot Handle heavy IO FUSE overhead manifested 1.3X Lustre: CRFS is 5.5X / 4.5X / 1.3X faster than native with class B/C/D inputs Ext3: CRFS is 2.8X / 2.2X / 1.1X faster than native with class B/C/D inputs ICPP 2011 19

Checkpoint Time (MPICH2) 11 X 9.3 X 1.3X Lustre: CRFS is 11X / 9.3X / 1.3X faster than native with class B/C/D inputs Ext3: CRFS is 7X / 8X / 6.9X faster than native with class B/C/D inputs ICPP 2011 20

Checkpoint Time (OpenMPI) 11.5 X 1.4 X Lustre: CRFS is 11.5X / 1.4X faster than native with class B/D inputs Ext3: CRFS is 5.5X / 5.2X / 1.6X faster than native with class B/C/D inputs ICPP 2011 21

22 CRFS: Multiplexing Scalability LU.D, vary number of processes per node. Run with MVAPICH2-1.6 Checkpoint to Lustre w/o CRFS ~30% reduction in ckpt time CRFS effectively reduces node-level IO multiplexing contention Diminish checkpoint writing overhead

23 Checkpoint Completion Time LU.C.64 Wide variation of completion time Run with MVAPICH2 CRFS diminishes IO contentions Reduce the completion waiting faster resumption of execution

24 Outline Introduction and Motivation Checkpoint Profiling and Analysis CRFS: A lightweight user-level Filesystem Performance Evaluation Conclusions and Future Work

25 Conclusions Checkpoint Writing incurs intensive IO overhead Sheer amount of data, contentions from concurrent writes, wide variation of write completion Existing optimizations are not portable, not generic CRFS: a user-level filesystem Portable: a user-level filesystem, work with any MPI stacks without any modifications High Performance: write aggregation, reduced contention, asynchronous bulk IO Generic: Optimizations inside filesystem, can work with any MPI stacks / IO intensive applications

26 Future Work How to optimize inter-node concurrent IO using CRFS How to extend CRFS to benefit other generic IO intensive applications

27 Thank you! http://mvapich.cse.ohio-state.edu {ouyangx, rajachan, besseron, wangh, huangjia, panda}@cse.ohio-state.edu Network-Based Computing Laboratory

28 Related Work PLFS [1] (Parall Log Filesystem) deal with N-1 sceanrio, cannot handle MPI system level checkpoint (N-N) Optimizations inside MPI library: [2]: write aggregation at MPI and BLCR library require modifications in kernel module, not portable [3]: node-level buffering of data specific to only one MPI stack [1] J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, PLFS: a checkpoint filesystem for parallel applications, in Proc. of SC 09, 2009. [2] X. Ouyang, K. Gopalakrishnan, and D. K. Panda, Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems, ICPP 2009. [3] J. Hursey, J. Squyres, T. Mattox, and A. Lumsdaine, The design and implementation of checkpoint/restart process fault tolerance for open mpi, in 12th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, 2007.