Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Similar documents
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Enhancing Checkpoint Performance with Staging IO & SSD

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

A Case for High Performance Computing with Virtual Machines

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

High Performance MPI on IBM 12x InfiniBand Architecture

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Unified Runtime for PGAS and MPI over OFED

Advanced RDMA-based Admission Control for Modern Data-Centers

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

REMEM: REmote MEMory as Checkpointing Storage

Unifying UPC and MPI Runtimes: Experience with MVAPICH

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

Application Acceleration Beyond Flash Storage

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

MPI History. MPI versions MPI-2 MPICH2

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Accelerating HPL on Heterogeneous GPU Clusters

Proactive Process-Level Live Migration in HPC Environments

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

MPI versions. MPI History

Designing High Performance DSM Systems using InfiniBand Features

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Progress Report on Transparent Checkpointing for Supercomputing

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Comparative Performance Analysis of RDMA-Enhanced Ethernet

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Performance Evaluation of InfiniBand with PCI Express

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

IO virtualization. Michael Kagan Mellanox Technologies

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Memcached Design on High Performance RDMA Capable Interconnects

Solutions for Scalable HPC

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

Performance Evaluation of InfiniBand with PCI Express

GROMACS Performance Benchmark and Profiling. September 2012

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

GROMACS Performance Benchmark and Profiling. August 2011

High Performance Migration Framework for MPI Applications on HPC Cloud

Job Startup at Exascale: Challenges and Solutions

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Designing Multi-Leader-Based Allgather Algorithms for Multi-Core Clusters *

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Designing High-Performance and Resilient Message Passing on InfiniBand

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

An Empirical Study of High Availability in Stream Processing Systems

Adaptive Connection Management for Scalable MPI over InfiniBand

HPC Customer Requirements for OpenFabrics Software

Infiniband and RDMA Technology. Doug Ledford

Future Routing Schemes in Petascale clusters

High Performance VMM-Bypass I/O in Virtual Machines

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

HYCOM Performance Benchmark and Profiling

The NE010 iwarp Adapter

PARAVIRTUAL RDMA DEVICE

Birds of a Feather Presentation

Memory Management Strategies for Data Serving with RDMA

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

INAM 2 : InfiniBand Network Analysis and Monitoring with MPI

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Comparing Ethernet & Soft RoCE over 1 Gigabit Ethernet

Job Startup at Exascale:

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

Optimizing LS-DYNA Productivity in Cluster Environments

A Global Operating System for HPC Clusters

Increasing Reliability through Dynamic Virtual Clustering

Transcription:

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand Qi Gao, Weikuan Yu, Wei Huang, Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science & Engineering The Ohio State University

Introduction Nowadays, clusters have been increasing in their sizes to achieve high performance.? = High Performance High Productivity Failure rate of the systems grows rapidly along with the system size System failures are becoming an important limiting factor of the productivity of large-scale clusters

Motivation Most end applications are parallelized Many are written in MPI. More susceptible to failures. Many research efforts, e.g. MPICH-V, LAM/MPI, FT-MPI, C 3, etc., for fault tolerance in MPI Newly deployed clusters are often equipped with high speed interconnect for high performance InfiniBand: an open industrial standard for high speed interconnect. Used by many large clusters in Top 500 list. Clusters with tens of thousand cores are being deployed How to achieve fault tolerance for MPI on InfiniBand clusters to provide both high performance and robustness is an important issue

Outline Introduction & Motivation Background InfiniBand Checkpointing & rollback recovery Checkpoint/Restart for MPI over InfiniBand Evaluation framework Experimental results Conclusions and Future work

InfiniBand Native InfiniBand transport services Protocol off-loading to Channel Adapter (NIC) High performance RDMA operations InfiniBand Stack (Courtesy from IB Spec.) Queue-based model Queue Pairs (QP) Completion Queues (CQ) OS-bypass Protection & Authorization Protection Domain (PD) Memory Regions (MR) and access keys Queuing Model (Courtesy from IB Spec.)

Checkpointing & Rollback Recovery Checkpointing & rollback recovery is a commonly used method to achieve fault tolerance. Which checkpointing method is suitable for clusters with high speed interconnects like InfiniBand? Categories of checkpointing: Coordinated Uncoordinated Communication Induced Pros: Easy to guarantee consistency Cons: Coordination overhead All processes must rollback upon failure Pros: No global coordination Cons: domino effect or message logging overhead Pros: Guarantee consistency without global coordination Cons: Requires permessage processing High overhead

Checkpointing & Rollback Recovery (Cont.) Implementation of checkpointing: System Level Application Level Compiler Assisted Pros: Can be transparent to user applications Checkpoints initiated independent to the progress of application Cons: Need to handle consistency issue Pros: Content of checkpoints can be customized Portable checkpoint file Cons: Applications source code need to be rewritten according to checkpointing interface Pros: Application level checkpointing without source code modification Cons: Requires special compiler techniques for consistency Our current approach: Coordinated, System-Level, Application Transparent Checkpointing

Outline Introduction & Motivation Background Checkpoint/Restart for MPI over InfiniBand Evaluation Framework Experimental Results Conclusions and Future Work

Overview Checkpoint/Restart for MPI programs over InfiniBand: Using Berkeley Lab s Checkpoint/Restart (BLCR) for taking snapshots of individual processes on a single node. Design coordination protocol to checkpoint and restart the entire MPI job; Totally transparent to user applications; Does not interfere critical path of data communication. Suspend/Reactivate the InfiniBand communication channel in MPI library upon checkpoint request. Network connections on InfiniBand are disconnected Channel consistency is maintained. Transparent to upper layers of MPI library

Checkpoint/Restart (C/R) Framework Process Manager Control Message Manager Global C/R Coordinator MPI Job Console MPI Process MPI Process MPI Process Pt2pt Data Connections C/R Library Local C/R Controller MPI Process Communication Channel Manager Data Network In our current implementation: Process Manager: Multi-Purpose Daemon (MPD), developed in ANL, extended with C/R messaging support C/R Library: Berkeley Lab s Checkpoint/Restart (BLCR)

Global View: Procedure of Checkpointing Process Manager Global C/R Coordinator MPI Job Console MPI Process MPI Process MPI Process Data Connections Control Message Manager C/R Library Local C/R Controller MPI Process Communication Channel Manager Checkpoint Request Data Network Running Initial Synchronization Post-checkpoint Coordination Pre-checkpoint Coordination Local Checkpointing

Global View: Procedure of Restarting Process Manager Global C/R Coordinator MPI Job Console MPI Process MPI Process MPI Process Data Connections Control Message Manager C/R Library Local C/R Controller MPI Process Communication Channel Manager Restart Request Data Network Restarting Post-checkpoint Coordination Running

Local View: InfiniBand Channel in MPI Post-checkpoint Pre-checkpoint Local Initial User Application Running Synchronization Checkpointing Coordination Coodination MPI Upper Layers Registered User Buffers MPI InfiniBand Channel Channel Progress Information Dedicated Communication Buffers Network Connection Information QPs MRs CQs PDs InfiniBand Host Adapter (HCA) Peer MPI Process Storage Peer MPI Process Peer MPI Process HCA HCA HCA InfiniBand Fabric

Outline Introduction & Motivation Background Checkpoint/Restart for MPI over InfiniBand Evaluation Framework Experimental Results Conclusions and Future Work

OSU MPI over InfiniBand Open Source High Performance Implementations MPI-1 (MVAPICH) MPI-2 (MVAPICH2) Has enabled a large number of production IB clusters all over the world to take advantage of InfiniBand Largest being Sandia Thunderbird Cluster (4512 nodes with 9024 processors) Have been directly downloaded and used by more than 390 organizations worldwide (in 30 countries) Time tested and stable code base with novel features Available in software stack distributions of many vendors Available in the OpenFabrics(OpenIB) Gen2 stack and OFED More details at http://nowlab.cse.ohio-state.edu/projects/mpi-iba/

Evaluation Framework Implementation based on MVAPICH2 version 0.9.0 Will be released with newer version of MVAPICH2 soon Test-bed: InfiniBand Cluster with 12 nodes, dual Intel Xeon 3.4 GHz CPUs, 2 GB memory, Redhat Linux AS 4 with kernel version 2.6.11; Ext3 file system on top of local SATA disks Mellanox InfiniHost MT23108 HCA adapters Experiments: Analysis of overhead for taking one checkpoint and restart NAS Parallel Benchmarks Performance impact to applications when checkpointing periodically NAS Parallel Benchmarks HPL Benchmark GROMACS

Outline Introduction & Motivation Background Checkpoint/Restart for MPI over InfiniBand Evaluation Framework Experimental Results Conclusions and Future Work

Checkpoint/Restart Overhead Storage overhead Checkpoint size is same as the memory used by process: Time for checkpointing Time (seconds) 8 7 6 5 4 3 2 1 0 Checkpoint size per process File Access Benchmark Coordination LU.C.8 126MB lu.c.8 bt.c.9 sp.c.9 lu.c.8 bt.c.9 sp.c.9 BT.C.9 213MB SP.C.9 193MB Delay from issuance of checkpoint/restart request to program resumes execution Sync checkpoint file to local disk before program continues Checkpoint Restart File accessing time is the dominating factor of checkpoint/restart overhead

Performance Impact to Applications NAS Benchmarks Execution Time (seconds) 500 400 300 200 100 0 lu.c.8 bt.c.9 sp.c.9 1 min 2 min 4 min None Checkpointing Interval NAS benchmarks, LU, BT, SP, Class C, for 8~9 processes For each checkpoint, the execution time increases for about 2-3%

Performance Impact to Applications Performance (GFLOPS) 30 25 20 15 10 5 0 HPL Benchmark 2 min (6) 4 min (2) 8 min (1) None Checkpointing Interval & No. of checkpoints HPL benchmarks, 8 processes. Performs same as original MVAPICH2 when taking no checkpoints For each checkpoint, the performance degradation is about 4%.

Benchmarks V.S. Target Applications Benchmarks Seconds, minutes (checkpoint in a few minutes) Load all data into memory at beginning The ratio of (memory usage / running time) is high Target applications: long running applications Days, weeks, months (checkpoint hourly, daily, or weekly) Computation intensive or load data into memory gradually The ratio of (memory usage / running time) is low Benchmarks reflects almost the worst case scenario Checkpointing overhead largely depends on checkpoint file size (process memory usage) Relative overhead is very sensitive to the ratio.

Performance Impact to Applications GROMACS Excution Time (seconds) 1000 800 600 400 200 0 10% 8% 6% 4% 2% 0% Performance Impact 1 min (14) 2 min (7) 4 min (3) 8 min (1) None Checkpointing Interval & No. of Checkpoints GROMACS Molecular dynamics for biochemical analysis. DPPC dataset running on 10 processes. Small memory usage with relatively longer running time. For each checkpoint, the execution time increases less than 0.3%.

Outline Introduction & Motivation Background Checkpoint/Restart for MPI over InfiniBand Evaluation Framework Experimental Results Conclusions and Future Work

Conclusions and Future Work Design & implement a framework to checkpoint and restart for MPI programs over InfiniBand Totally transparent to MPI applications Evaluations based on NAS, HPL, and GROMACS show that the overhead for checkpointing is not significant Future work: Reduce the checkpointing overhead Design a more sophisticated framework for fault tolerance in MPI Integrate into MVAPICH2 release

Acknowledgements Our research is supported by the following organizations Current Funding supported by Current Equipment supported by 25

Web Pointers {gaoq, yuw, huanwei, panda}@cse.ohio-state.edu http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ 26