Progress Report on Transparent Checkpointing for Supercomputing

Similar documents
Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Enhancing Checkpoint Performance with Staging IO & SSD

Advantages to Using MVAPICH2 on TACC HPC Clusters

A Case for High Performance Computing with Virtual Machines

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Extending DMTCP Checkpointing for a Hybrid Software World

Readme for Platform Open Cluster Stack (OCS)

System-level Transparent Checkpointing for OpenSHMEM

Towards Exascale: Leveraging InfiniBand to accelerate the performance and scalability of Slurm jobstart.

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

An introduction to checkpointing. for scientific applications

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Applications of DMTCP for Checkpointing

High-Performance Training for Deep Learning and Computer Vision HPC

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Transparent Checkpoint and Restart Technology for CUDA applications. Taichiro Suzuki, Akira Nukada, Satoshi Matsuoka Tokyo Institute of Technology

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

HPC Architectures. Types of resource currently in use

Screencast: What is [Open] MPI?

Screencast: What is [Open] MPI? Jeff Squyres May May 2008 Screencast: What is [Open] MPI? 1. What is MPI? Message Passing Interface

Job Startup at Exascale: Challenges and Solutions

The RAMDISK Storage Accelerator

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Proactive Process-Level Live Migration in HPC Environments

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Checkpointing using DMTCP, Condor, Matlab and FReD

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Hybrid MPI - A Case Study on the Xeon Phi Platform

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Initial Performance Evaluation of the Cray SeaStar Interconnect

Job Startup at Exascale:

How Container Runtimes matter in Kubernetes?

Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab.

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

FUJITSU PHI Turnkey Solution

Simulation using MIC co-processor on Helios

Getting Insider Information via the New MPI Tools Information Interface

Memory Management Strategies for Data Serving with RDMA

Triton file systems - an introduction. slide 1 of 28

Using DDN IME for Harmonie

High-Performance MPI Library with SR-IOV and SLURM for Virtualized InfiniBand Clusters

Scalability and Classifications

Compiling applications for the Cray XC

Introduction to Slurm

Feedback on BeeGFS. A Parallel File System for High Performance Computing

CST STUDIO SUITE 2019

Illinois Proposal Considerations Greg Bauer

Picocenter: Supporting long-lived, mostly-idle applications in cloud environments

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Implementing MPI on Windows: Comparison with Common Approaches on Unix

2008 International ANSYS Conference

Autosave for Research Where to Start with Checkpoint/Restart

Cray Operating System Plans and Status. Charlie Carroll May 2012

What communication library can do with a little hint from programmers? Takeshi Nanri (Kyushu Univ. and JST CREST, Japan)

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR. Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017

First Results in a Thread-Parallel Geant4

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

Data storage on Triton: an introduction

High Performance MPI on IBM 12x InfiniBand Architecture

Clusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory

Unified Runtime for PGAS and MPI over OFED

Transparent Process Migration for Distributed Applications in a Beowulf Cluster

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

M2: Malleable Metal as a Service

Application Acceleration Beyond Flash Storage

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Slurm Configuration Impact on Benchmarking

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Sherlock for IBIIS. William Law Stanford Research Computing

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Parallel Computing at DESY Zeuthen. Introduction to Parallel Computing at DESY Zeuthen and the new cluster machines

Kerberos & HPC Batch systems. Matthieu Hautreux (CEA/DAM/DIF)

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

The BioHPC Nucleus Cluster & Future Developments

Scaling with PGAS Languages

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

ABySS Performance Benchmark and Profiling. May 2010

An Overview of Fujitsu s Lustre Based File System

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

The Optimal CPU and Interconnect for an HPC Cluster

Deep Learning on SHARCNET:

TOSS - A RHEL-based Operating System for HPC Clusters

REMEM: REmote MEMory as Checkpointing Storage

A Portable InfiniBand Module for MPICH2/Nemesis: Design and Evaluation

Analyzing I/O Performance on a NEXTGenIO Class System

Transcription:

Progress Report on Transparent Checkpointing for Supercomputing Jiajun Cao, Rohan Garg College of Computer and Information Science, Northeastern University {jiajun,rohgarg}@ccs.neu.edu August 21, 2015 Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 1 / 14

Overview 1 Checkpointing: current status 2 The DMTCP Approach DMTCP and the Software Engineering Burden 3 Towards checkpointing for supercomputing Why are we doing this? Experiments on the MGHPCC Cluster Experiments at Stampede (TACC) Micro-benchmarks: improving performance Experiences while scaling up Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 2 / 14

Checkpointing: current status Application-specific Needs to modify user code; difficult to port to new applications BLCR Single-process kernel-level solution; MPI implementations and resource managers need to add extra checkpoint-restart modules; requires tuning for different kernel versions CRIU for Containers Linux kernel configured to expose certain of its internals in user-space; difficult to use with MPI DMTCP No need to modify the OS or the user program; transparent support for distributed applications (TCP/InfiniBand) Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 3 / 14

The DMTCP Approach Direct support for distributed applications out-of-the-box Entirely in user-space, no root privileges Transparently inter-operates with: 1 Major MPI implementations: (MVAPICH2, Open MPI, Intel MPI, MPICH2) 2 Resource managers (e.g., SLURM, Torque, planned for LSF) Resource Manager plugin (batch-queue) 3 High-performance networks InfiniBand plugin 4 MPI process managers: (e.g., Hydra, PMI, mpispawn, ibrun) 5 All recent versions of Linux kernel Extensible: supports application-specific plugins A cut-out plugin if memory is zero, don t save Checkpoint only when specified by the application Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 4 / 14

DMTCP and the Software Engineering Burden Lines of code: a proxy for the required software effort DMTCP C/C++: 22,276 (+ 1657 for comments) Headers: 6,693 (+ 2173 for comments) InfiniBand plugin C/C++: 8,500 (+ 705 for comments) Headers: 342 (+ 229 for comments) RM plugin C/C++: 3,859 (+ 451 for comments) Headers: 591 (+ 753 for comments) * All measurements using cloc Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 5 / 14

Why are we doing this? DMTCP-style transparent checkpointing is used with full memory dumps. 1990s (Beowulf cluster): one disk per node; full memory dump is efficient 2000s (Back-end storage): more RAM per compute node; many compute nodes writing to a small number of storage nodes, full memory dumps become less efficient Future (Local SSD storage): a few compute nodes writing to one SSD; full memory dump is efficient Conclusion: DMTCP-style checkpointing on supercomputers will be practical in the future. The time to plan for this is now! Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 6 / 14

Experiments on MGHPCC: Runtime overhead MGHPCC: Massachusetts Green High-Performance Computing Center, http://www.mghpcc.org/ Runtime (s) 1000 500 200 100 50 20 DMTCP (LU.C) Native (LU.C) LU.D LU.C LU.E 10 5 64 128 256 512 1024 2048 Number of MPI processes Figure: Comparison of runtime overhead (Open MPI); for jobs requiring at least 50 seconds the runtime overhead was small (< 2%); for jobs requiring less than 50 seconds, the DMTCP startup time was a significant fraction of the total. Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 7 / 14

Experiments on MGHPCC: Checkpoint overhead NAS Number of Ckpt time Ckpt size benchmark processes (s) (MB) LU.E 128 4 70.8 350 LU.E 64 8 136.6 356 LU.E 32 16 222.6 355 LU.E 128 16 70.2 117 Table: Checkpoint times and image sizes for the same NAS benchmark, under different configurations. The checkpoint image size is for a single MPI process. The checkpoint time is roughly proportional to the total size of the checkpoint images on a single node. NOTE: These were preliminary results with Open MPI ver. 1.6 with a NFS shared filesystem. Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 8 / 14

Bi-Bandwidth (MB/s) Bandwidth (MB/s) Latency (us) Bandwidth (MB/s) Experiments at Stampede (TACC): Runtime overhead 800 700 w/o DMTCP w/o IB DMTCP + IB Latency 7000 6000 w/o DMTCP w/o IB DMTCP + IB Bandwidth 600 500 400 300 200 5000 4000 3000 2000 100 1000 0 0 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 Message size 0 0 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 Message size 14000 12000 w/o DMTCP w/o IB DMTCP + IB Bidirectional Bandwidth 7000 6000 w/o DMTCP w/o IB DMTCP + IB Multiple Bandwidth 10000 8000 6000 4000 5000 4000 3000 2000 2000 1000 0 0 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 Message size 0 0 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 Message size Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 9 / 14

Experiments at Stampede (TACC): Runtime overhead NAS Number of Native with DMTCP Overhead benchmark processes (s) (s) (%) LU.C 2 354.7 356.7 0.6 LU.C 4 179.9 180.5 0.3 LU.C 8 95.3 95.8 0.5 LU.D 16 1104.0 1104.9 0.0 LU.D 32 497.0 508.4 2.2 LU.D 64 293.5 296.5 1.0 LU.D 128 156.9 159.0 1.3 LU.E 256 1604.8 1635.2 1.9 LU.E 512 619.3 628.9 1.6 LU.E 1024 535.3 544.6 1.7 LU.E 2048 168.7 (162.0) 184.7 (165.4) 9.5 (2.0) LU.E 4096 94.4 (86.7) 167.5 (91.3) 77.4 (5.0) Table: Runtime overhead for the NAS LU benchmark. MVAPICH-1.9 with the MV2 ON DEMAND THRESHOLD environment variable was used upto 1024 cores. For 2048 cores and for 4096 cores MVAPICH-2.1 was used. The numbers in the parentheses are the actual computation times as reported by the benchmark. Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 10 / 14

Experiments at Stampede (TACC): Checkpoint overhead Checkpoint times are approximately proportional to the image size. For a given problem class the image size decreases with the number of processes. NAS Number of Ckpt time Ckpt size Restart time benchmark processes (s) (MB) (s) LU.C 2 16.0 330 7.5 LU.C 4 8.6 174 4.3 LU.C 8 5.0 98 3.2 LU.D 16 32.0 650 12.9 LU.D 32 16.9 350 8.1 LU.D 64 9.9 190 6.5 LU.D 128 6.1 115 4.9 LU.E 256 33.8 700 18.9 LU.E 512 19.6 370 19.1 LU.E 1024 12.4 210 11.3 LU.E 2048 (in progress) (in progress) (in progress) LU.E 4096 (in progress) (in progress) (in progress) Table: Checkpoint times and image sizes for the NAS LU benchmark. Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 11 / 14

Micro-benchmarks: improving performance Performance bug in DMTCP found through micro-benchmarks (valloc) Performance bug in the IB plugin found through micro-benchmarks (malloc): a fix is planned Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 12 / 14

Experiences while scaling up 4 cores: 32 cores: 64 cores: Lustre FS bug (https://jira.hpdd.intel.com/browse/lu-6528): workaround: verify mkdir by calling stat Multiple processes on a compute node: fixes to DMTCP s code that handles shared memory segments Raised value of MV2 ON DEMAND THRESHOLD 128 cores: PMI external socket: added it to the list of external sockets 2048 cores: Initialization too slow with MVAPICH-1.9: switch to MVAPICH-2.1 16,384 cores:??? Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 13 / 14

Thanks Questions? Jiajun Cao, Rohan Garg (NU) Checkpointing for Supercomputing August 21, 2015 14 / 14