Proactive Process-Level Live Migration in HPC Environments

Similar documents
Proactive Process-Level Live Migration in HPC Environments

Proactive Process-Level Live Migration and Back Migration in HPC Environments

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

System Virtualization for Proactive Fault-Tolerant Computing. by Arun Babu Nagarajan

ABSTRACT. WANG, CHAO. Transparent Fault Tolerance for Job Healing in HPC Environments. (Under the direction of Associate Professor Frank Mueller).

Combing Partial Redundancy and Checkpointing for HPC

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Scalable In-memory Checkpoint with Automatic Restart on Failures

REMEM: REmote MEMory as Checkpointing Storage

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

MPI versions. MPI History

A Case for High Performance Computing with Virtual Machines

An MPI failure detector over PMPI 1

A Scalable Unified Fault Tolerance for HPC Environments

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

MPI History. MPI versions MPI-2 MPICH2

An introduction to checkpointing. for scientific applications

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

ABSTRACT. MISHRA, SHOBHIT. Design and Implementation of Process Migration and Cloning in BLCR. (Under the direction of Frank Mueller.

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Adaptive Runtime Support

Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing

Enhancing Checkpoint Performance with Staging IO & SSD

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Application Fault Tolerance Using Continuous Checkpoint/Restart

Progress Report on Transparent Checkpointing for Supercomputing

Live Virtual Machine Migration with Efficient Working Set Prediction

A Comprehensive User-level Checkpointing Strategy for MPI Applications

Avida Checkpoint/Restart Implementation

A Component Architecture for LAM/MPI

VM Migration, Containers (Lecture 12, cs262a)

DISP: Optimizations Towards Scalable MPI Startup

IN the field of high performance computing (HPC), the insatiable

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Adaptive Power Profiling for Many-Core HPC Architectures

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Live Migration of Virtualized Edge Networks: Analytical Modeling and Performance Evaluation

Ed D Azevedo Oak Ridge National Laboratory Piotr Luszczek University of Tennessee

Combining Partial Redundancy and Checkpointing for HPC

COMPUTATIONAL clusters with hundreds or thousands

WITH the development of high performance systems

Comparing different approaches for Incremental Checkpointing: The Showdown

arxiv: v2 [cs.dc] 2 May 2017

Failure Detection within MPI Jobs: Periodic Outperforms Sporadic

Fujitsu s Approach to Application Centric Petascale Computing

DINO: Divergent Node Cloning for Sustained Redundancy in HPC

RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes*

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Workload Characterization using the TAU Performance System

Evaluation of Xen: Performance and Use in Parallel Applications EECE 496 Project Report

The MOSIX Algorithms for Managing Cluster, Multi-Clusters, GPU Clusters and Clouds

ABSTRACT. KHARBAS, KISHOR H. Failure Detection and Partial Redundancy in HPC. (Under the direction of Dr. Frank Mueller.)

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

Checkpoint/Restart System Services Interface (SSI) Modules for LAM/MPI API Version / SSI Version 1.0.0

Alleviating Scalability Issues of Checkpointing

High Performance Pipelined Process Migration with RDMA

Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems

UNIBUS: ASPECTS OF HETEROGENEITY AND FAULT TOLERANCE IN CLOUD COMPUTING

DINO: Divergent Node Cloning for Sustained Redundancy in HPC

Increasing Reliability through Dynamic Virtual Clustering

Dirty Memory Tracking for Performant Checkpointing Solutions

Linux-CR: Transparent Application Checkpoint-Restart in Linux

Reliability Concerns

Self Adapting Numerical Software (SANS-Effort)

IN the field of high-performance computing (HPC), the

Proactive Resource Management for Failure Resilient High Performance Computing Clusters

Advanced Memory Management

An introduction to checkpointing. for scientifc applications

Communication Characteristics in the NAS Parallel Benchmarks

Concepts for High Availability in Scientific High-End Computing

On the scalability of tracing mechanisms 1

Analysis of the Component Architecture Overhead in Open MPI

Designing and Building Efficient HPC Cloud with Modern Networking Technologies on Heterogeneous HPC Clusters

A Composable Runtime Recovery Policy Framework Supporting Resilient HPC Applications

Fault Tolerance in Distributed Paradigms

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Scalable and Fault Tolerant Failure Detection and Consensus

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Research on Availability of Virtual Machine Hot Standby based on Double Shadow Page Tables

Phase-Based Application-Driven Power Management on the Single-chip Cloud Computer

Self Adaptivity in Grid Computing

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

ProActive SPMD and Fault Tolerance Protocol and Benchmarks

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

AIST Super Green Cloud

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

University of Alberta. Zhu Pang. Master of Science. Department of Computing Science

The MOSIX Algorithms for Managing Cluster, Multi-Clusters and Cloud Computing Systems

Autosave for Research Where to Start with Checkpoint/Restart

Live Migration of Virtual Machines

Autonomic Data Replication in Cloud Environment

Adding semi-coordinated checkpoints to RADIC in Multicore clusters

S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems

High Availability and Performance Computing: Towards non-stop services in HPC/HEC/Enterprise IT Environments

High Performance Migration Framework for MPI Applications on HPC Cloud

Transcription:

Proactive Process-Level Live Migration in HPC Environments Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory SC 08 Nov. 20 Austin, Texas

Outline Problem vs. Our Solution Overview of LAM/MPI and BLCR (Berkeley Lab Checkpoint/Restart) Our Design and Implementation Experimental Framework Performance Evaluation Conclusion and Future Work Related Work 2

Problem Statement Trends in HPC: high end systems with > 100,000 processors MTBF/I becomes shorter MPI widely accepted in scientific computing But no fault recovery method in MPI standard Transparent C/R: Coordinated: LAM/MPI w/ BLCR [LACSI 03] (Checkpoint/Restart) Uncoordinated, Log based: MPICH-V [SC 2002] Non-transparent C/R: Explicit invocation of checkpoint routines LA-MPI [IPDPS 2004] / FT-MPI [EuroPVM-MPI 2000] Frequently deployed C/R helps but 60% overhead on C/R [I.Philp HPCRI 05] 100 hrs job -> 251 hrs Must restart all job tasks Inefficient if only one (few) node(s) fails Staging overhead Requeuing penalty 3

Our Solution Proactive Live Migration High failure prediction accuracy with a prior warning window: up to 70% reported [Gu et. Al, ICDCS 08] [R.Sahoo et.al KDD 03] Active research field Premise for live migration Processes on live nodes remain active Only processes on unhealthy nodes are lively migrated to spares lamboot n0 mpirun failure predicted failure New approach n1 n2 n3 live migration Hence, avoid: High overhead on C/R Restart of all job tasks Staging overhead Job requeue penalty Lam RTE reboot 4

Proactive FT Complements Reactive FT [J.W.Young Commun. ACM 74] Tc: time interval between checkpoints Ts: time to save checkpoint information (mean Ts for BT/CG/FT/LU/SP Class C on 4/8/16 nodes is 23 seconds) Tf: MTBF, 1.25hrs [I.Philp HPCRI 05] Assume 70% faults [R.Sahoo et.al KDD 03] can be predicted/handled proactively Proactive FT cuts checkpoint frequency in half! Future work: use 1. better fault model 2. Ts/Tf on bigger cluster to measure its complementation effect 5

LAM-MPI Overview Modular, component-based architecture 2 major layers Daemon-based RTE: lamd Plug in C/R to MPI SSI framework: Coordinated C/R & support BLCR RTE: Run-time Environment SSI: System Services Interface RPI: Request Progression Interface MPI: Message Passing Interface LAM: Local Area Multi-computer Example: A two-way MPI job on two nodes 6

BLCR Overview Kernel-based C/R: Can save/restore almost all resources Implementation: Linux kernel module, allows upgrades & bug fixes w/o reboot Process-level C/R facility: single MPI application process Provides hooks used for distributed C/R: LAM-MPI jobs 7

Our Design & Implementation LAM/MPI Per-node health monitoring mechanism Baseboard management controller (BMC) Intelligent platform management interface (IPMI) NEW: Decentralized scheduler Integrated into lamd Notified by BMC/IPMI Migration destination determination Trigger migration 8

Live Migration Mechanism LAM/MPI & BLCR nodes MPI RTE setup lamd scheduler n0 lamd scheduler n1 lamd scheduler n2 lamd scheduler n3 MPI Job running Live migration Job exec. resume Step 3 is optional: live migration (w/ step 3) vs. frozen (w/o step 3) 9

Live Migration vs. Frozen Migration Live migration w/ precopy Frozen migration w/o precopy stop&copy-only source node destination node source node destination node precopy stop&copy stop&copy 10

Live Migration - BLCR New process created on destination node create a thread Precopy: transfer dirty pages iteratively barrier Stop&copy transfer dirty pages, registers/signals transfer registers/signals barrier stop&copy save dirty pages restore registers/signals barrier restore registers/signals stop normal execution (In kernel: dashed lines/boxes) Page-table dirty bit scheme: 1. dirty bit of PTE duplicated 2. kernel-level functions extended to set the duplicated bit w/o additional overhead 11

Frozen Migration - BLCR Live vs. Frozen migration (also for precopy termination conditions): 1. Thresholds, e.g., temperature threshold 2. Available network bandwidth determined by dynamic monitoring 3. Size of write set Future work: heuristic algorithm based on these conditions 12

Experimental Framework Experiments conducted on Opt cluster: 17 nodes, 2 core, dual Opteron 265, 1 Gbps Ether Fedora Core 5 Linux x86_64 Lam/MPI + BLCR w/ our extensions Benchmarks NAS V3.2.1 (MPI version) BT, CG, FT, LU, and SP benchmarks EP, IS and MG run is too short 13

Job Execution Time for NPB No-migration Live Frozen NPB Class C on 16 Nodes Migration overhead: difference of job run time w/ and w/o migration 14

Migration Overhead and Duration Live Frozen Migration Overhead Migration Duration (S&C = Frozen) Live: 0.08-2.98% overhead Frozen: 0.09-6% of benchmark runtime Penalty of shorter downtime of live migration: prolonged precopy No significant impact to job run time, longer prior warning window required 15

Migration Duration and Memory Transferred Migration Duration Memory Transferred Migration duration is consistent to memory transferred 16

Problem Scaling Problem Scaling: Overhead on 16 Nodes (S&C = Frozen) BT/FT/SP: Overhead increases with problem size CG/LU: small downtime subsumed by variance of job run time 17

Task Scaling Task Scaling: Overhead of NPB Class C (S&C = Frozen) Most cases: Overhead decreases with task size No trends: relatively minor downtime subsumed by job variance 18

Speedup lost-in-speedup speedup w/ one migration speedup w/o migration Normalized speedup to 4 nodes for NPB Class C FT 0.21 lost-in-speedup: relatively large overhead (8.5 sec) vs. short run time (150 sec) Limit of migration overhead: proportionate to memory footprint, limited by system hardware 19

Page Access Pattern & Iterative Migration 4600 1200 Page access pattern of FT Iterative live migration of FT Page write patterns are in accord with aggregate amount of transferred memory FT: 138/384MB -> 1200/4600 pages/.1 second 20

Process-level vs. Xen Virtualization Migration Xen virtualization live migration [A. B. Nagarajan & F. Mueller ICS 07] NPB BT/CG/LU/SP: common benchmarks measured with both solutions on the same hardware Xen virtualization solution: 14-24 seconds for live migration, 13-14 seconds for frozen migration - Including a 13 seconds minimum overhead to transfer the entire memory image of the inactive guest VM (rather than transferring a subset of the OS image) for the transparency - 13-24 seconds of prior warning to successfully trigger live process migration Our solution: 2.6-6.5 seconds for live migration, 1-1.9 seconds for frozen migration - 1-6.5 seconds of prior warning (reduce false alarm rate) 21

Conclusion and Future Work Design generic for any MPI implementation / process C/R Implemented over LAM-MPI w/ BLCR Cut the number of chkpts in half when 70% faults handled proactively Low overhead: Live: 0.08-2.98% Frozen: 0.09-6% No job requeue overhead/ Less staging cost/ No LAM Reboot Future work Heuristic algorithm for tradeoff between live & frozen migrations Back migration upon node recovery Measure how proactive FT complements reactive FT 22

Related Work Transparent C/R LAM/MPI w/ BLCR [S.Sankaran et.al LACSI 03] Process Migration: scan & update checkpoint files [Cao et. Al, ICPADS, 05] still requires restart of entire job Log based (Log msg + temporal ordering): MPICH-V [SC 2002] Non-transparent C/R: Explicit invocation of checkpoint routines LA-MPI [IPDPS 2004] / FT-MPI [EuroPVM-MPI 2000] Failure prediction: Predictive management [Gujrati et. Al, ICPP07] [Gu et. Al, ICDCS08] [Sahoo et. Al, KDD03] Fault model: Evaluation of FT policies [Tikotekar et. Al, Cluster07] Process migration: MPI-Mitten [CCGrid06] Proactive FT: Charm++ [Chakravorty et. Al, HiPC06], etc. 23

Questions? Thank you! This work was supported in part by: NSF Grants: CCR-0237570, CNS-0410203, CCF-0429653 Office of Advanced Scientific Computing Research DOE GRANT: DE-FG02-05ER25664 DOE GRANT: DE-FG02-08ER25837 DOE Contract: DE-AC05-00OR22725 Project websites: MOLAR: http://forge-fre.ornl.gov/molar/ RAS: http://www.fastos.org/ras/ precopy stop&copy 24