Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Similar documents
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Checkpointing HPC Applications

MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Rollback-Recovery p Σ Σ

A Consensus-based Fault-Tolerant Event Logger for High Performance Applications

SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

ProActive SPMD and Fault Tolerance Protocol and Benchmarks

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

A Case for High Performance Computing with Virtual Machines

Optimizing communications on clusters of multicores

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Redesigning the Message Logging Model for High Performance

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Scalable In-memory Checkpoint with Automatic Restart on Failures

Proactive Process-Level Live Migration in HPC Environments

Techniques to improve the scalability of Checkpoint-Restart

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

FAULT TOLERANT SYSTEMS

Fault tolerance techniques for high-performance computing

Adding semi-coordinated checkpoints to RADIC in Multicore clusters

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

Designing An Efficient Kernel-level and User-level Hybrid Approach for MPI Intra-node Communication on Multi-core Systems

Chapter 8 Fault Tolerance

Failure Tolerance. Distributed Systems Santa Clara University

Simulation using MIC co-processor on Helios

REMEM: REmote MEMory as Checkpointing Storage

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Scalable and Fault Tolerant Failure Detection and Consensus

Process groups and message ordering

Fault Tolerance. Basic Concepts

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

The Google File System

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Event Ordering. Greg Bilodeau CS 5204 November 3, 2009

Correlated set coordination in fault tolerant message logging protocols for many-core clusters

CS514: Intermediate Course in Computer Systems

High Performance Java Remote Method Invocation for Parallel Computing on Clusters

Adaptive Runtime Support

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Advantages to Using MVAPICH2 on TACC HPC Clusters

Analysis of the Component Architecture Overhead in Open MPI

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

arxiv: v2 [cs.dc] 2 May 2017

CSE 5306 Distributed Systems

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

CSE 5306 Distributed Systems. Fault Tolerance

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale

Understanding and Improving the Cost of Scaling Distributed Event Processing

The Google File System

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

Progress Report on Transparent Checkpointing for Supercomputing

Evaluation and Improvements of Programming Models for the Intel SCC Many-core Processor

xsim The Extreme-Scale Simulator

ABSTRACT. KHARBAS, KISHOR H. Failure Detection and Partial Redundancy in HPC. (Under the direction of Dr. Frank Mueller.)

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

The Google File System

GPU-centric communication for improved efficiency

Novel Log Management for Sender-based Message Logging

Lessons learned from MPI

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS

Implementation and Evaluation of a Scalable Application-level Checkpoint-Recovery Scheme for MPI Programs

2017 Paul Krzyzanowski 1

Today: Fault Tolerance. Reliable One-One Communication

Slurm Configuration Impact on Benchmarking

Network bandwidth is a performance bottleneck for cluster computing. Especially for clusters built with SMP machines.

Investigating Resilient HPRC with Minimally-Invasive System Monitoring

Evalua&ng Energy Savings for Checkpoint/Restart in Exascale. Bryan Mills, Ryan E. Grant, Kurt B. Ferreira and Rolf Riesen

Chapter 8 Fault Tolerance

Distributed Systems Principles and Paradigms

I N F M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E

A Global Operating System for HPC Clusters

On communication determinism in parallel HPC applications

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Application. Collective Operations. P2P Operations. ft-globus. Unexpected Q. Posted Q. Log / Replay Module. Send Q. Log / Replay Module.


FAULT TOLERANT SYSTEMS

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols

International Journal of Distributed and Parallel systems (IJDPS) Vol.1, No.1, September

David B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7].

Design and Implementation of a Novel Message Logging Protocol for OpenFOAM

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

MYE017 Distributed Systems. Kostas Magoutis

Transcription:

Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Fault Tolerance in HPC Systems is Mandatory Resiliency is a key problem at very large scale MTBF of a few hours at Exascale Rollback-recovery is needed to allow applications to terminate Saving information on a reliable storage Based on checkpoints Power consumption is another major issue Limit the amount of rolled back computation in the event of a failure 2

System Model MPI applications Finite set of processes Asynchronous distributed system Processes communicate by exchanging messages Causal dependencies between processes states (Lamport's happened-before relation) FIFO reliable channels Fail-stop failure model Multiple concurrent failures 3

Coordinated Checkpointing is not the Solution P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 4

Coordinated Checkpointing is not the Solution P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Synchronization at checkpoint time to ensure a consistent global state Easy to implement Efficient garbage collection Works for non-deterministic applications 5

Coordinated Checkpointing is not the Solution P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 All checkpoints are written at the same time on reliable storage High stress of the file system 6

Coordinated Checkpointing is not the Solution P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 One failure makes all processes rollback Recovery is costly regarding power consumption 7

The Existing Alternatives also have Drawbacks P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Uncoordinated checkpointing Checkpoints can be scheduled 8

The Existing Alternatives also have Drawbacks P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Uncoordinated checkpointing Checkpoints can be scheduled Suffers from the domino effect Orphan messages have to be rolled back Recovery and garbage collection is complex 9

The Existing Alternatives also have Drawbacks Message logging protocols Can be combined with uncoordinated checkpointing without the domino effect Only a subset of the processes have to rollback after a failure Messages content and delivery event have to be saved Assumption piecewise deterministic applications The only non-deterministic event are the messages reception event. 10

Many HPC Applications are Send-Deterministic Definition of Send-Determinism Given a set of input parameters, sequences of message sendings are always the same in any correct execution Messages reception order doesn't change processes behavior Static analysis of 27 HPC applications (Cappello 2010) NAS Benchmarks 6 NERSC Benchmarks 2 USQCD Benchmarks 6 Sequoia Benchmarks SpecFEM3D, Nbody, Ray2mesh ScaLAPACK SUMMA 25 25 over 27 27 are are send-deterministic 11

Contributions An uncoordinated checkpointing protocol without domino effect Small subset of logged messages Allow partial restart High performance on failure free execution (MX) Taking into account applications communications patterns Improving the protocol using clustering techniques 50% of rolled back processes on average At most 50% messages logged A new hierarchical cluster-based protocol Message logging between clusters Limit the number of rolled back processes to one cluster 12

Uncoordinated Checkpointing without Domino Effect P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Consequence of send-determinism: Orphan messages don't need to be rolled back The domino effect is avoided 13

Uncoordinated Checkpointing without Domino Effect P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Avoiding logging all messages Processes roll back to send again the missing messages 14

Uncoordinated Checkpointing without Domino Effect P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Avoiding logging all messages Processes roll back to send again the missing messages 15

Uncoordinated Checkpointing without Domino Effect P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Avoiding logging all messages Processes roll back to send again the missing messages 16

Uncoordinated Checkpointing without Domino Effect P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Avoiding logging all messages Processes roll back to send again the missing messages 17

Uncoordinated Checkpointing without Domino Effect P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Avoiding logging all messages Processes roll back to send again the missing messages 18

Uncoordinated Checkpointing without Domino Effect P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Avoiding logging all messages Processes roll back to send again the missing messages 19

Uncoordinated Checkpointing without Domino Effect P 0 m 0 m 2 m 5 P 1 m 3 m 4 P 2 m 1 P 3 Logging messages that could lead to a domino effect Sender-based message logging 20

Uncoordinated Checkpointing without Domino Effect Epoch=1 Epoch=2 P 0 P 1 m 0 Epoch=1 m 2 m 5 Epoch=2 P 2 P 3 Epoch=1 Epoch=1 m 1 m 3 m 4 Epoch=2 Epoch=2 Logging messages that could lead to a domino effect Sender-based message logging Using Epoch numbers Sender Epoch < Receiver Epoch 21

Managing Causal Dependencies During Recovery P 0 m 0 P 1 m 1 m 3 m 4 m5 P 2 P 3 22

Managing Causal Dependencies During Recovery P 0 m 0 P 1 m 1 m 3 m 4 m5 P 2 P 3 Restarting from an inconsistent state Messages that are causally dependent could be sent at the same time 23

Managing Causal Dependencies During Recovery P 0 m 0 P 1 m 1 m 3 m 4 m5 P 2 P 3 Restarting from an inconsistent state Messages that are causally dependent could be sent at the same time Causal dependency paths are broken by: Checkpoints Logged messages 24

Managing Causal Dependencies During Recovery P 0 1 m 0 2 P 1 1 2 3 P 2 P 3 1 1 m 1 m m 4 m5 2 3 3 3 4 Using phase numbers to show when a causality path is broken Similar to Lamport clocks Incremented when a causality path is broken Allows to order causally dependent messages 25

Our Prototype in MPICH2 Communication management in CH3/Nemesis Implementation over TCP and Myri-10G (MX) Message logging Rollback-recovery management in HYDRA (process manager) Uncoordinated process checkpointing (BLCR) Computation of the set of processes to rollback Using a centralized process Process restart (ongoing work with MPICH2 team) Restarting one process without restarting the application 26

Message Logging Using acknowledgements to detect messages to log Compare the epoch of the sender and the epoch of the receiver Sending an ack for every message is too costly for latency Optimized implementation Small messages (< 1 KB) Messages content are copied without waiting for the ack Acks are piggybacked on messages Only logged messages generate an explicit ack 27

Experimental Setup Lille cluster of Grid5000 45 nodes 2 Intel Xeon E5440 QC processors 8 GB of memory 10G-PCIE-8A-C Myri-10G NIC Linux kernel 2.6.26 NetPipe Ping-Pong test over MX Latency Bandwidth Performance evaluation for 3 NAS benckmarcks 28

Performance Evaluation At most 0.5 µs (15%) overhead on latency for small messages 40% bandwidth reduction for large messages (logging) 29

Performance Evaluation NAS Performance over MX Almost no impact without logging At most 5% overhead when all messages are logged 30

Computation of the recovery line Off-line computation Processes flush data about messages they send every 30s Off-line computation considering all failure scenario For 1 failure 6 NAS Benchmarks, class D, 128 processes All processes roll back in almost every case 31

Conclusion (First Part) An uncoordinated checkpointing protocol for senddeterministic applications No domino effect (easy garbage collection) Few messages logged Allow partial restart Prototype Evaluation Good performance on failure free execution (MX) All processes roll back in case of one failure Technical report Full description of the protocol Proof Summary Avoid checkpoint coordination Costly on recovery 32

Taking Into Account Communications Patterns Improving our uncoordinated protocol using process clustering Ongoing work A new hierarchical checkpointing protocol based on process clustering for send-deterministic applications 33

Improving our Protocol Our protocol: logging based on epochs Log messages going from epoch E1 to epoch E2 if E1 < E2 Basic idea Create clusters of processes Force message logging between clusters using different epoch numbers Take into account communication patterns to minimize the number of logged messages 34

Process Clustering 35

Process Clustering 36

Process Clustering Ep 10 Ep 6 Ep 4 Ep 2 Ep 0 37

Process Clustering Logging Ep 10 Ep 6 Ep 4 Ep 2 Ep 0 38

Process Clustering Logging Ep 10 Ep 6 Ep 0 Ep 2 Ep 4 No rollback propagation 39

Theoretical Limits Average number of cluster to rollback (p clusters) Failure in 1 st cluster p clusters rollback Failure in 2 nd cluster p-1 clusters rollback... Failure in last cluster 1 cluster rollback (P+1)/2 on average Maximum number of logged messages A: set of intra-cluster messages B: set of logged inter-cluster messages C: set of non-logged inter-cluster messages If B > 50% then C < 50% Less than 50% 40

Experimental Results Cluster Size 32 16 8 %log %rl %log %rl %log %rl BT 13 62.6 25.2 56.4 36.7 53.3 CG 2.9 62.5 3.4 56.3 15 43.8 FT 37.3 62.5 43.6 56 46.8 53 LU 10.3 62.5 24.1 56.3 25.9 42.1 MG 9.5 62.5 17.1 56.3 25.4 42.1 Class D NAS Benchmarks,128 processes 41

Experimental Results Cluster Size 32 16 8 %log %rl %log %rl %log %rl BT 13 62.6 25.2 56.4 36.7 53.3 CG 2.9 62.5 3.4 56.3 15 43.8 FT 37.3 62.5 43.6 56 46.8 53 LU 10.3 62.5 24.1 56.3 25.9 42.1 MG 9.5 62.5 17.1 56.3 25.4 42.1 Class D NAS Benchmarks, 128 processes 42

Experimental Results Cluster Size 32 16 8 %log %rl %log %rl %log %rl BT 13 62.6 25.2 56.4 36.7 53.3 CG 2.9 62.5 3.4 56.3 15 43.8 FT 37.3 62.5 43.6 56 46.8 53 LU 10.3 62.5 24.1 56.3 25.9 42.1 MG 9.5 62.5 17.1 56.3 25.4 42.1 Class D NAS Benchmarks, 128 processes 43

Conclusion An uncoordinated checkpointing protocol for senddeterministic applications No domino effect Avoids process synchronization Allows partial restart Provides good performance of failure free execution (MX) Process clustering Limit the number of process to roll back to almost 50% on average Reducing energy consumption Logging only a small amount of messages (max 50%) 44

Ongoing Work A hierarchical protocol based on process clustering Message logging between clusters Limit the consequences of a failure to one cluster Based on send-determinism The logged messages are still valid after a rollback Phase numbers are needed to deal with causal dependencies Coordinated or uncoordinated checkpointing inside a cluster Implementation in MPICH2 Collaboration with Charm++ team (Esteban Meneses, Zhihui Dai) Considering a single failure First prototype working 45

Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello