Scalable and Fault Tolerant Failure Detection and Consensus

Similar documents
xsim The Extreme-Scale Simulator

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

REMEM: REmote MEMory as Checkpointing Storage

Combing Partial Redundancy and Checkpointing for HPC

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

DISP: Optimizations Towards Scalable MPI Startup

Scaling Tuple-Space Communication in the Distributive Interoperable Executive Library. Jason Coan, Zaire Ali, David White and Kwai Wong

Replicated State Machine in Wide-area Networks

Using Performance Tools to Support Experiments in HPC Resilience

Parallel Discrete Event Simulation

Simulation of Scale-Free Networks

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

Intra-MIC MPI Communication using MVAPICH2: Early Experience

User Level Failure Mitigation in MPI

COSC 6385 Computer Architecture - Multi Processor Systems

HP-DAEMON: High Performance Distributed Adaptive Energy-efficient Matrix-multiplicatiON

Maximizing Memory Performance for ANSYS Simulations

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Support for resource sharing Openness Concurrency Scalability Fault tolerance (reliability) Transparence. CS550: Advanced Operating Systems 2

MPI CS 732. Joshua Hegie

First Experiences with Intel Cluster OpenMP

Intuitive distributed algorithms. with F#

Memcached Design on High Performance RDMA Capable Interconnects

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Multi-level Fault Tolerance in 2D and 3D Networks-on-Chip

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

A Global Operating System for HPC Clusters

Structuring PLFS for Extensibility

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Contents Overview of the Gateway Performance and Sizing Guide... 5 Primavera Gateway System Architecture... 7 Performance Considerations...

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

High Performance MPI on IBM 12x InfiniBand Architecture

02 - Distributed Systems

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

02 - Distributed Systems

Communication Models for Resource Constrained Hierarchical Ethernet Networks

HPC Architectures. Types of resource currently in use

Basic vs. Reliable Multicast

Short Note. Cluster building and running at SEP. Robert G. Clapp and Paul Sava 1 INTRODUCTION

Accurate emulation of CPU performance

Distributed Systems 24. Fault Tolerance

An Auto-configuring Service Discovery System

Systems. Roland Kammerer. 10. November Institute of Computer Engineering Vienna University of Technology. Communication Protocols for Embedded

Whitepaper / Benchmark

Brutus. Above and beyond Hreidar and Gonzales

Replication in Distributed Systems

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations

Single-Points of Performance

Distributed Systems 11. Consensus. Paul Krzyzanowski

Non-blocking Java Communications Support on Clusters

Introduction. Distributed Systems IT332

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Philip C. Roth. Computer Science and Mathematics Division Oak Ridge National Laboratory

IBM InfoSphere Streams v4.0 Performance Best Practices

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Unifying UPC and MPI Runtimes: Experience with MVAPICH

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

IOS: A Middleware for Decentralized Distributed Computing

Datura The new HPC-Plant at Albert Einstein Institute

CUDA GPGPU Workshop 2012

Chapter 8 Fault Tolerance

Optimized Distributed Data Sharing Substrate in Multi-Core Commodity Clusters: A Comprehensive Study with Applications

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

A Proposal for User-Level Failure Mitigation in the MPI-3 Standard

An MPI failure detector over PMPI 1

Introduction to Jackknife Algorithm

How то Use HPC Resources Efficiently by a Message Oriented Framework.

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Application Performance on Dual Processor Cluster Nodes

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

Top-Down System Design Approach Hans-Christian Hoppe, Intel Deutschland GmbH

GROMACS Performance Benchmark and Profiling. August 2011

Fault Tolerance. Distributed Systems. September 2002

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

Optimizing MPI Communication Within Large Multicore Nodes with Kernel Assistance

University at Buffalo Center for Computational Research

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Proactive Process-Level Live Migration in HPC Environments

High Volume Transaction Processing in Enterprise Applications

Future Routing Schemes in Petascale clusters

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Enhancing Checkpoint Performance with Staging IO & SSD

NUMA-Aware Shared-Memory Collective Communication for MPI

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Advanced RDMA-based Admission Control for Modern Data-Centers

Transcription:

EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann Oak Ridge National Laboratory, USA Funded By: Felix Scholarship and US Department of Energy, Advanced Scientific Computing Research

Outline Motivation Overview of related work Proposed approach Experimental Results Conclusion Future work A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 2

Motivation The need for resilience in High Performance Computing (HPC) Algorithm Based Fault Tolerance (ABFT) can help ABFT needs a fault tolerant MPI User Level Failure Mitigation (ULFM) is being proposed MPI_Comm_shrink and MPI_Comm_agree need a failure detection and consensus protocol A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 3

The need for resilience in HPC Resilience is a critical challenge Decreasing component reliability and increasing component count resulting in frequent faults, errors and failures Software complexity increases Parallel application correctness and efficiency are essential for success of extreme-scale systems A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 4

ABFT can help Global checkpoint-restart, the dominant resilience strategy, will be less efficient at scale Application-specific techniques, like Algorithm Based Fault Tolerance (ABFT), can be more effective Loss of application state can be dealt with through reconfiguration and adaptation ABFT applications incorporate the needed fault tolerance logic Some ABFT techniques that can be used: Error correction using data redundancy or encoding Re-execution using local checkpoints A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 5

ABFT needs a fault tolerant MPI Failure detection and notification Reconfiguration without global restart based on consensus on detected failures MPI Application (Algorithm Based Fault Tolerant) MPI Implementation Failure notification and reconfiguration services Underlying System A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 6

User Level Failure Mitigation MPI s Fault Tolerance Working Group (FTWG) has proposed User Level Failure Mitigation (ULFM) Specifies semantics/interfaces for an MPI implementation s behaviour in the presence of process failures Fail-stop: Failed processes stop communicating In a conformant MPI implementation: No operation hangs in the presence of failures but completes by returning an error Global knowledge of failures can be achieved whenever necessary Local failure detection left to the implementations A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 7

ULFM requires a failure detection and consensus protocol MPI_Comm_shrink Creates a new communicator by excluding the failed processes in the old communicator An agreement is reached on the failed processes MPI_Comm_agree Agrees on a value among the non-failed processes Both operations need to be supported by a faulttolerant failure detection and consensus algorithm A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 8

Related work in failure detection and consensus protocols Coordinator based protocols Assume failures to be pre-detected at each process Use consensus algorithm to construct global list of failed processes Typically good log-based scaling Not completely fault tolerant (failures occurring during the failure detection are not detected) Completely distributed Gossip-based protocols Consistent failure detection in phases: Failure suspicion, failure detection and consensus Very poor scalability Completely fault tolerant A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 9

Approach Gossip-based failure detection and consensus Assumptions Algorithm 1: Consensus using global knowledge Algorithm 2: Efficient heuristic consensus A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 10

Gossip-based failure detection and consensus Gossiping is a randomized communication scheme Gossip-based protocols are intrinsically fault tolerant and extremely scalable Two Gossip-based failure detection and consensus algorithms are proposed Maintaining global knowledge Efficient heuristic consensus Based on a combined method for detecting failures locally and quickly disseminating detections to achieve consensus using Gossip A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 11

Assumptions Detects fail-stop failures Reliable communication medium Failures are permanent Synchronous system with bounded message delay Failures during the algorithm will stop at some point to allow the algorithm to complete with successful consensus detection A process once detected as failed is detected to have failed by all the processes eventually A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 12

Algorithm 1: Consensus using global knowledge At each process p Fp[n, n], where n is system size, is maintained Fp[r, c] is the view at process p of the status of process c as detected by process r. 0 if alive; 1 otherwise Algorithm executed at each process Initialization assume all processes are alive At each Gossip cycle Direct failure detection using stochastic pinging Send PING gossip message with Fp to a random process and post a timeout event for receiving REPLY gossip message Timeout event and no reply received - direct failure detection of the PINGed process Update Fp to reflect the failure detection Gossip reception event Merge Fault Matrices (Indirect local failure detection and propagation) Consensus detection When all fault-free processes detect the failed process A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 13

Algorithm 2: Efficient heuristic consensus At each process p Fault list Lp = {< r, ccnt >, } is maintained. An entry in this list is a 2-tuple < r, ccnt >, where r is the rank of the failed process and ccnt is the consensus count associated with it Algorithm executed at each process Initialization assume all processes are alive At each Gossip cycle Direct failure detection using stochastic pinging Send PING gossip message with Lp to a random process and post a timeout event for receiving REPLY gossip message Timeout event and no REPLY received - direct failure detection of the PINGed process Add the pinged process to Lp with ccnt set to 0 Gossip reception event Merge Fault Lists (Indirect failure detection and propagation) Consensus detection Wait for log(n) number of cycles When ccnt for an entry <r,ccnt> reaches MIN_CCNT, which indicates with a high probability that the failed process r is recognized by all the processes, consensus on failure of r is reached A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 14

Results Overview of implementations Overview of the use of xsim Overview of the hardware environment Overview of the simulative environment Results for Algorithm 1 Results for Algorithm 2 Comparison (Algorithm 1 and Algorithm 2) A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 15

Overview of implementations Algorithms have been implemented as MPI applications using point-to-point operations Fault Matrix in algorithm 1 implemented as integer matrix Failed process id in algorithm 2 implemented as an integer Failures were simulated by restraining a process from participating in communications A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 16

Overview of the use of xsim Extreme-scale Simulator (xsim) Application performance and resilience investigation toolkit Sits between the MPI application and the MPI library PDES (MPI process) enables to evaluate the algorithms (using threads) at significantly larger scale than the available physical system A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 17

Overview of the hardware environment Experiments on the Linux cluster computer: One head node and 16 compute nodes Head node has two AMD Opteron 4386 3.1 GHz processors with eight cores/processor and 64 GB RAM Compute nodes have one Intel Xeon E3-1220 3.1GHz processor with four cores/processor and 16 GB RAM Nodes are connected by Gigabit Ethernet System is running the Ubuntu 12.04 LTS operating system and Open MPI 1.6.5 A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 18

Overview of the simulation environment Simulations using the Extreme-scale Simulator (xsim) atop the Linux cluster One simulator MPI process per physical processor core Multiple simulated MPI processes per simulator MPI process (oversubsciption) Processor model is set with a 1-to-1 performance match to the physical AMD processor core Network interconnect model with a basic star topology, 1 µs link latency, and infinite bandwidth Processor and network models are set to evaluate the algorithms, and not the system the algorithm runs on A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 19

Results for Algorithm 1: Consensus using global knowledge Consensus on single failure with 2^4-2^11 MPI ranks Consensus on four failures with 2^4-2^11 MPI ranks A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 20

Results for Algorithm 1: Consensus using global knowledge Exponential consensus propagation Asyncronous consensus detection A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 21

Results for Algorithm 1: Consensus using global knowledge Fault tolerant consensus propagation and detection A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 22

Results for Algorithm 2: Efficient heuristic consensus Consensus on single failure with 2^4-2^20 MPI ranks A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 23

Comparison of Algorithm 1 vs. Algorithm 2 Consensus on single failure with 2^2-2^20 MPI ranks A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 24

Conclusion Failure detection and consensus for a fault-tolerant MPI enable HPC applications to adopt ABFT Two novel Gossip-based failure detection and consensus algorithms were presented 1. Global knowledge at each process 2. Efficient heuristic consensus Results confirm their scalability and fault tolerance The second algorithm uses significantly lower memory and bandwidth and achieves a perfect consensus synchronization A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 25

Future work Better method to efficiently detect consensus Mechanisms to avoid false positives Further experimental analysis and comparison with other methods A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 26

Questions A. Katti, G. Di Fatta, T. Naughton, and C. Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. EuroMPI'15. 27