Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Similar documents
Scalable In-memory Checkpoint with Automatic Restart on Failures

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Charm++ for Productivity and Performance

Adaptive Runtime Support

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Rollback-Recovery p Σ Σ

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

c 2012 Harshitha Menon Gopalakrishnan Menon

REMEM: REmote MEMory as Checkpointing Storage

Samsara: Efficient Deterministic Replay in Multiprocessor. Environments with Hardware Virtualization Extensions

Runtime Support for Scalable Task-parallel Programs

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Checkpointing HPC Applications

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

Call Paths for Pin Tools

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

FAULT TOLERANT SYSTEMS

Advanced Memory Management

Techniques to improve the scalability of Checkpoint-Restart

Scalable Interaction with Parallel Applications

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

CSE 5306 Distributed Systems. Fault Tolerance

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

Chapter 8 Fault Tolerance

Performance Optimization Under Thermal and Power Constraints For High Performance Computing Data Centers

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Memory Tagging in Charm++

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Failure Tolerance. Distributed Systems Santa Clara University

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

CSE 5306 Distributed Systems

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel

Event Ordering. Greg Bilodeau CS 5204 November 3, 2009

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics

Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

Recovering from a Crash. Three-Phase Commit

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

Chapter 8 Fault Tolerance

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Life, Death, and the Critical Transition: Finding Liveness Bugs in System Code

Clock and Time. THOAI NAM Faculty of Information Technology HCMC University of Technology

Recent Advances in Heterogeneous Computing using Charm++

Process groups and message ordering

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

the Cornell Checkpoint (pre-)compiler

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

2 Introduction to Processes

Semi-Passive Replication in the Presence of Byzantine Faults

Incremental Record and Replay Mechanism for Message Passing Concurrent Programs

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Charm++ Migratable Objects + Active Messages + Adaptive Runtime = Productivity + Performance

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Fault Tolerance. Distributed Systems IT332

Deterministic Process Groups in

Efficient Data Race Detection for Unified Parallel C

Distributed Deadlock

High Performance MPI on IBM 12x InfiniBand Architecture

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

Adaptive Cluster Computing using JavaSpaces

Scaling Optimistic Concurrency Control by Approximately Partitioning the Certifier and Log

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

VALLIAMMAI ENGINEERING COLLEGE

Distributed Systems (5DV147)

Concurrent & Distributed Systems Supervision Exercises

Noise Injection Techniques to Expose Subtle and Unintended Message Races

GFS: The Google File System

GFS: The Google File System. Dr. Yingwu Zhu

Fault Tolerance Techniques for Sparse Matrix Methods

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Distributed Systems (ICE 601) Fault Tolerance

Heckaton. SQL Server's Memory Optimized OLTP Engine

Alleviating Scalability Issues of Checkpointing

Last Class: RPCs and RMI. Today: Communication Issues

Welcome to the 2017 Charm++ Workshop!

Reflections on Failure in Post-Terascale Parallel Computing

Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations

Transparent TCP Recovery

Distributed Systems

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Transcription:

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander*, Esteban Meneses, Harshitha Menon*, Phil Miller*, Sriram Krishnamoorthy, Laxmikant V. Kale* jliffl2@illinois.edu, emeneses@pitt.edu, {gplkrsh2,mille121}@illinois.edu, sriram@pnnl.gov, kale@illinois.edu *University of Illinois Urbana-Champaign (UIUC) University of Pittsburgh Pacific Northwest National Laboratory (PNNL) September 23, 2014

Deterministic Replay & Fault Tolerance Fault tolerance often crosses over into replay territory! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 2 / 33 S

Deterministic Replay & Fault Tolerance Fault tolerance often crosses over into replay territory! Popular uses Online fault tolerance Parallel debugging Reproducing results Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 2 / 33 S

Deterministic Replay & Fault Tolerance Fault tolerance often crosses over into replay territory! Popular uses Online fault tolerance Parallel debugging Reproducing results Types of replay Data-driven replay Application/system data is recorded Content of messages sent/received, etc. Control-driven replay The ordering of events is recorded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 2 / 33 S

Deterministic Replay & Fault Tolerance Our Focus Fault tolerance often crosses over into replay territory! Popular uses Online fault tolerance Parallel debugging Reproducing results Types of replay Data-driven replay Application/system data is recorded Content of messages sent/received, etc. Control-driven replay The ordering of events is recorded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 3 / 33 S

Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Exascale! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Exascale! Machines are getting larger Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Exascale! Machines are getting larger Projected to house more than 200,000 sockets Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

Online Fault Tolerance Hard failures Researchers have predicted that hard faults will increase Exascale! Machines are getting larger Projected to house more than 200,000 sockets Hard failures may be frequent and only affect a small percentage of nodes Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 4 / 33 S

Online Fault Tolerance Approaches Checkpoint/restart (C/R) Well-established method Save snapshot of system state Roll back to previous snapshot in case of failure Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 5 / 33 S

Online Fault Tolerance Approaches Checkpoint/restart (C/R) Well-established method Save snapshot of system state Roll back to previous snapshot in case of failure Motivation beyond C/R If a single node experiences a hard fault, why must all the nodes roll back? Recovering from C/R is expensive at large machine scales Complicated because it depends on many factors (e.g checkpointing frequency) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 5 / 33 S

Online Fault Tolerance Approaches Checkpoint/restart (C/R) Well-established method Save snapshot of system state Roll back to previous snapshot in case of failure Motivation beyond C/R If a single node experiences a hard fault, why must all the nodes roll back? Recovering from C/R is expensive at large machine scales Complicated because it depends on many factors (e.g checkpointing frequency) Solutions Application-specific fault tolerance Other system-level approaches Message-logging! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 5 / 33 S

Hard Failure System Model P processes that communicate via message passing Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 6 / 33 S

Hard Failure System Model P processes that communicate via message passing Communication is across non-fifo channels Sent asynchronously Possibly out of order Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 6 / 33 S

Hard Failure System Model P processes that communicate via message passing Communication is across non-fifo channels Sent asynchronously Possibly out of order Guaranteed to arrive sometime in the future if the recipient process has not failed Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 6 / 33 S

Hard Failure System Model P processes that communicate via message passing Communication is across non-fifo channels Sent asynchronously Possibly out of order Guaranteed to arrive sometime in the future if the recipient process has not failed Fail-stop model for all failures Failed processes do not recover from failures They do not behave maliciously (non-byzantine failures) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 6 / 33 S

Sender-Based Causal Message Logging (SB-ML) Combination of data-driven and control-driven replay Data-driven Messages sent are recorded Control-driven Determinants are recorded to store the order of events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 7 / 33 S

Sender-Based Causal Message Logging (SB-ML) Combination of data-driven and control-driven replay Data-driven Messages sent are recorded Control-driven Determinants are recorded to store the order of events Incurs costs in the form of time and storage overhead during forward execution Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 7 / 33 S

Sender-Based Causal Message Logging (SB-ML) Combination of data-driven and control-driven replay Data-driven Messages sent are recorded Control-driven Determinants are recorded to store the order of events Incurs costs in the form of time and storage overhead during forward execution Periodic checkpoints reduce storage overhead Recovery effort is limited to work executed after the latest checkpoint Data stored before the checkpoint can be discarded Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 7 / 33 S

Sender-Based Causal Message Logging (SB-ML) Combination of data-driven and control-driven replay Data-driven Messages sent are recorded Control-driven Determinants are recorded to store the order of events Incurs costs in the form of time and storage overhead during forward execution Periodic checkpoints reduce storage overhead Recovery effort is limited to work executed after the latest checkpoint Data stored before the checkpoint can be discarded Scalable implementation in Charm++ Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 7 / 33 S

Example Execution with SB-ML Checkpoint Failure Time Task A Task B m 1 m 2 m 4 m 1 m 2 m 4 m 6 Task C Task D Task E m 3 m 5 m 5 m 3 m 7 Forward Path Restart Recovery Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 8 / 33 S

Motivation Overheads with SB-ML 100% Performance Overhead Progress Slowdown Checkpoint Failure No FT FT Recovery Time Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 9 / 33 S

Forward Execution Overhead with SB-ML Logging the messages Just requires a pointer to be saved and message is not deallocated! Increases memory pressure Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 10 / 33 S

Forward Execution Overhead with SB-ML Logging the messages Just requires a pointer to be saved and message is not deallocated! Increases memory pressure Determinants, 4-tuple of the form: <SPE,SSN,RPE,RSN> Components: Sender processor (SPE) Sender sequence number (SSN) Receiver processor (RPE) Receiver sequence number (RSN) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 10 / 33 S

Forward Execution Overhead with SB-ML Logging the messages Just requires a pointer to be saved and message is not deallocated! Increases memory pressure Determinants, 4-tuple of the form: <SPE,SSN,RPE,RSN> Components: Sender processor (SPE) Sender sequence number (SSN) Receiver processor (RPE) Receiver sequence number (RSN) Must be stored stably based on the reliability requirements Propagated to n processors Unacknowledged determinants are augmented onto new messages (to avoid frequent synchronizations) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 10 / 33 S

Forward Execution Overhead with SB-ML Logging the messages Just requires a pointer to be saved and message is not deallocated! Increases memory pressure Determinants, 4-tuple of the form: <SPE,SSN,RPE,RSN> Components: Sender processor (SPE) Sender sequence number (SSN) Receiver processor (RPE) Receiver sequence number (RSN) Must be stored stably based on the reliability requirements Propagated to n processors Unacknowledged determinants are augmented onto new messages (to avoid frequent synchronizations) Recovery Messages must be replayed in a total order Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 10 / 33 S

Forward Execution Microbenchmark (SB-ML) Component Overhead (%) Determinants 84.75% Bookkeeping 11.65% Message-envelope size increase 3.10% Message storage 0.50% Using the LeanMD (molecular dynamics) benchmark Measured on 256 cores of Ranger Largest source of overhead is determinants Creating, storing, sending, etc. Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 11 / 33 S

Benchmarks Runtime System Charm++ Decompose parallel computation into objects that communicate More objects than number of processors Objects communicate by sending messages Computation is oblivious to the processors Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 12 / 33 S

Benchmarks Runtime System Charm++ Decompose parallel computation into objects that communicate More objects than number of processors Objects communicate by sending messages Computation is oblivious to the processors Benefits Load balancing, message-driven execution, fault tolerance, etc. Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 12 / 33 S

Benchmarks Configuration & Experimental Setup Benchmark Configuration STENCIL3D matrix: 4096 3, chunk: 64 3 LEANMD (mini-app for NAMD) 600K atoms, 2-away XY, 75 atoms/cell LULESH (shock hydrodynamics) matrix: 1024x512 2, chunk: 16x8 2 All experiments on IBM Blue Gene/P (BG/P), Intrepid 40960-node system Each node consists of one quad-core 850MHz PowerPC 450 2GB DDR2 memory Compiler: IBM XL C/C++ Advanced Edition for Blue Gene/P, V9.0 Runtime: Charm++ 6.5.1 Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 13 / 33 S

Forward Execution Overhead with SB-ML Percent Overhead (%) 20 15 10 5 8k Cores 16k Cores 32k Cores 64k Cores 132k Cores 0 Stencil3D LeanMD LULESH The finer-grained benchmarks, LeanMD and LULESH, suffer from significant overhead Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 14 / 33 S

Reducing the Overhead of Determinants Design Criteria We must maintain full determinism Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 15 / 33 S

Reducing the Overhead of Determinants Design Criteria We must maintain full determinism We must devolve well for all cases (even very non-deterministic programs) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 15 / 33 S

Reducing the Overhead of Determinants Design Criteria We must maintain full determinism We must devolve well for all cases (even very non-deterministic programs) Need to consider tasks or lightweight objects Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 15 / 33 S

Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Block races (1995: Clemencon, An implementation of race detection and deterministic replay with MPI Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Block races (1995: Clemencon, An implementation of race detection and deterministic replay with MPI MPI and Non-determinism (2000: Kranzlmuller, Event graph analysis for debugging massively parallel programs) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Block races (1995: Clemencon, An implementation of race detection and deterministic replay with MPI MPI and Non-determinism (2000: Kranzlmuller, Event graph analysis for debugging massively parallel programs)... Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

Reducing the Overhead of Determinants Intrinsic determinism Many researchers have noticed that programs have internal determinism Causality tracking (1988: Fidge, Partial orders for parallel debugging) Racing messages (1992: Netzer, et al., Optimal tracing and replay for debugging message-passing parallel programs) Theoretical races (1993: Damodaran-Kamal, Nondeterminancy: testing and debugging in message passing parallel programs) Block races (1995: Clemencon, An implementation of race detection and deterministic replay with MPI MPI and Non-determinism (2000: Kranzlmuller, Event graph analysis for debugging massively parallel programs)... Send-determinism (2011: Guermouche, et al., Uncoordinated checkpointing without domino effect for send-deterministic MPI applications) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 16 / 33 S

Our Approach In many cases, only a partial order must be stored for full determinism Program = internal determinism + non-determinism + commutative Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 17 / 33 S

Our Approach In many cases, only a partial order must be stored for full determinism Program = internal determinism + non-determinism + commutative Internal determinism requires no determinants! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 17 / 33 S

Our Approach In many cases, only a partial order must be stored for full determinism Program = internal determinism + non-determinism + commutative Internal determinism requires no determinants! Commutative events require no determinants! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 17 / 33 S

Our Approach In many cases, only a partial order must be stored for full determinism Program = internal determinism + non-determinism + commutative Internal determinism requires no determinants! Commutative events require no determinants! Approach: use determinants to store a partial order for the non-deterministic events that are not commutative Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 17 / 33 S

Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Can be accurately replayed from a given starting point Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Can be accurately replayed from a given starting point Dependencies d can be among the events in the set, or on preceding events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Can be accurately replayed from a given starting point Dependencies d can be among the events in the set, or on preceding events Intuitively, they are ordered sets of events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

Ordering Algebra Ordered Sets, O O(n, d) Set of n events and d dependencies Can be accurately replayed from a given starting point Dependencies d can be among the events in the set, or on preceding events Intuitively, they are ordered sets of events Define sequencing operation, : O(1, d 1 ) O(1, d 2 ) = O(2, d 1 + d 2 + 1) Intuitively, if we have two atomic events, we need a single dependency to tell us which one comes first Generalization: O(n 1, d 1 ) O(n 2, d 2 ) = O(n 1 + n 2, d 1 + d 2 + 1) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 18 / 33 S

Ordering Algebra Unordered Sets, U U(n, d) Unordered set of n events and d dependencies Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 19 / 33 S

Ordering Algebra Unordered Sets, U U(n, d) Unordered set of n events and d dependencies Example is where several messages are sent to a single endpoint Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 19 / 33 S

Ordering Algebra Unordered Sets, U U(n, d) Unordered set of n events and d dependencies Example is where several messages are sent to a single endpoint Depending the order of arrival, the eventual state will be different We decompose this into atomic events with an additional dependency between each successive pair: where d = d i U(n, d) = O(1, d 1 ) O(1, d 2 ) O(1, d n ) = O(n, d + n 1) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 19 / 33 S

Ordering Algebra Unordered Sets, U U(n, d) Unordered set of n events and d dependencies Example is where several messages are sent to a single endpoint Depending the order of arrival, the eventual state will be different We decompose this into atomic events with an additional dependency between each successive pair: U(n, d) = O(1, d 1 ) O(1, d 2 ) O(1, d n ) = O(n, d + n 1) where d = d i Result: additional n 1 dependencies required to fully order n events Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 19 / 33 S

Ordering Algebra Interleaving Multiple Independent Sets, operator Lemma Any possible interleaving of two ordered sets of events A = O(m, d) and B = O(n, e), where A B =, is given by: O(m, d) O(n, e) = O(m + n, d + e + min(m, n)) Lemma Any possible ordering of n ordered set of events O(m 1, d 1 ), O(m 2, d 2 ),..., O(m n, d n ), when i O(m i, d i ) =, can be n represented as: O(m i, d i ) = O(m, d + m max i m i ) where i=1 m = n m i d = n i=1 d i i=1 Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 20 / 33 S

Internal Determinism D D(n) = O(n, 0) n deterministically ordered events are structurally equivalent to an ordered set of n events with no associated explicit dependencies! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 21 / 33 S

Internal Determinism D D(n) = O(n, 0) n deterministically ordered events are structurally equivalent to an ordered set of n events with no associated explicit dependencies! What happens if we interleave internal determinism with something else? Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 21 / 33 S

Internal Determinism D D(n) = O(n, 0) n deterministically ordered events are structurally equivalent to an ordered set of n events with no associated explicit dependencies! What happens if we interleave internal determinism with something else? k interruption points => O(k, k 1) Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 21 / 33 S

Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order on them Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order on them However we can reduce a commutative set to: C(n) = O(2, 1) A beginning and end event sequenced together Sequencing other sets of event around the region just puts them before and after Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order on them However we can reduce a commutative set to: C(n) = O(2, 1) A beginning and end event sequenced together Sequencing other sets of event around the region just puts them before and after Interleaving other events puts them in three buckets: (1) before the begin event (2) during the commutative region (3) after the end event Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

Communtative Events C Some events in programs are communtative Regardless of the execution order the state will be identical All existing theories of message logging execute record a total order on them However we can reduce a commutative set to: C(n) = O(2, 1) A beginning and end event sequenced together Sequencing other sets of event around the region just puts them before and after Interleaving other events puts them in three buckets: (1) before the begin event (2) during the commutative region (3) after the end event This corresponds exactly to an ordered set of two events! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 22 / 33 S

Applying the Theory PO-REPLAY: Partial-Order Message Identification Scheme Properties It tracks causality with Lamport clocks It uniquely identifies a sent message, whether or not its order is transposed It requires exactly the number of determinants and dependencies produced by the ordering algebra Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 23 / 33 S

Applying the Theory PO-REPLAY: Partial-Order Message Identification Scheme Properties It tracks causality with Lamport clocks It uniquely identifies a sent message, whether or not its order is transposed It requires exactly the number of determinants and dependencies produced by the ordering algebra Determinant Composition (3-tuple): <SRN,SPE,CPI> SRN: sender region number, incremented for every send outside a commutative region and incremented once when a commutative region starts SPE: sender processor endpoint CPI: commutative path identifier, sequence of bits that represents the path to the root of the commutative region Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 23 / 33 S

Experimental Results Forward Execution Overhead: Stencil3D Percent Overhead (%) 20 15 10 5 8k Cores 16k Cores 32k Cores 64k Cores 132k Cores 0 Stencil3D- PartialDetFT Stencil3D- FullDetFT Course-grained, shows small improvement over SB-ML Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 24 / 33 S

Experimental Results Forward Execution Overhead: LeanMD Percent Overhead (%) 20 15 10 5 8k Cores 16k Cores 32k Cores 64k Cores 132k Cores 0 LeanMD- PartialDetFT LeanMD- FullDetFT Fine-grained, reduction from 11-19% overhead to <5% Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 25 / 33 S

Experimental Results Forward Execution Overhead: LULESH Percent Overhead (%) 20 15 10 5 8k Cores 16k Cores 32k Cores 64k Cores 132k Cores 0 LULESH- PartialDetFT LULESH- FullDetFT Medium-grained, many messages, 17% overhead to <4% Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 26 / 33 S

Experimental Results Fault Injection Measure the recovery time for the different protocols Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 27 / 33 S

Experimental Results Fault Injection Measure the recovery time for the different protocols We inject a simulated fault on a random node Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 27 / 33 S

Experimental Results Fault Injection Measure the recovery time for the different protocols We inject a simulated fault on a random node During approximately the middle of the period Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 27 / 33 S

Experimental Results Fault Injection Measure the recovery time for the different protocols We inject a simulated fault on a random node During approximately the middle of the period We calculate the optimal checkpoint period duration using Daly s formula Assuming 64K 1M socket count Assuming MTBF of 10 years Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 27 / 33 S

Experimental Results Recovery Time Speedup C/R 4 3.5 3 8192 Cores 16384 Cores 32768 Cores 65536 Cores 131072 Cores 2.5 Speedup 2 1.5 1 0.5 0 LeanMD Stencil3D LULESH LeanMD has the most speedup due to its fine-grained, overdecomposed nature We achieve speedup in all cases in recovery time Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 28 / 33 S

Experimental Results Recovery Time Speedup SB-ML 2.5 2 8192 Cores 16384 Cores 32768 Cores 65536 Cores 131072 Cores Speedup 1.5 1 0.5 0 LeanMD Stencil3D LULESH Increased speedup with scale, due to expense of coordinating determinants and ordering Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 29 / 33 S

Experimental Results Summary Our new message logging protocol has about <5% overhead for the benchmarks tested Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 30 / 33 S

Experimental Results Summary Our new message logging protocol has about <5% overhead for the benchmarks tested Recover is significantly faster than C/R or causal Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 30 / 33 S

Experimental Results Summary Our new message logging protocol has about <5% overhead for the benchmarks tested Recover is significantly faster than C/R or causal Depending on the frequency of faults, it may perform better than C/R Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 30 / 33 S

Future Work More benchmarks Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

Future Work More benchmarks Study for broader range of programming models Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

Future Work More benchmarks Study for broader range of programming models Memory overhead of message logging makes it infeasible for some applications Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

Future Work More benchmarks Study for broader range of programming models Memory overhead of message logging makes it infeasible for some applications Automated extraction of ordering and interleaving properties Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

Future Work More benchmarks Study for broader range of programming models Memory overhead of message logging makes it infeasible for some applications Automated extraction of ordering and interleaving properties Programming language support? Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 31 / 33 S

Conclusion Comprehensive approach for reasoning about execution orderings and interleavings Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 32 / 33 S

Conclusion Comprehensive approach for reasoning about execution orderings and interleavings We observe that the information stored can be reduced in proportion to the knowledge of order flexibility Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 32 / 33 S

Conclusion Comprehensive approach for reasoning about execution orderings and interleavings We observe that the information stored can be reduced in proportion to the knowledge of order flexibility Programming paradigms should make this cost model clearer! Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance Jonathan Lifflander 32 / 33 S

Questions?