DMTCP: Fixing the Single Point of Failure of the ROS Master

Similar documents
Checkpointing using DMTCP, Condor, Matlab and FReD

Applications of DMTCP for Checkpointing

First Results in a Thread-Parallel Geant4

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Extending DMTCP Checkpointing for a Hybrid Software World

Autosave for Research Where to Start with Checkpoint/Restart

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Overview Job Management OpenVZ Conclusions. XtreemOS. Surbhi Chitre. IRISA, Rennes, France. July 7, Surbhi Chitre XtreemOS 1 / 55

NPTEL Course Jan K. Gopinath Indian Institute of Science

Improving the Efficiency of Fuzz Testing Using Checkpointing

Advanced Memory Management

arxiv: v2 [cs.os] 3 Feb 2014

An introduction to checkpointing. for scientific applications

Linux-CR: Transparent Application Checkpoint-Restart in Linux

MySQL InnoDB Cluster. New Feature in MySQL >= Sergej Kurakin

URDB: A Universal Reversible Debugger Based on Decomposing Debugging Histories

Q.1. (a) [4 marks] List and briefly explain four reasons why resource sharing is beneficial.

A Behavior Based File Checkpointing Strategy

Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services

Application Fault Tolerance Using Continuous Checkpoint/Restart

Wind River. All Rights Reserved.

The Google File System

Elastic HPC on the MOC

Progress Report on Transparent Checkpointing for Supercomputing

MIGSOCK A Migratable TCP Socket in Linux

The ROS 2 Vision For Advancing the Future of Robotics Development

Today: Fault Tolerance. Reliable One-One Communication

Scale out Read Only Workload by sharing data files of InnoDB. Zhai weixiang Alibaba Cloud

Support for resource sharing Openness Concurrency Scalability Fault tolerance (reliability) Transparence. CS550: Advanced Operating Systems 2

Scalable In-memory Checkpoint with Automatic Restart on Failures

csdesign Documentation

Care and Feeding of Oracle Rdb Hot Standby

24-vm.txt Mon Nov 21 22:13: Notes on Virtual Machines , Fall 2011 Carnegie Mellon University Randal E. Bryant.

System-level Transparent Checkpointing for OpenSHMEM

Services in the Virtualization Plane. Andrew Warfield Adjunct Professor, UBC Technical Director, Citrix Systems

ECE 574 Cluster Computing Lecture 8

COMMUNICATION IN DISTRIBUTED SYSTEMS

Research Article Optimizing Checkpoint Restart with Data Deduplication

Recovering Device Drivers

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

RADIC: A FaultTolerant Middleware with Automatic Management of Spare Nodes*

Fault Tolerance. Distributed Systems IT332

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall 2008.

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION

Ravana: Controller Fault-Tolerance in SDN

Veritas Volume Replicator Administrator s Guide

Linux-CR: Transparent Application Checkpoint-Restart in Linux

Progress Report Toward a Thread-Parallel Geant4

Administrative Details. CS 140 Final Review Session. Pre-Midterm. Plan For Today. Disks + I/O. Pre-Midterm, cont.

Migratory TCP (MTCP) Transport Layer Support for Highly Available Network Services

Distributed Systems COMP 212. Lecture 18 Othon Michail

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

The Google File System

The Google File System (GFS)

Avida Checkpoint/Restart Implementation

MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Distributed Systems CSCI-B 534/ENGR E-510. Spring 2019 Instructor: Prateek Sharma

IO virtualization. Michael Kagan Mellanox Technologies

Map-Reduce. Marco Mura 2010 March, 31th

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

C-JDBC Tutorial A quick start

Today: Fault Tolerance. Replica Management

CA485 Ray Walshe Google File System

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Postgres-XC PostgreSQL Conference Michael PAQUIER Tokyo, 2012/02/24

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas

PolarDB. Cloud Native Alibaba. Lixun Peng Inaam Rana Alibaba Cloud Team

Lecture 7: February 10

Snapify: Capturing Snapshots of Offload Applications on Xeon Phi Manycore Processors

Reducing downtimes by replugging device drivers

Industrial hardware and QEMU. LinuxCon Europe 2012, Barcelona Alberto Garcia

Distributed Information Processing

Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism

Zhang Chen Zhang Chen Copyright 2017 FUJITSU LIMITED

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Problem Set: Processes

Workflow Fault Tolerance for Kepler. Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

Checkpoint-Restart for a Network of Virtual Machines

Towards Fault-Tolerant Energy-Efficient High Performance Computing inthe Cloud

The Google File System

Rootless Containers with runc. Aleksa Sarai Software Engineer

Checkpoint (T1) Thread 1. Thread 1. Thread2. Thread2. Time

The Power of Snapshots Stateful Stream Processing with Apache Flink

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

52 Remote Target. Simulation. Chapter

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

Packaging and Deploying Java Based Solutions to WebSphere Message Broker V7

Process Migration via Remote Fork: a Viable Programming Model? Branden J. Moor! cse 598z: Distributed Systems December 02, 2004

Virtualization for Desktop Grid Clients

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Chapter 8: Virtual Memory. Operating System Concepts

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Fault Tolerance. Distributed Systems. September 2002

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

MASSIVE SCALE USB DEVICE DRIVER FUZZ WITHOUT DEVICE. HC Tencent s XuanwuLab

MODELS OF DISTRIBUTED SYSTEMS

Chapter 1: Introduction Operating Systems MSc. Ivan A. Escobar

The Art, Science, and Engineering of Programming

Transcription:

DMTCP: Fixing the Single Point of Failure of the ROS Master Tw i n k l e J a i n j a i n. t @ h u s k y. n e u. e d u G e n e C o o p e r m a n g e n e @ c c s. n e u. e d u C o l l e g e o f C o m p u t e r a n d I n f o r m a t i o n S c i e n c e N o r t h e a s t e r n U n i v e r s i t y, B o s t o n, U S A September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 1

Roadmap Core Problem: ROS master failure Possible Solutions Effective solution: Transparent checkpointing (DMTCP) DMTCP s workflow Barrier: DMTCP didn t support Pseudoterminals Pseudoterminal plugin Extending the DMTCP Live demo DMTCP s growing number of use cases DMTCP, ROS2, and the future! Summary September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 2

Core Problem: ROS master failure ROS master contributes in: Provides name and parameter services to the rest of the nodes Control over system, graphs, nodes Helps a node to discover other nodes Why is the ROS master s failure a problem? No way to recover the system after crash Nodes need to reload System reinitialization Whole system will compromise The amount of loss can be huge in a robotics application See talk by our collaborator, Matthieu Amy, at ROSCon-2016 Ros Master Node 1 Communication Node 2 September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 3

Possible Solutions Switch to a more reliable hardware Expensive solution Chances of failure decreases but crash can happen Reduce the dependency on the ROS master Currently unavailable in ROS1 Can be applied in the future versions of ROS Checkpoint and Restart (presented here) Save the state periodically Resume from the last checkpointed state September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 4

Effective solution: Transparent Checkpointing (DMTCP) DMTCP: Distributed Multi-Threaded CheckPointing (http://dmtcp.sourceforge.net/) Active project for 13 years Over 10,000 downloads of source, unkown # downloads of binary package Open source package for transparent checkpointing Checkpointing saving restorable application state like a snapshot Transparent application remains unmodified User space kernel remains unmodified Easily extensible Plugin-based architecture Both manual and periodic checkpointing options are available NEWS: Being extended to support NVIDIA GPUs September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 5

SIGUSR2 SIGUSR2 DMTCP s workflow: Checkpointing Start the application with DMTCP DMTCP COORDINATOR Normal execution Gain control over threads Initiate checkpoint CKPT MSG CKPT MSG Suspend user threads CKPT THREAD CKPT THREAD Save application state USER THREAD A USER THREAD B USER PROCESS 1 SOCKET CONNECTION USER THREAD C USER PROCESS 2 Write checkpoint image to a file Resume user threads Node 1 Node 2 September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 6

SIGUSR2 SIGUSR2 DMTCP s workflow: Restart Restart the application with DMTCP DMTCP COORDINATOR Start a new process on each node CKPT MSG CKPT MSG Exec into MTCP restart Overwrite memory with ckpt image CKPT THREAD USER THREAD A USER THREAD B USER PROCESS 1 SOCKET CONNECTION CKPT THREAD USER THREAD C USER PROCESS 2 Node 1 Node 2 Restore application state Resume user threads Normal execution Ready for the checkpoint again September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 7

Barrier: DMTCP didn t support Pseudoterminals! When DMTCP tried to checkpoint a ROS application, it didn t work! Pseudoterminals Pair of two virtual devices connected with a bidirectional IPC channel Kernel space Emulates a device terminal for terminal oriented programs (e.g., vi editor) What is the problem? User space A typical ROS application uses ptys DMTCP doesn t handle pty s checkpointing correctly The state of a pty-based application was not same at the time of checkpoint, resume after checkpoint and at restart Fig. General approach for a pty-based application September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 8

Pseudoterminal plugin Wrapper functions for the pty related functions Packet or non-packet mode Raw or cooked mode Terminal settings Drain the kernel buffer Whether process running in the foreground or background Characteristics of bidirectional IPC channel between master and slave devices Order of refilling the kernel buffer September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 9

Extending DMTCP: Checkpoint & Restart September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 10

Live Demo: checkpoint a ROS application ROS application details: 3 processes running in 3 different terminal windows roscore Listener node Talker node DMTCP demo details: DMTCP coordinator running in the fourth terminal window Manual checkpointing and restart Let s start! September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 11

DMTCP s growing number of use cases Fault tolerance Process migration Skip past long startup times Debugging Ultimate bug report Speculative execution Easily Scalable over 32,000 nodes checkpointed at ICPADS 16 (Over 50 refereed publications by external teams, describing their use of DMTCP) September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 12

DMTCP, ROS2, and the future! An RMW(ROS MiddleWare) based on a ROS master and DMTCP DMTCP is transparent to the end user because it is built into the RMW Combining DMTCP with a small log-and-replay implementation allows rollback and replay during recovery for full reliability. This allows an RMW developer to provide a small, reliable RMW, while delegating to DMTCP developers the burden of guaranteeing fault tolerance. DMTCP is free and open-source. It uses GNU LGPL (non-contagious): no license fees and can be combined with proprietary corporate software. September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 13

Summary DMTCP now solves the single point of failure problem of the ROS master. Can now checkpoint-restart a typical ROS application. Plugin-based architecture helps in extending it easily. Can be bundle with larger application package. In the future, we hope to work along with upcoming ROS2 framework. September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 14

Thank you This work is also the result of a joint collaboration with Jean-Charles Fabre, Michael Lauer, and Matthieu Amy of the dependable computing and fault tolerance team At LAAS, Toulouse, France. September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 15

Appendix: Using DMTCP with ROS Requires developer version of DMTCP and Linux kernel version above 3.8.0 Execute on <COORD_HOST>: dmtcp_coordinator --port 7779 Execute once for each ROS master/node (ckpt interval of 5 seconds): dmtcp_launch --join --interval 5 --port 7779 \ <path_to_the_ros_executable> Execute once for each ROS master/node: dmtcp_restart --join --port 7779 --host <COORD_HOST> \ <path_to_each_checkpoint_file> September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 16

Appendix: Using DMTCP with ROS (cont.) Can change 7779 (the default port) of the DMTCP coordinator to another TCP port, as desired. On each ROS node, the checkpoint files will be of the form ckpt_*.dmtcp. Note that ROS sometimes creates additional Python processes, resulting in checkpoint image files for Python. Be sure to include the related Python checkpoint files when restarting the roscore/master. Note that if we checkpointed in three different Linux terminals (e.g., master, talker, listener), then we must invoke dmtcp_restart in the three different terminals, with the appropriate ckpt_*.dmtp files for the corresponding ROS node. If one prefers manual checkpointing (no --interval flag), then one can execute c (checkpoint) in a terminal window running the DMTCP coordinator, optionally followed by k (kill). The DMTCP coordinator is stateless. It s fine to kill any old coordinator and start a new one on restart. September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 17

References: 1. Ansel, Jason, Kapil Arya, and Gene Cooperman. DMTCP: Transparent checkpointing for cluster computations and the desktop. Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 2009 2. Jean-Charles Fabre, Michael Lauer, Matthieu Amy. Adaptive Fault Tolerance on ROS: A Component-Based Approach ROSCon 2016. 3. Kerrisk, Michael. The Linux programming interface. No Starch Press, 2010. September 21, 2017 DMTCP: Fixing The Single Point Of Failure Of The ROS Master Gene Cooperman and Twinkle Jain 18