Workflow Fault Tolerance for Kepler. Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

Similar documents
Sliding Window Calculations on Streaming Data using the Kepler Scientific Workflow System

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Kepler and Grid Systems -- Early Efforts --

Abstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance

Advanced Memory Management

Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

Where we are so far. Intro to Data Integration (Datalog, mediators, ) more to come (your projects!): schema matching, simple query rewriting

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

7 th International Digital Curation Conference December 2011

San Diego Supercomputer Center, UCSD, U.S.A. The Consortium for Conservation Medicine, Wildlife Trust, U.S.A.

A Dataflow-Oriented Atomicity and Provenance System for Pipelined Scientific Workflows

Kepler: An Extensible System for Design and Execution of Scientific Workflows

Using Web Services and Scientific Workflow for Species Distribution Prediction Modeling 1

The International Journal of Digital Curation Volume 7, Issue

A(nother) Vision of ppod Data Integration!?

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

A High-Level Distributed Execution Framework for Scientific Workflows

FCP: A Fast and Scalable Data Copy Tool for High Performance Parallel File Systems

USING AN ACTOR FRAMEWORK FOR SCIENTIFIC COMPUTING: OPPORTUNITIES AND CHALLENGES

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

2/12/11. Addendum (different syntax, similar ideas): XML, JSON, Motivation: Why Scientific Workflows? Scientific Workflows

Provenance Collection Support in the Kepler Scientific Workflow System

ARTICLE IN PRESS Future Generation Computer Systems ( )

Today: Fault Tolerance. Replica Management

MEMBRANE: OPERATING SYSTEM SUPPORT FOR RESTARTABLE FILE SYSTEMS

It also performs many parallelization operations like, data loading and query processing.

HA Use Cases. 1 Introduction. 2 Basic Use Cases

Scientific Workflows: Business as Usual?

Scientific Workflows

Future Directions. Edward A. Lee. Berkeley, CA May 12, A New Computational Platform: Ubiquitous Networked Embedded Systems. actuate.

Today: Fault Tolerance. Reliable One-One Communication

Automating Real-time Seismic Analysis

Hybrid-Type Extensions for Actor-Oriented Modeling (a.k.a. Semantic Data-types for Kepler) Shawn Bowers & Bertram Ludäscher

An Evaluation of Checkpoint Recovery For Massively Multiplayer Online Games

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Lecture 23 Database System Architectures

Introduction to MySQL Cluster: Architecture and Use

EMC VPLEX with Quantum Stornext

Building Scalable Stateful Services. Craft Conf 2016

Data Intensive Scalable Computing

Hands-on tutorial on usage the Kepler Scientific Workflow System

Towards Reliable, Performant Workflows for Streaming-Applications on Cloud Platforms

DMTCP: Fixing the Single Point of Failure of the ROS Master

Chapter 20: Database System Architectures

Recovering from a Crash. Three-Phase Commit

Appendix D: Storage Systems (Cont)

GFS: The Google File System. Dr. Yingwu Zhu

Scientific Workflow Design with Data Assembly Lines

Stork: State of the Art

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Database Technology. Topic 11: Database Recovery

On the appropriate granularity of activities in a scientific workflow applied to an optimization problem

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Building blocks: Connectors: View concern stakeholder (1..*):

CS 541 Database Systems. Recovery Managers

Advanced Database Management System (CoSc3052) Database Recovery Techniques. Purpose of Database Recovery. Types of Failure.

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Announcements. R3 - There will be Presentations

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E)

Imperative Recovery. Jinshan Xiong /jinsan shiung/ 2011 Whamcloud, Inc.

Norwegian University of Science and Technology Technical Report IDI-TR-05/2008. PROQID: Partial restarts of queries in distributed databases

Database Management Systems

Validation and Inference of Schema-Level Workflow Data-Dependency Annotations

Recovering Device Drivers

CS275 Intro to Databases. File Systems vs. DBMS. Why is a DBMS so important? 4/6/2012. How does a DBMS work? -Chap. 1-2

Commercial-Off-the-shelf Hardware Transactional Memory for Tolerating Transient Hardware Errors

Today: Fault Tolerance. Failure Masking by Redundancy

Analysis and summary of stakeholder recommendations First Kepler/CORE Stakeholders Meeting, May 13-15, 2008

Apache Flink. Alessandro Margara

Automatic Generation of Workflow Provenance

EMC VPLEX Geo with Quantum StorNext

In This Lecture. Transactions and Recovery. Transactions. Transactions. Isolation and Durability. Atomicity and Consistency. Transactions Recovery

Final Exam Solutions

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

The Power of Snapshots Stateful Stream Processing with Apache Flink

Workflow Exchange and Archival: The KSW File and the Kepler Object Manager. Shawn Bowers (For Chad Berkley & Matt Jones)

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Supporting Undo and Redo in Scientific Data Analysis

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. by Google Research. presented by Weichen Wang

Distributed Systems COMP 212. Revision 2 Othon Michail

Managing the Evolution of Dataflows with VisTrails Extended Abstract

Process-Based Software Components Final Mobies Presentation

Automatic capture and efficient storage of e-science experiment provenance

How VoltDB does Transactions

Big Data Applications using Workflows for Data Parallel Computing

Rollback-Recovery p Σ Σ

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #19: Logging and Recovery 1

Context-Aware Actors. Outline

SIMULATION METAMODELING FOR THE DESIGN OF RELIABLE OBJECT BASED SYSTEMS

Ed D Azevedo Oak Ridge National Laboratory Piotr Luszczek University of Tennessee

Nimrod/K: Towards Massively Parallel Dynamic Grid Workflows

Object Oriented Transaction Processing in the KeyKOS Microkernel

Integration Framework. Architecture

Heterogeneous Workflows in Scientific Workflow Systems

Scientific Workflow, Provenance, and Data Modeling Challenges and Approaches

Transcription:

Workflow Fault Tolerance for Kepler Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

Introduction Scientific Workflows Automate scientific pipelines Have long running computations Often contain stateful actors Workflow execution can crash because of Hardware failures Power outages Buggy / malicious actors, Current approach: Start workflow from the beginning 2 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Current Fault Tolerance Solutions Manage actor failures or sub-workflow failures AND their effects Atomicity and provenance support for pipelined scientific workflows [Wang et al.] Ptolemy s Backtrack [Feng et al.] Use caching strategies for faster re-execution W.A.T.E.R.S. memoization [Hartman et al.] Skip over strategy [Podhorszki et al.] (CPES)` 3 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Our Fault Tolerance Approach Recovery based on readily available Provenance 1. Create a uniform model for workflow descriptions and provenance 2. Record actor state in provenance in relation to invocations 3. After a workflow crash: Use provenance data in our uniform model and start recovery Different strategies for recovery 4 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Model for Workflows and Provenance 5 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Our Recovery Strategies Strategy Naive Replay Checkpoint Description - Restart the workflow without using provenance - Re-executes everything - Use basic provenance to speed up recovery - Re-execute stateful actor with input from provenance (replay) - Restore all queues - Resume the workflow according to the model of computation - extension of replay strategy - Use checkpoints (state of actors stored in provenance) - Reset stateful actors to recorded state - Replay successful invocations after the checkpoint - Restore queue content - Resume the workflow 6 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Example: Checkpoint in SDF Workflow with a mix of stateful and stateless actors. Corresponding schedule of the workflow with a fault during invocation B:2 7 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Execution with a Failure Execution of the previous workflow Checkpoints for actor B and D but not for C At invocation B:2 - Crash Tokens t4 and t7 - in queue Token t9 - to be restored Token t10 - to be deleted 8 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Stages of Checkpoint Recovery 9 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Prototype Implementation in Kepler Using Kepler with the Provenance Recorder Extensions to the Provenance Recorder: Record serialized tokens Extend the provenance schema Add queries Recovery Extension in the SDF Director: Serialize states after one iteration of the SDF schedule Black-list to prevent capturing transient actor information White-list if actors are annotated with state-information 10 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Upon restart: Prototype Implementation in Kepler SDF director checks provenance information SDF director calls the recovery engine Recovery: Restore the internal state of actors Replay successful invocations using input tokens from provenance Restore content of all queues Return to SDF director with information about where to resume 11 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Evaluation Synthetic Workflow Results 12 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011

Conclusion Advantages of our strategy: Efficient workflow recovery using readily available information Quick constant time recovery (checkpoint strategy) Generalized approach, saving labor Robustness Disadvantages of previous strategies: Required labor-intensive customized systems Failure required restarting long-running workflows from the beginning Caching only works for stateless actors Caching only provides a partial recovery 13 UC Davis: S. Koehler, T. McPhillips, S. Riddle, D. Zinn, B. Ludaescher 2/16/2011