A High-Level Distributed Execution Framework for Scientific Workflows

Similar documents
San Diego Supercomputer Center, UCSD, U.S.A. The Consortium for Conservation Medicine, Wildlife Trust, U.S.A.

A High-Level Distributed Execution Framework for Scientific Workflows

Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: an Ecological Example

Kepler and Grid Systems -- Early Efforts --

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

7 th International Digital Curation Conference December 2011

The iplant Data Commons

Workflow Exchange and Archival: The KSW File and the Kepler Object Manager. Shawn Bowers (For Chad Berkley & Matt Jones)

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Reproducibility and Reuse of Scientific Code Evolving the Role and Capabilities of Publishers

Web Services for Visualization

A Three Tier Architecture for LiDAR Interpolation and Analysis

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Automatic Generation of Workflow Provenance

Discovery Studio 2.0. Paul Flook Senior Director, Life Sciences R&D

Digital Curation and Preservation: Defining the Research Agenda for the Next Decade

Dynamic Provisioning of a Parallel Workflow Management System

Carelyn Campbell, Ben Blaiszik, Laura Bartolo. November 1, 2016

Integrated Machine Learning in the Kepler Scientific Workflow System

Challenges and Approaches for Distributed Workflow-Driven Analysis of Large-Scale Biological Data

Managing Exploratory Workflows

Discovery in the WBEM Architecture (Infrastructure Discovery)

Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

Introduction to Grid Computing

Managing Rapidly-Evolving Scientific Workflows

ACET s e-research Activities

Web Services Based Instrument Monitoring and Control

KEPLER: Overview and Project Status

Big Data Applications using Workflows for Data Parallel Computing

irods for Data Management and Archiving UGM 2018 Masilamani Subramanyam

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis

Workflow Fault Tolerance for Kepler. Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

Hands-on tutorial on usage the Kepler Scientific Workflow System

Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example

Grid Computing Systems: A Survey and Taxonomy

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Three Fundamental Dimensions of Scientific Workflow Interoperability: Model of Computation, Language, and Execution Environment

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

2/12/11. Addendum (different syntax, similar ideas): XML, JSON, Motivation: Why Scientific Workflows? Scientific Workflows

Scientific Workflows

The Materials Data Facility

DataONE: Open Persistent Access to Earth Observational Data

Context-Aware Actors. Outline

ORION/OOI CYBERINFRASTRUCTURE

Semantic-Based Workflow Composition for Video Processing in the Grid

Semantic Web Technologies

ICIT. Brian Hiller ESRI Account Manger. What s new in ArcGIS 10

Using Web Services and Scientific Workflow for Species Distribution Prediction Modeling 1

Scalable Scientific Workflows Management System SWFMS

Scheduling distributed applications can be challenging in a multi-cloud environment due to the lack of knowledge

Distributed Data Management with Storage Resource Broker in the UK

High Performance Computing Course Notes Grid Computing I

Resilient Linked Data. Dave Reynolds, Epimorphics

Putting the Archives to Work: Workflow and Metadata-driven Analysis in LTER Science

Distribution and web services

The International Journal of Digital Curation Volume 7, Issue

Datagridflows: Managing Long-Run Processes on Datagrids

DALA Project: Digital Archive System for Long Term Access

Kepler Scientific Workflow and Climate Modeling

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Procedia Computer Science

A(nother) Vision of ppod Data Integration!?

ARC VIEW. Critical Industries Need Active Defense and Intelligence-driven Cybersecurity. Keywords. Summary. By Sid Snitkin

Pegasus WMS Automated Data Management in Shared and Nonshared Environments

Exploring Many Task Computing in Scientific Workflows

DATA Act Information Model Schema (DAIMS) Architecture. U.S. Department of the Treasury

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

Provenance for MapReduce-based Data-Intensive Workflows

Enterprise Findability Without the Complexity

Coherence & WebLogic Server integration with Coherence (Active Cache)

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

Now SAML takes it all:

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

Reducing Consumer Uncertainty

Approaches to Distributed Execution of Scientific Workflows in Kepler

ORCID: A simple basis for digital data governance

Data Curation Practices at the Oak Ridge National Laboratory Distributed Active Archive Center

Dac-Man: Data Change Management for Scientific Datasets on HPC systems

FILE REPLICATION AND COLLABORATION REQUIREMENT: THE ESSENTIALS

Implementing a Data Publishing Service via DSpace. Jon W. Dunn, Randall Floyd, Garett Montanez, Kurt Seiffert

ISSN: Supporting Collaborative Tool of A New Scientific Workflow Composition

Safe, distributed computation for reluctant data sharers: At home in the Microsoft Azure cloud

Science-as-a-Service

FAIR-aligned Scientific Repositories: Essential Infrastructure for Open and FAIR Data

A Knowledge-Based Approach to Scientific Workflow Composition

Virtualization of Workflows for Data Intensive Computation

Potential for Technology Innovation within the Internet2 Community: A Five-Year View

Heterogeneous Workflows in Scientific Workflow Systems

A Dataflow-Oriented Atomicity and Provenance System for Pipelined Scientific Workflows

Software Architecture

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Multi-Server Based Namespace Data Management of Resource Namespace Service

biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data

Provenance Collection Support in the Kepler Scientific Workflow System

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Managing the Evolution of Dataflows with VisTrails

A three tier architecture applied to LiDAR processing and monitoring

Re-using Data Mining Workflows

Transcription:

A High-Level Distributed Execution Framework for Scientific Workflows Jianwu Wang 1, Ilkay Altintas 1, Chad Berkley 2, Lucas Gilbert 1, Matthew B. Jones 2 1 San Diego Supercomputer Center, UCSD, U.S.A. 2 National Center for Ecological Analysis and Synthesis, UCSB, U.S.A.

Outline Introduction Background Scientific Workflow Specification Structure Requirements for Distributed Execution using Scientific Workflows Our Conceptual Architecture Working Mechanisms Case Study Conclusion and Future Work A High-Level Distributed Execution Framework for Scientific Workflows 2

Introduction Scientific workflow can help domain scientists solve scientific problems. Most workflow systems centralize execution, which often causes a performance bottleneck. Distributed execution of scientific workflows is a growing and promising way to achieve better execution performance and efficiency. A High-Level Distributed Execution Framework for Scientific Workflows 3

Scientific Workflow Specification Structure Focus: basic scientific workflow specification structure - tasks - data dependencies - control dependencies A High-Level Distributed Execution Framework for Scientific Workflows 4

Requirements for Distributed Execution using Scientific Workflows Execution of Tasks on Remote Nodes Distributed Node Discovery Peer-to-Peer Data Transfer Provenance of Distributed Execution Distributed Monitoring Transparent Implementation Reuse of Existing Workflows Failure Recovery A High-Level Distributed Execution Framework for Scientific Workflows 5

Our Goals Easy-to-use Comprehensive Adaptable Extensible Efficient A High-Level Distributed Execution Framework for Scientific Workflows 6

Conceptual Architecture A High-Level Distributed Execution Framework for Scientific Workflows 7

Interaction sequence of a distributed scientific workflow execution A High-Level Distributed Execution Framework for Scientific Workflows 8

Working Mechanisms (1/5) Decoupling of the Workflow Specification from the Execution Model Ability to use existing workflow specifications with both centralized and distributed execution models, i.e., workflow engine Simply replacing the Director in Kepler Peer-to-Peer Data Transfer A corresponding pipeline for each data dependency Data flows from source Slave to destination Slave(s) directly A High-Level Distributed Execution Framework for Scientific Workflows 9

Working Mechanisms (2/5) Transparent Implementation Define technology selection rule, detect and adapt to the context of real situations Ease of deployment Each node running workflow instance can act as an execution endpoint in either the Master or Slave role. A High-Level Distributed Execution Framework for Scientific Workflows 10

Working Mechanisms (3/5) Capability-Based Slave Registration Slave Execution Capability Metamodel A High-Level Distributed Execution Framework for Scientific Workflows 11

Working Mechanisms (4/5) Automatic Constraint-Based Task Scheduling Match user requirements with Slave execution capabilities to get optimal task scheduling solutions Meet both functional requirements and non-functional constraints Need new task scheduling algorithms Run-times of some tasks vary with different input configuration Take the task s input and configuration values into account A High-Level Distributed Execution Framework for Scientific Workflows 12

Working Mechanisms (5/5) Broker based Provenance Management Centralized Inefficient to store the data content Decentralized Efficient, but difficult to query and integrate the data in the future Broker-based Tradeoff between functionality and efficiency Each slave records the data locally and register it to Provenance Manager. Provenance Manager only record the reference info The Master node can get the data content from the corresponding Slaves A High-Level Distributed Execution Framework for Scientific Workflows 13

Case Study Scenario: A group of three scientists collaboratively construct a workflow with tasks in their sub-domains. The workflow can t be executed as a whole on any of their computers. They hope to: Connect their computers (Computer 1, Computer 2, Computer 3) to execute the workflow Track the provenance information Solution: A High-Level Distributed Execution Framework for Scientific Workflows 14

Conclusion and Future Work A high-level distributed execution framework Based on requirements from the Kepler community Discuss its main working mechanisms. Main focus on its usability in terms of adoption in our community Refine the design details Finish implementation in Kepler Evaluate it with applications A High-Level Distributed Execution Framework for Scientific Workflows 15

A quick demo Simhofi workflow: Terrestrial ecology (With Parviez Hosseini from Princeton University)

Thanks! Questions? For More Information: Distributed Execution Interest Group of Kepler: https://dev.kepler-project.org/developers/interestgroups/distributed Contact: jianwu@sdsc.edu A High-Level Distributed Execution Framework for Scientific Workflows 17

Related Work Several scientific workflow systems support distributed execution. Triana Peer-to-peer execution Intuitive graphical user interface Pegasus Execute workflows in Grid environments Provenance support ASKALON Service repository to share service Data repository to share data A High-Level Distributed Execution Framework for Scientific Workflows 18