Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

Similar documents
Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

San Diego Supercomputer Center, UCSD, U.S.A. The Consortium for Conservation Medicine, Wildlife Trust, U.S.A.

A High-Level Distributed Execution Framework for Scientific Workflows

Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: an Ecological Example

Kepler and Grid Systems -- Early Efforts --

Hands-on tutorial on usage the Kepler Scientific Workflow System

A High-Level Distributed Execution Framework for Scientific Workflows

Procedia Computer Science

Kepler: An Extensible System for Design and Execution of Scientific Workflows

USING THE BUSINESS PROCESS EXECUTION LANGUAGE FOR MANAGING SCIENTIFIC PROCESSES. Anna Malinova, Snezhana Gocheva-Ilieva

Using Web Services and Scientific Workflow for Species Distribution Prediction Modeling 1

UCLA Grid Portal (UGP) A Globus Incubator Project

Introduction to Grid Computing

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

The Materials Data Facility

UGP and the UC Grid Portals

2/12/11. Addendum (different syntax, similar ideas): XML, JSON, Motivation: Why Scientific Workflows? Scientific Workflows

KEPLER: Overview and Project Status

High Throughput, Low Impedance e-science on Microsoft Azure

Workflow Fault Tolerance for Kepler. Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

Analysis and summary of stakeholder recommendations First Kepler/CORE Stakeholders Meeting, May 13-15, 2008

Cloud Computing. Up until now

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data

Scientific Workflows

ACET s e-research Activities

Kepler Scientific Workflow and Climate Modeling

Climate Data Management using Globus

Juliusz Pukacki OGF25 - Grid technologies in e-health Catania, 2-6 March 2009

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Clouds: An Opportunity for Scientific Applications?

Data publication and discovery with Globus

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

Grid Middleware and Globus Toolkit Architecture

Approaches to Distributed Execution of Scientific Workflows in Kepler

Workflow Exchange and Archival: The KSW File and the Kepler Object Manager. Shawn Bowers (For Chad Berkley & Matt Jones)

OpenSees on Teragrid

ISSN: Supporting Collaborative Tool of A New Scientific Workflow Composition

Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

Hortonworks Data Platform

Managing Exploratory Workflows

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Hierarchical FSMs with Multiple CMs

Getting Started with XSEDE. Dan Stanzione

JAVA Projects. 1. Enforcing Multitenancy for Cloud Computing Environments (IEEE 2012).

The Gigascale Silicon Research Center

Semantic SOA - Realization of the Adaptive Services Grid

The BOINC Community. PC volunteers (240,000) Projects. UC Berkeley developers (2.5) Other volunteers: testing translation support. Computer scientists

High Performance Computing Resources at MSU

Update on EZ-Grid. Priya Raghunath University of Houston. PI : Dr Barbara Chapman

SAS Platform Strategy Prepared for FANS usergroup. Mike Frost, Director, Product Management Fiona McNeill, Global Product Marketing

Opal: Wrapping Scientific Applications as Web Services

FuncX: A Function Serving Platform for HPC. Ryan Chard 28 Jan 2019

A(nother) Vision of ppod Data Integration!?

Component-Based Design of Embedded Control Systems

Workflow, Planning and Performance Information, information, information Dr Andrew Stephen M c Gough

Gatlet - a Grid Portal Framework

Opal: Simple Web Services Wrappers for Scientific Applications

Introduction to HPC Resources and Linux

Chapter 2 Introduction to the WS-PGRADE/gUSE Science Gateway Framework

Index Introduction Setting up an account Searching and accessing Download Advanced features

Medici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

7 th International Digital Curation Conference December 2011

Provenance for MapReduce-based Data-Intensive Workflows

biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Virtualization of Workflows for Data Intensive Computation

Parsl: Developing Interactive Parallel Workflows in Python using Parsl

LAMBDA The LSDF Execution Framework for Data Intensive Applications

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

NeuroLOG WP1 Sharing Data & Metadata

UC Berkeley Mobies Technology Project

NSF Transition to Practice Challenges. Anita Nikolich National Science Foundation Program Director, Advanced Cyberinfrastructure November, 2015

The Problem of Grid Scheduling

Automatic Transformation from Geospatial Conceptual Workflow to Executable Workflow Using GRASS GIS Command Line Modules in Kepler *

File Transfer: Basics and Best Practices. Joon Kim. Ph.D. PICSciE. Research Computing 09/07/2018

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

Science-as-a-Service

Grid and Component Technologies in Physics Applications

SUG Breakout Session: OSC OnDemand App Development

Introduction to Data Management for Ocean Science Research

A geoinformatics-based approach to the distribution and processing of integrated LiDAR and imagery data to enhance 3D earth systems research

DATA SCIENCE USING SPARK: AN INTRODUCTION

Integrated Machine Learning in the Kepler Scientific Workflow System

Big Data Applications using Workflows for Data Parallel Computing

Introduction to BioHPC

Workflow applications on EGI with WS-PGRADE. Peter Kacsuk and Zoltan Farkas MTA SZTAKI

Sliding Window Calculations on Streaming Data using the Kepler Scientific Workflow System

Visualization for Scientists. We discuss how Deluge and Complexity call for new ideas in data exploration. Learn more, find tools at layerscape.

System-Level Design Languages: Orthogonalizing the Issues

Introduction to BioHPC New User Training

ALICE Grid Activities in US

Bio-Workflows with BizTalk: Using a Commercial Workflow Engine for escience

PoS(ISGC 2011 & OGF 31)119

Web Services for Visualization

GROWL Scripts and Web Services

Day 1 : August (Thursday) An overview of Globus Toolkit 2.4

Data Governance for the Connected Enterprise

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

Transcription:

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System Jianwu Wang, Ilkay Altintas Scientific Workflow Automation Technologies Lab SDSC, UCSD project.org UCGrid Summit, Kepler Tutorial, Apr-09

Outline Scientific Workflow and Kepler Kepler in UCGrid Use Cases Ecology Use Case Chemistry Use Case project.org UCGrid Summit, Kepler Tutorial, Apr-09 2

Part I: Scientific Workflow Systems and Kepler project.org UCGrid Summit, Kepler Tutorial, Apr-09 3

Scientific Workflow Systems Mission of scientific workflow systems Promote scientific discovery by providing tools and methods to generate larger, automated "scientific process" Provide an extensible and customizable graphical user interface for scientists from different scientific domains Support workflow design, execution, sharing, reuse and provenance Design efficient ways to connect to the existing data and integrate heterogeneous data from multiple resources project.org UCGrid Summit, Kepler Tutorial, Apr-09 4

Scientific Workflow Capture how a scientist works with data and analytical tools data access, transformation, analysis, visualization possible worldview: dataflow-oriented (cf. controlflow-oriented) Scientific workflow (wf) benefits (v.s. script-based approaches): wf & component reuse, sharing, adaptation, archiving wf design, documentation built-in (model) concurrency provenance support distributed & parallel exec: Grid & cluster support wf fault-tolerance, reliability Why a W/F System? Higher-level language vs. assembly-language nature of scripts project.org UCGrid Summit, Kepler Tutorial, Apr-09 5

Kepler Scientific Workflow System project.org Kepler is a cross-project collaboration: over 20 diverse projects and multiple disciplines. Open-source project; latest release available from the website Builds upon the open-source Ptolemy II framework Vergil is the GUI, but Kepler also runs in non-gui and batch modes. initiated August 2003 1 st release: May 13 th, 2008 More than 20 thousand downloads! Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving support environment for Scientific Workflow development, execution, maintenance KEPLER = Ptolemy II + X for Scientific Workflows project.org UCGrid Summit, Kepler Tutorial, Apr-09 6

Actors are the Processing Components Actor-Oriented Design Actor Port Encapsulation of parameterized actions Interface defined by ports and parameters Communication between input and output data Without call-return semantics Relation Links from output Ports to input Ports Could be 1:1, m:n. Actor Examples Web service Actor Matlab Actor File Read Actor Local Execution Actor Job Submission Actor Adapted from the *.ppt slides by Edward A. Lee, UC Berkeley project.org UCGrid Summit, Kepler Tutorial, Apr-09 7

Atomic and Composite Actors atomic actors: perform a single specific independent task. composite actors: collections or sets of atomic/composite actors bundled together to perform more complex operations. project.org UCGrid Summit, Kepler Tutorial, Apr-09 8

Some actors in place for Currently more than 200 Kepler actors added! Generic Web Service Client Customizable RDBMS query and update Command Line wrapper tools (local, ssh, scp, ftp, etc.) Some Grid actors-globus Job Runner, GridFTP-based file access, Proxy Certificate Generator SRB support Native R and Matlab support Interaction with Nimrod and APST Grid Environments Imaging, Gridding, Vis Support Textual and Graphical Output Python, JNI more generic and domain-oriented actors project.org UCGrid Summit, Kepler Tutorial, Apr-09 9

Directors are the WF Engines that Implement different computational models Define the semantics of execution of actors and workflows interactions between actors Ptolemy and Kepler are unique in combining different execution models in heterogeneous models! Kepler is extending Ptolemy directors with specialized ones for distributed workflows. Dataflow Time Triggered Synchronous/reactive model Discrete Event Wireless Process Networks Rendezvous Publish and Subscribe Continuous Time Finite State Machines project.org UCGrid Summit, Kepler Tutorial, Apr-09 10

Kepler Modeling with GUI Actor Search Data Search Actor ontology and semantic search for actors Search -> Drag and drop -> Link via ports Metadata-based search for datasets project.org UCGrid Summit, Kepler Tutorial, Apr-09 11

Kepler Execution From GUI: click execution button From Kepler Web Service: for detached execution Synchronous: executbycontent, executebyuri, Asynchronous: startexebycontent, getstatus, get Result, Batch Mode: useful for command line and job submission Kepler.sh [config] workflow.xml project.org UCGrid Summit, Kepler Tutorial, Apr-09 12

Provenance of Workflow Related Data Provenance: A concept from art history and library Inputs, outputs, intermediate results, workflow design, workflow run Collected information Can be used in a number of ways Validation, reproducibility, fault tolerance, etc Can be recorded in a number of ways System.out, text file, databases, etc Viewable and searchable from outside of Kepler project.org UCGrid Summit, Kepler Tutorial, Apr-09 13

Running Provenance Recorder Circonspect Workflow By Madhusudan and Ilkay from SDSC. In CAMERA Project Funded by the Gordon and Betty Moore Foundation. project.org UCGrid Summit, Kepler Tutorial, Apr-09 14

Part II: Kepler in UC Grid project.org UCGrid Summit, Kepler Tutorial, Apr-09 15

Master-Slave Distributed Execution Framework Utilize distributed resources to accelerate workflow execution Smooth transition between different execution environments, such as local, ad-hoc network, cluster, grid and cloud PN Domain Composite Actor 1 Distributed Composite Actor 2 4 Master Composite Actor 3 Slave 1 Slave 2 project.org UCGrid Summit, Kepler Tutorial, Apr-09 16

Cluster Job Submission Actors Adaptable for different cluster schedulers, such as SGE and PBS Adaptable for local execution and ssh execution project.org UCGrid Summit, Kepler Tutorial, Apr-09 17

Example of Job Submission Actors Job Submission Workflow. By Norbert Podhorszki from UC Davis. In SDM Project Funded by the DOE SciDac Award No. DE-FC02-07ER25811. project.org UCGrid Summit, Kepler Tutorial, Apr-09 18

Grid Actors Actors: Grid Authentication, Globus Job, Grid Proxy, GridFTP, Support both Pre-WS and WS Globus Resource Invocation project.org UCGrid Summit, Kepler Tutorial, Apr-09 19

Collaboration of Kepler and UCGrid UCGrid provides abundant computing and software resources for scientists Kepler provides a bridge for scientists to easily utilize the above resources according to their domain problems Scientists compose individual tasks by Kepler workflows and run them in UCGrid project.org UCGrid Summit, Kepler Tutorial, Apr-09 20

Usage Modes of Kepler in UCGrid Kepler Application in UCGrid: Users model workflows from Kepler GUI, upload them to UCGrid portal, and execute them through Kepler batch-mode command Kepler Globus Web Service in UCGrid: With UCGrid authentication, We can integrate user applications with UCGrid, their tasks be executed through deployed Kepler WS Direct Execution from Kepler GUI: With UCGrid authentication, users can model workflows that submit jobs to UCGrid, and execute them from Kepler GUI project.org UCGrid Summit, Kepler Tutorial, Apr-09 21

Part III: Use Cases project.org UCGrid Summit, Kepler Tutorial, Apr-09 22

Theoretical Ecology Use Case It is a spatial stochastic birth-death process that simulates the dynamics of Mycoplasma gallisepticum in House Finches (Carpodacus mexicanus) The simulation code is written in GNU C++, and involves file reads, relatively complex mathematical operations The execution results were visualized using the R statistical system It needs to be run with a broad range of parameter sweep, namely the computing code may be iterated for over hundreds times with different parameter configurations Collaboration with Parviez R. Hosseini (Princeton Univ.), Derik Barseghian (UCSB) In REAP (Realtime Environment for Analytical Processing) project (http://reap.ecoinformatics.org/) Funded by NSF CEO:P Award No. DBI 0619060 project.org UCGrid Summit, Kepler Tutorial, Apr-09 23

Conceptual and Kepler Workflow Parameter Sweep Simhofi Program R App. Result Display Parameter Preparation Computation Tasks Result Display Conceptual Workflow sub-workflow to be executed on multiple nodes. Kepler Workflow project.org UCGrid Summit, Kepler Tutorial, Apr-09 24

Configuration and Experiments Interaction for execution environment transition Experiment data project.org UCGrid Summit, Kepler Tutorial, Apr-09 25

Computational Chemistry Use Case The whole goal is to (re)design existing enzymes to catalyze a novel chemical reaction The workflow will provide an automated way of generating enzyme designs from a model allows scientists to focus on creating better models rather than fussing with a number of different programs Each execution will generate over 4000 Protein Data Bank files which could be processed concurrently Collaboration with Scott Johnson, Seonah Kim, Prakashan Korambath, Kejian Jin (UCLA) and Shava Smallen (SDSC). project.org UCGrid Summit, Kepler Tutorial, Apr-09 26

Enzyme Design Workflow in Kepler project.org UCGrid Summit, Kepler Tutorial, Apr-09 27

Main Work For Enzyme Design Workflow Three versions of Enzyme Design Workflow Execute the Enzyme programs directly and locally Done Wrap the programs and submit as SGE jobs at Hoffman2 cluster Done Wrap the programs and submit as Globus jobs at UCGrid On Going Accelerate Workflow with UCGrid With Kepler Cluster Job Submission Actor and Hoffman2 cluster, the execution time is reduced from 2000 mins (in theory) to 80 mins Using Kepler with Grid resources will enable better parallel execution among multiple Grid nodes and reduce the whole execution time largely Provenance Support Each workflow execution will generate over 4000 pdb files and scientists need the workflow to executed for many times with different input model Provenance can help scientists to track the data efficiently in the future project.org UCGrid Summit, Kepler Tutorial, Apr-09 28

Thanks! & Questions Jianwu Wang jianwu@sdsc.edu +1 (858) 534-5110 Kepler Download: https://kepler-project.org/users/downloads Kepler Documents: https://kepler-project.org/users/documentation project.org UCGrid Summit, Kepler Tutorial, Apr-09 29