Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

Similar documents
Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

Kepler and Grid Systems -- Early Efforts --

San Diego Supercomputer Center, UCSD, U.S.A. The Consortium for Conservation Medicine, Wildlife Trust, U.S.A.

Hands-on tutorial on usage the Kepler Scientific Workflow System

A High-Level Distributed Execution Framework for Scientific Workflows

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Kepler: An Extensible System for Design and Execution of Scientific Workflows

A(nother) Vision of ppod Data Integration!?

KEPLER: Overview and Project Status

biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Integrated Machine Learning in the Kepler Scientific Workflow System

A geoinformatics-based approach to the distribution and processing of integrated LiDAR and imagery data to enhance 3D earth systems research

A High-Level Distributed Execution Framework for Scientific Workflows

Introduction to Grid Computing

Workflow Fault Tolerance for Kepler. Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

Workflow Exchange and Archival: The KSW File and the Kepler Object Manager. Shawn Bowers (For Chad Berkley & Matt Jones)

2/12/11. Addendum (different syntax, similar ideas): XML, JSON, Motivation: Why Scientific Workflows? Scientific Workflows

The iplant Data Commons

Enhanced Access to High-Resolution LiDAR Topography through Cyberinfrastructure-Based Data Distribution and Processing

7 th International Digital Curation Conference December 2011

Sliding Window Calculations on Streaming Data using the Kepler Scientific Workflow System

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Knowledge-based Grids

Big Data Applications using Workflows for Data Parallel Computing

Using Web Services and Scientific Workflow for Species Distribution Prediction Modeling 1

OPeNDAP: Accessing HYCOM (and other data) remotely

Digital Curation and Preservation: Defining the Research Agenda for the Next Decade

Supporting Data Stewardship Throughout the Data Life Cycle in the Solid Earth Sciences

Provenance for MapReduce-based Data-Intensive Workflows

Data publication and discovery with Globus

Mitigating Risk of Data Loss in Preservation Environments

Virtualization of Workflows for Data Intensive Computation

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

IRODS: the Integrated Rule- Oriented Data-Management System

Approaches to Distributed Execution of Scientific Workflows in Kepler

Provenance Collection Support in the Kepler Scientific Workflow System

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data

Web Services Based Instrument Monitoring and Control

Scientific Workflows

Reproducible & Transparent Computational Science with Galaxy. Jeremy Goecks The Galaxy Team

A Three Tier Architecture for LiDAR Interpolation and Analysis

Where we are so far. Intro to Data Integration (Datalog, mediators, ) more to come (your projects!): schema matching, simple query rewriting

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

Kepler Scientific Workflow and Climate Modeling

OOI CyberInfrastructure Architecture & Design

Kepler scientific workflow orchestrator as a tool to build the computational workflows.

LASDA: an archiving system for managing and sharing large scientific data

Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems?

Interoperability in Science Data: Stories from the Trenches

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Introduction to Data Management for Ocean Science Research

Cyberinfrastructure!

BIG DATA CHALLENGES A NOAA PERSPECTIVE

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: an Ecological Example

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Web Services for Visualization

The Hadoop Paradigm & the Need for Dataset Management

The International Journal of Digital Curation Volume 7, Issue

Situations and Ontologies: helping geoscientists understand and share the semantics surrounding their computational resources

The Materials Data Facility

Galaxy. Daniel Blankenberg The Galaxy Team

Wade Sheldon. Georgia Coastal Ecosystems LTER University of Georgia CUAHSI Virtual Workshop Field Data Management Solutions

Universities Access IBM/Google Cloud Compute Cluster for NSF-Funded Research

DataONE: Open Persistent Access to Earth Observational Data

irods usage at CC-IN2P3: a long history

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Heterogeneous Workflows in Scientific Workflow Systems

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

APAC: A National Research Infrastructure Program

DSpace Fedora. Eprints Greenstone. Handle System

ORION/OOI CYBERINFRASTRUCTURE

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

EUDAT- Towards a Global Collaborative Data Infrastructure

Distributed Repository for Biomedical Applications

Analysis and summary of stakeholder recommendations First Kepler/CORE Stakeholders Meeting, May 13-15, 2008

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

ISSN: Supporting Collaborative Tool of A New Scientific Workflow Composition

Distributed Data Management with Storage Resource Broker in the UK

Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Workflow applications on EGI with WS-PGRADE. Peter Kacsuk and Zoltan Farkas MTA SZTAKI

Software + Services for Data Storage, Management, Discovery, and Re-Use

Dennis Gannon Data Center Futures Microsoft Research

Rutgers Discovery Informatics Institute (RDI2)

Visualization for Scientists. We discuss how Deluge and Complexity call for new ideas in data exploration. Learn more, find tools at layerscape.

OOI CyberInfrastructure Conceptual and Deployment Architecture

Building blocks: Connectors: View concern stakeholder (1..*):

Data Movement & Tiering with DMF 7

DATA SCIENCE USING SPARK: AN INTRODUCTION

Massive Online Analysis - Storm,Spark

SAS Platform Strategy Prepared for FANS usergroup. Mike Frost, Director, Product Management Fiona McNeill, Global Product Marketing

Automatic Transformation from Geospatial Conceptual Workflow to Executable Workflow Using GRASS GIS Command Line Modules in Kepler *

Kepler User Manual Version October, 2010

Wade Sheldon. Georgia Coastal Ecosystems LTER University of Georgia

Introduction to Text Mining. Hongning Wang

CloudMan cloud clusters for everyone

Clouds: An Opportunity for Scientific Applications?

Transcription:

Scientific Workflow Tools Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego 1

escience Today Increasing number of Cyberinfrastructure (CI) technologies Data Repositories: Network File Systems, Databases, Web Services, SRB/iRODS Job Execution: Cloud Computing, Grid, Cluster, Adhoc Domain-specific analysis tools Difficult to orchestrate CI components to conduct escience 2

Scientific workflows emerged as an answer to the need to combine multiple Cyberinfrastructure components in automated process networks. So, what is a scientific workflow? 3

The Big Picture: Supporting the Scientist Phylogeny Analysis Workflow From Napkin Drawings to Executable Workflows Local Disk Multiple Sequence Alignment Phylogeny Analysis Tree Visualization 4

Advantages of Scientific Workflow Systems Formalization of the scientific process Easy to share, adapt and reuse Deployable, customizable, extensible Management of complexity and usability Support for hierarchical composition Interfaces to different technologies from a unified interface Can be annotated with domain-knowledge Tracking provenance of the data and processes Keep the association of results to processes Make it easier to validate/regenerate results and processes Enable comparison between different workflow versions Execution monitoring and fault tolerance Interaction with multiple tools and resources at once 5

Kepler Scientific Workflow System http://www.kepler-project.org Kepler is a cross-project collaboration Latest release available from the website Kepler 2.1 released on 30 September 2010 Builds upon the open-source Ptolemy II framework Vergil is the GUI, but Kepler also runs in non-gui and batch modes. Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving support environment for Scientific Workflow development, execution, maintenance KEPLER = Ptolemy II + X for Scientific Workflows 6

Actor-Oriented Modeling Actors Single component or task Well-defined interface (signature) Given input data, produces output data Configured with parameters Composite actor for sub-workflows Ports Each actor has a set of input and output ports Denote the actor s signature Produce/consume data (a.k.a. tokens) Can be semantically annotated with domain-specific concepts 7

Dataflow Connections Actor communication channels Directed edges Connect output ports with input ports Directors Execution models, define the execution semantics of workflow graphs Executes workflow graph (some schedule) Sub-workflows may have different directors SC07/Kepler Tutorial/V8bc/Nov-07 8 8

Kepler Actors Generic Web Service Clients- SOAP, REST, MS.Net Customizable RDBMS query and update Command Line wrapper tools (local, ssh, scp, ftp, etc.) Grid actors: Globus, GridFTP, Proxy Certificate Generator SRB and irods R and Matlab Interaction with MapReduce Communication with streaming data buffers- DataTurbine, ORB Imaging, Gridding, Viz Support Textual and Graphical Output Specialized actor for fault tolerance additional generic and domain-oriented actors 9

Vergil is the GUI for Kepler Actor Search Data Search Actor ontology and semantic search for actors Search -> Drag and drop -> Link via ports Metadata-based search for datasets 10

Actor Search Kepler Actor Ontology Used in searching actors and creating conceptual views (= folders) Currently there are more than 200 Kepler actors! 11

Data Search and Usage of Results EarthGrid Discovery of data resources through local and remote services: SRB, Grid and Web Services, DB connections Registry of datasets on-thefly using workflows 12

Distributed Execution Master-Slave Execute sub-workflows on slave nodes Map-Reduce Map and reduce sub-workflows executed in Hadoop cloud Job actors for PBS, LSF, SGE, Globus, etc. Kepler web service 13

Master-Slave Framework Single workflow created, sub-workflows seamlessly run on other resources Data is automatically distributed to Slave nodes and results returned Different behavior with different computation models Master Master SDF Domain PN Domain Composite Actor Distributed Composite Actor Composite Actor Composite Actor Distributed Composite Actor Composite Actor 1 2 Slave 1 1 3 Slave 1 3 Slave 2 4 2 4 Slave 2 14

Provenance of Workflow Related Data Provenance: A concept from art history and library Inputs, outputs, intermediate results, workflow design, workflow run Collected information Can be used in a number of ways Validation, reproducibility, fault tolerance, etc Linked to the data resources Viewable and searchable from outside Kepler 15

Kepler Provenance Framework What provenance is recorded: Workflow Specification: actors, ports, connections, parameters, etc. Workflow Evolution: parameter values that change over time, addition/removal of actors, ports, etc. Workflow Execution: Start/stop of workflow, individual actor executions Data exchanged between actors: data lineage Where provenance recorded: Modular interface supports saving to different output types. SQL, XML, or Open Provenance Model 16

Kepler is a Team Effort Some CI projects using Kepler: SEEK (ecology) SciDAC SDM (astrophysics, bio,...) CPES (plasma simulation, combustion) GEON (geosciences) CiPRes (phylogenetics) ROADnet (real-time data) Processing Phylodata (ppod) REAP (streaming data) Digital preservation (DIGARCH) COMET (environmental science) ITER (fusion) OOI CI - ORION (ocean observing CI) LOOKING (oceanography) CAMERA (metagenomics) Resurgence (computational chemistry) ChIP-chip (genomics) Cheshire Digital Library (archival) Cell Biology (Scripps) DART (X-Ray crystallography) Ocean Life Assembling the Tree of Life project NEES (earthquake engineering)... 17

Real-time Environment for Analytical Processing (REAP) Management and Analysis of Observatory Data using Kepler Scientific Workflows Overall goal: To bring together, for the first time, seamless access to sensor data from real- time data grids with analytical tools and sophisticated modeling capabilities of scientific workflow environments Funded 2006-2010 NSF CEO:P Jones, Altintas, Baru, Ludaescher, Schildhauer Partners: UCSB, SDSC/UCSD, UCDavis, UCLA, OpenDAP, OSU Lead institution: NCEAS/UCSB http://reap.ecoinformatics.org/ 18

Sea Surface Temperature (SST) Match-up Workflows Quantitative evaluation and integration of SST data sets Allows researchers to find data sets for a given spacetime window Builds match-up data sets from various sources, e.g., NOAA, JPL, FSU, using OPeNDAP Performs a variety of statistical comparisons and visualizations on match-ups using R and Matlab Collaborators: Peter Cornillon, Univ. of Rhode Island Nathan Potter, James Gallagher, OPeNDAP Inc. 19

Summary Scientific workflows help scientists manage diverse CI technologies Kepler is an open-source system and collaboration Grows by application requirements from contributors More information: http://kepler-project.org Acknowledgements: NSF award 0619060 for Real-time Environment for Analytical Processing NSF award 0941692 for Distributed Ocean Monitoring via Integrated Data Analysis of Coordinated Buoyancy Drogues DOE award DE-FC02-01ER25486 for SDM Center 20

Thanks! & Questions Daniel Crawl crawl@sdsc.edu Ilkay Altintas altintas@sdsc.edu 21