Scientific Workflows

Similar documents
Kepler: An Extensible System for Design and Execution of Scientific Workflows

Actor-Oriented Design and The Ptolemy II framework

Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

Kepler and Grid Systems -- Early Efforts --

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

Heterogeneous Workflows in Scientific Workflow Systems

San Diego Supercomputer Center, UCSD, U.S.A. The Consortium for Conservation Medicine, Wildlife Trust, U.S.A.

Portals and workflows: Taverna Workbench. Paolo Romano National Cancer Research Institute, Genova

Process-Based Software Components Final Mobies Presentation

USING THE BUSINESS PROCESS EXECUTION LANGUAGE FOR MANAGING SCIENTIFIC PROCESSES. Anna Malinova, Snezhana Gocheva-Ilieva

Actor-Oriented Design: Concurrent Models as Programs

DEPARTMENT OF COMPUTER SCIENCE

Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Context-Aware Actors. Outline

Kepler Scientific Workflow and Climate Modeling

Component-Based Design of Embedded Control Systems

Final Project Assignments. Promoter Identification Workflow (PIW)

UC Berkeley Mobies Technology Project

Apache Flink. Alessandro Margara

Concurrent Component Patterns, Models of Computation, and Types

Automating Real-time Seismic Analysis

6/20/2018 CS5386 SOFTWARE DESIGN & ARCHITECTURE LECTURE 5: ARCHITECTURAL VIEWS C&C STYLES. Outline for Today. Architecture views C&C Views

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

The Gigascale Silicon Research Center

Towards a Resilient Information Architecture Platform for the Smart Grid: RIAPS

Modal Models in Ptolemy

Requirements Elicitation

Grid Computing with Voyager

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

Data publication and discovery with Globus

Managing Exploratory Workflows

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

System-Level Design Languages: Orthogonalizing the Issues

Introduction to Grid Computing

What is a Web Service?

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data

An Introduction to Taverna Workflows Katy Wolstencroft University of Manchester

A High-Level Distributed Execution Framework for Scientific Workflows

The Ptolemy II Framework for Visual Languages

Advanced Tool Architectures. Edited and Presented by Edward A. Lee, Co-PI UC Berkeley. Tool Projects. Chess Review May 10, 2004 Berkeley, CA

Operating Systems (ECS 150) Spring 2011

OOI CyberInfrastructure Architecture & Design

Designing and debugging real-time distributed systems

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat

Architectural Styles I

Workflow, Planning and Performance Information, information, information Dr Andrew Stephen M c Gough

SERVICE-ORIENTED COMPUTING

The Future of the Ptolemy Project

Architectural Design

Supporting Bioinformatic Experiments with A Service Query Engine

2/12/11. Addendum (different syntax, similar ideas): XML, JSON, Motivation: Why Scientific Workflows? Scientific Workflows

Wade Sheldon. Georgia Coastal Ecosystems LTER University of Georgia CUAHSI Virtual Workshop Field Data Management Solutions

Hierarchical FSMs with Multiple CMs

Workflow Fault Tolerance for Kepler. Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

Embedded Software from Concurrent Component Models

Multimedia Database Architecture!

Compositional Model Based Software Development

Oracle Secure Backup 12.1 Technical Overview

Supporting Process Development in Bio-jETI by Model Checking and Synthesis

3.4 Data-Centric workflow

Cellarium: A Computational Biology Workflowing Environment

The Impact of SOA Policy-Based Computing on C2 Interoperation and Computing. R. Paul, W. T. Tsai, Jay Bayne

OBJECT ORIENTED SYSTEM DEVELOPMENT Software Development Dynamic System Development Information system solution Steps in System Development Analysis

Carelyn Campbell, Ben Blaiszik, Laura Bartolo. November 1, 2016

Analysis and summary of stakeholder recommendations First Kepler/CORE Stakeholders Meeting, May 13-15, 2008

Graphical System Design. David Fuller LabVIEW R&D Section Manager

Internet Application Developer

Outline. Mate: A Tiny Virtual Machine for Sensor Networks Philip Levis and David Culler. Motivation. Applications. Mate.

Yogesh Simmhan. escience Group Microsoft Research

An Introduction to Software Architecture. David Garlan & Mary Shaw 94

CompClustTk Manual & Tutorial

Where we are so far. Intro to Data Integration (Datalog, mediators, ) more to come (your projects!): schema matching, simple query rewriting

A Federated Grid Environment with Replication Services

Selenium Testing Course Content

Taverna I Assignment

Software Architecture

An Introduction to MATLAB

Appendix A - Glossary(of OO software term s)

Data Curation Profile Human Genomics

Vortex OpenSplice. Python DDS Binding

5-In switch case statement, every case should have a statement as the last statement A- Jump B- Break C- Exit D- Both a and b

Semantic Web. Semantic Web Services. Morteza Amini. Sharif University of Technology Spring 90-91

Science-as-a-Service

Giotto Domain. 5.1 Introduction. 5.2 Using Giotto. Edward Lee Christoph Kirsch

European Component Oriented Architecture (ECOA ) Collaboration Programme: Architecture Specification Part 2: Definitions

Programming in C# for Experienced Programmers

Creational. Structural

Automated Acceptance Testing

Architectural Design

Chapter 6 Architectural Design. Lecture 1. Chapter 6 Architectural design

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Executing Evaluations over Semantic Technologies using the SEALS Platform

CA ERwin Data Profiler

Fast Track Model Based Design and Development with Oracle9i Designer. An Oracle White Paper August 2002

Rapid Floodplain Delineation. Presented by: Leo R. Kreymborg 1, P.E. David T. Williams 2, Ph.D., P.E. Iwan H. Thomas 3, E.I.T.

Construct a sharable GPU farm for Data Scientist. Layne Peng DellEMC OCTO TRIGr

The VERCE Architecture for Data-Intensive Seismology

Textual Description of webbioc

Hortonworks Data Platform

Transcription:

Scientific Workflows

Overview More background on workflows Kepler Details Example Scientific Workflows Other Workflow Systems 2

Recap from last time Background: What is a scientific workflow? Goals: automate a scientist s repetitive data management and analysis tasks Typical Phases: Data access, scheduling, generation, transformation, aggregation, analysis, visualization Design, test, share, deploy, execute, reuse SWF s Overview and demo of Kepler Adapted from B. Ludaescher 3

Scientific Workflows: Some Findings Very different granularities: from high-level design to lowest level plumbing More dataflow than (business control) workflow Need for programming extensions Iterations over lists (foreach), filtering, functional composition, generic & higher-order operations (zip, map(f)) Need for abstraction and nested workflows Adapted from B. Ludaescher 4

Scientific Workflows: findings (continued) Need for data transformations Need for rich user interaction and workflow steering Pause/revise/resume Select & branch, e.g., web browser capability at specific steps as part of a coordinated SWF Need for high-throughput data transfers and CPU cycles Grid-enabling, streaming Need for persistence of intermediate products 5 and provenance Adapted from B. Ludaescher

Data-flow vs Control-flow Useful for Specification (language, model) Synthesis (scheduling, optimization) Validation (simulation, formal verification) Rough classification: Control Don t know when data arrive (quick reaction) Time of arrival often matters more than value Data Data arrive in regular streams (samples) Value matters most Adapted from B. Ludaescher 6

Data-flow vs. Control-flow Specification, synthesis, and validation methods tend to emphasize For control: Event/reaction relation Response time (Real time scheduling for deadline satisfaction) Priority among events and processes Adapted from B. Ludaescher 7

Data-flow vs. Control-flow For Data: Functional dependency between input and output Memory/time efficiency (Dataflow scheduling for efficient pipelining) All events and processes are equal Adapted from B. Ludaescher 8

Business Workflows vs. Scientific Workflows Business Workflows Task oriented: travel reservations, creditapproval, etc. Tasks, documents, etc undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout Complex control flow, complex process composition Dataflow and control-flow are often divorced Adapted from B. Ludaescher 9

Business Workflows vs. Scientific Workflows Scientific Workflows Dataflow and data transformations Data problems: volume, complexity, heterogeneity Grid aspects: Distributed computation Distributed data User-interactions/WF steering Data, tool, and analysis integration Dataflow and control-flow are often married Adapted from B. Ludaescher 10

SWF User Requirements Design tools especially for non-expert users Need to look into how scientists define processes Ease of use fairly simple user interface having more complex features hidden in background Reusable generic features Generic enough to serve different communities but specific enough to serve one domain Extensibility for the expert user almost a visual programming interface Registration and publication of data products and process products (workflows); provenance Adapted from B. Ludaescher 11

SWF Technical Requirements Error detection and recovery from failure Logging information for each workflow Allow data-intensive and compute-intensive tasks (maybe at the same time) Data management/integration Allow status checks and on the fly updates Visualization Semantics and metadata based dataset access Certification, trust, security Adapted from B. Ludaescher 12

Challenges/Requirements Seamless access to resources and services Web services are simple solution but doesn t address harder problems, e.g., web service orchestration, third party transfers Service composition & reuse and workflow design How to compose simple services to perform complex tasks Design components that are easily reusable, not application-specific Adapted from B. Ludaescher 13

Challenges/Requirements Scalability Some workflows require large amounts of data and/or high-end computational resources Require interfaces to Grid middleware components Detached execution Allow long running workflows to run in the background on remote server Reliability and Fault Tolerance e.g., workflow could fail if web service fails Adapted from B. Ludaescher 14

Challenges/Requirements User interaction e.g., users may inspect intermediate results Smart re-runs Changing a parameter after intermediate results without executing workflow from scratch Smart semantic links Assist in workflow design by suggesting which components might fit together Data Provenance Which data products and tools created a derived data product Log sequence of steps, parameter settings,etc. Adapted from B. Ludaescher 15

Why is a GUI useful? No need to learn a programming language Visual representation of what workflow does Allows you to monitor workflow execution Enables user interaction Facilitates sharing workflows 16

Kepler Details Director/Actor metaphor Actors are executable components of a workflow Director controls execution of workflow Workflows are saved as XML files Workflows can easily be shared/published 17

Directors Many different models of computation are possible Synchronous Processing occurs one component at a time Parallel One or more components run simultaneously Every Kepler workflow needs a director 18

Actors Reusable components that execute a variety of functions Communicate with other actors in workflow through ports Composite actor aggregation of actors Composite actor may have a local director 19

Parameters Values that can be attached to workflow or individual directors/actors Accessible by all actors in a workspace Facilitate worklflow configuration Analogous to global variables 20

Ports Ports used to produce and consume data and communicate with other actors in workflow Input port data consumed by actor Output port data produced by actor Input/output port data both produced and consumed Ports can be singular or multiple 21

Relations Direct the same input or output to more than one other port Example: direct an output to a display actor to show intermediate results, and an operational actor for further processing 22

Other Kepler features Can call external functions Can implement your own actors Incremental development for rapid prototyping If inputs and outputs defined, can incorporate actors into workflow Example dummy composite actor Components can be designed and tested separately 23

Focus on Actor-Oriented Design Object orientation: class name data methods What flows through object is sequential control call Actor orientation: input data Adapted from B. Ludaescher actor name data (state) parameters ports return output data What flows through object is streams of data 24

Object-Oriented vs. Actor Oriented Interface Definitions Object oriented TextToSpeech Actor oriented Text to Speech initialize(); void notify(); void isready(); boolean getspeech(); double[] text in speech out OO interface definition gives procedures that have to be invoked in an order not specified as part of the interface definition Adapted from B. Ludaescher AO interface definition says Give me text and I ll give you speech 25

Models of Computation Semantic interpretations of the abstract syntax Different models Different semantic Different execution One class: Producer/consumer Are actors active? Passive? Reactive? Are communications timed? Synchronized? Buffered? Adapted from B. Ludaescher 26

Directors: Semantics for Component Interaction Some directors: CT continuous time modeling DE discrete event systems FSM finite state machines PN process networks SDF synchronous dataflow Adapted from B. Ludaescher 27

Polymorphic Actors: Working Across Data Types and Domains Recall the add/subtract actor from last time Actor Data Polymorphism: Add numbers (int, float, double, complex) Add strings (concatenation) Add complex types (arrays, records, matrices) Add user-defined types Adapted from B. Ludaescher 28

Polymorphic Actors (continued) Actor behavioral polymorphism In synchronous dataflow model (SDF), add when all inputs have data In process networks, execute infinite loop in a thread that blocks when reading empty inputs In a time-triggered model, add when clock ticks Adapted from B. Ludaescher 29

Benefits of Polymorphism Some observations: Can define actors without defining input types Can define actors without defining model of computation Why is this useful? Increases reusability But need to ensure that actor works in every circumstance 30

Actor Implementation Details beyond the scope of this class Idea: each actor implements several methods: initialize() initializes state variables prefire() indicates if actor wants to fire fire() main point of execution Read inputs, produce outputs, read parameter values postfire() update persistent state, see if execution complete wrapup() Each director call these methods according to its model 31

Third-party transfers Problem: Many workflows access data from one web service S1, pass the output on to service S2 Current web services do not provide mechanism to transfer directly from S1 to S2 Data is moved around more than necessary 32

Third party transfers client ship request S1 S3 ship reply ship reply ship request S2 execute service execute service 33

Handle-oriented approach Idea: instead of shipping actual data, web service send handle (pointer to data) Web services need support for handles 34

Scientific Workflow Examples Promoter Identification Mineral Classification Environmental Modeling Blast-ClustalW Workflow 35

Promoter Identification Workflow Designed to help a biologist compare a set of genes that exhibit similar expression levels Goal: find the set of promoter modules responsible for this behavior Promoter is a subsequence of a chromosome that sits close to a gene and regulates its activity 36

Promoter Identification Workflow 1. Input list of gene IDs 2. For each gene, construct likely upstream region by finding sequences that significantly overlap input gene 1. Use GenBank to get sequence for each gene ID 2. Use BLAST to find similar sequences 3. Find transcription factor binding sites in each of the sequences 1. Run a Transfac search on sequence to identify binding sites 4. Align them and display 1. Use ClustalW to align 37

Promoter Identification Workflow 38

Mineral Classification Workflow Samples selected from a database holding mineral compositions of igneous rocks This data, along with set of classification diagrams, fed to a classifier Process of classifying samples involves determining position of sample values in series of diagrams When location of point in diagram of order n is determined, consult corresponding diagram of order n+1 Repeat until terminal level of diagrams reached 39

Mineral Classification Workflow Rock dataset Classifier GetPoint NextDiagram Result diagrams Diagrams Diagram ToPolygons PointInPolygon 40

CORIE Environmental Observation and Forecasting System Daily forecasts of bodies of water throughout coastal United States Simulation program models physical properties of water (e.g., salinity, temperature, velocity) Scripts generate images, plots, and animations from raw simulation outputs 41

Example CORIE Workflows Simulation Run Simulation Outputs (>300 MB) salt Data Product Tasks Model stations Data Products Simulation Model temp Isolines salt Isolines temp vert Transects temp 42

Blast-ClustalW Workflow Goal: Run BLASTN against DDBJ with a given DNA sequence, compare alignment regions of similar sequences using ClustalW 1.Run BLAST service with input sequence 2.Run GetEntry to get sequences of each hit 3.Cut off corresponding area 4.Run ClustalW 43

Some Scientific Workflow Tools Kepler SCIRun Triana Taverna Some commercial tools: Windows Workflow Foundation Mac OS X Automator 44

SciRun Computational workbench to interactively design and modify simulations Emphasis on visualization Scientists can interactively change models and parameters Fine-grained dataflow to improve computational efficiency 45

Some SCIRun Images Granular compaction simulation C-Safe Integrated Fire/Container Simulation 46

Triana Problem-solving environment combining visual interface and data analysis tools Emphasis on P2P and Grid computing environments Distributed functionality 47

Triana 48

Taverna Emphasis on bioinformatics workflows Enables coordination of local and remote resources Provides a GUI and access to bioinformatics web services Records provenance information 49

Taverna 50

A brief aside: BioMOBY Model Organism Bring Your own Database Messaging standard to automatically discover and interact with biological data and service providers Automatic manipulation of data formats 51

BioMOBY (continued) Ontology of bio-informatics data types Define data syntax Create an open API over this ontology Define web service inputs/outputs Register services Many clients being deployed Clients for some workflow tools, e.g., Taverna in development 52

Executing Kepler on the Grid Many challenges to Grid workflows, including: Authentication Data movement Remote service execution Grid job submission Scheduling and resource management Fault tolerance Logging and provenance User interaction May be difficult for domain scientists 53

Example Grid Workflow Stage-execute-fetch: 1. Stage local files to remote server 2. Execute computational experiment on remote resource Local server 3. Fetch results back to local environment Remote server 54

Why not use a script? Script does not specify low-level task scheduling and communication May be platform-dependent Can t be easily reused 55

Some Kepler Grid Actors Copy copy files from one resource to another during execution Stage actor local to remote host Fetch actor - remote to local host Job execution actor submit and run a remote job Monitoring actor notify user of failures Service discovery actor import web services from a service repository or web site 56