Scien&fic Experiments as Workflows and Scripts. Vanessa Braganholo

Similar documents
FACILITATING REPRODUCIBILITY AFTER THE FACT. Fernando Chiriga- ViDA Visualiza;on and Data Analysis Lab NYU Polytechnic School of Engineering

Exploring Many Task Computing in Scientific Workflows

Tools: Versioning. Dr. David Koop

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Towards Integrating Work1low and Database Provenance

Comparing Provenance Data Models for Scientific Workflows: an Analysis of PROV-Wf and ProvOne

noworkflow: Capturing and Analyzing Provenance of Scripts

How much domain data should be in provenance databases?

Sta$c Analysis Dataflow Analysis

System Modeling Environment

CONTAINERIZING JOBS ON THE ACCRE CLUSTER WITH SINGULARITY

NiW: Converting Notebooks into Workflows to Capture Dataflow and Provenance

Provenance: Information for Shared Understanding

Querying Provenance along with External Domain Data Using Prolog

Capturing and Querying Workflow Runtime Provenance with PROV: a Practical Approach

1/5/11. Announcements. Topics Lab hour/discussion sec2on: Introduc4on (SDM, e Science, scien2fic workflows) Databases. Assignment 1

Design Pa*erns. + Anima/on Undo/Redo Graphics and Hints

Ar#ficial Intelligence

Introduction to Python. Fang (Cherry) Liu Ph.D. Scien5fic Compu5ng Consultant PACE GATECH

61A LECTURE 22 TAIL CALLS, ITERATORS. Steven Tang and Eric Tzeng July 30, 2013

Frank McKenna. Pacific Earthquake Engineering Research Center

Document Databases: MongoDB

Data Provenance in Distributed Propagator Networks

Validation and Inference of Schema-Level Workflow Data-Dependency Annotations

Composing, Reproducing, and Sharing Simula5ons

Web Linked Data (RDF, Seman3c Web, Web of Data)

Proofs about Programs

: Advanced Compiler Design. 8.0 Instruc?on scheduling

Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

Bioinforma)cs Resources

Carelyn Campbell, Ben Blaiszik, Laura Bartolo. November 1, 2016

Hands-on tutorial on usage the Kepler Scientific Workflow System

Stream and Complex Event Processing Discovering Exis7ng Systems:c- sparql

Western Michigan University

Modelling interfaces in distributed systems: some first steps. David Pym UCL and Alan Turing London

HPCSoC Modeling and Simulation Implications

BornAgain: simula/ng X- ray and neutron sca7ering at grazing incidence

COSC 310: So*ware Engineering. Dr. Bowen Hui University of Bri>sh Columbia Okanagan

Model Transforma.on. Krzysztof Czarnecki Genera.ve So:ware Development Lab University of Waterloo, Canada gsd.uwaterloo.ca

Compiler Optimization Intermediate Representation

arxiv: v1 [math.ho] 7 Nov 2017

Amani Abu Jabal 1 Elisa Bertino 2. Purdue University, West Lafayette, USA 1. 2

Provenance Manager: PROV-man an Implementation of the PROV Standard. Ammar Benabadelkader Provenance Taskforce Budapest, 24 March 2014

Efficient Scheduling of Scientific Workflows using Hot Metadata in a Multisite Cloud

7 th International Digital Curation Conference December 2011

Informa(cs 231: What is Design? October 9, 2012

Ibis: A Provenance Manager for Mul5 Layer Systems. Christopher Olston & Anish Das Sarma Yahoo! Research

Dimensioning the Virtual Cluster for Parallel Scientific Workflows in Clouds

W1005 Intro to CS and Programming in MATLAB. Brief History of Compu?ng. Fall 2014 Instructor: Ilia Vovsha. hip://

CISC327 - So*ware Quality Assurance

Week 9, Lecture 17. Some Programming Required. Programming Languages. Turing Completeness 10/20/15

Commi&ng to Data Quality

Linking Prospective and Retrospective Provenance in Scripts

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Bioinforma)cs Resources

Architectural Requirements Phase. See Sommerville Chapters 11, 12, 13, 14, 18.2

Founda'ons of So,ware Engineering. Lecture 11 Intro to QA, Tes2ng Claire Le Goues

Raw data queries during data-intensive parallel workflow execution

Towards Scalable Cluster Auditing through Grammatical Inference over Provenance Graphs

Design Pattern. CMPSC 487 Lecture 10 Topics: Design Patterns: Elements of Reusable Object-Oriented Software (Gamma, et al.)

Process Mining Tutorial

CSc 120. Introduction to Computer Programming II. 14: Stacks and Queues. Adapted from slides by Dr. Saumya Debray

Python for Earth Scientists

Garbage collec,on Parameter passing in Java. Sept 21, 2016 Sprenkle - CSCI Assignment 2 Review. public Assign2(int par) { onevar = par; }

RESTful Web Service Composition with JOpera

Provenance-aware Faceted Search in Drupal

MPI Performance Analysis Trace Analyzer and Collector

Guido Sandmann MathWorks GmbH. Michael Seibt Mentor Graphics GmbH ABSTRACT INTRODUCTION - WORKFLOW OVERVIEW

Founda'ons of So,ware Engineering. Process: Agile Prac.ces Claire Le Goues

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #17: Transac0ons 1: Intro. to ACID

Python for Analytics. Python Fundamentals RSI Chapters 1 and 2

Fix- point engine in Z3. Krystof Hoder Nikolaj Bjorner Leonardo de Moura

Thinking Induc,vely. COS 326 David Walker Princeton University

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #6: Transac/ons 1: Intro. to ACID

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

Debugging. BBM Introduc/on to Programming I. Hace7epe University Fall Fuat Akal, Aykut Erdem, Erkut Erdem, Vahid Garousi

Introduction. IST557 Data Mining: Techniques and Applications. Jessie Li, Penn State University

DDD at 10. Eric Evans #dddesign

CIS 602: Provenance & Scientific Data Management. Provenance Standards. Dr. David Koop

Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines

Lecture 2. White- box Tes2ng and Structural Coverage (see Amman and Offut, Chapter 2)

Design and Debug: Essen.al Concepts CS 16: Solving Problems with Computers I Lecture #8

Introduc)on to Knowledge Graphs and Rich Seman)c Search. Peter Haase, metaphacts Barry Norton, Bri4sh Museum Denny Vrandečić, Google / Wikimedia

Through Python. "Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura..." Dante Alighieri

Teach A level Compu/ng: Algorithms and Data Structures

Spa$al Analysis and Modeling (GIST 4302/5302) Guofeng Cao Department of Geosciences Texas Tech University

Provenance-Aware Faceted Search in Drupal

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

DataONE Cyberinfrastructure. Ma# Jones Dave Vieglais Bruce Wilson

Lecture 10: Potpourri: Enum / struct / union Advanced Unix #include function pointers

Object Oriented Design (OOD): The Concept

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

A Hands-On Introduction to Using Python in the Atmospheric and Oceanic Sciences

Android Tasks and Back Stack

Bioinforma)cs Resources

Managing Exploratory Workflows

Scientific Workflow, Provenance, and Data Modeling Challenges and Approaches

CS 101: Computer Programming and Utilization

Integrating Selenium with Confluence and JIRA

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Transcription:

Scien&fic Experiments as Workflows and Scripts Vanessa Braganholo

The experiment life cycle Composition Concep&on Reuse Analysis Query Discovery Visualiza&on Provenance Data Distribu&on Monitoring Execution Source: Marta Mattoso, IJBPIM, 2010 Vanessa Braganholo E-Science 2

Agenda Abstract Representa&on of Scien&fic Experiments Workflows Scripts Black Boxes X White Boxes Workflow Management Systems Provenance Management Systems for Scripts Vanessa Braganholo E-Science 3

Composi&on: Conceiving Scien&fic Experiments Scien&sts usually design an experiment using a high abstrac+on level representa+on that is later mapped into a workflow or script Vanessa Braganholo E-Science 4

Phylogeny Analysis Experiment (Abstract Workflow) MSA Construc&on Evolu&onary Model Elec&on Construc&on of Phylogene&c Tree MSA Conversion MSA Concatena&on Selec&on of Evolu&onary Model Construc&on of Phylogeny Tree Source: MARINHO et al. Deriving scientific workflows from algebraic experiment lines: A practical approach. Future Generation Computer Systems, v. 68, p. 111-127, 2017. 5

Abstract x Concrete The abstract workflow is later mapped into a concrete workflow or script Vanessa Braganholo E-Science 6

Source: Freire et al., 2008. Provenance for Computational Tasks: A Survey. Vanessa Braganholo E-Science 7

Scien&fic Workflow A scien&fic workflow is a chain of ac+vi+es organized in the form of a data flow Vanessa Braganholo E-Science 8

Data Flow In a data flow, the execu+on is guided by the data As soon as all the input data of an ac&vity is available, it starts execu&ng Vanessa Braganholo E-Science 9

Example Ac&vi&es vtkstructuredpointsrea der and vtkcamera do not depend on other ac&vi&es data, so they can start execu&ng right away Source: Freire et al., 2008. Provenance for Computational Tasks: A Survey. Vanessa Braganholo E-Science 10

Script Defini&on is controversial One of the most accepted defini&ons is that a script language is a programming language that does not require an explicit compila+on step In other words, scripts are usually writen in Languages that are interpreted instead of compiled Examples: Python, R, MatLab, etc. Vanessa Braganholo E-Science 11

Script Execu&on follows a control flow instead of a data flow Commands explicitly define the execu&on order Vanessa Braganholo E-Science 12

1 DRY_RUN =... 2 3 def process(number): 4 while number >= 10: 5 new_number, str_number = 0, str(number) 6 for char in str_number: 7 new_number += int(char) ** 2 8 number = new_number 9 return number 10 11 def show(number): 12 if number not in (1, 7): 13 return "unhappy number" 14 return "happy number" 15 16 n = 2 ** 4000 17 final = process(n) 18 if DRY_RUN: 19 final = 7 20 print(show(final)) Source: Pimentel et al., 2016. Fine-grained Provenance Collection over Scripts Through Program Slicing Vanessa Braganholo E-Science 13

Running an Experiment A workflow or script is just part of an experiment In order to prove or refute an hypothesis, it is usually necessary to run the workflow or script several &mes, varying inputs, parameters and programs Each of those runs is called a trial of the experiment Vanessa Braganholo E-Science 14

New experiment! Could you check if the precipita&on of Rio de Janeiro remains constant across years? @ Vanessa Braganholo E-Science 15

1 st Itera&on H 1 : The precipita&on for each month remains constant across years Project Data: 2013, 2014 [BDMEP] experiment.py precipitation.py p13.dat Composi&on p14.dat Analysis Execu&on Vanessa Braganholo E-Science 16

1 import numpy as np 2 from precipitation import read, sum_by_month 3 from precipitation import create_bargraph 4 5 months = np.arange(12) + 1 6 7 d13, d14 = read("p13.dat"), read("p14.dat") 8 9 prec13 = sum_by_month(d13, months) 10 prec14 = sum_by_month(d14, months) 11 12 create_bargraph("out.png", months, 13 ["2013", "2014"], 14 prec13, prec14) Analysis Composi&on Execu&on Vanessa Braganholo E-Science 17

Project experiment.py precipitation.py p13.dat Trial $ now run -e Tracker experiment.py p14.dat out.png Composi&on Analysis Execu&on Vanessa Braganholo E-Science 18

out.png SELECT Composi&on Conclusion: Drought in 2014 Analysis Execu&on Vanessa Braganholo E-Science 19

2 nd Itera&on H 2 : The precipita&on for each month remains constant across years if there is no drought Project Data: 2012, 2013, 2014 [BDMEP] experiment.py precipitation.py p12.dat Composi&on p13.dat p14.dat Analysis Execu&on Vanessa Braganholo E-Science 20

1 import numpy as np 2 from precipitation import read, sum_by_month 3 from precipitation import create_bargraph 4 5 months = np.arange(12) + 1 6 d12 = read("p12.dat") 7 d13, d14 = read("p13.dat"), read("p14.dat") 8 prec12 = sum_by_month(d12, months) 9 prec13 = sum_by_month(d13, months) 10 prec14 = sum_by_month(d14, months) 11 12 create_bargraph("out.png", months, 13 ["2012", "2013", "2014"], 14 prec12, prec13, prec14) Analysis Composi&on Execu&on Vanessa Braganholo E-Science 21

$ now run -e Tracker experiment.py Project experiment.py precipitation.py p12.dat p13.dat Composi&on p14.dat out.png Analysis Execu&on Vanessa Braganholo E-Science 22

Version Model Trial History Product Space Project experiment.py Version Space Trial 1 Trial 2 1 2 write read precipitation.py 1 1 p12.dat 1 p13.dat p14.dat out.png provenance 1 1 1 1 1 1 2 2 Vanessa Braganholo E-Science 23

out.png Composi&on Conclusion: 2012 was similar to 2013 Analysis Execu&on Vanessa Braganholo E-Science 24

I don t think its enough to compare just these years. Could you add data from 2015? @ Vanessa Braganholo E-Science 25

The same can be done for workflows 26 Source: MARINHO, A. Algebraic Experiment Line: an approach to represent scientific experiments based on workflows. PhD Thesis. UFRJ, 2015.

The same can be done for workflows Each of these can originate several trials Source: MARINHO, A. Algebraic Experiment Line: an approach to represent scientific experiments based on workflows. PhD Thesis. UFRJ, 2015. 27

Trials in Workflows Vanessa Braganholo E-Science 28

History Graph (VisTrails) Source: Freire et al., 2008. Provenance for Computational Tasks: A Survey. Vanessa Braganholo E-Science 29

Several ways to go from abstract to concrete When using scripts, there are several ways to go from abstract to concrete workflows Ac&vi&es are implemented one ager the other in the script (no func&ons) Ac&vi&es are mapped into func&ons (each ac&vity becomes one or more func&on) Vanessa Braganholo E-Science 30

Black Box X White Box In Workflow systems, ac&vi&es are black boxes What goes in and out are known, but what happens inside is not known In scripts, ac&vi&es can be black boxes or white boxes An ac&vity in a script can call an external program, and in this the ac&vity is a black box When the func&on is implemented in Python, it is a white box Vanessa Braganholo E-Science 31

Black Box X White Box Black boxes have implica&ons in provenance analysis Vanessa Braganholo E-Science 32

1 DRY_RUN =... 2 3 def process(number): 4 while number >= 10: 5 new_number, str_number = 0, str(number) 6 for char in str_number: 7 new_number += int(char) ** 2 8 number = new_number 9 return number 10 11 def show(number): 12 if number not in (1, 7): 13 return "unhappy number" 14 return "happy number" 15 16 n = 2 ** 4000 17 final = process(n) 18 if DRY_RUN: 19 final = 7 20 print(show(final)) Which values influence the result printed by this script? (variable final) Source: Pimentel et al., 2016. Fine-grained Provenance Collection over Scripts Through Program Slicing Vanessa Braganholo E-Science 33

1 DRY_RUN =... 2 3 def process(number): 4 while number >= 10: 5 new_number, str_number = 0, str(number) 6 for char in str_number: 7 new_number += int(char) ** 2 8 number = new_number 9 return number 10 11 def show(number): 12 if number not in (1, 7): 13 return "unhappy number" 14 return "happy number" 15 16 n = 2 ** 4000 17 final = process(n) 18 if DRY_RUN: 19 final = 7 20 print(show(final)) If DRY-RUN is 7, then final depends only on DRY_RUN If not, then final also depends on n Source: Pimentel et al., 2016. Fine-grained Provenance Collection over Scripts Through Program Slicing Vanessa Braganholo E-Science 34

Implica&ons of Black Boxes If process(number) were a black box, anything could happen inside it It could, for example, read a file that could influence the value returned by the func&on, so dependencies would be missed This is a common case of implicit provenance, that is missed by several provenance capturing approaches Vanessa Braganholo E-Science 35

Implicit Provenance c:\wksp\imgcut.bmp { circle, 2} UnzipFile ApplyFilter CutInteres+ngArea Iden+fyPhenomenon SWfMS/Workflow domain OS domain Img.bmp Img.bmp ImgCut.bmp Sources: Neves et al., 2017. Managing Provenance of Implicit Data Flows in Scientific Experiments. Marinho et al., 2011. Challenges in managing implicit and abstract provenance data: experiences with ProvManager. Vanessa Braganholo E-Science 36

Implicit Provenance OS-Based approaches are able to capture this kind of provenance Other approaches need special components to handle it (e.g. PROVMONITOR) Vanessa Braganholo E-Science 37

Overview of Exis&ng Systems Workflow Management Systems Provenance Management Systems for Scritps Vanessa Braganholo E-Science 38

Workflow Management Systems VisTrails Taverna Kepler Swig SciCumulus Pegasus Vanessa Braganholo E-Science 39

VisTrails Visual drag and drop interface for workflow composi&on Captures history of changes in the workflow structure Allows comparing results side-by-side Focus on visualiza&on Vanessa Braganholo E-Science 40

VisTrails Vanessa Braganholo E-Science 41

Focus on Bioinforma&cs Taverna Several ready-to-use bioinforma&cs services Drag and Drop graphical interface for workflow composi&on http://www.taverna.org.uk/ Vanessa Braganholo E-Science 42

Vanessa Braganholo E-Science 43

Kepler Drag and Drop graphical interface for workflow composi&on Different actors that rules how the workflow executed Kepler workflows are not DAG https://kepler-project.org/ Vanessa Braganholo E-Science 44

Vanessa Braganholo E-Science 45

Swig, SciCumulus and Pegasus Focus on High Performance Workflows are specified in XML (no graphical interface) in SciCumulus and Pegasus In Swig, workflows are specified as scripts in a specific language http://swift-lang.org/main/index.php https://scicumulusc2.wordpress.com/ https://pegasus.isi.edu/ Vanessa Braganholo E-Science 46

Provenance Management Systems for Scripts noworkflow captures provenance for Python scripts RDataTracker captures provenance for R scripts Sumatra captures provenance for Python, R and MatLab scripts Vanessa Braganholo E-Science 47

Exercise Choose one of the systems presented in today s class and search the Web to find: What is the format in which provenance is stored If they export provenance in the PROV format Vanessa Braganholo E-Science 48

Provenance of these slides MARINHO, A. ; WERNER, C. M. L. ; MATTOSO, M. L. Q. ; BRAGANHOLO, V. ; MURTA, L. G. P.. Challenges in managing implicit and abstract provenance data: experiences with ProvManager. In: USENIX Workshop on the Theory and Prac&ce of Provenance (TaPP), 2011, Heraklion, Creta, Grécia, p. 1-6. MATTOSO, M. L. Q. ; WERNER, C. M. L. ; TRAVASSOS, G. H. ; BRAGANHOLO, V. ; MURTA, L. G. P. ; OGASAWARA, E. ; OLIVEIRA, D. ; CRUZ, S. ; MARTINHO, W.. Towards Suppor&ng the Life Cycle of Large Scale Scien&fic Experiments. Interna&onal Journal of Business Process Integra&on and Management (Print), v. 5, p. 79-92, 2010. NEVES, V. C. ; OLIVEIRA, D. ; OCANA, K. A. ; BRAGANHOLO, V. ; MURTA, L. G. P.. Managing Provenance of Implicit Data Flows in Scien&fic Experiments. ACM Transac&ons on Internet Technology, 2017. PIMENTEL, J. F. N. ; FREIRE, J. ; BRAGANHOLO, V. ; MURTA, L. G. P.. Tracking and Analyzing the Evolu&on of Provenance from Scripts. In: Interna&onal Provenance and Annota&on Workshop (IPAW), 2016, Washington, D.C., v. 9672. p. 16-28. PIMENTEL, J. F. N. ; FREIRE, J. ; MURTA, L. G. P. ; BRAGANHOLO, V.. Fine-grained Provenance Collec&on over Scripts Through Program Slicing. In: Interna&onal Provenance and Annota&on Workshop (IPAW), 2016, Washington D.C., v. 9672. p. 199-203. Vanessa Braganholo E-Science 49