A(nother) Vision of ppod Data Integration!?

Similar documents
Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Workflow Exchange and Archival: The KSW File and the Kepler Object Manager. Shawn Bowers (For Chad Berkley & Matt Jones)

Using Web Services and Scientific Workflow for Species Distribution Prediction Modeling 1

7 th International Digital Curation Conference December 2011

Automatic Transformation from Geospatial Conceptual Workflow to Executable Workflow Using GRASS GIS Command Line Modules in Kepler *

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

2/12/11. Addendum (different syntax, similar ideas): XML, JSON, Motivation: Why Scientific Workflows? Scientific Workflows

KEPLER: Overview and Project Status

DataONE: Open Persistent Access to Earth Observational Data

Provenance Collection Support in the Kepler Scientific Workflow System

Workflow Fault Tolerance for Kepler. Sven Köhler, Thimothy McPhillips, Sean Riddle, Daniel Zinn, Bertram Ludäscher

A Dataflow-Oriented Atomicity and Provenance System for Pipelined Scientific Workflows

Scientific Workflow, Provenance, and Data Modeling Challenges and Approaches

A High-Level Distributed Execution Framework for Scientific Workflows

A High-Level Distributed Execution Framework for Scientific Workflows

The International Journal of Digital Curation Volume 7, Issue

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Application of Named Graphs Towards Custom Provenance Views

DataONE Enabling Cyberinfrastructure for the Biological, Environmental and Earth Sciences

Abstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance

Kepler: An Extensible System for Design and Execution of Scientific Workflows

Hybrid-Type Extensions for Actor-Oriented Modeling (a.k.a. Semantic Data-types for Kepler) Shawn Bowers & Bertram Ludäscher

The International Journal of Digital Curation Issue 1, Volume

ISSN: Supporting Collaborative Tool of A New Scientific Workflow Composition

A Three Tier Architecture for LiDAR Interpolation and Analysis

ARTICLE IN PRESS Future Generation Computer Systems ( )

Putting the Archives to Work: Workflow and Metadata-driven Analysis in LTER Science

Knowledge-based Grids

Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

Semantics and Ontologies For EarthCube

Integrated Machine Learning in the Kepler Scientific Workflow System

Managing Exploratory Workflows

Scientific Workflows: Business as Usual?

Scientific Data & Workflow Engineering. Outline

Implementing Trusted Digital Repositories

ECS289F Winter 05 Scientific Data Management

Wade Sheldon. Georgia Coastal Ecosystems LTER University of Georgia CUAHSI Virtual Workshop Field Data Management Solutions

Scientific Workflows

Where we are so far. Intro to Data Integration (Datalog, mediators, ) more to come (your projects!): schema matching, simple query rewriting

Kepler and Grid Systems -- Early Efforts --

Provenance-aware Faceted Search in Drupal

GETTING STARTED GUIDE

Agile Data Management Challenges in Enterprise Big Data Landscape

DSpace Fedora. Eprints Greenstone. Handle System

Towards Semantically-enabled Exploration and Analysis of Environmental Ecosystems

Sangam: A Framework for Modeling Heterogeneous Database Transformations

DATA MANAGEMENT SYSTEMS FOR SCIENTIFIC APPLICATIONS

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Sliding Window Calculations on Streaming Data using the Kepler Scientific Workflow System

Bio-Workflows with BizTalk: Using a Commercial Workflow Engine for escience

Reproducible & Transparent Computational Science with Galaxy. Jeremy Goecks The Galaxy Team

Workflow Management in Spatial Studies:

ICD Wiki Framework for Enabling Semantic Web Service Definition and Orchestration

Final Project Assignments. Promoter Identification Workflow (PIW)

Towards Rule Learning Approaches to Instance-based Ontology Matching

Metadata Zoo Dataset Metadata Rebecca Koskela Execu4ve Director, DataONE

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Managing Rapidly-Evolving Scientific Workflows

Wade Sheldon. Georgia Coastal Ecosystems LTER University of Georgia

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

Provenance-Aware Faceted Search in Drupal

On the use of Abstract Workflows to Capture Scientific Process Provenance

Flexible Framework for Mining Meteorological Data

USING THE BUSINESS PROCESS EXECUTION LANGUAGE FOR MANAGING SCIENTIFIC PROCESSES. Anna Malinova, Snezhana Gocheva-Ilieva

An Archiving System for Managing Evolution in the Data Web

NextData System of Systems Infrastructure (ND-SoS-Ina)

Virtualization of Workflows for Data Intensive Computation

Implementing the Army Net Centric Data Strategy in a Service Oriented Environment

Kepler User Manual Version October, 2010

Leveraging metadata standards in ArcGIS to support Interoperability. David Danko and Aleta Vienneau

Semantic Web Technologies

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Tackling the Provenance Challenge One Layer at a Time

Dictionary Driven Exchange Content Assembly Blueprints

Bibster A Semantics-Based Bibliographic Peer-to-Peer System

Key cyberinfrastructure elements implemented as RESTful webservices

Software + Services for Data Storage, Management, Discovery, and Re-Use

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Principles of Dataspaces

Situations and Ontologies: helping geoscientists understand and share the semantics surrounding their computational resources

A geoinformatics-based approach to the distribution and processing of integrated LiDAR and imagery data to enhance 3D earth systems research

A three tier architecture applied to LiDAR processing and monitoring

Using ESML in a Semantic Web Approach for Improved Earth Science Data Usability

Context-Aware Actors. Outline

Tackling the Provenance Challenge One Layer at a Time

San Diego Supercomputer Center, UCSD, U.S.A. The Consortium for Conservation Medicine, Wildlife Trust, U.S.A.

Validation and Inference of Schema-Level Workflow Data-Dependency Annotations

Acquiring Experience with Ontology and Vocabularies

SEXTANT 1. Purpose of the Application

Open Research Online The Open University s repository of research publications and other research outputs

NeAT Business Plan Component Data Integration and Annotation Services in Biodiversity (DIAS-B) 1. Service Description

Comparing Provenance Data Models for Scientific Workflows: an Analysis of PROV-Wf and ProvOne

Automatic Generation of Workflow Provenance

Design of Distributed Data Mining Applications on the KNOWLEDGE GRID

The NCAR Community Data Portal

Semantic Web Mining and its application in Human Resource Management

On Optimizing Workflows Using Query Processing Techniques

Metadata Quality Assessment: A Phased Approach to Ensuring Long-term Access to Digital Resources

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Transcription:

Scientific Workflows: A(nother) Vision of ppod Data Integration!? Bertram Ludäscher Shawn Bowers Timothy McPhillips Dave Thau Dept. of Computer Science & UC Davis Genome Center University of California, DAVIS UC DAVIS Department of Computer Science

Overview Scientific Workflow: Overview Vision Examples using Kepler (from NSF/ITR SEEK) Provenance in Scientific Workflows from single runs to project histories ppod & Kepler next steps

Different Kinds of (Data) Integration Traditional Information (& Data) Integration syntactic & structural heterogeneities, schema mappings, schema matching, query rewriting (parsing, matching, [G]LAV, Chase [+IC], Resolution), dealing with fundamentally same (largely overlapping) information find ways to integrate different representations Scientific Information Integration (SII) includes the above but often deals with combining fundamentally different information more than one way to combine, integrate the data integration invokes scientific theories, models that cannot be inferred from only data, schema, ontologies joining of data, chaining of analysis steps in the scientist s head ( y := f(x) ; z := g(x,y); ) make these analysis pipelines first-class citizens scientific workflows can provide an end-to-end framework

Types of Information Integration Export schema Export schema Conventional information integration: Federated schema schema-based view-based at the instance level Federated schema Export schema Export schema Export schema Component schema Component schema Component schema Local schema Local schema Local schema Data Source Data Source Data Source Spatial (co-)registration/ overlay of different data from 2D, 3D, 4D (x,y,z,t), (4+n) D GIS ++ Extended DI approaches using ontologies controlled vocabularies, metadata, annotations Scientific Information Integration = data + process/application integration scientific workflows can include all the others and statistics, data mining, visualization,

Scientific Workflows = Cyberinfrastructure UPPERWARE Upperware Upper Middleware Middleware Underware Science Environment for Ecological Knowledge ( SEEK )

Science Environment for Ecological Knowledge (SEEK) Access distributed environmental, ecological, and systematics data Enable data sharing & reuse Enhance data discovery at global scales Distributed data network EcoGrid Design, reuse, and execute scientific analyses Enable communication and collaboration for analysis Enable reuse of analytical components and analyses Integrated data access Kepler Data discovery and integration Addressing variety of semantic data heterogeneity issues Ontology and controlled-vocabulary development Semantic data and actor annotations Resolve taxonomic ambiguities SMS / OBOE / Taxonomic concept services

Kepler Data Access via the EcoGrid Lightweight API for providers & clients Implemented via web services Common metadata query syntax Common mechanism for accessing ecological (KNB), museum specimen (DiGIR), environmental (SRB), and geological (GEON) data Catalog-based Integration NOT a single CDM leave the integration to the workflow designer!

Scientific Workflow Capture how a scientist works with data and analytical tools data access, transformation, analysis, visualization possible worldview: dataflow-oriented Scientific workflow (wf) benefits (compare w/ script-based approaches) : wf automation wf & component reuse wf design, documentation wf archival, sharing built-in concurrency (task-, pipeline-parallelism) built-in provenance support distributed execution (Grid) support

Kepler Collaboration (alive and evolving) Open-source Builds on Ptolemy II from UC Berkeley Contributors from: SEEK SciDAC SDM Ptolemy GEON ROADNet Resurgence AToL: CIPRES, POD Ptolemy II Phyl-O'Data (POD) Goals Natural Diversity Create powerful analytical tools that are useful across disciplines Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, Discovery Project

Basic Kepler User Interface Tool Bar Quick Search Workflow Canvas Actor Libraries Thumbnail Navigation

Kepler Data Access via the EcoGrid Data Quick Search Tab Metadata Keyword Search Access Multiple EcoGrid Sources Return Data Sets as Actors to Drag-Drop to Canvas

Input/Output Semantic Annotation Actor input/output port annotation: Each port can be annotated with multiple classes from multiple ontologies Annotations are stored with actor metadata (MOML) Actors can be discovered, validated, etc., via their semantic types

Actor Annotations Actor Annotations for Indexing & Classification New actors can be annotated and indexed into the component library (e.g., specializing generic actors) Existing components can also be revised, annotated, and indexed (hiding previous versions) Quick search leverages metadata, including annotations & ontologies

Kepler Demo: Building a simple workflow 1 3 2 Select actors from Kepler actor library: Local or remote actors View actor metadata/documentation (not shown) Drag desired actor to canvas Connect actor ports other actor examples

Kepler Demo: Building a simple workflow 1 2 3 Select input data: Shown here is an EcoGrid for bacterial abundance Connect data actors to workflow inputs many ways to import data

Kepler Demo: Building a simple workflow Using EcoGrid data sources: Display metadata (EML) Query data via SQL/QBE interface even if it is a tab-delimited file (see above)

Kepler Demo: Building a simple workflow Run the workflow Also set parameters, select & configure director, run window, etc.

SEEK Ecological Niche Modeling Workflows Complex workflows with many levels of nesting (sub-workflows) Predict species locations from presence data and environmental layers Designed to support different prediction algorithms (reusability) Currently uses GARP (Genetic Algorithm for Rule-Set Prediction) n levels down

Drilling down: Calculate Best Rulesets climate change data

SEEK Ecological Niche Modeling Workflows Includes a number of workflows for automating special purpose data-integration tasks Integration of multiple data sets and data types Workflows for local caching of data, format and content conversions Rescale grid data, adjust resolutions, extents, merges grids Integrate Hydro1K North and South American data, including warp/projection, format conversion, etc. ppod @ NESCENT, rescaling, Sept 07

The Joy of Exa-Scale Cyberinfrastructure Are we working at the right level of abstraction? Are we optimizing the right thing? Optimize human cycles, not just CPU cycles! cf. John McCarthy (of AI/LISP fame) Make data & scientific workflows effectively (re-)usable for scientist Make workflows first-class, shareable knowledge artifacts Support user-oriented provenance queries

(Data) Provenance & Scientific Workflows (Data) provenance data lineage, processing history Query the lineage of a data product: what data it is derived from and how Evaluate the results of a workflow: is the approach correct Reuse intermediate or final products of one workflow in another Explain unexpected results Discover all results derived from a given data set Accurately prepare methods section of a publication Archive scientific results in a repository Replicate the results reported by another researcher

Inferring a phylogenetic tree from disparate data Aligned DNA sequences Discrete morphological data Continuous characters Datasets Maximum likelihood tree (DNA) Maximum parsimony tree Maximum likelihood tree (continuous characters) Actors Integrate Consensus Tree(s) Provenance Store Datasets

Scientific provenance questions (single run) What DNA sequences were input (phylogenetic trees were output) by the workflow? What intermediate phylogenetic trees were created? Which actor created this phylogenetic tree? Which input sequences does this consensus tree depend on? Which input sequences were not used to derive any consensus tree What sequence alignment (key intermediate data) was used to infer this tree?

A (very) simple phylogenetics workflow

Data lineage + processing history for a consensus tree ip yl Ph lipp ars :1 Pa Phy s 3 : rs e:1 ens :1 se n e ons lipc y Ph ns Co Ph yl :5 rs Pa y Ph ip lip n Co il p :1 :1 rs ip yl rser e:1 PhylipConsense:1 Pa lepa ens :1 Ph yli p :1 se usfi se ons Phy Ph Nex en en TextFileReader:1 Phy lipc :1 ar ilep usf Nex ns se Co en ns Ph yl ip PhylipPars:1 1 ser: Co s:1 ar P p yli Ph 1 ars: lipp Phy Derivation (processing history) of a data item in a scientific workflow run (a DAG) Nodes = data items the workflow run operated on or created Edges = was directly used in labeled by the actor invocation that performed this computation Different (emerging) provenance extensions to Kepler

Provenance: Single Run

Provenance: Multiple Runs

Conceptual workflows: series of subworkflows

Manual, data visualization, and quality assessment steps are interleaved with automated steps

Projects comprise multiple conceptual workflows

Workflows are run multiple times with different parameter settings

How Kepler is used today Aware of only one workflow, one run at a time Data, workflows, and provenance records reside outside the system between runs p1 p2 p3 Users must perform most data and provenance management outside of the system Workflows must be modified or reconfigured to operate on different input data p1 p2 p3

Support for project folders & histories Data is registered Project folders allow users to organize data. Project history records and depicts past workflow runs and the flow of data between runs. Data is staged from the project folders (and project history). Run outputs appear in the project history (along with the input) if the run is committed. All or part of the output of a run may be used to update the project folders. Workflows can be applied to different data sets ppod @ NESCENT, Sept 07without

Project history relieves need to perform data versioning via project folders Recomputed data can replace old versions, be stored elsewhere in folders, or simply left in the project history. Replaced data are always accessible via project history. Provenance queries provide access to all data regardless of location.

Managing workflow evolution Workflow library is not a flat list of available workflows. Workflows evolve throughout a project, and previous versions must be retained for reference and for further use. Workflow evolution view complements run history.

Summary & Next Steps Kepler today used in ecoinformatics (SEEK), ChIP-chip, geoinformatics, data catalog, data grid workflows for data integration data annotation and semantic extensions Kepler next steps (planned deliverables): PHYLOGENETIC SCIENTIFIC WORKFLOWS Develop use cases / conceptual workflows: tree construction (understood) post-tree analysis, supertree/matrix construction (exciting :) community-driven! Implement subset of those in Kepler Generate actor library targeting community use cases PROJECT HISTORIES SUPPORT (cf. DILS'07 paper) Extend use cases to exploit project histories / provenance Implement those ppod REPOSITORY (Orchestra!?) 1. Extend Kepler to use ppod data repository

Consilience: The Unity of Knowledge (E. O. Wilson) "Literally a jumping together of knowledge by the linking of facts and fact-based theory across disciplines to create a common groundwork for explanation." E.O.Wilson escience, Cyberinfrastructure: mechanisms to make progress Scientific Workflows: crucial elements to get the most mileage out of CI to fuel escience, accelerating knowledge discovery Identify the real bottlenecks in this We must know, quest! we will know. -- David Hilbert Wer Visionen hat, sollte zum Arzt gehen Helmut Schmidt on Willy Brandt

Questions kepler-project.org

References Niche Modeling D Pennington, D Higgins, AT Peterson, M Jones, B Ludaescher, S Bowers. Ecological niche modeling using the Kepler workflow system.. Workflows for e-science: Scientific Workflows for Grids, Springer-Verlag, 2007. Ecological Niche Modeling in Kepler. User Manual. Draft, 2007 Semantic Annotation S Bowers, B Ludaescher. A calculus for propagating semantic annotations through scientific workflow queries. QLQP, 2006. S Bowers, B Ludaescher. Actor-oriented design of scientific workflows. ER, 2005. C Berkley, S Bowers, M Jones, B Ludaescher, M Schildhauer, J Tao. Incorporating semantics in scientific workflow authoring. SSDBM, 2005. S Bowers, B Ludaescher. An Ontology-driven framework for data transformation Scientific in scientific workflows. DILS, 2004. Workflows, B. Ludäscher

References Provenance in Workflows S Bowers, T McPhillips, M Wu, B Ludaescher. Project histories: Managing data provenance across collectionoriented scientific workflow runs. DILS, 2007. S Bowers, T McPhillips, B Ludaescher. Provenance in collection-oriented workflows. Concurrency and Computation: Practice and Experience, 2007. B Ludaescher, N Podhorszki, I Altintas, S Bowers, T McPhillips. From computation models to models of provenance: The RWS approach. Concurrency and Computation: Practice and Experience, 2007.

Additional Related Publications Semantic Type Annotation S Bowers, B Ludaescher. A Calculus for Propagating Semantic Annotations through Scientific Workflow Queries. ICDE Workshop on Query Languages and Query Processing (QLQP), LNCS, 2006. S Bowers, B Ludaescher. Towards Automatic Generation of Semantic Types in Scientific Workflows. International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS), WISE 2005 Workshop Proceedings, LNCS, 2005. C Berkley, S Bowers, M Jones, B Ludaescher, M Schildhauer, J Tao. Incorporating Semantics in Scientific Workflow Authoring. SSDBM, 2005. B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B Brodaric, C Baru. Managing Scientific Data: From Data Integration to Scientific Workflows. GSA Today, Special Issue on Geoinformatics, 2006. S Bowers, D Thau, R Williams, B Ludaescher. Data Procurement for Enabling Scientific Workflows: On Exploring Inter-Ant Parasitism. VLDB Workshop on Semantic Web and Databases (SWDB), 2004. S Bowers, K Lin, B Ludaescher. On Integrating Scientific Resources through Semantic Registration. SSDBM, 2004. S Bowers, B Ludaescher. An Ontology-Drive Framework for Data Transformation in Scientific Workflows. International Workshop on Data Integration in the Life Sciences (DILS), LNCS, 2004. S Bowers, B Ludaescher. Towards a Generic Framework for Semantic Registration of Scientific Data. International Semantic Web Conference Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003. Workflow Design and Modeling T McPhillips, S Bowers, B Ludaescher. Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data. Workshop on Data Integration in the Life Sciences (DILS), LNCS, 2006. S Bowers, T McPhillips, B Ludaescher, S Cohen, SB Davidson. A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows. International Provenance and Annotation Workshop (IPAW), LNCS, 2006. S Bowers, B Ludaescher, AHH Ngu, T Critchlow. Enabling Scientific Workflow Reuse through Structured Composition of Dataflow and Control-Flow. IEEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow), 2006. S Bowers, B Ludaescher. Actor-Oriented Design of Scientific Workflows. International Conference on Conceptual Modeling (ER), LNCS, 2005. T McPhillips, S Bowers. Pipelining Nested Data Collections in Scientific Workflows. SIGMOD Record, 2005. Kepler D Pennington, D Higgins, AT Peterson, M Jones, B Ludaescher, S Bowers. Ecological Niche Modeling using the Kepler Workflow System. Workflows for e-science, Springer-Verlag, to appear. W Michener, J Beach, S Bowers, L Downey, M Jones, B Ludaescher, D Pennington, A Rajasekar, S Romanello, M Schildhauer, D Vieglais, J Zhang. SEEK: Data Integration and Workflow Solutions for Ecology. Workshop on Data Integration in the Life Sciences (DILS), LNCS, 2005. S Romanello, W Michener, J Beach, M Jones, B Ludaescher, A Rajasekar, M Schildhauer, S Bowers, D Pennington. Creating and Providing Data Management Services for the Biological and Ecological Sciences: Science Environment for Ecological Knowledge. SSDBM, 2005.