A(nother) Vision of ppod Data Integration!?

Scientific Workflows: A(nother) Vision of ppod Data Integration!? Bertram Ludäscher Shawn Bowers Timothy McPhillips Dave Thau Dept. of Computer Science & UC Davis Genome Center University of California, DAVIS UC DAVIS Department of Computer Science

Overview Scientific Workflow: Overview Vision Examples using Kepler (from NSF/ITR SEEK) Provenance in Scientific Workflows from single runs to project histories ppod & Kepler next steps

Different Kinds of (Data) Integration Traditional Information (& Data) Integration syntactic & structural heterogeneities, schema mappings, schema matching, query rewriting (parsing, matching, [G]LAV, Chase [+IC], Resolution), dealing with fundamentally same (largely overlapping) information find ways to integrate different representations Scientific Information Integration (SII) includes the above but often deals with combining fundamentally different information more than one way to combine, integrate the data integration invokes scientific theories, models that cannot be inferred from only data, schema, ontologies joining of data, chaining of analysis steps in the scientist s head ( y := f(x) ; z := g(x,y); ) make these analysis pipelines first-class citizens scientific workflows can provide an end-to-end framework

Types of Information Integration Export schema Export schema Conventional information integration: Federated schema schema-based view-based at the instance level Federated schema Export schema Export schema Export schema Component schema Component schema Component schema Local schema Local schema Local schema Data Source Data Source Data Source Spatial (co-)registration/ overlay of different data from 2D, 3D, 4D (x,y,z,t), (4+n) D GIS ++ Extended DI approaches using ontologies controlled vocabularies, metadata, annotations Scientific Information Integration = data + process/application integration scientific workflows can include all the others and statistics, data mining, visualization,

Scientific Workflows = Cyberinfrastructure UPPERWARE Upperware Upper Middleware Middleware Underware Science Environment for Ecological Knowledge ( SEEK )

Science Environment for Ecological Knowledge (SEEK) Access distributed environmental, ecological, and systematics data Enable data sharing & reuse Enhance data discovery at global scales Distributed data network EcoGrid Design, reuse, and execute scientific analyses Enable communication and collaboration for analysis Enable reuse of analytical components and analyses Integrated data access Kepler Data discovery and integration Addressing variety of semantic data heterogeneity issues Ontology and controlled-vocabulary development Semantic data and actor annotations Resolve taxonomic ambiguities SMS / OBOE / Taxonomic concept services

Kepler Data Access via the EcoGrid Lightweight API for providers & clients Implemented via web services Common metadata query syntax Common mechanism for accessing ecological (KNB), museum specimen (DiGIR), environmental (SRB), and geological (GEON) data Catalog-based Integration NOT a single CDM leave the integration to the workflow designer!

Scientific Workflow Capture how a scientist works with data and analytical tools data access, transformation, analysis, visualization possible worldview: dataflow-oriented Scientific workflow (wf) benefits (compare w/ script-based approaches) : wf automation wf & component reuse wf design, documentation wf archival, sharing built-in concurrency (task-, pipeline-parallelism) built-in provenance support distributed execution (Grid) support

Kepler Collaboration (alive and evolving) Open-source Builds on Ptolemy II from UC Berkeley Contributors from: SEEK SciDAC SDM Ptolemy GEON ROADNet Resurgence AToL: CIPRES, POD Ptolemy II Phyl-O'Data (POD) Goals Natural Diversity Create powerful analytical tools that are useful across disciplines Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, Discovery Project

Basic Kepler User Interface Tool Bar Quick Search Workflow Canvas Actor Libraries Thumbnail Navigation

Kepler Data Access via the EcoGrid Data Quick Search Tab Metadata Keyword Search Access Multiple EcoGrid Sources Return Data Sets as Actors to Drag-Drop to Canvas

Input/Output Semantic Annotation Actor input/output port annotation: Each port can be annotated with multiple classes from multiple ontologies Annotations are stored with actor metadata (MOML) Actors can be discovered, validated, etc., via their semantic types

Actor Annotations Actor Annotations for Indexing & Classification New actors can be annotated and indexed into the component library (e.g., specializing generic actors) Existing components can also be revised, annotated, and indexed (hiding previous versions) Quick search leverages metadata, including annotations & ontologies

Kepler Demo: Building a simple workflow 1 3 2 Select actors from Kepler actor library: Local or remote actors View actor metadata/documentation (not shown) Drag desired actor to canvas Connect actor ports other actor examples

Kepler Demo: Building a simple workflow 1 2 3 Select input data: Shown here is an EcoGrid for bacterial abundance Connect data actors to workflow inputs many ways to import data

Kepler Demo: Building a simple workflow Using EcoGrid data sources: Display metadata (EML) Query data via SQL/QBE interface even if it is a tab-delimited file (see above)

Kepler Demo: Building a simple workflow Run the workflow Also set parameters, select & configure director, run window, etc.

SEEK Ecological Niche Modeling Workflows Complex workflows with many levels of nesting (sub-workflows) Predict species locations from presence data and environmental layers Designed to support different prediction algorithms (reusability) Currently uses GARP (Genetic Algorithm for Rule-Set Prediction) n levels down

Drilling down: Calculate Best Rulesets climate change data

SEEK Ecological Niche Modeling Workflows Includes a number of workflows for automating special purpose data-integration tasks Integration of multiple data sets and data types Workflows for local caching of data, format and content conversions Rescale grid data, adjust resolutions, extents, merges grids Integrate Hydro1K North and South American data, including warp/projection, format conversion, etc. ppod @ NESCENT, rescaling, Sept 07

The Joy of Exa-Scale Cyberinfrastructure Are we working at the right level of abstraction? Are we optimizing the right thing? Optimize human cycles, not just CPU cycles! cf. John McCarthy (of AI/LISP fame) Make data & scientific workflows effectively (re-)usable for scientist Make workflows first-class, shareable knowledge artifacts Support user-oriented provenance queries

(Data) Provenance & Scientific Workflows (Data) provenance data lineage, processing history Query the lineage of a data product: what data it is derived from and how Evaluate the results of a workflow: is the approach correct Reuse intermediate or final products of one workflow in another Explain unexpected results Discover all results derived from a given data set Accurately prepare methods section of a publication Archive scientific results in a repository Replicate the results reported by another researcher

Inferring a phylogenetic tree from disparate data Aligned DNA sequences Discrete morphological data Continuous characters Datasets Maximum likelihood tree (DNA) Maximum parsimony tree Maximum likelihood tree (continuous characters) Actors Integrate Consensus Tree(s) Provenance Store Datasets

Scientific provenance questions (single run) What DNA sequences were input (phylogenetic trees were output) by the workflow? What intermediate phylogenetic trees were created? Which actor created this phylogenetic tree? Which input sequences does this consensus tree depend on? Which input sequences were not used to derive any consensus tree What sequence alignment (key intermediate data) was used to infer this tree?

A (very) simple phylogenetics workflow

Data lineage + processing history for a consensus tree ip yl Ph lipp ars :1 Pa Phy s 3 : rs e:1 ens :1 se n e ons lipc y Ph ns Co Ph yl :5 rs Pa y Ph ip lip n Co il p :1 :1 rs ip yl rser e:1 PhylipConsense:1 Pa lepa ens :1 Ph yli p :1 se usfi se ons Phy Ph Nex en en TextFileReader:1 Phy lipc :1 ar ilep usf Nex ns se Co en ns Ph yl ip PhylipPars:1 1 ser: Co s:1 ar P p yli Ph 1 ars: lipp Phy Derivation (processing history) of a data item in a scientific workflow run (a DAG) Nodes = data items the workflow run operated on or created Edges = was directly used in labeled by the actor invocation that performed this computation Different (emerging) provenance extensions to Kepler

Provenance: Single Run

Provenance: Multiple Runs

Conceptual workflows: series of subworkflows

Manual, data visualization, and quality assessment steps are interleaved with automated steps

Projects comprise multiple conceptual workflows

Workflows are run multiple times with different parameter settings

How Kepler is used today Aware of only one workflow, one run at a time Data, workflows, and provenance records reside outside the system between runs p1 p2 p3 Users must perform most data and provenance management outside of the system Workflows must be modified or reconfigured to operate on different input data p1 p2 p3

Support for project folders & histories Data is registered Project folders allow users to organize data. Project history records and depicts past workflow runs and the flow of data between runs. Data is staged from the project folders (and project history). Run outputs appear in the project history (along with the input) if the run is committed. All or part of the output of a run may be used to update the project folders. Workflows can be applied to different data sets ppod @ NESCENT, Sept 07without

Project history relieves need to perform data versioning via project folders Recomputed data can replace old versions, be stored elsewhere in folders, or simply left in the project history. Replaced data are always accessible via project history. Provenance queries provide access to all data regardless of location.

Managing workflow evolution Workflow library is not a flat list of available workflows. Workflows evolve throughout a project, and previous versions must be retained for reference and for further use. Workflow evolution view complements run history.

Summary & Next Steps Kepler today used in ecoinformatics (SEEK), ChIP-chip, geoinformatics, data catalog, data grid workflows for data integration data annotation and semantic extensions Kepler next steps (planned deliverables): PHYLOGENETIC SCIENTIFIC WORKFLOWS Develop use cases / conceptual workflows: tree construction (understood) post-tree analysis, supertree/matrix construction (exciting :) community-driven! Implement subset of those in Kepler Generate actor library targeting community use cases PROJECT HISTORIES SUPPORT (cf. DILS'07 paper) Extend use cases to exploit project histories / provenance Implement those ppod REPOSITORY (Orchestra!?) 1. Extend Kepler to use ppod data repository

Consilience: The Unity of Knowledge (E. O. Wilson) "Literally a jumping together of knowledge by the linking of facts and fact-based theory across disciplines to create a common groundwork for explanation." E.O.Wilson escience, Cyberinfrastructure: mechanisms to make progress Scientific Workflows: crucial elements to get the most mileage out of CI to fuel escience, accelerating knowledge discovery Identify the real bottlenecks in this We must know, quest! we will know. -- David Hilbert Wer Visionen hat, sollte zum Arzt gehen Helmut Schmidt on Willy Brandt

Questions kepler-project.org

References Niche Modeling D Pennington, D Higgins, AT Peterson, M Jones, B Ludaescher, S Bowers. Ecological niche modeling using the Kepler workflow system.. Workflows for e-science: Scientific Workflows for Grids, Springer-Verlag, 2007. Ecological Niche Modeling in Kepler. User Manual. Draft, 2007 Semantic Annotation S Bowers, B Ludaescher. A calculus for propagating semantic annotations through scientific workflow queries. QLQP, 2006. S Bowers, B Ludaescher. Actor-oriented design of scientific workflows. ER, 2005. C Berkley, S Bowers, M Jones, B Ludaescher, M Schildhauer, J Tao. Incorporating semantics in scientific workflow authoring. SSDBM, 2005. S Bowers, B Ludaescher. An Ontology-driven framework for data transformation Scientific in scientific workflows. DILS, 2004. Workflows, B. Ludäscher

References Provenance in Workflows S Bowers, T McPhillips, M Wu, B Ludaescher. Project histories: Managing data provenance across collectionoriented scientific workflow runs. DILS, 2007. S Bowers, T McPhillips, B Ludaescher. Provenance in collection-oriented workflows. Concurrency and Computation: Practice and Experience, 2007. B Ludaescher, N Podhorszki, I Altintas, S Bowers, T McPhillips. From computation models to models of provenance: The RWS approach. Concurrency and Computation: Practice and Experience, 2007.

Additional Related Publications Semantic Type Annotation S Bowers, B Ludaescher. A Calculus for Propagating Semantic Annotations through Scientific Workflow Queries. ICDE Workshop on Query Languages and Query Processing (QLQP), LNCS, 2006. S Bowers, B Ludaescher. Towards Automatic Generation of Semantic Types in Scientific Workflows. International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS), WISE 2005 Workshop Proceedings, LNCS, 2005. C Berkley, S Bowers, M Jones, B Ludaescher, M Schildhauer, J Tao. Incorporating Semantics in Scientific Workflow Authoring. SSDBM, 2005. B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B Brodaric, C Baru. Managing Scientific Data: From Data Integration to Scientific Workflows. GSA Today, Special Issue on Geoinformatics, 2006. S Bowers, D Thau, R Williams, B Ludaescher. Data Procurement for Enabling Scientific Workflows: On Exploring Inter-Ant Parasitism. VLDB Workshop on Semantic Web and Databases (SWDB), 2004. S Bowers, K Lin, B Ludaescher. On Integrating Scientific Resources through Semantic Registration. SSDBM, 2004. S Bowers, B Ludaescher. An Ontology-Drive Framework for Data Transformation in Scientific Workflows. International Workshop on Data Integration in the Life Sciences (DILS), LNCS, 2004. S Bowers, B Ludaescher. Towards a Generic Framework for Semantic Registration of Scientific Data. International Semantic Web Conference Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003. Workflow Design and Modeling T McPhillips, S Bowers, B Ludaescher. Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data. Workshop on Data Integration in the Life Sciences (DILS), LNCS, 2006. S Bowers, T McPhillips, B Ludaescher, S Cohen, SB Davidson. A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows. International Provenance and Annotation Workshop (IPAW), LNCS, 2006. S Bowers, B Ludaescher, AHH Ngu, T Critchlow. Enabling Scientific Workflow Reuse through Structured Composition of Dataflow and Control-Flow. IEEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow), 2006. S Bowers, B Ludaescher. Actor-Oriented Design of Scientific Workflows. International Conference on Conceptual Modeling (ER), LNCS, 2005. T McPhillips, S Bowers. Pipelining Nested Data Collections in Scientific Workflows. SIGMOD Record, 2005. Kepler D Pennington, D Higgins, AT Peterson, M Jones, B Ludaescher, S Bowers. Ecological Niche Modeling using the Kepler Workflow System. Workflows for e-science, Springer-Verlag, to appear. W Michener, J Beach, S Bowers, L Downey, M Jones, B Ludaescher, D Pennington, A Rajasekar, S Romanello, M Schildhauer, D Vieglais, J Zhang. SEEK: Data Integration and Workflow Solutions for Ecology. Workshop on Data Integration in the Life Sciences (DILS), LNCS, 2005. S Romanello, W Michener, J Beach, M Jones, B Ludaescher, A Rajasekar, M Schildhauer, S Bowers, D Pennington. Creating and Providing Data Management Services for the Biological and Ecological Sciences: Science Environment for Ecological Knowledge. SSDBM, 2005.