ECS289F Winter 05 Scientific Data Management

Similar documents
Scientific Data & Workflow Engineering. Outline

KEPLER: Overview and Project Status

Final Project Assignments. Promoter Identification Workflow (PIW)

Scientific Data Integration: From the Big Picture to some Gory Details

Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

A High-Level Distributed Execution Framework for Scientific Workflows

Using Web Services and Scientific Workflow for Species Distribution Prediction Modeling 1

Scientific Workflows

A(nother) Vision of ppod Data Integration!?

7 th International Digital Curation Conference December 2011

Knowledge-based Grids

Workflow Exchange and Archival: The KSW File and the Kepler Object Manager. Shawn Bowers (For Chad Berkley & Matt Jones)

Automatic Transformation from Geospatial Conceptual Workflow to Executable Workflow Using GRASS GIS Command Line Modules in Kepler *

DataONE: Open Persistent Access to Earth Observational Data

A Three Tier Architecture for LiDAR Interpolation and Analysis

Hybrid-Type Extensions for Actor-Oriented Modeling (a.k.a. Semantic Data-types for Kepler) Shawn Bowers & Bertram Ludäscher

Where we are so far. Intro to Data Integration (Datalog, mediators, ) more to come (your projects!): schema matching, simple query rewriting

A High-Level Distributed Execution Framework for Scientific Workflows

Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life

Data Integration and Data Warehousing Database Integration Overview

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Kepler and Grid Systems -- Early Efforts --

Introduction to Grid Computing

San Diego Supercomputer Center, UCSD, U.S.A. The Consortium for Conservation Medicine, Wildlife Trust, U.S.A.

Enhanced Access to High-Resolution LiDAR Topography through Cyberinfrastructure-Based Data Distribution and Processing

A geoinformatics-based approach to the distribution and processing of integrated LiDAR and imagery data to enhance 3D earth systems research

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Situations and Ontologies: helping geoscientists understand and share the semantics surrounding their computational resources

Kepler: An Extensible System for Design and Execution of Scientific Workflows

DSpace Fedora. Eprints Greenstone. Handle System

Opus: University of Bath Online Publication Store

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

and the bringing cabig cancer data to the.net Developer and Microsoft Office User Communities

ITM DEVELOPMENT (ITMD)

Parameter Space Exploration using Scientific Workflows

Managing the Emerging Semantic Risks

Semantic Web Technologies

Interoperability and eservices

DataONE Enabling Cyberinfrastructure for the Biological, Environmental and Earth Sciences

Java Development and Grid Computing with the Globus Toolkit Version 3

Juliusz Pukacki OGF25 - Grid technologies in e-health Catania, 2-6 March 2009

Data publication and discovery with Globus

The Virtual Observatory and the IVOA

GMA-PSMH: A Semantic Metadata Publish-Harvest Protocol for Dynamic Metadata Management Under Grid Environment

Com S/Geron 415X Gerontechnology in Smart Home Environments Implementation. Dr. Hen-I Yang Computer Science Department, ISU

Features and Requirements for an XML View Definition Language: Lessons from XML Information Mediation

Introduction to Grid Technology

The International Journal of Digital Curation Issue 1, Volume

Agent-Enabling Transformation of E-Commerce Portals with Web Services

Overview. Scientific workflows and Grids. Kepler revisited Data Grids. Taxonomy Example systems. Chimera GridDB

An Eclipse-based Environment for Programming and Using Service-Oriented Grid

Hands-on tutorial on usage the Kepler Scientific Workflow System

Overview of Web Mining Techniques and its Application towards Web

Integration of INSPIRE & SDMX data infrastructures for the 2021 population and housing census

Science-as-a-Service

IRS-III: A Platform and Infrastructure for Creating WSMO-based Semantic Web Services

Mitigating Risk of Data Loss in Preservation Environments

Developing InfoSleuth Agents Using Rosette: An Actor Based Language

Managing Rapidly-Evolving Scientific Workflows

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Tools, tips, and strategies to optimize BEx query performance for SAP HANA

Semantics and Ontologies For EarthCube

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

Distributed Data Management with Storage Resource Broker in the UK

Implementing Trusted Digital Repositories

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

Provenance Collection Support in the Kepler Scientific Workflow System

Research on the Interoperability Architecture of the Digital Library Grid

Enabling Seamless Sharing of Data among Organizations Using the DaaS Model in a Cloud

Data Immersion : Providing Integrated Data to Infinity Scientists. Kevin Gilpin Principal Engineer Infinity Pharmaceuticals October 19, 2004

Introduction to Federation Server

Teiid Designer User Guide 7.5.0

Jeffery S. Horsburgh. Utah Water Research Laboratory Utah State University

Development of Contents Management System Based on Light-Weight Ontology

A Dream of Software Engineers -- Service Orientation and Cloud Computing

metamatrix enterprise data services platform

Reducing Consumer Uncertainty

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

IBM API Connect: Introduction to APIs, Microservices and IBM API Connect

Opal: Simple Web Services Wrappers for Scientific Applications

The NeuroLOG Platform Federating multi-centric neuroscience resources

Who we are: Database Research - Provenance, Integration, and more hot stuff. Boris Glavic. Department of Computer Science

Multiple Broker Support by Grid Portals* Extended Abstract

Context Aware Computing

Grid Computing Systems: A Survey and Taxonomy

Executive Committee Meeting

Component-Based Software Engineering TIP

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data

Networking European Digital Repositories

Sensor Data Management

Minsoo Ryu. College of Information and Communications Hanyang University.

Semantic Web. Semantic Web Services. Morteza Amini. Sharif University of Technology Spring 90-91

Call for Participation in AIP-6

A Knowledge-Based Approach to Scientific Workflow Composition

Webinar Annotate data in the EUDAT CDI

Distributed Repository for Biomedical Applications

Introduction Data Integration Summary. Data Integration. COCS 6421 Advanced Database Systems. Przemyslaw Pawluk. CSE, York University.

Context-Aware Actors. Outline

Transcription:

ECS289F Winter 05 Scientific Data Management Data Integration Bertram Ludäscher Associate Professor Dept. of Computer Science & Genome Center University of California, Davis ludaesch@ucdavis.edu Process Integration Scientific Data & Workflow Engineering Knowledge Representation UC DAVIS Department of Computer Science San Diego Supercomputer Center Logistics Class: MWF, 4:10-5pm, 244 Olson http://www.sdsc.edu/~ludaesch/ecs289f-w05.html (or Google with Bertram ECS ; then navigate) Notes will be posted there (& hand-outs available) Office hours: General: Wednesday & Friday, 11am-noon, room 3051, Kemper Hall Extra: email ludaesch@ucdavis.edu, subject ECS289F Grading: Project 50%, presentation 30%, homework 20% Projects Implementation Projects (Hands-on) with scientific data management and workflow tools: KEPLER scientific workflow system SDSC Storage Resource Broker e.g. run a number of compute-intensive jobs (say from cheminformatics) on a cluster computer via KEPLER, or access a real-time seismic sensor network client application using KEPLER and SRB Research Projects (Theory): Readings in data integration: understand how a database mediator works i.e., query rewriting Readings in knowledge representation: understand how ontology languages are used to capture and reason with domain knowledge Combined Projects (Theory & Hands-on) Implementation of algorithms for query rewriting 1

Research Assistant (RA)-ships Openings at the CS department and new Genome Center available; option for summer internship at the San Diego Supercomputer Center! Mix of theory and practice ideal: Databases (& Information Systems) background helps a lot No fear of ontologies or workflows (that s why we have this class ;-) Problem-solving (thinking!) and Programming skills Come to my office hours (WF, 11am-noon, 3051 Kemper Hall) for details Today s Outline Semantics & Scientific Data Integration Semantics & Scientific Workflow Management Conclusions Domain Science Driver Ecology (LTER), biodiversity, Analysis & Modeling System Design & execution of ecological models & analysis ( scientific workflows ) {application,upper}-ware Kepler system Semantic Mediation System Data Integration of hard-to-relate sources and processes Semantic Types and Ontologies upper middleware Sparrow Toolkit EcoGrid Access to ecology data and tools {middle,under}-ware unified API to SRB/MCAT, MetaCat, DiGIR, datasets Anatomy of the Science Environment for Ecological Knowledge (SEEK) Collaboratory sample CS problem [DILS 04] Common Collaboratories / Distributed Science / Cyberinfrastructure Pieces Seamless and uniform data access ( Data-Grid ) data & metadata registry distributed and high performance computing platform ( Compute-Grid ) service registry User-friendly workbench / problem-solving environment scientific workflow system A common problem: integrating (or at least linking) data from multiple sites, investigators, communities,, scales,, species, Federated, integrated, mediated databases often use of semantic extensions (e.g. ontologies) 2

Interoperability & Integration Challenges reconciling S 5 heterogeneities gluing together resources bridging information and knowledge gaps computationally System aspects: Grid Middleware distributed data & computing, SOA web services, WSDL/SOAP, WSRF, OGSA, sources = functions, files, data sets Syntax & Structure: (XML-Based) Data Mediators wrapping, restructuring (XML) queries and views sources = (XML) databases Semantics: Model-Based/Semantic Mediators conceptual models and declarative views Knowledge Representation: ontologies, description logics (RDF(S),OWL...) sources = knowledge bases (DB+CMs+ICs) Synthesis: Scientific Workflow Design & Execution Composition of declarative and procedural components into larger workflows (re)sources = services, processes, actors, Semantic extensions needed here as well! Information Integration Challenges: S 4 Heterogeneities System aspects platforms, devices, data & service distribution, APIs, protocols, Grid middleware technologies + e.g. single sign-on, platform independence, transparent use of remote resources, Syntax & Structure heterogeneous data formats (one for each tool...) heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, ) heterogeneous schemas (one for each DB...) Database mediation and warehousing technologies + XML-based data exchange, integrated views, transparent query rewriting, Semantics descriptive metadata, different terminologies, implicit assumptions & hidden semantics ( context ) of experiments, simulations, observation, Knowledge representation & semantic mediation technologies + smart data discovery & integration + e.g. ask about X ( mafic ); find data about Y ( diorite ); be happy anyways! Information Integration Challenges: S 5 Heterogeneities Synthesis of applications, analysis tools, data & query components, into scientific workflows How to make use of these wonderful things & put them together to solve a scientist s problem? Scientific Problem Solving Environments (PSEs) Portals,Workbench ( scientist s view, end user) + ontology-enhanced data registration, discovery, manipulation + creation and registration of new data products from existing ones, Scientific Workflow System ( engineer s view, tool maker) + for designing, re-engineering, deploying analysis pipelines and scientific workflows; a tool to make new tools + e.g., creation of new datasets from existing ones, dataset registration, Not discussed here: the 6th S : Social challenges Our Focus Scientific Data Integration: need DB/DI + KR ( semantic mediation ) Automation of Scientific Data Analysis, Process & Application Integration need for scientific workflow systems need for semantic extensions But first: Some data & information integration problems 3

An Online Shopper s Information Integration Problem El Cheapo: Where can I get the cheapest copy (including shipping cost) of Wittgenstein s Tractatus Logicus-Philosophicus within a week? A Home Buyer s Information Integration Problem What houses for sale under $500k have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? addall.com? Mediator (virtual DB) Information (vs. Datawarehouse) Integration NOTE: non-trivial data engineering challenges! One-World Mediation? Information Integration Multiple-Worlds Mediation amazon.com barnes&noble.com half.com A1books.com Realtor Realtor Crime Crime Stats Stats School School Rankings Rankings Demographics Demographics Information Integration from a Database Perspective Information Integration Problem Given: data sources S 1,..., S k (databases, web sites,...) and user questions Q 1,..., Q n that can in principle be answered using the information in the S i Find: the answers to Q 1,..., Q n The Database Perspective: source = database S i has a schema (relational, XML, OO,...) S i can be queried define virtual (or materialized) integrated (or global) view G over local sources S 1,..., S k using database query languages (SQL, XQuery,...) questions become queries Q i against G(S 1,..., S k ) Standard Mediator Architecture 1. Query Q ( G (S 1,..., S k ) ) Integrated View Definition G(..) S 1 (..)S k (..) USER/Client Integrated Global (XML) View G MEDIATOR 2. 5. Query Post rewriting processing 3. Q1 Q2 Q3 4. {answers(q1)} {answers(q2)} {answers(q3)} (XML) View Wrapper S 1 (XML) View Wrapper S 2 6. {answers(q)} (XML) View Wrapper S k web services as wrapper APIs 4

Query Planning in Data Integration Given: Declarative user query Q: answer() G... &{ G S } global-as-view (GAV) & { S G} local-as-view (LAV) & { ic() SG} integrity constraints (ICs) Find: equivalent (or minimal containing, maximal contained) query plan Q : answer() S query rewriting (logical/calculus, algebraic, physical levels) Results: A variety of results/algorithms; depending on classes of queries, views, and ICs: P, NP,, undecidable hot research area in core CS (database community) A Neuroscientist s Information Integration Problem What is the cerebellar distribution of rat proteins with more than 70% homology with human NCS-1? Any structure specificity? How about other rodents? Inter-source links: unclear for the non-scientists hard for the scientist protein localization sequence info (NCMIR) (CaPROT)? Information Integration morphometry (SYNAPSE) Biomedical Informatics Research Network http://nbirn.net Complex Multiple-Worlds Mediation neurotransmission (SENSELAB) Scientific Data Integration using Semantic Extensions 5

Example: Geologic Map Integration Given: Geologic maps from different state geological surveys (shapefiles w/ different data schemas) Different ontologies: Geologic age ontology (e.g. USGS) Rock classification ontologies: Multiple hierarchies (chemical, fabric, texture, genesis) from Geological Survey of Canada (GSC) Single hierarchy from British Geological Survey (BGS) Problem: Support uniform queries across all maps possibly using different ontologies Support registration w/ ontology A, querying w/ ontology B Schema Integration ( registering local schemas to the global schema) Sources Arizona Colorado Utah Nevada Wyoming New Mexico ABBREV PERIOD NAME PERIOD TYPE PERIOD Formation Age Formation Age Formation Age FMATN Formation TIME_UNIT Age NAME PERIOD NAME PERIOD Formation Age Formation Age FORMATION Formation Montana E. PERIOD Age Formation Age Composition Fabric Texture Integration Schema Formation Age Composition Fabric Texture FORMATION AGE LITHOLOGY Idaho Livingston formation FORMATION Tertiary- Cretaceous AGEMontana West LITHOLOGY andesitic sandstone Sources Multihierarchical Rock Classification Ontology (Taxonomies) for Thematic Queries (GSC) Ontology-Enabled Application Example: Geologic Map Integration Genesis domain knowledge Fabric Composition Knowledge representation Geologic Age ONTOLOGY Nevada Show Show formations where where AGE AGE = Paleozic (without age age ontology) Show Show formations where where AGE AGE = Paleozic (with age age ontology) +/- a few hundred million years Texture 6

Querying by Geologic Age Querying by Geologic Age: Results Semantic Mediation (via semantic registration of schemas and ontology articulations) Schema elements and/or data values are associated with concept expressions from the target ontology conceptual queries through the ontology Articulation ontology source registration to A, querying through B Semantic mediation: query rewriting w/ ontologies Different views on State Geological Maps Database1 semantic registration ontology articulations Ontology A Concept-based ( semantic ) queries Database2 semantic registration Ontology B 7

Sedimentary Rocks: BGS Ontology Sedimentary Rocks: GSC Ontology Some Thoughts Translate this idea of multiple conceptual (ontology) views to your domain! e.g. datasets biological pathways registration Your data is valuable (time & $$$ spent in producing it) data (re-)usability Data Semantics and Ontologies should be useful for Humans and The Machine Metadata helps to discover, localize, assess relevant data sets, given particular scientific questions & queries Does your system understand what to do with the metadata? Capturing more semantics of a data set in a way that humans and systems can exploit it is an investment in reusability We are producing more and more data Today we can store everything! But can we use anything? (i.e., is anyone looking at the data after the initial creation?) Design system, interfaces, data and metadata models with reusability in mind (think archives and time capsules ) This may even be pushed to the experiment/simulation/workflow design 8

Example: Domain Knowledge to glue SYNAPSE & NCMIR Data Semantic Source Browsing : Domain Maps/Ontologies (left) & conceptually linked data (right) Purkinje cells and Pyramidal cells have dendrites that have higher-order branches that contain spines. Dendritic spines are ion (calcium) regulating components. Spines have ion binding proteins. Neurotransmission involves ionic activity (release). Ion-binding proteins control ion activity (propagation) in a cell. Ion-regulating components of cells affect ionic activity (release). domain expert knowledge B. Ludaescher, formalized ECS289F-W05, as domain Topics Scientific map/ontology Data Management Made usable for the system using Description Logic A Semantic Mediation Result View Source Contextualization through Ontology Refinement In addition to registering ( hanging off ) data relative to existing concepts, a source may also refine the mediator s domain map... sources can register new concepts at the mediator... increase your data usability 9

Outline Semantics & Scientific Data Integration Semantics & Scientific Workflow Management Conclusions What is a Scientific Workflow (SWF)? Goals: automate a scientist s repetitive data management and analysis tasks typical phases: data access, scheduling, generation, transformation, aggregation, analysis, visualization design, test, share, deploy, execute, reuse, SWFs Promoter Identification Workflow Source: NIH BIRN (Jeffrey Grethe, UCSD) Source: Matt Coleman (LLNL) 10

Ecology: GARP Analysis Pipeline for Invasive Species Prediction Registered Ecogrid Database Species presence & absence points (native range) EcoGrid (a) Query +A1 + A3 +A2 Sample Data Training sample (d) Test sample (d) Native GARP range rule set prediction (e) map (f) Data Map Calculation Generation Validation Registered Ecogrid Database Integrated layers (native range) (c) Model quality parameter (g) Registered Ecogrid Database Registered Ecogrid Database EcoGrid Query Environmental layers (native range) (b) Environmental layers (invasion area) (b) Species presence &absence points (invasion area) (a) Layer Integration Layer Integration Integrated layers (invasion area) (c) Map Generation Validation Invasion area prediction map (f) Model quality parameter (g) Archive To Ecogrid User Selected prediction maps (h) Generate Metadata Source: NSF SEEK (Deana Pennington et. al, UNM) Commercial & Open Source Scientific Workflow (often Dataflow) Systems SCIRun: Problem Solving Environments for Large-Scale Scientific Computing Kensington Discovery Edition from InforSense Taverna Triana SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations and visualizations Component model, based on generalized dataflow programming Steve Parker (cs.utah.edu) 11

Why Ptolemy II (and thus KEPLER)? Ptolemy II Ptolemy II Objective: The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation. read! read! see! see! Dataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuse User-Orientation Workflow design & exec console (Vergil GUI) Application/Glue-Ware excellent modeling and design support run-time support, monitoring, not a middle-/underware (we use someone else s, e.g. Globus, SRB, ) but middle-/underware is conveniently accessible through actors! PRAGMATICS Ptolemy II is mature, continuously extended & improved, well-documented (500+pp) open source system Ptolemy II folks actively participate in KEPLER try! try! Source: Source: Edward Edward Lee Lee et et al. al. http://ptolemy.eecs.berkeley.edu/ptolemyii/ http://ptolemy.eecs.berkeley.edu/ptolemyii/ KEPLER/CSP: Contributors, Sponsors, Projects KEPLER: An Open Collaboration (or loosely coupled Communicating Sequential Persons ;-) Ilkay Altintas SDM, Resurgence Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SDM, SEEK, GEON, BIRN, ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK www.keplerwww.kepler-project.org Initiated by members from DOE SDM/SPA and NSF SEEK; now several other projects (GEON, Ptolemy II, EOL, Resurgence/NMI, ) Open Source (BSD-style license) Intensive Communications: Web-archived mailing lists IRC (!) Meetings, Hackathons Ptolemy II Co-development: via shared CVS repository joining as a new co-developer (currently): get a CVS account (read-only) local development + contribution via existing KEPLER member be voted in as a member/co-developer Software & social engineering How to better accommodate new groups/communities? How to better accommodate different usage/contribution models (core dev special purpose extender user)? 12

Ptolemy II/KEPLER GUI (Vergil) Directors define the component interaction & execution semantics Web Services Actors (WS Harvester) 1 2 4 Large, polymorphic component ( Actors ) and Directors libraries (drag & drop) 3 Minute-made (MM) WS-based application integration Similarly: MM workflow design & sharing w/o implemented components Rapid Web Service-based Prototyping (Here: ROADNet Command & Control Services for LOOKING Kick-Off Mtg) An early example: Promoter Identification SSDBM, AD 2003 Source: Ilkay Altintas, SDM, NLADR ROADNet: Vernon, Orcutt et al Web services: Tony Fountain et al Scientist models application as a workflow of connected components ( actors ) If all components exist, the workflow can be automated/ executed Different directors can be used to pick appropriate execution model (often pipelined execution: PN director) 13

PIW Workflow Today Run Window Enter initial inputs, Run and Display results Job Management (here: NIMROD) Job management infrastructure in place Results database: under development Goal: 1000 s of GAMESS jobs (quantum mechanics) Custom Visualization 14

Some Recent Actor Additions in KEPLER (w/ editable script) Source: Dan Higgins, Kepler/SEEK in KEPLER (interactive session) Blurring Design (ToDo) and Execution Source: Dan Higgins, Kepler/SEEK 15

Some Scientific Workflow Challenges Typical Features data-intensive and/or compute-intensive plumbing-intensive (consecutive web services won t fit) dataflow-oriented distributed (remote data, remote processing) user-interaction in the middle, vs. (C-z; bg; fg)-ing ( detach and reconnect) advanced programming constructs (map(f), zip, takewhile, ) logging, provenance, registering back (intermediate) products Scientific Workflows & Semantics Registering data to ontologies: semantic types (in addition to structural data types) Smarter data set discovery & integration Now also: Smarter workflow design More intelligent (semantics-aware) component composition Improved (re-)usability of data, services (actors), and workflows Given semantic type of my input ports, what other data sets / actors produce such input Reengineering a Geoscientist s Mineral Classification Workflow Beginnings: Ontology-based Actor/Service Discovery Ontology based actor (service) and dataset search Result Display Add semantic types to ports!! 16

Semantics & Scientific Workflows Data comes from heterogeneous sources Real-world observations Spatial-temporal contexts Collection/measurement protocols and procedures Many representations for the same information (count, area, density) Schematically heterogeneous Data discovered and synthesized manually Hard to reuse/repurpose existing analytical steps (another form of heterogeneity) A KR+DI+Scientific Workflow Problem Services can be semantically compatible, but structurally incompatible Semantic Type P s Source Service Incompatible Structural Type P s ( ) P s Ontologies (OWL) Compatible ( ) δ δ(p s ) ( ) Desired Connection Structural Type P t P t Semantic Type P t Target Service Source: [Bowers-Ludaescher, DILS 04] Ontology-Informed Data Transformation ( Structure-Shim ) Ontologies (OWL) Scientific Data Integration Outline Semantic Type P s Registration Mapping (Output) Compatible ( ) Registration Mapping (Input) Semantic Type P t Scientific Workflow Management Musings & Conclusions Structural Type P s Correspondence Structural Type P t Generate δ(p s ) Source Service P s Transformation Desired Connection P t Target Service Source: [Bowers-Ludaescher, DILS 04] 17

GEON Dataset Generation & Registration (a co-development in KEPLER) % Makefile Makefile $> $> ant ant run run Further Reading Matt,Chad, Dan et al. (SEEK) SQL database access (JDBC) Efrat (GEON) Ilkay (SDM) Yang (Ptolemy) Xiaowen (SDM) Edward et al.(ptolemy) under review available upon request from ludaesch@sdsc.edu Related Publications Semantic Data Registration and Integration On Integrating Scientific Resources through Semantic Registration, S. Bowers, K. Lin, and B. Ludäscher, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), 21-23 June 2004, Santorini Island, Greece. A System for Semantic Integration of Geologic Maps via Ontologies, K. Lin and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003. Towards a Generic Framework for Semantic Registration of Scientific Data, S. Bowers and B. Ludäscher. In Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW), Sanibel Island, Florida, 2003. The Role of XML in Mediated Data Integration Systems with Examples from Geological (Map) Data Interoperability, B. Brodaric, B. Ludäscher, and K. Lin. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003. Semantic Mediation Services in Geologic Data Integration: A Case Study from the GEON Grid, K. Lin, B. Ludäscher, B. Brodaric, D. Seber, C. Baru, and K. A. Sinha. In Geological Society of America (GSA) Annual Meeting, volume 35(6), November 2003. Query Planning and Rewriting Processing First-Order Queries under Limited Access Patterns, Alan Nash and B. Ludäscher, Proc. 23rd ACM Symposium on Principles of Database Systems (PODS'04) Paris, France, June 2004. Processing Unions of Conjunctive Queries with Negation under Limited Access Patterns, Alan Nash and B. Ludäscher., 9th Intl. Conference on Extending Database Technology (EDBT'04) Heraklion, Crete, Greece, March 2004, LNCS 2992. Web Service Composition Through Declarative Queries: The Case of Conjunctive Queries with Union and Negation, B. Ludäscher and Alan Nash. Research abstract (poster), 20th Intl. Conference on Data Engineering (ICDE'04) Boston, IEEE Computer Society, April 2004. Related Publications Scientific Workflows Kepler: An Extensible System for Design and Execution of Scientific Workflows, I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, S. Mock, 16th International Conference on Scientific and Statistical Database Management (SSDBM'04), 21-23 June 2004, Santorini Island, Greece. Kepler: Towards a Grid-Enabled System for Scientific Workflows, Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher, Steve Mock, Workflow in Grid Systems (GGF10), Berlin, March 9th, 2004. An Ontology-Driven Framework for Data Transformation in Scientific Workflows, S. Bowers and B. Ludäscher, Intl. Workshop on Data Integration in the Life Sciences (DILS'04), March 25-26, 2004 Leipzig, Germany, LNCS 2994. A Web Service Composition and Deployment Framework for Scientific Workflows, I. Altintas, E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In the 2nd Intl. Conference on Web Services (ICWS), San Diego, California, July 2004. 18