Towards Semantically-enabled Exploration and Analysis of Environmental Ecosystems

Similar documents
Next Generation Semantic Data Environments (or Linked Data, Semantics, and Standards in Scientific Applications)

Where did you hear that? Information and the Sources They Come From

Data.gov Wiki: A Semantic Web Approach to Government Data

DataONE: Open Persistent Access to Earth Observational Data

Open Ontology Repository Initiative

EN/EE 101 Introduction to the Exchange Network and E-Enterprise for the Environment Chris Simmers, New Hampshire DoIT Rob Willis, Ross Strategic

How to use Water Data to Produce Knowledge: Data Sharing with the CUAHSI Water Data Center

Copyright 2012 Taxonomy Strategies. All rights reserved. Semantic Metadata. A Tale of Two Types of Vocabularies

Interoperability ~ An Introduction

Wade Sheldon. Georgia Coastal Ecosystems LTER University of Georgia CUAHSI Virtual Workshop Field Data Management Solutions

FIBO Operational Ontologies Briefing for the Object Management Group

INSPIRE & Environment Data in the EU

Data publication and discovery with Globus

Data Governance Data Usage Labeling and Enforcement in Adobe Experience Platform

Jeffery S. Horsburgh. Utah Water Research Laboratory Utah State University

TWC LOGD: A Portal for Linking Open Government Data

The State of Arctic Data the IPY experience

Next Generation Cancer Data Discovery, Access, and Integration Using Prizms and Nanopublications

Data Governance: Data Usage Labeling and Enforcement in Adobe Cloud Platform

Provenance-aware Faceted Search in Drupal

Introduction to Data Management for Ocean Science Research

Earth Observation Imperative

POMELo: A PML Online Editor

Ontologies in Practice - a Solar Terrestrial Example

The Semantic Planetary Data System

SAF: A Provenance-Tracking Framework for Interoperable Semantic Applications

Smartphones: changing the way invasive species are reported Chuck Bargeron

Washington Statewide Trails Database Project

Data Governance for the Connected Enterprise

Annual Report:

INSPIRE overview and possible applications for IED and E-PRTR e- Reporting Alexander Kotsev

Generalized Document Data Model for Integrating Autonomous Applications

Ada L. Benavides, Deputy Chief South Pacific Division Regional Integration Team. May 5, US Army Corps of Engineers BUILDING STRONG

Reducing Consumer Uncertainty

Public Health Surveillance Using Global Health Explorer

How Insurers are Realising the Promise of Big Data

Application of Data Management and Decision Support Tools to Support Coastal Wetland Management in the Laurentian Great Lakes

BIBL NEEDS REVISION INTRODUCTION

ARKive-ERA Project Lessons and Thoughts

Resilient Chicago & A Flood Vulnerability Assessment for Critical Facilities

Towards Open Innovation with Open Data Service Platform

EXECUTIVE ORDER Chemical Facility Safety and Security: Providing ProtecFon Reduces Risk

How to get to information without drilling down through a gazillion layers or being a GIS power user

Putting the Archives to Work: Workflow and Metadata-driven Analysis in LTER Science

Standards: Enabler of Sustainability

Reducing Consumer Uncertainty Towards a Vocabulary for User-centric Geospatial Metadata

Searching for Grant- Funding Opportunities. Mark van t Hooft Office of Sponsored Programs

CHAPTER 6 REFERENCES

National Association of Regional Councils: SICoP DRM 2.0 Pilot

DataONE. Promoting Data Stewardship Through Best Practices

Florida Coastal Everglades LTER Program

Infrastructure for Multilayer Interoperability to Encourage Use of Heterogeneous Data and Information Sharing between Government Systems

Paving the Rocky Road Toward Open and FAIR in the Field Sciences

Cuyama Valley Groundwater Basin Groundwater Sustainability Plan Data Management System Draft. Prepared by:

Ecosystems Research & Environmental Assessment Biologist Department Division/Region Community Location Environment Wildlife Igloolik Nunavut

Windsor Essex Environmental Metadata System (WEEMS)

Growing Variety and Volume of Remote Sensing and In Situ Data

Implementation of Open-World, Integrative, Transparent, Collaborative Research Data Platforms: the University of Things (UoT)

Global Wildlife Cybercrime Action Plan1

Ontology Servers and Metadata Vocabulary Repositories

FINANCIAL REGULATORY REPORTING ACROSS AN EVOLVING SCHEMA

Arctic Data Center: Call for Synthesis Working Group Proposals Due May 23, 2018

Exchange Network Shared Schema Components: Technical Reference

Wither OWL in a knowledgegraphed, Linked-Data World?

Improving Data Discovery in Metadata Repositories through Semantic Search

Introduction to Grid Computing

Introduction to INSPIRE. Network Services

«Prompting an EOSC in Practice»

FRONTCOUNTER USER GUIDE

DANUBIUS-RI, the pan-european distributed research infrastructure supporting interdisciplinary research on river-sea systems

Five-Year Strategic Plan

GCE Data Toolbox for MATLAB An Introduction. Wade Sheldon Georgia Coastal Ecosystems LTER

Jelena Roljevic Assistant Vice President, Business Intelligence Ronald Layne Data Governance and Data Quality Manager

Report to the IMC EML Data Package Checks and the PASTA Quality Engine July 2012

Evolution of INSPIRE interoperability solutions for e-government

Data Curation Profile Movement of Proteins

Ontology and Application for Reusable Search Interface Design

Linking Entities in Scientific Metadata

The Semantic Interoperability Community of Practice (SICoP) of the Federal CIO Council

Effects of Marine and Hydrokinetic Energy Development

Introducing I 3 CON. The Information Interpretation and Integration Conference

A Climate Monitoring architecture for space-based observations

What s Out There and Where Do I find it: Enterprise Metacard Builder Resource Portal

W3C Provenance Incubator Group: An Overview. Thanks to Contributing Group Members

Building a National SGCN Dataset

APPLICATION OF SEMANTIC INTEGRATION METHODS FOR CROSS-AGENCY INFORMATION SHARING IN HEALTHCARE

Data Interoperability in the Hydrologic Sciences

Integrating Invasive Species Data:

When using this architecture for accessing distributed services, however, query broker and/or caches are recommendable for performance reasons.

Semantic Web: vision and reality

MarkLogic. A Modern Data Platform To Support Your Critical Path COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Provenance-Aware Faceted Search in Drupal

Acquiring Experience with Ontology and Vocabularies

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Texas Water Data Services

The CQUIN Learning Network Annual Meeting

5PRESENTING AND DISSEMINATING

enanomapper database, search tools and templates Nina Jeliazkova, Nikolay Kochev IdeaConsult Ltd. Sofia, Bulgaria

An Ontology Based Question Answering System on Software Test Document Domain

The Financial Industry Business Ontology

Transcription:

Towards Semantically-enabled Exploration and Analysis of Environmental Ecosystems Deborah L. McGuinness 1 Tetherless World Senior Constellation Chair Rensselaer Polytechnic Institute In conjunction with Ping Wang 1,3, Linyun Fu 1, Evan W. Patton 1, F. Joshua Dein 2, and R. Sky Bristol 2 1 RPI, 2 USGS, 3 Amazon escience 2012, Chicago, IL October 10, 2012

Motivation Semantic escience class generated a water quality portal project Presented at Environmental informatics and AGU attracted interest of wildlife managers at USGS (among other places) Generalized a semantically-enabled water quality portal to be a semantic foundation for monitoring environmental and ecological data. This talk will be on the expanded more generalized version of our system. Wildlife and their habitats are deteriorating around the world E.g. 40% of US freshwater fish are at risk [1] Scientists and resource managers need tools to integrate, interpret, and present data from both government and non-government sources Tools are need to support the collaborative effort of multiple disciplines to attain optimal health for people, animals, and the environment [2] [1] National Fish Habitat Board, Through a Fish s Eye: The Status of Fish Habitats in the United States 2010, Washington D.C., 2010 2 [2] The American Veterinary Medical Association. One Health Initiative Task Force. One Health: A New Professional Imperative. July 15, 2008.

Diverse (and growing) Data Sources Supporting resource managers requires integration of different data sources Environmental quality data Pollution regulatory data Species counts Health effects of pollutants on species Large amount of observations spread across different databases with different schema interpretations Performing data integration requires an automated process that can provide an understanding of the data and incorporate that into a shared schema 3

EPA Enforcement Compliance and History Online EPA has two separate database formats for storing compliance history Permit Compliance System (PCS) uses fixed-width ASCII tables for storing tabular data with a data dictionary for code lookups Integrated Compliance Information System (ICIS) uses comma-separated values (CSV) for storing tabular data Different states report to different systems, so multiple methods of data conversion are required to aggregate the data into a single system to be queried Meaning is separated out into supplementary files (i.e. the data dictionaries) SemantAqua uses both compliance data and regulation data EPA also maintains documentation on the health effects caused by the pollutants tracked by the agency 4

National Water Information System Provides measurements for water quality at a large number of water bodies across the United States Data are retrieved as comma-separated values files using the US Geological Survey (USGS) National Water Information System (NWIS) web service Data are denormalized, so information is often repeated between measurements 5

Avian Knowledge Network Aggregates records of bird counts provided by government and non-government organizations that can be used for determining bird populations over time Tab-delimited files retrieved via an application programming interface (API) Includes a number of provenance-related items Protocols used by the observing party Whether submitted entries have been reviewed Original reference system used by the observation team and its transformed coordinates in the final dataset Requires attribution of data to the appropriate parties 6

Wildpro Includes information on various species laid out taxonomically Provides references for habitat and migratory patterns of species Provides partial health effect information concerning contaminant effects on wildlife 7

(Selected) Value of Semantics Semantic technologies alleviate many of the issues of these different formats by incorporating the meaning and interpretation of the data into the data themselves The conversion process we use incorporates the information from the data dictionaries so that artificial intelligence (AI) reasoners can make use of that external information Extract common information related to individual entries in tables to promote (i.e. normalize) them as objects that can be linked to and further described by others on the web Possibly biggest value is support for quick prototyping students got an initial version up in a month 8

Reusable Ontologies Pollution ontology describes the relationship between a regulation violation (a measurement), a polluted thing, and a polluted site Combined with other ontologies (e.g. W3C Geo) users can ask Tell me all of the polluted things within 1 mile of my location 9

Ontologies Water quality ontology extends pollution to describe water-related pollution Further extended by regulation ontologies to provide regulation violation inference Allows the reasoner to match specific regulations to measurements that violate them 10

Connecting Ontologies Extensible Observation Ontology (OBOE) [1] provides a common schema for scientific observations to be shared SemantEco provides a mapping ontology to recode from OBOE to the pollution ontology and vice versa to support interoperability with tools supporting OBOE [1] J. Madin, S. Bowers, M. Schildhauer, S. Krivov, D. Pennington, and F. Villa, An ontology for describing and synthesizing ecological observation data, Ecological Informatics, vol. 2, no. 3, pp. 279-296, 2007. 11

Ontologies Geospecies [1] Provides a semantic representation of species in the Linnaean taxonomy Wildpro [2] Provides semantic encodings of the relationships between species and the observable health effects of exposure to pollutants [1] GeoSpecies Knowledge Base. [Online]. Available: http://lod.geospecies.org/ [2] Wildpro: The electronic encyclopedia and library for wildlife. [Online]. http://wildpro.twycrosszoo.org/ 12

Architecture Converting the data from tabular formats into Resource Description Framework (RDF) is done using csv2rdf4lod [1] The converter produces RDF, provenance metadata to trace back to the data provider, and includes an RDF representation of the processes used to translate the data from tabular form to structured form for repeatability [1] T. Lebo and G. T. Williams, Converting governmental datasets into linked data, In Proceedings of the 6th International Conference on Semantic Systems, 2010, pp. 38:1-38:3. 13

Architecture 14

SemantEco/SemantAqua Enable/Empower citizens & scientists to explore pollution sites, facilities, regulations, and health impacts along with provenance. Demonstrates semantic monitoring possibilities. Map presentation of analysis Explanations and Provenance available 2 3 1 http://was.tw.rpi.edu/swqp/map.html and http://aquarius.tw.rpi.edu/projects/semantaqua 1. Map view of analyzed results 2. Explanation of pollution 3. Possible health effect of contaminant (from EPA) 4. Filtering by facet to select type of data 5. Link for reporting problems 6. Now joint with USGS resource managers ; expanded to endangered species; now more virtual observatory style 5 4

Interface 16

Interface 17

Performance Restructuring data using OBOE required changing the structure of what pollution is in the ontology Such restructuring can adversely affect reasoning performance, so a handful of counties were chosen and their data were analyzed under the new schema We found up to 25% increase in triples and an order of magnitude increase in some reasoning. 18

Performance Tradeoffs Semantic engines used for quick prototypes, however scaling to 6 billion triples (to cover the entire country) caused scaling problems We began exploring performance enhancement strategies. Custom rules to match certain characteristics of the ontology with the instance data could improve performance over the DL scenario by only considering relevant properties Obtained 40x speed up in cases Time (s) 200000 150000 100000 50000 0 Pellet Rules County 19

Current and Future Work Incorporating air quality Water quality is only one factor in the health of species Air quality data is readily available through the EPA s CASTNet project, among others Connecting with CUAHSI via the DataOne project Additional sources of water data from a coalition of non-governmental institutions, in particular the Santa Barbara Coastal (SBC) LTER Further investigate reasoning tradeoffs to improve performance 20

Conclusions Semantic technologies provide: Systematic means of integrating sparse data from various sources Rapid prototyping of extensible applications via shared schemas for describing data Enhancing interoperability using mapping ontologies can decrease reasoning performance Performance can be improved by optimizing rule application to those that are essential for answering user queries 21

Conclusions, continued. Integration of many different types of data into a single access point allows stakeholders to ask more complex questions, e.g., : Find all water bodies and nearby facilities around Lake Michigan, where Mallard Duck have been observed in the past 3 years and where pollutants that lead to mercury toxicity have been observed in the water. Data on water bodies from USGS, facilities from EPA, places from USGS Geonames, ducks from species facet, time from time facet, health impacts from wildpro, contaminants from epa or usgs services to access it, 22

Questions? Acknowledgements: Support for Tetherless World Constellation comes from Fujitsu, Lockheed Martin, LGS, Microsoft Research, Qualcomm in addition to sponsored research from DARPA, IARPA, NASA, NIST, NSF, and USGS. Evan Patton is supported by a NSF Graduate Research Fellowship Contact Info: Evan Patton pattoe@cs.rpi.edu Deborah McGuinness: dlm@cs.rpi.edu 23

Extra slides 24

Interface 25

Performance Using OBOE to encode the data in a more true-to-form representation increases the number of triples The complexity required in the ontology to support the pollution reasoning caused execution time to increase 14-54 fold 26

Performance Rewriting portions of the regulation ontologies as rules improves performance 2-10x performance improvement Automated methods of simplifying ontologies into rulesets when certain semantic restrictions are used could improve other systems using semantic technologies to integrate data 27

Performance 120000 Number of Triples 100000 80000 60000 40000 20000 # Triples # Triples w/ OBOE 0 Kent RI Suffolk Yates NY San MA Fran. CA County 28

Performance Time (s) 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Kent RI Suffolk Yates NY MA County San Fran. CA Reasoning Time (s) Reasoning Time w/ OBOE (s) 29

Performance 180000 160000 140000 Time (s) 120000 100000 80000 60000 40000 Pellet Rules 20000 0 Kent RI Suffolk MA Yates NY San Fran. CA County 30