Implementing a Data Quality Strategy to simplify access to data

Similar documents
Implementing a Data Quality Strategy to simplify access to data

The NCI High Performance Computing (HPC) and High Performance Data (HPD) Platform to Support the Analysis of Petascale Environmental Data Collections

Clare Richards, Benjamin Evans, Kate Snow, Chris Allen, Jingbo Wang, Kelsey A Druken, Sean Pringle, Jon Smillie and Matt Nethery. nci.org.

The Changing Role of Data Stewardship in Creating Trustworthy, Transdisciplinary High Performance Data Platforms for the Future

Making data access easier with OPeNDAP. James Gallapher (OPeNDAP TM ) Duan Beckett (BoM) Kate Snow (NCI) Robert Davy (CSIRO) Adrian Burton (ARDC)

HDF Product Designer: A tool for building HDF5 containers with granule metadata

High Performance Data Efficient Interoperability for Scientific Data

Uniform Resource Locator Wide Area Network World Climate Research Programme Coupled Model Intercomparison

Scaling Weather Climate and Environmental Science Applications, and experiences with Intel Knights Landing

Supporting Data Stewardship Throughout the Data Life Cycle in the Solid Earth Sciences

eresearch Collaboration across the Pacific:

CSIRO and the Open Data Cube

THE GEOSS PLATFORM TOWARDS A BIG EO DATA SYSTEM LINKING GLOBAL USERS AND DATA PROVIDERS

The Logical Data Store

Bruce Wright, John Ward, Malcolm Field, Met Office, United Kingdom

Ocean Color Data Formats and Conventions:

The Many Facets of THREDDS Thematic Real-time Environmental Distributed Data Services

Using the Network Common Data Form for storage of atmospheric data

The GeoPortal Cookbook Tutorial

Production Petascale Climate Data Replication at NCI Lustre and our engagement with the Earth Systems Grid Federation (ESGF)

IMOS/AODN ocean portal: tools for data delivery. Roger Proctor, Peter Blain, Sebastien Mancini IMOS

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

HDF Product Designer Documentation

GSKY: A scalable, distributed geospatial data-server

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

Toward the Development of a Comprehensive Data & Information Management System for THORPEX

Technical documentation. SIOS Data Management Plan

OGC at KNMI: Current use and plans

Oceanic Observatory for the Iberian Shelf

Introduction to SDIs (Spatial Data Infrastructure)

Working with Scientific Data in ArcGIS Platform

GEOSS Data Management Principles: Importance and Implementation

GEOSS Common Infrastructure and the Big Data challenges

FAIR-aligned Scientific Repositories: Essential Infrastructure for Open and FAIR Data

SEXTANT 1. Purpose of the Application

GSCB-Workshop on Ground Segment Evolution Strategy

Lidar Radar Open Software Environment LROSE and the Python ARM Radar Toolkit Py-ART

INTAROS Integrated Arctic Observation System

THE ENVIRONMENTAL OBSERVATION WEB AND ITS SERVICE APPLICATIONS WITHIN THE FUTURE INTERNET Project introduction and technical foundations (I)

C3S Data Portal: Setting the scene

NetCDF-4: : Software Implementing an Enhanced Data Model for the Geosciences

Reproducibility and FAIR Data in the Earth and Space Sciences

IMPROVING THE INFRASTRUCTURE FOR EARTH OBSERVATION, ACTIONS AND UPDATES FROM THE SENTINEL ALPINE OBSERVATORY AT EURAC RESEARCH

SCSODC: Integrating Ocean Data for Visualization Sharing and Application

Distributed Online Data Access and Analysis

DataONE: Open Persistent Access to Earth Observational Data

Catalog-driven, Reproducible Workflows for Ocean Science

NeAT Business Plan Component Data Integration and Annotation Services in Biodiversity (DIAS-B) 1. Service Description

ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P.

Towards a pan-european infrastructure for marine and ocean data management + Importance of standards

Online intercomparison of models and observations using OGC and community standards

AWIPS Technology Infusion Darien Davis NOAA/OAR Forecast Systems Laboratory Systems Development Division April 12, 2005

Achieving Interoperability using the ArcGIS Platform. Satish Sankaran Roberto Lucchi

4 th Working Group on Geospatial Information

NetCDF and HDF5. NASA Earth Science Data Systems Working Group October 20, 2010 New Orleans. Ed Hartnett, Unidata/UCAR, 2010

Index Introduction Setting up an account Searching and accessing Download Advanced features

- C3Grid Stephan Kindermann, DKRZ. Martina Stockhause, MPI-M C3-Team

Python: Working with Multidimensional Scientific Data. Nawajish Noman Deng Ding

Data Centre NetCDF Implementation Pilot

Building a Global Data Federation for Climate Change Science The Earth System Grid (ESG) and International Partners

Deliverable 6.4. Initial Data Management Plan. RINGO (GA no ) PUBLIC; R. Readiness of ICOS for Necessities of integrated Global Observations

Data discovery and access via the SeaDataNet CDI system

European Marine Data Exchange

Leveraging metadata standards in ArcGIS to support Interoperability. David Danko and Aleta Vienneau

Desarrollo de una herramienta de visualización de datos oceanográficos: Modelos y Observaciones

Robin Wilson Director. Digital Identifiers Metadata Services

Long-term preservation for INSPIRE: a metadata framework and geo-portal implementation

Ocean, Atmosphere & Climate Model Assessment for Everyone

Multi-disciplinary Interoperability: the EuroGEOSS Operating Capacities

PARR for the Course: GIS and Public Access to NOAA Fisheries Research Data

Scientific and Multidimensional Raster Support in ArcGIS

Data Curation Practices at the Oak Ridge National Laboratory Distributed Active Archive Center

Reducing Consumer Uncertainty

Developing data catalogue extensions for metadata harvesting in GIS

Compass INSPIRE Services. Compass INSPIRE Services. White Paper Compass Informatics Limited Block 8, Blackrock Business

How to use Water Data to Produce Knowledge: Data Sharing with the CUAHSI Water Data Center

LTC 2017 Practical lesson

The AusGIN Geoscience Portal:

Paving the Rocky Road Toward Open and FAIR in the Field Sciences

Copernicus Climate Change Service

ArcGIS 9.2 Works as a Complete System

HDF Product Designer Documentation

The GEO Discovery and Access Broker

FP7-INFRASTRUCTURES Grant Agreement no Scoping Study for a pan-european Geological Data Infrastructure D 4.4

Digital Earth Australia: First Look

Unidata and data-proximate analysis and visualization in the cloud

Portfolio of Services. NATIONAL COMPUTATIONAL Portfolio INFRASTRUCTURE

Interoperability in Science Data: Stories from the Trenches

The Future of ESGF. in the context of ENES Strategy

OPeNDAP: Accessing HYCOM (and other data) remotely

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Fedora Commons: Taking on the Challenge of the Next Generation of Scholarly Communication

The CEDA Web Processing Service for rapid deployment of earth system data services

Data and information sharing WMO global systems

The Common Framework for Earth Observation Data. US Group on Earth Observations Data Management Working Group

e-infrastructure: objectives and strategy in FP7

Striving for efficiency

HDF Product Designer Documentation

The C3S Climate Data Store and its upcoming use by CAMS

Data discovery mechanisms and metadata handling in RAIA Coastal Observatory

Transcription:

IN43D-07 AGU Fall Meeting 2016 Implementing a Quality Strategy to simplify access to data Kelsey Druken, Claire Trenham, Ben Evans, Clare Richards, Jingbo Wang, & Lesley Wyborn National Computational Infrastructure, Canberra, Australia

NCI Australia NCI hosts one of Australia s largest repositories (10+ PBytes) of research data collections Spanning data collections from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences

NCI Australia Key to maximizing benefit of NCI s collections and computational capabilities: Ensuring seamless interoperable access to these datasets NCI Australia 2016 Kelsey.Druken@anu.edu.au

Quality Strategy (DQS) Key to maximizing benefit of NCI s collections and computational capabilities: Ensuring seamless interoperable access to these datasets NCI Australia 2016 Kelsey.Druken@anu.edu.au

The Goal Combining data Visualising How can we enable this type of easy access and use? NCI Australia 2016 Kelsey.Druken@anu.edu.au

Collections are being accessed and utilised from a broad range of options Direct access on system Web and data services portals Virtual labs (e.g., virtual desktops) How data collections are accessed ereefs online analysis portal

Key Challenges Application of community-agreed data standards to the broad set of Earth systems and environmental data that are being used Within these disciplines, data span a wide range of: - Gridded - Non-gridded (i.e., trajectories/pros, point data) - Coordinate reference projections - Resolutions

Motivation: Management Maturity Program Shelley Stall Assistant Director, Enterprise Management Program http://dataservices.agu.org/dmm/ DMM Capability 25 Processes to Perform, Manage, Define 1. Management Strategy Process Area 1. Management Strategy 2. Communications 3. Management Function 4. Grant Strategy/Business Case 5. Funding 2. Governance Process Area 6. Governance Management 7. Vocabulary/Glossary 8. Metadata Management 3. Quality Process Area 9. Quality Strategy 10. Profiling 11. Quality Assessment 12. Cleansing and Curation 4. Operations Process Area 13. Requirements Definition 14. Lifecycle Management 15. Contribution / Provider Management 5. Platform and Architecture Process Area 16. Architectural Standards 17. Architectural Approach 18. Management Platform 19. Integration / Linking 20. Archiving and Preservation 6. Infrastructure Support Practices 21. Measurement and Analysis 22. Process Management 23. Process Quality Assurance 24. Risk Management 25. Configuration Management

Quality Strategy (DQS) Quality Strategy (DQS): What does it involve? 1. Underlying High Performance (HPD) format 2. Close collaboration with data custodians and managers Planning, designing, and assessing the data collections 3. Quality control through compliance with recognised community standards 4. assurance through demonstrated functionality across common platforms, tools, and services

Quality Strategy (DQS) Quality Strategy (DQS): What does it involve? 1. Underlying High Performance (HPD) format 2. Close collaboration with data custodians and managers Planning, designing, and assessing the data collections 3. Quality control through compliance with recognised community standards 4. assurance through demonstrated functionality across common platforms, tools, and services

Where to start? 1. Climate/ESS Model Assets and Products 2. Earth and Marine Observations and Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management and Hydrology Collections Collections CMIP5, CORDEX, ACCESS Models Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR Digital Elevation, Bathymetry Onshore/Offshore Geophysics Seasonal Climate Bureau of Meteorology Observations Bureau of Meteorology Ocean-Marine Terrestrial Ecosystem Reanalysis products Approx. Capacity 5 Pbytes 2 Pbytes 1 Pbytes 700 Tbytes 350 Tbytes 350 Tbytes 290 Tbytes 100 Tbytes

NetCDF/HDF Common Formats 1. Climate/ESS Model Assets and Products 2. Earth and Marine Observations and Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management and Hydrology Collections Collections CMIP5, CORDEX, ACCESS Models Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR Digital Elevation, Bathymetry Onshore/Offshore Geophysics Seasonal Climate Bureau of Meteorology Observations Bureau of Meteorology Ocean-Marine Terrestrial Ecosystem Reanalysis products NetCDF common data format Approx. Capacity 5 Pbytes 2 Pbytes 1 Pbytes 700 Tbytes 350 Tbytes 350 Tbytes 290 Tbytes 100 Tbytes

National Environmental Research Interoperability Platform (NERDIP) Biodiversity & Climate Change VL Climate & Weather Science Lab emast Speddexes ereefs AGDC VL All Sky Virtual Observatory VGL Globe Claritas VHIRL Open Nav Surface Workflow Engines, Virtual Laboratories (VL s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS Models Fortran, C, C++, MPI, OpenMP Python, R, MatLab, IDL Visualisation Drishti ANDS/RDA AODN/IMOS TERN AuScope. gov.au Digital Bathymetry & Elevation Tools s National Environmental Research Interoperability Platform (NERDIP) Services Layer Direct Access Fast whole-of-library catalogue CS-W RDF, LD WMS WFS WCS W*PS SWE W*TS Open DAP Vocab Service PROV Service Conventions netcdf-cf ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL API Layers Climate Weather Oceans Bathy EO [HDF4- EOS] [Airborne Geophysics] [SEG-Y] [FITS] [LAS LiDAR] HP Library Layer HDF5 Lustre HDF5?? Other Storage (e.g., HDFS)

National Environmental Research Interoperability Platform (NERDIP) Biodiversity & Climate Change VL Climate & Weather Science Lab emast Speddexes ereefs AGDC VL All Sky Virtual Observatory VGL Globe Claritas VHIRL Open Nav Surface Workflow Engines, Virtual Laboratories (VL s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS Models Fortran, C, C++, MPI, OpenMP Python, R, MatLab, IDL Visualisation Drishti ANDS/RDA AODN/IMOS TERN AuScope. gov.au Digital Bathymetry & Elevation Tools s National Environmental Research Interoperability Platform (NERDIP) Services Layer Direct Access Fast whole-of-library catalogue CS-W RDF, LD WMS WFS WCS W*PS SWE W*TS Open DAP Vocab Service PROV Service Conventions netcdf-cf ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL API Layers Climate Weather Oceans Bathy EO [HDF4- EOS] [Airborne Geophysics] [SEG-Y] [FITS] [LAS LiDAR] HP Library Layer HDF5 Lustre HDF5?? Other Storage (e.g., HDFS)

National Environmental Research Interoperability Platform (NERDIP) Biodiversity & Climate Change VL Climate & Weather Science Lab emast Speddexes ereefs AGDC VL All Sky Virtual Observatory VGL Globe Claritas VHIRL Open Nav Surface Workflow Engines, Virtual Laboratories (VL s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS Models Fortran, C, C++, MPI, OpenMP Python, R, MatLab, IDL Visualisation Drishti ANDS/RDA AODN/IMOS TERN AuScope. gov.au Digital Bathymetry & Elevation Tools s National Environmental Research Interoperability Platform (NERDIP) Services Layer Direct Access Fast whole-of-library catalogue CS-W RDF, LD WMS WFS WCS W*PS SWE W*TS Open DAP Vocab Service PROV Service Conventions netcdf-cf ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL API Layers Climate Weather Oceans Bathy EO [HDF4- EOS] [Airborne Geophysics] [SEG-Y] [FITS] [LAS LiDAR] HP Library Layer HDF5 Lustre HDF5?? Other Storage (e.g., HDFS)

National Environmental Research Interoperability Platform (NERDIP) Biodiversity & Climate Change VL Climate & Weather Science Lab emast Speddexes ereefs AGDC VL All Sky Virtual Observatory VGL Globe Claritas VHIRL Open Nav Surface Workflow Engines, Virtual Laboratories (VL s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS Models Fortran, C, C++, MPI, OpenMP Python, R, MatLab, IDL Visualisation Drishti ANDS/RDA AODN/IMOS TERN AuScope. gov.au Digital Bathymetry & Elevation Tools s National Environmental Research Interoperability Platform (NERDIP) Services Layer Direct Access Fast whole-of-library catalogue CS-W RDF, LD WMS WFS WCS W*PS SWE W*TS Open DAP Vocab Service PROV Service Conventions netcdf-cf ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL API Layers Climate Weather Oceans Bathy EO [HDF4- EOS] [Airborne Geophysics] [SEG-Y] [FITS] [LAS LiDAR] HP Library Layer HDF5 Lustre HDF5?? Other Storage (e.g., HDFS)

National Environmental Research Interoperability Platform (NERDIP) Biodiversity & Climate Change VL Climate & Weather Science Lab emast Speddexes ereefs AGDC VL All Sky Virtual Observatory VGL Globe Claritas VHIRL Open Nav Surface Workflow Engines, Virtual Laboratories (VL s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS Models Fortran, C, C++, MPI, OpenMP Python, R, MatLab, IDL Visualisation Drishti ANDS/RDA AODN/IMOS TERN AuScope. gov.au Digital Bathymetry & Elevation Tools s National Environmental Research Interoperability Platform (NERDIP) Services Layer Direct Access Fast whole-of-library catalogue CS-W RDF, LD WMS WFS WCS W*PS SWE W*TS Open DAP Vocab Service PROV Service Conventions netcdf-cf ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL API Layers Climate Weather Oceans Bathy EO [HDF4- EOS] [Airborne Geophysics] [SEG-Y] [FITS] [LAS LiDAR] HP Library Layer HDF5 Lustre HDF5?? Other Storage (e.g., HDFS)

Quality Strategy (DQS) Quality Strategy (DQS): What does it involve? 1. Underlying High Performance (HPD) format 2. Close collaboration with data custodians and managers Planning, designing, and assessing the data collections 3. Quality control through compliance with recognised community standards 4. assurance through demonstrated functionality across common platforms, tools, and services

Quality Strategy (DQS) Quality Strategy (DQS): What does it involve? 1. Underlying High Performance (HPD) format 2. Close collaboration with data custodians and managers Planning, designing, and assessing the data collections 3. Quality control through compliance with recognised community standards 4. assurance through demonstrated functionality across common platforms, tools, and services

Many levels of metadata Collection & dataset-levels (e.g., parent-child metadata) Collection ISO-19115, ANZLIC, etc. sets sets sets File (granule)-level Contains 2 types of metadata: (1) Variable-level (CF-Convention) (2) Global -level (ACDD**) **Can link to collection/dataset metadata File-level Variablelevel(s)

NCI s Current NetCDF Holdings FORMAT By collection

NCI s Current NetCDF Holdings The motivation: reduce none FORMAT By collection

Existing Community Standards CF Conventions Climate and Forecast Conventions and Metadata: http://cfconventions.org/ ACDD Attribute Convention for Discovery: http://wiki.esipfed.org/index.php/attribute_convention_for Discovery Together, these two standards define several categories of metadata ensuring: Usage, discoverability, and understanding of the data contents

Compliance checker Want to adopt or utilise existing community checkers if possible Two main options: UK Reading (CF-Convention website links to this one) IOOS (growing fast, designed to be modified and extended) Our own modifications Needed our own wrapper to enable collection-level scans Tailor our output and reporting

Compliance checker Summarised version on the compliance status. The break down compliance scores and also measure of consistency across the collection Providing attack plan for improvements: Make it easy for data managers to efficiently address and meet baseline compliance

The result: win-win for all Quality Strategy In Action Progressive improvement in the quality of the data across the different subject domains Improves the ease by which users can access, utilise and combine the datasets from across NCI's holdings

Quality Strategy (DQS) Quality Strategy (DQS): What does it involve? 1. Underlying High Performance (HPD) format 2. Close collaboration with data custodians and managers Planning, designing, and assessing the data collections 3. Quality control through compliance with recognised community standards 4. assurance through demonstrated functionality across common platforms, tools, and services

Functionality tests Extend to test usability across wide spectrum of scientific tools and data services Commonly used libraries (e.g., netcdf, HDF, GDAL, etc.) Accessibility by data servers (e.g., THREDDS, Hyrax, GeoServer) Validation against scientific analysis and programming platforms (e.g., Python, Matlab, R, QGIS) Visualization tools (e.g., ParaView, IDV, WMS-viewers)

Functionality tests Primary motivation: Positive experience for our users. Expectation that advertised collections and services are usable.

Bonus results Bonus results: Feedback to the local and international communities The more we test and test, the more we learn Functionality tests lead to reference and training material for our user community

Bonus: User reference material NCI Australia 2016 Kelsey.Druken@anu.edu.au

Bonus results Bonus results: Feedback to the local and international communities The more we test and test, the more we learn Functionality tests lead to reference and training material for our user community Benefits of standardised and interoperable data formats

Summary/Future Work What s next? Automating and extending these measures and tests across our full collection What about the broader formats? Staying connected and working with international communities E.g., NSF Funded Advancing netcdf-cf for the Geoscience Community (EarthCube) https://www.earthcube.org/content/advancing-netcdf-cf-geoscience-community