Data Exchange in the Earth Sciences Perspective of a multidisciplinary data facility Kerstin Lehnert, Columbia University lehnert@ldeo.columbia.edu 1
Access to Data Transparency & Reproducibility Publishers/Journals New science Funders Open Data Researchers Return on Investment Repositories Credit 2
FAIR Data Findable Easy to find by both humans and computer systems based on mandatory description (metadata) that allow the discovery Accessible Stored for long term such that they can be easily accessed and/or downloaded with well-defined license and access conditions (Open Access when possible), whether at the level of metadata, or at the level of the actual data content Interoperable Ready to be combined with other datasets by humans as well as computer systems Reusable Ready to be used for future research and to be processed further using computational methods. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016). 3
Topics The Research Data Lifecycle Enhance re-usability: provenance, integration Enhance access: linking data & literature 4
FAIR FARI Integrated / interoperable V a l accessible re-usable context, provenance harmonized, machine-readable u e findable identification, persistence protection, protocols Data Curation Standards Domain-specific Data Standards 5 Domain Repositories
Data Facilities: Objectives Advance & accelerate scientific discoveries Easy & fast access to data comprehensive knowledge base Integration of data for interdisciplinary research Ensure quality of data and trust into scientific results Preserve irreproducible observations & results Support research ethics 6
IEDA: A Multi-Disciplinary Data Facility geochemistry, marine geophysics, marine geology, geochronology, and more sensor data and sample-based observations & experiments field data, lab data, processed data, samples gridded data, point data, time-series data, maps, photos, and more long-tail to big data 7 Geochemistry Marine Geophysics Samples Antarctic Science
IEDA Services Trusted repository services Ensure citability (DOI registration) Long-term preservation Links to publications Domain-specific data curation QA/QC metadata standards development Science-specific user interfaces & software tools Data products (synthesis) Cross-disciplinary data access Programmatic, standards-based interfaces User support & training 8 www.iedadata.org
IEDA Repositories Data Data Data Investigators Metadata Catalog FINDABLE & ACCESSIBLE DOI registration Long-term archiving CC license Guidelines for data reporting Provenance metadata Formats QC by data managers Data Data Data Data Data EarthChem Data Managers 9
Guidance for Investigators 10
QA/QC: Data Reporting Standards Accessible in the EarthChem Library 11
DOI to allow proper citation of data 12 Link to publications Link to funding source 12
13
Publishers Concern: Reproducibility M. McNutt, K. Lehnert, B. Hanson, B. A. Nosek, A. M. Ellison, J. L. King SCIENCE Policy Forum, 04 MAR 2016 14
Recent Alignment by Publishers, Repositories, and Funders Around Data TOP (transparency and openness promotion guidelines) 538 journals COPDESS.org (Coalition on Publishing Data in the Earth and Space Sciences) Statement of Commitment endorsed by most publishers and repositories in the Earth and space sciences Joint Declaration of Data Citation Principles endorsed by 109 organizations including most major publishers. Reproducibility conferences and outcomes (AAAS and other orgs) Certification standards for repositories (WDS, DSA, ISO) 15
TOP s 8 Standards Data citation Design transparency Research materials transparency Data transparency Analytic methods (code) transparency Preregistration of studies Preregistration of analysis plans Replication 3 Tiers: Disclose Require Verify 16
https://www.force11.org/group/joint-declaration-data-citation-principles-final 17
Certification standards for repositories 18 From Helen Glaves and Gary Baker
Coalition on Publishing Data in the Earth and Space Sciences (COPDESS.org) Connecting Earth Science publishers and Data Facilities to help translate the aspirations of open, available, and useful data from policy into practice. Formed in October 2014 Endorsed a Statement of Commitment, 2015 Consistent policies across publishers/journals Increase development and enforcement of data best practices Reduce effort of metadata QC Increase flow of small data into repositories 19
COPDESS Goals Consistent policies across publishers/journals Increase development and enforcement of data best practices Reduce effort of metadata QC Increase flow of small data into repositories 20 www.copdess.org
New Publishing Paradigms 21
Future of Linking Literature & Data http://dliservice.research-infrastructures.eu/ http://www.scholix.org 22
The Challenge of Data Integration System Metadata Standard Database Schema Data Types MGDL ECL Datacite, ISO 19115-2, DIF, SeismicXML DataCite, Dublin Core, ISO 19115-2 expedition-based dataset-based USAP-DC DataCite, DIF NSF award-based ASP@UTIG DataCite, SeismicXML expedition-based SESAR IGSN Description Metadata Custom (sample-based) Mutlidisciplinary Marine Geoscience Sensor Field Data (e.g. towed bathymetry, side scan sonar, controlled source seismic, seafloor photos and photomosaics). Derived data products (e.g. DEMs, microseismicity catalogs, magnetic and gravity gridded compilations, geologic interpretations). Diverse complementary sensor data (e.g. temperature and chemical probes, optical backscatter) Geochemical, geochronological, petrographic, petrological datasets, code for geochemical data reduction and modelling Multi-disciplinary data types spanning Antacrtic research (e.g. volcano observatory video, penguin counts, paleo-geologic maps, meteorological model outputs) Controlled source seismic field (legacy) and processed data Sample metadata for Earth science samples (e.g. rock, fossil, sediment, soil, fluid) 23
Toward Integration of Systems 24
EarthCube GeoLink Building Block: Find and integrate resources across repositories Awards Expeditions Datasets Documents Features Instruments Measurements Organizations Papers Persons Platforms Presentations Programs Repositories Samples via Semantic Web (Linked Data). Partners: : (more) Slide courtesy of R. Arko, Columbia University GeoLink + IGSNs
GeoLink design 1.Shared Ontology classes +properties that describe resources, sufficient for discovery 2.Linked Data Resources are on the Web, open license Resources are structured and use non-proprietary formats and languages (RDF, SPARQL) Resources have HTTP URIs Resources are linked to other Resources 3.Canonical Resources agreement on reference set of resources to anchor mappings GeoLink + IGSNs
Example: Linked Resources ORCID Expedition @R2R Award @NSF Paper Dataset @ECL Dataset @BCO-DMO Samples @SESAR GeoLink + IGSNs
Data Synthesis Forming a picture 28
Concept: Data Mining Measurements stored in relational database Linked to context & provenance metadata for search and filter Harmonized and quality controlled data & metadata Data output allows immediate data analysis Users generate new data compilations across any number of data sets Downloadable spreadsheets with all context & provenance information Use of unique sample ID links data for individual samples across datasets to support further exploration 29
Data Synthesis: PetDB Global compilation of geochemical data for igneous rocks from the ocean floor & mantle xenoliths > 2,280 data sets/publications > 87,600 samples > 3.3 million observed values 30 http://www.earthchem.org/petdb
Syntactic: ODM2 ODM2 Team: J S Horsburgh A K Aufdenkampe L Hsu A Jones K Lehnert E Mayorga L Song D Tarboton I Zaslavsky 31
ODM2 Benefits Integration of different observational data types specimen-based single observations, time-series, arrays Alignment with OGC Conceptual Data Model Observations & Measurements (ISO 19156:2011) Comprehensive capture of provenance (but not aligned with W3C PROV) Controlled vocabularies Interoperability at the level of sampling features Web services (OGC standards), XML (e.g. GeoSciML) 32
Interoperability with Library of Experimental Phase Relations 33
Interoperability with Modeling Tools
Data Synthesis A (too?) BIG Effort Transfer of data and metadata from publications into databases by data curators: Time consuming Requires deep understanding of data Lack of recognition for the job Partial automation possible currently not satisfactory expensive development effort 35
Automation: Example Best Reference: DeepDive: A Data Management System for Automatic Knowledge Base Construction. Ce Zhang.Ph.D. Dissertation, University of Wisconsin-Madison, 2015. 36
Integrating Heterogeneous Data You think it s easy? 37
Take home messages Relevance of standards Community best practices: Data type specific documentation of provenance Data exchange protocols & APIs Repository best practices Benefits of collaboration & participation in research data organizations (RDA, WDS, etc.) Connections to publishers & editors Difficulty of migrating older (legacy) systems 38
39