State of the Art in Ethno/ Scientific Data Management

Similar documents
State of the Art in Data Citation

Implementing the RDA Data Citation Recommendations for Long Tail Research Data. Stefan Pröll

Approaches to Making Dynamic Data Citeable Recommendations of the RDA Working Group Andreas Rauber

COALITION ON PUBLISHING DATA IN THE EARTH AND SPACE SCIENCES: A MODEL TO ADVANCE LEADING DATA PRACTICES IN SCHOLARLY PUBLISHING. Source: NSF.

Reproducibility and FAIR Data in the Earth and Space Sciences

FAIR-aligned Scientific Repositories: Essential Infrastructure for Open and FAIR Data

Services to Make Sense of Data. Patricia Cruse, Executive Director, DataCite Council of Science Editors San Diego May 2017

ISMTE Best Practices Around Data for Journals, and How to Follow Them" Brooks Hanson Director, Publications, AGU

CODATA: Data Citation Workshop Perspectives from Editors and Publishers. Brooks Hanson Director, Publications, AGU

Making Sense of Data: What You Need to know about Persistent Identifiers, Best Practices, and Funder Requirements

Persistent Identifier the data publishing perspective. Sünje Dallmeier-Tiessen, CERN 1

Adoption of Data Citation Outcomes by BCO-DMO

Data Curation Practices at the Oak Ridge National Laboratory Distributed Active Archive Center

Checklist and guidance for a Data Management Plan, v1.0

National Snow and Ice Data Center. Plan for Reassessing the Levels of Service for Data at the NSIDC DAAC

National Snow and Ice Data Center. Plan for Reassessing the Levels of Service for Data at the NSIDC DAAC

Data Citation and Scholarship

FREYA Connected Open Identifiers for Discovery, Access and Use of Research Resources

Improving a Trustworthy Data Repository with ISO 16363

Data Citation Then and Now

GEOSS Data Management Principles: Importance and Implementation

Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Institute for Quantitative Social Science (IQSS) Harvard

Indiana University Research Technology and the Research Data Alliance

Data Citation. DataONE Community Engagement & Outreach Working Group

For Attribution: Developing Data Attribution and Citation Practices and Standards

DOI for Astronomical Data Centers: ESO. Hainaut, Bordelon, Grothkopf, Fourniol, Micol, Retzlaff, Sterzik, Stoehr [ESO] Enke, Riebe [AIP]

LIBER Webinar: A Data Citation Roadmap for Scholarly Data Repositories

PDS, DOIs, and the Literature. Anne Raugh, University of Maryland Edwin Henneken, Harvard-Smithsonian Center for Astrophysics

Persistent Identifiers for Earth Science Provenance

Dataverse and DataTags

Research Elsevier

Paving the Rocky Road Toward Open and FAIR in the Field Sciences

EUDAT- Towards a Global Collaborative Data Infrastructure

DATA SHARING FOR BETTER SCIENCE

PERSISTENT IDENTIFIERS FOR THE UK: SOCIAL AND ECONOMIC DATA

Data Citation Working Group P7 March 2 nd 2016, Tokyo

A Data Citation Roadmap for Scholarly Data Repositories

The Role of Repositories and Journals in the Astronomy Research Lifecycle

5/16/2018. Researcher Challenges with Data Use. AGU s position statement on data affirms that

Data Exchange in the Earth Sciences

Supporting Data Stewardship Throughout the Data Life Cycle in the Solid Earth Sciences

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

Reflections on Three Decades in Internet Time

Enabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services

RADAR A Repository for Long Tail Data

DOIs for Research Data

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

Data Citation. Mark Parsons, Ruth Duerr and the Federation of Earth Science Information Partners (ESIP)

CODE AND DATA MANAGEMENT. Toni Rosati Lynn Yarmey

Applying Archival Science to Digital Curation: Advocacy for the Archivist s Role in Implementing and Managing Trusted Digital Repositories

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

DATAVERSE FOR JOURNALS

Data Curation Handbook Steps

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

How to make your data open

Executive Committee Meeting

Data Discovery - Introduction

RN Workshop Series on Innovations in Scholarly Communication: plementing the Benefits of OAI (OAI3)

Developing a Research Data Policy

A Brief Introduction to the Data Curation Profiles

Data publication and discovery with Globus

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Deliverable 6.4. Initial Data Management Plan. RINGO (GA no ) PUBLIC; R. Readiness of ICOS for Necessities of integrated Global Observations

EUDAT Common data infrastructure

Towards a joint service catalogue for e-infrastructure services

Robin Wilson Director. Digital Identifiers Metadata Services

Data Curation Profile Human Genomics

Flexible Framework for Mining Meteorological Data

INTEGRATION OPTIONS FOR PERSISTENT IDENTIFIERS IN OSGEO PROJECT REPOSITORIES

Writing a Data Management Plan A guide for the perplexed

Why CERIF? Keith G Jeffery Scientific Coordinator ERCIM Anne Assserson eurocris. Keith G Jeffery SDSVoc Workshop Amsterdam

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

Data Archival and Dissemination Tools to Support Your Research, Management, and Education

The State of Arctic Data the IPY experience

Data Curation Profile Plant Genomics

Description Cross-domain Task Force Research Design Statement

The Research Data Alliance Creating the culture and technology for an international data infrastructure

University at Buffalo's NEES Equipment Site. Data Management. Jason P. Hanley IT Services Manager

From Sensor to Archive: Observational Data Services at NCAR s Earth Observing Laboratory

Scalable Data Citation in Dynamic, Large Databases: Model and Reference Implementation

The International Journal of Digital Curation Issue 1, Volume

Linking data and publications the past, present, and future. Dr. Hylke Koers, Head of Content Innovation, Elsevier

Data Management Plans. Sarah Jones Digital Curation Centre, Glasgow

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Data Management Checklist

INTAROS Integrated Arctic Observation System

The Materials Data Facility

GSAW The Earth Observing System (EOS) Ground System: Leveraging an Existing Operational Ground System Infrastructure to Support New Missions

Data Curation Profile Movement of Proteins

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

NISO STS (Standards Tag Suite) Differences Between ISO STS 1.1 and NISO STS 1.0. Version 1 October 2017

Software + Services for Data Storage, Management, Discovery, and Re-Use

DIAS_Satellite_MODIS_SurfaceReflectance dataset

How to share research data

Big Data infrastructure and tools in libraries

Data Publication. HGF Alliance: Remote Sensing and Earth System Dynamics

Roy Lowry, Gwen Moncoiffe and Adam Leadbetter (BODC) Cathy Norton and Lisa Raymond (MBLWHOI Library) Ed Urban (SCOR) Peter Pissierssens (IODE Project

Enabling Precise Identification and Citabilityof Dynamic Data. Recommendations of the RDA Working Group on Data Citation

Session Two: OAIS Model & Digital Curation Lifecycle Model

Carelyn Campbell, Ben Blaiszik, Laura Bartolo. November 1, 2016

Transcription:

State of the Art in Ethno/ Scientific Data Management Ruth Duerr This work is licensed under a Creative Commons Attribution v4.0 License.

Overview Data management layers Data lifecycle Levels of data Citation and the data landscape *

Renear, A. H., Sacchi, S., & Wickett, K. M. (2010). Definitions of dataset in the scientific and technical literature. Proceedings of the American Society for Information Science and Technology, 47(1), 1 4.

Data Management Layers Layers Examples Implication for PI Curation Preservation Future JHU Data Archive and other DCS instances, NSIDC Sea Ice Index, ICPSR comparison sets JHU Data Archive Portico ICPSR New query capabilities Cross-disciplinary re-use possible Ability to use own data in the future (e.g. 5 yrs) Data sharing Archiving Storage CUAHSI FigShare Dataverse Server in Lab, Project Website, Amazon S3 System provides identifiers for sharing, references, fixity, backups, etc. Responsible for: Restore Sharing Staffing

Data Management Layers Layers Examples Implication for PI Curation NSIDC Sea Ice Index, ICPSR comparison sets, Protein Data Bank New query capabilities Cross-disciplinary re-use Preservation JHU Data Archive Portico ICPSR Ability to use own data in the future (e.g. 5 yrs) Data sharing Archiving Storage CUAHSI FigShare Dataverse Server in Lab, Project Website, Amazon S3 System provides identifiers for sharing, references, fixity, backups, etc. Responsible for: Restore Sharing Staffing

The Remote Sensing Record Satellite-based Passive Microwave sensors have been measuring sea ice since 1972 Consistent collection of data started in 1978 with the SMMR series of instruments Why passive microwave? Distinguishing sea ice from ocean is straightforward Passive microwave works through clouds and in the dark Initial user base was cryospheric scientists Arctic sea ice concentration in April 2004, calculated from data measured by the Special Sensor Microwave/Imager (SSM/I) on the Defense Meteorological Satellite Program (DMSP) satellite. The image is centered over the North Pole, with continents shown in green. - Image courtesy of Florence Fetterer and Ken Knowles, National Snow and Ice Data Center, University of Colorado, Boulder, CO. Evolution of sea ice data products at NSIDC, presented by Ruth Duerr March 10, 2015, RDA 5th Plenary, San Diego

Evolution of sea ice data products at NSIDC, presented by Ruth Duerr March 10, 2015, RDA 5th Plenary, San Diego

I2S2 Partners, Idealized Scientific Research Activity Lifecycle Model [2011]

Levels of data There may have been several steps in processing data to create the data products that are actually used by a particular community In many cases the "raw" data may only be useful to specialists (e.g., algorithm developers) Typically, more processed data are useful to different and often broader audiences NASA defined processing levels for their satellite data decades ago Now other fields are beginning to think about reusing this concept for their own types of data

Data levels for NASA data vs textual criticism Level NASA Text Criticism 0 unprocessed instrument data at full resolutions. 1 unprocessed instrument data at full resolution, time referenced, and annotated with ancillary information, including radiometric and geometric calibration coefficients and georeferencing parameters (i.e., platform ephemeris) computed and appended but not applied to the Level 0 data. 2 Derived environmental variables (e.g., ocean wave height, soil moisture, ice concentration) at the same resolution and location as the Level 1 source data. 3 Variables mapped on uniform spacetime grid scales, usually with some completeness and consistency properties (e.g., missing points interpolated, complete regions mosaicked together from multiple orbits)." Unprocessed text images Unprocessed text images, annotated with the identification of the hardware and software used, any configuration or calibration information, the time and place of the scanning, and organization or persons conducting the imaging, and a [nondescriptive] identification of the object imaged. Derived representation of text content and structure, mapped to locations in the Level 1 source data Representation of textual content and structure mapped on to and expansion of abbreviations, interpolation of missing text. (perhaps multiple) carriers with described structure (e.g, physical bibliography), Renear, A. H., Dolan, M., Trainor, K. and Cragin, M. H. (2009), Towards a cross-disciplinary notion of data level in data curation. Proc. Am. Soc. Info. Sci. Tech., 46: 1 8. doi: 10.1002/meet.2009.14504603103

A brief history of data citation Data citation used to be common practice What!!?

A brief history of data citation Data was in the literature! In Books and Technical Reports

A brief history of data citation Data was in the literature! and Journals

A brief history of data citation That started changing with the advent of digital data At first because the publications were still paper Why would you want to make your data less accessible to the computers needed to analyze it? Now how do you represent a multi-dimensional data set in a twodimensional medium? Later because often the data was voluminous

A brief history of data citation Digital data repositories came into being in the final decades of the 20 th century Many collocated with existing data centers (e.g., World Data Centers set up during the International Geophysical Year 1957/8) Many have been promoting data citation for decades

A brief history of data citation By 2013 many groups had been working on data citation guidelines and principles for many years Adapted from a slide by Maryann Martone

A brief history of data citation Photo: Flickr Paul Uhlir...a plea to come together

Joint Declaration of Data Citation Principles Importance: Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications. Credit and Attribution: Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data. Evidence: In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited. Unique Identification: A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community. Access: Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data. Persistence: Unique identifiers, and metadata describing the data, and its disposition, should persist -- even beyond the lifespan of the data they describe. Specificity and Verifiability: Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific time slice, version and/or granular portion of data retrieved subsequently is the same as was originally cited. Interoperability and flexibility: Data citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities. Data Citation Synthesis Group. (2014). Joint Declaration of Data Citation Principles. http://force11.org/datacitation

Data Citation Implementer's Group Work in 4 areas: NISO JATS. Identifiers and associated metadata. Common repository interfaces. Putting together and analyzing some exemplar journal workflows with suggestions on how the editorial process can deal with data citations, to provide context and analysis of commonality for the other tasks.

Data Citation in the NISO-JATS DTD NISO-JATS is an open standard for representing full text articles in XML Used widely, but not limited to, in life sciences. Technical Workshop: June 2014, London 18 (publishers, JATS users, and JATS committee reps) Workshop Goals JATS recommendations to support structured data citations according to the F11 Data Citation Principles Decide adoption and implementation strategy by publishers

Implications of NISO-JATS support for data citation Enabling the citation of data to be treated with the same respect as article citations Journals empowered to structure the citation of data in machine-actionable form ultimately supporting development of new applications and processes Agreements on implementation best practice will become important as uptake grows (Data Citation Principles!) For more info: mcentyre@ebi.ac.uk

Data Citation Implementer's Group

Moving Forward Research Data Alliance has several working groups working on data citation also Data bibliometrics Data services Data Workflows in conjunction with Force 11 group Cost recovery for data centers Dynamic data citation 26 Data Citation: ESIP, AGU, and NSIDC, Feb. 2014, Ruth Duerr

Moving Forward in Earth Sciences Brooks Hanson (AGU) and Kerstin Lehnert (IEDA) held a publisher's round table Statement of Commitment from Earth and Space Science Publishers and Data Facilities Data management policies Training/Ethics - E.g., for NSF program managers Ongoing collaboration between publishers and data centers Index of data facilities Seeking endorsement at AGU winter meeting Nancy Ritchey (NCDC) session on defining what data publication means for data centers 27 Data Citation: ESIP, AGU, and NSIDC, Feb. 2014, Ruth Duerr

Moving Forward Integrating Domain Repositories into the National Data Infrastructure How do each of the proposed infrastructure interact with existing repositories Shared Access Research Ecosystem (SHARE) (Clifford Lynch) (Libraries) Clearinghouse for the Open Research of the United States (CHORUS) (Publishers) National Data Service - supercomputer centers 28 Data Citation: ESIP, AGU, and NSIDC, Feb. 2014, Ruth Duerr

Making Dynamic Data Citeable Data Citation: Data + Means-of-access Data time-stamped & versioned (aka history) Researcher creates working-set via some interface: Access assign PID to QUERY, enhanced with Time-stamping for re-execution against versioned system Re-writing for normalization, unique-sort, mapping to history Hashing result-set: verifying identity/correctness PID leads to a query specific landing page S. Pröll, A. Rauber. Scalable Data Citation in Dynamic Large Databases: Model and Reference Implementation. In IEEE Intl. Conf. on Big Data 2013 (IEEE BigData2013), 2013 http://www.ifs.tuwien.ac.at/~andi/publications/pdf/pro_ieeebigdata13.pdf

Making Dynamic Data Citeable Building blocks of supporting dynamic data citation: - Uniquely identifiable data records (for unique sort) - Versioned data, marking changes as insertion/deletion - Time stamps of data insertion / deletions - Query language for constructing subsets Add modules: - Persistent query store: queries, timestamp, hash, metadata including creator of subset - Query rewriting module - PID assignment to queries - Landing page design, citation text Stable across data source migrations (e.g. diff. DBMS), scalable, machine-actionable Page 14

Data Citation Deployment Researcher uses workbench to identify subset of data Upon executing selection ( download ) user gets Data (package, access API, ) This is an important advantage over PID (e.g. DOI) (Query is time-stamped and stored) traditional approaches relying on, e.g. Hash value storing computed a list of over identifiers!!! the data for local storage Recommended citation text (e.g. BibTeX) PID resolves to landing page Provides detailed metadata, link to parent data set, subset, Option to retrieve original data OR current version OR changes Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned

ESIP View of Dynamic Citation ESIP has had guidelines for citation of dynamic data for many years Doe, J. and R. Roe. 2001, updated daily. The FOO Gridded Time Series Data Set. Version 3.2. Oct. 2007- Sep. 2008, 84 N, 75 W; 44 N, 10 W. The FOO Data Center.http://dx.doi.org/10.xxxx/notfoo. 547983. Accessed 1 May 2011. The question is can a reproducible subset identifier be generated to replace the red bit.