Data Citation Then and Now

Similar documents
Data Citation. Mark Parsons, Ruth Duerr and the Federation of Earth Science Information Partners (ESIP)

State of the Art in Data Citation

Implementing the RDA Data Citation Recommendations for Long Tail Research Data. Stefan Pröll

The State of Arctic Data the IPY experience

Unique Identifiers Assessment: Results. R. Duerr

FAIR-aligned Scientific Repositories: Essential Infrastructure for Open and FAIR Data

Persistent Identifiers for Earth Science Provenance

CODATA: Data Citation Workshop Perspectives from Editors and Publishers. Brooks Hanson Director, Publications, AGU

Services to Make Sense of Data. Patricia Cruse, Executive Director, DataCite Council of Science Editors San Diego May 2017

Reproducibility and FAIR Data in the Earth and Space Sciences

Approaches to Making Dynamic Data Citeable Recommendations of the RDA Working Group Andreas Rauber

State of the Art in Ethno/ Scientific Data Management

The Research Data Alliance Creating the culture and technology for an international data infrastructure

Adoption of Data Citation Outcomes by BCO-DMO

ISMTE Best Practices Around Data for Journals, and How to Follow Them" Brooks Hanson Director, Publications, AGU

PERSISTENT IDENTIFIERS FOR THE UK: SOCIAL AND ECONOMIC DATA

Making Sense of Data: What You Need to know about Persistent Identifiers, Best Practices, and Funder Requirements

LIBER Webinar: A Data Citation Roadmap for Scholarly Data Repositories

Implementation of Open-World, Integrative, Transparent, Collaborative Research Data Platforms: the University of Things (UoT)

COALITION ON PUBLISHING DATA IN THE EARTH AND SPACE SCIENCES: A MODEL TO ADVANCE LEADING DATA PRACTICES IN SCHOLARLY PUBLISHING. Source: NSF.

Data Citation. DataONE Community Engagement & Outreach Working Group

Managing Web Resources for Persistent Access

DOIs for Research Data

A Data Citation Roadmap for Scholarly Data Repositories

Persistent Identifier the data publishing perspective. Sünje Dallmeier-Tiessen, CERN 1

Research Elsevier

Data Curation Profile Human Genomics

Dataverse and DataTags

The DOI Identifier. Drexel University. From the SelectedWorks of James Gross. James Gross, Drexel University. June 4, 2012

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

FREYA Connected Open Identifiers for Discovery, Access and Use of Research Resources

BPMN Processes for machine-actionable DMPs

For Attribution: Developing Data Attribution and Citation Practices and Standards

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

Indiana University Research Technology and the Research Data Alliance

Bengkel Kelestarian Jurnal Pusat Sitasi Malaysia. Digital Object Identifier Way Forward. 12 Januari 2017

Data Citation and Scholarship

Scholix Metadata Schema for Exchange of Scholarly Communication Links

CODE AND DATA MANAGEMENT. Toni Rosati Lynn Yarmey

DOI for Astronomical Data Centers: ESO. Hainaut, Bordelon, Grothkopf, Fourniol, Micol, Retzlaff, Sterzik, Stoehr [ESO] Enke, Riebe [AIP]

Making data publication a first class research output

The Experimental Project of DOI Registration for Research Data at Japan Link Center (JaLC)

Linking data and publications the past, present, and future. Dr. Hylke Koers, Head of Content Innovation, Elsevier

SHARING YOUR RESEARCH DATA VIA

How to make your data open

Interoperability Framework Recommendations

Roy Lowry, Gwen Moncoiffe and Adam Leadbetter (BODC) Cathy Norton and Lisa Raymond (MBLWHOI Library) Ed Urban (SCOR) Peter Pissierssens (IODE Project

Slide 1 & 2 Technical issues Slide 3 Technical expertise (continued...)

The Data Curation Profiles Toolkit: Interview Worksheet

DATAVERSE FOR JOURNALS

Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Institute for Quantitative Social Science (IQSS) Harvard

National Snow and Ice Data Center. Plan for Reassessing the Levels of Service for Data at the NSIDC DAAC

Data Curation Practices at the Oak Ridge National Laboratory Distributed Active Archive Center

National Snow and Ice Data Center. Plan for Reassessing the Levels of Service for Data at the NSIDC DAAC

The DataCite Metadata Schema. Frauke Ziedorn Workshop: Metadata and Persistent Identifiers for Social and Economic Data 7th May 2012

Linking datasets with user commentary, annotations and publications: the CHARMe project

Science Europe Consultation on Research Data Management

Linking data and publications the past, present, and future. Dr. Hylke Koers, Head of Content Innovation, Elsevier

PDS, DOIs, and the Literature. Anne Raugh, University of Maryland Edwin Henneken, Harvard-Smithsonian Center for Astrophysics

Using DCAT-AP for research data

Certification. F. Genova (thanks to I. Dillo and Hervé L Hours)

Chapter 3: Google Penguin, Panda, & Hummingbird

Why CERIF? Keith G Jeffery Scientific Coordinator ERCIM Anne Assserson eurocris. Keith G Jeffery SDSVoc Workshop Amsterdam

Robin Wilson Director. Digital Identifiers Metadata Services

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

The RMap Project: Linking the Products of Research and Scholarly Communication Tim DiLauro

The Materials Data Facility

Technical documentation. SIOS Data Management Plan

Paving the Rocky Road Toward Open and FAIR in the Field Sciences

CrossRef tools for small publishers

OpenAIRE From Pilot to Service The Open Knowledge Infrastructure for Europe

Introduction to Data Management for Ocean Science Research

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

Wendy Thomas Minnesota Population Center NADDI 2014

Description Cross-domain Task Force Research Design Statement

CrossRef developments and initiatives: an update on services for the scholarly publishing community from CrossRef

5/16/2018. Researcher Challenges with Data Use. AGU s position statement on data affirms that

DOIs for Scientists. Kirsten Sachs Bibliothek & Dokumentation, DESY

GEOSS Data Management Principles: Importance and Implementation

The Virtual Observatory and the IVOA

Archive II. The archive. 26/May/15

Minimal Metadata Standards and MIIDI Reports

Conducting a Self-Assessment of a Long-Term Archive for Interdisciplinary Scientific Data as a Trustworthy Digital Repository

Trust and Certification: the case for Trustworthy Digital Repositories. RDA Europe webinar, 14 February 2017 Ingrid Dillo, DANS, The Netherlands

CMIP6 Data Citation and Long- Term Archival

Checklist and guidance for a Data Management Plan, v1.0

Managing Data in the long term. 11 Feb 2016

Callicott, Burton B, Scherer, David, Wesolek, Andrew. Published by Purdue University Press. For additional information about this book

Research Data Edinburgh: MANTRA & Edinburgh DataShare. Stuart Macdonald EDINA & Data Library University of Edinburgh

OpenAIRE Guidelines for Data Archive Managers 1.0 December 2012

Research Data Repository Interoperability Primer

DATA SHARING FOR BETTER SCIENCE

RDA? GAME ON!! A B C L A / B C C A T S P R E C O N F E R E N C E A P R I L 2 2, : : 0 0 P M

Data Curation Handbook Steps

PIDs for CLARIN. Daan Broeder CLARIN / Max-Planck Institute for Psycholinguistics

Implementing the RDA data citation recommendations in the distributed Infrastructure of the Virtual Atomic and Molecular Data Centre

How to Use Google Scholar An Educator s Guide

Improving a Trustworthy Data Repository with ISO 16363

Towards a joint service catalogue for e-infrastructure services

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

Transcription:

Data Citation Then and Now Mark A. Parsons with help from Ruth Duerr and Peter Fox!!! 17 June 2014 GeoData 2014 Boulder, CO Unless otherwise noted, the slides in this presentation are licensed by Mark A. Parsons under a Creative Commons Attribution-Share Alike 3.0 License

Then The Evolution of Data Citation Data was part of the literature tables, maps, monographs, etc. and we cited accordingly. (Some data were still hoarded). Digital data becomes the norm. It s messier and we forget how to do cite it routinely. Initial efforts to define digital data citation in the 90s - early 00s Right idea, little traction Partially conflated with the citing URLs issue A blossoming in the mid-late 00s. Multiple disciplines start developing approaches and guidelines DOI a big driver, especially for DataCite, but other identifiers used too (Handles, LSIDs, UNFs, ARKs and good ol URI/Ls) A slightly competitive atmosphere GeoData 2011 a milestone for ESIP Citation Guidelines most sophisticated to date. (http://dx.doi.org/10.7269/p34f1nnj)

Now The Evolution of Data Citation Now a consensus phase Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data. 2013. http://dx.doi.org/10.2481/dsj.osom13-043 Draft Global Joint Declaration of Data Citation Principles. 2013. http://www.force11.org/datacitation

Next The Evolution of Data Citation Next Implementation phase just begun ESIP Guidelines adopted by a variety of NASA and NOAA data centers and internationally by GEOSS. AGU Publishing Committee is developing author guidelines based on ESIP. Other disciplines, notably social science, has relationships with publishers. Several data centers partnering with publishers, e.g. Elsevier s article of the future. New PLOS data policy. Joint engagement activity following on the joint principles. It happens locally and requires culture change so debates will continue.

Outline of 2011 Talk Purpose of Data Citation How it s currently done Basic citation form and content Identifiers and locators Microcitation

Purpose of data citation

Then Purpose of Data Citation Credit for data creators and stewards Track impact of data set Accountability for creators and stewards Aid reproducibility through direct, unambiguous connection to the precise data used! A location/reference mechanism not a discovery mechanism per se. 7

Purpose of Data Citation Now Aid scientific reproducibility through direct, unambiguous connection to the precise data used. Credit for data authors and stewards Accountability for creators and stewards Track impact of data set Help identify data use (e.g., trackbacks) Data authors can verify how their data are being used. Users can better understand the application of the data.! A locator/reference mechanism not a discovery mechanism per se

The Noble Eight- Fold Path to Citing Data 1. Importance 2. Credit and attribution 3. Evidence 4. Unique Identification 5. Access 6. Persistence 7. Specificity and verifiability 8. Interoperability and flexibility Principles are supplemented Joint Declaration with of Data a glossary, Citation Principles references and examples http://force11.org/datacitation (Overview)

Purpose of Data Citation Next A locator/reference/linking mechanism! This helps! Aid scientific reproducibility through direct, unambiguous connection to the precise data used Identify and track data use It contributes but is NOT central to Credit for data authors and stewards Accountability for creators and stewards Tracking impact of data set

Our ordinary conceptual system, in terms of which we both think and act, is fundamentally metaphorical in nature. Lakoff and Johnson, 1980 Is data publication the right metaphor? M. A. Parsons & P. A. Fox Data Science Journal, 2013 http://dx.doi.org/10.2481/dsj.wds-042 If we hear the same language over and over, we will think more and more in terms of the frames and metaphors activated by that language. Lakoff, 2008

How it s currently done

Then How data citation is currently done Citation of traditional publication that actually contains the data, e.g. a parameterization value. Not mentioned, just used, e.g., in tables or figures Reference to name or source of data in text URL in text (with variable degrees of specificity) Citation of related paper (e.g. CRU Temp. records recommend citing two old journal articles which do not contain the actual data or full description of methods) Citation of actual data set typically using recommended citation given by data center Citation of data set including a persistent identifier/locator, typically a DOI 13

Now How data citation is currently done Citation of traditional publication that actually contains the data, e.g. a parameterization value. Not mentioned, just used, e.g., in tables or figures Reference to name or source of data in text URL in text (with variable degrees of specificity) Citation of related paper (e.g. CRU Temp. records recommend citing two old journal articles which do not contain the actual data or full description of methods.) Citation of actual data set typically using recommended citation given by data center Citation of data set including a persistent identifier/locator, typically a DOI

Then 2009 1.7% 2008 1.3% 2007 0.9% 2006 1.3% 2005 0.7% 2004 0.7% 2003 2002 1.0% 1.3% Formal Citation Total Entries 0 100 200 300 400 500 600 MODIS Snow Cover Data in Google Scholar

Now Next How data citation is currently done Implementation phase just begun ESIP Guidelines adopted by a variety of NASA and NOAA data centers and internationally by GEOSS. AGU Publishing Committee is developing author guidelines based on ESIP. Other disciplines, notably social science, has relationships with publishers. Several data centers partnering with publishers, e.g. Elsevier s article of the future. New PLOS data policy. Joint engagement activity following on the joint principles. It happens locally and requires culture change so debates will continue.

Basic content

Then Now Next Basic data citation form and content Per DataCite: Creator. PublicationYear. Title. [Version]. Publisher. [ResourceType]. Identifier.! Per ESIP: Author(s). ReleaseDate. Title, [version]. [editor(s)]. Archive and/or Distributor. Locator. [date/time accessed]. [subset used].!

An Example Citation Then Now Next Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements ver. 2.0. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://dx.doi.org/10.5060/d4h41pbp.

Identifiers and locators

An assessment of identification schemes Then Now for digital Earth science data Unique Identifier Unique Locator Citable Locator Scientifically Unique ID ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item URL/N/I PURL XRI Handle DOI ARK LSID OID UUID Good Fair Poor Adapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: An assessment and recommendations. Earth Science Informatics. 4:139-160. http://dx.doi.org/10.1007/s12145-011-0083-6

An assessment of identification schemes Then Now for digital Earth science data Unique Identifier Unique Locator Citable Locator Scientifically Unique ID Locators ID Scheme Data Set Item Data Set Item Data Set Item Data Set Item URL/N/I PURL Identifiers XRI Handle DOI ARK LSID OID UUID Good Fair Poor Adapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: An assessment and recommendations. Earth Science Informatics. 4:139-160. http://dx.doi.org/10.1007/s12145-011-0083-6

What needs an identifier/locator? What needs to be cited? Everything needs an identifier. Most things need locators. Intellectual content needs citation. Different versions of things may need different identifiers/locators Subsets may need identifiers or clear reference to sub-setting process (e.g. space and time). Different representations (conceptual models) may need different identifiers/ locators. E.g Maps.

Then Now Why the DOI? Not perfect but well understood by publishers Thomson Reuters collaborating with DataCite to get data citations in their index.! But... What is the citable unit? How do we handle different versions? What about retired data? When is a DOI assigned?

Now Versioning approach recommended by DCC As DOIs are used to cite data as evidence, the dataset to which a DOI points should also remain unchanged, with any new version receiving a new DOI. There are two possible approaches the data repository can take: time slices and snapshots.

When to assign a DOI? Now First principle: Data should be citable as soon as they are available for use by anyone other than the original authors. But... Most people (falsely) believe that a DOI implies permanence so how do we cite transient data? Some believe that a DOI should not be assigned until the data has undergone some level of review (e.g. Lawrence et al. 2010). So how do we cite data used before the review? Data are often used by friends and collaborators in a raw, unpublished state. Should this use be cited with a DOI? Near real time or preliminary data may only be available for a short uncurated period. There may not be a good match between the submission package and the distribution package. What gets the DOI? When?

Versioning and locators: some suggestions from NSIDC major version.minor version.[archive version] Individual stewards need to determine which are major vs. minor versions and describe the nature and file/record range of every version. Assign DOIs to major versions. Old DOIs should be maintained and point to some appropriate page that explains what happened to the old data if they were not archived. A new major version leads to the creation of a new collection-level metadata record that is distributed to appropriate registries. The older metadata record should remain with a pointer to the new version and with explanation of the status of the older version data. Major and minor version (after the first version) should be exposed with the data set title and recommended citation. Minor versions should be explained in documentation, ideally in file-level metadata. Applying UUIDs or ARKs to individual files upon ingest aids in tracking minor versions and historical citations.

Microcitation

Then Now Next Basic data citation form and content! Author(s). ReleaseDate. Title, Version. [editor(s)]. Archive and/or Distributor. Locator. [date/time accessed]. [subset used].!!! The best solution is to have unique identifiers or query IDs for subsets, but that won t be available for most data sets for a long time, so we need alternative solutions...

February 8, 2011, 4:45 PM Page Numbers for Kindle Books an Imperfect Solution Neither solution is perfect locations or page numbers because the problem is unsolvable. The best we can hope for is a choice...! Amazon s Kindle will have page numbers that correspond to real books and locations by passage. http://pogue.blogs.nytimes.com/2011/02/08/page-numbers-for-kindle-books-an-imperfect-solution/

Chapter and Verse Then Now Next Bible Koran Bhagavad-Gita and Ramayana other sacred texts! A structural index

Then Now Next Doing it as best we can...? Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005.3, Oct. 2007- Sep. 2008, 84 N, 75 W; 44 N, 10 W. Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-11-01 at http://dx.doi.org/ 10.1234/xxx. Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005.3, Oct. 2007- Sep. 2008, Tiles (15,2;16,0;16,1;16,2;17,0;17,1). Boulder, Colorado USA: National Snow and Ice Data Center. Data set accessed 2008-11-01 at http:// dx.doi.org/10.1234/xxx. Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003. CLPX-Ground: ISA snow depth transects and related measurements, Version 2.0, shapefiles. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://dx.doi.org/ 10.5060/D4H41PBP.

Next Just-in-time citation Approach being developed by an RDA Working Group Ensure data is time-stamped and versioned Assign PID to time-stamped query/selection expression

Content negotiation the details of identifier resolution

So, you have a DOI, or a handle, Resolution service or URI? Landing pages (for metadata, citation recommendations, ) Content- negotiation (conneg) for embedding in Other pages Applications Ingest, object type repositories, communities

Landing Pages For humans and machines

Landing page a short form http://data.rpi.edu/repository/handle/10833/24

http://data.rpi.edu/repository/handle/10833/24?show=full Long form

Content Negotiation Conneg Many examples, but what follows is ~ from: http://www.crosscite.org/cn/ What is it? Es ce que vous parlez Français? Do you speak html or JSON or RDF? For embedding in Other pages Applications Ingest, object type repositories, communities

Conneg

Supported content types..

DCO Object Registration and Deposit DCO Research Community Network Join Network Share Knowledge Metadata Title Author Author Email Licence Subject Keyword Data Type DCO Object Deposit DCO Research Network Register Metadata Dataset CDF Upload Raw Data DCO-ID Request Allocate a universal accessible DCO-ID DCO-ID Request

Further integration..

Get involved! RDA Working Group on citing dynamic data. http://rd-alliance.org/working-groups/data-citation-wg.html RDA WG on identifier types and PID Interest Group https://rd-alliance.org/working-groups/pid-information-types-wg.html https://rd-alliance.org/internal-groups/pid-interest-group.html ESIP Preservation and Stewardship Committee http://wiki.esipfed.org/index.php/preservation_and_stewardship Implementation Team for the Joint Citation Principles http://www.force11.org/node/4849

Update of 2011 Talk Purpose of Data Citation evolving into more specific concerns (slowly) How it s currently done consensus on needs and approach, but no substantial progress on implementation Basic citation form and content basics are solid and generally consistent Identifiers and locators consensus emerging, now looking at identifiers for everything and what to resolve. Micro-citation good technical progress, hindered by social implementation and concept of citation vs. linking.

Overall Summary Use many metaphors and be cautious of them all. We know how to cite data (for the most part), we just need to make it a cultural practice. Just do it. Location and identity are different but can be the same. Separate concerns. Data citation is not literature citation. Micro-citation is more important and the citation needs to be machine interpretable and operable. Machine interpretable and operable. Everything in data stewardship needs an identifier, but due diligence is the underlying requirement. These sort of socio-technical problems require collaborative, networked solutions. Participate.