GUID Guide for Data Providers

Similar documents
idigbio Data Ingestion Requirements and Guidelines

Workflow Detail: Data Capture (for flat sheets and packets)

For those of you who may not have heard of the BHL let me give you some background. The Biodiversity Heritage Library (BHL) is a consortium of

Foundations for a digitization project Terminology and standards

Developing Online Databases and Serving Biological Research Data

Unique Identifiers Assessment: Results. R. Duerr

Smartphones: changing the way invasive species are reported Chuck Bargeron

Technical and Financial Progress Report

Feedback from plant e-flora and occurrence session

Data Curation Profile Botany / Plant Taxonomy

Extraction and Parsing of Herbarium Specimen Data:

Writing a Data Management Plan A guide for the perplexed

Data Standards and Protocols for Biological Collections Data

Transcribing instructions for BioGaps. February 2017

The TDWG Ontology and the Darwin Core Namespace Policy. Steve Baskauf TDWG VoMaG and RDF Task Groups

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

University at Buffalo's NEES Equipment Site. Data Management. Jason P. Hanley IT Services Manager

The DOI Identifier. Drexel University. From the SelectedWorks of James Gross. James Gross, Drexel University. June 4, 2012

Registry Interchange Format: Collections and Services (RIF-CS) explained

This presentation is on issues that span most every digitization project.

Software/Tool Comparison Worksheet

Fulfillment User Guide FULFILLMENT

Contribution of OCLC, LC and IFLA

Network Activity D - Developing and Maintaining Databases. Report D Detailed UI architecture study including html mock-up

Corso di Biblioteche Digitali

Web Interface Tool How To Guidelines

Deposit-Only Service Needs Report (last edited 9/8/2016, Tricia Patterson)

Quality Assured (QA) data

Appendix REPOX User Manual

idigbio TCN IT Needs Assessment Results José Fortes, Shari Ellis, Andréa Matsunaga

A Beginner s Guide to Persistent Identifiers Version 1.0

DRS Policy Guide. Management of DRS operations is the responsibility of staff in Library Technology Services (LTS).

NeAT Business Plan Component Data Integration and Annotation Services in Biodiversity (DIAS-B) 1. Service Description

SISSA Library Maria Pia Calandra & Marco Marin. SISSA Digital Library Research Products Archive

ARKive-ERA Project Lessons and Thoughts

Google indexed 3,3 billion of pages. Google s index contains 8,1 billion of websites

Teiid Designer User Guide 7.7.0

Developing a Research Data Policy

Making Information Findable

Building on to the Digital Preservation Foundation at Harvard Library. Andrea Goethals ABCD-Library Meeting June 27, 2016

Metadata Issues in Long-term Management of Data and Metadata

An RDF guide for the Darwin Core standard

Virtualizing Lifemapper Software Infrastructure for Biodiversity Expedition

Plant Atlas Web Site Overview What can you do at this web site? Search the Herbarium Database

FEMA Data Standard: Declaration String. Proposal for Adoption by the FEMA Data Governance Council

Europeana, the prototype EDLfoundation Europeana Network Europeana, vs. 1.0 ThoughtLab Technical requirements

Dagmar Triebel, Peter Grobe, Anton Güntsch, Gregor Hagedorn, Joachim Holstein, Carola Söhngen, Claus Weiland, Tanja Weibulat.

University of Bath. Publication date: Document Version Publisher's PDF, also known as Version of record. Link to publication

Corso di Biblioteche Digitali

Search Interoperability, OAI, and Metadata

Collection Policy. Policy Number: PP1 April 2015

An overview of the OAIS and Representation Information

Archivists Toolkit: Description Functional Area

PERSISTENT IDENTIFIERS FOR THE UK: SOCIAL AND ECONOMIC DATA

CADEX CATIA CADAM DRAFTING

COAR Interoperability Roadmap. Uppsala, May 21, 2012 COAR General Assembly

BRAHMS v8. An update on development JANUARY 2016

Word Overview Page 3 Tables Page 5 Labels Page 9 Mail Merge Page 12. Excel Overview Page 19 Charts Page 22

Persistent identifiers, long-term access and the DiVA preservation strategy

data elements (Delsey, 2003) and by providing empirical data on the actual use of the elements in the entire OCLC WorldCat database.

A tool for Entering Structural Metadata in Digital Libraries

Networked Access to Library Resources

BHL-EUROPE: Biodiversity Heritage Library for Europe. Jana Hoffmann, Henning Scholz

DORSA Virtual Museum and Phonothek

Digital Library Curriculum Development Module 4-b: Metadata Draft: 6 May 2008

ECHA Accounts Manual for Industry Users

A Dublin Core Application Profile in the Agricultural Domain

How to Create a Custom Ingest Form

Information and documentation Records management. Part 1: Concepts and principles AS ISO :2017 ISO :2016

The Trustworthiness of Digital Records

RPS Technology Standards Grades 9 through 12 Technology Standards and Expectations

Enterprise Data Catalog for Microsoft Azure Tutorial

Development and Implementation of an Efficient, Accessible Online Museum Site and Database with Open Access. Justin Woods, Art Chadwick

EUDAT. Towards a pan-european Collaborative Data Infrastructure

EUROPEANA METADATA INGESTION , Helsinki, Finland

THE PALPARES RELATIONAL DATABASE: AN INTEGRATED MODEL FOR LACEWING RESEARCH

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

Shell Collection Management Program Version 2.0

Introduction to Programming Nanodegree Syllabus

ADDING VALUE TO BIODIVERSITY IMAGES THROUGH COMMUNITY ANNOTATION

Introduction to Archivists Toolkit Version (update 5)

Implementing the Army Net Centric Data Strategy in a Service Oriented Environment

Towards Open Innovation with Open Data Service Platform

Data Curation Profile Movement of Proteins

Utilizing PBCore as a Foundation for Archiving and Workflow Management

Elevating Natural History Museums Cultural Collections to the Linked Data Cloud

Teiid Designer User Guide 7.5.0

Semantics, Metadata and Identifying Master Data

Data publication and discovery with Globus

Lesson 22 Enhancing Presentations with Multimedia Effects

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

The Sunshine State Digital Network

The Butterfly Effect. A proposal for distribution and management for butterfly data programs. Dave Waetjen SESYNC Butterfly Workshop May 10, 2012

SANBI POLICY DOCUMENT

Open Access & Open Data in H2020

Abstract. Background. 6JSC/ALA/Discussion/5 31 July 2015 page 1 of 205

NEW ENGLISH FILE INTERMEDIATE ONLINE - OXFORD UNIVERSITY PRESS ENGLISH FILE TEACHER'S SITE TEACHING RESOURCES OXFORD

OpenAIRE Guidelines for Data Archive Managers 1.0 December 2012

FREYA Connected Open Identifiers for Discovery, Access and Use of Research Resources

The UC Davis BIBFLOW Project

Transcription:

GUID Guide for Data Providers 2013-06- 26 Preface A Globally Unique Identifier (GUID) is a unique reference number used as an identifier. Complexities associated with specimens and associated, dynamic data in natural history collections have led to differences in opinions about how to create GUIDs and where they are required vs. recommended. This guide is intended to give data providers the information necessary to assign GUIDs that will allow idigbio to ingest and share data. Suggestions for improvements to this guide are welcome, and can be inserted as comments on https://www.idigbio.org/content/guid- guide. A more general discussion on the use of identifiers in natural history collections is appropriate in idigbio s community forums: https://www.idigbio.org/forums/identifiers- natural- history- collections. GUIDs for idigbio GUIDs are needed by data providers, intermediate data aggregators, and data consumers. A data provider is a person who sends biodiversity collections data to idigbio to be ingested on behalf of the institution that owns the data. A data aggregator is an entity like idigbio, Morphbank, and VertNet, which ingests data from providers or other aggregators and serves it to consumers. A consumer is a person who extracts data from a portal such as that provided by idigbio for use in research, education and outreach activities. GUIDS are necessary to identify unequivocally specimen- based records so that uses of the data, and recommended annotations (such as corrected species identifications or addition of georeference data) from aggregators or consumers, can be supplied back to the provider. To enable this, a digital record for the specimen 1 and digital records for related objects (e.g., image or audio recording of the specimen) must have their own GUIDs. 1 A "specimen" may be defined differently among various types of taxonomic collections. It may refer, for example, to (a) an individual (a mammal skin), (b) a 'lot such as a vial containing several University of Florida Florida Museum of Natural History Dickinson Hall (Museum Rd. & Newell Dr.) Gainesville, FL 32611 352-273- 1906 idigbio is funded by a grant from the National Science Foundation's Advancing Digitization of Biodiversity Collections Program (#EF1115210)

A digital record for the specimen includes information usually contained in a collection database: a taxonomic name, a catalog number and/or barcode, locality, date of collection, etc. For this information to be submitted to idigbio, the record should include, as a minimum, a GUID and a taxonomic name (usually a scientific name, but it can be a genus, family, or other taxonomic name). Note: idigbio will accept data without GUIDs assigned by the provider. However, without GUIDs, one will not be able to update records (i.e., they can only be read) making it impossible to create and track history, generate usage statistics or relate records to other records or data. In addition, recommended annotations to the dataset and information on uses of data cannot be easily captured and reliably returned to the provider. Please see Appendix 1. I. How to generate GUIDs idigbio recommends generating GUIDs as URIs (Universal Resource Identifiers), which are strings that begin with a scheme name (or protocol). Registered scheme names that may be used to develop URIs include http, doi, and urn. idigbio further recommends URI schemes that rely on UUIDs (Universally Unique Identifiers). UUIDs are generated using web services or standard software packages and the corresponding URIs are globally unique. 1. Examples of UUIDs are: A simple UUID: f47ac10b- 58cc- 4372- a567-0e02b2c3d479 A UUID using URI syntax: urn:uuid:f47ac10b- 58cc- 4372- a567-0e02b2c3d479 2. Another example of an acceptable URI scheme is Archival Resource Key (ARK), which originates from the library, archive and museum community. The EZID project of the California Digital Library (http://www.cdlib.org/services/uc3/ezid/) provides services in support of ARK UUIDs. An ARK- generated UUID has the following form, where the embedded 87286 is the Name Assigning Authority for the biodiversity community. specimens from the same locality and date, (c) all of the various parts of a plant on an herbarium sheet, or (4) a group of fossils in a single concretion. Page 2 of 8

ark:/ 87286/f47ac10b- 58cc- 4372- a567-0e02b2c3d479 3. Other examples of GUID- generating resources are given in Appendix 2. Remember: Digital records for the specimen and each related digital object must have their own GUIDs in order for annotations and other usage data to be sent to the provider. A digital record for the specimen might have this URN- generated UUID: urn:uuid:f47ac10b- 58cc- 4372- a567-0e02b2c3d479 a record for an image of that specimen would have a separately generated URN UUID: urn:uuid:6796b640-4c50-4714- aad6-818346008282 and a record for a video recording of that specimen would have a third URN UUID. II. How to add GUIDs to a database for export to idigbio Step 1. GUIDs from providers should be entered in a field named idigbio:recordid (Fig. 1) These identifiers do not replace catalog numbers, bar codes, or other collection management strategies that are used to connect database records with physical specimens. It is not necessary for GUIDs to be on labels or otherwise attached to physical specimens. Step 2. Use of the ac:relatedresourceid field (Fig. 1) To clarify the relationship between a digital record for the specimen and another digital object record, e.g., for an image, put the GUID for the digital record for the specimen in a field named ac:relatedresourceid when the image record is submitted. (If an image or other related object record is submitted before the specimen record, put the GUID for the related digital object in the ac:relatedresourceid field when the digital record for the specimen is submitted.) Step 3. Use of the idigbio:collectionobjectid field (Fig. 1) To maintain a global reference to all records related to a physical specimen, a second GUID can be added to a field named idigbio:collectionobjectid. This could be used, for example, to Page 3 of 8

distinguish among several specimens on a slide or in a vial, or among fossils in a single concretion (in these instances, the GUID in the idigbio:recordid field refers to the slide, the vial, or the concretion, the encompassing cataloged collection object). Figure 1. Relationships among GUIDs and names of fields for export to idigbio Page 4 of 8

III. How to export record identifiers to idigbio a) If using the GBIF IPT tool, add the related resource extension and create the field mapping like this: 1. dwc:relatedresourceid = GUID entered in idigbio:recordid 2. dwc:relationshipofresource = representedin This additional mapping is necessary because there is no recordid field in the GBIF IPT tool. b) If submitting records in a spreadsheet: 1. Create a field called idigbio:recordid and populate that field with the digital (specimen or media) record GUIDs from the collections database. 2. If a collectionobjectid is used, create another field called idigbio:collectionobjectid and populate it with a GUID as discussed above. (Note: For providers using Audubon Core fields for ingesting media records, this means adding the dcterms:identifier field and filling it with the primary identifier of the image record.) In summary: A GUID should be assigned to each digital record of a specimen, and a different GUID assigned to each related digital object submitted to idigbio. These GUIDs are to be submitted in a field named idigbio:recordid (Fig. 1). It is recommended that providers place the GUID of the digital record for the specimen in a field named ac:relatedresourceid (Fig. 1) when a record for a related digital object (image) is submitted. The GUID in the idigbio:collectionobjectid field can be used together with the idigbio:recordid to maintain the global reference back to a particular physical specimen (e.g., in cases where it is necessary to relate different specimens to the same 'lot'). Page 5 of 8

Appendix 1 Why Use a GUID? It is critical to be able to identify unequivocally a piece of information about a digital collection object, including who the provider is, regardless of the route it may have taken before arriving at the idigbio portal; e.g., from one or many data aggregators or directly from the source (the provider). Ideally, every provider would add a globally unique identifier (in a field called recordid), to each digital collection object record. When this happens, different kinds of valuable services are available to the provider and to the consumer: 1) Write- back - This is the process where an annotation to a record by a consumer can be pushed back to the provider for presentation and possible update. Additionally, any other place where the specimen record appears in the global data universe, can be also be updated, as long as its identifier is persistent, i.e., no intervening aggregator has subverted, concealed, or reused for a different purpose the provider s GUID. 2) Impact tracking - this is the process where a provider can trace via the universe of globally linked data where their specimen records have been used; e.g., in research publications or environmental impact statements. 3) Validation - when the portal is able to run data validation reports against authority files, e.g., place names, taxonomy, collector names, it will be important to get the suggested corrections back to the source. 4) Data quality/research value/transparency - many consumers of the data are researchers who depend on the quality of the data, which includes proper accounting for statistical models. A GUID assigned at the source by the provider improves the quality of the data in general; e.g., alleviating issues caused by the number of duplicated collection object records. For reasons given above, it is highly desirable that GUIDs be assigned by the data provider. An internal idigbio GUID is assigned to each record received for use by idigbio, regardless of whether the provider has provided a GUID. These idigbio GUIDs are visible too, and can be used by providers; however, that requires that they be returned to the Page 6 of 8

provider and entered into the original dataset by the provider at the institution, which is likely to be more difficult than adding GUIDs initially. Consequences of Submitting Data to idigbio without a GUID In an effort to accommodate the broadest range of available data sources, idigbio is relaxing its initial requirement of every record coming into the system with a GUID. idigbio will make an effort to integrate records that have no identifiers into our system, but many of the features idigbio has or intends to build may be disabled for these data. Specifically, annotations and systems which depend on annotation such as functionality (validation/quality assessments, impact reporting at the record level) will likely be wholly or partly disabled for datasets without identifiers. For datasets that are unable to store or export even stable locally unique identifiers (CSVs with unstable row ordering, databases with duplicate catalog numbers and no primary keys, databases with fixed export formats that do not include identifiers), idigbio may have to disable core system features such as versioning and relationships in order to incorporate them. Providers who find themselves in this category are advised to contact idigbio for advice on modifying their existing system or migrating to a different cataloging system entirely, as systems in this category are likely to be increasingly isolated from the rest of the data universe. Appendix 2 - Resources dwc: Refers to 'Darwin Core,' a body of standards that "includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. The Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, and samples, and related information. Included are documents describing how these terms are managed, how the set of terms can be extended for new purposes, and how the terms can be used" (http://www.tdwg.org). ac: "The Audubon Core metadata schema ("AC") is a representation- neutral metadata vocabulary for describing biodiversity- related multimedia resources and collections. Multimedia Resources are digital or physical artifacts which normally comprise more than text. These include pictures, artwork, drawings, photographs, sound, video, animations, presentation materials, interactive online media including identification tool packages involving text and other media. A multimedia collection is an assemblage of such objects whether curated or not and whether electronically accessible or not. For the purposes of Page 7 of 8

this schema we regard a collection of multimedia resources itself as a 'multimedia resource'"(http://www.tdwg.org). UUID http://tools.ietf.org/pdf/rfc4122.pdf - canonical technical definition of UUID http://en.wikipedia.org/wiki/globally_unique_identifier - the Wikipedia definition Online GUID generator: http://www.guidgenerator.com/online- guid- generator.aspx http://henbo.wordpress.com/2007/12/02/the- mystery- of- upper- case- and- lower- case- guid- values/ o NOTE: 6.5.4 Software generating the hexadecimal representation of a UUID shall not use upper case letters. It is recommended that the hexadecimal representation used in all human- readable formats be restricted to lower- case letters. Software processing this representation is, however, required to accept both upper and lower case letters as specified in 6.5.2. MSSQL, PostgreSQL, and MySQL all provide native UUID datatypes. GBIF Best place to start: http://www.gbif.org/communications/news- and- events/showsingle/article/a- beginners- guide- to- persistent- identifiers- published/ http://rs.tdwg.org/dwc/terms/ http://terms.gbif.org/wiki/audubon_core_term_list_(1.0_normative) Audubon Core term list GUID http://en.wikipedia.org/wiki/globally_unique_identifier http://links.gbif.org/persistent_identifiers_guide_en_v1.pdf http://imsgbif.gbif.org/cms_new/get_file.php?file=24d1fe88b849e4225f3117fac03d6 c LSID http://en.wikipedia.org/wiki/lsid DOI http://www.doi.org http://www.handle.net EZID http://www.cdlib.org/services/uc3/ezid/ Page 8 of 8