Managing Data in the long term. 11 Feb PDF Free Download

Managing Data in the long term 11 Feb 2016

Outline What is needed for managing our data? What is an archive? 2

Motivation Researchers often have funds for data management during the project lifetime. Limited time to manage data once project has completed Essentially it's not the researchers job But, there is value in the ensuring data are available beyond the end of the project Value to peers Potential value to researchers in other areas (cross-discipline) 4

Motivation Edmond Halley (C18) used historical data to determine the trajectory of a comet and provide validation of Newton's theory of gravitation. Due to period of comet ~70years historical data essential. Needed to use data collected for different purposes (eg propaganda, religious) 5

Motivation Data taken from google scholar 2014-2016 Download View Publication Moser et al Nobel Prize data 2 58 1.1 10 0 00 0. 14 /20 2 4 00 0.0 14 0 2 2/ 58 1.1 10 01 2/2 8 5.11 0 1 1 00 0 0 4..1 10 1 1 /20 2 58 0 00 4.0 3 6

Motivation 7

Motivation Currently ~50TB data archived 44 datasets (49TB) from Climate community 4 datasets Biology, 1 Computer science With very little prompting! Demonstrates researchers do see value in their data a service capable of managing their data 8

An Important Distinction An archive is a service that provides long-term access to data. Long-term usually means More than 5 years. An archive is not a backup A backup is a snapshot of data that may change over time (eg last weeks backup of file X may be different to this weeks backup of file X). Once data reaches a mature state (ie doesn't change) then it can be considered for archiving. 10

Roles The Norstore archive recognises 5 different types of user: Creator, Contributor, Data Manager, Rights Holder, Access User. All types can be a person or an organisation (although in the case of an organisation a contact person is needed). The Access Users doesn't need to be defined (unless data access is restricted). It is possible that the different types can resolve to the same person or organisation. It s important to assign these roles to the dataset in case of questions. 11

Creator and Contributor Roles A person uploading data into the archive takes the role of the Contributor. There can be more than one contributor for a dataset. The Contributor uploads the data and fills-in the metadata for the dataset. The Contributor shares the responsibility of ensuring the dataset is complete, abides by the Terms and Conditions and that the metadata is accurate. The Creator is the person or group that created the data. 12

Data Manager To address the problem of datasets being used in different situations than originally anticipated we need to have an expert or contact person for the dataset. Data Manager responsible for fielding questions or comments regarding the dataset during its lifetime. The Contributor does not need to maintain a connection with the dataset (eg contributor could be a PostDoc or PhD student). Doesn t have to be an expert on the dataset, but should know whom to contact. Similar to what happens with publications (contact person or corresponding author is mentioned). 13

Rights Holder The Rights Holder is the person or group that controls or owns the rights to the dataset. This includes intellectual property, copyright. There may be more than one Rights Holder for a dataset. If the access restrictions exist on the use of the dataset the Rights Holder will need to be contacted for permission to use the dataset. In most cases (those abiding by the NLOD or CCv4 license) the role of the Rights Holder is less important (but it still relevant). It is IMPORTANT that you check with your Institution, funding agency as to whom has rights on your dataset. 14

Access User Any person querying the archive or using the data in the archive assumes the role of an Access User. Metadata for all published datasets is accessible by all Access Users. Datasets are accessible only requiring an email address The download link needs to be sent to the user. Using datasets assumes you abide by the access licence. 15

Archiving Data Is part of the data lifecycle Requires information from previous phases Information on how the data was collected, processed, etc Needs to be taken into consideration at project proposal time Motivates the need for a plan for data management 16

Research Data Management Plan Founded on three criteria for the research project: Successful data collection Successful data use Successful data sharing with target audience Throughout the data lifecycle Plan will also help in provisioning resources Norstore working on a template to address these criteria that: Recognises researchers are not data management experts Uses best practices from (UK Digital Curation Centre, DANS and other agencies) 17

Research Data Management Plan Research Council of Norway Researchers Data Management Plan Research Institutions Service Providers: Norstore, Nortur 18

Research Data Management Plan Currently drafting a template for the plan Intention is to have pre-prepared text as much as possible Template ready very soon Next steps: Review template draft internally, seek feedback from stakeholders 19

Datasets (archive) A collection of related data Usually data is related in terms of use e.g. cloud simulation data Up to the researcher to define the dataset Datasets elegible for archiving should be 'closed' or 'complete': Datasets such as those that resulted in publications Datasets that are considered a natural conclusion to a project All datasets should be considered of lasting value to the community 20

Datasets (archive) Data needs ideally to be in a standard or open format that makes it possible to migrate in case of obsolescence. Licensing who can use the data, under what restrictions. Contact persons in case of questions concerning the data. Integrity checksums should be provided along with the data. Metadata description of the data. 21

Datasets (Archive) Are some popular approaches to arranging data. Internet Engineering Task Force proposal for structuring related data BagIt ( http://tools.ietf.org/html/draft-kunze-bagit-10) Used by a variety of institutions (eg Library of Congress) Essentially: 22

Dataset (Archive) BagIt data directory contains sub-structure. Suggest dividing into: doc for documentation (including table of contents of layout) src for any source code needed to read the data (and possibly that generated the data) aux auxiliary data file <data type> for data files of that data type Or any other layout. But, try to provide a doc directory containing documentation and a src containing source code. Can then zip or tar the BagIt hierarchy and upload to the archive. 23

Metadata What is this? What was it used for? 24

Metadata and Datasets Metadata essential to successfully use the dataset: Describes what the dataset is. Describes where it came from. Describes how to use it. Metadata is created throughout the data lifecycle Different phases of the lifecycle require different types of metadata Perhaps data are initially stored in a primitive format and then processed. 25

Metadata and Datasets Can be divided into 3 classes: Descriptive: what the data is, features, etc Structural: how the data is arranged, formats, etc Administrative: how to manage the data, checksums, rights etc Many domains have complex, detailed metadata... 26

Metadata Seeing Standards: A Visualisation of the Metadata Universe. J. Riley, D. Becker 27

Metadata Metadata schemes for many communities at different stages of evolution. Quite detailed. Very difficult for Norstore to support all metadata schemes Look for lowest common denomenator 28

Norstore Archive Metadata Many metadata schemes have Dublin Core either as a basis or have a strong overlap with Dublin Core. Dublin Core is an ISO standard. The standard has 15 terms, extended Dublin Core has more terms. The Norstore Archive uses Dublin Core as a basis. Additional metadata terms added that are not covered by DC, but are generic enough for all communities. OAI-PMH based on DC so automatically compliant. Metadata is separate entity from the dataset. 29

Norstore Archive Metadata Descriptive Information Administrative Information Structural Information Category Description Identifier Internal Identifier Journal Article Language Phase State Subject Title Access Rights Contributor Created Creator Data Manager License Lifetime Preservation Level Published on Publisher Rights Rights Holder Submitted Terms and Conditions for Deposit File Checksum File Name File Size File Type Descriptive Information (optional) Bibliographic Citation Conforms to Comment Geolocation Label Project Provenance Source Temporal Coverage Bold terms are Dublin Core recommended terms. Top 3 boxes contain mandatory metadata. Terms in italics are automatically filled in by archive. Only ~14 terms to be defined by the user 30

Norstore Archive Metadata Norstore metadata intended to be as generic as possible. Sufficient to locate data and understand how to use the data. More detailed information should be contained in domainspecific catalogues. Can reference domain-specific catalogue within descriptive metadata In the future we could envisage the archive holding a reference to the domain catalogue. Need to be aware the archive lifetime may be longer than the domain catalogue Can have domain specific catalogues use the DOI as a handle to the data (resolving the link will provide access to the data). 31

Norstore and Domain Metadata Data Service Norstore Archive Metadata Service DOI Resolver Domain Metadata Service Domain Metadata catalogue can have DOI registered. Could then invoke DOI resolver to provide access to archive metadata and data. 32

Norstore Metadata Currently metadata must be supplied using the web interface Metadata needs to be completed in two stages: Before data upload consists of mandatory metadata that is needed by the archive for managing the data (eg contact information, title of the dataset etc) After data upload consists of remaining mandatory descriptive information and optional information. There is a 3 month time limit to fill in metadata User will be reminded during this period of need to complete metadata. At the discretion of the Archive Manager the dataset may be deleted if the metadata remains incomplete after this limit. Typically metadata is completed within 2 weeks. 33

Completing Norstore Metadata 34

Tips for Norstore Metadata Avoid duplication if information is contained in the publication or other referenced material. Consider what information is needed to reanalyse the data: libraries, operating systems, workflows, manuals, any other data a good test is to ask a person new to the community to document what they need to make use of the data. Any features in the data worth mentioning? How the data was collected described? The environment the data was collected in such as instrument settings etc. 35

Tips for Norstore Metadata Use the description field to describe what the dataset is and how to use it. Use the journal metadata to provide a reference to the article that describes the dataset. If there is a lot of documentation it could be included as part of the dataset and describe where to find it in the description. In this case the description can be more succinct. If the dataset has temporal or spatial information consider using the optional metadata to capture that information. Provides a visual aid to the description of your dataset. 36

Tips for Norstore Metadata Links to external references with more information are good. But, beware of longevity Will the reference last the lifetime of the dataset? Beware of jargon or terminology Perhaps run the description by novice users to see if it s clear 37

Norstore Metadata Plans Recognise that many projects have metadata catalogues. Ability to extract subset that matches some of the Norstore metadata terms would be useful. Working on a REST API for the metadata catalogue. Currently looking at the Search, but can be extended to the ingest of metadata. Allows you to script the extraction and loading of some of the norstore metadata automatically. Useful for projects with many datasets. Implement metadata errata: Allow traceable corrections of metadata 38

Archive Oslo disk irods W e b Norstore catalog C L I External User tape Project Area user Tromso irods disk 39

The Archive User Interface (web and CLI) IRODS Metadata Catalog Storage (disk and tape, Oslo, Tromso) Designed the archive to allow replacement of any component with minimal impact 40

User Interface The primary user interface is web-based. Command line interface used for large dataset interaction with the project area. Interfaces to norstore metadata catalogue. Also used for metadata search PostgreSQL database. All metadata and state information held there. Also interfaces to the data management system (irods). 41

IRODS Data Management System Rule oriented data management system Abstracts details of distributed storage by providing logicallayer Logical-physical mapping held in irods metadata catalogue PostgreSQL database. Provides access control and interfaces to authentication such as GSI and Kerberos Norstore makes use of just one archive user to manage the data Users don t interact directly with irods, but through the web interface or command line tools. 42

IRODS Data Management System Allows policies to be placed on the data Norstore policy to replicate data to 3 resources Also have a policy to remove data from one resource and replicate to a new resource Also policy to regularly checksum data 43

Archiving a Dataset Datasets can be archived from researchers local computer, or from norstore project area. Local computer uploads achieved via Filesender service Datasets < 1TB in size can be uploaded (can be increased) Project area requires users to be registered with a valid project Data are uploaded via command-line scripts Once dataset is uploaded metadata needs to be filled in via the web interface. 44

Norstore Archive workflow Identify data Seek approval Identify metadata Fill-in metadata upload data Verify Request publication Verify metadata Ensure approval Assign DOI Publish 45

Project Area Upload Select Project Area Upload Email containing dataset UUID Create Dataset Manifest File A valid argument for find <dir>! type d <file pattern> E.g. /projects/ns9999k name *.tgz Run ArchiveData set UUID <manifest file> Job submitted to queue. Email when finished. Query status: ListArchiveDataset UUID 46

Publishing Data Necessary in order to be able to cite datasets. Currently using DataCite node in Denmark to issue Digital Object Identifiers. DOI are standard, unique identifier that can be used to identify a resource. Originally developed for documents, but now being used for data. Each DOI must point to metadata about the object and may contain a link to the dataset itself. Resolver services are used to resolve the DOI to a URI. Structure of DOI meaningful doi:10.1000/182 10 refers to the DOI registry, 1000 refers to the entity that registered the data, 182 refers to the actual object. Once a dataset is published it cannot be modified Some metadata may be updated 47

Publishing Data 48

Landing page Permanent metadata record for the dataset. All access via DOI resolve to this page. Page contains links to additional metatdata and data Landing page exists for terminated datasets Called a Tombstone record Link to data removed Contains additional metadata: when data was removed, reason for removal. 49

Planned Functionality Imminent: REST API for searching datasets. Provides command line access to metadata. Allow harvesting of metadata (opensearch and OAI-PMH planned). Imminent: Versioning datasets. Accommodate cases where researcher wishes to update data (either data has migrated to different format, or mistakes made, or update metadata). Provide a link back to the previous version visible from landing page. New version will have new DOI. Previous version will remain accessible unless explicitly terminated. 50

Future Functionality Subsets of datasets: Researchers may be interested in downloading only a subset of a dataset. Via the table of contents it s possible to identify subset of interest and tag for download files of interest. Collections of datasets: There may be a logical grouping of datasets (eg series of datasets) Can make it easier to link related datasets 51

Preserving Datasets Digital preservation attempts to ensure digital remains accessible and usable by future users. This is addressed by: Ensuring bit-level integrity through data replication. Ensuring data is understandable (may require adding or updating metadata on how to use and interpret the data). Ensuring data is discoverable (equipped with the right and relevant metadata and description). Ensuring data in usable format (may require migration from obsolete formats to new formats, or virtual environments). 52

Migration and Virtualisation Things to be aware of for Migration: What s the best format (most durable, popular, open)? What features in the data need to be maintained and how can we check they are? Migration pros/cons: Easy to use new tools with old data, easier to integrate data into new/current workflows One-way street. May lose some features/functionality in the migration that may only be relevant later. Requires experts to be able to assess what features need to be kept and whether they are indeed kept. Things to be aware of for Virtualisation: What type of virtual machine to use (licensing, rendering, performance)? Are all the resources required by the application contained within the VM? 53

Migration and Virtualisation Virtualisation pros/cons: Preserves original features/functionality (little risk of missing something). Can be difficult to integrate with newer tools. If large volume of data may not be scalable option. Choice depends on your circumstances and needs. 54

Auditing Aim to pass Data Seal of Approval ( http://www.datasealofapproval.org/en/) Ensures the archive conforms to best practice Allows users to assess how reliable the archive is. 55

Managing Data in the long term. 11 Feb 2016