Data management Backgrounds and steps to implementation; A pragmatic approach.
Research and data management through the years Find the differences 2
Research and data management through the years Find the similarities 3
Research and data management through the years Similarities Lots of data (and increasing explosively) Lots of information Unstructured Poorly accessible from outside Poorly searchable; dependend of index used Items hardly findable; based on system used Hardly administered Unsafe (content & bearer) Ownership based on geographical location 4
Research and data management through the years By those Poorly accessible Poorly findable Poorly searchable Poorly reproducible Poorly verifiable Poorly reusable Unsafe 5
How do we solve that? 6
First some definitions, theories and thoughts 7
Definitions Data versus Information Data are facts. If data are processed, organised and structured or presented in a specific context, in order to render it useful, it is called information. 8
Definitions Wat are Research data? Research data are data collected during research in order to analyse and by that producing original results. e.g. measurements, pictures, models, chromatographs, surveys 9
Definitions What are Big Data and what makes those data BIG? Gartner: 1. High volume 2. High velocity 3. High variety 10
Definitions What are Metadata? Metadata are data describing characteristics of data. So, metadata are data about data. 11
Definitions What is an Archive? An accumulation of historical records, or the physical place they are located. Archives contain primary source documents that have accumulated over the course of an individual or organization's lifetime, and are kept to show the function of that person or organization. Source: Wikipedia 12
Definitions What is a repository? A place where, or in which, things are, or may, be stored. 13
Definitions What is data deposition? To place data in a place especially for safekeeping or as proof for a longer period of time. Mostly in a repository. In law, a deposition is the out-of-court oral testimony of a witness that is reduced to writing for later use in court or for discovery purposes. 14
Definitions What is a Dataverse? A Dataverse is a container for research data studies, customized and managed by its owner. A study is a container for a research data set. It includes cataloging information, data files and complementary files. Source: thedata.harvard.edu 15
Dynamic (work in progress) Static (Archive) Theory Structuring and arranging: The Research data lifecycle Data discovery Data dissemination is the distribution or Data transmitting of data repurposing to end users Data Study, Concept & Design Data collection Data processing Data access & Dissemination Analysis KT Cycle Research outcomes Source: Charles Humphrey and Elizabeth Hamilton (2004) 16
Theory Knowledge transfer cycle Popularizing Popularising Popular Popular literature, literature, newspapers, newspaper, practice practice Conceptualising E-mails, letters, literature reviews Conceptualising E-mails, letters literature reviews Value Value meeting minutes Grant applications, meeting minutes teaching/research Teaching/research numeracy Numeracy Formalising informed Informed public public Analysis Formalising Analysis Journal articles, Presentations, Journal articles, books, curricula conferences, books, curricula, Presentations, content, policy seminars content, policy conferences, Initial seminars Initial results Results Grant reports, Grant technical reports, reports, technical reports, thesis thesis Initialising Grant applications, Initialising Source: Charles Humphrey and Elizabeth Hamilton (2004) 17
Thoughts Per researcher per three years 1 TB Rough data After cleaning 750 GB remains 8-10 datasets > ~ 7.5 TB Metadata / analysis and reports 1 GB Approx. 8 TB per researcher per three years 30 researchers generate > 240 TB in three years 18
Recap Research data was and still is Poorly accessible Poorly findable Poorly searchable Poorly reproducible Poorly verifiable Poorly reusable Unsafe 19
May I introduce 20
Solution The data management plan: In a data management plan a researcher puts down how the research data will be stored, administered, documented, protected and shared. 21
A data management plan describes Information about data & data format Include a description of data to be produced by the project. This might include (but is not limited to) data that are: How will the data be acquired? When and where will they be acquired? After collection, how will the data be processed? Include information about Software used Algorithms. Describe the file formats that will be used, justify those formats, and describe the naming conventions used. Identify the quality assurance & quality control measures that will be taken during sample collection, analysis, and processing. If existing data are used, what are their origins? How will the data collected be combined with existing data? What is the relationship between the data collected and existing data? How will the data be managed in the short-term? E.g.: Version control; Backing up data and data products; Security & protection of data and data products; Who will be responsible for management. Metadata content and format What metadata are needed? How will the metadata be created and/or captured? Examples include lab notebooks. What format will be used for the metadata? Consider the metadata standards commonly used in the scientific discipline that contains your work. Policies for access, sharing, and re-use Describe any obligations that exist for sharing data collected. These may include obligations from funding agencies, institutions, other professional organizations, and legal requirements. Include information about how data will be shared, including when the data will be accessible, how long the data will be available, how access can be gained, and any rights that the data collector reserves for using data. Address any ethical or privacy issues with data sharing. Who owns the copyright? What are the institutional, publisher, and/or funding agency policies associated with intellectual property? Are there embargoes for political, commercial, or patent reasons? Describe the intended future uses/users for the data. Indicate how the data should be cited by others. How will the issue of persistent citation be addressed? For example, if the data will be deposited in a public archive, will the dataset have a digital object identifier (doi) assigned to it? Long-term storage and data management Researchers should identify an appropriate archive for long-term preservation of their data. By identifying the archive early in the project, the data can be formatted, transformed, and documented appropriately to meet the requirements of the archive. Researchers should consult colleagues and professional societies in their discipline to determine the most appropriate database, and include a backup archive in their data management plan in case their first choice goes out of existence. Early in the project, the primary researcher should identify what data will be preserved in an archive. Usually, preserving the data in its most raw form is desirable, although data derivatives and products can also be preserved. Budget Data management and preservation costs may be considerable, depending on the nature of the project. By anticipating costs ahead of time, researchers ensure that the data will be properly managed and archived. Potential expenses that should be considered are: Personnel time for data preparation, management, documentation, and preservation. Hardware and/or software needed for data management, backing up, security, documentation, and preservation. Costs associated with submitting the data to an archive. The data management plan should include how these costs will be paid. 22
Dynamic (work in progress) Static (Archive) Recap: Research data lifecycle Data discovery Data repurposing Data Study, Concept & Design Data collection Data processing Data access & Dissemination Analysis KT Cycle Research outcomes Source: Charles Humphrey and Elizabeth Hamilton (2004) 23
How does this fit in (the former) models? 24
How to get there and how we, at FPN, did it A. Define a guideline DHMR: Data handling & methods reporting describing regulations about data packages: 1. Meta-data 2. Data collecting 3. RAW data file definition 4. Data storage 5. Materials 6. Statistical processing 7. Processed data file 8. Access and verification 9. Retention B. Let the (faculty) board decide on implementing the Guideline 25
How to get there and how we, at FPN, did it C. Define rules/regulations for a DHMR commission D. Instate a DHMR commission* with a number of tasks: a. (Re)defining and updating the DHMR guidelines. b. Overseeing compliance with DHMR by means of internal evaluations and/or audits on research c. Drawing up an annual DHMR report. d. Providing the dean with asked/unasked advice about: Aspects of DHMR (e.g. infrastructure; open access;) (Legal) aspects of data management; DHMR Training (staff; facilities; programmes;) Research culture and -ethics. E. Install the DHMR commission F. Implement and start working according to DHMR *Consists of senior researchers; at FPN ORa 26
27