Data Curation Profile Human Genomics

Similar documents
Data Curation Profile Water Flow and Quality

Data Curation Profile Movement of Proteins

Data Curation Profile Food Technology and Processing / Food Preservation

Data Curation Profile Plant Genomics

Data Curation Profile Botany / Plant Taxonomy

The Data Curation Profiles Toolkit: Interview Worksheet

Data Curation Profile Plant Genetics / Corn Breeding

Data Curation Profile: Agronomy / Grain Yield

Data Curation Profile Architectural History / Epigraphy

Data Curation Profile Cornell University, Biophysics

A Brief Introduction to the Data Curation Profiles

Data Management Checklist

Data Curation Profile Geophysics and Seismology/Structural Geology and Neotectonics

Data Curation Handbook Steps

Introducing the Springer Nature Data Support Services

AMNH Gerstner Scholars in Bioinformatics & Computational Biology Application Instructions

Guidelines for Depositors

Best Practice Guidelines for the Development and Evaluation of Digital Humanities Projects

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

Developing a Research Data Policy

Brown University Libraries Technology Plan

SHARING YOUR RESEARCH DATA VIA

Feed the Future Innovation Lab for Peanut (Peanut Innovation Lab) Data Management Plan Version:

Applying Archival Science to Digital Curation: Advocacy for the Archivist s Role in Implementing and Managing Trusted Digital Repositories

Survey of Research Data Management Practices at the University of Pretoria

Edinburgh DataShare: Tackling research data in a DSpace institutional repository

If you build it, will they come? Issues in Institutional Repository Implementation, Promotion and Maintenance

UC Irvine LAUC-I and Library Staff Research

Research Elsevier

Preservation and Access of Digital Audiovisual Assets at the Guggenheim

Data Management Plan: Port Radar, Oregon State University (Taken from NOAA Data Sharing Template and adapted for IOOS Certification)

Session Two: OAIS Model & Digital Curation Lifecycle Model

DataSTORRE Deposit Guide

Callicott, Burton B, Scherer, David, Wesolek, Andrew. Published by Purdue University Press. For additional information about this book

The library s role in promoting the sharing of scientific research data

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

ClinVar. Jennifer Lee, PhD, NCBI/NLM/NIH ClinVar

Adding Research Datasets to the UWA Research Repository

Step: 9 Conduct Data Standardization

Data Management Plan: OR Mooring - Ocean Acidification related measurements (Taken from NOAA Data Sharing Template and adapted for IOOS Certification)

Digital repositories as research infrastructure: a UK perspective

DRS Policy Guide. Management of DRS operations is the responsibility of staff in Library Technology Services (LTS).

DATA-SHARING PLAN FOR MOORE FOUNDATION Coral resilience investigated in the field and via a sea anemone model system

National Data Sharing and Accessibility Policy-2012 (NDSAP-2012)

NSF Data Management Plan Template Duke University Libraries Data and GIS Services

Managing Web Resources for Persistent Access

FAIR-aligned Scientific Repositories: Essential Infrastructure for Open and FAIR Data

Data Management Plan: HF Radar

HRA Open User Guide for Awardees

Usability Inspection Report of NCSTRL

Rodale. Upper East Side

Reproducibility and FAIR Data in the Earth and Space Sciences

Science Europe Consultation on Research Data Management

Persistent Identifier the data publishing perspective. Sünje Dallmeier-Tiessen, CERN 1

Use of the Hydra/Sufia repository and Portland Common Data Model for research data description, organization, and access

Data publication and discovery with Globus

How to make your data open

GEOSS Data Management Principles: Importance and Implementation

Open Access, RIMS and persistent identifiers

Robin Wilson Director. Digital Identifiers Metadata Services

Data management Backgrounds and steps to implementation; A pragmatic approach.

Research Data Management & Preservation: A Library Perspective

BPMN Processes for machine-actionable DMPs

Big Data infrastructure and tools in libraries

re3data.org - Making research data repositories visible and discoverable

Engaging and Connecting Faculty:

Standards Readiness Criteria. Tier 2

eportfolio Requirements Request

Document Title Ingest Guide for University Electronic Records

DMPonline / DaMaRO Workshop. Rewley House, Oxford 28 June DataStage. - a simple file management system. David Shotton

Roy Lowry, Gwen Moncoiffe and Adam Leadbetter (BODC) Cathy Norton and Lisa Raymond (MBLWHOI Library) Ed Urban (SCOR) Peter Pissierssens (IODE Project

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

28 September PI: John Chip Breier, Ph.D. Applied Ocean Physics & Engineering Woods Hole Oceanographic Institution

RPg procedures for Research Data Management (RDM)

Building Illinois Electronic Documents Access

Checklist and guidance for a Data Management Plan, v1.0

Min Wang. April, 2003

DATA SELECTION AND APPRAISAL CHECKLIST University of Reading Research Data Archive

Linda Strick Fraunhofer FOKUS. EOSC Summit - Rules of Participation Workshop, Brussels 11th June 2018

Discovering Information through Summon:

Launching the. Data Curation Network NDS/MBDH 2018

Open Access and DOAJ 과편협편집인워크숍 10:50-11:10. 한양대구리병원이현정

DRS Update. HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017

NDSA Web Archiving Survey

Research Data Edinburgh: MANTRA & Edinburgh DataShare. Stuart Macdonald EDINA & Data Library University of Edinburgh

Gary F. Simons. SIL International

Making Sense of Data: What You Need to know about Persistent Identifiers, Best Practices, and Funder Requirements

Your library. PubSub: a guide for online submission to IRep.

Data Management Planning

Making data publication a first class research output

Advancing code and data publication and peer review. Erika Pastrana, PhD Executive Editor, Nature Journals ALPSP_Sept 2018

For Attribution: Developing Data Attribution and Citation Practices and Standards

EXAM PREPARATION GUIDE

Title: Interactive data entry and validation tool: A collaboration between librarians and researchers

Update on Dataverse Dryad-Dataverse Community Meeting. Mercè Crosas, Elizabeth Quigley & Eleni Castro. Data Science > IQSS > Harvard University

Metadata Framework for Resource Discovery

The data explosion. and the need to manage diverse data sources in scientific research. Simon Coles

Hardware and Software Requirements

Genome Browsers - The UCSC Genome Browser

Fair data and open data: differences and consequences

Transcription:

Data Curation Profile Human Genomics Profile Author Profile Author Institution Name Contact J. Carlson N. Brown Purdue University J. Carlson, jrcarlso@purdue.edu Date of Creation October 27, 2009 Date of Last Update Version 1.0 Discipline / Sub- Discipline Data Curation Profiles are designed to capture requirements for specific data generated by a single scientist or scholar as articulated by the scientist him or herself. They are also intended to enable librarians and others to make informed decisions in working with data of this form, from this research area or Purpose sub-discipline. Data Curation Profiles employ a standardized set of fields to enable comparison; however, they are designed to be flexible enough for use in any domain or discipline. Context Sources of Information Scope Note Editorial Note Author s Note URL A profile is based on the reported needs and preferences for these data. They are derived from several kinds of information, including interview and document data, disciplinary materials, and standards documentation. An initial interview with the scientist conducted in June, 2008 A second interview with the scientist conducted in January, 2009. A questionnaire completed by the scientist as a part of the second interview. A published paper explaining the research and the methodology used to gather, process and analyze the data set in question. The scope of individual profiles will vary, based on the author s and participating researcher s background, experiences, and knowledge, as well as the materials available for analysis. Any modifications of this document will be subject to version control, and annotations require a minimum of creator name, data, and identification of related source documents. This Human Genomics data curation profile is based on analysis of interviews, a completed worksheet and information gathered from a publication, collected from a researcher working in this research area or sub-discipline. Some subsections of the profile were left blank; this occurs when there was no relevant data in the interview or available documents used to construct this profile. http://www.datacurationprofiles.org 1 http://dx.doi.org/10.5703/1288284315006

Brief summary of data curation needs The data set consists of a mysql database and several text files containing data used for reference purposes. The data are currently posted on the scientist s personal web site after the results of her work have been published. However, she has stated that she does not have the resources or expertise needed to maintain the data adequately. Once the scientist is confident that the results of her research are sound, she would like to integrate her data into other genomic data repositories. The scientist believes that there is a need for archiving her data and in providing resources and mechanisms to enable data archiving to researchers at her institution. Note: The scientist views data management and curation primarily from a technical perspective. Although the scientist is happy that the Libraries are interested in data, she perceives the primarily responsibility for addressing data issues falling to the campus IT department. Overview of the research Research area focus The scientist describes her work as generating the criteria needed to do experiments in human genomic research. Her research focuses on the discovery of codes that regulate the expression of human genes. This research is done through collecting and analyzing 9-mers, segments of DNA code (A, G, C, T) 9 bases long, from specific regions and making predictions about which ones seem to function as specific codes to turn a gene s function on or off. As her research is computational rather than experimental in nature, much of her work centers on collecting and analyzing data that s publicly available in human genome browsers. Intended audiences Other researchers who are working in the field of human genomics. Funding sources None, although the scientist may seek funding for her work in this area in the future. As she has not received outside funding for this research, the scientist is not mandated to generate a data management plan or share her data with others outside of her lab. Data kinds and stages Data narrative The first step in the scientist s research process is to generate a reference table of all possible 9- mers that may occur in DNA. Next, two sets of 9-mers were obtained from the National Center of Biotechnology s (NCBI) nucleotide database and UCSC s genomic browser. The first set of 9-mers was selected from DNA sequences containing known promoters with accurately defined transcription initiation sites as identified by other researchers. This data set is smaller but more likely to provide valid results. The second set of 9-mers was selected using proximal promoters that were based on predicted transcription initiation sites. 9-mers are then cleaned by using Perl scripts. This process included checking that a beginning of a gene was assigned accurately in the first set, and eliminating any incomplete or ambiguous codes as well as any redundancies in the second set. 2

Once the data sets were clean, they were placed into a mysql database running in Linux. Data collection, access, retrieval, management, and analysis were done using Perl scripts and modules obtained from Bioperl (http://www.bioperl.org). The data were then mathematically filtered using Perl scripts to process the data. Processing consists of comparing 9-mers present in the whole human genome to the 9-mers that are located within genes themselves and how often they occur. 9-mers that occur more frequently in proximity to gene promoters and do not appear frequently in the genome are identified as candidates for analysis. After processing, the scientist analyzes the 9-mers by generating density plots showing which 9- mers are most often found in proximal promoters. The density plots of 9-mers most often found in proximal promoters are then compared to the density of the number of occurrences of each 9- mers in human DNA to see if there are any statistically significant differences between the two sets. A statistically significant difference indicates that a particular 9-mer may be involved in gene regulation and warrants further investigation. The scientist was unable to answer questions regarding the size of the files as she does not manage her files directly. The categories in the data stage column listed in the table below were developed by the authors of this data curation profile. The data specifically designated by the scientist to make publicly available are indicated by the rows shaded in gray. Data Stages Output Typical File Size Format Other / Notes Reference Raw Cleaned Processed A reference table of all possible 9-mers 5.53MB.txt Two base sets of 9- mers mysql database of 9-mers meeting scientist s criteria 9-mers identified according to their location and frequency Analyzed Density plots sql Supplementary Data Files for Reference Collections of filtered data that serve as reference documents for the scientist. Augmentative Data.xls sql sql,.txt Computationally generated all possible 9- mers that may appear in DNA. Data gathered from genomic browsers according to specific criteria. Raw data are checked and cleaned using perl scripts and deposited into a mysql database. Data are prepared for statistical analysis The analyzed statistics include density plots generated from tools available from a genomic data repository Files are publicly available through scientist s web site.txt Note: The data specifically designated by the scientist to make publicly available are indicated by the rows shaded in gray. Empty cells represent cases in which information was not collected or the scientist could not provide a response. 3

Target data for sharing The mysql database developed by the scientist was made available from her website, but is currently inaccessible. The scientist stated that she lacks the resources to maintain this database effectively. To support this research, the scientist developed files listing all possible 9-base elements that may occur in DNA (the reference data set), 9-mers that were found three or more times in proximal promoters of human genes and the predicted correspondence of the collected 9-mers to potential transcription factor binding sites. These files are also publicly available and are currently accessible. Use/re-use value of the data According to the scientist her work thus far has been primarily exploratory in nature. Therefore she has not yet contributed her data to genome browsers such as the one at UCSC. She wants to conduct further evaluations to provide more evidence and help her to feel confident that her predictions are accurate. She does plan to contribute her data (the database and the density plots) to existing genomic browsers in the future. It is unclear if the scientist would be willing to make the density plots available to others before submitting them to genomic browsers. However, the data that she generates is of value for other researchers engaged in similar areas of study. The work that the scientist has done to prepare genomic data for research in gene regulation could be used by others to do their own calculations and make their own discoveries in this area. The perl scripts used to prepare the data would also be of use to others doing similar research. Contextual narrative The scientist s research produces a number of outputs related to the data in multiple formats. Data is collected from genomic browsers in spreadsheets and stored in this format until cleaned. The centerpiece of her data is the mysql database containing clean data sets for her purposes. In addition to the reference data file, the scientist has also created several supplementary text files. These text files contain subsections of the data that have been relevant to the scientist s work, and could be used by others doing similar types of research. These text files include: the 9-mers that were found 3 or more times in proximal promoters of human genes and the predicted correspondence of the collected 9-mers to potential transcription factor binding sites. Other outputs include the density plots generated through C++, some of which have been reproduced in publications, and the perl scripts used to clean the data, and prepare it for analysis. The genomic data gathered from genomic repositories are described by the scientist as experimental. As such, the perl scripts are used to mathematically filter the data according to specific criteria to normalize the data in preparation for analysis. The scientist has published a journal article which explains her methodologies in gathering and working with her data. The scientist has another article in press that focuses on the density plots and a previous book chapter based upon this research. Intellectual property context and information Data owner(s) The researcher does not feel that she owns the data; in fact she feels that no one individual or institution really owns the data at all. According to her, the culture of the human genomic field is there is no one owner of the research data. The research culture supports, even expects, that data will be shared with others. The field of human genomic research is so new, potentially vast, and ripe for exploration that access to data is vital for the field to grow and develop. Any 4

restrictions on access to genomic data would stifle progress and slow the development of an important area of research. Stakeholders The systems manager who designed the perl scripts may be a stakeholder, although the scientist did not characterize him in this manner during discussions. Terms of use (conditions for access and (re)use) None stated. The data are currently made available from the scientist s personal website without any statements concerning access or re-use from the scientist. Attribution The ability to cite the dataset in publications is a high priority for the scientist. Organization and description of data for ingest (incl. metadata) Overview of data organization and description There are two different data types within the data set, a database and text files. There are also a set of perl scripts included with the data that were used to process it. The scientist stated that she uploaded the data to the genomic browser at UCSC to use its tools to analyze her data, which may imply that the structure of the database is compatible with the structure of the UCSC s browser. The scientist would like to integrate her data with the UCSC browser at some point in the future when she is more confident with her results. The text files contain the citation to the paper at the top of the file and some explanatory text about the data. The application of standardized genomics metadata to the dataset is a high priority for the scientist. Formal standards used Unknown. The metadata fields in this database may be compatible with the genomic browser at UCSC, however this cannot be confirmed as of this writing because public access to the scientist s database is currently offline. Locally developed standards Unknown. Crosswalks Unknown. It is unclear at this time if some form of crosswalk would be needed to connect the scientist s data with the UCSC genomic browser. Documentation of data organization/description The methodology behind data acquisition, processing and analysis is described in a published article. Although this description does not provide detailed information on the organization or description of the data, the scientist believes that it does provide enough information for others to understand and use the data properly. As the published paper(s) serve as the documentation of the data, the scientist would like to be able to link the data to the publication in some manner. Ingest The ability for the scientist to submit the data to a repository herself is a higher priority for her than the ability to automate the process of data submission into a repository. 5

Access Willingness / Motivations to share The scientist is willing to share the data once the findings from the research have been published. Her primary reluctance to share her data any earlier is that she wants to be sure that the data are usable and have value. The scientist does currently post her data, and the perl scripts used in her research, publicly on her personal website for others to discover and access. As of this writing however access to the database was not functioning properly. Embargo Once the publication is out, the scientist does not require any embargo of the data. Access control Once the findings have been published she is willing to share the data with anyone without restriction (although she would like others to cite the publication that uses the data, if they use the data themselves). Secondary (Mirror) site A secondary data access (mirror) site that would allow continued access to the data if the repository is off-line is a high priority for the scientist. Discovery Although the scientist has made her data publicly available through her website, she does not have a sense of how people discover the data. She does refer people to the data at conferences and refer people to her website in her publications. The ability for researchers within her field and outside of her field to easily find the data is a high priority for the scientist. The ability for people to find the data through internet search engines is a high priority for the scientist. The scientist makes her data accessible from her web page in a mysql database, which could provide insight about the desired search functionality for her data. However, the database is off line as of this writing. Tools The perl scripts used by the scientist to analyze the data are made available alongside of the data. The mysql database was developed in a Linux environment. There are tools available through genomic browsers at NCBI, UCSC and others, to process and analyze this type of data. The ability to connect the data to these tools is a high priority for the scientist. The scientist selected I don t know or NA in response to the question on selecting a level of priority for a data repository to support the use of web services APIs. 6

Interoperability Research in genomics is highly dependent upon access to previous work and data gathered by others in the field. According to the scientist, there is a cultural expectation in the genomics field that data will be made available for analysis by other researchers. Linkages between datasets and tools to process and/or analyze the data are a high priority for the scientist. According to the scientist, the particular dataset under discussion is too specialized and too experimental in its current state of development to be incorporated with the NCBI or other genomic repository. However, once the scientist is confident that her predictions are accurate she does intend to deposit her data analysis with UCSC. Therefore, the scientist s data may need to be aligned with the structures and standards used at UCSC if this has not been done already. As the published paper(s) serve as the documentation of the data, the scientist would like to be able to link the data to the publication in some manner. The scientist responded I Don t Know or NA to the question about the ability for the repository to support the use of web services APIs. Measuring impact Usage statistics The ability to view usage statistics is a high priority for the scientist. The scientist currently employs a usage counter on the database she has made publicly available from her website. Gathering information about users Gathering information about the users of her data was not addressed by the scientist. Data management Security / Back-ups The database is currently backed up on a regular basis (although the scientist was not sure how frequently back-ups occurred). The scientist is aware of the need to plan for data migration but has not taken steps to draft such plans. Secondary storage site A secondary data storage site is a high priority for the scientist. A secondary data storage site at a different geographic location is also a high priority. Preservation Duration of preservation The scientist would like her data to be preserved for more than 5 years but less than 10 years, depending on the data s intrinsic value over time. Data provenance The scientist selected I don t know or NA in response to the question on assigning a priority level for documenting any and all changes made to her data over time. 7

Data audits The ability to audit the dataset is a high priority for the scientist. Version control The ability of the repository to provide version control for this data set is a high priority for the scientist. Format migration The ability to migrate the dataset into new formats over time is a high priority for the scientist. Personnel This section is to be used to document roles and responsibilities of the people involved in the stewardship of this data. For this particular profile, information was gathered as a part of a study directed by human subject guidelines and therefore we are not able to populate the fields in this section. Primary data contact (data author or designate) Data steward (ex. library / archive personnel) Campus IT contact Other contacts Notes on personnel 8