NFAIS Open Data Fostering Open Science June 20, 2016 Dataverse and DataTags Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Institute for Quantitive Social Science Harvard University @mercecrosas
Research data publishing is the release of research data, associated metadata, accompanying documenta8on, and so9ware code (in cases where the raw data have been processed or manipulated) for re- Research data publishing is the release of research data, associated metadata, accompanying documenta8on, and so9ware code (in cases where the raw data have been processed or manipulated) for re- use and analysis in such a manner that they can be discovered on the Web and referred to in a unique and persistent way. use and analysis in such a manner that they can be discovered on the Web and referred to in a unique and persistent way. RDA Data Publishing Workflows Working Group; 10.5281/zenodo.34542
Data Publishing is sharing data that are: Findable Accessible Interoperable Reusable
Why publish data? Researchers Get credit for their data Publishers and Journals Verify published work Federal funding agencies Make public assets public Science Validate, reuse and extend previous work
Ways of Publishing Data Journal s data policy Scholarly Article Data in Repository Data Descriptor or Data Paper Data in Repository Scholarly Article Published Dataset in Repository Scholarly Article
A data repository system for sharing and archiving research data A Solution for Publishing FAIR research data: Findable, Accessible, Interoperable, Reusable
http://dataverse.org Created and developed at Harvard s Institute for Quantitative Social Science Harvard Dataverse: Generic data repository open to researchers world wide http://dataverse.harvard.edu
Dataverse Today: A growing Community Dataverse Project: Dataverse installations:19; serving > 200 Universities User Community group: 294 members Open-source software: 29 contributors Dataverse Community Meeting (July, 2016):107 registered, so far Twitter: 2940 followers Harvard Dataverse Repository: Registered users: 13,795; 300 new per month Dataverses: 1,677; 50 new per month Journal Dataverses: 91 Datasets: 61,781; 400 new per month Data Files: 330,462; 3,000 new per month
Dataverses contain datasets or dataverses Datasets contain metadata and data files
Dataverse follows best practices for FAIR Data Publishing
Best Practices Data Citation Metadata Access Control and Rules APIs and Standards Reference, locate and attribute Discover and reuse Access protecting privacy Interoperate
Data Citation in Dataverse Authors Published Year Dataset Title Global Persistent Identifier Repository = Data Publisher Version (or time range)
Data Citation Basics The dataset landing page is accessible and guaranteed by the repository (data publisher), even when data are restricted or deaccessioned Force11, Joint Declaration of Data Citation Principles, 2014; Starr et al, 2015
Metadata in Dataverse Metadata Level Fields Standards Citation Metadata Domain-specific Metadata File-level Metadata author, title, repository, year published, version, etc data collection info (methods, organism, observation, survey, experiment, etc) metadata inside the data file (variables, instrument details, geospatial info, etc) Dublin Core DataCite DDI (social sciences) ISA-Tab BioCaddie (biomed) Virtual Observatory (astro) + Custom metadata blocks DDI (for variables), + more to be determined Dataverse JSON Schema
Tiered Access Metadata Files How to Access Open (default): CC0 Open Open Click to Download GuestBook Open Open Terms of Use Open Open Data Restricted Open Restricted Data Restricted Open Restricted Fill in guestbook before download Click through terms of use before download Request Access via click through Request Access via application
Data Publishing Workflows Create Dataset (landing page restricted) Review (collaborators or anonymous review) Publish v. 1 Minor change (metadata only) Publish v. 1.1 Major change (might include new data file) Publish v. 2
Learn more at dataverse.org guides
Current Research Grants Privacy tools to share sensitive data Data provenance Social Science Big Data Journal articles connected to data Data Privacy Biomedical largescale data
How can we maximize data publishing of sensitive data while being mindful of privacy?
The DataTags System Sweeney L, Crosas M, Bar-Sinai M. Sharing Sensitive Data with Confidence: The DataTags System. Technology Science. 2015101601. October 16, 2015. http://techscience.org/a/2015101601
A datatag is a set of security features and access requirements for file handling. A datatags repository is one that stores and shares data files in accordance with a standardized and ordered levels of security and access requirements
Datatags&Levels& Tag$Type$ Descrip-on$ Security$Features$ Access$Requirements$ Blue$ Public& Clear&storage& Clear&transmission& Green$ Controlled$ public& Clear&storage& Clear&transmission& Yellow$ Accountable& Clear&storage& Encrypted&transmit& Orange$ More$ accountable& Encrypted&storage& Encrypted&transmit& & Open& Email,&OAuth&verified& registra:on& Password,&Registered&,& Approval,&Click&DUA& Password,&Registered,& Approval,&Signed&DUA& Red$ Fully$ accountable& Encrypted&storage& Encrypted&transmit& TwoDfactor&authen:ca:on,& Approval,&Signed&DUA& Crimson$ Maximally$ restricted& Mul:Encrypt&store& Encrypted&transmit& TwoDfactor&authen:ca:on,& Approval,&Signed&DUA&
DataTags Workflow in a Dataverse Repository (under development) Automa-c$ Interview$$ Review$Board$ Approval$ Direct$ Access$ Data$File$ Inges-on$ Sensi-ve$ Dataset$ Two-factor Authentication; Signed DUA http://datatags.org http://privacytools.seas.harvard.edu Privacy$ Preserving$ Access$
Example of DataTags Interview: A sequence of questions from an expert system
Example of DataTags Interview: Final datatag human-readable and machine-actionable policy
Summary Data sharing is good for researchers, journals, funding agencies, and science Dataverse is an open-source software for building data repositories to share research data Data citation and rich metadata support are key to Dataverse, and enable FAIR data publishing Dataverse also supports tiered access to data and data publishing review and versioning workflows DataTags generates human-readable and machine-actionable policies to support sensitive datasets in data repositories
Join us to this year s Dataverse Community Meeting
References @mercecrosas and http://scholar.harvard.edu/mercecrosas http://dataverse.org http://dataverse.harvard.edu http://datatags.org Wilkinson, et al, 2016, The FAIR Guiding Principles for Scientific Data Management and Stewardship, Scientific Data Altman, Borgman, Crosas, Martone, 2015, An Introduction to the Joint Data Citation Principles, Bulletin of the Association for Information Science and Technology Starr et al, 2015, Achieving Human and Machine Accessibility of Cited Data in Scholarly Publications, PeerJ Computer Science Meyer et al, 2016, Data Publication with the Structural Biology Grid Supports Live Analysis, Nature Communications Sweeney, Crosas, Bar-Sinai. 2015, Sharing Sensitive Data with Confidence: The DataTags System. Technology Science