Data publication and discovery with Globus

Similar documents
The Materials Data Facility

Globus Platform Services for Data Publication. Greg Nawrocki University of Chicago & Argonne National Lab GeoDaRRS August 7, 2018

Design patterns for data-driven research acceleration

Climate Data Management using Globus

Welcome! Presenters: STFC January 10, 2019

Building a Materials Data Facility (MDF)

Brown University Libraries Technology Plan

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

BEFORE SUBMITTING YOUR WORK 1. Discuss with your faculty mentor the level of access appropriate for your project. Be sure to sign off on the paper

Edinburgh DataShare: Tackling research data in a DSpace institutional repository

GEOSS Data Management Principles: Importance and Implementation

SHARING YOUR RESEARCH DATA VIA

Data Curation Handbook Steps

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

CREATING DIGITAL REPOSITORIES PRESENTED BY CHAMA MPUNDU MFULA CHIEF LIBRARIAN NATIONAL ASSEMBLY OF ZAMBIA

University of British Columbia Library. Persistent Digital Collections Implementation Plan. Final project report Summary version

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

The What, Why, Who and How of Where: Building a Portal for Geospatial Data. Alan Darnell Director, Scholars Portal

XSEDE Infrastructure as a Service Use Cases

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Data Curation Profile Movement of Proteins

Visualization for Scientists. We discuss how Deluge and Complexity call for new ideas in data exploration. Learn more, find tools at layerscape.

Robin Wilson Director. Digital Identifiers Metadata Services

Data Exchange and Conversion Utilities and Tools (DExT)

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

Working with Islandora

Issues and Opportunities Associated with Federated Searching

UC Irvine LAUC-I and Library Staff Research

FAIR-aligned Scientific Repositories: Essential Infrastructure for Open and FAIR Data

SEAD Data Services. Jim Best Practices in Data Infrastructure Workshop. Cooperative agreement #OCI

Advancing Library Cyberinfrastructure for Big Data Sharing and Reuse. Zhiwu Xie

The OAIS Reference Model: current implementations

Federated Services for Scientists Thursday, December 9, p.m. EST

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

Managing Protected and Controlled Data with Globus. Vas Vasiliadis

Conference of Directors of National Libraries in Asia and Oceania. Hanoi, 20 April 2009

AssetWise to OpenText PoC Closeout Report

Building an Interoperable Materials Infrastructure

IUNI Web of Science Data Enclave 102

Microsoft SharePoint Server 2013 Plan, Configure & Manage

The Data Census: Assessing Data Services at MSU

Next-Generation Standards Management with IHS Engineering Workbench

The Metadata Assignment and Search Tool Project. Anne R. Diekema Center for Natural Language Processing April 18, 2008, Olin Library 106G

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

Data Management Tools. Lizzy Rolando, Georgia Tech Aaron Trehub, Auburn University August 6, 2013

Supporting Data Stewardship Throughout the Data Life Cycle in the Solid Earth Sciences

Introduction to Grid Computing

Putting Open Access-into Interlibrary LoanJoe Mcarthur

Roy Lowry, Gwen Moncoiffe and Adam Leadbetter (BODC) Cathy Norton and Lisa Raymond (MBLWHOI Library) Ed Urban (SCOR) Peter Pissierssens (IODE Project

Leveraging the InCommon Federation to access the NSF TeraGrid

Educating a New Breed of Data Scientists for Scientific Data Management

Perspectives on Open Data in Science Open Data in Science: Challenges & Opportunities for Europe

EUDAT. Towards a pan-european Collaborative Data Infrastructure

The iplant Data Commons

Presented by: Victoria Ossenfort Office of Library and Information Services

IT Town Hall Meeting

Data Management: the What, When and How

Building a Data Catalog

Development of a Digital Repository Prototype applied to Faculty of Pharmacy, University of Lisbon. Sílvia Lopes Pedro Faria Lopes Fernanda Campos

Big Data infrastructure and tools in libraries

A Brief Introduction to the Data Curation Profiles

Google indexed 3,3 billion of pages. Google s index contains 8,1 billion of websites

Survey of Research Data Management Practices at the University of Pretoria

Research Data Edinburgh: MANTRA & Edinburgh DataShare. Stuart Macdonald EDINA & Data Library University of Edinburgh

Special Consideration Online application

Trust and Certification: the case for Trustworthy Digital Repositories. RDA Europe webinar, 14 February 2017 Ingrid Dillo, DANS, The Netherlands

Conducting a Self-Assessment of a Long-Term Archive for Interdisciplinary Scientific Data as a Trustworthy Digital Repository

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

2017 Resource Allocations Competition Results

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016

Big Data, Big Compute, Big Interac3on Machines for Future Biology. Rick Stevens. Argonne Na3onal Laboratory The University of Chicago

IAALD/2013 World Congress. VIVO Workshop. Brian J. Lowe Jon Corson-Rikert

Metadata Workshop 3 March 2006 Part 1

XSEDE s Campus Bridging Project Jim Ferguson National Institute for Computational Sciences

Launching the. Data Curation Network NDS/MBDH 2018

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Persistent identifiers, long-term access and the DiVA preservation strategy

Cyberinfrastructure!

Dataverse: Modular Storage and Migration to the Cloud

Digitally Preserving African Heritage

Informatica Enterprise Information Catalog

Certification. F. Genova (thanks to I. Dillo and Hervé L Hours)

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Building Institutional Repositories: Emerging Challenges

Research Data Management & Preservation: A Library Perspective

Web of Science. Platform Release Nina Chang Product Release Date: March 25, 2018 EXTERNAL RELEASE DOCUMENTATION

Metadata Discovery and Integration to Support Repurposing of Heterogeneous Data using the OpenFurther Platform

Engagement With Scientific Facilities

Callicott, Burton B, Scherer, David, Wesolek, Andrew. Published by Purdue University Press. For additional information about this book

The Value of Force.com as a GRC Platform

Web of Science. Platform Release Nina Chang Product Release Date: December 10, 2017 EXTERNAL RELEASE DOCUMENTATION

Research Elsevier

Ensuring Proper Storage for Earth Science Data: The USGS Process to Certify Trusted Digital Repositories

Web CMS Sub Administrator Training

The library s role in promoting the sharing of scientific research data

Invenio: A Modern Digital Library for Grey Literature

Best Practice Guidelines for the Development and Evaluation of Digital Humanities Projects

Transcription:

Data publication and discovery with Globus Questions and comments to outreach@globus.org The Globus data publication and discovery services make it easy for institutions and projects to establish collections, which are dataset repositories with customized curation policies, metadata schema, access policies, etc.; for researchers to create and populate collections with datasets comprising (big) data and metadata; and for researchers to search, browse, and access datasets contained within any collections they are authorized to access. A software-as-as-service (SaaS) approach means that the costs associated with establishing, accessing, and operating repositories and collections are low. Globus data publication and discovery services were first demonstrated in prototype form at the GlobusWorld conference in April 2015. A video of the demonstration is available at www.globus.org/data-publication. The initial production version of the services was released in October 2015. Globus background Globus leverages cloud-based software-as-aservice (SaaS) methods to deliver powerful research data management capabilities over Globus data publication features SaaS for publishing large data Bring your own storage Extensible metadata Publication and curation workflows Public and restricted collections Rich discovery model the network. For researchers, Globus accelerates discovery by providing easy-to-use, self-service capabilities for rapid, reliable, and secure data management. For campuses and other resource providers, Globus enhances the value, usability, and manageability of valuable network and storage system resources. For tool developers, Globus can serve as a platform, allowing applications to outsource time-consuming tasks such as federated user and group management, and data management. Globus currently supports file transfer, sharing, and publication. As of April 2017, it has 56,000 registered users, more than 10,000 active endpoints (i.e., storage systems connected to Globus), and has moved more than 256 petabytes in 36 billion files. It is recommended by major research facilities such NSF s XSEDE and DOE s NERSC. Globus development and operations are supported by the DOE, NIH, NSF, Amazon, Sloan Foundation, and University of Chicago. A subscription model for premium features provides for long-term sustainability (details on Globus Subscriptions are at www.globus.org/subscriptions). Data publication and discovery A combination of federal mandates and good research practice is driving increased interest in data publication capabilities. There are limited tools currently available to librarians, digital media managers, and others on campus organizations tasked with managing data publication. Typical approaches involve developing, installing, and configuring various software components, and integrating these with existing campus identity and storage systems. This is a costly and time-consuming activity that few can afford.

In response to these challenges, Globus publication offers features that make it easy to identify, describe, curate, verify, access, and preserve data at appropriate levels of durability. Additional planned features will enable rich discovery by making it possible to search, browse, and access large published data sets, irrespective of where they may be stored. Globus data publication services benefit many stakeholders including researchers, librarians, and campus IT managers. Researchers have a straightforward way to publish their data; they can define specific and relevant metadata; they can easily annotate and describe large data sets with the help of automation; and they can discover relevant data sets published by others, both within and beyond their field of study. Librarians can leverage robust, scalable infrastructure to curate and preserve diverse data collections at modest cost; they have complete control over the curation process; and they do not need to invest in custom solutions or acquire unnecessary technical skills. Campus IT managers can leverage existing resources as storage repositories and campus identity systems; and they can provide advanced services to students, faculty, and staff with minimal new investment. System Overview Globus publication capabilities are delivered through a hosted service, meaning that no software need be installed or operated to run the service. Published data and associated metadata is stored on campus, institutional, or group resources that are managed and operated by their respective administrators. A copy of metadata is also stored in the cloud and indexed for facilitating rich discovery models, including cross collection and community searches. To enable a storage resource as a repository for data publication, administrators configure a Globus endpoint for sharing and then associate the endpoint with the data collection through Globus. Creating a Globus endpoint is a straightforward process requiring just a few commands to install and configure Globus Connect Server (www.globus.org/globus-connect-server). Datasets are published into collections, which in turn can be organized into communities (see Figure 1). For example, an Argonne National Laboratory community has several member collections: Advanced Photon Source; Center for Nanoscale Materials; and Computing, Environment and Life Sciences, to name a few. Often, collections will map to a department or group within an institution, but this is not required. Globus users can create and manage their own communities and collections through the data publication service. A collection enables the submission of datasets with policies regarding access. 2

Figure 1: Globus Publishing Model A dataset comprises data and metadata. Policies can be set on entire communities and/or individual collections to manage: metadata (schema, requirements); access control (user and group based); curation workflow; submission and distribution license; and storage (e.g., which determines durability). Data Publication Datasets undergo curation based on a workflow defined by the community that will publish the data. Each community may customize the workflow to capture their specific metadata, and to reflect the community's review process. In a typical workflow, a scientist who has just published a paper wishes to publish the data associated with his publication. Using Globus: (1) the scientist selects the community that will publish his data, and the collection that will house his datasets; (2) the scientist describes the submission using both publication (Dublin core) and collection-specific scientific metadata (e.g., see Figure 2), as defined by the collection administrator; (3) a unique Globus endpoint is created and only the scientist is granted permission to write to the endpoint for this submission; (4) the scientist assembles a dataset on this endpoint by transferring files from one or more systems, e.g., from the campus computing cluster where the final analysis was run, and from the high-resolution microscope used to capture the raw experiment data (the scientist can assemble this dataset over a long period of time and can return to the submission workflow when all the desired data are in place); (5) the scientist agrees to the submission license specified by the community for that collection and submits the data for publication (the license allows the submitting user to grant rights to the collection and the Globus system to manage and disseminate the dataset based on the agreed upon policies); 3

(6) the curators/reviewers for the community are notified that a submission awaits their approval, and are granted permission to access the submission; (7) the curators view (and, if desired, edit) the metadata and assembled files of the dataset, and approve the submission; (8) the submission is published in the collection and assigned a DOI. Figure 2: The user supplies domain-specific metadata Most steps in this process may be customized to meet the requirements of the community. Globus supports a wide range of data publication workflows, ranging from formal archival publication to informal sharing within a group of collaborators. 4

Data discovery Globus discovery capabilities make it easy to efficiently search and browse collections. After a dataset is published, it is discoverable using both free text and faceted search that allows the researcher to progressively filter results and rapidly focus in on the data of interest. Figure 3: Users can search within and across collections The interface for discovery of datasets is not unlike Amazon s interface for discovery of products. The first step of discovery is to define the context in which the user wants to search. The user may use free text search terms, key-value terms, or even range queries. A user may limit the search context to a single collection, to collections owned by a community, or to all collections to which the user has access, including collections published by other communities. Future plans call for the search context to include both collections (i.e., published datasets) and endpoints (i.e., uncurated, but accessible, datasets). Search results show a brief summary of each published dataset, including information about the publication time, collection, summary information about the data files, name, authors, description, and a set of keyword tags as well as key-value tags. Each field can be used to search for a particular dataset. Results are ranked according to their relevance to the search criteria. The researcher can iteratively refine the search by expanding the number of search terms and/or by adding more specific values to the query (see Figure 3). Once data of interest are found, the researcher may transfer them to another Globus endpoint for further inspection and processing. DOIs and ORCIDs We expect that many data repository managers will want to assign Digital Object Identifiers (DOIs: www.doi.org) to datasets and associate Open Researcher and 5

Contributor Identifiers (ORCIDs: www.orcid.org) for data authors with datasets. The Globus data publication and discovery service supports both DOIs and ORCIDs. Linking to publishers An increasing number of academic publishers allow researchers to associate datasets with publications, typically by supplying DOIs. As part of our work with the National Data Service (www.nationaldataservice.org), we are working with publishers to determine how to incorporate such linkages into Globus data publication workflows. Implementation The Globus data publication and discovery prototype leverages and extends elements of the Globus service for user and file management, and DSpace (www.dspace.org) for publication and curation workflows, all hosted in the Amazon Web Services cloud. For more information See www.globus.org/data-publication for the latest on Globus data publication and discovery. That page also contains pointers to a talk and slides showing a demonstration of an early version of data publication and discovery functionality. Please give the service a try at https://trial.publish.globus.org/. 6