Building a Materials Data Facility (MDF)

Similar documents
The Materials Data Facility

Building an Interoperable Materials Infrastructure

Globus Platform Services for Data Publication. Greg Nawrocki University of Chicago & Argonne National Lab GeoDaRRS August 7, 2018

Design patterns for data-driven research acceleration

Data publication and discovery with Globus

Climate Data Management using Globus

Welcome! Presenters: STFC January 10, 2019

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Managing Protected and Controlled Data with Globus. Vas Vasiliadis

Big Data infrastructure and tools in libraries

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Leveraging the Globus Platform in your Web Applications. GlobusWorld April 26, 2018 Greg Nawrocki

Roy Lowry, Gwen Moncoiffe and Adam Leadbetter (BODC) Cathy Norton and Lisa Raymond (MBLWHOI Library) Ed Urban (SCOR) Peter Pissierssens (IODE Project

Digital The Harold B. Lee Library

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

Slide 1 & 2 Technical issues Slide 3 Technical expertise (continued...)

Toward Scalable Monitoring on Large-Scale Storage for Software Defined Cyberinfrastructure

Federated Services for Scientists Thursday, December 9, p.m. EST

BEXIS Release Notes

I data set della ricerca ed il progetto EUDAT

Data management and discovery

A VO-friendly, Community-based Authorization Framework

Data Curation Handbook Steps

Grid Programming: Concepts and Challenges. Michael Rokitka CSE510B 10/2007

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

Town Hall Meeting on the National and International Data Infrastructure. Jim Warren, NIST Lisa Friedersdorf, NNCO

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

Data Staging: Moving large amounts of data around, and moving it close to compute resources

Leveraging Globus Identity for the Grid. Suchandra Thapa GlobusWorld, April 22, 2016 Chicago

Science-as-a-Service

Data Staging: Moving large amounts of data around, and moving it close to compute resources

Perspectives on Open Data in Science Open Data in Science: Challenges & Opportunities for Europe

Building the Modern Research Data Portal using the Globus Platform. Rachana Ananthakrishnan GlobusWorld 2017

Writing a Data Management Plan A guide for the perplexed

Visualization for Scientists. We discuss how Deluge and Complexity call for new ideas in data exploration. Learn more, find tools at layerscape.

Beyond File Transfer. Steve Tuecke NCAR September 5, 2018

RENKU - Reproduce, Reuse, Recycle Research. Rok Roškar and the SDSC Renku team

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

SHARING YOUR RESEARCH DATA VIA

Medici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project

Managing Data in the long term. 11 Feb 2016

Lessons Learned. Implementing Rosetta in the Harold B. Lee Library

Guillimin HPC Users Meeting June 16, 2016

Deploying the TeraGrid PKI

Implementing a Data Publishing Service via DSpace. Jon W. Dunn, Randall Floyd, Garett Montanez, Kurt Seiffert

Sessions 3/4: Member Node Breakouts. John Cobb Matt Jones Laura Moyers 7 July 2013 DataONE Users Group

EUDAT & SeaDataCloud

Big Data, Big Compute, Big Interac3on Machines for Future Biology. Rick Stevens. Argonne Na3onal Laboratory The University of Chicago

File Transfer: Basics and Best Practices. Joon Kim. Ph.D. PICSciE. Research Computing 09/07/2018

IM&T Data Management Strategy Overview March 2013 CSIRO INFORMATION MANAGEMENT & TECHNOLOGY

Building on Existing Communities: the Virtual Astronomical Observatory (and NIST)

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Parsl: Developing Interactive Parallel Workflows in Python using Parsl

Developing a Research Data Policy

EUDAT - Open Data Services for Research

Transferring a Petabyte in a Day. Raj Kettimuthu, Zhengchun Liu, David Wheeler, Ian Foster, Katrin Heitmann, Franck Cappello

Call for Participation in AIP-6

A Reproducible Framework Powered by Globus Tanu Malik, Kyle Chard*, Ian Foster

globus online Software-as-a-Service for Research Data Management

Research Data Preserve, Share, Reuse, Publish, or Perish Mark Gahegan Director, Centre for eresearch 24 Symonds St

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

Assessment of product against OAIS compliance requirements

Fundamentals of Data Infrastructures

DOI for Astronomical Data Centers: ESO. Hainaut, Bordelon, Grothkopf, Fourniol, Micol, Retzlaff, Sterzik, Stoehr [ESO] Enke, Riebe [AIP]

vfire 9.8 Release Notes Version 1.5

DRS Policy Guide. Management of DRS operations is the responsibility of staff in Library Technology Services (LTS).

Research Elsevier

Introduction to GT3. Introduction to GT3. What is a Grid? A Story of Evolution. The Globus Project

B2SAFE metadata management

COMPUTE CANADA GLOBUS PORTAL

Certification. F. Genova (thanks to I. Dillo and Hervé L Hours)

XSEDE Infrastructure as a Service Use Cases

VI-SEEM Data Repository. Presented by: Panayiotis Charalambous

Implementation of the Data Seal of Approval

LAMBDA The LSDF Execution Framework for Data Intensive Applications

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

Make the most of your access to ScienceDirect

Knowledge-based Grids

Introduction to Grid Computing

BPMN Processes for machine-actionable DMPs

Scientific data management

Making data publication a first class research output

Leveraging the Globus Platform in your Web Applications

Implementation of the CoreTrustSeal

Nancy Wilkins-Diehr San Diego Supercomputer Center (SDSC) University of California at San Diego

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Building the Modern Research Data Portal. Developer Tutorial

EUDAT. Towards a pan-european Collaborative Data Infrastructure. KNMI Workshop, Utrecht, Netherlands

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

SobekCM Digital Repository : A Retrospective

Day 1 : August (Thursday) An overview of Globus Toolkit 2.4

IEMS 5722 Mobile Network Programming and Distributed Server Architecture

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Robin Wilson Director. Digital Identifiers Metadata Services


EUDAT. Towards a pan-european Collaborative Data Infrastructure - A Nordic Perspective? -

Site Administrator Help

re3data.org - Making research data repositories visible and discoverable

Transcription:

Building a Materials Data Facility (MDF) Ben Blaiszik (blaiszik@uchicago.edu) Ian Foster (foster@uchicago.edu) Steve Tuecke, Kyle Chard, Rachana Ananthakrishnan, Jim Pruyne (UC) Kandace Turner-Jones, John Towns (UIUC NCSA) materialsdatafacility.org globus.org

What is MDF? We aim to make it simple for materials datasets to be... Published Identified Described Curated Verifiable Accessible Preserved and SRD Ref Data Resource Data Published Results Discovered Searched Browsed Shared Recommended Accessed Publishable Results Derived Data Working Data [1] Adapted from Warren et al. 2

Materials Data Publication/Discovery is Often a Challenge Data Collection Data Storage and Process Publication 3

Materials Data Publication/Discovery is Often a Challenge Data Collection Want to Publish Want to Discover / Use Data Storage and Process Publication Need networked storage, sometimes many TB Need to uniquely identify data for search/cite Need custom metadata descriptions Need a data curation workflow Need automation capabilities??? 4

Materials Data Publication/Discovery is Often a Challenge Data Collection Want to Publish Want to Discover / Use Data Storage and Process Publication Need storage, sometimes many TB Need to uniquely identify data for search/cite Need custom metadata descriptions Need a data curation workflow Need automation capabilities??? 5

Quick Globus Definitions... Endpoint E.g. laptop, server, campus research computing, multiuser HPC resource Enables advanced file transfer and sharing Currently GridFTP, future GridFTP + HTTP secure endpoint, e.g. laptop A transfer Globus moves the data for you secure endpoint, e.g. midway B Key Features REST API for automation and interoperability Web UI for convenience Optimizes and verifies transfers You submit a transfer request Globus notifies you once the transfer is complete Handles auto-restarts Battle tested with big data 6

MDF Collection Model Collections have specified Might be a research group or a research topic... Mapping to storage endpoint Currently handled as automatically created shared endpoints Metadata schemas Access control policies Licenses Curation workflows Collections contain Datasets Data Metadata Metadata Persistence Metadata log file with dataset Metadata replicated in search index 7

Publish Large Datasets Leverages Globus production capabilities for file transfer, user authentication, and groups Ensure file integrity during transfer Handle huge files with ease (e.g. automated restart) 100 TB of reliable storage @ NCSA, and more storage at Argonne Globus endpoint at ncsa#mdf Expandable to PBs as necessary Automated tape backup (under development) Optionally use your own local or institutional storage 8

Uniquely Identify Datasets Associate a unique identifier with a dataset DOI, Handle Improve dataset discovery and citability Aligning incenhves and understanding the culture will be crihcal to driving adophon Dataset Downloads Time Future... Your work has been cited 153 times in the last year Researchers from 30 institutions have downloaded your datasets 9

Share Data with Flexible ACLs Share data publicly, with a set of users, or keep data private Leverage Curation Workflows Collection administrators can specify the level of curation workflow required for a given collection e.g. No curahon CuraHon of metadata only CuraHon of metadata and files 10

Customize Metadata Build a custom metadata schema for your specific research data Re-use existing metadata schemas Working in conjunction with NIST researchers to define these schemas [1] Future... Can we build a system that allows schema: Inheritance E.g. a schema polymers might inherit and expand upon the base material of NIST Versioning E.g. Understand contextually how to map fields between versions Dependence E.g. Allows the ability to build consensus around schemas [1] Workshop on metadata schemas next week at NIST see, Carrie for details 11

Discover Research Datasets Search on file metadata, custom metadata, and indexed file-level data Goal: Intuitive, Google-style search Future... Prototype advanced dataset discovery 12

MaterialsDataFacility.org 13

Quick Walkthrough of an MDF Submission 14

MDF Collection Home 15

MDF Collections Recall: Policies Set at the Collection Level Required metadata, schemas Data storage location Metadata curation policies 16

MDF Metadata Entry Scientist or representative describes the data they are submitting For this collection Dublin Core and a custom metadata template are required 17

MDF Custom Metadata Scientist or representative describes the data they are submitting For this collection Dublin Core and a custom metadata template are required 18

Dataset Assembly Shared endpoint is auto-created on collection-specified data store Scientist transfers dataset files to a unique publish endpoint Dataset may be assembled over any period of time When submission is finished, dataset will be rendered immutable via checksum 19

Dataset Curation Optionally specified in collection configuration Can be approved or rejected (i.e. sent back to the submitter) 20

Mint a Permanent Identifier Can op'onally be DOI or Handle 21

Dataset Record 22

Dataset Discovery 23

First Steps Year 1 Year 2 Year 3 Thread 1: Initial publication, search Thread 2: Expand search capabilities Thread 3: Expand publication capabilities Thread 4: Build and operate Materials Data Facility Establish MDF with storage at Argonne and NCSA Identify key use cases to pilot MDF publication pipelines Please talk to us if you have data you want to share, publish, discover, 24

Thanks to Our Sponsors! U.S. DEPARTMENT OF ENERGY 25