The iplant Data Commons

Similar documents
Reproducible & Transparent Computational Science with Galaxy. Jeremy Goecks The Galaxy Team

Data publication and discovery with Globus

A High-Level Distributed Execution Framework for Scientific Workflows

Building a Dutch National Research Infrastructure IRODS UGM 2017

Data Discovery - Introduction

Science-as-a-Service

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

SEAD Data Services. Jim Best Practices in Data Infrastructure Workshop. Cooperative agreement #OCI

Reproducibility and Reuse of Scientific Code Evolving the Role and Capabilities of Publishers

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Building on to the Digital Preservation Foundation at Harvard Library. Andrea Goethals ABCD-Library Meeting June 27, 2016

BIG DATA CHALLENGES A NOAA PERSPECTIVE

Content-based Comparison for Collections Identification

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

Striving for efficiency

SMART Guidance for Notes Migrations

EGI federated e-infrastructure, a building block for the Open Science Commons

GEOSS Data Management Principles: Importance and Implementation

Writing a Data Management Plan A guide for the perplexed

irods at TACC: Secure Infrastructure for Open Science Chris Jordan

Case Study: CyberSKA - A Collaborative Platform for Data Intensive Radio Astronomy

Scientific Workflow Tools. Daniel Crawl and Ilkay Altintas San Diego Supercomputer Center UC San Diego

Reflections on Three Decades in Internet Time

Microsoft SharePoint Server 2013 Plan, Configure & Manage

Merging Enterprise Applications with Docker* Container Technology

irods - An Overview Jason Executive Director, irods Consortium CS Department of Computer Science, AGH Kraków, Poland

Paving the Rocky Road Toward Open and FAIR in the Field Sciences

Supporting Data Stewardship Throughout the Data Life Cycle in the Solid Earth Sciences

Red Hat Atomic Details Dockah, Dockah, Dockah! Containerization as a shift of paradigm for the GNU/Linux OS

Data Curation Handbook Steps

EUDAT- Towards a Global Collaborative Data Infrastructure

GRIDS INTRODUCTION TO GRID INFRASTRUCTURES. Fabrizio Gagliardi

CIS : Computational Reproducibility

Docker CaaS. Sandor Klein VP EMEA

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

CODE AND DATA MANAGEMENT. Toni Rosati Lynn Yarmey

Persistent Identifier the data publishing perspective. Sünje Dallmeier-Tiessen, CERN 1

for Modernization Accelerate Your Modernization Project Faster return on your investment goals

European Cloud Initiative: implementation status. Augusto BURGUEÑO ARJONA European Commission DG CNECT Unit C1: e-infrastructure and Science Cloud

BoF: Grafeas Using Artifact Metadata to Track and Govern Your Software Supply Chain

Sunil Shah SECURE, FLEXIBLE CONTINUOUS DELIVERY PIPELINES WITH GITLAB AND DC/OS Mesosphere, Inc. All Rights Reserved.

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Powering Official Statistics at Statistics New Zealand with DDI-L and Colectica

Beyond 1001 Dedicated Data Service Instances

The Materials Data Facility

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016

Research Data Preserve, Share, Reuse, Publish, or Perish Mark Gahegan Director, Centre for eresearch 24 Symonds St

Big Data infrastructure and tools in libraries

Growing Variety and Volume of Remote Sensing and In Situ Data

Data Archival and Dissemination Tools to Support Your Research, Management, and Education

FAIR-aligned Scientific Repositories: Essential Infrastructure for Open and FAIR Data

White Paper. EVERY THING CONNECTED How Web Object Technology Is Putting Every Physical Thing On The Web

XSEDE s Campus Bridging Project Jim Ferguson National Institute for Computational Sciences

CU Boulder Research Cyberinfrastructure plan 1

OpenShift Roadmap Enterprise Kubernetes for Developers. Clayton Coleman, Architect, OpenShift

DataONE: Open Persistent Access to Earth Observational Data

TEXT MINING: THE NEXT DATA FRONTIER

Walkthrough OCCAM. Be on the lookout for this fellow: The callouts are ACTIONs for you to do!

ORACLE DATABASE LIFECYCLE MANAGEMENT PACK

Pegasus WMS Automated Data Management in Shared and Nonshared Environments

Research Elsevier

SIMPLIFY, AUTOMATE & TRANSFORM YOUR BUSINESS

CloudMan cloud clusters for everyone

AssetWise to OpenText PoC Closeout Report

The Physiome Model Repository. Poul Nielsen

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

ACF Interoperability Human Services 2.0 Overview. August 2011 David Jenkins Administration for Children and Families

Academic Workflow for Research Repositories Using irods and Object Storage

Containerizing GPU Applications with Docker for Scaling to the Cloud

Red Hat Roadmap for Containers and DevOps

The Canadian CyberSKA Project

Metadata Models for Experimental Science Data Management

Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Institute for Quantitative Social Science (IQSS) Harvard

I data set della ricerca ed il progetto EUDAT

Medici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project

JBoss DNA. Randall Hauch Principal Software Engineer JBoss Data Services

How to make your data open

UC Irvine LAUC-I and Library Staff Research

Building for the Future

Cloud Computing For Researchers

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Social Environment

Policy Based Distributed Data Management Systems

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Scheduling Computational and Storage Resources on the NRP

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

XML-based production of Eurostat publications

FuncX: A Function Serving Platform for HPC. Ryan Chard 28 Jan 2019

Data Staging and Data Movement with EUDAT. Course Introduction Helsinki 10 th -12 th September, Course Timetable TODAY

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

<Insert Picture Here> Enterprise Data Management using Grid Technology

Leveraging High Performance Computing Infrastructure for Trusted Digital Preservation

The Data Curation Profiles Toolkit: Interview Worksheet

Fedora, Islandora, & Samvera: Requirements and Gaps. David Wilcox,

Requirements for data catalogues within facilities

Some Big Data Challenges

EUDAT and Cloud Services

Next-Generation SOA Infrastructure. An Oracle White Paper May 2007

IT Town Hall Meeting

Indiana University Research Technology and the Research Data Alliance

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

Transcription:

The iplant Data Commons Using irods to Facilitate Data Dissemination, Discovery, and Reproducibility Jeremy DeBarry, jdebarry@iplantcollaborative.org Tony Edgin, tedgin@iplantcollaborative.org Nirav Merchant, nirav@iplantcollaborative.org The iplant Collaborative

The iplant Collaborative We are a Cyberinfrastructure Platforms, tools, datasets Storage and compute Training and support

The iplant Collaborative And a virtual organization Developer Expertise Computational Capacity Science Domain Expertise Training Administrative and Organization

Why Data Commons (by Phil Bourne) The Commons is a pilot experiment in the efficient storage, manipulation, analysis, and sharing of research output, from all parts of the research lifecycle. Should The Commons be successful we would achieve a level of comprehensive access and interoperability across the research enterprise far beyond what is possible today. Some key attributes: Should support Sharing & Accessibility International to be maximally successful Should allow data science become more cost-effective, hence more sustainable Replicability, opportunity and ability to reproduce, or at least replicate, experiments Discoverability of research output through indexing or other methods Phil Bourne is the Associate Director of Data Science (ADDS) at NIH https://pebourne.wordpress.com/2014/10/07/the-commons/

iplant Collaborative Products Ready to use Platforms Ease of use Extensible Services Established CI Components Foundational Capabilities

Data Store: Sharing Data peer to peer ACL based irods tickets public links, anonymous read-only access through URLs Community Data, large dataset sharing

Sharing Usage Statistics User Statistics 27000 user accounts 4900 users with data 2600 users (53% of users with data) made at least 1 share 2100 shares per user 42 million files (58% shared) 59 million (1.1 million/month) shares Community Data Statistics 5 million files 55 million (1.0 million/month) shares ~1PB of data A share is an instance of a user sharing a single file with a single user or group.

Indexing and Search irods amqptopicsend.p y RabbitMQ Broker DE web app indexing rules incremental indexer elasticsearch DE web services Rules Engine crawling crawling indexers indexers PostgreSQL ICAT DB irods incremental indexing tags incremental indexing crawler indexing

Data Analysis in the DE Data analysis tools are installed on Condor cluster or HPC. (user can request new tools) An app is an analysis template for a tool usable from the DE. (user can create and publish new apps) A workflow combines apps into an analysis pipeline (user can create and publish new workflows). Batches of files may be processed by apps and workflows. Ability to annotate results with metadata, comments and relationships to support reproducibility and manage large data sets

Current Tool Deployment Results reproduction is difficult outside of our infrastructure. Updates can make results reproduction impossible. Deployment requires support staff. Software conflicts may make certain tool combinations impossible.

Containerized Tool Deployment* (or Docker to the rescue!) Containers are portable, so results can be reproduced outside of CI. Different versions of tool are placed in different containers. Old tool results are still reproducible after an upgrade. Users can bring custom tools in own containers. No conflicts occur between tools, since each in its own container. * Coming July 2015

The iplant Data Commons It is a part of the Data Store for high-value, public datasets with links to external repositories. Its data are available throughout the CI. Its data are searchable and discoverable. Its data will be available and useful to the community, not buried in Data Store.

Data Commons Project Management Associate collaborators with a project. Add data and metadata to a project Organize data using standardized and projectspecific metadata. Suggest analyses based on data type. Track history of operations. (provenance)

Data Commons Staging Area Distill project artifacts into package for publication to Data Commons and external repository. Combine and edit project metadata. Select appropriate licenses and persistent ids for the data and for the chosen repository.

Acknowledgements This project is funded by NSF. Thanks to all of the staff at the iplant Collaborative. The Data Commons would not exist without the efforts of Kapeel Chougule, Jeremy DeBarry, Maria Esteva, Matthew Hanlon, Nirav Merchant, Stephen Mock, Sriram Srinivasan, Ramona Walls, and Liya Wang. If you have additional questions, please don t hesitate to contact me at tedgin@iplantcollaborative.org.