Dataverse: Modular Storage and Migration to the Cloud

Similar documents
Securing Dataverse with an Adapted Command Design Pattern. Gustavo Durand, Michael Bar-Sinai, Merce Crosas SecDev - September 26, 2017

The Open Monolith. Keeping Your Codebase (and Your Headaches) CON3449. Matthew sbgrid.

DATAVERSE FOR JOURNALS

Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Institute for Quantitative Social Science (IQSS) Harvard

Demos: DMP Assistant and Dataverse

Data publication and discovery with Globus

Update on Dataverse Dryad-Dataverse Community Meeting. Mercè Crosas, Elizabeth Quigley & Eleni Castro. Data Science > IQSS > Harvard University

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

A Data Sharing System

Dataverse and DataTags

Science Panel Discussion presentation: "A Data Sharing Story"

Metadata Ingestion and Processinng

Helping Journals to Upgrade Data Publications for Reusable Research

The Materials Data Facility

BPMN Processes for machine-actionable DMPs

SHARING YOUR RESEARCH DATA VIA

Research Data Edinburgh: MANTRA & Edinburgh DataShare. Stuart Macdonald EDINA & Data Library University of Edinburgh

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

Dataverse 4.0 & Beyond. Eleni Castro > Ins/tute for Quan/ta/ve Social Science (IQSS), Harvard University

BlackPearl Customer Created Clients Using Free & Open Source Tools

Storage Made Simple: Preserving Digital Objects with bepress Archive and Amazon S3

FROM VSTS TO AZURE DEVOPS

Cloud platforms. T Mobile Systems Programming

Astronomy Dataverse: enabling astronomer data publishing.

Jenkins: A complete solution. From Continuous Integration to Continuous Delivery For HSBC

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Click to edit Master title style

Azure DevOps. Randy Pagels Intelligent Cloud Technical Specialist Great Lakes Region

DATA SHARING FOR BETTER SCIENCE

Red Hat OpenStack Platform 10 Product Guide

CloudMan cloud clusters for everyone

Managing Data at Scale: Microservices and Events. Randy linkedin.com/in/randyshoup

DOIs for Research Data

Transform Your Enterprise Search and ediscovery on the AWS Cloud.

<Insert Picture Here> Future<JavaEE>

Building for the Future

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

Harvard s Dataverse Network:

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

Developing Applications with Java EE 6 on WebLogic Server 12c

Cloud Computing. An introduction using MS Office 365, Google, Amazon, & Dropbox.

This tutorial is meant for software developers who want to learn how to lose less time on API integrations!

Job Description: Junior Front End Developer

Building a Data Catalog

RADAR. Establishing a generic Research Data Repository: RESEARCH DATA REPOSITORY. Dr. Angelina Kraft

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

Creating engaging website experiences on any device (e.g. desktop, tablet, smartphone) using mobile responsive design.

EUDAT Towards a Collaborative Data Infrastructure

From Java EE to Jakarta EE. A user experience

Prediction of workflow execution time using provenance traces: practical applications in medical data processing

SEAD Data Services. Jim Best Practices in Data Infrastructure Workshop. Cooperative agreement #OCI

Building a Digital Library Software

Cloud platforms T Mobile Systems Programming

Science-as-a-Service

Mega-scale Postgres How to run 1,000,000 Postgres Databases

Roles. Ecosystem Flow of Information between Roles Accountability

B2SAFE metadata management

Paving the Rocky Road Toward Open and FAIR in the Field Sciences

FLAT: A CLARIN-compatible repository solution based on Fedora Commons

DataSTORRE Deposit Guide

June 2, 2015, 10:30am EDT Food and Drug Administration, New Hampshire Ave., WO 66, Silver Spring, MD Pre submission Document ID: Q150777

CLOUD MANAGEMENT AND SECURITY

Dataverse Usability Evaluation: Findings & Recommendations. Presented by Eric Gibbs Lin Lin Elizabeth Quigley

Cloud Computing & Visualization

FREYA Connected Open Identifiers for Discovery, Access and Use of Research Resources

Building A Billion Spatio-Temporal Object Search and Visualization Platform

Persistent Identifier the data publishing perspective. Sünje Dallmeier-Tiessen, CERN 1

Archive II. The archive. 26/May/15

Licensing Guide for Partners

DataBridge: CREATING BRIDGES TO FIND DARK DATA. Vol. 3, No. 5 July 2015 RENCI WHITE PAPER SERIES. The Team

Cloud Computing: Making the Right Choice for Your Organization

NorStore. a national infrastructure for scientific data. Andreas O Jaunsen UNINETT Sigma as

sqamethods Approach to Building Testing Automation Systems

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

CO Java EE 7: Back-End Server Application Development

Get the Most Out of GoAnywhere: Achieving Cloud File Transfers and Integrations

8.0 Help for End Users About Jive for SharePoint System Requirements Using Jive for SharePoint... 6

Cloud Foundry and OpenStack

Unpacking Office 365 A high level overview of the apps and services bundled in the standard Office 365 subscription: What is it Use cases FAQ

Informatica Enterprise Information Catalog

Embedded Technosolutions

The future of database technology is in the clouds

AWS Lambda: Event-driven Code in the Cloud

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

Digital Curation and Preservation: Defining the Research Agenda for the Next Decade

Provisioning with SUSE Enterprise Storage. Nyers Gábor Trainer &

Cloud FastPath: Highly Secure Data Transfer

Data Management Plans. Sarah Jones Digital Curation Centre, Glasgow

OPENSTACK PRIVATE CLOUD WITH GITHUB

EUDAT - Open Data Services for Research

Research at PNNL: Powered by AWS NLIT 2018

ganeti Comparing IaaS VMware vs OpenStack vs Google s Ganeti November 2013 Giuseppe Gippa Paternò

Horizon Societies of Symbiotic Robot-Plant Bio-Hybrids as Social Architectural Artifacts. Deliverable D4.1

EUDAT. Towards a pan-european Collaborative Data Infrastructure

EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

WHITEPAPER. MemSQL Enterprise Feature List

Enterprise Java Unit 1-Chapter 2 Prof. Sujata Rizal Java EE 6 Architecture, Server and Containers

Services to Make Sense of Data. Patricia Cruse, Executive Director, DataCite Council of Science Editors San Diego May 2017

Transcription:

Dataverse: Modular Storage and Migration to the Cloud Gustavo Durand, Dataverse Technical Lead / Architect Leonid Andreev, Dataverse Senior Developer

Dataverse

Overview An open-source platform to publish, cite, and archive research data Built to support multiple types of data, users, and workflows Developed at Harvard s Institute for Quantitative Social Science (IQSS) since 2006 Development funded by IQSS and with grants, in collaboration with institutions around the world 15 on the core team - developers, designers, UI/UX, metadata specialists, curation manager

Dataverse Features - Data Persistent IDs / URLs DataCite Handle Automatically Generated Citations with attribution Compliant with FAIR and data citation principles Domain-specific Metadata Versioning File Storage Local Swift (OpenStack) S3 (Amazon)

Dataverse Features - Users Multiple Sign In options Native Shibboleth OAuth (ORCID) Dataverses within Dataverses Branding Widgets

Dataverse Features - Workflows Permissions Access Controls and Terms of Use Publishing Workflows Private URLs Upload / Download Workflows Browser Dropbox Rsync (for big data packages )

Dataverse Features - Interoperability APIs SWORD Native Harvesting (OAI-PMH) Client Server Modular External Tools Explore Configure

Dataverse Technology Glassfish Server 4.1 Java SE8 Java EE7 - Presentation: JSF (PrimeFaces), RESTful API - Business: EJB, Transactions, Asynchronous, Timers - Storage: JPA (Entities), Bean Validation Storage: Postgres, Solr, File System / Swift / S3

Dataverse Development Process Inbox Backlog This Sprint Development Code Review QA Done https://waffle.io/iqss/dataverse

(some) Collaborations SBGrid Data Large Data and Support Massachusetts Open Cloud Big Data Storage and Compute Access (OpenStack) DANS/CIMMYT Handles Support ResearchSpace API Java Client Library Provenance W3C PROV

Dataverse Community 34 installations around the world

Dataverse Community 75+ code contributors outside of the Core Team Hundreds of members of the Dataverse Community developers, researchers, librarians, data scientists Dataverse Google Group Dataverse Community Calls Dataverse Community Meeting Global Dataverse Community Consortium

Modularity : External Tools

Compute/Explore Access

External Tools: Two Ravens and World Map

External Tools: Data Explorer

External Tools: PSI Budgeteer The budgeteer allows users to select which statistics they would like to calculate and are given estimates of how accurately each statistic can be computed. They can also redistribute their privacy budget according to which statistics they think are most valuable in their dataset.

Data Storage How data files are handled in Dataverse

(one real life design and development story)

Let s talk about common pitfalls when designing complex applications Quick hacks save time; incur costs later Overengineering. (Are you designing too far into the future? Is it an investment into making future development easier or a waste of resources?) The design and development story behind this presentation may be an example of a reasonably balanced mix of expandability and simplicity.

Datasets = Metadata + Files!

Typical metadata of a Dataverse dataset

Data Files in a researcher s dataset

How do we store these things? - Early design prototype Metadata: stored in a SQL database Files: stored on the filesystem Implementation: (very much simplified!)

But then we thought let s make it modular! StorageIO: An added layer of abstraction between application and file storage Individual drivers for specific types of physical storage

Real life use case, early years All the files were stored on a local filesystem; (so did we even need any of that modularity?)

Then suddenly cloud happened! Exciting new projects and collaborations, and the need to support new data storage methods. MassOpenCloud - cloud computing collaboration, Swift support needed Harvard Dataverse migration to AWS, S3 support needed SBGrid Databank - a collaboration with Harvard Medical School; Big Data/complex file package model (... and we were prepared)

New storage drivers added With the StorageIO framework in place, it was possible to quickly add driver implementations for AWS S3 and OpenStack Swift.

Some code examples... Top level StorageIO interface package edu.harvard.iq.dataverse.dataaccess; public abstract class StorageIO { public abstract void open(dataaccessoption... option) throws IOException; public abstract WritableByteChannel getwritechannel() throws IOException; public abstract InputStream getinputstream() throws IOException; public abstract OutputStream getoutputstream() throws IOException; public abstract void delete() throws IOException; public abstract void savepath(path filesystempath) throws IOException; public abstract void saveinputstream(inputstream inputstream) throws IOException;

Code sample: FileAccessIO (Filesystem storage driver implementation) @Override public void savepath(path filesystempath) throws IOException { Path outputpath = getfilesystempath(); if (outputpath == null) { throw new FileNotFoundException("FileAccessIO: Could not locate physical file for writing."); } Files.copy(fileSystemPath, outputpath, StandardCopyOption.REPLACE_EXISTING); }

Code sample: SwiftAccessIO (SWIFT storage driver implementation) It s not that much code, really (and that s the point!) @Override public void savepath(path filesystempath) throws IOException { try { inputfile = filesystempath.tofile(); swiftfileobject.uploadobject(inputfile); } catch (Exception ioex) { throw new IOException("Swift AccessIO: Unknown exception occurred while uploading a local file into a Swift StoredObject"); } }

Code sample: S3AccessIO (AWS S3 storage driver implementation) @Override public void savepath(path filesystempath) throws IOException { try { File inputfile = filesystempath.tofile(); s3.putobject(new PutObjectRequest(bucketName, key, inputfile)); } catch (SdkClientException ioex) { throw new IOException("S3AccessIO: Unknown exception occured while uploading a local file into S3Object "); } }

Files, as seen by Dataverse users (They all look the same to us!)

File records in the database, living side by side...

Dataverse and extended data storage models in practice.

Our users like to upload files...

SBGrid Databank A collaboration between SBGrid/Harvard Medical School, Dataverse and Globus (with support from the Helmsley Trust Biomedical Research Infrastructure).

SBDB: Big Data Support and Multiple Storage Locations, package files

Cloud Dataverse : a collaboration with Massachusetts Open Cloud and Boston University MOC is a public/open research cloud that runs on OpenStack

MOC Dataverse: Integration with a Big Data Analytics Platform Big Data Analytics using OpenStack Nova (Compute) and Sahara Sahara: cluster provisioning of data processing frameworks Hadoop/Spark/Pig/Hive/Storm Abstraction for easy job submission Direct Swift I/O integration

Thank you! Please get in touch with us! Google Group, Github, IRC, Twitter - dataverse.org/contact support@dataverse.org