Dataverse: Modular Storage and Migration to the Cloud Gustavo Durand, Dataverse Technical Lead / Architect Leonid Andreev, Dataverse Senior Developer
Dataverse
Overview An open-source platform to publish, cite, and archive research data Built to support multiple types of data, users, and workflows Developed at Harvard s Institute for Quantitative Social Science (IQSS) since 2006 Development funded by IQSS and with grants, in collaboration with institutions around the world 15 on the core team - developers, designers, UI/UX, metadata specialists, curation manager
Dataverse Features - Data Persistent IDs / URLs DataCite Handle Automatically Generated Citations with attribution Compliant with FAIR and data citation principles Domain-specific Metadata Versioning File Storage Local Swift (OpenStack) S3 (Amazon)
Dataverse Features - Users Multiple Sign In options Native Shibboleth OAuth (ORCID) Dataverses within Dataverses Branding Widgets
Dataverse Features - Workflows Permissions Access Controls and Terms of Use Publishing Workflows Private URLs Upload / Download Workflows Browser Dropbox Rsync (for big data packages )
Dataverse Features - Interoperability APIs SWORD Native Harvesting (OAI-PMH) Client Server Modular External Tools Explore Configure
Dataverse Technology Glassfish Server 4.1 Java SE8 Java EE7 - Presentation: JSF (PrimeFaces), RESTful API - Business: EJB, Transactions, Asynchronous, Timers - Storage: JPA (Entities), Bean Validation Storage: Postgres, Solr, File System / Swift / S3
Dataverse Development Process Inbox Backlog This Sprint Development Code Review QA Done https://waffle.io/iqss/dataverse
(some) Collaborations SBGrid Data Large Data and Support Massachusetts Open Cloud Big Data Storage and Compute Access (OpenStack) DANS/CIMMYT Handles Support ResearchSpace API Java Client Library Provenance W3C PROV
Dataverse Community 34 installations around the world
Dataverse Community 75+ code contributors outside of the Core Team Hundreds of members of the Dataverse Community developers, researchers, librarians, data scientists Dataverse Google Group Dataverse Community Calls Dataverse Community Meeting Global Dataverse Community Consortium
Modularity : External Tools
Compute/Explore Access
External Tools: Two Ravens and World Map
External Tools: Data Explorer
External Tools: PSI Budgeteer The budgeteer allows users to select which statistics they would like to calculate and are given estimates of how accurately each statistic can be computed. They can also redistribute their privacy budget according to which statistics they think are most valuable in their dataset.
Data Storage How data files are handled in Dataverse
(one real life design and development story)
Let s talk about common pitfalls when designing complex applications Quick hacks save time; incur costs later Overengineering. (Are you designing too far into the future? Is it an investment into making future development easier or a waste of resources?) The design and development story behind this presentation may be an example of a reasonably balanced mix of expandability and simplicity.
Datasets = Metadata + Files!
Typical metadata of a Dataverse dataset
Data Files in a researcher s dataset
How do we store these things? - Early design prototype Metadata: stored in a SQL database Files: stored on the filesystem Implementation: (very much simplified!)
But then we thought let s make it modular! StorageIO: An added layer of abstraction between application and file storage Individual drivers for specific types of physical storage
Real life use case, early years All the files were stored on a local filesystem; (so did we even need any of that modularity?)
Then suddenly cloud happened! Exciting new projects and collaborations, and the need to support new data storage methods. MassOpenCloud - cloud computing collaboration, Swift support needed Harvard Dataverse migration to AWS, S3 support needed SBGrid Databank - a collaboration with Harvard Medical School; Big Data/complex file package model (... and we were prepared)
New storage drivers added With the StorageIO framework in place, it was possible to quickly add driver implementations for AWS S3 and OpenStack Swift.
Some code examples... Top level StorageIO interface package edu.harvard.iq.dataverse.dataaccess; public abstract class StorageIO { public abstract void open(dataaccessoption... option) throws IOException; public abstract WritableByteChannel getwritechannel() throws IOException; public abstract InputStream getinputstream() throws IOException; public abstract OutputStream getoutputstream() throws IOException; public abstract void delete() throws IOException; public abstract void savepath(path filesystempath) throws IOException; public abstract void saveinputstream(inputstream inputstream) throws IOException;
Code sample: FileAccessIO (Filesystem storage driver implementation) @Override public void savepath(path filesystempath) throws IOException { Path outputpath = getfilesystempath(); if (outputpath == null) { throw new FileNotFoundException("FileAccessIO: Could not locate physical file for writing."); } Files.copy(fileSystemPath, outputpath, StandardCopyOption.REPLACE_EXISTING); }
Code sample: SwiftAccessIO (SWIFT storage driver implementation) It s not that much code, really (and that s the point!) @Override public void savepath(path filesystempath) throws IOException { try { inputfile = filesystempath.tofile(); swiftfileobject.uploadobject(inputfile); } catch (Exception ioex) { throw new IOException("Swift AccessIO: Unknown exception occurred while uploading a local file into a Swift StoredObject"); } }
Code sample: S3AccessIO (AWS S3 storage driver implementation) @Override public void savepath(path filesystempath) throws IOException { try { File inputfile = filesystempath.tofile(); s3.putobject(new PutObjectRequest(bucketName, key, inputfile)); } catch (SdkClientException ioex) { throw new IOException("S3AccessIO: Unknown exception occured while uploading a local file into S3Object "); } }
Files, as seen by Dataverse users (They all look the same to us!)
File records in the database, living side by side...
Dataverse and extended data storage models in practice.
Our users like to upload files...
SBGrid Databank A collaboration between SBGrid/Harvard Medical School, Dataverse and Globus (with support from the Helmsley Trust Biomedical Research Infrastructure).
SBDB: Big Data Support and Multiple Storage Locations, package files
Cloud Dataverse : a collaboration with Massachusetts Open Cloud and Boston University MOC is a public/open research cloud that runs on OpenStack
MOC Dataverse: Integration with a Big Data Analytics Platform Big Data Analytics using OpenStack Nova (Compute) and Sahara Sahara: cluster provisioning of data processing frameworks Hadoop/Spark/Pig/Hive/Storm Abstraction for easy job submission Direct Swift I/O integration
Thank you! Please get in touch with us! Google Group, Github, IRC, Twitter - dataverse.org/contact support@dataverse.org