THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

Similar documents
Introduction to Grid Computing

Data publication and discovery with Globus

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

CANARIE: Providing Essential Digital Infrastructure for Canada

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

FAIR-aligned Scientific Repositories: Essential Infrastructure for Open and FAIR Data

Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21)

Building on Existing Communities: the Virtual Astronomical Observatory (and NIST)

SEAD Data Services. Jim Best Practices in Data Infrastructure Workshop. Cooperative agreement #OCI

Summary of Data Management Principles

EUDAT- Towards a Global Collaborative Data Infrastructure

Big Data infrastructure and tools in libraries

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

Edward Seidel Director, National Center for Supercomputing Applications Founder Prof. of Physics, U of Illinois

Paving the Rocky Road Toward Open and FAIR in the Field Sciences

Data Curation Handbook Steps

Striving for efficiency

The Virtual Observatory and the IVOA

Data Discovery - Introduction

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

The NIH Big Data to Knowledge Initiative: Raising the Prominence of Data

The Materials Data Facility

Illinois Data Bank FIRST YEAR REVIEW COMPILED BY RESEARCH DATA SERVICE STAFF

Scientific Data Curation and the Grid

The Role of Repositories and Journals in the Astronomy Research Lifecycle

Data Curation Profile Movement of Proteins

A VO-friendly, Community-based Authorization Framework

RADAR A Repository for Long Tail Data

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

Introducing the Springer Nature Data Support Services

A PERSPECTIVE ON FACILITIES STEWARDSHIP

Reproducibility and FAIR Data in the Earth and Space Sciences

Research Elsevier

VIRTUAL OBSERVATORY TECHNOLOGIES

Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Institute for Quantitative Social Science (IQSS) Harvard

einfrastructures Concertation Event

Perspectives on Open Data in Science Open Data in Science: Challenges & Opportunities for Europe

Reflections on Three Decades in Internet Time

EECS3421 Introduction to Database Management Systems. Thanks to John Mylopoulos and Ryan Johnson for material in these slides

Data management Backgrounds and steps to implementation; A pragmatic approach.

Science-as-a-Service

Software + Services for Data Storage, Management, Discovery, and Re-Use

Data Curation Profile Human Genomics

Global Data Sharing The Research Data Alliance


EarthCube and Cyberinfrastructure for the Earth Sciences: Lessons and Perspective from OpenTopography

Advancing Library Cyberinfrastructure for Big Data Sharing and Reuse. Zhiwu Xie

EUDAT. Towards a pan-european Collaborative Data Infrastructure

CU Boulder Research Cyberinfrastructure plan 1

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016

LASDA: an archiving system for managing and sharing large scientific data

The IPAC Research Archives. Steve Groom IPAC / Caltech

A Brief Introduction to the Data Curation Profiles

The DOI Identifier. Drexel University. From the SelectedWorks of James Gross. James Gross, Drexel University. June 4, 2012

DATA SHARING FOR BETTER SCIENCE

Engagement With Scientific Facilities

The Smithsonian/NASA Astrophysics Data System

The EUDAT Collaborative Data Infrastructure

Services to Make Sense of Data. Patricia Cruse, Executive Director, DataCite Council of Science Editors San Diego May 2017

Introduction to Data Management for Ocean Science Research

James Marsteller and Von Welch

DCC Disciplinary Metadata

Data Management at NIST

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

EGI federated e-infrastructure, a building block for the Open Science Commons

Data Curation Profile Plant Genomics

National Data Sharing and Accessibility Policy-2012 (NDSAP-2012)

ICME: Status & Perspectives

Data Archival and Dissemination Tools to Support Your Research, Management, and Education

OUR VISION To be a global leader of computing research in identified areas that will bring positive impact to the lives of citizens and society.

The Computation and Data Needs of Canadian Astronomy

Data Management Checklist

Business Model for Global Platform for Big Data for Official Statistics in support of the 2030 Agenda for Sustainable Development

EUDAT & SeaDataCloud

Collaborations and Partnerships. John Broome CODATA-International

Some Big Data Challenges

National Materials Data Initiatives

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

Climate Data Management using Globus

Scaling Local Markets Together. Christian A. Hendricks President

Developing Data Management Plans (DMP) Scholarly Communication Initiative Mississippi State University Libraries March 25, 2015

Solving the Enterprise Data Dilemma

DataONE Enabling Cyberinfrastructure for the Biological, Environmental and Earth Sciences

Data Staging and Data Movement with EUDAT. Course Introduction Helsinki 10 th -12 th September, Course Timetable TODAY

Certification. F. Genova (thanks to I. Dillo and Hervé L Hours)

Astronomy Dataverse: enabling astronomer data publishing.

Application of Big Data Technology to Library data:a review

Cyberinfrastructure!

Indiana University Research Technology and the Research Data Alliance

Open Access & Open Data in H2020

Long-term preservation for INSPIRE: a metadata framework and geo-portal implementation

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Invenio: A Modern Digital Library for Grey Literature

Towards a Strategy for Data Sciences at UW

For Attribution: Developing Data Attribution and Citation Practices and Standards

Design patterns for data-driven research acceleration

Rutgers Discovery Informatics Institute (RDI2)

Dataverse and DataTags

Using Caltech s Institutional Repository to Track OA Publishing in Chemistry

DDN Annual High Performance Computing Trends Survey Reveals Rising Deployment of Flash Tiers & Private/Hybrid Clouds vs.

Transcription:

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel National Center for Supercomputing Applications University of Illinois Urbana-Champaign

Data-enabled Transformation of Science Astronomy 1500-2000: Single scientist looks through telescope Record KB of data in notebook Require reproducibility Sloan Digital Sky Survey 2000+ Record data for decade (40TB) Serve to entire world Thousands of scientists work together DES (now) 200GB/night PB in decade LSST (6 years) Record data for decade SDSS/night! 200 PB/decade

Data-enabled Transformation of Science How can I publish, discover, verify data in this new world? Astronomy 1500-2000: Single scientist looks through telescope Record KB of data in notebook Require reproducibility Sloan Digital Sky Survey 2000+ Record data for decade (40TB) Serve to entire world Thousands of scientists work together DES (now) 200GB/night PB in decade LSST (6 years) Record data for decade SDSS/night! 200 PB/decade

Big Data vs The Long Tail of Science Many Big Data projects are special Highly organized, singular sources of data, professionally curated, a lot attention paid What about the Long Tail (the other 99%)? 1000s of biologists sequencing communities of organisms Thousands of chemists and materials scientists developing a materials genome Characteristics: Heterogeneous, perhaps hand generated Not curated, reused, served, etc 4

Big Data vs The Long Tail of Science Many Big Data projects are special Highly organized, singular sources of data, professionally curated, a lot attention paid What about Fundamental the Long Tail (the other 99%)? Observation: Scientists 1000s communicate of biologists by sequencing sharing communities of organisms data Thousands of chemists and materials scientists developing a materials genome Characteristics: Heterogeneous, perhaps hand generated Not curated, reused, served, etc 5

In 2021 the LIGO gravitational wave observatory detects a transient burst event. An alert is issued. A Data Scenario Communities share data, software, knowledge, in real time 6

A Data Scenario Scientists (many of whom have never worked together) engage discovery services, find data from the IceCube neutrino observatory, isolating the source in the sky. Communities share data, software, knowledge, in real time 7

A Data Scenario Discovery services also connect researchers to data of DES and LSST to look for E&M precursors Communities share data, software, knowledge, in real time 8

A Data Scenario Literature searches find publications describing similar detections; Elsevier, APS publications and arxiv preprints, supporting NDS data linking, lead them to the data underlying the analyses Communities share data, software, knowledge, in real time 9

A Data Scenario NDS transfer services retrieve supercomputer simulation data, containing unpublished neutrino emission predictions, to DataScope to compare observations with theoretical models. Analysis suggests a new class of stellar object Communities share data, software, knowledge, in real time 10

A Data Scenario LIGO data, IceCube detections, images from LSST, and analyses of simulation data are brought together as NDS metadata generation tools help them organize a new collection Communities share data, software, knowledge, in real time 11

A Data Scenario A publication is submitted to an OA journal, with identifiers for the new data collection included in the paper... Once the paper is accepted, the linked NDS data collection is sent to a campus repository for longer-term curated management... Communities share data, software, knowledge, in real time 12

A Data Scenario Readers have access to the underlying data, enabling them to verify and extend results. Data are further available to educators, who bring the discovery to a broad audience by updating astronomy e- textbooks. Communities share data, software, knowledge, in real time 13

Digital Data-driven Research Virtually all science is driven scientific research by growing data, the more digital we will data But it is becoming un-reproducible! We need to take steps to make scientific research data more liquid. The more we move towards open as the default for get out of the research enterprise. It is time to take deliberate steps to make that a reality. Mike Stebbins, White House Growing call for open, sharable data OSTP In many fields, scientists are used to pulling data from online archives NSF, NIH, other agencies in US, world requiring more and more White House Open Data Policy supports sharable data at scale Despite this, there are no uniform mechanisms to store, share and re-use data

Inaccessibility of data Scientific results are shared through literature No uniform mechanisms to publish data that results are based on Can t cite data just as readily Some disciplines/communities are quite advanced in making data available Discipline-specific federations address peculiar problems of metadata, data types, data formats NEON, LTER, VAO, SEAD, EarthCube, HASTAC, We ll hear from three tomorrow Not for all data, nor for all disciplines

Long-tail data problem Major surveys, missions, and projects provide standard archived products Many more smaller collections remain inaccessible Datasets created by small research groups Datasets (painstakingly) refined by further processing for specific science questions New datasets created by synthesizing existing datasets No universal tools to create, publish data collections

National Data Service(s) Urgent need for national infrastructures for data Integrated set of national-scale services Storing, sharing, finding, verifying, publishing, citing, reusing Open and federating architecture Building on infrastructure at discipline/community level Enable providers to make data accessible in national environment Allow new and community-produced tools and resources to be plugged in Enable revolutionary new research Rapid discovery and easy re-use of data Unprecedented cross-disciplinary research Greater collaboration Make science verification more routine

Connects data resources into an interoperable web Federations University Libraries Major Projects MREFC, DIBBs, etc. National and Regional Labs and Computing Centers Publishers Connect/unify data capabilities: Data discovery Data publishing Data movement Data-Literature linking The NDS Ecosystem

What should NDS do for researchers? Help researchers find data Cross-disciplinary searching: federations, projects, archives, and other repositories Find data related to a publication Leverage specialized community-specific services Help researchers use data Download data, browse metadata, track provenance Move data to processing platforms for specialized (re-)processing and analysis

What should NDS do for researchers? Help researchers share and publish data NDS Repository: platform for publication Sharing privately with collaborators prior to publishing Tools to help organize the data for publishing Automatically ensure links to literature: Assign DOIs, provide links to publishers, synchronize data publishing with papers Recommend appropriate discipline/community repository for long-term preservation NDS Repository as archive of last resort

NDS Consortium NDS vision requires collaboration of many institutions Consortium to guide the development of the NDS Coordinate separately funded efforts to build components Ensure interoperability Integrate existing tools and resources Forum for developing new partnerships and proposals A number of organizations already with us DataONE, LTER, LIGO Lab, IceCube, Many others with strong interest, poised to join What should the NDS Consortium look like? Break out session today, facilitated discussion tomorrow

NDS Relationship with RDA RDA & NDS Consortium are connected No attempt to duplicate or compete with RDA activities Work closely with RDA groups to implement NDS NDS focus: create infrastructure framework, build on/interoperating with existing activities Narrow focus on functions, leveraging existing capabilities Tools that can be built and deployed within a few years National structures: HPC centers, XSEDE, Internet2 Libraries, publishers ensure appropriate links to literature Specific communities & projects plug in

What s Next: Let s Get to Work! Much of this is possible now Lacking is Framework to integrate National integration structures to deploy Need doers and visionaries both! This meeting: let s look at What should an NDS do? How should it be built? Inventory what is already there, what is missing Demonstrations to build momentum: much can be done now! How do we fund it? NCSA, others fully committed Organize coordinated proposals to build missing components of services! Form a consortium Future meetings