EUDAT. Towards a pan-european Collaborative Data Infrastructure

Similar documents
EUDAT. Towards a pan-european Collaborative Data Infrastructure

EUDAT. Towards a pan-european Collaborative Data Infrastructure - A Nordic Perspective? -

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

EUDAT. Towards a pan-european Collaborative Data Infrastructure. Damien Lecarpentier CSC-IT Center for Science, Finland EUDAT User Forum, Barcelona

EUDAT. Towards a pan-european Collaborative Data Infrastructure

EUDAT. Towards a Collaborative Data Infrastructure. Ari Lukkarinen CSC-IT Center for Science, Finland NORDUnet 2012 Oslo, 18 August 2012

I data set della ricerca ed il progetto EUDAT

EUDAT- Towards a Global Collaborative Data Infrastructure

European Collaborative Data Infrastructure EUDAT - Training on EUDAT Principles -

Using EUDAT services to replicate, store, share, and find cultural heritage data

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

EUDAT Towards a Collaborative Data Infrastructure

The EUDAT Collaborative Data Infrastructure

EUDAT & SeaDataCloud

Data management and discovery

EUDAT Common data infrastructure

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

EUDAT & AAI. Daan Broeder MPI for Psycholinguistics

EUDAT - Open Data Services for Research

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

EUDAT and Cloud Services

EUDAT. Towards a pan-european Collaborative Data Infrastructure. KNMI Workshop, Utrecht, Netherlands

USE CASES IN SEISMOLOGY. Alberto Michelini INGV

Data Discovery - Introduction

Best practices in the design, creation and dissemination of speech corpora at The Language Archive

META-SHARE : the open exchange platform Overview-Current State-Towards v3.0

European Cloud Initiative: implementation status. Augusto BURGUEÑO ARJONA European Commission DG CNECT Unit C1: e-infrastructure and Science Cloud

EPOS: European Plate Observing System

Horizon 2020 and the Open Research Data pilot. Sarah Jones Digital Curation Centre, Glasgow

META-SHARE: An Open Resource Exchange Infrastructure for Stimulating Research and Innovation

B2FIND: EUDAT Metadata Service. Daan Broeder, et al. EUDAT Metadata Task Force

ODC and future EIDA/ EPOS-S plans within EUDAT2020. Luca Trani and the EIDA Team Acknowledgements to SURFsara and the B2SAFE team

Deliverable 6.4. Initial Data Management Plan. RINGO (GA no ) PUBLIC; R. Readiness of ICOS for Necessities of integrated Global Observations

clarin:el an infrastructure for documenting, sharing and processing language data

e-infrastructure: objectives and strategy in FP7

Safe Replication and Data Staging

European digital repository certification: the way forward

Data publication and discovery with Globus

Towards FAIRness: some reflections from an Earth Science perspective

Key Elements of Global Data Infrastructures

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

Science Europe Consultation on Research Data Management

Ways to improve Global Data Sharing and Re-use

GATE: Big Data for Smart Society Dessislava Petrova-Antonova Sofia University St. Kliment Ohridski Faculty of Mathematics and Informatics

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

CLARIN s central infrastructure. Dieter Van Uytvanck CLARIN-PLUS Tools & Services Workshop 2 June 2016 Vienna

Federated Identity Management for Research Collaborations. Bob Jones IT dept CERN 29 October 2013

EPOS a long term integration plan of research infrastructures for solid Earth Science in Europe

Moving e-infrastructure into a new era the FP7 challenge

InfraStructure for the European Network for Earth System modelling. From «IS-ENES» to IS-ENES2

Data Staging and Data Movement with EUDAT. Course Introduction Helsinki 10 th -12 th September, Course Timetable TODAY

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

Striving for efficiency

Implementation of the Data Seal of Approval

On the way to Language Resources sharing: principles, challenges, solutions

DCH-RP Trust-Building Report

Helix Nebula The Science Cloud

How can CLARIN archive and curate my resources?

Trust and Certification: the case for Trustworthy Digital Repositories. RDA Europe webinar, 14 February 2017 Ingrid Dillo, DANS, The Netherlands

PASIG Directions & Issues

irods workflows for the data management in the EUDAT pan-european infrastructure

Long-term preservation for INSPIRE: a metadata framework and geo-portal implementation

DANUBIUS-RI, the pan-european distributed research infrastructure supporting interdisciplinary research on river-sea systems

Developing a social science data platform. Ron Dekker Director CESSDA

Writing a Data Management Plan A guide for the perplexed

IT Challenges and Initiatives in Scientific Research

Fair data and open data: differences and consequences

Data Curation Handbook Steps

e-infrastructures in FP7 INFO DAY - Paris

CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen,

Indiana University Research Technology and the Research Data Alliance

Future Core Ground Segment Scenarios

Fundamentals of Data Infrastructures

INSPIRE & Environment Data in the EU

Helix Nebula, the Science Cloud

Open Science Commons: A Participatory Model for the Open Science Cloud

Pascal Gilles H-EOP-GT. Meeting ESA-FFG-Austrian Actors ESRIN, 24 th May 2016

Assessing the FAIRness of Datasets in Trustworthy Digital Repositories: a 5 star scale

PID System for eresearch

e-infrastructures in Horizon 2020 e-infrastructures for data and computing

Current status of the Romanian EIDA (European Integrated Data Archive)

NDSA Web Archiving Survey

Session Two: OAIS Model & Digital Curation Lifecycle Model

Digital repositories as research infrastructure: a UK perspective

Managing very large Multimedia Archives and their Integration into Federations

How to share research data

Scientific Data Curation and the Grid

Survey of Research Data Management Practices at the University of Pretoria

The National Digital Library Finna Among Digital Research Infrastructures in Finland

BHL-EUROPE: Biodiversity Heritage Library for Europe. Jana Hoffmann, Henning Scholz

An overview of the OAIS and Representation Information

Big Data infrastructure and tools in libraries

Copernicus Space Component. Technical Collaborative Arrangement. between ESA. and. Enterprise Estonia

Global Data Sharing The Research Data Alliance

The MovingLife Project

Promoting semantic interoperability between public administrations in Europe

Global Research Infrastructures for Biodiversity and Ecosystems Research

EOSC Services & Architecture: the EOSC-hub approach Tiziana Ferrari, Project Coordinator, EGI Founda?on

e-research Infrastructures for e-science Axel Berg SARA national HPC & e-science support center RAMIRI, June 15, 2011

Transcription:

EUDAT Towards a pan-european Collaborative Data Infrastructure Damien Lecarpentier CSC-IT Center for Science, Finland CESSDA workshop Tampere, 5 October 2012

EUDAT Towards a pan-european Collaborative Data Infrastructure Damien Lecarpentier CSC-IT Center for Science, Finland CESSDA workshop Tampere, 5 October 2012

2

EUDAT Consortium 3

Data centers and Communities 4

EPOS: European Plate Observatory System Research infrastructure and e-science for data and observatories on earthquakes, volcanoes, surface dynamics and tectonics Distributed data sensors Large-scale statistics Metadata schema Reference architecture

CLARIN: Common Language Resources and Technology Infrastructure CLARIN is a large-scale pan-european collaborative effort to create, coordinate and make language resources and technology available and usable Around 200 EU centres Require PIDs and metadata infrastructure ISOcat, SCHEMcat The Virtual Language Observatory http://www.clarin.eu/vlo/

ENES: Service for Climate Modelling in Europe ENES provides information and services to foster intricate simulations of the climate system using high-performance computers as well as the distribution and dissemination of data produced by such simulations About 20 EU centres Uses data infrastructure at the German climate centre Uses CIM data model Uses DOIs and EPIC handles Metadata schema based on ISO 11179

LifeWatch: Biodoversity Data and Observatories LifeWatch will construct and bring into operation the facilities, hardware, software and governance structures for all aspects of biodiversity research: facilities for data generation and processing, data integration and interoperability; a network of observatories, virtual laboratories; a Service Centre supporting scientific and policy users Involving most nature infrastructures Interoperability requirements Distributed data sensors Metadata standardisation Common reference model

VPH: The Virtual Physiological Human VPH aims to support and progress European research in biomedical modelling and simulation of the human body. This will improve our ability to predict, diagnose and treat disease, and have a dramatic effect on the future of healthcare, the pharmaceutical and medical device industries Pilot project with 5 hospitals Central datacentre Metadata aggregation DICOM, JPEG headers

The CDI concept 10

Communities data centres Service provision Technology appraisal & matching Requirements, service cases 11

How do achieve this?

Building Blocks of the CDI EUDAT Portal Integrated APIs and harmonized access to EUDAT facilities Metadata Catalog Aggregated EUDAT metadata domain. Data inventory Data Staging Safe Replication Simple Store Dynamic replication to HPC workspace for processing Data curation and access optimization Researcher data store (simple upload, share and access) AAI Network of trust among authentication and authorization actors

INFRASTRUCTURE 14

OPERATION TEAM 15

How about data aquisition and licence agreements? 16

Some community examples: CLARIN Type of Data: text, audio, video, multimedia, etc. Aquisition, use, and re-use of data depends very much on the source of the data: Newspapers and publishing houses: licences available to work with their data. Analysis belongs to the researcher (or his/her institution); but the underlying source remain property of the publishing house. Text, audio, or video interviews created by researchers: exploitation right of the data (and maybe also copyright, depending on the situation) belongs to the researchers, but due to personality rights (of the persons interviewed), access to data can be restricted. Web-crawled text corpora: permission of each website owner is required to distribute the texts In most cases, copyright, personality right, and/or exploitation right are involved, i.e. hardly any linguistic data are completely freely available to the public. (except Wikipedia). Licence agreements (for annotated resources) cannot be standardised easily => uses of the resource needs the permission of the publishers or subjects. Researchers want to keep control of their data (for the reasons mentioned above). Also many researchers spent a lot of time for manually collecting data. Trust plays an important role data owners need to keep control over their data. 17

Some community example: EPOS Type of data: from recording equipments (sensors, etc.) and lab experiments Acquisition, use, and re-use of data depends very much on the source of the data AND research groups : Seismologists and geodesists obtaining their data from recording equipements installed in the field are generally very open to sharing the data because they also need the others' data => community effort. Lab experiments often require hosting costly and often unique machinery => attitude can be more protective Volcano scientists can be very protective of their data belonging to 'their' volcano. Data restrictions exists: some data is free and accessible in standard formats (e.g. data coming from permanent field equipments; other is restricted (e.g. data coming from labs). MoU exist between institutions to share data Researchers want to be innovative but also want to protect their work => because there is so much effort behind (i.e., for acquiring, quality checking, organising, etc.). EPOS seeks to address these issues and to combine the different data resources and create a single e- infrastructure where data provenance, acknowledgment and right of use are integral parts of the whole effort. 18

Some community example: LIFEWATCH Type of data: geospatial data, DNA sequences, coming from various disciplines. Generally it makes sense to make a difference between different categories of data: Simple data strings (DNA sequences): open and free through the EBI portal Signals (from sensors or Earth observation via satellites or plane): Feeling of ownership is stronger because of previous investments. Processed (filtered and transformed) data are also becoming more open and free available, but not in real time. Processed data with often human interpretations (observations, monitoring): data mostly generated by academic and related research institutions (e.g. data about species presence and abundance, population densities, life stages of organisms, etc). As these data sets require curation, the ownership and owners responsibility must be clear. Licences: International agreements on in place for some data: Open access (e.g. GBIF portal for sharing primary species presence data) Free Access to metadata (e.g. LTER-Europe) Technical, semantic and legal interoperability is crucial for lifewatch 19

Some community example: ENES Type of data: climate modelling data Aquisition, use, and re-use of data depends very much on the source of the data Data produced by the scientific community itself as climate model data: Climate model data are available to the scientific community for academic use without restrictions but not anonymously. About 2/3 of the modeling groups make their data available also for commerical applications according to the US model for data access. Observational data like meteorological stations and satellites which are disseminated by agencies: Data from observations and from satellites are available to the scientific community for academic use for free or on self-cost basis. Commercial applications have to pay on an applications basis. The idea is that agencies will refinanced by selling data for commercial purposes. 20

Hierarchy of data needs Value creation, though openness and sharing Combining data Sharing data Combining data in imaginative new ways to solve problems Sharing data across communities relies on more basic data needs being met first Improving data usability and reusability Keeping data safe and accessible Storing and archiving data Metadata catalogue, persistent identifiers Data security, data standards, data curation Data archive, data storage facilities LSDMA Symposium

Principles where we want to be 1. Data deposited with the EUDAT CDI will be preserved in perpetuity 2. Data are best curated in their own communities 3. Access to data in the EUDAT CDI is free at the point of use 4. EUDAT will operate as a federation of community-facing repositories and back office hosting providers 5. EUDAT services and infrastructure must be a suitable target for TDR outsourcing (cf. datasealofapproval.org) 6. EUDAT will not assert ownership of any data it holds EUDAT - towards sustainability - current thinking

Welcome to the 1st EUDAT Conference! 23

Contact us! eudat-info@postit.csc.fi www.eudat.eu facebook.com/eudat twitter.com/eudat_eu Project Coordinator: Kimmo Koski kimmo.koski@csc.fi Scientific Coordinator: Peter Wittenburg peter.wittenburg@mpi.nl Project Manager: Damien Lecarpentier damien.lecarpentier@csc.fi

EUDAT Consortium 3

Data centers and Communities 4

EPOS: European Plate Observatory System Research infrastructure and e-science for data and observatories on earthquakes, volcanoes, surface dynamics and tectonics Distributed data sensors Large-scale statistics Metadata schema Reference architecture

CLARIN: Common Language Resources and Technology Infrastructure CLARIN is a large-scale pan-european collaborative effort to create, coordinate and make language resources and technology available and usable Around 200 EU centres Require PIDs and metadata infrastructure ISOcat, SCHEMcat The Virtual Language Observatory http://www.clarin.eu/vlo/

ENES: Service for Climate Modelling in Europe ENES provides information and services to foster intricate simulations of the climate system using high-performance computers as well as the distribution and dissemination of data produced by such simulations About 20 EU centres Uses data infrastructure at the German climate centre Uses CIM data model Uses DOIs and EPIC handles Metadata schema based on ISO 11179

LifeWatch: Biodoversity Data and Observatories LifeWatch will construct and bring into operation the facilities, hardware, software and governance structures for all aspects of biodiversity research: facilities for data generation and processing, data integration and interoperability; a network of observatories, virtual laboratories; a Service Centre supporting scientific and policy users Involving most nature infrastructures Interoperability requirements Distributed data sensors Metadata standardisation Common reference model

VPH: The Virtual Physiological Human VPH aims to support and progress European research in biomedical modelling and simulation of the human body. This will improve our ability to predict, diagnose and treat disease, and have a dramatic effect on the future of healthcare, the pharmaceutical and medical device industries Pilot project with 5 hospitals Central datacentre Metadata aggregation DICOM, JPEG headers

The CDI concept 10

Communities data centres Service provision Requirements, service cases Technology appraisal & matching 11

How do achieve this?

Building Blocks of the CDI EUDAT Portal Integrated APIs and harmonized access to EUDAT facilities Metadata Catalog Aggregated EUDAT metadata domain. Data inventory Data Staging Safe Replication Simple Store Dynamic replication to HPC workspace for processing Data curation and access optimization Researcher data store (simple upload, share and access) AAI Network of trust among authentication and authorization actors

INFRASTRUCTURE 14

OPERATION TEAM 15

How about data aquisition and licence agreements? 16

Some community examples: CLARIN Type of Data: text, audio, video, multimedia, etc. Aquisition, use, and re-use of data depends very much on the source of the data: Newspapers and publishing houses: licences available to work with their data. Analysis belongs to the researcher (or his/her institution); but the underlying source remain property of the publishing house. Text, audio, or video interviews created by researchers: exploitation right of the data (and maybe also copyright, depending on the situation) belongs to the researchers, but due to personality rights (of the persons interviewed), access to data can be restricted. Web-crawled text corpora: permission of each website owner is required to distribute the texts In most cases, copyright, personality right, and/or exploitation right are involved, i.e. hardly any linguistic data are completely freely available to the public. (except Wikipedia). Licence agreements (for annotated resources) cannot be standardised easily => uses of the resource needs the permission of the publishers or subjects. Researchers want to keep control of their data (for the reasons mentioned above). Also many researchers spent a lot of time for manually collecting data. Trust plays an important role data owners need to keep control over their data. 17

Some community example: EPOS Type of data: from recording equipments (sensors, etc.) and lab experiments Acquisition, use, and re-use of data depends very much on the source of the data AND research groups : Seismologists and geodesists obtaining their data from recording equipements installed in the field are generally very open to sharing the data because they also need the others' data => community effort. Lab experiments often require hosting costly and often unique machinery => attitude can be more protective Volcano scientists can be very protective of their data belonging to 'their' volcano. Data restrictions exists: some data is free and accessible in standard formats (e.g. data coming from permanent field equipments; other is restricted (e.g. data coming from labs). MoU exist between institutions to share data Researchers want to be innovative but also want to protect their work => because there is so much effort behind (i.e., for acquiring, quality checking, organising, etc.). EPOS seeks to address these issues and to combine the different data resources and create a single e- infrastructure where data provenance, acknowledgment and right of use are integral parts of the whole effort. 18

Some community example: LIFEWATCH Type of data: geospatial data, DNA sequences, coming from various disciplines. Generally it makes sense to make a difference between different categories of data: Simple data strings (DNA sequences): open and free through the EBI portal Signals (from sensors or Earth observation via satellites or plane): Feeling of ownership is stronger because of previous investments. Processed (filtered and transformed) data are also becoming more open and free available, but not in real time. Processed data with often human interpretations (observations, monitoring): data mostly generated by academic and related research institutions (e.g. data about species presence and abundance, population densities, life stages of organisms, etc). As these data sets require curation, the ownership and owners responsibility must be clear. Licences: International agreements on in place for some data: Open access (e.g. GBIF portal for sharing primary species presence data) Free Access to metadata (e.g. LTER-Europe) Technical, semantic and legal interoperability is crucial for lifewatch 19

Some community example: ENES Type of data: climate modelling data Aquisition, use, and re-use of data depends very much on the source of the data Data produced by the scientific community itself as climate model data: Climate model data are available to the scientific community for academic use without restrictions but not anonymously. About 2/3 of the modeling groups make their data available also for commerical applications according to the US model for data access. Observational data like meteorological stations and satellites which are disseminated by agencies: Data from observations and from satellites are available to the scientific community for academic use for free or on self-cost basis. Commercial applications have to pay on an applications basis. The idea is that agencies will refinanced by selling data for commercial purposes. 20

Hierarchy of data needs Value creation, though openness and sharing Combining data Sharing data Combining data in imaginative new ways to solve problems Sharing data across communities relies on more basic data needs being met first Improving data usability and reusability Keeping data safe and accessible Storing and archiving data Metadata catalogue, persistent identifiers Data security, data standards, data curation Data archive, data storage facilities LSDMA Symposium

Principles where we want to be 1. Data deposited with the EUDAT CDI will be preserved in perpetuity 2. Data are best curated in their own communities 3. Access to data in the EUDAT CDI is free at the point of use 4. EUDAT will operate as a federation of community-facing repositories and back office hosting providers 5. EUDAT services and infrastructure must be a suitable target for TDR outsourcing (cf. datasealofapproval.org) 6. EUDAT will not assert ownership of any data it holds EUDAT - towards sustainability - current thinking

Welcome to the 1st EUDAT Conference! 23

Contact us! eudat-info@postit.csc.fi www.eudat.eu facebook.com/eudat twitter.com/eudat_eu Project Coordinator: Kimmo Koski kimmo.koski@csc.fi Scientific Coordinator: Peter Wittenburg peter.wittenburg@mpi.nl Project Manager: Damien Lecarpentier damien.lecarpentier@csc.fi