Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Similar documents
EUDAT. Towards a pan-european Collaborative Data Infrastructure

I data set della ricerca ed il progetto EUDAT

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

EUDAT. Towards a pan-european Collaborative Data Infrastructure - A Nordic Perspective? -

Data Staging: Moving large amounts of data around, and moving it close to compute resources

EUDAT. Towards a Collaborative Data Infrastructure. Ari Lukkarinen CSC-IT Center for Science, Finland NORDUnet 2012 Oslo, 18 August 2012

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Using EUDAT services to replicate, store, share, and find cultural heritage data

Data management and discovery

EUDAT Common data infrastructure

EUDAT. Towards a pan-european Collaborative Data Infrastructure

EUDAT - Open Data Services for Research

Data Staging: Moving large amounts of data around, and moving it close to compute resources

European Collaborative Data Infrastructure EUDAT - Training on EUDAT Principles -

EUDAT. Towards a pan-european Collaborative Data Infrastructure. KNMI Workshop, Utrecht, Netherlands

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

EUDAT. Towards a pan-european Collaborative Data Infrastructure. Damien Lecarpentier CSC-IT Center for Science, Finland EUDAT User Forum, Barcelona

EUDAT Towards a Collaborative Data Infrastructure

Data Staging and Data Movement with EUDAT. Course Introduction Helsinki 10 th -12 th September, Course Timetable TODAY

EUDAT & AAI. Daan Broeder MPI for Psycholinguistics

irods workflows for the data management in the EUDAT pan-european infrastructure

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

EUDAT- Towards a Global Collaborative Data Infrastructure

The EUDAT Collaborative Data Infrastructure

NorStore. a national infrastructure for scientific data. Andreas O Jaunsen UNINETT Sigma as

EUDAT & SeaDataCloud

EUDAT and Cloud Services

ODC and future EIDA/ EPOS-S plans within EUDAT2020. Luca Trani and the EIDA Team Acknowledgements to SURFsara and the B2SAFE team

B2SAFE metadata management

USE CASES IN SEISMOLOGY. Alberto Michelini INGV

Safe Replication and Data Staging

Data Discovery - Introduction

Research Data Management & Preservation: A Library Perspective

Fundamentals of Data Infrastructures

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

Scientific Data Curation and the Grid

Scientific data management

Managing very large Multimedia Archives and their Integration into Federations

Richard Marciano Alexandra Chassanoff David Pcolar Bing Zhu Chien-Yi Hu. March 24, 2010

The Materials Data Facility

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

Key Elements of Global Data Infrastructures

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

Building a Materials Data Facility (MDF)

Europeana Core Service Platform

dcache: challenges and opportunities when growing into new communities Paul Millar on behalf of the dcache team

DSpace Fedora. Eprints Greenstone. Handle System

Writing a Data Management Plan A guide for the perplexed

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

Welcome to the Pure International Conference. Jill Lindmeier HR, Brand and Event Manager Oct 31, 2018

Data publication and discovery with Globus

Federated Identity Management for Research Collaborations. Bob Jones IT dept CERN 29 October 2013

Ways to improve Global Data Sharing and Re-use

High Performance Computing Course Notes Grid Computing I

Technical implementation plan for next period

PID System for eresearch

«Prompting an EOSC in Practice»

Federated Services and Data Management in PRACE

CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen,

Promoting Open Standards for Digital Repository. case study examples and challenges

Indiana University Research Technology and the Research Data Alliance

re3data.org - Making research data repositories visible and discoverable

Designing an institutional research data management infrastructure for the life sciences

Data Management: the What, When and How

PASIG Directions & Issues

EUROPEANA METADATA INGESTION , Helsinki, Finland

Long-term preservation for INSPIRE: a metadata framework and geo-portal implementation

HPC IN EUROPE. Organisation of public HPC resources

Educating a New Breed of Data Scientists for Scientific Data Management

On the way to Language Resources sharing: principles, challenges, solutions

CLARIN s central infrastructure. Dieter Van Uytvanck CLARIN-PLUS Tools & Services Workshop 2 June 2016 Vienna

Webinar Annotate data in the EUDAT CDI

DCH-RP Trust-Building Report

European Cloud Initiative: implementation status. Augusto BURGUEÑO ARJONA European Commission DG CNECT Unit C1: e-infrastructure and Science Cloud

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

clarin:el an infrastructure for documenting, sharing and processing language data

Simplifying Collaboration in the Cloud

Horizon 2020 and the Open Research Data pilot. Sarah Jones Digital Curation Centre, Glasgow

Introduction to Grid Computing

irods - An Overview Jason Executive Director, irods Consortium CS Department of Computer Science, AGH Kraków, Poland

Long Term Digital Preserva2on

A Federated Grid Environment with Replication Services

NEUROIMAGING RESEARCH DATA LIFE- CYCLE MANAGEMENT

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

PIDs for CLARIN. Daan Broeder CLARIN / Max-Planck Institute for Psycholinguistics

Data Curation Handbook Steps

DRI: Dr Aileen O Carroll Policy Manager Digital Repository of Ireland Royal Irish Academy

Certification. F. Genova (thanks to I. Dillo and Hervé L Hours)

Building on Existing Communities: the Virtual Astronomical Observatory (and NIST)

Mitigating Risk of Data Loss in Preservation Environments

National Data Sharing and Accessibility Policy-2012 (NDSAP-2012)

Towards a joint service catalogue for e-infrastructure services

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Making research data repositories visible and discoverable. Robert Ulrich Karlsruhe Institute of Technology

BUILDING A NEW DIGITAL LIBRARY FOR THE NATIONAL LIBRARY OF AUSTRALIA

DCH-RP and PREFORMA Two case studies on the digital preservation of cultural heritage

Towards FAIRness: some reflections from an Earth Science perspective

WORKSHOP 28 TH /29 TH APRIL Christine Staiger

A VO-friendly, Community-based Authorization Framework

Transcription:

Data Replication: Automated move and copy of data PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013 Claudio Cacciari c.cacciari@cineca.it Outline The issue Starting point Which kind of service? Which kind of users? Flexibility Different transfer strategies Policies Performances Different federation strategies PID and registered data 2 1

Data trends: the challenges Exponential growth Zettabytes Exabytes Petabytes Terabytes Gigabytes Increasing complexity and variety Where to store it? How to find it? How to make the most of it? How to ensure service interoperability? 3 3 Collaborative Data Infrastructure To move data across heterogenous systems and organizations, the keys are interoperability at organisational, semantic and technical levels, scalability and reduction of the complexity of the data management Data Generators Users User functionalities, data capture & transfer, virtual research environments Data Curation Trust Community Support Services Data discovery & navigation, workflow generation, annotation, interpretability Common Data Services Persistent storage, identification, authenticity, workflow execution, mining 4 4 2

Just another infrastructure? If there are hundreds of Research Infrastructures, how many different data management systems can we sustain? 5 5 EUDAT project EUDAT is a three-year project that will deliver a Collaborative Data Infrastructure (CDI) Co-funded within the 7th Framework Programme Launched on 1 October 2011 6 3

Principles where we want to be 1. Data deposited will be preserved in perpetuity 2. Data are best curated in their own communities 3. Access to data in the Collaborative Data Infrastructure is free at the point of use 4. The Collaborative Data Infrastructure will not assert ownership of any data it holds 7 Data movement: staging or replication? PID Safe Replication Data preservation and access optimization Data Staging Dynamic replication to HPC workspace for processing Data 8 8 4

Replication 9 9 Once upon a time replicate my collection X to three data centres and store the collection safely for 10 years 10 5

Are you talking about clouds? I already do it every day in my cloud space http://www.flickr.com/photos/theaucitron/5810163712 11 We are talking about a robust safe highly available Replication Service which is not a personal cloud space 12 6

What about trust? Can you find where your data are physically stored on the cloud? Or who can access them? No, because clouds are opaque While a collaborative data infrastructure is transparent http://www.flickr.com/photos/andercismo/2349098787 13 replicate my collection X to three data centres EPCC EUDAT center Community center VPH ENES CLARIN Lifewatch CINECA BSC EPOS 14 7

Then is it a complex mechanism suited only for expert data managers? No. It is a flexible approach able to encompass quite different usage scenarios. http://www.flickr.com/photos/joshmichtom/4311112953 15 Horizontal flexibility Single researcher Team Community Upload and forget Upload, add metadata, share Periodic transfers, quality checks Different strategies for different usage scenarios 16 8

EUDAT service overview Metadata Catalogue Aggregated EUDAT metadata domain. Data inventory Safe Replication Data preservation and access optimization Data Staging Dynamic replication to HPC workspace for processing Simple Store Researcher data store (simple upload, share and access) AAI Network of trust among authentication and authorization actors PID Anchor for identification and integrity Data PID Metadata 17 Service availability does not mean user awareness http://xkcd.com/949/ 18 9

Transfer approaches Autonomous Planned small medium large huge Data size 19 Coming back replicate my collection X to three data centres and store the collection safely for 10 years Apparently a simple statement But you need to plan it, then you need Policies! 20 10

Vertical flexibility 21 EUDAT consortium is working on Policies for the Collaborative Data Infrastructure management We can call them internal policies http://www.flickr.com/photos/frank3/6053973411 22 11

How to manage software updates? What about security bugs? Which systems are monitored? 23 Each community has one or more doors to connect to the infrastructure EUDAT center community center Ingestion point 24 12

replicate my collection X to three data centres and store the collection safely for 10 years Updating the sub-collection X 1 weekly And the sub-collection X 2 hourly And keeping on-line the data uploaded during the last six months 25 So far so good, we have our infrastructure, our policies What is missing? Performances? http://www.flickr.com/photos/matthewpaulson/6024340932 26 13

If you produce 1 TB of data per day and you want to store it remotely And it takes one week to move the data Then any policy is useless http://www.flickr.com/photos/johnragai/8694960199 27 High Performance Transfers Transfer tools http irods gridftp bandwidth speed 28 14

Eudat Transfer options Globus Online Third party transfer, web portal Globusclient server to server to server irods specific client server to toserver HTTP options Work in progress 29 To join or not to join Join: Use: automated data movement server client to to server Data center Community repository 30 15

Persistent Identifiers (PID) definition See the previous presentation Persistent Identifiers by Maurizio Lunghi for the general concept. EUDAT relies on the EPIC service to associate persistent identifier to digital objects (http://www.pidconsortium.eu). EPIC is an identifier system using the Handle infrastructure. Handle resolution 31 31 and focus Its focus is the registration of data in an early state of the scientific process, where lots of data is generated and has to become referable to collaborate with other scientific groups or communities, but it is still unclear, which small part of the data should be available for a long time period. http://www.ands.org.au/services/pid-policy.html 32 16

Registered data data enrichment processing reduction analysis domain of globally referenceable data temporary data global registration referenceable data citable publication data acquisition generation description preservation 33 33 Safe replication explained EPIC service PID Data Center Store Community repository Data Center Store EUDAT CDI Domain of registered data Data Center Store 34 17

Finally, all together The set up of the automated data transfer between EPOS community and EUDAT touched all the aspects we mentioned so far: Near real time sync (ongoing) Persistent Identifiers registration EPOS joined the EUDAT CDI We defined a specificdaily policy with them back-up We tuned the transfer tools to achieve the best performance, but the HP tool (GridFTP) was useless since the bottleneck was the bandwidth So we chose a more flexible tool like irods irsync protocol In fact in order to achieve a hourly synchronization we exploited checksum sync and file age limit options. 35 Why we love data replication data bit-stream preservation more optimal data curation better accessibility of data identification of data through Persistent Identifiers (PIDs) 36 18

EUDAT is the solution, or maybe not We need your help! http://www.flickr.com/photos/hikingartist/3555349324 37 eudat-info@postit.csc.fi e-irg Workshop - Dublin - Ireland 38 38 19

Thank you! 39 20