EUDAT Towards a Collaborative Data Infrastructure Daan Broeder - MPI for Psycholinguistics - EUDAT - CLARIN - DASISH Bielefeld 10 th International Conference
Data These days it is so very easy to create data but still far less easy to manage it. Experiment data Sensor produced data Simulations Digital libraries The Web How to store, to administrate, to find, to enrich, to link, to process, to share, to reuse,, to publish For this we need a data infrastructure One that is efficient, sustainable and cost effective
Data creation cycle analysis & enrichment temp. data referable data citable data Citable publication raw data registration & preservation
The current data infrastructure landscape Long history of data management in Europe: several existing data infrastructures dealing with established and growing user communities (e.g., ESO, ESA, EBI, CERN) New Research Infrastructures (ESFRI roadmap) are emerging and are also trying to build data infrastructure solutions to meet their needs (CLARIN, EPOS, ELIXIR, ESS, etc.) However, most of these infrastructures and initiatives address primarily the needs of a specific discipline and user community Challenges Compatibility, interoperability, for cross-disciplinary research Data growth in volume and complexity strong impact on costs threatening the sustainability of the infrastructure Opportunities Synergies do exist: although disciplines have different work flows and ambitions, they have common basic needs and requirements that can be matched with generic services supporting multiple communities Strategy needed at pan-european level 4
Collaborative Data Infrastructure 5
EUDAT short fact list Content Project Name EUDAT European Data Start date 1st October 2011 Duration Budget 36 months 16,3 M (including 9,3 M from the EC) EC call Call 9 (INFRA-2011-1.2.2): Data infrastructure for e-science (11.2010) Participants 25 partners from 13 countries (national data enters, technology providers, research communities, and funding agencies) Objectives To deliver cost-efficient and high quality Collaborative Data Infrastructure (CDI) with the capacity and capability for meeting researchers needs in a flexible and sustainable way, across geographical and disciplinary boundaries. 6
Consortium 7
Research Communities
Research fields Environmental Science Social Sciences and Humanities ENES, EPOS, Lifewatch, EMSO, IAGOS-ERI, ICOS, Euro-Argo CLARIN, CESSDA, DARIAH Biological and Medical Science VPH, ELIXIR, BBRMI, ECRIN, DiXA Physical Sciences and Engineering WLCG, ISIS, PanData Material Science ESS EUDAT targets all scientific disciplines (discipline neutral): To enable the capture and identify cross-discipline requirements To involving the scientists of all the communities in the shaping of the infrastructure and its services
EUDAT service design activities 1. Capturing Communities Requirements (WP4) 1st round of interviews with the five initial communities (Oct.2011 - Dec. 2012) Understand how data is organised in each community Collect first wishes and specific requirements from a common data service layer Next phase: refine analysis and expanding it to other communities 2. Building the corresponding services (WP5) Technology appraisal (ongoing) What is already available at partners s sites to build the corresponding services? What are the gaps and market failures that should be addressed by EUDAT? Next phase: Developing candidate services Adapt services to match the requirements Integrate with community and SP services Test and evaluate with communities 3. Deploying the services and operating the federated infrastructure (WP6) Designing the federated infrastructure and the interfaces for cross-site operations (ongoing) Next phase: integrating and coordinating resource provision, operations and support
EUDAT Core Service Areas Community-oriented services Simple Data Acces and upload Long term preservation Shared workspaces Execution and workflow Joint metadata and data visibility Simple storage facility for individual scientists and small projects Core services are building blocks of EUDAT s Common Data Infrastructure mainly included on bottom layer of data services Enabling services (making use of existing services where possible Persistent identifier service (EPIC, DataCite,...) Federated AAI service (NRENs, edugain) Network Services Monitoring and accounting
Data Management Service Cases Safe Replication Replicate data between selected centers Based on user specified policies For LTA, for easy access, Technology: irods Dynamic Replication (Data staging) Moving data to HPC workspaces and storing the results Technology: irods + grid tools Usable PID framework facilitate administrating data replication allow identifying parts of objects data verifiability, Technology: HS + EPIC and DataCite Center registry Listing EUDAT services, centers and their capabilities
Data Management Service Cases Joint metadata domain A metadata catalogue for (all?) research data Interdisciplinary (re-)use of data Semantic interoperability: explicit semantics and flexible relations or hard-wired mappings,.. Granularity Include individual resources or data-sets only Commenting function Platform permitting data-set promotion Proper acknowledgements for data creators Technologies: icat, mercury, OAI-PMH, xsd, rdf, Simple Store A safe repository for all research data in need youtube or dropbox model (Detailed?) metadata Sharing
EUDAT Architecture EUDAT Community center EUDAT data center EUDAT data center EUDAT data center PRACE HPC center HPC workspace EUDAT Community center D EUDAT PID Service LTA facility EUDAT HPC center D HPC workspace D D D D D LTA facility EUDAT Metadata Service Harvesting metadata EUDAT center registry EUDAT Simple -store D
Collaborations With the ESFRI (cluster) projects With service providers: EPIC, DataCite, EUDAT <-> EGI collaboration (& competition) US DataNET: DataOne, Data Conservancy, DAITF - Data Access & Interoperability Task Force This task will contribute to the efforts to establish an international task force. This work will be carried out in collaboration with OpenAIRE and other relevant initiatives/projects focusing on data.
Thank you for your attention
Interlinking data and publications Identifiers for Actors (ORCID) data curator data depositor reviewer author editor API API datasets & metadata publications Identifiers for data & publications (HS, DOI, URN)
Organizations guiding data management infrastructure building ICSU CODATA WDC COAR EUDAT ICORDI OpenAIRE DAITF
top-down process about strategies and needs driven by science bottom-up process towards solutions driven by science Move to DAITF & icordi inspired by OpenAIRE and EUDAT NSF EC icordi PROGRAMMES ANALYSIS PROGRAMME DAITF STEERING BOARD CNRS Horizontal Data Infrastructures KNAW Informing DOMAIN OF TOP SCIENTISTS, SENIOR TECHNOLOGISTS, POLICY MAKERS MPG HLSF PROCESS Workshops, working groups Influencing Interacting DAITF PROCESS Conferences, working groups, hands-on training Data Scientists Young Scientists Technologists CNR DFG Informing DOMAIN OF DATA INFRASTRUCTURE PRACTITIONERS NWO STFC icordi PROGRAMMES KNOWLEDGE EXCHANGE PROGRAMME WORKSHOP PROGRAMME PROTOTYPE PROGRAMME Discipline/Domain Data Infrastructures 1st Workshop March 2012, Copenhagen next workshop in October, Washington other stakeholders RCs, ROs, Funders, etc how to organize and support this process? IETF? DWF?
What has been done so far? 2006/8 UIPIU Data2012 Workshop DataNet1 DAITF Prepar. Workshop ASIST Workshop DataNet2 DWF Concept 2008 2009 2010 2011 2012 tackling first data topics brainstorming on data issues, need for global action & first focussed actions global interaction in place 20
EUDAT Services for Communities!? v v DARIAH v DiXa v v v v v v v
Safe Replication Use Case Objective: Allow communities to reliably replicate data to selected data centers for storage and do this in a robust, reliable and highly available manner. Respecting existing conventions on stewardship and security. Using user defined policies: e.g. make 4 copies, don t copy to the UK, Application: To (1) move data to locations where curation and/or LTP services are present (2) processing requiring HPC can take place (3) for improved user data accessibility Replicated digital objects are identified through a single PID, with multiple locations associated to the PID record; one location per copy. 22
Dynamic Replication Service Case Move entire data set (i.e. data collection) back and forth between an EUDAT node and a non-eudat node: PRACE or EGI facilities Keep the data replicas at the non-eudat nodes in sync with the EUDAT nodes Ingest/register relevant simulation results back at the EUDAT nodes. Candidate technologies irods Globus on-line FTS Unicore FTP gtransfer
community specific DASISH CLARIN LT web service infrastructure SSH communities wide - DASISH common SSH metadata catalog replication & preservation ENVRI Data Replication & Preservation, Publication EUDAT HPC, GRID services PRACE, EGI NETWORK Services - GEANT CLARIN DARIAH CESSDA Life Watch Federated Identity Management