EUDAT - Open Data Services for Research Johannes Reetz EUDAT operations Max Planck Computing & Data Centre Science Operations Workshop 2015 ESO, Garching 24-27th November 2015 EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-infrastructures. Contract No. 654065 www.eudat.eu
Research Infrastructures Where are we going? Research Infrastructure trends & challenges Internationalisation Diversification Increasingly relying on ICT Data deluge Complexity Trust, Authenticity Citation, Credits Open Access Open Data European RIs: Around 500 100 billion investment middle age 19th century 20th century 21st century 2
An e-infrastructure solution for pan-european Research Data Challenges All Research communities and RIs are facing similar data challenges Where to store (big) data? How to keep-it meaningful over time? How to share data? How to publish it? How to register data objects? How to connect them? How to find it? How to access it? How to transfer it? Solutions needed at global level collaboration needed Exploitation of synergies Some services are common to many communities / research domains Reduce investment and operational cost Collaborate on optimizing standards for APIs, MD, DO Identity profiles, policies 3
EUDAT consortium (2011, 2015) EUDAT offers common data services, supporting multiple research communities as well as individuals, through a geographically distributed, resilient network of 35 European organisations e-science Data Factory 4
Data Curation Trust HLEG 2010: Collaborative Data Infrastructure Data Generators Users User functionalities, data capture & transfer, virtual research environments Community Services Data discovery & navigation, workflow generation, annotation, interpretability Common Data Services Persistent storage, identification, authenticity, workflow execution, mining 5
Community-Driven BIOMEDICAL & MEDICAL SCIENCES MATERIALS & ANALYTICAL FACILITIES MAPPER PHYSICAL SCIENCES & ENGINEERING 6
EUDAT collaborating with other e-infrastructures Policy & guidelines Data management plans Service integration Open AIRE RDA Policy & networking Output adoption Test beds GEANT LERU LIBER PRACE Cross-infra services & ops Common protocols, APIs HPC/HTC/Clouds EGI Helix Nebula Data Cloud 7
B2 Services 8
B2Services and the Research Data Pyramid 9
Participate by Using the CDI (1) Community Thematical (Community) Center Using the CDI via standardised APIs Community policies independent from CDI Community centre either remains main actor for community data stewardship EUDAT CDI 10
Participate by Using the CDI (2) Community Thematical (Community) Center Using the CDI via standardised APIs policies independent from CDI Some responsibility for data stewardship delegated to a CDI center (ingest node) EUDAT CDI 11
Participate by Joining the CDI Community Community Center Community centre installs EUDAT (B2SAFE) middleware Common CDI policies concerning PID configuration, MD handling, security and other operational procedures apply EUDAT CDI 12
Metadata support within the CDI Data and metadata via a HTTP/JSON descriptions Data and metadata as separate objects D MD D Data and metadata in defined packages Data with embedded metadata descriptions (e.g. NetCDF, HDF5 file formats) Package MD D MD D 13
EUDAT CDI API Abstraction Layer Community Services (e.g. specific discovery services) EUDAT API library Abstraction Layer GridFTP HTTP DO support - Separate relevant metadata and data objects - As packages - Embedded metadata DO support - HTTP/JSON descriptions - Separate metadata and data objects - As packages - Embedded metadata 14
Metadata Description Support Defined Templates Interpretable for EUDAT Uninterpretable for EUDAT 15
Production Infrastructure Operational Services Central Registry Sites&Services creg.eudat.eu Monitoring cmon.eudat.eu providing PIDs Community Centre/Repository, Data Provider General data centre (many HTC/HPC service providers) Community Data Project Resource Coordination rct.eudat.eu Helpdesk helpdesk.eudat.eu Operational and Support services PID provider (most of them are epic partners and can issue Handle prefixes) 16
X.509 IdP PKI SAML IdP SAML Social IdP Primary Identities Google Facebook Linkedin edugain RIs e.g. CLARIN PRACE EGI WLCG IDM Integration https://b2access.eudat.eu ORCID ResearchID Scopus OpenID OpenID IdPs from RIs Multi-Protocol Identity Management, LoA support powered by Unity IDM e.g. ESGF, ENES OAuth 2 authorization server EUDAT CA EUDAT federation database B2ACCESS AAI functions B2ACCESS IdP User Profile B2SHARE (Oauth 2) Access Token X.509 SAML B2SAFE (X.509) B2STAGE (X.509) B2DROP (SAML) B2HANDLE (SAML) Data Project Coordination Portal Helpdesk TTS Site & Service Registry EUDAT Service Endpoints 17
PKI SAML IdP SAML Social IdP Primary Identities Production Oct 15 B2SHARE (Oauth 2) B2SAFE (X.509) Google B2ACCESS IdP User Profile B2STAGE (X.509) edugain OpenID Multi-Protocol Identity Management powered by Unity IDM OAuth 2 authorization server EUDAT CA EUDAT federation database Access Token X.509 SAML B2DROP (SAML) B2HANDLE (SAML) Data Project Coordination Portal Helpdesk TTS Site/Repository & Service Registry B2ACCESS AAI functions EUDAT Service Endpoints 18
B2ACCESS EUDAT IDM http://b2access.eudat.eu 19
B2ACCESS EUDAT IDM http://b2access.eudat.eu 28/11/2015 20 20
Example: Safe Replication The ideal solution to: eudat.eu/b2safe replicate research data into secure data stores archive and preserve research data in the long-term bring data permanently close to powerful compute resources co-locate data with different communities benefit from economies of scale Features: large-scale storage robust and highly available permanent PIDs 21
B2SAFE Use Case from CLARIN ERIC Replication of Linguistic Data PID PID irods SAMQFS irods irods GPFS dcache HPSS DMF 22
Who? Groups, Communities and Centres who want to make their data referencable in a stable way What? Follows policies to register data and make it long term refer- and citable Reliability through mutual PID mirroring, Handle Prefix Registrars from epic or other DONA MPAs are partners of EUDAT. Provides the abstraction layer between a globally unique persistent identifier and physical location of data objects Machine readable via HTTP RESTful API Why Handles? Stable globally unique IDs, stable cross-links Technology Agnostic Simple Integration Development activity Develop policies for the B2HANDLE service (e.g. PID namespace mngmt) Consolidate the PID record profile for the CDI Define PID Information Types for data, metadata, collection records Integrate with Data Type Registry service Consolidate B2HANDLE API library with EUDAT API library 23
Worldwide PID system, a DNS for Data being built DONA dona.net Representatives from all continents Stewards of the Handle System Worldwide MPAs with RAs IDF, epic, CNRI, etc. A worldwide system to register digital objects via Handles similar to the assignment of FQDNs to compute resources having IP-addresses. Need to be able to identify and test integrity and authenticity of data, relations to meta data, apps, services etc. The Handle System offers a powerful solution. DONA is a foundation under Swiss law to make the Handle System independent from CNRI. Federation of Multi-primary Prefix Administrators (MPA) and Prefix- Registration Authorities are being established. 24
EUDAT Production Environment Data Management Project Enabling Helpdesk & Support Network, Configuration Compute Resources Service Hosting, Service on Demand Service Deployment Storage, Storage Services Security Team Service and Resource Provisioning & Coordination Operational & Central Services 25 14 generic centres, 15PB committed, 5-10Gb/s per site (potential of > 1000 PB aggregated)
Configuration of Handle PIDs as linked list A1