dcache: challenges and opportunities when growing into new communities Paul Millar on behalf of the dcache team

Similar documents
dcache integration into HDF

Data Storage. Paul Millar dcache

EUDAT & AAI. Daan Broeder MPI for Psycholinguistics

EUDAT. Towards a Collaborative Data Infrastructure. Ari Lukkarinen CSC-IT Center for Science, Finland NORDUnet 2012 Oslo, 18 August 2012

EUDAT. Towards a pan-european Collaborative Data Infrastructure

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Lessons Learned in the NorduGrid Federation

Scientific data management

dcache Introduction Course

dcache, activities Patrick Fuhrmann 14 April 2010 Wuppertal, DE 4. dcache Workshop dcache.org

EUDAT - Open Data Services for Research

INDIGO AAI An overview and status update!

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Using EUDAT services to replicate, store, share, and find cultural heritage data

EUDAT. Towards a pan-european Collaborative Data Infrastructure. Damien Lecarpentier CSC-IT Center for Science, Finland EUDAT User Forum, Barcelona

The EUDAT Collaborative Data Infrastructure

EUDAT- Towards a Global Collaborative Data Infrastructure

dcache-view Olufemi S. Adeyemi On behalf of the project team INDIGO DataCloud

EUDAT. Towards a pan-european Collaborative Data Infrastructure

dcache Ceph Integration

EUDAT. Towards a pan-european Collaborative Data Infrastructure - A Nordic Perspective? -

DSIT WP1 WP2. Federated AAI and Federated Storage Report for the Autumn 2014 All Hands Meeting

A scalable storage element and its usage in HEP

Online data storage service strategy for the CERN computer Centre G. Cancio, D. Duellmann, M. Lamanna, A. Pace CERN, Geneva, Switzerland

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

Guidelines on non-browser access

I data set della ricerca ed il progetto EUDAT

2. HDF AAI Meeting -- Demo Slides

Introduction to SciTokens

irods Security Aspects Willem Elbers CLARIN-ERIC, Netherlands

dcache NFSv4.1 Tigran Mkrtchyan Zeuthen, dcache NFSv4.1 Tigran Mkrtchyan 4/13/12 Page 1

EGI Check-in service. Secure and user-friendly federated authentication and authorisation

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan

Grid Authentication and Authorisation Issues. Ákos Frohner at CERN

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

AARC Blueprint Architecture

SAML-Based SSO Solution

dcache Introduction Course

AAI in EGI Current status

Best Practices: Authentication & Authorization Infrastructure. Massimo Benini HPCAC - April,

Elastic Cloud Storage (ECS)

Computing / The DESY Grid Center

Configuration Guide - Single-Sign On for OneDesk

The SciTokens Authorization Model: JSON Web Tokens & OAuth

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

European Collaborative Data Infrastructure EUDAT - Training on EUDAT Principles -

Storage Management in INDIGO

The EGI AAI CheckIn Service

INDIGO-DataCloud Architectural Overview

EUDAT. Towards a pan-european Collaborative Data Infrastructure. KNMI Workshop, Utrecht, Netherlands

and the GridKa mass storage system Jos van Wezel / GridKa

Goal. TeraGrid. Challenges. Federated Login to TeraGrid

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

Towards a joint service catalogue for e-infrastructure services

EUDAT Registry Overview for SAF (26/04/2012) John kennedy, Tatyana Khan

StratusLab Cloud Distribution Installation. Charles Loomis (CNRS/LAL) 3 July 2014

Design & Manage Persistent URIs

Federated Access Management Futures

Promoting Open Standards for Digital Repository. case study examples and challenges

Warm Up to Identity Protocol Soup

SAML-Based SSO Solution

Management der Virtuellen Organisation DARIAH im Rahmen von Shibboleth- basierten Föderationen. 58. DFN- Betriebstagung, Berlin, 12.3.

ArcGIS Enterprise Security: An Introduction. Gregory Ponto & Jeff Smith

WP JRA1: Architectures for an integrated and interoperable AAI

LCG data management at IN2P3 CC FTS SRM dcache HPSS

Understanding StoRM: from introduction to internals

The German National Analysis Facility What it is and how to use it efficiently

RCauth.eu / MasterPortal update

UMA and Dynamic Client Registration. Thomas Hardjono on behalf of the UMA Work Group

The Materials Data Facility

Some thoughts on the evolution of Grid and Cloud computing

Paranoid Penguin rsync, Part I

DARIAH Update. 9th FIM4R Workshop. Vienna, Novemer 30, Peter Gietz, DAASI International GmbH.

File Services. File Services at a Glance

EUROPEAN MIDDLEWARE INITIATIVE

Cloud Access Manager Configuration Guide

Virtualized Network Services SDN solution for service providers

EUDAT Common data infrastructure

SA1 CILogon pilot - motivation and setup

Direct, DirectTrust, and FHIR: A Value Proposition

Azure Marketplace Getting Started Tutorial. Community Edition

Federated Authentication with Web Services Clients

Data Staging: Moving large amounts of data around, and moving it close to compute resources

You can access data using the FTP/SFTP protocol. This document will guide you in the procedures for configuring FTP/SFTP access.

THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA

Developing Microsoft Azure Solutions (70-532) Syllabus

TECHNICAL GUIDE SSO SAML. At 360Learning, we don t make promises about technical solutions, we make commitments.

Introducing Shibboleth. Sebastian Rieger

dcache: sneaking up on NFS4.1

EUDAT Towards a Collaborative Data Infrastructure

BEYOND AUTHENTICATION IDENTITY AND ACCESS MANAGEMENT FOR THE MODERN ENTERPRISE

Managing Protected and Controlled Data with Globus. Vas Vasiliadis

NorStore. a national infrastructure for scientific data. Andreas O Jaunsen UNINETT Sigma as

8.0 Help for Community Managers About Jive for Google Docs...4. System Requirements & Best Practices... 5

EGI federated e-infrastructure, a building block for the Open Science Commons

4. Note: This example has NFS version 3, but other settings such as NFS version 4 may also work better in some environments.

Developing Microsoft Azure Solutions (70-532) Syllabus

SAML-Based SSO Configuration

Outline. ASP 2012 Grid School

VMware AirWatch Android Platform Guide

Transcription:

dcache: challenges and opportunities when growing into new Paul Millar communities on behalf of the dcache team EMI is partially funded by the European Commission under Grant Agreement RI-261611

Orientation: what is dcache? Storage for the non-hep user. Cella Nova: the new storage Summary. Contents

1. Orientation

What is dcache? Data Storage system Upload files, get at uploaded bytes again Files can be deleted, renamed, moved but not updated or appended; subdirectories can be created, deleted, moved, renamed. Separates front-end nodes, storage nodes and namespace (makes it scale) Supports multiple protocols: *FTP, HTTP/WebDAV, NFS 4.1, xrootd & *dcap. Runs on multiple platforms (just needs a JVM) Many advance features Fine-grain control over data placement (on write, on stage) Supports pools that are read-only, write-only, stage-only or any combination thereof Dynamic hot-spot replication Supports tape back-ends Can maintain redundant internal copies of data Flexible approach for establishing users' identity Supports data integrity assurance Many aspects may be customised by writing plugins plus more...

dcache evolution 2000 2005: the site-centric era Providing storage for local users. Users authenticate against site-local systems. 2005 2011: the Grid era deployed at sites throughout the world as a Storage Element using.509 identification. 2011... : the SaaS era Storage as a Service A single dcache can provide storage for multiple end-user groups, auto-provisioning users, who identify themselves in various ways, providing different qualities of service (Amazon S3-like service, DropBox-like service, federated storage,...) NB. these dates are very approximate

2. Storage for the non-hep user.

LOFAR Juelich and SARA use dcache to provide storage for LOFAR SARA currently provides ~1PB of storage Used for LOFAR's LTA: long-term archive Data accessed using SRM + GridFTP, users identified with.509 SARA is investigating HTTP/WebDAV No space tokens, but different QoS provided (d1t0, d1t1 and d0t1)..509 and username+pw authentication. LOFAR have developed integration software Generally treats EGEE/BiG Grid and Astro-WISE as separate domains Metadata (hosted in Astro-WISE) is common and LTA data is accessible from both domains. LOFAR users cope with (but don't like using).509 user certificates. Normal authentication is with LDAP

FEL example Free electron laser facility Software design is currently under development currently being built at DESY dcache will be used, likely to provide archival storage A potential barrier to broader use is end-user software's write patterns and possibly immutability. Metadata is key for most users' work-flow Discovery of data is through the metadata Metadata is held outside dcache a web-portal to allow browsing and searching. Web access is initially via portal, but redirected to dcache for accessing data.

EUDAT example Relatively short project (3 years) Needs to take read to use software and deploy it, with minimum integration. User communities already have large amounts of data: Software must work along side what already exists. Unclear to what extent dcache will be used (Although SARA is a member) but their requirements are interesting.

EUDAT Core Services Note how (in general) the underlying storage isn't mentioned, it's assumed. This relies on easy integration of storage with higherlevel functionality Slide adapted from Damien Lecarpentier's presentation, KE Research Data Working Group Meeting, Copenhagen, 14th August 2012

Service SR Community CLARIN ENES DR MD SS PID AAI + + + EPOS VPH LifeWatch + + + Note that AAI (Authentication) is a common requirement, and that all communities either require or are interested in PID (persistent identifiers). Slide adapted from Damien Lecarpentier's presentation, KE Research Data Working Group Meeting, Copenhagen, 14th August 2012 EUDAT: requirements

Summary of friction points Evaluation stage: Does a new project even know about dcache? Do they understand dcache's flexibility? Missing features: Missing functionality within dcache (e.g., mutability?) Necessary hooks for easy integration with higher-level components Authentication and Identification management. Authorisation: if not based on filesystem permissions. Boundary activity: data ingest, egress and management. Desire for a turn-key solutions

3. Cella Nova: the new storage

Evaluation stage People can only evaluate what they know about How do people know to evaluate dcache? Word-of-mouth EMI, EGI, ScienceSoft, But, as a general message: If you're building something that needs reliable, flexible, powerful storage, have a look at dcache. If you find a limitation, get in touch with the developers <support@dcache.org>; we might already be planning to working on it (or it might be easy to fix). Would it make sense for the EU to have a registry of EU-funded software projects?

Identity management Federated Identity Management: FIM OpenID, SAML ( Shibboleth ), OAuth2, gplazma is powerful enough to support all these It's use of plugins is ideal, just need to write the plugins :-) HTTP access need updating to provide new login possibilities OpenID login, Web-profile SAML,... There is still a problem with non-http access: Moonshot is most promising approach; it's also being investigated by other projects (Contrail, Eudat,...) Need to handle provisioning: creating accounts automatically. Decommissioning is problematic it's still generally an unsolved problem in FIM.

Authorisation Currently dcache support UNI permissions and NFS ACLs Users have a UID & GIDs Permissions decided by ownership of files & directory and their modes. ACLs allow a more flexible description. For grid users, their DN and FQAN(s) define their UID and GIDs. Many projects have roughly similar approach: Current mapping is somewhat awkward, but work is underway to fix this. User presents group-membership token(s), which are mapped to GIDs. Others projects may wish to make decisions completely outside of dcache One approach is for users to supply an authz token with a request Another approach is to call-out for each operation (e.g., AML) Some support already exists already, but not uniform and only for the ALICE approach.

Boundary activities Data ingest and egress: Trigger activity when data is uploaded (e.g., update catalogue, extract metadata) Trigger activity when data is downloaded (e.g., redacting or anonymising ) Should these activities happen inside dcache or outside, triggered by dcache? User may have non-modifiable analysis application Can't modify (no source code) or don't want to modify dcache's use of standard protocols (NFS, HTTP, WebDAV, FTP) Better chance of dcache being accessible to client's application. Get the clients for free (or almost for free?) Community comes with additional protocol requirements More than just upload and download: Can add support for a new protocol. Management of data dcache provides SRM as a standard management interface, Other interfaces provide a subset of SRM functionality. Does user concepts match dcache management concepts?

Integration Storage is a minimum service Often, seen as some hidden back-end to higher-level functionality How much functionality should be in dcache? Storing user-supplied metadata Persistent Identifiers? How flexible should dcache be internally? As RDF triple-store? With SPARQL end-point? With a reasoner? What complexity class? Should it embed a domain-specific language triggered by activity within dcache? Need to provide sufficient hooks to allow easy integration with higher-level services What dcache activity should trigger these hooks? Work on this already started within EMI How should the reverse interaction (external systems triggering dcache activity) look like?

Summary Presented examples of non-hep communities with strong data requirements Although dcache is being used by non-hep users, there are points the hinder their adopting dcache We are working on these points, allowing people to better use dcache.

and my thanks to Ron Trompert and Shaun de-witt for their help and input. Thanks for listening...

Questions? Discussion?

Backup slides

gplazma: new

HTTP / WebDAV How do we support non-hep users? dcap, SRM, rfio, xrootd Nobody outside HEP has heard of these HTTP & WebDAV Everyone has a web-browser WebDAV is commonly available on platforms Deployed in production: DESY, PIC, BNL, Used by Microsoft's SkyDrive service

NFS v4.1 / pnfs Industry standard protocol: It is available NOW: RHEL/SL 6.x, Fedoria, Debian ( Wheezy ), Ubuntu, Windows, Solaris, In production (at DESY) for over a year Fermi RE dept. evaluated dcache NFSv4.1 for Fermilab Intensity Frontier: Results look promising, throughput scales well with number of pool nodes Authn: trusted-host or Kerberos

NFS 4.1 with.509 HEP uses.509 client certificates for authn and authz decisions. (everyone else is using Kerberos) NFS 4.1 doesn't support this, currently Support need for HEP jobs to use NFS. Collaborating with CERN/DPM to solve this Linux has pluggable authn, so this is fix-able.