Scientific data management

Similar documents
Introduction to Programming and Computing for Scientists

Introduction to Programming and Computing for Scientists

Client tools know everything

Lessons Learned in the NorduGrid Federation

Grid Data Management

glite Grid Services Overview

Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008

Introduction to Programming and Computing for Scientists

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

A scalable storage element and its usage in HEP

Introduction to SRM. Riccardo Zappi 1

Data Storage. Paul Millar dcache

Operating the Distributed NDGF Tier-1

Philippe Charpentier PH Department CERN, Geneva

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Chelonia. a lightweight self-healing distributed storage

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Architecture Proposal

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

Understanding StoRM: from introduction to internals

ODC and future EIDA/ EPOS-S plans within EUDAT2020. Luca Trani and the EIDA Team Acknowledgements to SURFsara and the B2SAFE team

Distributing storage of LHC data - in the nordic countries

dcache: challenges and opportunities when growing into new communities Paul Millar on behalf of the dcache team

dcache Introduction Course

Usage statistics and usage patterns on the NorduGrid: Analyzing the logging information collected on one of the largest production Grids of the world

ARC NOX AND THE ROADMAP TO THE UNIFIED EUROPEAN MIDDLEWARE

ARC integration for CMS

High Performance Computing Course Notes Grid Computing I

Grid Infrastructure For Collaborative High Performance Scientific Computing

Information and monitoring

A Simple Mass Storage System for the SRB Data Grid

Data Management for the World s Largest Machine

EGI-InSPIRE. ARC-CE IPv6 TESTBED. Barbara Krašovec, Jure Kranjc ARNES. EGI-InSPIRE RI

Andrea Sciabà CERN, Switzerland

Data Access and Data Management

Grid Architectural Models

Guidelines on non-browser access

and the GridKa mass storage system Jos van Wezel / GridKa

PoS(ISGC 2016)035. DIRAC Data Management Framework. Speaker. A. Tsaregorodtsev 1, on behalf of the DIRAC Project

Outline. ASP 2012 Grid School

Distributed Data Management on the Grid. Mario Lassnig

Performance of the NorduGrid ARC and the Dulcinea Executor in ATLAS Data Challenge 2

A distributed tier-1. International Conference on Computing in High Energy and Nuclear Physics (CHEP 07) IOP Publishing. c 2008 IOP Publishing Ltd 1

The Grid Monitor. Usage and installation manual. Oxana Smirnova

Prototype DIRAC portal for EISCAT data Short instruction

DIRAC data management: consistency, integrity and coherence of data

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

Managing Scientific Computations in Grid Systems

I data set della ricerca ed il progetto EUDAT

N. Marusov, I. Semenov

Promoting Open Standards for Digital Repository. case study examples and challenges

Future trends in distributed infrastructures the Nordic Tier-1 example

Grid Data Management

EMI Deployment Planning. C. Aiftimiei D. Dongiovanni INFN

The glite middleware. Ariel Garcia KIT

NeuroLOG WP1 Sharing Data & Metadata

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

The NorduGrid Architecture and Middleware for Scientific Applications

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

LCG-2 and glite Architecture and components

Swedish National Storage Infrastructure for Academic Research with irods

Service Availability Monitor tests for ATLAS

Progress in building the International Lattice Data Grid

LCG data management at IN2P3 CC FTS SRM dcache HPSS

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

The LHC Computing Grid

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

Experience of Data Grid simulation packages using.

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

ARC middleware. The NorduGrid Collaboration

EUDAT & AAI. Daan Broeder MPI for Psycholinguistics

dcache Introduction Course

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

( PROPOSAL ) THE AGATA GRID COMPUTING MODEL FOR DATA MANAGEMENT AND DATA PROCESSING. version 0.6. July 2010 Revised January 2011

CHIPP Phoenix Cluster Inauguration

HEP Grid Activities in China

Oracle Secure Backup 12.1 Technical Overview

Data Management. Enabling Grids for E-sciencE. Vladimir Slavnic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Distributed Computing Framework. A. Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille

PoS(EGICF12-EMITC2)106

Medici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project

AMGA metadata catalogue system

Grid Middleware and Globus Toolkit Architecture

Managing Petabytes of data with irods. Jean-Yves Nief CC-IN2P3 France

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

irods usage at CC-IN2P3: a long history

SharePoint 2016 Administrator's Survival Camp

Experiences with the new ATLAS Distributed Data Management System

The Materials Data Facility

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

The SweGrid Accounting System

EUDAT Data Services & Tools for Researchers and Communities. Dr. Per Öster Director, Research Infrastructures CSC IT Center for Science Ltd

ARC-XWCH bridge: Running ARC jobs on the XtremWeb-CH volunteer

An Introduction to GPFS

The Grid: Processing the Data from the World s Largest Scientific Machine

Monitoring the Usage of the ZEUS Analysis Grid

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Edinburgh (ECDF) Update

Transcription:

Scientific data management

Storage and data management components Application database Certificate Certificate Authorised users directory Certificate Certificate Researcher Certificate Policies Information services File Catalogue Grid job management service Certificate Special data management components: File catalogue File transfer service Grid tools Certificate Grid tools Researcher Certificate Storage Grid job services deal with: management service Access control storage Accounting Information 2014 Oxana Smirnova, Dept. of Physics 93

More than just storage Scientific data management is much more than just storage need to be stored need to be made available for processing have to be preserved for future re-processing need to be shared with other researchers 2014 Oxana Smirnova, Dept. of Physics 94

management on YouTube (by NYU) 2014 Oxana Smirnova, Dept. of Physics 95

Storage requirements Large enough storage is needed From Gigabytes for some to Petabytes for some others Large enough bandwidth between the apparatus, storage, data processing facility and the user Terabits per second these days Controlled write, read and list access From password to certificates and VOs to federated identities Encryption: for data, or transfer, or both, or none Adequate space management quotas, space recovery utilities Transfer, replication and migration tools Per file, per logical group, per database 2014 Oxana Smirnova, Dept. of Physics 96

Storage requirements (continued) Backup a large range of requirements: from basic RAID to multiple replicas, local and remote, tapes or other media, with recovery of older versions in case of accidental modifications etc Indexing of what is stored From basic POSIX information listing to metadata catalogs, adequately protected from unauthorised access Logical organization in terms of metadata-based grouping possibly reflected in physical grouping for optimization Monitoring and statistics collection (accounting) Various usage parameters as a function of time, access monitoring and such - all protected from unauthorised access 2014 Oxana Smirnova, Dept. of Physics 97

processing requirements Accessibility of data from a computing resource streaming and/or caching copying for a direct access: single files, logical groups, per job or per site querying in case of databases Persistent and unique identifiers per file, per logical group, per database and such Mechanisms to match/resolve identifiers to physical addresses For streaming, copying, querying and such Well-defined (formalized) data formats/structures and tools for conversion between (at least some of) them 2014 Oxana Smirnova, Dept. of Physics 98

Preservation requirements Continuous storage upgrade and media migration Identifiable authorship of the data per file, logical group, database etc Provenance information original data taking conditions, possible modifications, changes of ownership, changes of access rights etc Preservation of data format description possibly encapsulated in data 2014 Oxana Smirnova, Dept. of Physics 99

Preservation requirements (continued) Preservation of processing algorithms and/or workflows Preservation of computing environments used to produce or process the data Preservation of metadata Preservation of accessibility: if possible, preservation of protocols when protocols change, consistent re-mapping of data identifiers to new protocols long-term access rights management: granting write access to curators, migration to new AAI technologies, revoking access rights of non-authorized individuals, opening up for public read and list access 2014 Oxana Smirnova, Dept. of Physics 100

Sharing requirements Access control managed by authorized users From dedicated managers for large research groups, to individual researchers who produce the data Networked access via industry-standard protocols and means like http/webdav today, remotely mounted or synched file systems, etc Discovery tools relying on metadata Including authorship, provenance and other info Upload tools From simple copy or file-by-file transfer, to portals and other utilities dealing with logical groups, databases etc 2014 Oxana Smirnova, Dept. of Physics 101

What does Grid offer today for data management Storage Elements (SE) Disk/tape storage pools managed by storage middleware (e.g. dcache)» Internal space management, shares, some backup etc» Grid access control» Storage federation (common name space across different SEs)» Accounting and information File transfer service (FTS by EMI) and metadata indexing services Simple file catalogue (LFC by EMI) Application-specific metadata catalogues» Attempts to create generic catalogues failed Client tools for the above 2014 Oxana Smirnova, Dept. of Physics 102

Graphics adapted from Patrick Fuhrmann dcache storage solution Clients: PCs ARC CE Worker nodes etc SRM NFS4.1 dcache storage system Tertiary storage: Tapes Clouds http(s) Head node WebDAV Pool Pool gsiftp xroot Pool Pool dcap 2014 Oxana Smirnova, Dept. of Physics 103

Graphics by Patrick Fuhrmann dcache details a.k.a GridFTP 2014 Oxana Smirnova, Dept. of Physics 104

Story of Storage Resource Manager (SRM) The first Grid standard Introduced as an abstraction on top of the multitude of transfer protocols (gsiftp, https etc) SRM-enabled service is meant to provide: Transfer protocol negotiation Dynamic Transfer URL allocation Uniform access to heterogeneous storage, access permissions Access to permanent and temporary types of storage Space reservation Reliable transfer services The specification was never implemented to a full extent Gradually losing importance, as https starts to dominate 2014 Oxana Smirnova, Dept. of Physics 105

Graphics by StoRM Grid file names LFN example: data2014-1-raw GUID: 26851250-b9f8-11e3-a5e2-0800200c9a66 Globally Unique Identifier; can denote a dataset or a single file SURL: srm://dcache.swegrid.se/lund/astro/data2014-1-raw.xls TURL: https://server5.liu.se/pool3/12nsd3/data2014-1-raw.xls Mapping managed by SRM 2014 Oxana Smirnova, Dept. of Physics 106

Some other data management services LHC File Catalogue (LFC) maps LFNs to GUIDs to SURLs Can store some file metadata (size, checksum etc)» metadata: data about data File Transfer Service (FTS) Performs massive file transfers between Storage Elements on behalf of Grid users Makes use of X509, SRM, GridFTP, https Both LHC and FTS are rarely used outside CERN Are designed for handling very large amounts of files 2014 Oxana Smirnova, Dept. of Physics 107

ARC data management tools ARC comes with a set of basic file management tools Use X509 Grid security Support most known protocols, including SRM» Can also interact with LFC and some area-specific data catalogues Can not interact with FTS, neither can be used for storage management Commands: arcls, arccp, arcrm, arcmkdir 2014 Oxana Smirnova, Dept. of Physics 108

Nordic Grid infrastructure: distributed storage Scientists Authorised users directory (VOMS) ARC Indexing servers Cache Cache Cache Cache Copenhagen Oslo Linköping Umeå SGAS accounting server Cache Stockholm Cache Cache Cache Helsinki Bergen Uppsala Cache Lund Site-BDII servers ARC computing services Operators CERN dcache storage cloud dcache server FTS server 2014 Oxana Smirnova, Dept. of Physics 109

SweStore uses Grid technologies SweStore is a Swedish national long-term storage for various researchers http://snicdocs.nsc.liu.se/wiki/swestore Has storage pools distributed over several centers Part of it in LUNARC; central services in Linköping Uses dcache, and recently irods irods stands for Integrated Rule-Oriented System» Provides higher-level functionality than dcache (can use dcache as storage)» Makes no use of Grid Security Infrastructure Practically any research group in Sweden can apply for a storage space at SweStore Some groups have dedicated storages within SweStore 2014 Oxana Smirnova, Dept. of Physics 110

Exercises Goals: use arc* command line tools to upload/download files Start: browse SweStore using a regular browser (requires certificate in the browser) https://webdav.swestore.se/nordugrid arcproxy S nordugrid.org:/nordugrid.org/tutorial/role=student arcls gsiftp://gsiftp.swestore.se/nordugrid/tutorial arcls l gsiftp://gsiftp.swestore.se/nordugrid/tutorial arcls --metadata gsiftp://gsiftp.swestore.se/nordugrid/tutorial/arc-logo.png arccp --recursive=999 gsiftp://gsiftp.swestore.se/nordugrid/tutorial/xrsl/ xrsl/ The following should fail:» arcmkdir gsiftp://gsiftp.swestore.se/nordugrid/tutorial/newdir» arcrm gsiftp://gsiftp.swestore.se/nordugrid/tutorial/arc-logo.png 2014 Oxana Smirnova, Dept. of Physics 111

Exercises For advanced students: try to install ARC Graphical Clients from http://sourceforge.net/projects/arc-gui-clients/ Needs cmake, libqt4-dev and nordugrid-arc-dev For those who did not manage to submit jobs last time, use the downloaded.xrsl files to do so: arcsub c arc-iridium.lunarc.lu.se hello_grid.xrsl arcls <jobid> arcstat <jobid> 2014 Oxana Smirnova, Dept. of Physics 112