ATLAS DQ2 to Rucio renaming infrastructure

Similar documents
C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

The DMLite Rucio Plugin: ATLAS data in a filesystem

Popularity Prediction Tool for ATLAS Distributed Data Management

Experiences with the new ATLAS Distributed Data Management System

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

PoS(EGICF12-EMITC2)106

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Overview of ATLAS PanDA Workload Management

Distributed Data Management on the Grid. Mario Lassnig

ATLAS Data Management Accounting with Hadoop Pig and HBase

A Popularity-Based Prediction and Data Redistribution Tool for ATLAS Distributed Data Management

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

The ATLAS Distributed Analysis System

The ATLAS PanDA Pilot in Operation

ATLAS & Google "Data Ocean" R&D Project

AGIS: The ATLAS Grid Information System

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

New data access with HTTP/WebDAV in the ATLAS experiment

Status of KISTI Tier2 Center for ALICE

glite Grid Services Overview

HEP replica management

Andrea Sciabà CERN, Switzerland

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Evaluation of the Huawei UDS cloud storage system for CERN specific data

LHCb Distributed Conditions Database

ATLAS operations in the GridKa T1/T2 Cloud

AliEn Resource Brokers

Prompt data reconstruction at the ATLAS experiment

The Google File System

WHEN the Large Hadron Collider (LHC) begins operation

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

PROOF-Condor integration for ATLAS

CLOUD-SCALE FILE SYSTEMS

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

Scientific data management

DQ2 - Data distribution with DQ2 in Atlas

ATLAS distributed computing: experience and evolution

CMS users data management service integration and first experiences with its NoSQL data storage

Experience with ATLAS MySQL PanDA database service

The CMS data quality monitoring software: experience and future prospects

Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008

Data Transfers Between LHC Grid Sites Dorian Kcira

Austrian Federated WLCG Tier-2

The ATLAS EventIndex: Full chain deployment and first operation

Data transfer over the wide area network with a large round trip time

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

DIRAC File Replica and Metadata Catalog

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

UW-ATLAS Experiences with Condor

ANSE: Advanced Network Services for [LHC] Experiments

CernVM-FS beyond LHC computing

Parallel Storage Systems for Large-Scale Machines

Data Management 1. Grid data management. Different sources of data. Sensors Analytic equipment Measurement tools and devices

arxiv: v1 [cs.dc] 12 May 2017

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

APPLICATION OF POLICY BASED INDEXES AND UNIFIED CACHING FOR CONTENT DELIVERY Andrey Kisel Alcatel-Lucent

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Software-defined Storage: Fast, Safe and Efficient

XCache plans and studies at LMU Munich

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Data preservation for the HERA experiments at DESY using dcache technology

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Long Term Data Preservation for CDF at INFN-CNAF

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Distributed Monte Carlo Production for

Atrium Webinar- What's new in ADDM Version 10

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

ATLAS Tracking Detector Upgrade studies using the Fast Simulation Engine

Data Movement & Tiering with DMF 7

Security in the CernVM File System and the Frontier Distributed Database Caching System

High Performance Computing Course Notes Grid Computing I

vsan Stretched Cluster Bandwidth Sizing First Published On: Last Updated On:

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Forecasting Model for Network Throughput of Remote Data Access in Computing Grids

The INFN Tier1. 1. INFN-CNAF, Italy

Dynamic Federations. Seamless aggregation of standard-protocol-based storage endpoints

DIRAC pilot framework and the DIRAC Workload Management System

Experience of Data Grid simulation packages using.

Adaptive Cluster Computing using JavaSpaces

TAG Based Skimming In ATLAS

Distributed File Systems II

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Application of Virtualization Technologies & CernVM. Benedikt Hegner CERN

Outline. ASP 2012 Grid School

Clustering and Reclustering HEP Data in Object Databases

Storage and I/O requirements of the LHC experiments

Improved ATLAS HammerCloud Monitoring for Local Site Administration

Google File System. By Dinesh Amatya

Visita delegazione ditte italiane

Analysis and improvement of data-set level file distribution in Disk Pool Manager

The CMS Computing Model

CS-580K/480K Advanced Topics in Cloud Computing. Object Storage

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Monitoring the Usage of the ZEUS Analysis Grid

Transcription:

ATLAS DQ2 to Rucio renaming infrastructure C. Serfon 1, M. Barisits 1,2, T. Beermann 1, V. Garonne 1, L. Goossens 1, M. Lassnig 1, A. Molfetas 1,3, A. Nairz 1, G. Stewart 1, R. Vigne 1 on behalf of the ATLAS Collaboration 1 CERN PH-ADP-CO/DDM, 1211 Geneva, Switzerland 2 University of Innsbruck, Innsbruck, Austria 3 University of Melbourne, Melbourne, Australia E-mail: cedric.serfon@cern.ch Abstract. To prepare the migration to the new ATLAS Data Management system called Rucio, a renaming campaign of all the physical files produced by ATLAS is needed. It represents around 300 million files split between 120 sites with 6 different storage technologies. It must be done in a transparent way in order not to disrupt the ongoing computing activities. An infrastructure to perform this renaming has been developed and is presented in this paper as well as its performance. ATL-SOFT-PROC-2013-016 09 October 2013 1. Introduction Rucio [1] is the new evolution of the ATLAS Data Management system that will replace the current one called DQ2 [2]. This new system has many improvements, but it breaks the compatibility with DQ2. One of the biggest changes is the fact that no external replica catalog like the LCG File Catalog (LFC) is being used in Rucio, whereas it was one of the central components in DQ2. In Rucio, the physical path of a file can be simply extracted from the Logical File Name via a deterministic function. Therefore, all file replicas produced by ATLAS have to be renamed before migrating to Rucio. This amounts to around 300 million files split between 120 sites with 6 different storage technologies. An infrastructure to perform this is needed. After shortly presenting the new naming convention, we expose in the second part the constraints for this massive renaming and describe the infrastructure that has been implemented as well as the technologies that were used. Finally, we show the performance for the sites where the files were already renamed. 2. The Rucio naming convention The old naming convention (also known as DQ2 naming convention) for the Physical File Name (PFN) was based on the filename and the dataset name. Since a file can belong to many distinct datasets, it can have many different PFNs, and thus there is no deterministic way to go from a filename to a PFN. Moreover, the convention changed more than 3 times during the last 6 years. That s the reason why an external file catalog like the LFC is mandatory for DQ2. The new naming convention (also known as Rucio naming convention) is based on a deterministic function that allows to transform every Logical File Name (LFN) into a PFN. A LFN in Rucio is composed of 2 parts: a scope and a name. The scope allows to partition the namespace for different users. It consists of a string of at most 30 characters (e.g.

data12 8TeV, user.jdoe, group.phys-susy...). The name is a string of at most 255 characters (e.g. myanalysis.xyz.root.1, AOD.491965. 0042.pool.root.1) and is unique within a scope. A LFN is the combination of a scope and a filename, for instance user.jdoe:myanalysis.xyz.root.1 or data11 7TeV:AOD.491965. 0042.pool.root.1. The PFN can be easily constructed from the LFN with the following algorithm: The first part is constructed with the scope, where the dots delimit the subdirectories. In our examples: user.jdoe user/jdoe, group.phys-susy group/phys-susy and data12 8TeV data12 8TeV The second part is composed of the 2 first bytes (in hexadecimal) of the md5 sum 1 of the LFN that are used to create a 2 level directory structure: md5(user.jdoe:myanalysis.xyz.root.1) 68a2ea3ea370b43f437d62955fba1435 68/a2. The last part is the filename. Therefore LFN=user.jdoe:myanalysis.xyz.root.1 will be translated into the following PFN: user/jdoe/68/a2/myanalysis.xyz.root.1. The PFN is prefixed by the protocol and the site specific part to give a Uniform Resource Identifier (URI). With this convention, there is no need of an external file catalog : the information of the internal Rucio catalog and the deterministic function are enough to build the URI. The drawback is that all physical files registered in DQ2 (more than 300 million replicas) will have to be renamed according to this new convention. To perform this work, a renaming infrastructure is crucial. 3. Renaming infrastructure 3.1. Constraints on the renaming There are many constraints that must be satisfied by the renaming infrastructure. First of all, it must be fully automatised and require as little work as possible from the sites. Secondly, the renaming must be a transparent process that doesn t interfere with the ongoing computing activities, as Monte Carlo simulation and user analysis. Moreover, during the transition phase to Rucio, we still want to be able to use the old DQ2 system and therefore the renaming must include both the physical renaming of the files and the renaming in the LFC. The infrastructure must be able to run on most of the different storage technologies used by the sites supporting ATLAS. It must be fault tolerant, and in case of major failure of the site, it must be able to stop and to recover where it stopped. Finally, it must be fast enough to allow the renaming of 300 million files in one year (which means a renaming rate of 10 Hz sustained over a year). 3.2. Description of the renaming infrastructure The renaming infrastructure that has been developed is based on gearman [3]. It consists of different workers that are dedicated to specific tasks (DQ2 lookup, LFC lookup...). Each worker is an independent process and gets payload from a gearman server, that is fed by a main process which we call supervisor, that controls the workflow (see Figure 1). Thanks to this approach, the load can be spread across multiple machines if needed. The workflow consists of 6 steps : 1 The supervisor queries the DQ2 catalog that stores the dataset replicas to get a list of datasets on a given site. The operation consists in a single query, and don t need to be parallelized between multiple workers. 2 The supervisor sends the list of datasets to the DQ2 lookup workers that query the DQ2 catalog that stores the dataset content. A query is performed for each dataset and therefore takes advantage of working on multiple workers. The result returned is a list of logical files identified by a GUID (Globally Unique Identifier). 1 The reason of chosing a cryptographic hash like md5 is to allow to have a well balanced distribution of the files per directories.

DQ2 LFC Storage DQ2 lookup LFC lookup DQ2 LFC lookup addreplica DQ2 Storage lookup rename DQ2 LFC lookup delreplica 2 3 4 5 6 1 Gearman server Gearman "supervisor" Figure 1. Workflow of the renaming infrastructure. The blue part represents the catalogs. The red one represents the different workers. The green one represents the supervisor. The numbers indicate the different steps of the renaming. 3 The list of GUIDs is then used as input by the LFC lookup agents that identify the replicas (URI) associated to the GUID in the target site. 4 The list of replicas extracted by the LFC lookup is used as input by the LFC addreplica agent, which for each replica that follows the old naming convention will register in the LFC a new replica with the new convention. If some replicas already follow the new convention, they are skipped. 5 The list of files with the old and new convention is used as input by the Storage Rename agent, that interacts with the Storage Element. It renames the files with the old convention on the Storage Element to the new convention using the WebDAV [4] protocol. 6 The last part of the rename is to delete the old replicas from the LFC. This is done by the Cleaner agents. By proceeding this way, the files physically present on the Storage Element are always available in the catalogs. Moreover the workers get the jobs 2 to process from the gearman server, and if some of the jobs fail, the server ensures that they are reprocessed. In case of massive failure (for instance if the storage system of a site goes down) the renaming can be stopped at any step and when it restarts, the infrastructure has been designed to restart where it broke, which ensure the fault tolerance and robustness. The renaming process is fully automated: once a supervisor is started it will process all the files from the site. 2 A job consists of a task like LFC lookup or Storage renaming on sereral inputs.

Renaming rate in Hz Time (HH:MM) Figure 2. Total renaming rate (Hz) on a site over 250k files. It represents the frequency of the joint physical (storage renaming) and logical operations (LFC renaming). The number of Storage rename workers used for this site is 20. For the physical renaming, the protocol that is used is WebDAV. It is supported by DPM [5], dcache [6], StoRM [7] and EOS [8] which covers more than 90% of the ATLAS sites 3. Each physical renaming requires at most 7 interactions: 2 to test if the destination and source files are available, 2 to test existence of the directories that host the destination file and 2 to create them if they do not exist, 1 for the actual renaming. 4. Performance The performance of the system is monitored via Graphite [9], a tool to store time series and visualise them in real-time. The slowest part of the renaming is the physical renaming, since the renaming of one single file with WebDAV ranges from O(100) ms to few seconds. This mainly depends on the Round Trip Time which can reach up to 250ms for some sites in North America or Asia Pacific zone. But, many different storage renaming processes can be run simultaneously allowing to reach renaming rates between 10 and 40 Hz (joint physical and logical operations) as shown in Figure 2. The physical renaming rate increases linearly with the number of Storage rename workers up to 20-40 parallel processes and for higher values the renaming speed scales sublinearly. The number of Storage rename workers is then chosen to stay in the domain of linearity (between 20 and 40 depending on the sites). For the other parts, the number of workers for each task has been tuned to: 20 workers for the DQ2 lookup, 4 workers for the LFC lookup, 4 workers for the LFC rename and 4 workers for the LFC cleanup. We first focussed on renaming files in some big sites (Tier1s and big Tier2s). As of beginning September 2013, more than 50% of the files (around 200 million) already follow the new convention, as shown on Figure 3. The total number of files renamed per day can easily reach 1.5 million by only using one machine 4 to run the whole infrastructure, and by design, the system can scale horizontally. Actually, what currently limits the renaming rate is the deployment of WebDAV on the sites (not all sites have already enabled it). 3 For the sites with technologies that do not support WebDAV, the sites will be asked to rename the files using some local protocol. 4 The machine used here is a dedicated node (Intel(R) Xeon(R) CPU L5520 2.27GHz 2 GB RAM).

Figure 3. Percentage of files following the old (red) and the new (blue) convention extracted from the central LFC. The absolute number of files fluctuates with time, at the beginning of September 2013 being 400 million files. 5. Conclusion The renaming infrastructure developed to prepare the transition to Rucio has demonstrated that it satisfies all the requirements: it is automatised, robust, transparent for the users, fault tolerant, storage technology agnostic and requires no work from the sites except to configure WebDAV. At this time of writing, it has already successfully renamed hundreds millions of files without disrupting the ongoing computing activities, and if all sites provide a WebDAV access, it should be possible to finish the renaming campaign in time for Rucio deployment. References [1] V. Garonne et al., Rucio - The next generation of large scale distributed system for ATLAS Data Management., these proceedings. [2] V. Garonne et al., The ATLAS Distributed Data Management project: Past and Future, Journal of Physics: Conference Series, 396 (032045), IOP 2012. [3] Gearman: http://gearman.org/ [4] The IETF Trust, HTTP Extensions for Web Distributed Authoring and Versioning (WebDAV), RFC 4918, 2007. [5] A. Alvarez et al., DPM: Future Proof Storage, Journal of Physics: Conference Series 396 (032015), IOP 2012. [6] M. Ernst et al., dcache, a distributed data storage caching system, Proceedings of Computing in High Energy Physics, 2001. [7] A. Carbone et al., Performance studies of the StoRM Storage Resource Manager, Proceedings of Third IEEE International Conference on e-science and Grid Computing, (10-13 Dec. 2007). [8] A. J. Peters and L. Janyst, Exabyte Scale Storage at CERN, Journal of Physics: Conference Series 331 (052015), IOP 2011. [9] Graphite: http://graphite.wikidot.com/