A comparison of data-access platforms for BaBar and ALICE analysis computing model at the Italian Tier1

Similar documents
On enhancing GridFTP and GPFS performances

The INFN Tier1. 1. INFN-CNAF, Italy

Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF

Testing an Open Source installation and server provisioning tool for the INFN CNAF Tier1 Storage system

The CMS experiment workflows on StoRM based storage at Tier-1 and Tier-2 centers

Data transfer over the wide area network with a large round trip time

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Cluster Setup and Distributed File System

Long Term Data Preservation for CDF at INFN-CNAF

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

Andrea Sciabà CERN, Switzerland

Experience with PROOF-Lite in ATLAS data analysis

Evaluation of the Huawei UDS cloud storage system for CERN specific data

PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Cooperation among ALICE Storage Elements: current status and directions (The ALICE Global Redirector: a step towards real storage robustness).

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Benchmark of a Cubieboard cluster

Use of containerisation as an alternative to full virtualisation in grid environments.

Current Status of the Ceph Based Storage Systems at the RACF

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

Understanding StoRM: from introduction to internals

EMC CLARiiON CX3-40. Reference Architecture. Enterprise Solutions for Microsoft Exchange Enabled by MirrorView/S

High Throughput WAN Data Transfer with Hadoop-based Storage

CMS High Level Trigger Timing Measurements

A self-configuring control system for storage and computing departments at INFN-CNAF Tierl

The Legnaro-Padova distributed Tier-2: challenges and results

A data handling system for modern and future Fermilab experiments

WHEN the Large Hadron Collider (LHC) begins operation

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell EqualLogic Storage Arrays

Comparison of Storage Protocol Performance ESX Server 3.5

System level traffic shaping in disk servers with heterogeneous protocols

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

ISTITUTO NAZIONALE DI FISICA NUCLEARE

An Introduction to GPFS

InfoSphere Warehouse with Power Systems and EMC CLARiiON Storage: Reference Architecture Summary

IBM Emulex 16Gb Fibre Channel HBA Evaluation

ATLAS computing activities and developments in the Italian Grid cloud

ATLAS software configuration and build tool optimisation

Vblock Architecture. Andrew Smallridge DC Technology Solutions Architect

The High-Level Dataset-based Data Transfer System in BESDIRAC

PROOF-Condor integration for ATLAS

Using IKAROS as a data transfer and management utility within the KM3NeT computing model

Best Practices for Deploying a Mixed 1Gb/10Gb Ethernet SAN using Dell Storage PS Series Arrays

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Demartek December 2007

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

Using Hadoop File System and MapReduce in a small/medium Grid site

Exchange Server 2007 Performance Comparison of the Dell PowerEdge 2950 and HP Proliant DL385 G2 Servers

Testing SLURM open source batch system for a Tierl/Tier2 HEP computing facility

Performance Report: Multiprotocol Performance Test of VMware ESX 3.5 on NetApp Storage Systems

Maintaining End-to-End Service Levels for VMware Virtual Machines Using VMware DRS and EMC Navisphere QoS

DELL EMC READY BUNDLE FOR VIRTUALIZATION WITH VMWARE AND FIBRE CHANNEL INFRASTRUCTURE

PoS(ISGC 2011 & OGF 31)047

UW-ATLAS Experiences with Condor

Hosted Microsoft Exchange Server 2003 Deployment Utilizing Network Appliance Storage Solutions

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Monte Carlo Production on the Grid by the H1 Collaboration

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

Striped Data Server for Scalable Parallel Data Analysis

12/04/ Dell Inc. All Rights Reserved. 1

EMC CLARiiON CX3-80. Enterprise Solutions for Microsoft SQL Server 2005

EMC Business Continuity for Microsoft Applications

Data oriented job submission scheme for the PHENIX user analysis in CCJ

ATLAS operations in the GridKa T1/T2 Cloud

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Performance Characterization of the Dell Flexible Computing On-Demand Desktop Streaming Solution

DELL Reference Configuration Microsoft SQL Server 2008 Fast Track Data Warehouse

Overview of ATLAS PanDA Workload Management

Extremely Fast Distributed Storage for Cloud Service Providers

CMS users data management service integration and first experiences with its NoSQL data storage

AliEn Resource Brokers

Exchange 2003 Deployment Considerations for Small and Medium Business

Copyright 2009 by Scholastic Inc. All rights reserved. Published by Scholastic Inc. PDF0090 (PDF)

WHITE PAPER AGILOFT SCALABILITY AND REDUNDANCY

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

The CMS data quality monitoring software: experience and future prospects

Maintaining End-to-End Service Levels for VMware Virtual Machines Using VMware DRS and EMC Navisphere QoS

Virtualizing a Batch. University Grid Center

High Performance Computing on MapReduce Programming Framework

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date:

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Managing Performance Variance of Applications Using Storage I/O Control

LHCb Distributed Conditions Database

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

ATLAS Nightly Build System Upgrade

Grid Operation at Tokyo Tier-2 Centre for ATLAS

Geant4 Computing Performance Benchmarking and Monitoring

Dell Fluid Data solutions. Powerful self-optimized enterprise storage. Dell Compellent Storage Center: Designed for business results

Federated data storage system prototype for LHC experiments and data intensive science

Mission-Critical Enterprise Linux. April 17, 2006

Transcription:

Journal of Physics: Conference Series A comparison of data-access platforms for BaBar and ALICE analysis computing model at the Italian Tier1 To cite this article: A Fella et al 21 J. Phys.: Conf. Ser. 219 723 Related content - StoRM-GPFS-TSM: A new approach to hierarchical storage management for the LHC experiments A Cavalli, L dell'agnello, A Ghiselli et al. - On enhancing GridFTP and GPFS performances A Cavalli, C Ciocca, L dell'agnello et al. - The INFN Tier-1 G Bortolotti, A Cavalli, L Chiarelli et al. View the article online for updates and enhancements. This content was downloaded from IP address 148.251.232.83 on 7/12/218 at 5:27

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 A Comparison of Data-Access Platforms for BaBar and ALICE analysis Computing Model at the Italian Tier1 A Fella 1, F Furano 2, L Li Gioi 1, F Noferini 1, M Steinke 4, D Andreotti 3, A Cavalli 1, A Chierici 1, L dell Agnello 1, D Gregori 1, A Italiano 1, E Luppi 3, B Martelli 1, A Prosperini 1, P Ricci 1, E Ronchieri 1, D Salomoni 1, V Sapunenko 1 and D Vitlacil 1 1 INFN-CNAF Bologna, Italy 2 Conseil Europeen Recherche Nucl. (CERN) 3 INFN Ferrara Ferrara, Italy 4 RUHR-UNIVERSITT BOCHUM E-mail: armando.fella@cnaf.infn.it, luigi.ligioi@cnaf.infn.it, francesco.noferini@bo.infn.it Abstract. Performance, reliability and scalability in data access are key issues in the context of Grid computing and High Energy Physics (HEP) data analysis. We present the technical details and the results of a large scale validation and performance measurement achieved at the CNAF Tier1, the central computing facility of the Italian National Institute for Nuclear Research (INFN). The aim of this work is the evaluation of data access activity during analysis tasks within BaBar and ALICE computing models against two of the most used data handling systems in HEP scenario: GPFS and Scalla/Xrootd. 1. Introduction Data handling and file access systems are key topics in distributed computing model as it is the Grid context in which the work presented is developed. The High Energy Physics (HEP) community, facing the challenges of the new scientific program starting at the Large Hadron Collider (LHC), needs to provide the best choice in terms of reliability, availability and performance for high throughput data services. The study of data handling systems in real analysis use case is one of the main work-in-progress at the moment we are writing. The test session took place at INFN-CNAF Italian Tier-1 computing centre and it is designed to be representative of typical HEP experiment in production status. It consists of a comparative evaluation of two data-access solutions in use at Tier-1 and Tier-2 centres, namely General Parallel File System (GPFS) [1] and the Scalla/Xrootd system [2]. At the time of writing GPFS is an IBM product distributed under license, whereas Scalla/Xrootd is an open source platform free of license fees. Both BaBar [3] and ALICE [4] computing models are based on Scalla/Xrootd data access system. On August 27 the BaBar choice at CNAF was to move from Scalla/Xrootd to GPFS data access solution: a comparison of the two data access models was performed via a complete test session on September 28 [5]. We cite also the first work on data access platform comparison performed at the same centre [6]. c 21 IOP Publishing Ltd 1

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 Figure 1. Schematic representation of the test-bed layout. The ALICE interest in adopting Scalla/Xrootd as data access at Tier1 scale brought us to develop a test design being able to describe in deep the data access systems performance. This will provide useful information for enabling ALICE at CNAF. We used real physics analysis running on real physics datasets in the BaBar case and on fully simulated dataset in the ALICE case. Since 24 BaBar collaboration includes INFN-CNAF Tier-1 in its distributed computing model moving a subset of physics data and the related analysis effort from SLAC to Italy. Moreover, the Italian Tier-1 represents an important node in the ALICE computing model in particular for the Italian component of the collaboration which is strongly involved in the physics program of the experiment. Unlike the September 28 test in which the measurements were performed in a controlled environment, the work we present shows the results of a test in a pure production scenario. This choice permits to study the data access systems behavior by measurements involving farm heterogeneity in terms of worker nodes architectures and jobs differentiation. 2. Test-Bed Layout A sketch of the test-bed is depicted in Fig. 1. In our setup, 8 disk-servers were integrated into a Storage Area Network (SAN) via FC links, while the communication of the disk-servers with the computing nodes was provided via Gigabit LAN. As disk-storage hardware we used 2 EMC CX3-8 SAN systems, with a total of 16 TB of raw-disk space, assembled by aggregating 1-TB SATA disks. The EMC system comprised two storage controllers (with 8 GB of RAM each), 2

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 connected to a Brocade 48 Fibre Channel fabric director provided with 8 links, for a total theoretical bandwidth of 32 Gb/s. The 8 Dell PowerEdge M6 disk-servers were equipped with Xeon 2.33 GHz quad-core biprocessors, 16 GB of RAM, dual-port Qlogic 2432 4Gb Fibre Channel to PCI Express Host Bus Adapter (HBA) and 1 Gigabit Ethernet link. As an Operating System (OS) they run Scientific Linux CERN (SLC) version 4.6, with a 2.6.9-67..15.EL.cernsmp kernel operating at 64 bits. The whole farm services and infrastructure were shared with the usual Tier1 activity including worker nodes job slots, DNS, LDAP Authentication, Network and local batch system managed by Load Sharing Facility (LSF).The worker node job allocation is ruled by an overloading policy that permits to run 1.2 jobs per core. For these tests a specific LSF queue was configured to be able to exploit computational resources equal to the sum of the 8% of the ALICE and BaBar ones. In this scenario we can successfully manage synchronous job starts working into the CNAF production submission environment. The CNAF farm is composed by the following Worker Node (WN) hardware architectures with the associated occurrences: Woodcrest 2, Intel Xeon E542 (7%), Woodcrest 266, Intel Xeon E544 (2%), AMD Opteron 252 (1%). The Operating System of whole WNs was Scientific Linux CERN 4.5 with i386 kernel 2.6.9-67..15.EL.cernsmp. Each node was connected to a Gigabit switch, one switch per rack, and each switch had a 2 x 1 Gb/s trunked up-link to the core network switch. 2.1. Monitoring Tools In order to monitor the actual throughput of the various components of the system during the tests, both the network and disk ones, we used several products, hence cross-checking in a coherent view all the information available from different sources: MRTG was used in order to monitor the network throughput as seen by the Gigabit network switches, Lemon [7] system was used to monitor the network interfaces of the disk-servers and the throughput to/from the backend storage systems. Every 1 seconds, read/write throughput information were generated by LeMon system and stored in its data base component. A C++ utility based on ROOT [8] was developed to process these information and generate graphs. 3. Data-Access Platforms GPFS is a general purpose distributed file-system developed by IBM. It provides file-system services to parallel and serial applications. GPFS allows parallel applications to simultaneously access the same files in a concurrent way, ensuring the global coherence, from any node which has the GPFS filesystem locally mounted. GPFS is particularly appropriate in an environment where the aggregate I/O peak exceeds the capability of a single file-system server. Differently from the other solution employed in our tests, from the point of view of the applications, the file-system is accessed by a standard POSIX interface, without the need of compiling the client code with ad-hoc libraries. In our tests we use the production BaBar GPFS cluster, specifically we work against 1 disk-servers: 8 data servers and 2 cluster admin node, aggregating all the SAN LUNs available in one global file-system. The GPFS version adopted was 3.2.1-4. The Scalla/Xrootd platform is a pure disk data handling system developed as a collaboration between SLAC and INFN, with some other contributors in the HEP software community, in specific the last two releases was developed by ALICE experiment. Scalla/Xrootd is designed to provide fault tolerant location and access to files distributed throughout cluster and WAN environment by employing peer-to-peer-like mechanisms. The architecture provided for this test was based on the last Scalla/Xrootd production version: 2926.1632, and notably included two admin nodes (also known as redirectors) and one Scalla/Xrootd server on each of the 8 diskservers. The Scalla/Xrootd disk back-end is composed by SAN LUNs XFS formatted. A new BaBar software release was deployed ad-hoc to permit a well-advised and coherent comparison between the two experiments test results: the client side software for both the experiments 3

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 included the use of ROOT 5/21-6. A single test run was performed using the BaBar typical deployment based on ROOT 5/14-e to obtain client side comparison information. The hybrid solution Scalla/Rootd with GPFS backend (Scalla/Rootd over GPFS in the following) was tested too. In this case data files were storend all in one GPFS file system mounted on all the Scalla/Xrootd managed disk servers. 3.1. Scalla/Xrootd setup The data used in the ALICE analysis were imported at CNAF via Scalla/Xrootd protocol in order to register them in the Alien catalog [9] and subsequently replicated on the GPFS filesystem. AliEn is a Grid framework developed and adopted by the ALICE collaboration that provides an interface to all Grid services including the data access authentication through Scalla/Xrootd protocol. The ALICE data handling model include a central Scalla/Xrootd server, named global redirector, working as an upper level file location broker. The Scalla/Xrootd package in the BaBar runs is the same as the ALICE one apart from the configuration that was modified to be as similar as possible to the BaBar computing model. Table 1 reports the detailed configuration parameters for both experiments. ALICE parameters have been assigned standard values obtained by previous optimizations. Table 1. ALICE and BaBar Scalla/Xrootd setup. Experiments Monitor system Outbound service Read-ahead [KB] Cache [MB] ALICE Monalisa Global redirector 5 1 BaBar No No 4. Data Analysis We performed a real physics data analysis to evaluate the data access behavior in a real production scenario using the analysis framework of the ALICE and BaBar experiments [11, 3]. The test consisted in running several bunches of analysis jobs each one accessing different input files. The ALICE analysis was performed on a set of about 55k of PbPb events from the ALICE Monte Carlo production 27 corresponding to 2.6 TB of disk space. These events were produced by detector response simulation in order to provide a data structure as possible close to the future real data. All the data set was analyzed through 1 parallel jobs; so the data were split in 1 homogenous collections of files. The analysis was also repeated with a smaller (7). The analysis performed on the data was a typical two particles correlation study. The software used in the analysis is AliRoot [1] that is a ROOT extension developed by ALICE collaboration. The job authentication phase is managed by AliEn during the data access process via Scalla/Xrootd protocol. The BaBar jobs run on files containing data collected by the BaBar detector. Each job analyze 17k physics events, searching the occurrence of a D + [K S π π + ] D π + decay in each event, with a procedure consisting in the initial event reading followed by the execution of the proper decay selection algorithm. In deep the process parses portions of file related to the event information the specific analysis it is interested in. We started a run consisting of 1 synchronized jobs analyzing 1.2 TB of data. Data files managed by the Scalla/Xrootd system were spread uniformly over all the disk partitions served by the related disk-servers. 4

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 4.1. Execution Time One of the most useful results to compare the different data access solutions is the job duration. Fig. 2 and Fig. 3 show the execution time distribution for 7 and 1 ALICE parallel jobs. Since run 7 reads 87% amount of data then run 1, the job duration is different within the two runs: run 7 is 2% longer then run 1. Since we run the tests in a real production scenario no difference in the performance between Scalla/Xrootd and GPFS can be estimated. We conclude that the behavior of the two data access solutions was the same during the two ALICE runs. The run with Scalla/Xrootd over GPFS instead appears to suffer of a kind of data access system overlapping effect that is the cause of the raising of two peaks shape. Fig. 4 shows Number of jobs 16 GPFS 14 12 on GPFS Number of jobs 25 2 GPFS on GPFS 1 8 15 6 1 4 2 5 5 1 15 2 25 3 Execution time [s] 5 1 15 2 25 3 Execution time [s] Figure 2. Execution time distribution for 7 ALICE parallel jobs. Figure 3. Execution time distribution for 1 ALICE parallel jobs. the execution time distribution for 1 BaBar parallel jobs. The slightly better performance of GPFS is compatible with the one observed during the September 28 test session. The two peaks shape in the Scalla/Xrootd case is due to an unlucky distribution of the jobs over the different worker nodes architectures. No data access system overlapping effect is present in the case of Scalla/Xrootd over GPFS. Fig. 5 shows the same distributions for the BaBar runs with 2 different versions of the Scalla/Xrootd client: root 5/21-6 and root 5/14-e. We saw no difference in the behavior of the two Scalla/Xrootd clients except for the different distribution within the worker nodes architectures. Number of jobs 5 4 GPFS on GPFS Number of jobs 25 2 15 Root 5.21/6 Root 5.14/e 3 1 2 5 1 4 6 8 1 12 14 16 18 2 Execution time [s] Figure 4. Execution time distribution for 1 BaBar parallel jobs. 6 8 1 12 14 16 18 Execution time [s] Figure 5. Execution time distribution for 1 BaBar parallel jobs with two different Scalla/Xrootd client version: root 5/21-6 and root 5/14-e. 5

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 4.2. Worker nodes hardware architecture In order to understand better the effect of the production environment, a detailed study has been performed separately for the different worker nodes architecture types. In Fig. 6 and Fig. 9 the fraction of completed jobs as a function of the execution time are shown for ALICE and BaBar runs respectively. 1 ALICE - 7 jobs - WOOD266 9 8 7 6 5 1 9 8 7 6 5 ALICE - 7 jobs - WOOD2 1 9 8 7 6 5 ALICE - 7 jobs - AMD26 4 4 4 xrootd 3 3 3 gpfs 2 2 2 1 1 1 xrootd on gpfs 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 2 2 3 4 5 6 7 8 9 Figure 6. Fastest WN architecture (ALICE 7 jobs). Figure 7. Main WN architecture (ALICE 7 jobs). Figure 8. Slowest WN architecture (ALICE 7 jobs). During the ALICE runs, the three tested solutions perform in a very similar way for all the hardware architectures. The Scalla/Xrootd system seems to take advance of most recent hardware architecture. 1 9 8 7 6 5 4 3 BaBar - 1 jobs - WOOD266 1 9 8 7 6 5 4 3 BaBar - 1 jobs - WOOD2 1 9 8 7 6 5 4 3 BaBar - 1 jobs - AMD26 xrootd gpfs 2 2 2 xrootd on gpfs 1 1 1 xrootd (root v5.14) 4 6 8 1 12 14 4 6 8 1 12 14 2 4 6 8 1 12 14 Figure 9. Fastest WN architecture (BaBar 1 jobs). Figure 1. Main WN architecture (BaBar 1 jobs). Figure 11. Slowest WN architecture (BaBar 1 jobs). The difference between the ALICE and BaBar computing models and in particular the client side prefetch parameters (Tab. 1) impacts Scalla/Xrootd performances: in fact in the BaBar case, as shown in Fig. 9 GPFS performs better then Scalla/Xrootd for all the architectures. These results are compatible with the ones showed in Fig. 12 and Fig. 13 where the same issue is treated considering the WallClock time. In these plots we consider only the case in which the same job, insisting to the same data collections, runs on the same worker node architecture. Fig. 13 shows a performance worsening of Scalla/Xrootd over GPFS with respect the Scalla/Xrootd pure run. The corresponding BaBar case showed in Fig. 14 and Fig. 15 confirm the GPFS overcame in performance with respect Scalla/Xrootd. The Scalla/Xrootd run suffers of a suspect production environment interaction (Fig. 1). In Fig. 15 the case of Scalla/Xrootd over GPFS is 1% slower then the GPFS one. The results just commented are confirmed by the CPU time analysis presented in Fig. 16 and Fig. 17 for the ALICE runs. The focusing on CPU time reveals the Scalla/Xrootd and GPFS 6

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 5 4 Mean 364.1 ± 155.1 all archs WOOD2 WOOD266 AMD26 5 4 3 Mean 231 ± 136.8 3 2 2 1 1-1 -5 5 1 xrootd t gpfs -t [s] Figure 12. The job Wall Time duration difference distribution, per Worker Node architecture, between GPFS and Scalla/Xrootd runs of 1 ALICE jobs. -1-5 5 1 t xrootd on gpfs -t xrootd [s] Figure 13. The job Wall Time duration difference distribution, per Worker Node architecture, between Scalla/Xrootd over GPFS and Scalla/Xrootd runs of 1 ALICE jobs. 8 7 6 5 Mean -2596 ± 98.83 all archs WOOD2 WOOD266 AMD26 12 1 8 Mean 689.5 ± 81.66 4 6 3 4 2 1 2-1 -5 5 1 t gpfs -t xrootd [s] -1-5 5 1 t xrootd on gpfs -t gpfs [s] Figure 14. The job Wall Time duration difference distribution, per Worker Node architecture, between GPFS and Scalla/Xrootd runs of 1 BaBar jobs. Figure 15. The job Wall Time duration difference distribution, per Worker Node architecture, between Scalla/Xrootd over GPFS and GPFS runs of 1 BaBar jobs. client side comparison: the Scalla/Xrootd client side appears to be more CPU time consuming then the GPFS one, the hybrid case Scalla/Xrootd over GPFS is instead clearly more CPU time consuming then the other two. 4.3. Throughput The second useful result to compare the different data access solutions is the aggregated throughput recorded at the Ethernet card of the disk-servers. Since we configured differently the Scalla/Xrootd client we expect to have different results between ALICE and BaBar. 4.3.1. ALICE results The ALICE aggregated throughput as a function of time is shown in Fig. 18 for the 1 jobs run and in Fig. 19 for the 7 jobs run. As it is shown in Fig. 22, the integral over the time of the aggregated throughput is the same for all data access solutions in the 7 jobs run. In the case of 1 parallel jobs it is a little bit smaller for the Scalla/Xrootd 7

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 6 Mean -29.2 ± 112.2 5 4 all archs WOOD2 WOOD266 AMD26 7 6 5 4 Mean 11 ± 1.6 3 3 2 2 1 1-1 -5 5 1 xrootd t gpfs -t [s] Figure 16. The job Cpu Time duration difference distribution, per Worker Node architecture, between GPFS and Scalla/Xrootd runs of 1 ALICE jobs. -1-5 5 1 t xrootd on gpfs -t xrootd [s] Figure 17. The job Cpu Time duration difference distribution, per Worker Node architecture, between Scalla/Xrootd over GPFS and Scalla/Xrootd runs of 1 ALICE jobs. over GPFS run. This behavior is affected by a larger number of job failure in this run: 6% respect to the null jobs failure of the other ones. The different jobs failure is also one of the cause of the lower plateau throughput of the Scalla/Xrootd over GPFS run. The other reason is the higher fraction of jobs running on the slowest worker node architecture (2%). Throughput [MB/s] 35 3 25 GPFS on GPFS Throughput [MB/s] 25 2 GPFS on GPFS 2 15 15 1 1 5 5 2 4 6 8 1 12 14 16 Time [s] 2 4 6 8 1 12 14 16 Time [s] Figure 18. Aggregated throughput as a function of time in the case of 1 ALICE parallel analysis jobs. Figure 19. Aggregated throughput as a function of time in the case of 7 ALICE parallel analysis jobs. 4.3.2. BaBar results The plateau and the integral values over the time of the aggregated throughput using Scalla/Xrootd, as it is shown in Fig. 22, appears to be less than 5% of the GPFS ones, performing exactly the same jobs in the two scenarios. This behavior is due to the distinct prefetch configurations for the two systems. This explains also the different shape of the aggregated throughput as a function of time shown in Fig. 2. At the beginning of the run an initial peak is visible in both cases but with different magnitude. The Scalla/Xrootd peak is due to job accessing in a contemporary manner to the detector condition database at starting time. This effect appears also in the GPFS case, but here a large initial read-ahead is also present. After the peak the graph enters in a stationary region. The different tails in the last region of the throughput graph can be explained with the different distribution of the jobs 8

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 Throughput [MB/s] 5 4 3 GPFS on GPFS Throughput [MB/s] 2 18 16 14 12 1 8 Root 5.21/6 Root 5.14/e 2 6 4 1 2 4 6 8 1 12 Time [s] Figure 2. Aggregated throughput as a function of time in the case of 1 BaBar parallel analysis jobs. 2 2 4 6 8 1 12 Time [s] Figure 21. Aggregated throughput as a function of time in the case of 1 BaBar parallel analysis jobs with 2 Scalla/Xrootd client version: root 5/21-6 and root 5/14- e. within the worker node architectures. This reflects also the different job duration distribution shape. During the Scalla/Xrootd over GPFS run the Scalla/Xrootd prefetch policy deal the throughput, so the shape is very close to the Scalla/Xrootd one. Fig. 21 shows the aggregated throughput as a function of time in the case of 1 BaBar parallel analysis jobs using two Scalla/Xrootd client versions: root 5/21-6 and root 5/14-e. We saw no difference in the behavior of the two Scalla/Xrootd clients except from the different distribution within the worker nodes architectures. 4.3.3. Throughput summary The summary of integrated throughput and plateau values for all runs is shown in Fig. 22. All values are normalized to the Scalla/Xrootd runs. This plot shows clearly the effect of the different Scalla/Xrootd prefetch algorithms between ALICE and BaBar. Setting to zero the read-ahead value, BaBar uses about one half of the bandwidth respect to GPFS. The throughput curve of the Scalla/Xrootd over GPFS run results very close to the Scalla/Xrootd one (the measures are based on data recorded disk-server side). We saw no difference in the throughput behavior due to the updated Scalla/Xrootd client version except from the different distributions within the worker nodes architectures. Otherwise, with the default prefetch settings, ALICE bandwidth usage is quite the same in all cases except for the Scalla/Xrootd over GPFS, 1 ALICE jobs which is affected by a 6% of failure rate. 5. Conclusions We performed a comparison between two of the most used data access systems in HEP: GPFS and Scalla/Xrootd, addressing relevant questions such as performance and reliability. We used large scale test-bed, composed of one EMC CX3-8 storage system providing 16 TB of raw-disk space via a SAN, 12 Gigabit disk-servers. We tested three types of access patterns, for HEP data analysis workloads, giving this way a useful view of the actual capabilities of the systems in a real production scenario. The test was performed in a production environment so we faced with difficulties proper of this scenario: heterogeneous worker node architectures, production and test jobs contemporary running on the same client node and sharing base services (eg. network and authentication). However this permitted to collect information about Tier1 and Tier2 representative working scale. Both appeared to be a valid data access system for HEP environment especially looking at the results obtained with new generation worker nodes hardware architecture. 9

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (21) 723 doi:1.188/1742-6596/219/7/723 gpfs - integr. thr. xrootd on gpfs - integr. thr. xrootd (root v5.14) - integr. thr. gpfs - plateau thr. xrootd on gpfs - plateau thr. xrootd (root v5.14) - plateau thr. ratio w.r.t pure xrootd 2.5 2 1.5 1.5 7 ALICE jobs 1 ALICE jobs 1 BaBar jobs Figure 22. Throughput summary. All values are normalized to the Scalla/Xrootd run. The behavior of the two data access platforms concerning the job duration was the same during the two ALICE runs. The small observed differences were due to the production scenario characteristics. Otherwise the Scalla/Xrootd over GPFS run appears to suffer of a kind of data access systems overlapping effect. During the BaBar runs we observed a slightly better performance running on GPFS that is compatible with the results we obtained during the September 28 test session and we saw no overlapping effects in the Scalla/Xrootd over GPFS run. The different Scalla/Xrootd setting of the prefetch algorithms between ALICE and BaBar results in different bandwidth usage respect to GPFS: setting to zero the read-ahead value, BaBar uses about one half of the bandwidth. ALICE bandwidth usage instead is quite the same in all cases. We found that Scalla/Xrood is more CPU consuming than GPFS in both client and server side. The latter presents I/O wait values of 2% higher the almost null GPFS ones. BaBar computing model performance was not affected by the use of different client side ROOT version: 5/14-e and 5/21-6 reported the same behavior during both job duration and throughput measurements. References [1] General Parallel File System Documentation [Online]. Available: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com. ibm.cluster.gpfs.doc/gpfsbooks.html [2] A. Dorigo, P. Elmer, F. Furano, and A. Hanushevsky, in WSEAS Trans. Comput., Apr. 25. [3] Cosmo G, Nuclear Physics B - Proceedings Supplements, Volume 78, Number 1, August 1999, pp. 732-737(6) [4] P. Saiz, L. Aphecetche, P. Buncic, R. Piskac, J. E. Revsbech and V. Sego [ALICE Collaboration], Nucl. Instrum. Meth. A 52, 437 (23). [5] A. Fella, L. Li Gioi, D. Andreotti, E. Luppi, V. Sapunenko, 28 IEEE Nuclear Science Symposium Conference Record N29-8 1984 (28) [6] M. Bencivenni et al., IEEE Transactions on Nuclear Science (June 28) Volume: 55, Issue: 3, Part 3, pp. 1621-163 [7] G. Cancio et al., in Proc. Computing in High Energy Physics (CHEP4), Interlaken, Switzerland, 24, CD-ROM. [8] ROOT [Online]. Available: http://root.cern.ch [9] P. Saiz et al., Nucl. Instrum. Meth. A 52 23 43744 [1] ALICE ROOT [Online]. Available: http://aliceinfo.cern.ch/offline/aliroot/manual.html [11] The ALICE Offline Bible [Online]. Available: http://aliceinfo.cern.ch/export/sites/aliceportal/offline/galleries/download/offlinedownload/offlinebible.pdf 1