A copy can be downloaded for personal non-commercial research or study, without prior permission or charge

Similar documents
Establishing Applicability of SSDs to LHC Tier-2 Hardware Configuration

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Experience with PROOF-Lite in ATLAS data analysis

HammerCloud: A Stress Testing System for Distributed Analysis

Edinburgh (ECDF) Update

Department of Physics & Astronomy

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Analysis and improvement of data-set level file distribution in Disk Pool Manager

ATLAS operations in the GridKa T1/T2 Cloud

Parallel File Systems for HPC

Unified storage systems for distributed Tier-2 centres

Cluster Setup and Distributed File System

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Austrian Federated WLCG Tier-2

Data transfer over the wide area network with a large round trip time

Use of containerisation as an alternative to full virtualisation in grid environments.

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

White Paper. File System Throughput Performance on RedHawk Linux

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

High Throughput WAN Data Transfer with Hadoop-based Storage

Benchmark of a Cubieboard cluster

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

An Introduction to GPFS

Evaluation of the Huawei UDS cloud storage system for CERN specific data

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

CernVM-FS beyond LHC computing

Streamlining CASTOR to manage the LHC data torrent

The ATLAS Distributed Analysis System

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

The INFN Tier1. 1. INFN-CNAF, Italy

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Understanding StoRM: from introduction to internals

PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS

PoS(ACAT2010)039. First sights on a non-grid end-user analysis model on Grid Infrastructure. Roberto Santinelli. Fabrizio Furano.

Influence of Distributing a Tier-2 Data Storage on Physics Analysis

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Supra-linear Packet Processing Performance with Intel Multi-core Processors

A new petabyte-scale data derivation framework for ATLAS

Assessing performance in HP LeftHand SANs

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

ATLAS Tracking Detector Upgrade studies using the Fast Simulation Engine

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

MyCloud Computing Business computing in the cloud, ready to go in minutes

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Data oriented job submission scheme for the PHENIX user analysis in CCJ

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Forensic Toolkit System Specifications Guide

The Oracle Database Appliance I/O and Performance Architecture

Striped Data Server for Scalable Parallel Data Analysis

A comparison of data-access platforms for BaBar and ALICE analysis computing model at the Italian Tier1

IBM Emulex 16Gb Fibre Channel HBA Evaluation

ATLAS software configuration and build tool optimisation

CouchDB-based system for data management in a Grid environment Implementation and Experience

Summary of the LHC Computing Review

Block Storage Service: Status and Performance

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Storage Evaluations at BNL

Emerging Technologies for HPC Storage

Recent developments in user-job management with Ganga

Improved ATLAS HammerCloud Monitoring for Local Site Administration

PoS(ISGC 2011 & OGF 31)047

ScotGrid: Providing an effective distributed Tier-2 in the LHC era

ECFS: A decentralized, distributed and faulttolerant FUSE filesystem for the LHCb online farm

Benchmarking third-party-transfer protocols with the FTS

Evaluation of Apache Hadoop for parallel data analysis with ROOT

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

Parallel Storage Systems for Large-Scale Machines

Parallel File Systems. John White Lawrence Berkeley National Lab

XCache plans and studies at LMU Munich

Using Hadoop File System and MapReduce in a small/medium Grid site

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

Evaluation of the Chelsio T580-CR iscsi Offload adapter

Performance Report: Multiprotocol Performance Test of VMware ESX 3.5 on NetApp Storage Systems

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

SAM at CCIN2P3 configuration issues

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

Comparing Performance of Solid State Devices and Mechanical Disks

Storage Devices for Database Systems

Virtual Security Server

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

Overview of ATLAS PanDA Workload Management

High Energy Physics data analysis

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

Benchmarking the ATLAS software through the Kit Validation engine

Parallel I/O Libraries and Techniques

Highly accurate simulations of big-data clusters for system planning and optimization

ATLAS Offline Data Quality Monitoring

Efficient HTTP based I/O on very large datasets for high performance computing with the Libdavix library

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

WHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD

Transcription:

Bhimji, W., Bland, J., Clark, P. J., Mouzeli, E. G., Skipsey, S., and Walker, C. J. (11) Tuning grid storage resources for LHC data analysis. In: International Conference on Computing in High Energy and Nuclear Physics (CHEP 1), 18-22 Oct 1, Taipei, Japan. Copyright 1 The Authors A copy can be downloaded for personal non-commercial research or study, without prior permission or charge Content must not be changed in any way or reproduced in any format or medium without the formal permission of the copyright holder(s) When referring to this work, full bibliographic details must be given http://eprints.gla.ac.uk/95116/ Deposited on: 17 July 14 Enlighten Research publications by members of the University of Glasgow http://eprints.gla.ac.uk

Tuning grid storage resources for LHC data analysis W. Bhimji 1, J. Bland 2, P.J. Clark 1, E. G. Mouzeli 1, S. Skipsey 3 and C. J. Walker 4 1 University of Edinburgh, School of Physics & Astronomy, James Clerk Maxwell Building, The Kings Buildings, Mayfield Road, Edinburgh EH9 3JZ, UK 2 University of Liverpool, Oliver Lodge Laboratory, P.O. Box 147, Oxford Street, Liverpool L69 3BX, UK 3 University of Glasgow, Department of Physics and Astronomy, Glasgow G12 8QQ, UK 4 Queen Mary University of London, Department of Physics, Mile End Road, London E1 4NS, UK E-mail: wbhimji@staffmail.ed.ac.uk Abstract. Grid Storage Resource Management (SRM) and local file-system solutions are facing significant challenges to support efficient analysis of the data now being produced at the Large Hadron Collider (LHC). We compare the performance of different storage technologies at UK grid sites examining the effects of tuning and recent improvements in the I/O patterns of experiment software. Results are presented for both live production systems and technologies not currently in widespread use. Performance is studied using tests, including real LHC data analysis, which can be used to aid sites in deploying or optimising their storage configuration. 1. Introduction The work presented here focuses primarily on Tier 2 WLCG sites which have limited financial resources, but are also expected to play host to a considerable portion of the user analysis of data for the LHC experiments. The large data volumes and random-like access of that data stresses such systems, hitting limits in, for example, disk seeks or network bandwidth. This results in certain setups displaying cpu efficiencies less than 5% with correspondingly low event rates. Improving these efficiencies could make considerably better use of existing computing resources, yielding savings in the substantial costs involved in provisioning LHC computing hardware. This issue needs to be addressed by the LHC community especially as we enter an era of rapidly growing data volumes and increasing analysis activity. 1.1. Tuning storage There are a number of ways to tackle the bottlenecks and poor efficiencies described above. Some approaches are outlined below and examples of these are explored in this paper. Applications: the way data is accessed has significant impacts on the ability of the storage systems to effectively deliver it. As the data stored by LHC experiments at these sites uses almost exclusively ROOT-based [1] formats, there is scope for improvements to be made in that access layer. We examine some recent attempts to make more sophisticated use of ROOT features. Published under licence by IOP Publishing Ltd 1

Filesystems/Protocols: there are many filesystems and protocols used by LHC experiments to access data. Each can offer benefits but require specific tuning. We test in detail the RFIO protocol of the Disk Pool Manager (DPM) [2], the most widely used SRM solution in the UK, as well as exploring alternative and experimental technologies. Hardware: upgrading hardware is often the obvious way of resolving a bottleneck from the perspective of a site that may be unaware of details of the application. Current options include use of 1 Gbit networking or storage that can sustain high levels of I/Os per second (IOPS) such as Solid State Drives (SSDs). This will not be discussed in great detail in this report, but a comparison of SSDs with other solutions is performed elsewhere [3]. Section 2 of this paper describes the test setup used, while Sections 3 and 4 explore the issues described above and Section 5 considers some alternative technologies for the future. 2. Description of tests performed 2.1. Data The tests performed here replicate LHC data analysis usage. Experimental data in the form of ATLAS AOD files [4] are chosen as this format is commonly used at these sites. Larger file-sizes (2-3 GB) are selected in order to make it less likely that the file can be entirely cached in RAM which would not test the underlying storage. Part of this study includes understanding how recent changes in the ATLAS data format impacts on file access performance, so the samples include identical data in older formats as well as that with the latest optimisations. 2.2. Tests A simple framework was developed incorporating three tests: (i) File copy: many ATLAS jobs currently copy data to the worker node to run. Therefore performing a file copy, using the available local file protocols, models the interaction jobs would have with the data servers and provides an easy test to run. (ii) Reading events in ROOT: this test reads through the main collection tree. It provides a portable test that does not require an experiment s software to be installed. It may however miss subtleties of real analysis jobs such as the reading of multiple trees in the same file. (iii) ATLAS Athena jobs: we use D3PDMaker, which is commonly used to write data out in a different format for further analysis. This provides a realistic test of the I/O involved in reading events but requires the ATLAS software infrastructure. In addition to this framework, use was made of the Hammercloud tool [5] which provides automatic resubmission of jobs together with collection of statistics. The Hammercloud tests were configured to use the same D3PDMaker job and files as above and the raw results filtered to ensure the same CPU type and similar numbers of simultaneous running jobs for each test. 2.3. Sites Three UK Tier 2 sites were used for tests. These were: Glasgow: tier 2 site with CPU Cores (Intel Xeon E54, 2.5GHz selected for these tests); 1PB Disk with disk servers containing 24x1TB SATA disks and connected with dual-bonded-gigabit ethernet. The storage solutions tested were DPM used in production and a DPM-Fuse test installation. Edinburgh Compute and Data Facility (ECDF): 1 CPU Cores (Intel Xeon X545, 3.GHz selected here); 1TB Disk with 15x1TB SATA disks per production server and dual-bonded-gigabit ethernet. The storage solutions tested were StoRM with GPFS and DPM used in production and testbeds for the HDFS and Ceph filesystems. 2

QMUL: CPU Cores (Intel Xeon, E54, 2.5GHz tested here); 5TB Disk with 3x1TB SATA disks per server and 1 GbE for disk servers / dual-bonded-gigabit ethernet for worker nodes. The storage solution was StoRM with the Lustre filesystem. Tests were run using the production infrastructure of the sites. This has the advantage of presenting studies on large scale resources that have been somewhat already tuned for LHC analysis. It represents a realistic environment that users jobs would currently be running under. A disadvantage, however, is that some comparisons between protocols are on different site setups, though we note where that is the case. Another disadvantage is that the sites are running other jobs during the tests and to mitigate this all tests were repeated several times. The results presented should not be thought of as precise measurements but, given the large scale of some of the effects observed, definite conclusions can still be drawn. 3. Application tuning 3.1. ROOT I/O Recently there have been a number of improvements in the LHC experiments use of features in ROOT including basket reordering and the TTreeCache. These are described in detail elsewhere [7]. Both have the effect of reducing random-like access of data files and the requests made to the underlying filesystem and can therefore significantly improve storage performance. We measure the impacts of the changes in more detail in the following sections. 3.1.1. Basket Reordering The ATLAS experiment has recently written out files in which previously unordered buffers are arranged by event number. In Figure 1 we show the effect of this change when accessing data from the local disk (see [6] for more details on the variables plotted in this figure). Six ATLAS Athena jobs are run on different 2GB files on the same filesystem. The thick lines in the Disk IO plot for unordered files illustrate that the application is having to seek to various positions in the files given by the disk offset value. This reaches the available seek limits as shown in the Seek Count plot. For reading new files it can be seen that the scatter in offsets is much smaller and so seek counts are substantially reduced. Figure 1. Disk access pattern and seeks per second for six Athena jobs running on files with unordered baskets (left) or after basket reordering (right) (plots from seekwatcher [6]). Figure 2 shows the improvements that can be gained from reordering when data is read from remote servers using a ROOT test with different protocols at the same site. Much less time is taken to read the unordered file on the GPFS filesystem than using DPM/RFIO but the best performance for both protocols is achieved with the reordered data. Figure 3 shows larger scale 3

GPFS 233 138 1 1 DPM/rfio (Gla) 1 Lustre (Qmul) 1 1 1 Rfio 1717 165 8 8 5 1 15 Figure 2. The time for a ROOT test reading the same unordered or reordered file for GPFS and RFIO. 1 3 5 7 8 9 1 1 3 5 7 8 9 1 Figure 3. Stress tests running on multiple files with basket reordering (white) or without (blue, filled) on DPM/RFIO (left) (with a RFIO buffer size of.5m) or Lustre (right). tests with multiple simultaneous jobs (using a MuonAnalysis Hammercloud test) and again there is a clear indication that the reordered files lead to significantly better cpu efficiencies for both Lustre and DPM/RFIO systems. It also appears that direct access on the Lustre filesystem offers better performance than RFIO. 3.1.2. TTreeCache The TTreeCache is a file cache for data of interest (where that data is determined from a learning phase). To use this with the RFIO protocol on DPM required fixes for a couple of minor bugs in DPM (available since version 1.7.4-7) and ROOT (backported to v5.26c, used in ATLAS releases after 15.9.). Figure 4 shows the further improvements in event rates over basket reordering for an indicative single Athena job and Figure 5 shows the improvements in cpu efficiencies for multiple Athena jobs run simultaneously. That figure also shows that smaller RFIO buffer sizes can bring improvements in efficiency which is discussed in the next section. Lustre DPM/RFIO (1M RFIO Buffer) 7.6 6. 7.1 3.8 5.1 4.4 3.2 2. + TTC +TTC 8 DPM/Rfio (Glas) 7 No TTreeCache TTreeCache 5 4k Rfio buffer + TTC 3 25 15 1 Lustre (Qmul) No TTreeCache TTreeCache. 2. 4. 6. 8. Event Rate 1 5 Figure 4. Event rates for an Athena test on the same unordered or reordered file with the TTreeCache (TTC). 1 3 5 7 8 9 1 5 55 65 7 75 8 85 9 95 1 Figure 5. Cpu efficiency for multiple Athena jobs running on the same complete, reordered, dataset with the TTreeCache (blue, filled) or without (white) on DPM/RFIO (left) (with RFIO buffer sizes of 1MB and 4k (red, filled)) or Lustre (right). 4. Filesystem/Protocol tuning 4.1. RFIO buffer sizes and read-aheads The RFIO protocol has a configurable read-ahead buffer size. Figure 5 shows that even with the reordered files, or use of the TTreeCache, higher cpu efficiencies are seen with very small buffer sizes, indicating that data access is still not sequential enough to make effective use of buffered data. Figure 6 shows the network traffic when the same ATLAS athena job is run with successively increasing buffer sizes (where linux file caches are dropped between successive 4

runs). This illustrates that larger volumes of unused data are indeed transferred. However, using smaller buffers leads to more IOPS on disk servers and so can cause considerable load. This would impact on all other users therefore, unless they have disk servers that can handle this level of IOPS, many sites choose a higher value of the RFIO buffer size (say.5m) to limit the IOPS at the expense of higher network traffic. UK grid sites have also found that larger values for the block-device read-ahead on disk severs further reduces this load. Finally, it is still preferable for DPM sites to copy files to the worker node as Figure 7 also shows that high CPU efficiencies can be obtained in that case without the level of IOPS caused by direct RFIO access. 1 8 Copy to WN Rfio Buffer 4k 1M 128 M 4k 128k 86 88 9 92 94 96 98 1 Figure 6. Network traffic for Athena tests with different values of RFIO buffer size. Figure 7. Cpu efficiency when the data file is copied to the worker node compared to that for direct RFIO access with a 4k buffer size. Copying the file can however move the bottleneck to the worker node disk. As shown earlier, in Figure 1, multiple Athena jobs running on unordered data saturate the IOPS on a single disk. We measured that one Athena job running on a worker node with a single SATA disk would achieve around 9% efficiency, but this decreases to around 7% efficiency when the node had 8 cores occupied. This can be improved by using a pair of disks in a RAID configuration (as is done in the test shown in Figure 7) or by using SSDs (see [3] for a detailed comparison). 5. Alternative technologies 5.1. Fuse with DPM POSIX I/O for accessing files considerably aids users and allows use of Linux file caching. We test one way to enable this for DPM through an already existing, but not widely used, GFAL Filesystem in Userspace (FUSE) module. When using this module, RFIO is the backend protocol, but applications benefit from the page cache. As shown in Figure 8, we find that while performance for copying files is not as good as using the RFIO protocol directly, the performance for directly reading files, either in ROOT or Athena, appears to be better. However, there is currently a limitation of the module, believed to be in GFAL, that causes a crash when a file is opened twice. This impacts any Athena job with AutoConfiguration, including the D3PDMaker job used here (for testing it was resolved by opening the file via RFIO for the configuration step). It would need to be fixed for this FUSE library to be used for ATLAS Athena jobs. However, even without a fix, this module could be useful for easing access to user data. 5.2. HDFS and Ceph A number of distributed filesystems are gaining increasing interest outside the LHC community. Among the more popular of these are the Hadoop Filesystem (HDFS) [8] and Ceph [9]. HDFS is widely used in industry and is particularly valuable for aggregating disk in worker nodes given its fault tolerance. It has already been used for LHC data in US CMS Tier 2 sites. Ceph, while not yet believed production ready by its developers, has a client that is included with the linux 5

kernel since v2.6.34. A testbed was setup with these filesystems, full details of which are given in [1]. For ATLAS data, as shown in Figure 9, both filesystems were found to offer similar performance for reordered data as the sites GPFS and DPM systems (which were shown in Figure 2). Ceph also showed good performance for the random access required for unordered ATLAS files and while the HDFS test did not complete on those files it is believed that the default values for read-ahead in HDFS could be tuned better for this kind of access. Athena (1 evts) ROOT Fuse Rfio File Copy 457 761 682 716 73 6 Ceph HDFS 197 142 #N/A 142 8 Figure 8. Time taken to complete tests reading a file directly with RFIO or the GFAL FUSE module. 5 1 15 25 Figure 9. The time for a ROOT test reading an unordered or reordered file with the Ceph or HDFS filesystems. 6. Conclusions Storage setups used for LHC data analysis can benefit from detailed study and tuning to improve efficiencies and event rates. Recently considerable progress has been made and we have shown that improvements in the experiments use of ROOT I/O through Basket Reordering and the TTreeCache significantly improve performance on all the systems tested. We have also shown that tuning of those systems, such as read-aheads and the choice of protocol, can yield further benefits and that there are many promising emerging technologies. However, this tuning depends on the available site hardware and the type of jobs, so there will be a continuing need for the tools and understanding developed here to be applied through ongoing testing and monitoring. Acknowledgments The authors wish to thank Jens Jensen and others in the GridPP Storage Group for comments and support and Daniel van der Ster for help with Hammercloud tests. References [1] Brun R and Rademakers F 1997 Nucl. Instrum. Meth. A 389 81 [2] DPM - Disk Pool Manager URL https://svnweb.cern.ch/trac/lcgdm/wiki/dpm [3] Skipsey S, Bhimji W and Kenyon M 11 Establishing applicability of SSDs to LHC Tier-2 hardware configuration. To be published in J. Phys.: Conf. Series [4] Duckeck, G et al. [ATLAS Collaboration] 5 CERN-LHCC-5-22 [5] Vanderster, D C et al. 1 Functional and large-scale testing of the ATLAS distributed analysis facilities with Ganga J. Phys.: Conf. Series 219 721 [6] Seekwatcher URL http://oss.oracle.com/ mason/seekwatcher/ [7] Vukotic, I, Bhimji, W, Biscarat, C, Brandt, G, Duckeck, G, van Gemmeren, P, Peters, A, Schaffer, R D 1 Optimization and performance measurements of ROOT-based data formats in the ATLAS experiment. ATL-COM-SOFT-1-81. To be published in J. Phys.: Conf. Series [8] HDFS URL http://hadoop.apache.org/hdfs [9] Ceph URL http://ceph.newdream.net/ [1] Mouzeli, E G 1 Open Source Distributed Storage Solutions for the ATLAS Project, MSc University of Edinburgh URL http://www2.ph.ed.ac.uk/ wbhimji/gridstorage/opensourcefilesystems-em.pdf 6