Data Management for the World s Largest Machine Sigve Haug 1, Farid Ould-Saada 2, Katarina Pajchel 2, and Alexander L. Read 2 1 Laboratory for High Energy Physics, University of Bern, Sidlerstrasse 5, CH-3012 Bern, Switzerland sigve.haug@lhep.unibe.ch 2 Department of Physics, University of Oslo, Postboks 1048 Blindern, NO-0316 Oslo, Norway {farid.ould-saada, katarina.pajchel, a.l.read}@fys.uio.no http://www.fys.uio.no/epf Abstract. The world s largest machine, the Large Hadron Collider, will have four detectors whose output is expected to answer fundamental questions about the universe. The ATLAS detector is expected to produce 3.2 PB of data per year which will be distributed to storage elements all over the world. In 2008 the resource need is estimated to be 16.9 PB of tape, 25.4 PB of disk, and 50 MSI2k of CPU. Grids are used to simulate, access, and process the data. Sites in several European and non-european countries are connected with the Advanced Resource Connector (ARC) middleware of NorduGrid. In the first half of 2006 about 10 5 simulation jobs with 27 TB of distributed output organized in some 10 5 files and 740 datasets were performed on this grid. ARC s data management capabilities, the Globus Replica Location Service, and ATLAS software were combined to achieve a comprehensive distributed data management system. 1 Introduction At the end of 2007 the Large Hadron Collider (LHC) in Geneva, often referred to as the world s largest machine, will start to operate [1]. Its four detectors aim to collect data which is expected to give some answers to fundamental questions about the universe, e.g. what is the origin of mass. The data acquisition system of one of these detectors, the ATLAS detector, will write the recorded information of the proton-proton collision events at a rate of 200 events per second [2]. Each event s information will require 1.6 MB storage space [3]. Taking the operating time of the machine into account this will yield 3.2 PB of recorded data per year. The simulated and reprocessed data comes in addition. The estimated computing resource needs for 2008 are 16.9 PB tape storage, 25.4 PB disk storage and 50.6 MSI2k CPU. The ATLAS experiment uses three grids to store, replicate, simulate, and process the data all over the planet : The LHC Computing Grid (LCG), the B. Kågström et al. (Eds.): PARA 2006, LNCS 4699, pp. 480 488, 2007. c Springer-Verlag Berlin Heidelberg 2007
Data Management for the World s Largest Machine 481 Fig. 1. Geographical snapshot of sites connected with ARC middleware (as of Dec. 2005). Many sites are also organized into national and or organizational grids, e.g. Swegrid and Swiss ATLAS Grid. Open Science Grid (OSG), and the NorduGrid [4] [5] [6]. Here we report on the recent experience with the present distributed simulation and data management system used by the ATLAS experiment on NorduGrid. A geographical map of the sites connected by NorduGrid s middleware The Advanced Resource Connector (ARC) is shown in Figure 1. The network of sites which also have the necessary ATLAS software installed and thus are capable of running ATLAS computing tasks will in the following be called the ATLAS ARC Grid. First, a description of the distributed simulation and data management system follows. Second, a report on the system performance in the period from November 2005 to June 2006 is presented. Then future usage, limitations, and needed
482 S. Haug et al. improvements are commented. Finally, we recapitulate the performance of the ATLAS ARC Grid in this period and draw some conclusions. 2 The Simulation and Data Management System The distributed simulation and data management system on the ATLAS ARC Grid can be divided into three main parts. First, there is the production database which is used for definition and tracking of the simulation tasks [7]. Second, there is the Supervisor-Executor instance which pulls tasks from the production database and submits them to the ATLAS ARC Grid. And finally, there are the ATLAS data management databases which collect the logical file names into datasets [8]. The Supervisor is common for all three grids. The Executor is unique for each grid and contains code to submit, monitor, postprocess and clean the grid jobs. In the case of the ATLAS ARC Grid, this simple structure relies on the full ARC grid infrastructure, in particular also a Globus Replica Location Service (RLS) which maps logical to physical file names [9]. The production database is an Oracle instance where job definitions, job input locations and job output names are kept. Further jobs estimated resource needs, status, etc are stored. The Supervisor-Executor is a Python application which is run by a user whose grid certificate is accepted at all ATLAS ARC sites. The Supervisor communicates with the production database and passes simulation jobs to the Executor in XML format. The Executor then translates the job descriptions into ARC s extended resource specification language (XRSL). Job-brokering is performed with attributes specified in the XRSL job-description and information gathered from the computing clusters with the ARC information system. In particular, clusters have to have the required ATLAS run time environment installed. This is an experiment-specific software package of about 5 GB which is frequently released. When a suitable cluster is found, the job is submitted. The ARC gridmanager on the front-end of the cluster downloads the input files, submits jobs to the local batch system and monitors them to their completion, and uploads the output of successful jobs. In this process the RLS is used to index both input and output files. The physical storage element (SE) for an output file is provided automatically by a storage service which obtains a list of potential SE s indexed by RLS. Thus neither the grid job executing on the batch node nor the Executor do any data movement and do not need to know explicitly where the physical inputs come from or where the physical outputs are stored. When the Executor finds a job finished, it registers the metadata, e.g. a globally unique identifier and creation date, of the joboutput files in the RLS. It sets the desired grid access control list (gacl) on the files and reports back to the Supervisor and the production database. Finally, the production database is periodically queried for finished tasks. For these the logical file names and their dataset affiliation are retrieved in order to register available datasets, their file content, state and locations in the ATLAS dataset databases. Hence, datasets can subsequentially be looked up
Data Management for the World s Largest Machine 483 for replication and analysis. The dataset catalogs provide the logical file names and the indexing service (from among the more than 20 index servers for the three grids of which the ATLAS computing grid is comprised) for the dataset to which the logical file is attached. The indexing service, i.e. the RLS on the ATLAS ARC Grid, provides the physical file location. In short, the production on ATLAS ARC Grid is by design a fully automatic and light weight system which takes advantage of the inherent job-brokering and data management capabilities of the ARC middleware (RLS for indexing logical to physical filenames and storing metadata about files) and the ATLAS distributed data management system (a set of catalogs allowing replication and analysis on a dataset basis). See Reference [10] and [11] for detailed descriptions of the ATLAS and ARC data management systems. 3 Recent System Performance on the ATLAS ARC Grid The preparation for the ATLAS experiment relies on detailed simulations of the physics processes, from the proton-proton collision, via the particle propagation through the detector material, to the full reconstruction of the particles tracks. To a large extent this has been achieved in carefully planned time periods of operation, so-called Data Challenges. Many ARC sites have been providing Table 1. ARC clusters which contributed to the ATLAS simulations in the period from November 2005 to June 2006. The number of jobs per site and the percentage of successful jobs are shown. Cluster Number of jobs Efficiency 1 ingrid.hpc2n.umu.se 6596 0.94 2 benedict.grid.aau.dk 5838 0.88 3 hive.unicc.chalmers.se 14211 0.84 4 pikolit.ijs.si 34106 0.83 5 bluesmoke.nsc.liu.se 9141 0.83 6 hagrid.it.uu.se 6654 0.81 7 grid00.unige.ch 624 0.79 8 morpheus.dcgc.dk 1329 0.76 9 grid.uio.no 2878 0.75 10 lheppc10.unibe.ch 3978 0.73 11 hypatia.uio.no 1542 0.70 12 sigrid.lunarc.lu.se 12038 0.70 13 alice.grid.upjs.sk 3 0.67 14 norgrid.ntnu.no 31 0.48 15 grid01.unige.ch 284 0.35 16 norgrid.bccs.no 286 0.35 17 grid.tsl.uu.se 6 0.00
484 S. Haug et al. Table 2. ARC Storage Elements and their contributions to the ATLAS Computing System Commissioning. Number of files stored by the ATLAS production in the period are shown in the third column. The fourth lists the total space occupied by these files. The numbers were extracted from the Replica Location Service rls://atlasrls.nordugrid.org on 2006-06-13. Storage Element Location Files TB ingrid.hpc2n.umu.se Umeaa 1217 0.2 se1.hpc2n.umu.se Umeaa 14078 1.3 ss2.hpc2n.umu.se Umeaa 70656 5.6 ss1.hpc2n.umu.se Umeaa 74483 6.2 hive-se2.unicc.chalmers.se Goteborg 10412 0.8 harry.hagrid.it.uu.se Uppsala 38226 2.9 hagrid.it.uu.se Uppsala 12620 1.6 storage2.bluesmoke.nsc.liu.se Linkoping 6254 0.6 sigrid.lunarc.lu.se Lund 14425 1.9 swelanka1.it.uu.se Sri Lanka 1 < 0.1 grid.uio.no Oslo 856 < 0.1 grid.ift.uib.no Bergen 1 < 0.1 morpheus.dcgc.dk Aalborg 252 < 0.1 benedict.grid.aau.dk Aalborg 9426 1.3 pikolit.ijs.si:2811 Slovenia 25094 2.0 pikolit.ijs.si Slovenia 21239 2.7 299240 27.1 resources for these large scale production operations [12]. At the present time the third Data Challenge, or the Computing System Commissioning (CSC), is entering a phase of more or less constant production. As part of this constant production about 100 000 simulation jobs were run on ATLAS enabled ARC sites in the period from mid November 2005 to mid June 2006 where the end date just reflects the time of this report. Up to 17 clusters comprising about 1000 CPU s were used as a single resource for these jobs. In Table 1 the clusters and their executed job shares are listed. Depending on their size, access policy, and competition with local users the number of jobs varies. In this period six countries provided resources. The Slovenian cluster, pikolit.ijs.si, was the largest contributor followed by the Swedish resources. The best clusters have efficiencies close to 90% (total ATLAS and grid middleware efficiency). This number reflects what can be expected in a heterogenious grid environment where not only different jobs and evolving software are used, but also the operational efficiency of the numerous computing clusters and storage services is a significant factor. In Table 2 the number of output files and their integrated sizes are listed according to storage elements and locations. About 300 000 files with a total of
Data Management for the World s Largest Machine 485 Fig. 2. TB per country. The graph visualizes the numbers in Table 2. In the period from November 2005 to June 2006 Sweden and Slovenia were the largest storage contributers to the ATLAS Computing System Commissioning. Only ARC storage is considered. 27 TB were produced and stored on disks at 11 sites in five different countries. This gives an average file size of 90 MB. The integrated storage contribution per country is shown in Figure 2. 1 In the ATLAS production of simulated data (future data analysis will produce a different and more chaotic pattern) simulation is done in three steps. For each step input and output sizes vary. In the first step the physics in the proton-proton collisions is simulated, so-called event generation. These jobs have practically no input and output about 0.1 GB per job. In the second step the detector response to the particle interactions is simulated. These jobs use the output from the first step as input. They produce about 1 GB output per job. This output is again used as input for the last step where the reconstruction of the detector response is performed. A reconstruction job takes about 10 GB input in 10 files and produces an output of typically 1 GB. In order to minimize the number of files, it is foreseen to increase the file sizes (from 1 to 10 GB) as network capacity, disk sizes and tape systems evolve. The outputs are normally replicated to at least one other storage element in one of the other grids and in the case of reconstruction outputs (the starting point of most physics analyses) to all the other large computing sites spread throughout the ATLAS grid. The output remains on the storage elements till a central ATLAS decision is made about deletion, most probably several years. 1 This distribution is not representative for the previous data challenges.
486 S. Haug et al. Table 3. ATLAS Datasets on ARC Storage Elements as of 2006-06-13 Category ARC Total ARC/Total Description All 739 3171 0.23 CSC + CTB + MC CSC 489 2179 0.22 Computing System Commisioning CTB 7 86 0.08 Combined Test Beam Production MC 242 906 0.27 MC Production Finally, the output files were logically collected into datasets, objects of analysis and replication. The 300 000 ATLAS files produced in this period and stored on ARC storage elements belong to 739 datasets in the period. The average number of files was then roughly 400, the actual numbers ranging from 50 to 10000. Table 3 shows the categories of datasets and their respective parts of the total numbers. The numbers in the ARC column were collected with the ATLAS DQ2 client, the numbers in the Total column with the PANDA monitor (http://gridui02.usatlas.bnl.gov:25880/server/pandamon/query). Since in the considered period the ATLAS ARC Grid s contribution to the total AT- LAS Grid production is estimated to have been about 11 to 13%, the numbers indicate that rather shorter than average and long jobs were processed. 2 4 Perspective, Limitations and Improvements The limitations of the system must be considered in the context of its desired capabilities. At the moment the system manages some 10 3 jobs per day where each job typically needs less than a day to finish. The number of output files are about three times larger. In order to provide the ATLAS experiment with a significant production grid, the ATLAS ARC Grid should aim to cope with numbers of jobs another order of magnitude larger. In this perspective the ATLAS ARC Grid has no fundamental scaling limitations. However, in order to meet the ambition several improvements are needed. First, the available amount of resources must increase. The present operation almost exhausts the existing. And since the resources are shared and with growing attraction to users, fair-sharing of the resources between local and grid and between different grid-users needs to be implemented. At the moment local users always have implicit first priority. And the grid-users are often mapped to a single local account so that they are effectively treated first-come first-serve. Second, the crucial Replica Location Service provides the desired functionality with mapping from logical to physical file names, certificate authentication and bulk operations and is expected to be able to handle the planned scaling-up 2 The Nordic share of the ATLAS computing resources is 7.5%, according to a memorandum of understanding.
Data Management for the World s Largest Machine 487 of the system. However, the lack of perfect stability is an important problem which remains to be solved. Meanwhile, the persons running the Supervisor- Executor instances should probably have some administration privileges, e.g. the possibility to restart the service. Third, further development should aim at some hours database independency. Both the production database and the data management databases now and then have some hours down time. This should cause problems other than delays in database registrations. Continuous improvements in the ARC middleware ease the operation. However, in the ATLAS ARC Grid there are many independent clusters in production mode and not dedicated to ATLAS. Thus it is impractical to negotiate frequent middleware upgrades on all of them. Hence, the future system should rely as much as possible on the present features. 5 Conclusions As part of the preparations for the ATLAS experiment at the Large Hadron Collider, large amounts of data are simulated on grids. The ATLAS ARC Grid, sites connected with NorduGrid s Advanced Resource Connector and having ATLAS software installed and configured for use by grid-jobs, now continuously contributes to this global effort. In the period from November 2005 to June 2006 about 300 000 output files were produced on the ATLAS ARC Grid. Up to 17 sites in five different countries were used as a single batch facility to run about 100 000 jobs. Compared to previous usage, another layer of organization was introduced in the data management system. This enabled the concept of datasets, i.e. conglomerations of files, which are used as objects for data analysis and replication. The 27 TB output was collected into 740 datasets with the physical output distributed over eight significant sites in four countries. Present experience shows that the system design can be expected to cope with the future load. Provided enough available resources, one person should be able to supervise about 10 4 jobs per day with a few GB of input and output data. The present implementation of the ATLAS ARC Grid is lacking the ability to replicate ATLAS datasets to and from other grids via the ATLAS distributed data management tools [8] and there is no support for tape-based storage elements. These shortcomings will be addressed in the near future. Acknowledgments. The indispensable work of the contributing resources system administrators is highly appreciated. References 1. The LHC Study Group: The Large Hadron Collider, Conceptual Design, CERN- AC-95-05 LHC (1995) 2. ATLAS Collaboration: Detector and Physics Performance Technical Design Report, CERN-LHCC-99-14 (1999)
488 S. Haug et al. 3. ATLAS Collaboration: ATLAS Computing Technical Design Report, CERN- LHCC-2005-022 (2005) 4. Knobloch, J. (ed.): LHC Computing Grid - Technical Design Report, CERN- LHCC-2005-024 (2005) 5. Open Science Grid Homepage: http://www.opensciencegrid.org 6. NorduGrid Homepage: http://www.nordugrid.org 7. Goosesens, L., et al.: ATLAS Production System in ATLAS Data Challenge 2, CHEP 2004, Interlaken, contribution, no. 501 8. ATLAS Collaboration: ATLAS Computing Technical Design Report, CERN- LHCC-2005-022, p. 115 (2005) 9. Nielsen, J., et al.: Experiences with Data Indexing Services supported by the NorduGrid Middleware, CHEP 2004, Interlaken, contribution, no. 253 10. Konstantinov, A., et al.: Data management services of NorduGrid, CERN-2005-002, vol. 2, p. 765 (2005) 11. Branco, M.: Don Quijote - Data Management for the ATLAS Automatic Production System, CERN-2005-002, p. 661 (2005) 12. NorduGrid Collaboration: Performance of the NorduGrid ARC and the Dulcinea Executor in ATLAS Data Challenge 2, CERN-2005-002, vol. 2, p. 1095 (2005)