Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Similar documents
Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Unified System for Processing Real and Simulated Data in the ATLAS Experiment

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

A new petabyte-scale data derivation framework for ATLAS

AGIS: The ATLAS Grid Information System

New strategies of the LHC experiments to meet the computing requirements of the HL-LHC era

ATLAS software configuration and build tool optimisation

CMS High Level Trigger Timing Measurements

An Analysis of Storage Interface Usages at a Large, MultiExperiment Tier 1

ATLAS Nightly Build System Upgrade

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

An SQL-based approach to physics analysis

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Federated data storage system prototype for LHC experiments and data intensive science

Andrea Sciabà CERN, Switzerland

Improved ATLAS HammerCloud Monitoring for Local Site Administration

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Evaluation of the computing resources required for a Nordic research exploitation of the LHC

The JINR Tier1 Site Simulation for Research and Development Purposes

Overview of ATLAS PanDA Workload Management

The LHC Computing Grid

Evaluation of the Huawei UDS cloud storage system for CERN specific data

ATLAS operations in the GridKa T1/T2 Cloud

Evolution of Database Replication Technologies for WLCG

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

CMS - HLT Configuration Management System

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Development of DKB ETL module in case of data conversion

LHCb Computing Resources: 2018 requests and preview of 2019 requests

CLOUDS OF JINR, UNIVERSITY OF SOFIA AND INRNE JOIN TOGETHER

ATLAS Tracking Detector Upgrade studies using the Fast Simulation Engine

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

Geant4 Computing Performance Benchmarking and Monitoring

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

Early experience with the Run 2 ATLAS analysis model

Understanding the T2 traffic in CMS during Run-1

DIRAC pilot framework and the DIRAC Workload Management System

The ALICE Glance Shift Accounting Management System (SAMS)

The NOvA software testing framework

Data Transfers Between LHC Grid Sites Dorian Kcira

Data oriented job submission scheme for the PHENIX user analysis in CCJ

Monitoring of large-scale federated data storage: XRootD and beyond.

Batch Services at CERN: Status and Future Evolution

Popularity Prediction Tool for ATLAS Distributed Data Management

High Throughput WAN Data Transfer with Hadoop-based Storage

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

arxiv: v1 [cs.dc] 12 May 2017

First LHCb measurement with data from the LHC Run 2

Automating usability of ATLAS Distributed Computing resources

Data transfer over the wide area network with a large round trip time

Security in the CernVM File System and the Frontier Distributed Database Caching System

Data Management for the World s Largest Machine

The CMS data quality monitoring software: experience and future prospects

A data handling system for modern and future Fermilab experiments

Review of the Compact Muon Solenoid (CMS) Collaboration Heavy Ion Computing Proposal

ATLAS distributed computing: experience and evolution

The High-Level Dataset-based Data Transfer System in BESDIRAC

The GAP project: GPU applications for High Level Trigger and Medical Imaging

Preparing for High-Luminosity LHC. Bob Jones CERN Bob.Jones <at> cern.ch

CernVM-FS beyond LHC computing

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

Availability measurement of grid services from the perspective of a scientific computing centre

PoS(EGICF12-EMITC2)106

Performance of popular open source databases for HEP related computing problems

Experience of the WLCG data management system from the first two years of the LHC data taking

CC-IN2P3: A High Performance Data Center for Research

TAG Based Skimming In ATLAS

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

First Experience with LCG. Board of Sponsors 3 rd April 2009

arxiv: v1 [physics.ins-det] 1 Oct 2009

IEPSAS-Kosice: experiences in running LCG site

Monitoring ARC services with GangliARC

Evolution of Database Replication Technologies for WLCG

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Storage and I/O requirements of the LHC experiments

PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

High Performance Computing on MapReduce Programming Framework

Striped Data Server for Scalable Parallel Data Analysis

Insight: that s for NSA Decision making: that s for Google, Facebook. so they find the best way to push out adds and products

Support for multiple virtual organizations in the Romanian LCG Federation

Monitoring WLCG with lambda-architecture: a new scalable data store and analytics platform for monitoring at petabyte scale.

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Large Scale Software Building with CMake in ATLAS

Grid Computing at the IIHE

The TDAQ Analytics Dashboard: a real-time web application for the ATLAS TDAQ control infrastructure

The Compact Muon Solenoid Experiment. Conference Report. Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

Emerging Database Technologies and Their Applicability to High Energy Physics: A First Look at SciDB

The ATLAS Distributed Analysis System

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Monte Carlo Production Management at CMS

Standardization of automated industrial test equipment for mass production of control systems

Data Intensive Science Impact on Networks

Performance quality monitoring system for the Daya Bay reactor neutrino experiment

Transcription:

Journal of Physics: Conference Series OPEN ACCESS Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns To cite this article: A Vaniachine et al 2014 J. Phys.: Conf. Ser. 513 032101 View the article online for updates and enhancements. Related content - ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov et al. - The ATLAS distributed analysis system F Legger and the Atlas Collaboration - Data federation strategies for ATLAS using XRootD Robert Gardner, Simone Campana, Guenter Duckeck et al. This content was downloaded from IP address 148.251.232.83 on 17/09/2018 at 00:48

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns A Vaniachine 1, D Golubkov 2 and D Karpenko 3 on behalf of the ATLAS Collaboration 1 High Energy Physics Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, United States of America 2 Experimental Physics Department, Institute for High Energy Physics, Protvino, 142281, Russia 3 Department of Physics, University of Oslo, P.b. 1048 Blindern, 0316 Oslo, Norway E-mail: vanachine@anl.gov Abstract. During three years of LHC data taking, the ATLAS collaboration completed three petascale data reprocessing campaigns on the Grid, with up to 2 PB of data being reprocessed every year. In reprocessing on the Grid, failures can occur for a variety of reasons, while Grid heterogeneity makes failures hard to diagnose and repair quickly. As a result, Big Data processing on the Grid must tolerate a continuous stream of failures, errors and faults. While ATLAS fault-tolerance mechanisms improve the reliability of Big Data processing in the Grid, their benefits come at costs and result in delays making the performance prediction difficult. Reliability Engineering provides a framework for fundamental understanding of the Big Data processing on the Grid, which is not a desirable enhancement but a necessary requirement. In ATLAS, cost monitoring and performance prediction became critical for the success of the reprocessing campaigns conducted in preparation for the major physics conferences. In addition, our Reliability Engineering approach supported continuous improvements in data reprocessing throughput during LHC data taking. The throughput doubled in 2011 vs. 2010 reprocessing, then quadrupled in 2012 vs. 2011 reprocessing. We present the Reliability Engineering analysis of ATLAS data reprocessing campaigns providing the foundation needed to scale up the Big Data processing technologies beyond the petascale. 1. Introduction During the first run period of the Large Hadron Collider (LHC) [1] at CERN in 2010-2012, the ATLAS experiment [2] recorded 7.4 PB of raw pp collisions data. Data reconstruction is a starting point for physics analysis. Following the prompt reconstruction, the ATLAS data are reprocessed, which allows reconstruction of the data with updated software and conditions/calibrations improving the quality of the reconstructed data for physics analysis. The large-scale data reprocessing campaigns are conducted on the Worldwide LHC Computing Grid (WLCG) [3]. For the success of the reprocessing campaigns conducted in preparation for the major physics conferences, a timely completion is critical. The WLCG computing centres provide tens of thousands of CPU cores and petabytes of disk storage for a faster reprocessing throughput. During Run1, the ATLAS collaboration completed three petascale data reprocessing campaigns on the Grid, with up to 2 PB of raw data being reprocessed every year. Depending on conditions, it Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by IOP Publishing Ltd 1

takes 3 7 10 6 CPU-hours to reconstruct one petabyte of ATLAS raw data. Scheduled LHC upgrades will increase the data taking rates tenfold [4], which increases demands on computing resources. However, a tenfold increase in WLCG resources for LHC upgrade needs is not an option. A comprehensive end-to-end solution for the composition and execution of the reprocessing within given CPU and storage constraints is necessary to accommodate physics needs of the next LHC run. The ATLAS experiment needs to exercise due diligence in evolving its Computing Model to optimally use the required resources. Reliability Engineering provides the foundation needed to scale up ATLAS data reprocessing beyond the petascale. 2. Data Processing The raw data from the ATLAS data acquisition system are processed to produce the reconstructed data for physics analysis. Events acquired within few minutes are collected in one raw data file. Files with events that are close in time (from the same run ) are collected in one dataset. During reconstruction ATLAS applications are processing raw events data with sophisticated algorithms to identify and reconstruct physics objects such as charged particle tracks. Figure 1 shows the data processing flow of raw data and conditions/calibrations used in reconstruction operations. Since the data are comprised of independent events, massively parallel reconstruction applications process one event at a time. The first-pass processing of the raw event data at the ATLAS Tier-0 site provides promptly the data for quality assessment and analysis. Later, the quality of the reconstructed data is improved by optimizing further software algorithms and conditions/calibrations. For data processing with improved software and/or conditions and calibrations (reprocessing) we use WLCG computing resources.! Figure 1. Data from the ATLAS data acquisition system (top arrow) and conditions/calibrations are processed through prompt reconstruction at the Tier-0 site at CERN (top) and in reprocessing on the Grid (bottom). 2

2.1. Tasks vs. Jobs In petascale data processing the task became a main unit of computation. A task is a collection of similar jobs from the same dataset (such as from one data-taking run ) that could be executed in parallel. The task-based approach delivers six sigma quality performance for the petascale data processing campaigns with thousands of tasks [5]. The success of the task-based approach in ATLAS data processing resulted in a double exponential growth in the number of tasks requested for processing [6]. 3. Analysis In reprocessing on the Grid, failures can occur for a variety of reasons, while Grid heterogeneity makes failures hard to diagnose and repair quickly. As a result, ATLAS data reprocessing tolerates a continuous stream of failures, errors and faults. Transient job failures can be recovered by automatic re-tries. While our fault-tolerance mechanisms improve the reliability of data reprocessing, their benefits come at costs and result in delays making the performance prediction difficult. Designing fault tolerance strategies that minimize the duration of petascale data processing on the Grid is an active area of research. Reliability Engineering provides a framework for fundamental understanding of the data reprocessing on the Grid, which is not a desirable enhancement but a necessary requirement. #!!!!$ #!!!$ '!#!$ '!##$ '!#'$!"#$%&'()* #!!$ #!$ #$!"#$!"!#$!$ %!!$ &!!$ #'!!$ #(!!$ +,)-*.,/-* Figure 2. Tasks ranked by CPU time used to recover from transient failures. 3.1. Cost of Recovery from Failures Job re-tries avoids data loss at the expense of CPU-hours used by the failed jobs. Figure 2 shows that the distribution of tasks ranked by CPU-hours used to recover from transient failures is not uniform: most of CPU-hours required for recovery were used in a small fraction of tasks. Table 1 compares cost of failure recovery for three major ATLAS data reprocessing campaigns. In 2010 reprocessing, the CPU-hours used to recover from transient failures were 6% of the total core-hours used for reconstruction. In 2011 reprocessing, the CPU-hours used to recover from transient failures were reduced to 4% of the total CPU-hours used for the reconstruction. Figure 3 shows that most of the improvement came from reduction in failures in file transfers at the end of a job [8]. 3

The observed distribution approximately follows the behaviour of the Weibull distribution, which often models the process with a variety of failures having independent failure times (the weakest link model). Figure 4 shows similarity of the distributions of tasks vs. recovery costs for all three data reprocessing campaigns. Table 1. Cost of recovery from transient failures for the reconstruction jobs. Reprocessing campaign Input Data Volume (PB) CPU Time Used for Reconstruction (10 6 h) Fraction of CPU Time Used for Recovery (%) 2010 1 2.6 6.0 2011 1 3.1 4.2 2012 2 14.6 5.6 Figure 3. Distribution of tasks vs. CPU-hours used to recover from transient job failures follows a multi-mode Weibull distribution [7]. 100 Number of Tasks / Petabyte 75 50 25 2010 2011 2012 0-2.5-1.5-0.5 0.5 1.5 2.5 3.5 4.5 log10(cpu-hours) Figure 4. Distribution of tasks/petabyte vs. CPU-hours used to recover from transient job failures follows log-normal behaviour. 4

2 ln(-ln(1 - ecdf)) 0.5-1 -2.5 2010 2011 2012-4 -5.5-7 -4.5-3 -1.5 0 1.5 3 4.5 As the latest campaign provided additional data for analysis, Figure 5 shows deviation of the empirical Cumulative Distribution Function (ecdf) from the straight line corresponding to the case of a simple Weibull distribution. Indeed, the symmetry of the Figure 4 distribution indicates a log-normal behaviour rather than a (skewed) Weibull distribution. Table 2 compares results of the maximum likelihood fits to the data using the log-normal (1) and Weibull (2) distributions with two parameters: f (x) = log10(cpu-hours) Figure 5. Weibull test shows deviation from the straight line. 1 "!(ln x! µ)2 exp $ x! 2" # 2! 2 % ' (1) & f (x) = k! x /! ( ) k!1 exp!( x /!) k ( ) (2) The Kolmogorov Smirnov (K-S) test values show that the log-normal fit describes the data better. Reprocessing campaign Table 2. Comparison of log-normal vs. Weibull fits to data. μ σ K-S test k λ K-S test 2010 3.29±0.09 2.63±0.06 0.07 0.41±0.01 96.6±8.3 0.05 2011 2.77±0.09 2.50±0.07 0.05 0.43±0.01 54.5±4.9 0.10 2012 2.92±0.07 2.90±0.05 0.04 0.35±0.01 79.4±6.0 0.08 The log-normal behaviour of the distribution of tasks vs. CPU-hours used for failure recovery is unexpected. Although log-normal distributions are often observed in Reliability Engineering analysis of failures, such as mean-time-to-failure of the Grid hardware, these are expected when some multiplicative process is present, such as multiplicative degradation [9]. As we were unable to identify the multiplicative process describing our case where the CPU time used for recovery is added upon 5

******* 20th International Conference on Computing in High Energy and Nuclear Physics (CHEP2013) IOP Publishing each job retry, we compared our data with two necessary conditions under which the additive process may result in the log-normal distribution (when the number of variables is limited) [10]: (z z )3 (ln z ln z )3 < (ln z ln z )2 1.5 (z z )2 1.5 (3) (ln z ln z )4 3 (ln z ln z )2 2 (z z )4 3 (z z )2 2 < (ln z ln z )2 2 (z z )2 2 (4) Here z* is the random variable (in our case it is the CPU-hours used to recover from transient job failures). Table 3 shows that conditions (1) and (2) are met in our case, since for all reprocessing campaigns the right-side expressions are much greater than the left-side expressions. Table 3. Conditions under which the additive process may result in the log-normal distribution. Reprocessing Left-side of campaign condition (3) 2010 2011 2012 0.27 0.26 0.06 Right-side of Condition Left-side of Right-side of Condition condition (3) (3) condition (4) condition (4) (4) 7.90 7.17 10.1 0.23 0.31 0.46 79.6 74.0 153.5 3.2. Reprocessing Duration Since jobs generally complete or fail without producing side effects, the simple recovery mechanism involves automatic retry of failed jobs. Unfortunately, the automatic approach results in an unpredictable delay in task completion time dominated by repeated retries. Transient job failures and re-tries delay the reprocessing duration. Optimization of fault-tolerance techniques to speed up the completion of thousands of interconnected tasks on the Grid is an active area of Reliability Engineering research. 1 Figure 6. Number of jobs concurrently running on the WLCG clouds during ATLAS petascale data processing campaigns in November-December 2010, in August-September 2011 and in November 2012. 6

3.3. Results During LHC data taking, Reliability Engineering analysis became critical for the success of the reprocessing campaigns conducted in preparation for the major physics conferences. To deliver results for the 2012 Moriond Conference, the processing of 2 PB of ATLAS data (that required twice more CPU-hours per petabyte) was completed in four weeks, achieving four-fold improvement in data processing power in comparison to the 2011 data processing (Figure 6). Earlier, optimization of reprocessing workflow and other improvements cut the delays and halved the duration of the petascale data processing on the Grid from almost two months in 2010 to less than four weeks in 2011 [5]. 4. Conclusions During three years of LHC data taking, the ATLAS collaboration completed three petascale data reprocessing campaigns on the Grid. Reliability Engineering analysis supported continuous improvements in data reprocessing throughput during LHC data taking. The throughput doubled in 2011 vs. 2010 reprocessing, then quadrupled in 2012 vs. 2011 reprocessing. Scheduled LHC upgrades will increase the data taking rates tenfold, increasing demands on computing resources. A comprehensive end-to-end solution for the composition and execution of the reprocessing within given CPU and storage constraints is necessary to accommodate physics needs of future LHC runs. Reliability Engineering analysis provides the foundation needed to scale up ATLAS data reprocessing on the Grid beyond the petascale. Acknowledgments We wish to thank all our colleagues who contributed to ATLAS data reprocessing activities. This work was funded in part by the U. S. Department of Energy, Office of Science, High Energy Physics under Contract No. DE-AC02-06CH11357. References [1] http://lhc.web.cern.ch [2] The ATLAS Collaboration, Aad G et al. 2008 The ATLAS experiment at the CERN Large Hadron Collider J. Inst. 3 S08003 [3] http://wlcg.web.cern.ch [4] The ATLAS Collaboration 2012 Letter of Intent for the Phase-II Upgrade of the ATLAS Experiment, Scientific Committee Paper CERN-LHCC-2012-022; LHCC-I-023, http://cds.cern.ch/record/1502664 [5] Vaniachine A V for the ATLAS Collaboration 2011 ATLAS detector data processing on the Grid IEEE Nuclear Science Symposium and Medical Imaging Conference pp 104-107 [6] Golubkov D et al. 2012 ATLAS Grid Data Processing: system evolution and scalability J. Phys.: Conf. Ser. 396 032049 [7] Weibull W 1951 A Statistical Distribution Function of Wide Applicability ASME Journal of Applied Mechanics, pp 293-297 [8] Vaniachine A V on behalf of the ATLAS and CMS Collaborations 2013 Advancements in Big Data Processing in the ATLAS and CMS Experiments Preprint arxiv:1303.1950 [9] Kolmogorov A N 1941 On the log-normal distribution of particles sizes during break-up process Dokl. Akad. Nauk. SSSR 31 99-101 [10] Mouri H 2013 Lognormal distribution from a process that is not multiplicative but is additive Phys. Rev. E 88 042124 7