ATLAS Offline Data Quality Monitoring

Similar documents
Data Quality Monitoring Display for ATLAS experiment

Prompt data reconstruction at the ATLAS experiment

The CMS Computing Model

The CMS data quality monitoring software: experience and future prospects

Monitoring of Computing Resource Use of Active Software Releases at ATLAS

The ATLAS Trigger Simulation with Legacy Software

Early experience with the Run 2 ATLAS analysis model

Performance of the ATLAS Inner Detector at the LHC

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

First LHCb measurement with data from the LHC Run 2

Data Management for the World s Largest Machine

LHCb Distributed Conditions Database

Performance quality monitoring system for the Daya Bay reactor neutrino experiment

CMS data quality monitoring: Systems and experiences

Andrea Sciabà CERN, Switzerland

The NOvA software testing framework

A Tool for Conditions Tag Management in ATLAS

Machine Learning in Data Quality Monitoring

Reprocessing DØ data with SAMGrid

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

a new Remote Operations Center at Fermilab

AGIS: The ATLAS Grid Information System

Software and computing evolution: the HL-LHC challenge. Simone Campana, CERN

LHCb Computing Resources: 2019 requests and reassessment of 2018 requests

Overview of ATLAS PanDA Workload Management

Online remote monitoring facilities for the ATLAS experiment

Fast pattern recognition with the ATLAS L1Track trigger for the HL-LHC

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

CMS Alignement and Calibration workflows: lesson learned and future plans

LHCb Computing Strategy

Use of ROOT in the DØ Online Event Monitoring System

Data handling and processing at the LHC experiments

Real-time dataflow and workflow with the CMS tracker data

LHCb Computing Resources: 2018 requests and preview of 2019 requests

Evaluation of the computing resources required for a Nordic research exploitation of the LHC

The ATLAS MDT Remote Calibration Centers

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

The Compact Muon Solenoid Experiment. Conference Report. Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

Data handling with SAM and art at the NOνA experiment

THE ATLAS INNER DETECTOR OPERATION, DATA QUALITY AND TRACKING PERFORMANCE.

IEPSAS-Kosice: experiences in running LCG site

Large Scale Software Building with CMake in ATLAS

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

A Geometrical Modeller for HEP

Monte Carlo Production on the Grid by the H1 Collaboration

ALICE ANALYSIS PRESERVATION. Mihaela Gheata DASPOS/DPHEP7 workshop

Improved ATLAS HammerCloud Monitoring for Local Site Administration

LCG Conditions Database Project

CouchDB-based system for data management in a Grid environment Implementation and Experience

DIAL: Distributed Interactive Analysis of Large Datasets

Computing. DOE Program Review SLAC. Rainer Bartoldus. Breakout Session 3 June BaBar Deputy Computing Coordinator

Managing Asynchronous Data in ATLAS's Concurrent Framework. Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, Berkeley CA 94720, USA 3

PARALLEL PROCESSING OF LARGE DATA SETS IN PARTICLE PHYSICS

The ATLAS Conditions Database Model for the Muon Spectrometer

Benchmarking the ATLAS software through the Kit Validation engine

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

A copy can be downloaded for personal non-commercial research or study, without prior permission or charge

N. Marusov, I. Semenov

LHCb Computing Resource usage in 2017

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

How the Monte Carlo production of a wide variety of different samples is centrally handled in the LHCb experiment

The Database Driven ATLAS Trigger Configuration System

Experience with Data-flow, DQM and Analysis of TIF Data

The GAP project: GPU applications for High Level Trigger and Medical Imaging

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

e-science for High-Energy Physics in Korea

Stephen J. Gowdy (CERN) 12 th September 2012 XLDB Conference FINDING THE HIGGS IN THE HAYSTACK(S)

Evolution of ATLAS conditions data and its management for LHC Run-2

150 million sensors deliver data. 40 million times per second

Tracking and flavour tagging selection in the ATLAS High Level Trigger

A data handling system for modern and future Fermilab experiments

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

Grid Computing: dealing with GB/s dataflows

A L I C E Computing Model

ATLAS Tracking Detector Upgrade studies using the Fast Simulation Engine

A new petabyte-scale data derivation framework for ATLAS

ComPWA: A common amplitude analysis framework for PANDA

Grid Computing a new tool for science

The Compact Muon Solenoid Experiment. Conference Report. Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

CMS event display and data quality monitoring at LHC start-up

INTRODUCTION TO THE ANAPHE/LHC++ SOFTWARE SUITE

The AAL project: automated monitoring and intelligent analysis for the ATLAS data taking infrastructure

Muon Reconstruction and Identification in CMS

Evolution of Database Replication Technologies for WLCG

Computing at Belle II

The ATLAS Production System

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Alignment of the ATLAS Inner Detector tracking system

Open data and scientific reproducibility

ATLAS software configuration and build tool optimisation

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

Alignment of the ATLAS Inner Detector

An SQL-based approach to physics analysis

First experiences with the ATLAS pixel detector control system at the combined test beam 2004

Atlantis: Visualization Tool in Particle Physics

Gamma-ray Large Area Space Telescope. Work Breakdown Structure

Transcription:

ATLAS Offline Data Quality Monitoring ATL-SOFT-PROC-2009-003 24 July 2009 J. Adelman 9, M. Baak 3, N. Boelaert 6, M. D Onofrio 1, J.A. Frost 2, C. Guyot 8, M. Hauschild 3, A. Hoecker 3, K.J.C. Leney 5, E. Lytken 6, M. Martinez-Perez 1, J. Masik 7, A.M. Nairz 3, P.U.E. Onyisi 4, S. Roe 3, S. Schaetzel 3, M.G. Wilson 3 1 Institut de Física d Altes Energies, Universitat Autònoma de Barcelona, Edifici Cn, ES - 08193 Bellaterra (Barcelona), Spain 2 Cavendish Laboratory, University of Cambridge, J J Thomson Avenue, Cambridge CB3 0HE, United Kingdom 3 CERN, CH - 1211 Geneva 23, Switzerland 4 University of Chicago, Enrico Fermi Institute, 5640 S. Ellis Avenue, Chicago, IL 60637, United States of America 5 Oliver Lodge Laboratory, University of Liverpool, P.O. Box 147, Oxford Street, Liverpool L69 3BX, United Kingdom 6 Lunds Universitet, Fysiska Institutionen, Box 118, SE - 221 00 Lund, Sweden 7 School of Physics and Astronomy, University of Manchester, Manchester M13 9PL, United Kingdom 8 CEA, DSM/DAPNIA, Centre d Etudes de Saclay, FR - 91191 Gif-sur-Yvette, France 9 Yale University, Department of Physics, PO Box 208121, New Haven CT, 06520-8121, United States of America Now at SLAC National Accelerator Laboratory, Stanford, California 94309, United States of America E-mail: ponyisi@hep.uchicago.edu Abstract. The ATLAS experiment at the Large Hadron Collider reads out 100 Million electronic channels at a rate of 200 Hz. Before the data are shipped to storage and analysis centres across the world, they have to be checked to be free from irregularities which render them scientifically useless. Data quality offline monitoring provides prompt feedback from full first-pass event reconstruction at the Tier-0 computing centre and can unveil problems in the detector hardware and in the data processing chain. Detector information and reconstructed proton-proton collision event characteristics are distilled into a few key histograms and numbers which are automatically compared with a reference. The results of the comparisons are saved as status flags in a database and are published together with the histograms on a web server. They are inspected by a 24/7 shift crew who can notify on-call experts in case of problems and in extreme cases signal data taking abort. 1. Introduction Monitoring of all aspects of an experiment s operation is important for success. Examination of the data taken by the detector can detect irregularities which may need to be corrected later or which may require immediate action by the experimentalists. Such monitoring should be comprehensive, while providing feedback as early as possible. This paper describes the offline data quality monitoring system of the ATLAS experiment. This provides feedback from fully reconstructed events processed at the Tier-0 farm, with

Figure 1. The flow of monitoring and data quality information in ATLAS. Here we are concerned with the offline component. histograms becoming available generally less than an hour after the start of an experimental run. A sophisticated suite of automatic checks is run on a subset of histograms produced from the data and identifies problematic distributions. The histograms and the results of the automatic checks are stored for future access. Summary flags are stored in a database and are used in physics analyses to determine the detector status. It should be noted that ATLAS has additional data quality monitoring systems which are described in other contributions to these proceedings and are depicted in Figure 1. The online monitoring system provides near-real-time feedback on quantities that are available given the limited computing power in the online cluster. An additional system checks the records of the detector configuration system to determine if the parameters are acceptable for good quality data (fraction of active channels, high voltage, etc.). Beyond these, various ATLAS subdetectors have specific monitoring systems that are outside the present scope. The flow of information in the offline data quality monitoring system is depicted in Figure 2. 2. Data Processing and Histogram Production at Tier-0 The primary application of the offline data quality monitoring is to give quick feedback on the prompt reconstruction of ATLAS data at the Tier-0 computer farm at CERN. The ATLAS computing model [1] foresees the first full reconstruction of all data occurring at Tier-0. This should be done before the relevant data are staged to tape due to incoming files, which is a period on the order of several days. In this time calibrations and alignment have to be verified by rapidly reconstructing a smaller express subset of the data. The first express reconstruction occurs as soon as data for a run are available and the necessary detector conditions information has been replicated to the Tier-0, generally within an hour of the start of a run. Additional reconstruction iterations will occur as necessary until the calibrations are signed off for full reconstruction. Data quality monitoring is run during each of these passes to ensure that the initial data is satisfactory (first pass) or that the calibrations are valid (second and higher passes). Every 15 minutes, and at the end of reconstruction, the available files from different computing nodes are merged to sum up the statistics. During the merging process additional post-processing code can be run on the produced histograms. The histogram files produced by the end-of-run merging are registered as a Grid dataset for world-wide access. Automatic data quality checks at ATLAS are based primarily on histograms. The size and

Figure 2. Information flow in ATLAS offline data quality monitoring. The programs han (histogram analyser) and handi (web display creator) are discussed further in Sections 3 and 4. memory requirements of histograms scale well with the number of events. In production, the histograms are produced by tools that run in the ATLAS software framework Athena. These tools are active during prompt reconstruction and can also be used on already-reconstructed files, although transient information available during reconstruction cannot be monitored in the latter case. Various tools are implemented that monitor information ranging from readout errors to reconstructed physics quantities such as dimuon masses for J/ψ and Z reconstruction. User monitoring code is controlled by a manager class. Histograms are registered with the manager which handles proper placement into an output ROOT file [3] and metadata such as the algorithm to use when merging histograms across files. Histograms can be declared with time granularities shorter than a run, and the manager takes care of the necessary bookkeeping. The manager also implements the trigger awareness facility, which specifies that user code should only be invoked when certain triggers or sets of triggers have been fired. 3. Data Quality Monitoring Framework Both ATLAS offline and online automatic data quality monitoring use a common core software infrastructure, the Data Quality Monitoring Framework (DQMF) [2]. The concept is of a data quality tree, where the leaf nodes are specific histogram checks, and the interior nodes combine the results of their daughter nodes; the data quality results propagate up towards the root of the tree. The four currently implemented data quality flags are green, yellow, red, and undetermined (the last is used if, for example, statistics are too poor for a check to be performed). In the future an additional detector disabled flag is foreseen. For the leaf nodes, input histograms are specified, along with check algorithms, thresholds, and reference histograms. Check algorithms can range in complexity from simple tests that histograms are empty (e.g. for readout errors) to χ 2 tests and shape comparisons. Check algorithms can also publish extra information on the histograms in addition to the overall flags. Interior nodes combine the information of their daughters using summary algorithms, which perform various logical combinations of the flags. Reference [2] primarily discusses the online realization of the system. The offline variant differs in the input of histograms, output of results, and configuration format, driven by the different requirements for its environment. The offline latency and speed requirements are much more relaxed, where the checks are run at most once every few minutes instead of continuously as they are online. Therefore, instead of an always-on, distributed application as is used online, the offline implementation is a non-networked, standalone program (Histogram Analyzer or han ) that runs as part of a Tier-0 job. The input to han is ROOT files, and the configurations are stored as single binary files that include all necessary information (such as reference histograms). Conversely, the offline environment places stricter requirements on provenance and preservation of the results. As a consequence, the output of the han application includes all information that would be necessary to reproduce the checks the input histograms, the configuration, and any

Figure 3. Web display of dimuon mass spectrum in J/ψ region, created during a mock data exercise. references in addition to the flags and additional algorithm results. This output is archived for future reference. 4. Presentation 4.1. Web Display Once the results are available in the han output ROOT file, they must be made available to experimental shifters and other users. The primary means of visualizing the results is via the han Display (handi) application. This reads the ROOT files and produces a set of static HTML pages with the flags, algorithm results, and images of the histograms and associated references. The handi configuration specifically the histogram display options and any additional information like histogram descriptions and actions to take if checks fail are encoded as options in the han configuration file and propagated to handi via the han output file. A sample display from a mock data exercise is shown in Figure 3. 4.2. Conditions Database Top-level summary information, on the level of subdetector units, is stored in the ATLAS conditions database [1], which is implemented using the LCG COOL technology [4]. The checks are mapped to COOL channels and stored with intervals of validity set by the time granularity of the respective check. This information can then be accessed via standard ATLAS tools, including a web interface (shown in Figure 4).

Figure 4. Web interface to data quality information stored in the ATLAS conditions database. 4.3. Web Service The information contained in the han output files is of general long-term interest for example it contains run-by-run information on residual means and widths. This information should be made available without requiring scraping of the handi output HTML pages; in addition it should preferably be independent of the exact storage format, allowing internal changes to the han ROOT output format or backing by a different technology entirely such as a relational database. To address these concerns an XML-RPC web service is in the process of being deployed. This allows distributed, language-neutral use of the results of the DQMF checks. This system has been used to create plots of the history of any monitored quantity. Currently it is backed by SQLite databases which are updated when han output files are archived, which provides an order of magnitude speed improvement over direct access to the han files. 5. Additional Uses The histogram comparison and display code is relatively lightweight and, in particular, is not tied to the Tier-0 reconstruction; it is available wherever the ATLAS codebase is deployed. This makes it useful for other purposes. 5.1. Grid Data Reprocessing In the ATLAS computing model [1] reprocessing of data to take advantage of improved calibrations, alignment, and reconstruction algorithms takes place at Tier-1 sites. It is important to check the output of these reprocessing campaigns. The same monitoring code can be used here as in the Tier-0 processing, and the histogram file merging is handled by jobs scheduled by the ATLAS production system. In the setup used for the reprocessing of ATLAS 2008 cosmic ray data, the histogram datasets produced at various Tier-1 sites are replicated to storage at CERN, where han and handi are run. 5.2. Monte Carlo Production Validation Standard monitoring code can be run on the files produced by ATLAS full Monte Carlo (MC) simulation. These files can then be compared with the results of previous MC runs, or even with the corresponding histograms from data. The technical challenges of Grid operation are very similar to those for data reprocessing.

5.3. Code Validation ATLAS runs many tests on its software on a nightly basis to catch problems introduced in development. These tests include a full test of the MC production chain and the full reconstruction of an amount of cosmic ray data. In principle any changes in the output of these checks from test to test should be fully understood in terms of code changes in particular the monitoring histograms should be identical except where they were intended to change. They can be compared between nightly builds using the data quality monitoring tools, using a histogram comparison algorithm that flags as red any difference between input and reference histograms. 6. 2008 Production Experience The offline data quality monitoring for Tier-0 operations was deployed in anticipation of 2008 data taking and was operational at LHC startup. After the September 19 LHC incident, ATLAS continued a combined cosmic data run which exercised many aspects of the data acquisition and monitoring infrastructure, including offline monitoring. Additional features were added at user request. The most rapidly evolving part of the system is the check configuration, which is changing as experience is gained with the ATLAS subdetectors and the parameters of normal operation are mapped out. 7. Summary The ATLAS offline data quality monitoring infrastructure has been deployed and tested for monitoring of Tier-0 prompt reconstruction. It provides status flags and algorithm results for monitored histograms, which are provided to users through several interfaces. The core applications are available anywhere the ATLAS offline software framework is installed, and can be used in any validation task where histogram checks are needed. References [1] Duckeck G et al. [ATLAS Collaboration] 2005 ATLAS computing: Technical design report Preprint CERN- LHCC-2005-022 [2] Corso-Radu A, Hadavand H, Hauschild M, Kehoe R, and Kolos S 2008, IEEE Trans. Nucl. Sci. 55 417 [3] Brun R and Rademakers F 1997 Nucl. Instrum. Meth. A 389 81 [4] Valassi A et al., http://lcgapp.cern.ch/project/conddb/