Overview of ATLAS PanDA Workload Management

Similar documents
Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

The ATLAS PanDA Pilot in Operation

Popularity Prediction Tool for ATLAS Distributed Data Management

The PanDA System in the ATLAS Experiment

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

New Features in PanDA. Tadashi Maeno (BNL)

Experience with ATLAS MySQL PanDA database service

Improved ATLAS HammerCloud Monitoring for Local Site Administration

ATLAS operations in the GridKa T1/T2 Cloud

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

A Popularity-Based Prediction and Data Redistribution Tool for ATLAS Distributed Data Management

Distributing storage of LHC data - in the nordic countries

Experiences with the new ATLAS Distributed Data Management System

CernVM-FS beyond LHC computing

AGIS: The ATLAS Grid Information System

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

HammerCloud: A Stress Testing System for Distributed Analysis

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

The ATLAS Production System

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Monte Carlo Production on the Grid by the H1 Collaboration

ATLAS Nightly Build System Upgrade

DIRAC pilot framework and the DIRAC Workload Management System

PanDA: Exascale Federation of Resources for the ATLAS Experiment

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

ANSE: Advanced Network Services for [LHC] Experiments

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Bringing ATLAS production to HPC resources - A use case with the Hydra supercomputer of the Max Planck Society

New data access with HTTP/WebDAV in the ATLAS experiment

High Throughput WAN Data Transfer with Hadoop-based Storage

UW-ATLAS Experiences with Condor

Monitoring ARC services with GangliARC

TAG Based Skimming In ATLAS

Operating the Distributed NDGF Tier-1

ATLAS DQ2 to Rucio renaming infrastructure

The LHC Computing Grid

Data Management for the World s Largest Machine

ATLAS Computing: the Run-2 experience

HEP replica management

Andrea Sciabà CERN, Switzerland

IllustraCve Example of Distributed Analysis in ATLAS Spanish Tier2 and Tier3

The INFN Tier1. 1. INFN-CNAF, Italy

Analysis of internal network requirements for the distributed Nordic Tier-1

Stitched Together: Transitioning CMS to a Hierarchical Threaded Framework

LHCb Distributed Conditions Database

EGEE and Interoperation

Towards Network Awareness in LHC Computing

Monitoring HTCondor with the BigPanDA monitoring package

ATLAS distributed computing: experience and evolution

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

150 million sensors deliver data. 40 million times per second

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Using Puppet to contextualize computing resources for ATLAS analysis on Google Compute Engine

Data handling with SAM and art at the NOνA experiment

PoS(ACAT2010)039. First sights on a non-grid end-user analysis model on Grid Infrastructure. Roberto Santinelli. Fabrizio Furano.

A distributed tier-1. International Conference on Computing in High Energy and Nuclear Physics (CHEP 07) IOP Publishing. c 2008 IOP Publishing Ltd 1

Recent developments in user-job management with Ganga

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

Benchmarking the ATLAS software through the Kit Validation engine

LHCb Computing Strategy

A data handling system for modern and future Fermilab experiments

The Compact Muon Solenoid Experiment. Conference Report. Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

CHEP 2013 October Amsterdam K De D Golubkov A Klimentov M Potekhin A Vaniachine

The High-Level Dataset-based Data Transfer System in BESDIRAC

The JINR Tier1 Site Simulation for Research and Development Purposes

Support for multiple virtual organizations in the Romanian LCG Federation

ATLAS computing activities and developments in the Italian Grid cloud

N. Marusov, I. Semenov

CMS experience of running glideinwms in High Availability mode

CMS users data management service integration and first experiences with its NoSQL data storage

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

AliEn Resource Brokers

PoS(EGICF12-EMITC2)106

Virtualizing a Batch. University Grid Center

ATLAS Data Management Accounting with Hadoop Pig and HBase

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

The LHC Computing Grid

Data preservation for the HERA experiments at DESY using dcache technology

Computing for LHC in Germany

The Grid: Processing the Data from the World s Largest Scientific Machine

Scalable Computing: Practice and Experience Volume 10, Number 4, pp

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Summary of the LHC Computing Review

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Computing at Belle II

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

Distributed Monte Carlo Production for

Austrian Federated WLCG Tier-2

Explore multi core virtualization on the project

Large Scale Software Building with CMake in ATLAS

A new petabyte-scale data derivation framework for ATLAS

Transcription:

Overview of ATLAS PanDA Workload Management T. Maeno 1, K. De 2, T. Wenaus 1, P. Nilsson 2, G. A. Stewart 3, R. Walker 4, A. Stradling 2, J. Caballero 1, M. Potekhin 1, D. Smith 5, for The ATLAS Collaboration 1 Brookhaven National Laboratory, NY, USA 2 University of Texas at Arlington, TX, USA 3 Department of Physics and Astronomy, University of Glasgow, Glasgow G12 8QQ, UK 4 Ludwig-Maximilians-Universitat Munchen, Munchen, Germany 5 SLAC National Accelerator Laboratory, CA, USA tmaeno@bnl.gov Abstract. The Production and Distributed Analysis System (PanDA) plays a key role in the ATLAS distributed computing infrastructure. All ATLAS Monte-Carlo simulation and data reprocessing jobs pass through the PanDA system. We will describe how PanDA manages job execution on the grid using dynamic resource estimation and data replication together with intelligent brokerage in order to meet the scaling and automation requirements of ATLAS distributed computing. PanDA is also the primary ATLAS system for processing user and group analysis jobs, bringing further requirements for quick, flexible adaptation to the rapidly evolving analysis use cases of the early datataking phase, in addition to the high reliability, robustness and usability needed to provide efficient and transparent utilization of the grid for analysis users. We will describe how PanDA meets ATLAS requirements, the evolution of the system in light of operational experience, how the system has performed during the first LHC data-taking phase and plans for the future. 1. Introduction The Production and Distributed Analysis (PanDA) system has been developed to meet ATLAS production and analysis requirements for a data-driven workload management system capable of operating at LHC data processing scale. ATLAS production and analysis place challenging requirements on throughput, scalability, robustness, minimal operational manpower, and efficient integrated data/processing management. The key features of PanDA are as follows: Apache and Oracle based central task queue and management Use of grid and/or farm batch queues to pre-stage job wrappers (pilots) to worker nodes Tight integration with the ATLAS Distributed Data Management (DDM) system [1] working with datasets An integrated monitoring system [2] for production and analysis operations, user analysis interfaces, data access, and site status Support of OSG, EGEE/EGI and NDGF middleware stacks Highly automated system requiring low operation manpower Published under licence by IOP Publishing Ltd 1

Well developed security mechanisms [3] PanDA was originally developed for ATLAS and is being extended as a generic high level workload manager usable by non-atlas OSG activities as well. This paper reports here the PanDA system overview, workload management for production and analysis, the current status and future plans. 2. PanDA System Overview Figure 1 shows a schematic view of the PanDA system. Jobs are submitted to the PanDA server via a simple Python/HTTP client interface. The same interface is used for ATLAS production, distributed analysis, and other general job submission. The PanDA server is the main component which provides a task queue managing all job information centrally. The PanDA server receives jobs through the client interface into the task queue, upon which a brokerage module operates to prioritize and assign work on the basis of job type, priority, input data and its locality, and available CPU resources. The autopyfactory pre-schedules pilots to OSG and EGEE/EGI grid sites using Condor-G. Pilots retrieve jobs from the PanDA server in order to run the jobs as soon as CPU slots become available. Pilots use resources efficiently; they exit immediately if no job is available and the submission rate is regulated according to workload. Each pilot executes a job, detects zombie processes, reports job status to the PanDA server, and recovers failed jobs. Ref.[4] describes the details on pilots. For NDGF, the ARC control tower [5] retrieves jobs from the PanDA server and sends the jobs together with pilot wrappers to NDGF sites using ARC middle-ware. Figure 1. Schematic view of the PanDA System 2

2.1. Workload Management for Central and Group Production The purpose of central and group production is to produce data for overall ATLAS collaboration and/or a particular physics working group. Production tasks are centrally defined on the basis of physics needs and resource allocation in ATLAS. ATLAS has one Tier-0 site, ten Tier-1 sites and many Tier-2 sites. A cloud is composed of one Tier-1 site and associated Tier-2 sites, or, at CERN, contains only the Tier-0 site. Tasks are assigned to clouds based on the amount of disk space available on the Tier-0/1 storage element (SE), the locality of input data, available CPU resources, the Memorandum of Understanding (MoU) share which specifies the contributions expected from each participating institute, and downtime of the Tier-0/1. Also, the amount of queued tasks with equal or higher priorities is taken into account so that high priority tasks can jump over low priority tasks even if the cloud has already been occupied with low priority tasks. Tasks are automatically converted to many jobs for parallel execution. Once a task is assigned to a cloud, jobs are assigned to sites in the cloud based on CPU and I/O usage pattern of the jobs, software availability, the amount of disk space available on the local SE, sizes of scratch disk and memory on worker nodes, site downtime, the ratio between the number of running jobs to the number of queued jobs, the number of input files pre-existing on the local SE. Assignment of jobs to sites is followed by the dispatch of input files to those sites. The system has a pipeline structure running data-transfer and job-execution in parallel. Atlas Distributed Data Management (DDM) system [1] takes care of actual data-transfer: The PanDA server sends requests to DDM to dispatch input files to Tier2s or to prestage input files from a tape storage at Tier0/1. DDM transfers or pre-stages files, and then sends notifications to the PanDA server. Jobs wait in the task queue after they were submitted to PanDA, and get activated when all input files are ready. Pilots pick activated jobs and can run immediately since DDM has already staged the input files. Similarly, pilots copy output files to local storage and exit immediately to release CPUs. Then the PanDA server sends requests to DDM to aggregate output files to the Tier1 SE over the grid. That is, pilots occupy CPUs only during running jobs and accessing the local SE. 2.2. Workload Management for Analysis PanDA is designed for production and analysis. This means that both production and analysis use the same software, monitoring system and facilities, and thus no duplicated manpower is required for maintenance. However, one of the most significant differences is policy for data-transfers. DDM transfers files over the grid for production, while analysis does not trigger file-transfers since analysis jobs are sent only to sites where the input files are already available. This is mainly because analysis jobs are typically I/O intensive and run on many files. Priorities and quotas are calculated per user, or per working group if the user submits jobs for a particular working group. Working groups can be defined in VOMS or in the PanDA database. Each site can provide regional CPU/storage resources by using budgets beyond ATLAS MoU share in addition to official resources. Each site/country/cloud can decide how those resources are used. A typical usage policy is to run jobs for users in a particular country group when their jobs are queued, otherwise, run other users jobs. Figure 2 shows a schematic view of workflow for analysis jobs. The user submits a user task (job set) that is split to multiple job subsets according to localities of input datasets and available CPU resources at sites. A job subset is sent to a site where input files are available, i.e., if input files are distributed over multiple sites there will be multiple job subsets and they will be sent to multiple sites. Each job subset is composed of one compile job and many execution jobs. The compile job receives source files from the user through the PanDA server to produce binary files. Execution jobs are triggered once the binary files are put into the local storage element (SE) by the compile job. The execution jobs retrieve input files and the binary files from the local SE, run jobs in parallel, produce and add output files to an output dataset. Output datasets are added to an output dataset container and 3

users can access the dataset container using standard DDM tools/services [6]. As mentioned, analysis jobs go to sites where input data is available. This means that job distribution depends intimately on data distribution that follows the ATLAS computing model. It is difficult to distribute popular data uniformly since the usage pattern of data is highly unpredictable. Therefore, the policy-based data distribution causes unbalanced job distribution and also makes redundant replicas that are not actually used. That is, CPU and storage resources are not used very efficiently. In order to solve those problems, the PanDA Dynamic Data Placement (PD2P) has been developed. The idea is that PanDA triggers replication of input data based on usage pattern of analysis jobs. The fundamental strategy of PD2P is to make many replicas for popular data and to send data to idle sites. Stuck jobs are periodically reassigned to other sites since PD2P may have replicated datasets to idle sites during jobs are waiting in the queue. Figure 2. Workflow for Analysis jobs 3. Current Status and Future Plans Figure 3 shows the number of production jobs running concurrently in 2009 and 2010. We can see a decrease of the maximum number of running jobs to 40k from 50k. This is because CPU resources have been shared with other VOs who have been becoming more active and also ATLAS analysis activities have increased by factor of 10. There are large fluctuations in 2010 due to a lack of defined tasks. With the start of LHC data taking, ATLAS software had been updated frequently. Once a new version of ATLAS software was released it had to be validated before bulk production, which caused a 4

temporary shortage of tasks. The single job failure rate of production jobs was about 9% in 2010. The breakdown is 3.6 % from ATLAS software problems and 5.4% from site problems. Although site failures are still significant, automatic retries help in many cases and thus those failures are hidden from the task requester. Figure 4 shows the number of analysis jobs running concurrently in 2009 and 2010. We can see factor of 10 rise compared to one year ago in this figure. Analysis activities on PanDA have increased steadily, which clearly shows usefulness of PanDA for physics analysis. The single job failure rate of analysis jobs has been improved to 21% in 2010, down from 36% in 2009 thanks to good user support by the ATLAS distributed analysis support team. The major causes of failures are software problems since end users often submit problematic jobs. Figure 3. The number of production jobs concurrently running in 2009-2010 Figure 4. The number of analysis jobs concurrently running in 2009-2010 5

As described above, the PanDA system is in good shape, but development is not complete and there are still many improvements to come. However, the changes should be incremental and feedbackdriven to avoid disruptions to ongoing activities. First, more flexibility should be added for group production. Second, PD2P should be enhanced for Tier1 to Tier1 data distribution. Third, a combination of remote IO and local caching mechanism should be evaluated for analysis jobs. Finally, output file merging should be implemented to solve an outstanding problem with transfers of small files produced by analysis jobs. 5. Conclusions The PanDA system has been developed for ATLAS and is being extended as a generic high level workload manager usable by the wider grid community. PanDA is performing very well both for production and analysis for ATLAS, producing high volume Monte-Carlo samples and making huge computing resources available for individual analysis. There are still many new challenges to come but no big-bang migration is expected and the changes should be incremental and transparent to users. Notice: This manuscript has been authored by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a nonexclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. References [1] Garonne V., Status, news and update of the ATLAS Distributed Data Management software project: DQ2, in Proc. of the 18 th Int. Conf. on Computing in High Energy and Nuclear Physics (CHEP2010) [2] Potekhin M., The ATLAS PanDA Monitoring System and its Evolution, in Proc. of the 18 th Int. Conf. on Computing in High Energy and Nuclear Physics (CHEP2010) [3] Caballero J., Improving Security in the ATLAS PanDA System, in Proc. of the 18 th Int. Conf. on Computing in High Energy and Nuclear Physics (CHEP2010) [4] Nilsson P., The ATLAS PanDA Pilot in Operation, in Proc. of the 18 th Int. Conf. on Computing in High Energy and Nuclear Physics (CHEP2010) [5] Filipcic A., arccontroltower, the sytem for Atlas production and analysis on ARC, in Proc. of the 18 th Int. Conf. on Computing in High Energy and Nuclear Physics (CHEP2010) [6] Klimentov A, ATLAS Data Transfer Request Package (DaTRI), in Proc. of the 18 th Int. Conf. on Computing in High Energy and Nuclear Physics (CHEP2010) 6