The PanDA System in the ATLAS Experiment

Similar documents
The ATLAS PanDA Pilot in Operation

Overview of ATLAS PanDA Workload Management

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

New Features in PanDA. Tadashi Maeno (BNL)

Grid services. Enabling Grids for E-sciencE. Dusan Vudragovic Scientific Computing Laboratory Institute of Physics Belgrade, Serbia

DIRAC distributed secure framework

EUROPEAN MIDDLEWARE INITIATIVE

DIRAC Distributed Secure Framework

DIRAC pilot framework and the DIRAC Workload Management System

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

HammerCloud: A Stress Testing System for Distributed Analysis

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Experience with ATLAS MySQL PanDA database service

Using Puppet to contextualize computing resources for ATLAS analysis on Google Compute Engine

glite Grid Services Overview

CernVM-FS beyond LHC computing

Singularity in CMS. Over a million containers served

Popularity Prediction Tool for ATLAS Distributed Data Management

PoS(EGICF12-EMITC2)106

PoS(ACAT2010)039. First sights on a non-grid end-user analysis model on Grid Infrastructure. Roberto Santinelli. Fabrizio Furano.

A security architecture for the ALICE Grid Services

TAG Based Skimming In ATLAS

13th International Workshop on Advanced Computing and Analysis Techniques in Physics Research ACAT 2010 Jaipur, India February

Improved ATLAS HammerCloud Monitoring for Local Site Administration

Service Availability Monitor tests for ATLAS

UW-ATLAS Experiences with Condor

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

The ATLAS Tier-3 in Geneva and the Trigger Development Facility

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Bookkeeping and submission tools prototype. L. Tomassetti on behalf of distributed computing group

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

ARC integration for CMS

Interoperating AliEn and ARC for a distributed Tier1 in the Nordic countries.

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Long Term Data Preservation for CDF at INFN-CNAF

Alteryx Technical Overview

Globus Toolkit Firewall Requirements. Abstract

Experiences with the new ATLAS Distributed Data Management System

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Introduction to Grid Computing

150 million sensors deliver data. 40 million times per second

Distributed Monte Carlo Production for

Distributed Data Management on the Grid. Mario Lassnig

EGEE and Interoperation

Austrian Federated WLCG Tier-2

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Philippe Charpentier PH Department CERN, Geneva

Ganga The Job Submission Tool. WeiLong Ueng

Grid Computing. Olivier Dadoun LAL, Orsay. Introduction & Parachute method. Socle 2006 Clermont-Ferrand Orsay)

ATLAS Nightly Build System Upgrade

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

AMGA metadata catalogue system

The INFN Tier1. 1. INFN-CNAF, Italy

Introduction to SciTokens

PoS(ACAT2010)029. Tools to use heterogeneous Grid schedulers and storage system. Mattia Cinquilli. Giuseppe Codispoti

Batch Services at CERN: Status and Future Evolution

Troubleshooting Grid authentication from the client side

The ATLAS Production System

ANSE: Advanced Network Services for [LHC] Experiments

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Scalable Computing: Practice and Experience Volume 10, Number 4, pp

PanDA: Exascale Federation of Resources for the ATLAS Experiment

Credentials Management for Authentication in a Grid-Based E-Learning Platform

The ATLAS Distributed Analysis System

PoS(ACAT)020. Status and evolution of CRAB. Fabio Farina University and INFN Milano-Bicocca S. Lacaprara INFN Legnaro

CHEP 2013 October Amsterdam K De D Golubkov A Klimentov M Potekhin A Vaniachine

UNICORE Globus: Interoperability of Grid Infrastructures

Grids and Security. Ian Neilson Grid Deployment Group CERN. TF-CSIRT London 27 Jan

The ATLAS Software Installation System v2 Alessandro De Salvo Mayuko Kataoka, Arturo Sanchez Pineda,Yuri Smirnov CHEP 2015

Michigan Grid Research and Infrastructure Development (MGRID)

The glite middleware. Presented by John White EGEE-II JRA1 Dep. Manager On behalf of JRA1 Enabling Grids for E-sciencE

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Distributing storage of LHC data - in the nordic countries

CouchDB-based system for data management in a Grid environment Implementation and Experience

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

MSG: An Overview of a Messaging System for the Grid

LCG-2 and glite Architecture and components

ATLAS operations in the GridKa T1/T2 Cloud

CernVM-FS. Catalin Condurache STFC RAL UK

A RESOURCE MANAGEMENT FRAMEWORK FOR INTERACTIVE GRIDS

The High-Level Dataset-based Data Transfer System in BESDIRAC

and the GridKa mass storage system Jos van Wezel / GridKa

Scientific data management

Considerations for a grid-based Physics Analysis Facility. Dietrich Liko

The Grid Monitor. Usage and installation manual. Oxana Smirnova

EGI-InSPIRE. Security Drill Group: Security Service Challenges. Oscar Koeroo. Together with: 09/23/11 1 EGI-InSPIRE RI

Grid Compute Resources and Grid Job Management

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

Grid Architectural Models

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

CMS Tier-2 Program for user Analysis Computing on the Open Science Grid Frank Würthwein UCSD Goals & Status

PROOF-Condor integration for ATLAS

AGIS: The ATLAS Grid Information System

Clouds at other sites T2-type computing

Vis: Online Analysis Tool for Lattice QCD

Transcription:

1a, Jose Caballero b, Kaushik De a, Tadashi Maeno b, Maxim Potekhin b, Torre Wenaus b on behalf of the ATLAS collaboration a University of Texas at Arlington, Science Hall, PO Box 19059, Arlington, TX 76019, United States b Brookhaven National Laboratory, PO Box 5000, Upton, NY 11973, United States The PanDA system was developed by US ATLAS to meet the requirements for high throughput production and distributed analysis processing for the ATLAS Experiment at CERN. The system provides integrated service architecture with late binding of job, maximal automation through layered services, tight binding with the ATLAS Distributed Data Management system, advanced job recovery and error discovery functions, among other features. This paper will present the components and performance of the PanDA system which has been in production in the US since early 2006 and ATLAS wide since the end of 2007. XII Advanced Computing and Analysis Techniques in Physics Research Erice, Italy, 3-7 November, 2008 1 Corresponding author, E-mail: Paul.Nilsson@cern.ch Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. http://pos.sissa.it

1. Introduction PanDA is the Production and Distributed Analysis system for the ATLAS experiment [1, 2]. It was designed to meet the ATLAS requirements for a data-driven workload management system capable of operating at the LHC data processing scale. PanDA has a single task queue and is using pilots to execute jobs. The pilots retrieve the jobs from an Apache based central server as soon as CPUs are available, leading to low latencies that is especially useful for user analysis which has a chaotic behavior. The PanDA system is highly automatic and thus only needs low maintenance and operations manpower that is aided by an integrated monitoring system. 1.1 A brief history of PanDA The development of PanDA started in the summer of 2005. Initially developed for US based production and analysis, the software infrastructure was deployed in late 2005. Since September 2006 PanDA has also been a principal component of the US Open Science Grid program in just-in-time (pilot based) workload management. In October 2007 it was adopted by the ATLAS collaboration as the sole system for distributed production across the collaboration. PanDA has successfully deployed at all WLCG sites, and is currently being integrated with NorduGrid. 2.Workflow Jobs are submitted to PanDA via a simple Python client by which users define job sets and their associated datasets. The job specifications are sent to the PanDA server with secure https which is authenticated with a grid certificate proxy. The client interface has been used to implement the PanDA front ends for ATLAS production, distributed analysis and US regional production. The PanDA server receives work from these front ends and places it in a global job queue, upon which the brokerage module operates to assign and prioritize work on the basis of job type, priority, input data and its locality, available CPU resources and other criteria. The allocation of job sets to sites is followed by the dispatch of the corresponding input datasets which is handled by a data service interacting with the ATLAS Distributed Data Management (DDM) system. Pre-placement of data is a strict precondition for job execution. For analysis jobs there is no dispatch of data the data must pre-exist locally. Jobs are not released for processing until their input data has are available at the site. When the data dispatch has finished, jobs are made available to a job dispatcher from which the pilot jobs can download them. The pilots are delivered to the worker nodes by an independent subsystem via a number of different scheduling systems. When a pilot has been launched on a worker node, it reports the available memory and disk space to the job dispatcher and asks for an available job that is appropriate for the worker node. If there is no job available at the moment, it waits for a minute and then asks again. If there is still no job available, the pilot will exit. For interactive 2

analysis it is desirable to have minimal latency from job submission to execution, so it is of particular importance that the pilot dispatch mechanism bypasses any latency in the scheduling system for submitting and launching the pilot. Pilots carry a generic production grid proxy with an additional pilot VOMS attribute indicating a pilot job. Analysis pilots may use glexec [3] to switch their identity to that of the job submitter on the worker node. The PanDA system is illustrated in Fig. 1. Fig. 1. The PanDA system. 3. PanDA components 3.1 Core components The following constitute the core components of the PanDA system: PanDA server. Central hub composed of several components that make up the core of PanDA (task buffer, job dispatcher, etc). Panda DB. Oracle database for PanDA. Panda Client. PanDA job submission and interaction client. Pilot. Execution environment for PanDA jobs. Pilots request and receive job payloads from the dispatcher, perform setup and run the jobs themselves, while regularly reporting the job status to PanDA during execution. AutoPilot. Pilot submission, management and monitoring system. SchedConfig. Database table used to configure resources. Monitor. Web based monitoring and browsing system that provides an interface to PanDA for operators and users. 3

Logger. Logging system allowing PanDA components to log incidents in a database via the standard Python logging module. Bamboo. Interface between PanDA and the ATLAS production database. Sections 3.2 3.4 will describe a selection of these components in more detail. 3.2 PanDA server The PanDA server is implemented as a stateless multi-process REST web service with Apache mod_python and with an Oracle back-end. The interaction with the clients is done via http for passive read operations and https for active operations such as job submission and pilot interaction. Secure https is authenticated using grid certificates with Apache mod_gridsite. The server consists of the following components: Task buffer. PanDA job queue manager which keeps track of all active jobs in the system. Brokerage. Matching of job attributes with site and pilot attributes, i.e. a job will only be sent to a pilot that reports a compatible site label. It manages the dispatch of input data to processing sites, and implements PanDA's data pre-placement requirement. Job dispatcher. Dispatcher receive requests for jobs from pilots and dispatches job payloads. Jobs are assigned depending on how they match the capabilities of the site and worker node, e.g. data availability, disk space, memory etc. Data service. Data dispatch to and retrieval from sites. 3.3 AutoPilot The AutoPilot is a simple and generic implementation of the PanDA pilot and pilotscheduler for use in more varied environments than the ATLAS-specific production pilots and schedulers. It governs the submission of pilot wrapper scripts to target sites using Condor-G [4]. The AutoPilot is using a number of databases to manage information on queues, pilot wrappers, etc. One of these databases is the SchedConfig table that holds information about queue configurations. For example, pilot wrapper submission to a site that experiences a problem can easily be turned off by updating its status in this database. Doing so will effectively turn off the site from production. Many other site configurations can quickly be updated through this table as well. The AutoPilot also provides a pilot implementation that contains no ATLAS specific content and is used by CHARMM, a protein structure application that uses OSG VO. 3.4 Pilot The autopilot wrapper script performs a number of preliminary checks, downloads the site configuration from the SchedConfig DB before the main pilot code is downloaded from a cache and unpacked. When the PanDA pilot [5] is launched it collects information about the worker node and sends it to the job dispatcher. If a 4

matched job does not exist, the pilot will end. If a job does exist, the pilot will fork the job wrapper and start monitoring it. The job wrapper prepares the run time environment, performs stage-in, executes the payload, stages out the output files for a successful job, and then finishes by removing the input files (only) from the work directory. All information and actions performed by the pilot and the job wrapper are written to a pilot log file. For both successful and failed jobs, the pilot creates a job log (tar ball) of the remaining work directory and transfers it to the local storage element (SE). All output files and logs are available from the PanDA monitor [6], which allows for fast access to production and user jobs which is especially important for debugging problems. Since the job logs will not be available in the case of transfer problems during stage-out, debugging will rely on the pilot error reported to the dispatcher (error code and a relevant extract of the payload and pilot log files) and the batch system log which contains the pilot log as well as all stdout messages that were printed after the job log was created. A payload that is not producing any output within a certain time, will be killed and failed by the pilot. The time limit can be set in the job definition. The pilot also monitors the size of the payload work directory. A job that is producing too much output is aborted by the pilot, which is better than having the batch system kill the job which in turn is much harder to debug since it leaves no hints on what went wrong. 3.4.1 Data transfers Since the input data has already been transferred to the site by the DDM system before the pilot and payload start, the pilot only has to copy the files from the local storage to the worker node (and vice versa for the output files). The pilot will use a copy tool that is relevant for the site and is registered in the SchedConfig DB. Special wrapper classes, known as site movers, have been written for a variety of copy tools (cp, mv, dccp, rfcp, uberftp, gridftp, lcg-cp, dq2-get/put, and locally defined copy tools). Files can normally not be opened directly on the worker node from the SE. However, for dcache, Castor and Xrootd the access protocols are supported by PanDA and the ATLAS software framework (Athena [7]) which can open and read the input files directly from the remote SE. Another practical feature is the time-out control for hanging transfers. In case this happens, the pilot will abort the copy after a given time. File transfers are verified by comparing either the adler32 or md5 checksums of the local and remote file. 3.4.2 Job recovery When there are service problems (contacting the PanDA server for essential job updates) or site problems (with the local file or storage system), valid data from completed jobs can be stranded on the worker nodes or even be deleted by cleanup scripts running at the site. To handle these issues, the pilot has been equipped with a recovery mechanism that attempts to salvage these otherwise lost jobs, without a rerun of the original job being necessary. If a job is prevented from transferring its output to the SE, it leaves the output files, job log and a job state file on the worker node. A later pilot running on the same node checks for such stranded jobs, and retries the data movement if necessary. If it fails again, it will leave the remains for yet another pilot. Pilots will try a maximum of 15 times before failing the job permanently. The PanDA server can also fail the job if the recovery process has not finished within three days. A 5

later pilot trying to recover an already failed job will be notified by the server to give up and remove the remains. The job recovery mechanism, which is optional, has proved to be highly effective in waiting out service disruptions for later recovery. CPU cycles otherwise lost are thus recovered. 4. Security PanDA services use the standard GSI grid security model of authentication and authorization based on X509 grid certificates, which is implemented via OpenSSL and secure https. Interactions with the PanDA server require secure https for anything other than passive read actions. The Virtual Organization Membership Service (VOMS) attributes of the proxy are checked to ensure that the user is a member of a virtual organization that is authorized for PanDA use. The authenticated users distinguished name is part of the metadata of a PanDA job, so the user id is known and tracked throughout PanDA operations. Execution of production jobs and file management relies on production certificates, with the pilot carrying a special pilot VOMS role to control its rights. Analysis jobs also run under a production proxy unless glexec is employed in identity switching mode. glexec is a tool provided by the Enabling Grids for E-sciencE (EGEE) [8] project which runs the payload under the identity of the end-user. Optional glexec based identity change on the worker node to submitter identity for user jobs is currently under testing. The end-user proxies are pre-staged in the MyProxy credential caching service [9] while the access information needed by the pilot is stored in the PanDA DB. The proper identity under which to run the job can then be extracted by glexec from the user s proxy. An optional pilot feature is the proxy hiding, where the proxy is removed from disk during the execution of the payload, and only restored after the payload has finished. We have also defined a means of securing the payload from tampering in the PanDA DB by means of encryption of the payload using an RSA key pair, with decoding and validation in the pilot prior to execution. Furthermore, a token based system is currently being implemented in the system. The pilot will only be allowed to download a new job from the job dispatcher (and make job updates) if it presents a valid token that was passed on to the pilot from the PanDA server, where it was originally generated, through the batch system. 5. Performance As of March 2009, some 18+ million jobs have been processed by the PanDA system, at a rate of up to 500k production jobs per week at about 100 sites around the world. During the same period, PanDA has also served almost 900 users, and about 5 million jobs during the last six months. In February 2009 PanDA reached more than 35k concurrent jobs in the system (Fig. 2). As ATLAS data taking ramps up over the next few years, job counts are estimated to reach the order of 500k jobs per day, with the greatest increase coming from analysis. 6

Fig. 2. Concurrent jobs running in the PanDA system, color coded by cloud. 6. Distributed analysis Production and analysis jobs are running on the same infrastructure in PanDA, which means that they are both using the same software, monitoring and facilities. It also means that there is no need for any duplicated manpower for maintenance. The production and analysis jobs use different computing resources and thus do not have to compete with each other. The data transfer policies are also different. For production jobs, the input files are dispatched to the Tier-2 s and the output files are aggregated to the Tier-1 s via the DDM asynchronously, while analysis jobs do not trigger any data transfer. Analysis jobs only go to sites that hold the input files The users can choose between two different tools to submit analysis jobs to PanDA, pathena [10] and Ganga [11]. 7. Conclusion The PanDA system provides integrated service architecture with late binding of job, maximal automation through layered services, and tight binding with the ATLAS Distributed Data Management system. It has a single task queue and is using pilots to execute the jobs. The pilots are retrieving the payloads from an Apache based central server as soon as the CPUs are available, which leads to low latencies that is especially useful for the user analysis which has a chaotic behavior. The PanDA system has been in production in the US since early 2006 and ATLAS wide since the end of 2007. The production across ATLAS is going very smoothly and PanDA stands ready to provide a stable and robust service for real data taking in the ATLAS experiment. 7

References [1] ATLAS Technical Proposal, CERN/LHCC/94-43 (1994) [2] T. Maeno. PanDA: Distributed production and distributed analysis system for ATLAS. J.Phys.Conf.Ser.119:062036, 2008. [3] D. Groep, O. Koeroo, G. Venekamp, glexec: gluing grid computing to the Unix world, J.Phys.:Conf.Series 119 (2008) 062032 [4] Condor Project, http://www.cs.wisc.edu/condor [5] P. Nilsson. Experience from a pilot based system for ATLAS. J.Phys.Conf.Ser.119:062038,2008 [6] PanDA monitor, http://panda.cern.ch [7] Athena Core Software, http://atlas-computing.web.cern.ch/atlascomputing/packages/athenacore/athenacore.php, 2008 [8] Enabling Grids for E-sciencE, http://www.eu-egee.org/ [9] J. Novotny, S. Tuecke, and V. Welch. An Online Credential Repository for the Grid: MyProxy. Proceedings of the Tenth International Symposium on High Performance Distributed Computing (HPDC-10), IEEE Press, August 2001, pages 104-111. [10] pathena, https://twiki.cern.ch/twiki/bin/view/atlas/pandaathena [11] F.Brochu et al. Ganga: a tool for computational-task management and easy access to Grid resources. Submitted to Computer Physics Communications (arxiv:0902.2685v1). 8