Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Similar documents
Monitoring HTCondor with the BigPanDA monitoring package

Overview of ATLAS PanDA Workload Management

The ATLAS PanDA Pilot in Operation

The PanDA System in the ATLAS Experiment

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

PanDA: Exascale Federation of Resources for the ATLAS Experiment

UK Tier-2 site evolution for ATLAS. Alastair Dewhurst

Extending ATLAS Computing to Commercial Clouds and Supercomputers

Using Puppet to contextualize computing resources for ATLAS analysis on Google Compute Engine

AGIS: The ATLAS Grid Information System

Evolution of Cloud Computing in ATLAS

VC3. Virtual Clusters for Community Computation. DOE NGNS PI Meeting September 27-28, 2017

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

ATLAS Experiment and GCE

New Features in PanDA. Tadashi Maeno (BNL)

The ATLAS Distributed Analysis System

Clouds at other sites T2-type computing

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

The Fermilab HEPCloud Facility: Adding 60,000 Cores for Science! Burt Holzman, for the Fermilab HEPCloud Team HTCondor Week 2016 May 19, 2016

SLATE. Services Layer at the Edge. First Meeting of the National Research Platform Montana State University August 7-8, 2017

EGEE and Interoperation

ATLAS Nightly Build System Upgrade

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

Clouds in High Energy Physics

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

HammerCloud: A Stress Testing System for Distributed Analysis

CHEP 2013 October Amsterdam K De D Golubkov A Klimentov M Potekhin A Vaniachine

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

LCG Conditions Database Project

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

ANSE: Advanced Network Services for [LHC] Experiments

Zero to Microservices in 5 minutes using Docker Containers. Mathew Lodge Weaveworks

UW-ATLAS Experiences with Condor

The ATLAS Software Installation System v2 Alessandro De Salvo Mayuko Kataoka, Arturo Sanchez Pineda,Yuri Smirnov CHEP 2015

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Operating the Distributed NDGF Tier-1

Towards a Strategy for Data Sciences at UW

PROOF-Condor integration for ATLAS

ELFms industrialisation plans

The evolving role of Tier2s in ATLAS with the new Computing and Data Distribution model

Scientific Computing on Emerging Infrastructures. using HTCondor

Scheduling Computational and Storage Resources on the NRP

Pegasus WMS Automated Data Management in Shared and Nonshared Environments

ATLAS & Google "Data Ocean" R&D Project

SZDG, ecom4com technology, EDGeS-EDGI in large P. Kacsuk MTA SZTAKI

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Experience with ATLAS MySQL PanDA database service

Expressing Parallelism with ROOT

The CORAL Project. Dirk Düllmann for the CORAL team Open Grid Forum, Database Workshop Barcelona, 4 June 2008

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

A short introduction to the Worldwide LHC Computing Grid. Maarten Litmaath (CERN)

Storage Resource Sharing with CASTOR.

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Improved ATLAS HammerCloud Monitoring for Local Site Administration

Grid Computing. MCSN - N. Tonellotto - Distributed Enabling Platforms

New data access with HTTP/WebDAV in the ATLAS experiment

Challenges and Evolution of the LHC Production Grid. April 13, 2011 Ian Fisk

Popularity Prediction Tool for ATLAS Distributed Data Management

Distributed Computing Framework. A. Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille

Analytics Platform for ATLAS Computing Services

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

CHIPP Phoenix Cluster Inauguration

13th International Workshop on Advanced Computing and Analysis Techniques in Physics Research ACAT 2010 Jaipur, India February

Introduction to SciTokens

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Data publication and discovery with Globus

Work Queue + Python. A Framework For Scalable Scientific Ensemble Applications

Towards Network Awareness in LHC Computing

The Materials Data Facility

Experiences with the new ATLAS Distributed Data Management System

High Performance Computing from an EU perspective

XSEDE Visualization Use Cases

DQ2 - Data distribution with DQ2 in Atlas

EGI-InSPIRE. Security Drill Group: Security Service Challenges. Oscar Koeroo. Together with: 09/23/11 1 EGI-InSPIRE RI

Flying HTCondor at 100gbps Over the Golden State

arxiv: v1 [physics.comp-ph] 29 Nov 2018

HIGH ENERGY PHYSICS ON THE OSG. Brian Bockelman CCL Workshop, 2016

CernVM-FS beyond LHC computing

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Austrian Federated WLCG Tier-2

ALICE Grid Activities in US

Using a dynamic data federation for running Belle-II simulation applications in a distributed cloud environment

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

The LHC Computing Grid

Workflow as a Service: An Approach to Workflow Farming

Computing for LHC in Germany

HTCondor Week 2015: Implementing an HTCondor service at CERN

Allowing Users to Run Services at the OLCF with Kubernetes

Ó³ Ÿ , º 5(203) Ä1019. Š Œ œ ƒˆˆ ˆ ˆŠ

Batch Services at CERN: Status and Future Evolution

Some thoughts on the evolution of Grid and Cloud computing

INDEXING OF ATLAS DATA MANAGEMENT AND ANALYSIS SYSTEM

ATLAS distributed computing: experience and evolution

LHC and LSST Use Cases

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Computing at Belle II

Transcription:

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science T. Maeno, K. De, A. Klimentov, P. Nilsson, D. Oleynik, S. Panitkin, A. Petrosyan, J. Schovancova, A. Vaniachine, T. Wenaus, D. Yu on behalf of the ATLAS collaboration Brookhaven National Laboratory, USA University of Texas at Arlington, USA Argonne National Laboratory, USA CHEP2013, Amsterdam, Netherlands, Oct 14 2013

Outline Introduction Overview of the BigPanDA project Current status and plans Conclusions 2

Introduction PanDA = Production and Distributed Analysis Sytem Designed to meet ATLAS production/analysis requirements for a data-driven workload management system capable of operating at LHC data processing scale Highly automated, low operational manpower, integrated monitoring system History Aug 2005: Project started, Sep 2005: The first prototype, Dec 2005: Production in US ATLAS, 2008: Adopted as the workload management system for the entire ATLAS collaboration, Recent: AMS and CMS have deployed own PanDA instances Performed well during LHC Run 1 for data processing, simulation and analysis, while actively evolving to meet rapidly changing physics needs Successfully managing more than 100 sites, about 5 million jobs per week, and about 1500 users

Production managers define task/job repository (Production DB) production job https submitter submitter (bamboo) (bamboo) analysis https job submit PanDA System PanDA server pull https job https https OSG https Data Data Management Management System System (DQ2) (DQ2) pilot pilot Logging System https NDGF Local Local Replica Replica Local Replica Catalog Catalog Catalog (LFC) (LFC) (LFC) pilot pilot arc EGEE/EGI pilot pilot ARC ARC Interface Interface (act) (act) End-user Worker Nodes condor-g pilot pilot scheduler scheduler (autopyfactory) (autopyfactory) 4

ATLAS Jobs Concurrently Running 12 13 ~ 160 k running jobs at peak 5

Overview of BigPanDA project 6

BigPanDA project Evolving PanDA for Advanced Scientific Computing The interest in PanDA by other big data sciences provided the primary motivation to generalize the PanDA system A project to extend PanDA as meta application, providing location transparency of processing and data management, for HEP and other dataintensive sciences, and a wider exascale community 3 FTE for 3 years from 2012 Three dimensions to evolution Making PanDA available beyond ATLAS and High Energy Physics Extending beyond Grid (Leadership Computing Facilities, Clouds, University clusters) Integrating network as a resource in workload management 7

3 year plan Work Plan Year 1. Setting the collaboration, define algorithms and metrics Hiring process was completed in June 2013 Development team is formed (3 FTE) Year 2. Prototyping and implementation Year 3. Production and operations 4 work packages WP1 : Factorizing the core WP2 : Extending the scope WP3 : Leveraging intelligent networks WP4 : Usability and monitoring 8

Work Packages 1/4 WP1 (Factorizing the core) Factorizing the core components of PanDA to enable adoption by a wide range of exascale scientific communities To take the core components of PanDA and package them in an experiment neutral package General components + customizable layers The experiment specific layers as plug-ins + configuration files Advanced features will have sensible defaults and can be turned on for demanding applications 9

Work Packages 2/4 WP2 (Extending the scope) Evolving PanDA to support extreme scale computing clouds and Leadership Computing Facilities Historically HEP community used the Grid infrastructure for large scale production E.g., Leadership Computing Facilities were not used very extensively Adding extra resources Further expansion of the potential user community and the resources available to them 10

Work Packages 3/4 WP3 (Leveraging intelligent networks) Many of the research efforts into dynamic network provisioning, quality of service and traffic management Integrating network services and real-time data access to the PanDA workflow Integrating these services within existing and evolving infrastructures E.g., PanDA currently does not interface with any of the advanced network provisioning technologies available Automating the discovery and use of such services transparently to the scientists E.g., instead of exposing intelligent network services directly to scientists, PanDA could be updated to directly interface with them without any involvement of scientists 11

Work Packages 4/4 WP4 (Usability and monitoring) The PanDA monitoring has been identified to require a special effort for factorization and generalization Each experiment has own workflow and visualization needs To design a generic PanDA monitor browser view and skeleton from which experiment (ATLAS and other) browser views are derived customizations To provide generic components and APIs for user communities to easily implement and customize their monitoring views and workflow visualizations 12

BigPanDA and ATLAS ATLAS remains the largest user community and provides the primary focus to PanDA development although PanDA is not only for ATLAS any longer Working very close with ATLAS PanDA developers One set of software packages for all experiments including ATLAS Core packages + experiment-specific plug-ins The effort has to be incremental and coherent with many challenging developments for ATLAS 13

Current Status and Plans 14

Factorization of Core Components Example in PanDA server panda-cms package Adder2 module AdderGen module panda-server package AdderCmsPlugin AdderAtlasPlugin Being well underway Decomposition of experimental code to plug-ins & separate packaging See Paul Nilsson s poster as well Nilsson P., Next Generation PanDA Pilot for ATLAS and Other Experiments panda-atlas package 15

VO-independent PanDA Server PanDA server with MySQL database backend Multiple backend support Oracle or MySQL can be selected in config file Code is being merged to the main branch PanDA server @ Amazon EC2 Serves experiments which don t want to maintain own PanDA server MySQL backend 16

Google Compute Engine (GCE) preview project Google allocated resources for ATLAS for free ~5M cpu hours, 4000 cores for ~2 months Transparent inclusion of cloud resources into ATLAS grid Resources were organized as HTCondor-based PanDA queue Delivered to ATLAS as a production resource and not as an R&D platform See Sergey Panitkin s talk for details Panitkin S., ATLAS Cloud Computing R&D 17

PanDA + HPC In collaboration with ORNL Leadership Computing Facility Get experience with all relevant aspects of the platform and workload job submission, security, storage, monitoring, etc Developing appropriate pilot/agent model for Titan 18

Leveraging Network Synchronized with two other efforts Integration of FAX (Federated Xrootd for ATLAS) with PanDA ANSE project Three layered software architecture Collector Collecting network performance information from various sources such as perfsonar, Grid sites status board, FAX, etc AGIS (ATLAS Grid Information System) Information pool Calculator Calculating weights which the brokerage take into account for site selection The site selection algorithm in the PanDA brokerage has been improved to consider cost metrics of FAX 19

Conclusions The PanDA system played a key role during LHC Run 1 data processing, simulation and analysis with great success, while actively evolving to meet rapidly changing physics needs The interest in PanDA by other big data sciences provided the primary motivation to generalize the PanDA system The BigPanDA project gives us a great opportunity to evolve PanDA beyond ATLAS and HEP Progress in many areas : networking, VO independent PanDA instance, cloud computing, HPC Strong interest in the project from several experiments and scientific centers to have a collaborative project 20