The ATLAS Software Installation System v2 Alessandro De Salvo Mayuko Kataoka, Arturo Sanchez Pineda,Yuri Smirnov CHEP 2015

Similar documents
Benchmarking the ATLAS software through the Kit Validation engine

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Percona XtraDB Cluster MySQL Scaling and High Availability with PXC 5.7 Tibor Korocz

Percona XtraDB Cluster

ATLAS 実験コンピューティングの現状と将来 - エクサバイトへの挑戦 坂本宏 東大 ICEPP

Analytics Platform for ATLAS Computing Services

AutoPyFactory: A Scalable Flexible Pilot Factory Implementation

Evolution of the HEP Content Distribution Network. Dave Dykstra CernVM Workshop 6 June 2016

ATLAS Nightly Build System Upgrade

CIT 668: System Architecture. Amazon Web Services

Percona Live Europe Amsterdam, Netherlands October 3 5, 2016

Getting Started With Amazon EC2 Container Service

HA for OpenStack: Connecting the dots

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

Singularity tests at CC-IN2P3 for Atlas

Migration and Building of Data Centers in IBM SoftLayer

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

arxiv: v1 [cs.dc] 7 Apr 2014

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

MySQL High Availability Solutions. Alex Poritskiy Percona

BUILDING A SCALABLE MOBILE GAME BACKEND IN ELIXIR. Petri Kero CTO / Ministry of Games

CHEP 2013 October Amsterdam K De D Golubkov A Klimentov M Potekhin A Vaniachine

CC13c LifeCycle Management. Infrastructure at your Service.

ATLAS Distributed Computing Experience and Performance During the LHC Run-2

Scientific Computing on Emerging Infrastructures. using HTCondor

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Grid monitoring with MonAlisa

Evolution of Cloud Computing in ATLAS

HA solution with PXC-5.7 with ProxySQL. Ramesh Sivaraman Krunal Bauskar

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

ATLAS COMPUTING AT OU

Product Data Sheet: Ignition 8 Industrial Application Platform. A Whole New View

Execo tutorial Grid 5000 school, Grenoble, January 2016

Aurora, RDS, or On-Prem, Which is right for you

Internet2 Meeting September 2005

VMWARE PROTECTION WITH DELL EMC NETWORKER 9

One Platform Kit: The Power to Innovate

Database Level 100. Rohit Rahi November Copyright 2018, Oracle and/or its affiliates. All rights reserved.

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

Percona XtraDB Cluster ProxySQL. For your high availability and clustering needs

MarkLogic Server. MarkLogic Server on Microsoft Azure Guide. MarkLogic 9 January, 2018

Molecular dynamics simulations in the MolDynGrid Virtual Laboratory by means of ARC between Grid and Cloud

The PanDA System in the ATLAS Experiment

Quobyte The Data Center File System QUOBYTE INC.

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars

Important DevOps Technologies (3+2+3days) for Deployment

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Using MySQL for Distributed Database Architectures

SLATE. Services Layer at the Edge. First Meeting of the National Research Platform Montana State University August 7-8, 2017

13th International Workshop on Advanced Computing and Analysis Techniques in Physics Research ACAT 2010 Jaipur, India February

Front Office for NetBackup Guide Self-service backup & restore

Monitoring the ALICE Grid with MonALISA

Genomics on Cisco Metacloud + SwiftStack

Report on the HEPiX Virtualisation Working Group

Jozef Cernak, Marek Kocan, Eva Cernakova (P. J. Safarik University in Kosice, Kosice, Slovak Republic)

Percona XtraDB Cluster 5.7 Enhancements Performance, Security, and More

BRINGING HOST LIFE CYCLE AND CONTENT MANAGEMENT INTO RED HAT ENTERPRISE VIRTUALIZATION. Yaniv Kaul Director, SW engineering June 2016

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Introduction to OpenStack Trove

Architecture and Design of MySQL Powered Applications. Peter Zaitsev CEO, Percona Highload Moscow, Russia 31 Oct 2014

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

BOSS and LHC computing using CernVM and BOINC

SZDG, ecom4com technology, EDGeS-EDGI in large P. Kacsuk MTA SZTAKI

Highly Available Database Architectures in AWS. Santa Clara, California April 23th 25th, 2018 Mike Benshoof, Technical Account Manager, Percona

Table of Contents DevOps Administrators

How do I patch custom OEM images? Are ESXi patches cumulative? VMworld 2017 Do stateless hosts keep SSH & SSL identities after reboot? With Auto Deplo

glite Grid Services Overview

Layer-4 to Layer-7 Services

Copyright 2012 EMC Corporation. All rights reserved.

Distributing OpenStack on top of a Key/Value store

MySQL Replication Options. Peter Zaitsev, CEO, Percona Moscow MySQL User Meetup Moscow,Russia

HammerCloud: A Stress Testing System for Distributed Analysis

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

Citrix Workspace Cloud

Resource Allocation in computational Grids

Scheduling Computational and Storage Resources on the NRP

Clouds in High Energy Physics

Overview of ATLAS PanDA Workload Management

Geographically Dispersed Percona XtraDB Cluster Deployment. Marco (the Grinch) Tusa September 2017 Dublin

MySQL Performance Improvements

OMi Management Pack for Microsoft SQL Server. Software Version: For the Operations Manager i for Linux and Windows operating systems.

EGEE and Interoperation

1 Modular architecture

High Availability for Enterprise Clouds: Oracle Solaris Cluster and OpenStack

Beyond 1001 Dedicated Data Service Instances

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Smarter Systems In Your Cloud Deployment

VMWARE PIVOTAL CONTAINER SERVICE

VMWARE ENTERPRISE PKS

Adaptive Cluster Computing using JavaSpaces

Percona XtraDB Cluster powered by Galera. Peter Zaitsev CEO, Percona Slide Credits: Vadim Tkachenko Percona University, Washington,DC Sep 12,2013

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

1 BRIEF / Oracle Solaris Cluster Features and Benefits

Deployment Scenario: WebSphere Portal Mashup integration and page builder

Intellicus Cluster and Load Balancing- Linux. Version: 18.1

IBM WebSphere Application Server 8. Clustering Flexible Management

Percona Software & Services Update

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

Transcription:

The ATLAS Software Installation System v2 Alessandro De Salvo Mayuko Kataoka, Arturo Sanchez Pineda,Yuri Smirnov CHEP 2015 Overview Architecture Performance

LJSFi Overview LJSFi is an acronym of Light Job Submission Framework Developed in ATLAS since 2003 as a job submission framework for the validation of software releases and other software installation related tasks Evolved with time to cope with the increased load, the use of the WMS and Panda and for HA Using a plugin architecture, in order to be able to plug any other backend in the future Multi-VO enabled LJSFi can handle multiple VOs, even in the same set of servers Web User Interface The LJSFi main interface is web-based Users can interact with the system in different ways, depending on their role Anonymous users have limited access, while registered users, identified by their personal certificate, have more deep access Fast job turnaround, Scalability and High-Availability LJSFi is able to cope with hundreds of resources and thousands of releases, with turnaround of the order of minutes in the submission phase Horizontal scalability is granted by adding classes of components to the system HA is granted by the DB infrastructure and the embedded facilities of the LJSFi components 2!

LJSFi Components The main components of the LJSFi infrastructure are The LJSFi Server The Request Agents The Installation agents The LJSFi Server is built out of different sub-systems The HA Installation DB The InfoSys The Web Interface The monitoring facilities The APIs The Request Agents and Installation Agents are connected to the close Servers To request and process the tasks To cleanup and notify the admins In this configuration the failure of a component is not fatal Just the logfiles hosted on a failed server won t be accessible 3!

The LJSFi v2 architecture RM! InfoSys Task Request Agent n! Task Request Agent n Task Request Agent n Same or different sites Installation AGENT n! Installation AGENT 2 PanDA Percona! XtraDB! Cluster! Install DB Via redirect on server Via redirect on server Installation AGENT 1 DDM DB1!! WAN DB1!!!! DB6! Other site! DBn! HA Multi-Master Database Percona XtraDB cluster PanDA and DDM (Rucio) clients from CVMFS Embedded HA facility LJSFi Server 1 memcached! HA alias! Same or different sites Users LJSFi Server n memcached! 4!

LJSFi HA Database Based on Percona XtraDB cluster Extension of the MySQL engine, with WSREP (Write-Set Replication) patches True multi-master, WAN-enabled engine A cluster of 9 DB machines in Roma + 2 at CERN The suggested minimum is 3 to have the quorum, and we wanted to be on the safe side! More machines may be added, even in other sites, for better redundancy No powerful machine is needed, but at least 4GB of RAM, 100GB of HD and standard network connectivity (but lower latencies are better) VMs are used at CERN, where we run on the Agile Infrastructure, and no performance issue was seen so far, including the WAN latency Hosting the main DBs used by LJSFi The Installation DB: the source of release definition for CVMFS and full driver for the ATLAS installations The InfoSys DB: the database used for resource discovery and matchmaking 5!

LJSFi InfoSys Used for resource discovery and matchmaking Connected to AGIS (ATLAS Grid Information System) and PanDA Mirroring the needed AGIS data once every 2 hours (tunable) Data freshness checks No interaction is possible with the InfoSys if the data age is > 4h (tunable) May use more parameters in the matchmaking than the ones currently present in AGIS e.g. OS type/release/version (filled by the installation agents via callback) These parameters can be sent to AGIS if needed, as we do for the CVMFS attributes Sites can be disabled from the internal matchmaking if needed For example HPC and opportunitic resources (BOINC), where we should not run automatically the validations as soon as we discover them 6!

LJSFi APIs LJSFi provides two ways to interact with the servers The python APIs The REST APIs The python APIs are used by the LJSFi CLI For the end-users Used by the Installation Agents and Request Agents too The REST APIs are used for a more broad spectrum of activities Callbacks from running jobs External monitoring CLI commands / Installation Agents Internal Server activities 7!

LJSFi Request Agents The LJSFi Request Agents and responsible of discovering new software releases and insert validation requests into the DB Using the infosys and the matchmaker to discover resources not currently offline Handling the pre-requirements of the tasks, like installation pre-requisites OS type, architecture maximum number of allowed concurrent jobs in the resources (Multicore resources) The Request Agents periodically run on all the releases set in auto deployment mode Currently the loop is set every 2 hours, but will be shortened as soon as we will bring the request agents to multi-threaded mode 8!

LJSFi Installation Agents [1] Used to follow the whole job lifecycle Processing Task Requests from the database Collisions control among multiple agents is performed by central locks on tasks Pre-tagging site as having the given software before sending the jobs Only happening if the sites are using CVMFS Almost all the sites are CVMFS-enabled, with a few exceptions like the HPC resources Job submission Job status check and output retrieval Tag handling (AGIS based) Tags are removed in case of failure of the validation jobs or added/checked in case of success The installation agents are fully multi-threaded Able to send several jobs in parallel and follow the other operations In case of problems, timeouts of the operations are provided either from the embedded commands used or by the generic timeout facility in the agents themselves 9!

LJSFi Installation Agents [2] Several installation agents can run in the same site or in different sites Each agent is linked to an LJSFi server, but when using an HA alias it can be delocalized Each server redirect the DB calls via haproxy to the close DB machines Taking advantage of the WAN Multi-Master HA properties of the DB cluster Serving all the ATLAS grids (LCG/EGI, NorduGrid, OSG), the Cloud resources, the HPCs and opportunistic facilities via Panda The logfiles of every job is kept for about a month in the system, for debugging purposes Logfiles are sent by the agent to their connected servers Each server knows where the logfiles are and can redirects every local logfile request to the appropriate one 10!

11! LJSFi Web Interface The LJSFi Web interface has been designed for simplicity and clearness https://atlas-install.roma1.infn.it/atlas_install Most of the Input boxes are using hints rather than combo boxes Links to AGIS and Panda for the output resources Friendly page navigation (HTML5) Online Help Each server have a separate Web Interface, but the interaction with the system are consistent, whatever server you are using

Performance The system can scale up to more than several thousands jobs per day The horizontal scaling is granted by adding more agents in parallel and increasing the Database cluster nodes To improve performance a limit on the number of jobs handled by the currently running agents has been set to 4000 The system processes new requests before the others, to allow a fast turnaround of urgent tasks Generally only a few minutes are needed between the task requests and the actual job submission by the agents The system is able to handle a large number of releases and sites We currently have > 500 different resources and > 1600 software releases or patches handled by the system 12!

Conclusions LJSFi is in use by ATLAS since 2003 Evolved in time from the WMS to Panda Open System, multi-vo enabled The infrastructure can be optimized to be used by several VOs, even hosted on the same server Currently handling well all the validation jobs in all the Grid/Cloud/ HPC sites of ATLAS (> 500 resources and > 1600 software releases) LCG/EGI NorduGrid OSG Cloud sites / HPC sites / Opportunistic resources (Boinc) Fully featured system, able to cope with a big load, scalable and high-available No single point of failure 13!