A self-configuring control system for storage and computing departments at INFN-CNAF Tierl

Similar documents
Testing an Open Source installation and server provisioning tool for the INFN CNAF Tier1 Storage system

The Legnaro-Padova distributed Tier-2: challenges and results

The data operation centre tool. Architecture and population strategies

Testing SLURM open source batch system for a Tierl/Tier2 HEP computing facility

Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF

Long Term Data Preservation for CDF at INFN-CNAF

The CMS experiment workflows on StoRM based storage at Tier-1 and Tier-2 centers

CMS users data management service integration and first experiences with its NoSQL data storage

A comparison of data-access platforms for BaBar and ALICE analysis computing model at the Italian Tier1

Monitoring System for the GRID Monte Carlo Mass Production in the H1 Experiment at DESY

Automating usability of ATLAS Distributed Computing resources

The INFN Tier1. 1. INFN-CNAF, Italy

Monte Carlo Production on the Grid by the H1 Collaboration

On enhancing GridFTP and GPFS performances

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

File Access Optimization with the Lustre Filesystem at Florida CMS T2

Design, Development and Improvement of Nagios System Monitoring for Large Clusters

Towards Monitoring-as-a-service for Scientific Computing Cloud applications using the ElasticSearch ecosystem

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Tests of PROOF-on-Demand with ATLAS Prodsys2 and first experience with HTTP federation

Using IKAROS as a data transfer and management utility within the KM3NeT computing model

ATLAS Nightly Build System Upgrade

Scientific Cluster Deployment and Recovery Using puppet to simplify cluster management

Performance of popular open source databases for HEP related computing problems

CASTORFS - A filesystem to access CASTOR

Simulation of the effectiveness evaluation process of security systems

Monitoring ARC services with GangliARC

CernVM-FS beyond LHC computing

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Development of datamining software for the city water supply company

A data handling system for modern and future Fermilab experiments

Measuring the power consumption of social media applications on a mobile device

The ALICE Glance Shift Accounting Management System (SAMS)

Exam Name: Midrange Storage Technical Support V2

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Resilient FTS3 service at GridKa

A High Availability Solution for GRID Services

EMC Business Continuity for Microsoft SharePoint Server (MOSS 2007)

What's in this guide... 4 Documents related to NetBackup in highly available environments... 5

Evolution of Database Replication Technologies for WLCG

DIRAC pilot framework and the DIRAC Workload Management System

and the GridKa mass storage system Jos van Wezel / GridKa

The Software Defined Online Storage System at the GridKa WLCG Tier-1 Center

Phronesis, a diagnosis and recovery tool for system administrators

Utilizing Slurm and Passive Nagios Plugins for Scalable KNL Compute Node Monitoring

Data Storage. Paul Millar dcache

A framework to monitor activities of satellite data processing in real-time

CMS - HLT Configuration Management System

High Availability and Disaster Recovery features in Microsoft Exchange Server 2007 SP1

Use of containerisation as an alternative to full virtualisation in grid environments.

High Availability and Disaster Recovery Solutions for Perforce

The Diverse use of Clouds by CMS

IBM Tivoli Storage Manager HSM for Windows Version 7.1. Messages

Geant4 Computing Performance Benchmarking and Monitoring

A SEMANTIC MATCHMAKER SERVICE ON THE GRID

Experiences in Clustering CIFS for IBM Scale Out Network Attached Storage (SONAS)

An SQL-based approach to physics analysis

Monitoring of large-scale federated data storage: XRootD and beyond.

CMS High Level Trigger Timing Measurements

Metadaten Workshop 26./27. März 2007 Göttingen. Chimera. a new grid enabled name-space service. Martin Radicke. Tigran Mkrtchyan

GStat 2.0: Grid Information System Status Monitoring

Improved ATLAS HammerCloud Monitoring for Local Site Administration

Automated system and service monitoring with openqrm and Nagios

Managing a tier-2 computer centre with a private cloud infrastructure

Optimization of Italian CMS Computing Centers via MIUR funded Research Projects

Data preservation for the HERA experiments at DESY using dcache technology

Evaluation of Apache Hadoop for parallel data analysis with ROOT

Migration and Building of Data Centers in IBM SoftLayer

Vendor: Hitachi. Exam Code: HH Exam Name: Hitachi Data Systems Storage Fondations. Version: Demo

Understanding StoRM: from introduction to internals

This option lets you reset the password that you use to log in if you do not remember it. To change the password,

Availability measurement of grid services from the perspective of a scientific computing centre

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

IBM Tivoli Storage Manager for AIX Version Installation Guide IBM

From an open storage solution to a clustered NAS appliance

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

W H I T E P A P E R : T E C H N I C AL. Symantec High Availability Solution for Oracle Enterprise Manager Grid Control 11g and Cloud Control 12c

CMS conditions database web application service

Acronis Backup plugin for WHM and cpanel 1.0

Whitepaper Wishful Thinking vs. Reality in Regards to Virtual Backup and Restore Environments

Schema Validate With Oracle 10g Database Status

The AAL project: automated monitoring and intelligent analysis for the ATLAS data taking infrastructure

NCAR-Developed Tools. Bill Anderson and Marc Genty National Center for Atmospheric Research HUF 2017

IBM Tivoli Storage Manager for HP-UX Version Installation Guide IBM

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Symantec System Recovery 2011 Management Solution Technical FAQ

This option lets you reset the password that you use to log in if you do not remember it. To change the password,

TIBCO StreamBase 10 Distributed Computing and High Availability. November 2017

An Analysis of Storage Interface Usages at a Large, MultiExperiment Tier 1

Netwrix Auditor. Release Notes. Version: 9.6 6/15/2018

Stager. A Web Based Application for Presenting Network Statistics. Arne Øslebø

Striped Data Server for Scalable Parallel Data Analysis

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

Cluster Setup and Distributed File System

The Art of Container Monitoring. Derek Chen

PoS(ISGC 2011 & OGF 31)049

IBM Tivoli Storage FlashCopy Manager Version 4.1. Installation and User's Guide for UNIX and Linux

ATLAS operations in the GridKa T1/T2 Cloud

AUTOMATING IBM SPECTRUM SCALE CLUSTER BUILDS IN AWS PROOF OF CONCEPT

Sharepoint 2010 Content Database Has Schema Version Not Supported Farm

Transcription:

Journal of Physics: Conference Series PAPER OPEN ACCESS A self-configuring control system for storage and computing departments at INFN-CNAF Tierl To cite this article: Daniele Gregori et al 2015 J. Phys.: Conf. Ser. 608 012014 View the article online for updates and enhancements. Related content - Empirical study of self-configuring genetic programming algorithm performance and behaviour E Semenkin and M Semenkina - Self-organized atomic switch networks Adam Z. Stieg, Audrius V. Avizienis, Henry O. Sillin et al. - MEMS probes for on-wafer RF microwave characterization of future microelectronics: design, fabrication and characterization Jaouad Marzouk, Steve Arscott, Abdelhatif El Fellahi et al. This content was downloaded from IP address 46.3.204.210 on 13/12/2017 at 00:06

A self-configuring control system for storage and computing departments at INFN-CNAF Tier1 Daniele Gregori 1, Stefano Dal Pra 1, Pier Paolo Ricci 1, Michele Pezzi 1, Andrea Prosperini 1, Vladimir Sapunenko 1 1 INFN-CNAF, Viale Berti-Pichat 6/2, 40127 Bologna, Italy E-mail: daniele.gregori@bo.infn.it Abstract. The storage and farming departments at the INFN-CNAF Tier1[1] manage approximately thousands of computing nodes and several hundreds of servers that provides access to the disk and tape storage. In particular, the storage server machines should provide the following services: an efficient access to about 15 petabytes of disk space with different cluster of GPFS file system, the data transfers between LHC Tiers sites (Tier0, Tier1 and Tier2) via GridFTP cluster and Xrootd protocol and finally the writing and reading data operations on magnetic tape backend. One of the most important and essential point in order to get a reliable service is a control system that can warn if problems arise and which is able to perform automatic recovery operations in case of service interruptions or major failures. Moreover, during daily operations the configurations can change, i.e. if the GPFS cluster nodes roles can be modified and therefore the obsolete nodes must be removed from the control system production, and the new servers should be added to the ones that are already present. The manual management of all these changes is an operation that can be somewhat difficult in case of several changes, it can also take a long time and is easily subject to human error or misconfiguration. For these reasons we have developed a control system with the feature of self-configure itself if any change occurs. Currently, this system has been in production for about a year at the INFN-CNAF Tier1 with good results and hardly any major drawback. There are three major key points in this system. The first is a software configurator service (e.g. Quattor or Puppet) for the servers machines that we want to monitor with the control system; this service must ensure the presence of appropriate sensors and custom scripts on the nodes to check and should be able to install and update software packages on them. The second key element is a database containing information, according to a suitable format, on all the machines in production and able to provide for each of them the principal information such as the type of hardware, the network switch to which the machine is connected, if the machine is real (physical) or virtual, the possible hypervisor to which it belongs and so on. The last key point is a control system software (in our implementation we choose the Nagios software), capable of assessing the status of the servers and services, and that can attempt to restore the working state, restart or inhibit software services and send suitable alarm messages to the site administrators. The integration of these three elements was made by appropriate scripts and custom implementation that allow the self-configuration of the system according to a decisional logic and the whole combination of all the above-mentioned components will be deeply discussed in this paper. 1. Introduction The departments Storage and Farming at the INFN-CNAF Tier1 administer around a thousand computing nodes and some hundreds of servers. A control system checking the status of services and hosts, able to send email alerts and run automatic recovery actions in case of an error is Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by Ltd 1

an essential tool. Given the large number of machines to be controlled it becomes very difficult to keep aligned the control system with the machines actually in production. For this reason, we decided to realize a control mechanism able of self configuration according to whether the machines are really in production and to their role in terms of the type of server (e.g. the IBM General Parallel File System GPFS, GridFTP, Worker Node, User Interface). For this purpose we have used a set of standard open source tools: Quattor[2], Nagios[3] and a Postgres database containing information on servers in the computing center named Docet[4]. All these tools have been integrated together by means of a series of Bash scripts that have the purpose of reading the information on the machines in production and creating the updated Nagios configuration. In the following sections, the tools used and the operating logic of integration will be described. 2. Adopted Tools 2.1. Quattor Quattor is a generic open-source tool-kit used to install, configure, and manage computers; it ensures that all machines that need to be monitored by Nagios are configured to perform the required checks. 2.2. Nagios Nagios is an open source computer system monitoring, network monitoring and infrastructure monitoring software application. Nagios offers monitoring and alerting services for servers, switches, applications, and services. It alerts the users when things go wrong and alerts them a second time when the problem has been resolved. 2.3. Docet Docet is a PostgreSQL database (DB) that contains all hardware information on all the machines in production at the Tier1. The information can be extracted from the DB and used to define which machines are actually in production and therefore which need to be controlled via Nagios. 3. Project Constraints At the time of writing the system design has taken into account what are the specific requirements of each department. In particular for farming the large number of computing nodes 1000 must be taken into account. For the storage unit, the configuration of the GPFS cluster is an element that must be taken into consideration in order to control the specific GPFS server nodes as quorum. The system must be able to control thousands of Worker Nodes and in the case of problems must be able to perform automatic recovery actions. The system must be able to automatically learn which machines (worker node or server) should be monitored and how the GPFS clusters are configured. The system must be quickly restored in case of failure of the control server (disaster recovery). The Worker Nodes of the farming department must check a list of running daemons and mounted filesystems. In case of problems the control system must provide a restart of the services or mount any filesystem that is not mounted. Given the high number of nodes and controls to be done Nagios is not able to perform active controls (synchronous) via plugin NRPE[5]. For this purpose we employ passive controls (asynchronous) via plugin NSCA[6]. On the Worker Node a series of sensors to control the daemons and the file system are performed within an hour without setting the exact time using a random delay. The sensors send their results to the Nagios server on a named pipe, the Nagios server in this way reads the status value of the service 2

Figure 1. Overview of the key elements of the control system. checked. If there is no status report sent for service for more than an hour Nagios status changes to Unknown and a mail is sent to the administrators of that service. If from a Worker Node the status Error is sent then Nagios will react immediately, proactively via NRPE plugin, requesting an attempt to restart the service and then send the new status. For this type of machine Nagios learns which of them should be checked through a query to the Docet DB. The query returns besides the list of Worker Nodes in production also the hostname, the types of hardware, whether they are physical or virtual machines, and the network switch to which they are connected. Based on this information, the Bash scripts construct the Nagios configuration and make an initial syntax checking and duplication of IP addresses. At this point the reload of the configuration is left to the Nagios administrator. The requirements for a storage department are different. There are numerous GPFS clusters that must be controlled, a server within a cluster can be added or removed, or may change its role as quorum manager. All this information is not contained in the Docet DB, so in order to learn the cluster configuration one of the cluster servers must communicate with Nagios. The script /opt/gpfs-nagios-info-sender.sh is added to the cluster nodes via Quattor. This script sends the mmlscluster[7] command output directly to the Nagios server. In this way the Nagios family scripts can build configuration for all GPFS cluster, taking into consideration the quorum nodes and creating a cluster service for these particular machines. To restore the server in case of disaster, Quattor is used. Quattor is able to reinstall the machine from scratch in about 30 minutes. Once the Nagios server is restored it can recreate his configuration: at first the machines of the GPFS cluster must send their configuration with the script /opt/gpfs-nagios-info-sender.sh then by querying the Docet DB to know the hardware characteristics. Overall, the reconstruction of the configuration takes a few minutes. In Figure 1 we present the schema of the self-configuring control system with characteristics of its elements. 3

4. The configuration system operation Once the Nagios server has received the information from all the GPFS clusters, you can run the script that creates the configuration. There are two scripts that are executed in succession, the first is /opt/nagios-gpfs/read cluster info.sh and the second is /opt/nagios-gpfs/create all.sh. The first reads the file sent by the GPFS cluster and creates an ordered list of servers per cluster, making sure that there are no machines belonging to more than one cluster, which is forbidden by GPFS. The second creates the configuration of Nagios itself, making sure that the servers are not in a blacklist, in which case they are ignored. It makes query to the database to check the hardware type of the individual server in order to add specific hardware controls (IPMI, RAID, POWER SUPPLY). Here are the steps that are followed: (i) Control of information from the GPFS cluster and creation of a file containing the servers. (ii) Save the current configuration for the case of recovery. (iii) Through requests to the Docet DB other types of machines (GridFTP, GPFS, etc.) are added, for these machines the type information is contained in docet. (iv) Verify that each server found in the preceding paragraphs is not present in the blacklist, in which case it is ignored. (v) From the information gathered in the previous steps the script creates the configuration files of Nagios groups, hosts, commands and services. (vi) A check is made of the correctness of the configuration with the command nagios -v /etc/nagios/nagios.cfg. (vii) Nagios is ready to restart or reload the configuration. For the farming departments the steps are similar to the above with the exclusion of the first point since there is no GPFS cluster to control and that the machines which require information from Docet are Worker Node, User Interface and other servers. 5. Some NagiosGraph tests In order to increase the number of information available from the Nagios server we have created some sensors to measure the parameters of the network and memory of the clients. This data are sent to the Nagios server which stores it in a Round Robin DB and displays it via the plugin NagiosGraph[8]. With a granularity of 5 minutes for each quantity kept for one year hundreds MB are collected. In Figure 2 an example of a NagiosGraph plot is shown. 6. Conclusions The automated system for Storage and Farming has been in production for over a year on two separate servers. During this period it has proven to be a reliable tool, which has significantly reduced the time spent managing Nagios and the possibility of manual errors. Possible developments are cross-checking of the two servers and in case of a failure an automatic activation of a third Nagios that takes the place of the server disabled by a hardrware failure. References [1] D. Gregori et al, INFN-CNAF Activity in the Tier-1 and Grid for LHC Experiments, 2009, E-ISBN : 978-1- 4244-3750-4 [2] http://www.quattor.org/ [3] http://www.nagios.org/ [4] S. Dal Pra, A. Crescente, The data operation centre tool. Architecture and population strategies, 2012, Journal of Physics: Conference Series 396(2012) 042014 [5] http://exchange.nagios.org/directory/addons/monitoring-agents/ /NRPE--2D-Nagios-Remote-Plugin-Executor/details 4

Figure 2. Network NagiosGraph plot. [6] http://exchange.nagios.org/directory/addons/passive-checks/ /NSCA--2D-Nagios-Service-Check-Acceptor/details [7] General Parallel File System (GPFS) Administration and Programming Reference Version 4 Release 1.0.4, E-ISBN : SA23-1452-01 [8] http://nagiosgraph.sourceforge.net/ 5