Utilizing Slurm and Passive Nagios Plugins for Scalable KNL Compute Node Monitoring

Similar documents
Monitoring a HPC Cluster with Nagios

Automated system and service monitoring with openqrm and Nagios

NWC 2011 Monitoring a Cloud Infrastructure in a Multi-Region Topology

Monitoring ARC services with GangliARC

The Art of Container Monitoring. Derek Chen

Purpose. Target Audience. Prerequisites. What Is An Event Handler? Nagios XI. Introduction to Event Handlers

Directions in Workload Management

Scheduler Optimization for Current Generation Cray Systems

Next Generation Monitoring: Moving Beyond Nagios

Event Console Integration , Sven Panne Check_MK Conference #4

A self-configuring control system for storage and computing departments at INFN-CNAF Tierl

Intel Xeon PhiTM Knights Landing (KNL) System Software Clark Snyder, Peter Hill, John Sygulla

Design, Development and Improvement of Nagios System Monitoring for Large Clusters

Hadoop Map Reduce 10/17/2018 1

Slurm Overview. Brian Christiansen, Marshall Garey, Isaac Hartung SchedMD SC17. Copyright 2017 SchedMD LLC

Monitoring Grid Virtual Machine deployments

Modification Alerts. Feature Deep Dive

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

HONEYPOT BASED INTRUSION MANAGEMENT SYSTEM: FROM A PASSIVE ARCHITECTURE TO AN IPS SYSTEM

Linux Clusters Institute: Monitoring

Architectural Flexibility

Out-of-band that actually works.

SLURM Event Handling. Bill Brophy

MONITOR YOUR ENTIRE IT INFRASTRUCTURE WITH NAGIOS XI

Real-time Monitoring, Inventory and Change Tracking for. Track. Report. RESOLVE!

HPCC Monitoring and Reporting (Technical Preview) Boca Raton Documentation Team

Federated Cluster Support

Monitoring (and) IPv6

Effective Monitoring for Demanding Operations Environments

Learning Nagios 4. Wojciech Kocjan. Chapter No.1 "Introducing Nagios"

Reducing Cluster Compatibility Mode (CCM) Complexity

PlaFRIM. Technical presentation of the platform

A WEB-BASED SOLUTION TO VISUALIZE OPERATIONAL MONITORING LINUX CLUSTER FOR THE PROTODUNE DATA QUALITY MONITORING CLUSTER

Nagios XI Host and Service Details Overview

SLURM Operation on Cray XT and XE

Best Practices for Alert Tuning. This white paper will provide best practices for alert tuning to ensure two related outcomes:

Kerberos & HPC Batch systems. Matthieu Hautreux (CEA/DAM/DIF)

Network Management & Monitoring Overview

Slurm Version Overview

2012 Microsoft Corporation. All rights reserved. Microsoft, Active Directory, Excel, Lync, Outlook, SharePoint, Silverlight, SQL Server, Windows,

Data Sheet. Monitoring Automation for Web-Scale Networks MONITORING AUTOMATION FOR WEB-SCALE NETWORKS -

Enterprise print management in VMware Horizon

Monitoring Agent for Unix OS Version Reference IBM

Application Level Protocols

VMware vsphere. Administration VMware Inc. All rights reserved

Network Monitoring and Management Introduction to Networking Monitoring and Management

Docker Container Access Reference Design

Aggregated Monitoring Chapter One: RouterOS. Bartłomiej Kos MUM Slovenia Feb 2016

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Avoiding the Cost of Confusion: SQL Server Failover Cluster Instances versus Basic Availability Group on Standard Edition

GRID monitoring with NetSaint

HTCondor Week 2015: Implementing an HTCondor service at CERN

1 Bull, 2011 Bull Extreme Computing

Heterogeneous Job Support

Plugin Monitoring for GLPI

Elasticsearch Scalability and Performance

TARGETPROCESS JIRA INTEGRATION GUIDE

CURRENT STATE OF ICINGA

Request support: ecentral.creo.com. Virtualized Systems as a Basis for Redundancy

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

Europeana DSI 2 Access to Digital Resources of European Heritage

Keep an eye on your PostgreSQL clusters

Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Improve WordPress performance with caching and deferred execution of code. Danilo Ercoli Software Engineer

Research Project 1. Failover research for Bright Cluster Manager

Exadata Monitoring and Management Best Practices

Concurrency and Recovery

Network Management & Monitoring Overview

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Introduction to Slurm

Install some base packages. I recommend following this guide as root on a new VPS or using sudo su, it will make running setup just a touch easier.

Making Remote Network Visibility Affordable for the Distributed Enterprise

Analytics and Visualization

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Course 6231A: Maintaining a Microsoft SQL Server 2008 Database

Linux Administration

Linux Clusters Institute: Monitoring. Zhongtao Zhang, System Administrator, Holland Computing Center, University of Nebraska-Lincoln

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

DB2 (Linux Unix & Windows)

An Introduction APRICOT 2008 Network Management Workshop February Taipei, Taiwan

Trending with Purpose. Jason Dixon

How do I patch custom OEM images? Are ESXi patches cumulative? VMworld 2017 Do stateless hosts keep SSH & SSL identities after reboot? With Auto Deplo

Deploying Ceph clusters with Salt

Course 6231A: Maintaining a Microsoft SQL Server 2008 Database

Lecture 8: Internet and Online Services. CS 598: Advanced Internetworking Matthew Caesar March 3, 2011

Important DevOps Technologies (3+2+3days) for Deployment

What s New in PowerSchool 9.0

Porting SLURM to the Cray XT and XE. Neil Stringfellow and Gerrit Renker

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

CLUSTERING. What is Clustering?

Sigma Tile Workshop Guide. This guide describes the initial configuration steps to get started with the Sigma Tile.

High Noon at AWS. ~ Amazon MySQL RDS versus Tungsten Clustering running MySQL on AWS EC2

Divide & Recombine (D&R) with Tessera: High Performance Computing for Data Analysis.

Smart Monitoring

Using CLC Genomics Workbench on Turing

WANSyncHA Microsoft Exchange Server. Operations Guide

Connect with Remedy: SmartIT: Social Event Manager Webinar Q&A

IBM InfoSphere Streams v4.0 Performance Best Practices

CIS Controls Measures and Metrics for Version 7

Prediction Project. Release draft (084e399) OPNFV

Transcription:

Utilizing Slurm and Passive Nagios Plugins for Scalable KNL Compute Node Monitoring Tony Quan, Basil Lalli twquan@lbl.gov, bdlalli@lbl.gov SLUG 2017, NERSC - LBNL

About NERSC, Cori, Edison: 11 Nagios Core (4.x) servers Aggregated by Thruk Mostly NRPE and custom plugins, some NSCA Cori - ~12K nodes, ~9.6K Knight s Landing nodes Edison - ~5.5K nodes (Ivybridge) Nagios servers are split per-group/purpose (NGF, FSG, clusters, etc.) Notifications provided by Slack/email Heavy use of custom plugins

The Problem: Scale! Active/NRPE plugins serve us well in most groups, but cannot handle the scale of our latest systems. It s not going to get better. Traditional remote plugins cause unnecessary intrusion on the internal clusters. With KNL nodes, reboots (for boot feature changes) are routine. Easily can be thousands a day. Previous approaches are no longer viable.

Considerations: Easily maintainable. Integrate into our existing workflows, monitoring, and infrastructure. K.I.S.S. Must be able to accommodate KNL reboot events gracefully. Per-node monitoring granularity for ticketing purposes. Must not spam notifications!!! Forward-thinking with regard to the scale of future systems and automation.

What Do We Really Need Here? Traditional plugins are unnecessary and/or meaningless: PING, DISK, LOAD, etc. NHC takes care of this. All data necessary to monitor compute nodes effectively can be provided from single locations en masse. We don t want/need to do anything 12,000 times especially if other tools already do.

The Solution P1: check_nodestatus xtprocadmin, xtcli, sinfo, sacct output is gathered from the SMW and pushed out to the nagios server via SSH/cron. Plugin runs on the nagios server, correlates the output, and provides local, passive check results for per-node services. Plugin maintains a state file and only returns changes each run. Master service is updated every run, provides node counts, acts as parent service, and allows freshness checking. Provides xtproc/xtcli/slurm state, job id/user, boot features, and slurm downtime/reason if applicable.

The Solution P2: Thruk provides basic regex searching and mass actions, allowing easy human interaction. An in-house notification digest collects and provides alerts at regular intervals. Guaranteed spam-free. RESTful communication with slurm (more on this later) informs nagios process of known reboot events and triggers downtimes.

Advantages: Extremely scalable, able to support our next system. Tested up to 100K services, updated every two minutes. Absolute minimum intrusion into the cluster. We do not touch the compute nodes. Processing is done on the nagios end. Outgoing SSH only from the cluster avoids security concerns. No need for another nagios helper-daemon (e.g. NSCA). Relies only on pre-existing tools and Python. Maintainable. Able to detect node changes and generate nagios configuration files. Cronjobs can easily take advantage of HA pairs, data can be drawn from several locations, minimum points of failure.

Other Applications: XT nagios services across the internal non-compute nodes: Uses same data, provides basic service node monitoring. Datawarp/Burstbuffer monitoring: Sessions, instances, fragments, etc. are linked together and reported on the lead node of each instance. HPSS monitoring: Tape events are communicated via a pair of non-sequential logs. Plugin detects new entries and groups them into several services according to a set of regex rules. Supports volatile and traditional alerts.

The Future: (In progress) Livestatus queries will allow a daemon to group together node events by job, time, reason, etc. to provide automatic ticket creation/resolution. Can be extended to query multiple nagios servers and correlate more complex events. Enhancements to the RESTful mechanism may be able to replace the SSH/cron method of gathering data.

Thank You