Linux Clusters Institute: Monitoring

Similar documents
Linux Clusters Institute: Monitoring. Zhongtao Zhang, System Administrator, Holland Computing Center, University of Nebraska-Lincoln

Automated System Monitoring

Graphing and statistics with Cacti. AfNOG 11, Kigali/Rwanda

Balancing the pressures of a healthcare SQL Server DBA

The Art of Container Monitoring. Derek Chen

Real-Time Monitoring Configuration

NCAR-Developed Tools. Bill Anderson and Marc Genty National Center for Atmospheric Research HUF 2017

Effecient monitoring with Open source tools. Osman Ungur, github.com/o

Network Management Automated Intelligence

Network Management & Monitoring Overview

Getting Started User s Guide

D3.6.1 Version 1.0 Author USV Dissemination PU Date 27/01/2014

A Guide to Finding the Best WordPress Backup Plugin: 10 Must-Have Features

Monitoring a HPC Cluster with Nagios

Log Data: A Source of Value. Nagios Enterprises LLC Nagios Enterprises 2017 Logs: A Source of Value // 1

Lanka Education and Research Network. Network Monitoring LEARN. 28 th November IT Center, University of Peradeniya Dilum Samarasinhe (LEARN)

Network Management & Monitoring Overview

Network Monitoring and Management Introduction to Networking Monitoring and Management

Automated system and service monitoring with openqrm and Nagios

SAS SOLUTIONS ONDEMAND

Trending with Purpose. Jason Dixon

EventSentry Quickstart Guide

AppDynamics Integration Guide

Understanding Feature and Network Services in Cisco Unified Serviceability

DISASTER RECOVERY TESTING, YOUR EXCUSES, AND HOW TO WIN

Monitoring tools and techniques for ICT4D systems. Stephen Okay

Healthcare IT A Monitoring Primer

High Availability- Disaster Recovery 101

Sample Exam ISTQB Advanced Test Analyst Answer Rationale. Prepared By

Monitoring ARC services with GangliARC

High Availability- Disaster Recovery 101

Environmental monitoring of any facilities, control of security breaches, temperatures, smoke, water leakages, voltages and more.

RRDTool: A Round Robin Database for Network Monitoring

Europeana DSI 2 Access to Digital Resources of European Heritage

A Survey on Open Source Tools - for Server Monitoring using SNMP

op5 Monitor 4.1 Manual

CSE 124: TAIL LATENCY AND PERFORMANCE AT SCALE. George Porter November 27, 2017

ManageEngine OpManager NCM Plug-in :::::: Page 2

Introduction to Networking Monitoring and Management

Monitoring Principles & z/vse Monitoring Options

GRAPH ALL THE THINGS. An SNMP and alerting primer featuring Observium

Design, Development and Improvement of Nagios System Monitoring for Large Clusters

Operational Efficiency Hacks. John Allspaw Operations Engineering, Flickr

Monitoring Applications and Services with Network Monitoring

2012 Microsoft Corporation. All rights reserved. Microsoft, Active Directory, Excel, Lync, Outlook, SharePoint, Silverlight, SQL Server, Windows,

Batch Jobs Performance Testing

Federal Agency Firewall Management with SolarWinds Network Configuration Manager & Firewall Security Manager. Follow SolarWinds:

S I T E I N F R A S T R U C T U R E C U R R E N T P R O B L E M S A N D P R O P O S E D S O L U T I O N S

Rhapsody Interface Management and Administration

Software Architect, Deutsche Bank

Timers 1 / 46. Jiffies. Potent and Evil Magic

Monitoring system LN-VT325. Environmental Infrastructure monitoring

Marathon has a timer metric that determines how long an event has taken place. Timer does not exist for Mesos observability metrics.

NWC 2011 Monitoring a Cloud Infrastructure in a Multi-Region Topology

SERVICE SCHEDULE MANAGED DATABASE

User Guide HelpSystems Insite 1.6

How to ask the right questions about backup reporting products Lindsay Morris Product Architect

A SKY Computers White Paper

Monitoring Tool Made to Measure for SharePoint Admins. By Stacy Simpkins

Effective Testing for Live Applications. March, 29, 2018 Sveta Smirnova

Certified Tester Foundation Level Performance Testing Sample Exam Questions

Top-Down Network Design

Secret Server HP ArcSight Integration Guide

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Network Monitoring and Management Cacti Plugin Configuration and Use (Settings and Thold)

Account Customer Portal Manual

TAIL LATENCY AND PERFORMANCE AT SCALE

Architectural Flexibility

How to pimp high volume PHP websites. 27. September 2008, PHP conference Barcelona. By Jens Bierkandt

UCLA AUDIT & ADVISORY SERVICES

From LLD to SuperDiscovery

Improve WordPress performance with caching and deferred execution of code. Danilo Ercoli Software Engineer

OPERATING SYSTEMS: Lesson 4: Process Scheduling

ROOM GUARD VT335. Environmental Infrastructure monitoring

Linux Clusters for High- Performance Computing: An Introduction

Student Guide ASPECT AGENT

HPC Downtime Budgets: Moving SRE Practice to the Rest of the World

Technical Brief. A Checklist for Every API Call. Managing the Complete API Lifecycle

Who am I? I m a python developer who has been working on OpenStack since I currently work for Aptira, who do OpenStack, SDN, and orchestration

Avoiding alerts overload from microservices

SERVICE DESCRIPTION MANAGED BACKUP & RECOVERY

VMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR. NetApp Storage. User Guide

MINION ENTERPRISE FEATURES LIST

Trends and challenges Managing the performance of a large-scale network was challenging enough when the infrastructure was fairly static. Now, with Ci

Title DC Automation: It s a MARVEL!

Securing Home Networks with SPIN

Gustavo Alonso, ETH Zürich. Web services: Concepts, Architectures and Applications - Chapter 1 2

Real-time Monitoring, Inventory and Change Tracking for. Track. Report. RESOLVE!

collectd An introduction

A2 Design Considerations. CS161, Spring

System and Network Monitoring Requirements - August 2009

Five critical features

SaaS Providers. ThousandEyes for. Summary

March 10 11, 2015 San Jose

A Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez

ROOM GUARD VT335. Environmental Infrastructure monitoring

Utilizing Slurm and Passive Nagios Plugins for Scalable KNL Compute Node Monitoring

Table of Contents HOL-SDC-1415

White Paper. PremiTech s Performance Guard Adds Value to Citrix MetaFrame Resource Manager.

Designing a Cluster for a Small Research Group

Transcription:

Linux Clusters Institute: Monitoring Nathan Rini -- National Center for Atmospheric Research (NCAR) nate@ucar.edu Kyle Hutson System Administrator for Kansas State University kylehutson@ksu.edu

Why monitor? 2

Service Level Agreement (SLA) Which services must be provided by you? Which services must be provided to you? Regulatory requirements Contractual requirements Business requirements Common Deliverables Availability of services (Uptime) mean time between failures (MTBF) mean time to repair or mean time to recovery (MTTR) 3

Monitoring and Notification Basic Flow Metric collection Metric aggregation Automated Gathering of Metrics Metric Transformation Metric Analysis Evaluate Metrics against Requirements Presentation Notification Prepare and share metrics for Stakeholders 4

What to Collect (Metrics) Overall cluster health Queue size Jobs running Jobs Queued Overall network usage Number of responding nodes Individual node health Load average Memory used Network bandwidth CPU usage Temperature Storage Capacity Degraded status Connectivity Security Logs of everything Power status temperatures Cold-aisle Switches exhausts CPU temperatures 5

Metric Collection Collection Tools (Common Tools) Ganglia Collectd Perfmon Performance Co-pilot (PCP) Nagios Unified Fabric Manager (UFM) Cacti Syslog TACC stats Scripts Collection tools already exist to capture most metrics. No single tool will do everything you need unless you write it yourself Try to avoid re-inventing the wheel. 6

Metric Aggregation Aggregation Tools Ganglia Collectd Performance Co-pilot (PCP) Nagios Unified Fabric Manager (UFM) Cacti Syslog Round Robin Database (RRD) Metrics need to be gathered from all over the cluster to a single place for analysis and storage Most metrics should transfer over the Management Ethernet to avoid interference with Job performance in Low Latency interconnect 7

Metric Analysis and Transformation Monitoring Conundrum Data is useless unless we do something with it We can collect much more data than we can analyse We generally won t know what data we need until we need it Exception: Data we must provide for SLA requirements Limited storage and processing capacity for metric analysis This is less of an issue as drives get cheaper, but they also aren t getting much faster 8

Notification Notification Tools Nagios Icinga Zenoss Zabbix PRTG OpenNMS OP5 Pandora FMS Unified Fabric Manager (UFM) Basic functionality of all alerts: Red, Yellow, Green Most notification tools are forks or clones of Nagios Notification tools can be passive or active in querying the status of the cluster 9

Notification Monitoring for known evil Basis for all notifications Only alert if something known bad happens Metrics -> Notifications Most tools will require extensive configuration to be useful Most tools will have a way to query metrics and create alerts Some tools, such as Nagios, have this entire process built in Others will have ways to bolt on this functionality Nagios can query Ganglia Ganglia can query Nagios 10

How should we get notified? Emergency Fire and smoke exiting machine Urgent: Email or text or phone call Define this carefully Not-so urgent: Web page updates Especially helpful for historical data Email (filtered) End-user support requests 11

SLA based Alerts Alerts on Deliverables Availability of services (Uptime) Example: Alert if less than 98% of batch nodes are online mean time between failures (MTBF) Example: Send email report of time between failures mean time to repair or mean time to recovery (MTTR) Example: Alert if a down node does not come online after 4 hours down 12

How often to alert? SLA requirements If your SLA requires it, you may will need to get called off-hours or even on holidays You will quickly get a feel for this Too much info is often worse than too little info The urgent continually The not-so-urgent anywhere from a few times per day to once per week There s nothing wrong with trial and error Consider aggregated reports for not-so-urgent 13

Security Alerts Securing the cluster Security alerts may need to go to specific groups or people instead of normal operations Regulations and Security rules may apply to cluster which must be enforced Compliance to Regulations: Sarbanes Oxley, Fisma, HIPAA, etc Active response may need to be required such as blocking IPs Security status updates Alerts on security failures sudo reports Network login failures (e.g. fail2ban) crontab failures Logfile errors (customize to fit) 14

Example: Nagios 15

Example: Nagios Nagios/NRPE (Nagios Remote Plugin Executor) Generic executable that runs plugins Plugins can monitor just about anything you can think of monitoring Even works with Windows Nagios (http://www.nagios.org/) is by far the most common monitoring system 16

Example: Icinga 17

Example: Icinga Icinga (https://www.icinga.org/) Can use NRPE (New) version 2 has its own client Uses database backend for history Multi-threaded and multihomed 18

Example: Ganglia 19

Example: Ganglia Ganglia (http://ganglia.sourceforge.net/) - for historical and resource monitoring Ours are public RRD files give historical data (a.k.a. lots of pretty graphs ) 20

Monitoring Future Large data analysis using machine learning 21