Best Practices for Alert Tuning. This white paper will provide best practices for alert tuning to ensure two related outcomes:

Similar documents
Balancing the pressures of a healthcare SQL Server DBA

ENTERPRISE INTEGRATION AND DIGITAL ROBOTICS

WHITEPAPER. 5 benefits you ll lose if you choose standalone monitoring tools

Move Performance Testing to the Next Level with HP Performance Center September 11, Copyright 2013 Vivit Worldwide

Cloud Monitoring as a Service. Built On Machine Learning

whitepaper Practitioners Guide to Managing DHCP, DNS, & IP Addresses (DDI)

Practical Strategies For High Performance SQL Server High Availability

ebook ADVANCED LOAD BALANCING IN THE CLOUD 5 WAYS TO SIMPLIFY THE CHAOS

WITH ACTIVEWATCH EXPERT BACKED, DETECTION AND THREAT RESPONSE BENEFITS HOW THREAT MANAGER WORKS SOLUTION OVERVIEW:

The Problem with Privileged Users

5 STEPS for Turning Data into Actionable Insights

The New Normal. Unique Challenges When Monitoring Hybrid Cloud Environments

Endpoint Security. powered by HEAT Software. Patch and Remediation Best Practice Guide. Version 8.5 Update 2

Goliath Technology Overview with MEDITECH Module

ITIL Event Management in the Cloud

CASE STUDY FINANCE. Enhancing software development with SQL Monitor

KEMP 360 Vision. KEMP 360 Vision. Product Overview

Cisco SP Wi-Fi Solution Support, Optimize, Assurance, and Operate Services

SaaS Providers. ThousandEyes for. Summary

A Practical Guide to Efficient Security Response

DevOps Anti-Patterns. Have the Ops team deal with it. Time to fire the Ops team! Let s hire a DevOps unit! COPYRIGHT 2019 MANICODE SECURITY

Table of Contents HOL-SDC-1317

March 10 11, 2015 San Jose

ThousandEyes for. Application Delivery White Paper

How can you manage what you can t see?

Automating Elasticity. March 2018

2012 Microsoft Corporation. All rights reserved. Microsoft, Active Directory, Excel, Lync, Outlook, SharePoint, Silverlight, SQL Server, Windows,


Security Precognition: Chaos Engineering in Incident Response

Cisco CloudCenter Use Case Summary

A Guide to Finding the Best WordPress Backup Plugin: 10 Must-Have Features

MULTINATIONAL BANKING CORPORATION INVESTS IN ROUTE ANALYTICS TO AVOID OUTAGES

SIEMLESS THREAT MANAGEMENT

An SDLC for the DevSecOps Era Or SecDevOps, or DevOpsSec,

Acronis Monitoring Service

Best Practices for Monitoring VMware with System Center Operations Manager

The Keys to Monitoring Internal Web Applications

Best Practices to Transition to the Cloud. Five ways to improve IT agility and speed development by adopting a Cloud DevOps approach

AppDynamics Lite vs. Pro Edition

Kaseya IT Services KASEYA IT SERVICES PROGRAM CATALOG 2014 Q3

AppEnsure s End User Centric APM Product Overview

TECHNICAL ADDENDUM 01

WHY BUILDING SECURITY SYSTEMS NEED CONTINUOUS AVAILABILITY

Data Center Consolidation and Migration Made Simpler with Visibility

The Definitive Guide to Office 365 External Sharing. An ebook by Sharegate

Network Performance Monitor

Using AppDynamics with LoadRunner

vrealize Operations Manager User Guide

TARGETING CITIZENS WITH LOCATION BASED NOTIFICATIONS.

Cyber Resilience - Protecting your Business 1

SOLUTION BRIEF NETWORK OPERATIONS AND ANALYTICS. How Can I Predict Network Behavior to Provide for an Exceptional Customer Experience?

REDUCE TCO AND IMPROVE BUSINESS AND OPERATIONAL EFFICIENCY

Conquering Rogue Application Behavior in a Terminal Server Environment

Cisco Crosswork Network Automation

How Can Testing Teams Play a Key Role in DevOps Adoption?

vrealize Operations Manager User Guide Modified on 17 AUG 2017 vrealize Operations Manager 6.6

Information Infrastructure and Security. The value of smart manufacturing begins with a secure and reliable infrastructure

Exploring Cloud Security, Operational Visibility & Elastic Datacenters. Kiran Mohandas Consulting Engineer

Support and Management for AWS

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Three requirements for reducing performance issues and unplanned downtime in any data center

Why the Threat of Downtime Should Be Keeping You Up at Night

vrealize Introducing VMware vrealize Suite Purpose Built for the Hybrid Cloud

Optimize Your Databases Using Foglight for Oracle s Performance Investigator

LANDESK White Paper. Maintaining User Productivity through Effective Endpoint Management

Chaos Engineering: Why the world needs more resilient

Monitoring Standards for the Producers of Web Services Alexander Quang Truong

Take Your Oracle WebLogic Applications to The Next Level with Oracle Enterprise Manager 12c

Data Compression in Blackbaud CRM Databases

Chris Skorlinski Microsoft SQL Escalation Services Charlotte, NC

vrealize Operations Manager User Guide

Best Practices for Monitoring VMware with System Center Operations Manager

ACTIONABLE SECURITY INTELLIGENCE

Quick Start Guide. Version R92. English

Resolving Network Performance Issues with Real-time Monitoring A Series of Case Studies

Using Threat Analytics to Protect Privileged Access and Prevent Breaches

Protecting Against Modern Attacks. Protection Against Modern Attack Vectors

Tableau Server Platform Monitoring

TRAINING GUIDE. Tablet: Cradle to Mobile Configuration and Setup

Understanding Notification Escalations

CATCH ERRORS BEFORE THEY HAPPEN. Lessons for a mature data governance practice

Planning for Performance Assurance No Longer an Afterthought

Rhapsody Interface Management and Administration

On Command Performance Manager 7.0 Lab on Demand Guide

CA ecometer. Overview. Benefits. agility made possible. Improve data center uptime and availability through better energy management

The 2017 State of IT Incident Management. Annual Report on Incidents, Tools & Processes

Enabling Performance & Stress Test throughout the Application Lifecycle

6 Tips to Help You Improve Configuration Management. by Stuart Rance

Manage MySQL like a devops sysadmin. Frédéric Descamps

Making the case for SD-WAN

Veritas Resiliency Platform: The Moniker Is New, but the Pedigree Is Solid

IBM Resilient Incident Response Platform On Cloud

Cloud & DevOps April Big Group. April 24, 2015 Friday 1:30-2:30 p.m. Science Center Hall E

Cisco Network Assurance Engine with ServiceNow Cisco Network Assurance Engine, the industry s first SDN-ready intent assurance suite, integrates with

Monitoring tools and techniques for ICT4D systems. Stephen Okay

vrealize Operations Manager User Guide 11 OCT 2018 vrealize Operations Manager 7.0

Gain Control Over Your Cloud Use with Cisco Cloud Consumption Professional Services

white paper Rocket Rocket U2 High Availability / Disaster Recovery Best Practices

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

SWsoft ADVANCED VIRTUALIZATION AND WORKLOAD MANAGEMENT ON ITANIUM 2-BASED SERVERS

Transcription:

This white paper will provide best practices for alert tuning to ensure two related outcomes: 1. Monitoring is in place to catch critical conditions and alert the right people 2. Noise is reduced and people are not needlessly woken up These outcomes are essential to a successful monitoring strategy. What follows are some critical issues to consider when designing a new monitoring system or reviewing an existing system. Components of a Monitored Item Different monitoring solutions use different terminology to represent the items they monitor. Usually there is a hierarchical system that includes high-level groups such as a data center, and then lower-level groups and individual hosts below that. At the host level there will be monitoring categories such as disk, CPU etc., and below that specific instances to be monitored such as /dev/sda mounted at /. Whatever the organization structure of your monitoring solution, each item monitored should have the following components defined: Metric: The measure being monitored, such as CPU percent used. Threshold: The definition of when the metric is considered to be in a less than optimal state. Alert Level: The level of urgency associated with a given state, usually warning, error or critical. Action to Resolve: The action items associated with the alert. Alert Routing: Where and how the alert is to be delivered. Failure to consider the above components when developing a monitoring solution will increase the likelihood of an alert system failure. Alert systems fail when they generate too much noise, which can result in staff missing or ignoring critical alerts, but they can also fail if they don t catch critical conditions due to either a lack of flexibility, monitoring capability or bad configuration. Setting the correct threshold when rolling out a monitoring system can be difficult. Good monitoring tools will have sensible defaults out-of-the-box, but these will have to be tuned over time. For example, CPU usage of 96% is not necessarily a bad thing for all systems, but it will be for some. Additionally, some monitoring tools provide the ability to automatically configure alert thresholds based on historical metric measurement. This offers a threshold baseline and ideally forecasts future issues based on threshold trends. Alert levels also need to be tuned over time. Again, good defaults can help but a warning alert on one set of systems could be critical alert on others. One of the most commonly overlooked requirements leading to alerts being ignored is the need to define action items. Good monitoring solutions will have reasonable default descriptions that indicate what may be the cause of a particular condition. In my preferred monitoring solution LogicMonitor, we see the following message associated with an alert on CPU usage: The 5 minute load average on server123 is now 99, putting it in a state of critical. This started at 2016-05-11 11:38:42 PDT. See which processes are consuming CPU (use 'ps' command to see which processes are in runnable and waiting state; 'top' command will show individual cpu core usage). Troubleshoot or adjust the threshold.

As a default message this is pretty useful. It tells the recipient the basic steps for troubleshooting. These messages can be customized for known situations such as high traffic requiring more application servers to be added to a cluster. Avoid implementing monitoring alerts that have no actionable responses. This just adds noise and increases the frustration of those receiving the alerts. Avoid Alert Spamming Dashboard in LogicMonitor showing a NOC Overview Too many alerts going off too frequently creates alert spamming or what is sometimes referred to as an Alert Storm. This is extremely dangerous and can have many adverse consequences, such as a failure of system monitoring. Too many alerts can result in real critical issues going unnoticed. Too many alerts can also lead to critical alerts being accidentally disabled in an effort to cut down the noise. Every alert must be meaningful and have an associated action. Correctly routing alerts is also important. If the action item is to call someone else then the alert is not meaningful. Don't turn your admins into receptionists! This is particularly important when working in a DevOps culture where you have developers on-call and different teams responsible for different systems. Take into account that certain metrics will indicate a problem at the application layer and developers are best placed to resolve the situation, but problems that relate to the database should be handled by the DBA. While it makes sense for your operations people to be the first line for system failures, if their only action item is to call a developer or a DBA, this can quickly become demoralizing. Similarly, if a developer is woken up for what is obviously a system failure, this can also be a problem. There will always be situations where one team has to escalate to another, but these should be the exceptions rather than the norm. In LogicMonitor, you can avoid alert spamming by setting up custom escalation chains and alert rules that route alerts to the right people when an issue occurs.

Managing Downtime Your chosen monitoring solution should have fine-grained control for setting downtime. This can range from disabling all alerts during a full downtime deployment, to controlling individual hosts and monitored instances for ongoing maintenance. For example, a fine-grained approach to downtime will allow you to set downtime for a single hard drive on a given host. If you are inadvertently alerted for a host that is not currently in production you may decide to disable all metrics but still want to ensure that the host is up and responding to network pings. Additionally, disabling all monitoring for a host may be desirable. For example, if a host has been removed from a production cluster but not fully decommissioned you could decide you may not need to monitor it. Host groups are useful for managing groups of related systems. For example a dev, QA and production group gives a good high-level organization for managing downtime because during QA deployments you may decide to disable all hosts in the QA group. For automated deployments, a good monitoring system will provide an API to allow automation tools to specify downtime. For scheduled maintenance, the monitoring solution should allow you to schedule repeatable maintenance windows in advance. Downtime should be scheduled at the most granular possible level. If you are increasing the size of a disk for example, don't set the entire host to downtime. If you know it will take you 24 hours to fix the issue, schedule 24 hours of downtime. Downtime should expire automatically to avoid forgotten settings. Missed Alerts View of Alerts in LogicMonitor Even the best monitoring systems will occasionally result in system failures that go undetected. This is obviously bad at any time, but particularly after replacing one monitoring system with another. No one remembers the thousands of false alerts produced by your home-rolled Nagios when your shiny new SaaS system misses a production down incident.

To counter the skepticism this will inevitably introduce towards your new system, it is essential to have good blameless post mortem practices in place for every outage. A useful checklist of questions to ask in such postmortem include the following: 1. What was the root cause of the failure? 2. What state were the systems in prior to the failure? 3. How much CPU was consumed, how much memory? 4. Were we swapping? 5. Were there alerts that should have triggered? 6. Were the thresholds correct? Data graphing should assist with this. Compare the graphs under normal system usage with those just prior to the system failure. With LogicMonitor, you can do this by building dashboards that display relevant graphs and metrics across your entire infrastructure. After discussing all of the above the next set of questions you need to answer relate to what we can do to prevent this in the future. It is essential that the issue is not marked as resolved until monitoring is in place to prevent a repeat. Avoiding Email Overload Most places I have worked go to great lengths to prevent false alerts that could potentially page people and wake them up in the night. However, email alerts are often another story. Some companies will allow thousands of email alerts to go off for warning level situations that are often self corrected, when other warnings may be useful, such as disk usage warnings that can allow a systems administrator to take action to prevent an alert reaching critical and waking someone during the night. Unfortunately, it is not realistic to manually scan a thousand emails for something that may actually be useful. Good monitoring tools like LogicMonitor provide dashboards to indicate alert status. Alert Forecasting If you want to take your alert strategy to the next level you should add alert forecasting to the mix. Standalone alerts can notify you of specific changes in metrics, but alert forecasting gives you the ability to predict how that metric will behave in the future. Forecasting also enables you to predict metric trends and know when an alert will reach a designated alert threshold. With this information, you can prioritize the urgency of performance issues, refine budget projections, and strategize resource allocation. Some useful examples of alert forecasting are predicting when a disk will be at 95% capacity and predicting your total monthly AWS bill. LogicMonitor includes alert forecasting and you can use it to predict alerts for up to three months in the future.

Reporting Good reporting can help with the email overload situation described above. One valuable recommendation I received from a LogicMonitor training session was to generate a nightly report for all warning level alerts and disable alert delivery. Admins can then check the morning report and address any warnings to prevent them escalating to error or critical levels. Another helpful report is a report of all the alert thresholds. This is useful for checking the currently set thresholds, but most importantly will show any disabled thresholds. One of the common responses to excess noise in alerting is disabling alerting for a metric or even an entire hosts. This is often done with the intention of addressing the issue in the morning, but often higher priorities can intervene and it will be forgotten until there is some catastrophe that isn t alerted on. Conclusion Modern monitoring solutions can provide a robust infrastructure monitoring framework that can both alert when necessary and avoid alert overload. It is important to realize, however, that even the best tools require careful implementation and ongoing tuning and improvement. This is not only the responsibility of the operations people maintaining the monitoring systems. As new features are designed, careful thought should be given to how they will be monitored. What metrics indicate the normal functioning of the application or the overall system? What are the acceptable thresholds for a given metric? At what point should we be warned and what do we do to prevent the warning escalating to a critical situation. The complex nature of web and SaaS platforms today means that monitoring cannot be seen as something which operations teams apply to production infrastructure. It requires cross-team collaboration to ensure we are monitoring the right things and alerting the right people. This is the key to a successful monitoring strategy. This content is brought to you by LogicMonitor, the automated SaaS performance monitoring platform that provides IT Ops teams with end-to-end visibility and actionable metrics to manage today s sophisticated onpremise, hybrid, and cloud infrastructures. Sign up for a free 14 day trial today. SaaS-based performance monitoring for modern IT infrastructure Custom Threshold Report in LogicMonitor