Cloud Monitoring as a Service. Built On Machine Learning

Similar documents
Elastic Efficient Execution of Varied Containers. Sharma Podila Nov 7th 2016, QCon San Francisco

WHITEPAPER AMAZON ELB: Your Master Key to a Secure, Cost-Efficient and Scalable Cloud.

The Art of Container Monitoring. Derek Chen

Using AppDynamics with LoadRunner

Automating Elasticity. March 2018

Real-time Monitoring, Inventory and Change Tracking for. Track. Report. RESOLVE!

TUTORIAL: WHITE PAPER. VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS

Table of Contents HOL-SDC-1317

AWS Reference Architecture - CloudGen Firewall Auto Scaling Cluster

VMware vrealize Operations for Horizon Administration

Best Practices for Alert Tuning. This white paper will provide best practices for alert tuning to ensure two related outcomes:

PrepAwayExam. High-efficient Exam Materials are the best high pass-rate Exam Dumps

Deploying and Operating Cloud Native.NET apps

Elastic Load Balancing

Documentation. This PDF was generated for your convenience. For the latest documentation, always see

Oracle Enterprise Manager 12c IBM DB2 Database Plug-in

Monitoring & Tuning Azure SQL Database

Architectural overview Turbonomic accesses Cisco Tetration Analytics data through Representational State Transfer (REST) APIs. It uses telemetry data

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN. PostgresConf US

AWS Solution Architecture Patterns

Deploying and Operating Cloud Native.NET apps

ENTERPRISE INTEGRATION AND DIGITAL ROBOTICS

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Cisco Tetration Analytics

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN

AppDynamics Lite vs. Pro Edition

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

Oracle Enterprise Manager 12c Sybase ASE Database Plug-in

Using AWS to Build a Large Scale Dockerized Microservices Architecture. Dr. Oliver Wahlen moovel Group GmbH Frankfurt, 30.

Amazon Web Services (AWS) Solutions Architect Intermediate Level Course Content

Two years of on Kubernetes

Price Performance Analysis of NxtGen Vs. Amazon EC2 and Rackspace Cloud.

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

First Look at Built-in Autoscaling and Alerting. Paul blog.paulbouwer.com

VMware vcloud Architecture Toolkit Cloud Bursting

Build a system health check for Db2 using IBM Machine Learning for z/os

Veritas Storage Foundation for Windows by Symantec

Expert Reference Series of White Papers. Introduction to Amazon Auto Scaling

STATE OF MODERN APPLICATIONS IN THE CLOUD

Logging, Monitoring, and Alerting

Level 3 SM Enhanced Management - FAQs. Frequently Asked Questions for Level 3 Enhanced Management

Challenges of Capacity Management in Large Mixed Organizations

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN. PGConf.EU 2017, Warsaw

Performance Analysis of Virtual Machines on NxtGen ECS and Competitive IaaS Offerings An Examination of Web Server and Database Workloads

BIG-IP Analytics: Implementations. Version 13.1

Load Dynamix Enterprise 5.2

Pass4test Certification IT garanti, The Easy Way!

How can you implement this through a script that a scheduling daemon runs daily on the application servers?

Zero to Microservices in 5 minutes using Docker Containers. Mathew Lodge Weaveworks

ControlUp v7.1 Release Notes

InterAct User. InterAct User (01/2011) Page 1

Getting Started With Amazon EC2 Container Service

AWS Integration Guide

VMAX: PERFORMANCE MADE SIMPLE

VMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR. NetApp Storage. User Guide

Sunil Shah SECURE, FLEXIBLE CONTINUOUS DELIVERY PIPELINES WITH GITLAB AND DC/OS Mesosphere, Inc. All Rights Reserved.

Interoperability First Published On: Last Updated On:

Scalability Engine Guidelines for SolarWinds Orion Products

Lies, Damn Lies and Performance Metrics. PRESENTATION TITLE GOES HERE Barry Cooks Virtual Instruments

Data Sheet. Monitoring Automation for Web-Scale Networks MONITORING AUTOMATION FOR WEB-SCALE NETWORKS -

70-532: Developing Microsoft Azure Solutions

Key metrics for effective storage performance and capacity reporting

Cisco Tetration Analytics

Intelligent QoS Grid for Virtualized Workloads Gaurav Gupta Tata Consultancy Services

VMware Cloud on AWS Technical Deck VMware, Inc.

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

@joerg_schad Nightmares of a Container Orchestration System

ENHANCE APPLICATION SCALABILITY AND AVAILABILITY WITH NGINX PLUS AND THE DIAMANTI BARE-METAL KUBERNETES PLATFORM

Server Status Dashboard

70-532: Developing Microsoft Azure Solutions

Microsoft SQL Server Fix Pack 15. Reference IBM

MQ Monitoring on Cloud

Auto Management for Apache Kafka and Distributed Stateful System in General

Analytics and Visualization

4 Effective Tools for Docker Monitoring. By Ranvijay Jamwal

VMware vrealize Operations. Management Pack for. PostgreSQL

AALOK INSTITUTE. DevOps Training

Retrospective: The Magento Commerce Cloud at Work

VMware vrealize Operations for Horizon Administration

Vmware VCP550PSE. VMware Certified Professional on vsphere 5.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

APPLICATION NOTE. XCellAir s Wi-Fi Radio Resource Optimization Solution. Features, Test Results & Methodology

March 10 11, 2015 San Jose

CA ecometer. Overview. Benefits. agility made possible. Improve data center uptime and availability through better energy management

How to keep capacity predictions on target and cut CPU usage by 5x

Training on Amazon AWS Cloud Computing. Course Content

Container Orchestration on Amazon Web Services. Arun

Simplifying Data Management. With DataStax Enterprise (DSE) OpsCenter

VirtualWisdom â ProbeNAS Brief

BUYER S GUIDE TO AWS AND AZURE RESERVED INSTANCES

Amazon Web Services Training. Training Topics:

Measuring HEC Performance For Fun and Profit

Enroll Now to Take online Course Contact: Demo video By Chandra sir

Datasheet FUJITSU Software Cloud Monitoring Manager V2.0

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

Intelligent QoS Grid for Virtualized Workloads

Enterprise2014. GPFS with Flash840 on PureFlex and Power8 (AIX & Linux)

Performance Sentry VM Provider Objects April 11, 2012

SMB 3.0 Performance Dan Lovinger Principal Architect Microsoft

Transcription:

Cloud Monitoring as a Service Built On Machine Learning

Table of Contents 1 2 3 4 5 6 7 8 9 10 Why Machine Learning Who Cares Four Dimensions to Cloud Monitoring Data Aggregation Anomaly Detection Algorithms Preset Best Practices Capacity Analysis Cost Analysis Dynamic Resource Allocation Conclusion

Why Machine Learning 1 2 3 Don t Waste Time Gain Extra Insights Don t Forget Capacity & Cost Administering and scaling your own monitoring stack Creating your own integrations Setting thresholds one metric at a time Aggregate data by cluster and by application See what s abnormal in your environment Leverage industry best practices for alerting and reporting Reuse the same agent to gather capacity and performance data Rely on preset metrics to measure multiple dimension of capacity AWS cost analysis gets complicated fast

Who Cares DevOps Manager DevOps Engineer I want automation and I don t want to wake up in the middle of the night I want my team to be efficient, I want to avoid outages, and I don t want to spend weekends going through AWS bills Executive I want our customer-facing applications to be available and fast, and our cloud hosting costs to be low

Four Dimensions To Cloud Monitoring A modern and extensible platform for checking the status of server, process, port up/down Latency and throughput for cloud, middleware and custom application metrics Resource utilization measurement in multiple dimensions including I/O, memory, network, connection pool, etc. Simplify the complexity of cloud billing including resizing, reservation, spot instances, and auto-scaling

Data Aggregation A Function for Each Purpose For an ever-increasing counter, you want to know the delta over time For latency, you want to know the weighed average For errors, you want to know the sum during an interval Filter and Group Metrics User-specified metric-level aggregation is powerful For example, you could average CPU utilization across all EC2s with a given application tag in a particular AWS region Or the average latency of a method call across a cluster of Docker containers More sophisticated expressions account for attribute meta-data such as hardware rating for IOPS for an m3.xlarge EC2 to identify unusual process queue for I/O, or the number of CPU cores that can be used to normalize the Linux load

Anomaly Detection Algorithms Green Band: Single-Variate. What is the value usually at this time? Blue Band: Multi-Variate. What should it be as a function of other correlated metrics? Multi-variate regression analysis discovers the regression weights and correlation coefficients to determine where a metric should be as a function of other inter-dependent metrics, for a more accurate estimation. Clusters Anomaly detection on load-balanced clusters of nodes or containers is best accomplished by analyzing the aggregated values across the entire cluster vs. at the ephemeral node level. The aggregated values conform to the workload patterns driven by the application usage traffic. Sudden Change: Is the trend-line drastically changing up or down? What if you don t have multiple metrics to correlate? What if your green band of normalcy is too wide due to high standard deviation? In those cases, you can detect sudden changes in the trend-lines. What s important is how steep was the rise and fall and how long it lasted.

Preset Best Practices Accurate Anomaly Detection Domain knowledge must complement statistical analysis to achieve an actionable level of accuracy in anomaly detection. Industry Best Practices Preset configuration of dashboards and alerts based on industry best practices avoid the need for manual configuration by end-users upon activation of integrations with common technologies, such as the examples listed below: Accuracy Multi-Variable Statistical Analysis Learn behavior accurately Accurate Anomaly Detection Control Domain Knowledge Preset Best Practices Look for combination of deviations Multi-Conditional Policies DON T WAKE ME UNLESS Multi-conditional alerting policies result in control, accuracy, and reasoning that are required to deliver actionable alerts. It is recommended for un-interrupted night of sleep. 1 I/O wait > 20% Database ELB round-trip 2 connections drop 3 latency jumps

Capacity Analysis Your application platform has finite computing capacity (0 to 100%), but there are multiple types of resources. Some are intuitive to measure (ex. CPU). Some can be measured if you know the meta-data attribute (ex. 1G vs. 10G network card). Others will remain subjective, but a high water mark can help estimate (ex. we can process 10k concurrent users). Use only one agent to collect both performance and capacity time-series data. CPU MEM NET DISK DB QUEUE LB LATENCY UX BIZ SQL Box You would miss a short-lived spike in IOPS if you Whisker Whisker take the average across a 1 or 5 minute interval. For accurate capacity planning, you must roll-up data while preserving: standard deviation, min, max, percentiles, average and median. Min Lower Quartile Q1 Median Q2 Upper Quartile Q3 Max

Cost Analysis Group cost and capacity utilization by AWS tags or attributes Filter by AWS tags and attributes. Filter according to cost and capacity utilization Cross-Analyze Cross-analyze capacity utilization with AWS cost data, and use business intelligence oriented concepts to filter and group by meta-data. It helps you attribute cost to application s microservices, plan your AWS reservation by instance type, or be notified of sudden changes in utilization or cost. Dots on the chart represent capacity utilization Bars represent AWS cost RELY ON EMAILED REPORTS TO PUSH VS. PULL SAVINGS IDEAS Automate You can automate the process of searching through thousands of EC2 configurations to match the optimal EC2 configuration according to each individual instance s workload patterns, including memory utilization and I/O.

Dynamic Resource Allocation Whether you are using Auto-Scaling Groups (ASG), Elastic Container Service (ECS), or Kubernetes, it helps to: 1. Identify & collect KPIs 2. Analyze Capacity 1. Identify Key Performance Indicators for scaling (ex. queue depth, average cluster IOPS, number of user sessions). 2. Use a tool that allows what-if analysis simulation based on your historic data. Minimum Node Count Increased Node Count For Peak Traffic 3. Analyze Cost 3. Use the above analysis to automatically update your ASG rule-set, Cloud Formation template, or Terraform module. 4. Adjust Capacity hourly Estimated cost savings compared to current configuration Simulated utilization levels across the cluster Recommended hourly node count based on simulation of historic data A what-if analysis simulator for auto-scaling leverages your historic data to identify workload patterns and recommend optimal hourly node count according to your risk tolerance.

Conclusion Use Monitoring as a Service to stay focused on managing your main application stack instead of your monitoring stack and rely on help from technical support. Use machine learning to automate tedious manual tasks. Not only does it save time, but you will appreciate that certain types of analysis are not humanly possible. Competition is pushing vendors to simplify and automate. Machine learning, pre-configured dashboards, and alerts make it easy to start in minutes.

Start a free trial at