Federated Prometheus Monitoring at Scale

Similar documents
Operating Within Normal Parameters: Monitoring Kubernetes

Kubernetes. An open platform for container orchestration. Johannes M. Scheuermann. Karlsruhe,

Kubernetes introduction. Container orchestration

Triangle Kubernetes Meet Up #3 (June 9, 2016) From Beginner to Expert

Two years of on Kubernetes

Introduction to Kubernetes

Package your Java Application using Docker and Kubernetes. Arun

Kubernetes: Twelve KeyFeatures

An Introduction to Kubernetes

The Art of Container Monitoring. Derek Chen

Monitoring Cloud Native applications with Prometheus. Aaron Weaveworks

GoDocker. A batch scheduling system with Docker containers

How to build scalable, reliable and stable Kubernetes cluster atop OpenStack.

Continuous delivery while migrating to Kubernetes

Prometheus For Big & Little People Simon Lyall

Building a Kubernetes on Bare-Metal Cluster to Serve Wikipedia. Alexandros Kosiaris Giuseppe Lavagetto

Important DevOps Technologies (3+2+3days) for Deployment

Containers Infrastructure for Advanced Management. Federico Simoncelli Associate Manager, Red Hat October 2016

Understanding and Evaluating Kubernetes. Haseeb Tariq Anubhavnidhi Archie Abhashkumar

Kubernetes - Load Balancing For Virtual Machines (Pods)

Container Orchestration on Amazon Web Services. Arun

Building an on premise Kubernetes cluster DANNY TURNER

Question: 2 Kubernetes changed the name of cluster members to "Nodes." What were they called before that? Choose the correct answer:

VMware Integrated OpenStack with Kubernetes Getting Started Guide. VMware Integrated OpenStack 4.1

A REFERENCE ARCHITECTURE FOR DEPLOYING WSO2 MIDDLEWARE ON KUBERNETES

Multiple Networks and Isolation in Kubernetes. Haibin Michael Xie / Principal Architect Huawei

OpenShift 3 Technical Architecture. Clayton Coleman, Dan McPherson Lead Engineers

Developing Kubernetes Services

Kubernetes 101. Doug Davis, STSM September, 2017

Hacking and Hardening Kubernetes

Launching StarlingX. The Journey to Drive Compute to the Edge Pilot Project Supported by the OpenStack

TensorFlow on vivo

Kubernetes 1.9 Features and Future

Code: Slides:

More Containers, More Problems


Using Custom Resources to Provide Cloud Native API Management Frank B Greco Jr, Cloud Native Engineer, Northwestern Mutual

Project Calico v3.2. Overview. Architecture and Key Components. Project Calico provides network security for containers and virtual machine workloads.

KUBERNETES IN A GROWN ENVIRONMENT AND INTEGRATION INTO CONTINUOUS DELIVERY

Kubernetes: What s New

Building Kubernetes cloud: real world deployment examples, challenges and approaches. Alena Prokharchyk, Rancher Labs

Implementing SaaS on Kubernetes

DevOps Technologies. for Deployment

$ wget V SOLUTIONS.tar.bz2 \ --user=lftraining --password=penguin2014

Kuberiter White Paper. Kubernetes. Cloud Provider Comparison Chart. Lawrence Manickam Kuberiter Inc

Evolution of Kubernetes in One Year From Technical View

Think Small to Scale Big

Full Scalable Media Cloud Solution with Kubernetes Orchestration. Zhenyu Wang, Xin(Owen)Zhang

Kubernetes objects on Microsoft Azure

Hynek Schlawack. Get Instrumented. How Prometheus Can Unify Your Metrics

Maximizing Network Throughput for Container Based Storage David Borman Quantum

Installation and setup guide of 1.1 demonstrator

Kubernetes Integration with Virtuozzo Storage

Kubernetes. Introduction

IBM 开源技术微讲堂容器技术与微服务系列

Open-Falcon A Distributed and High-Performance Monitoring System. Yao-Wei Ou & Lai Wei 2017/05/22

agenda PAE Docker Docker PAE

CONTAINERS AND MICROSERVICES WITH CONTRAIL

CHALLENGES IN A MICROSERVICES AGE: MONITORING, LOGGING AND TRACING ON OPENSHIFT. Martin Etmajer Technology May 4, 2017

Kubernetes 1.8 and Beyond

S Implementing DevOps and Hybrid Cloud

Continuous Delivery of Micro Applications with Jenkins, Docker & Kubernetes at Apollo

Using Prometheus with InfluxDB for metrics storage

Getting Started with VMware Integrated OpenStack with Kubernetes. VMware Integrated OpenStack 5.1

Table of Contents. Section 1: Overview 3 NetScaler Summary 3 NetScaler CPX Overview 3

DEVELOPER INTRO

VMWARE PKS. What is VMware PKS? VMware PKS Architecture DATASHEET

Wrapp. Powered by AWS EC2 Container Service. Jude D Souza Solutions Wrapp Phone:

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Project Calico v3.1. Overview. Architecture and Key Components

Kubernetes Integration Guide

Real-time Monitoring, Inventory and Change Tracking for. Track. Report. RESOLVE!

Scheduling in Kubernetes October, 2017

Dan Williams Networking Services, Red Hat

A Comparision of Service Mesh Options

Kubernetes made easy with Docker EE. Patrick van der Bleek Sr. Solutions Engineer NEMEA

The InfluxDB-Grafana plugin for Fuel Documentation

StreamSets Control Hub Installation Guide

Blockchain on Kubernetes

Ray Tsang Developer Advocate Google Cloud Platform

Introduction to the Open Service Broker API. Doug Davis

Bringing Security and Multitenancy. Lei (Harry) Zhang

What s New in Kubernetes 1.12

Taming Distributed Pets with Kubernetes

OpenShift Roadmap Enterprise Kubernetes for Developers. Clayton Coleman, Architect, OpenShift

Kubernetes Container Networking with NSX-T Data Center Deep Dive

Cloud Native Java with Kubernetes

Managing your microservices with Kubernetes and Istio. Craig Box

Service discovery in Kubernetes with Fabric8

Internals of Docking Storage with Kubernetes Workloads

& the architecture along the way!

Kubernetes Love at first sight?

K8s(Kubernetes) and SDN for Multi-access Edge Computing deployment

Datacenter Network Solutions Group

Blockchain on Kubernetes

Top Nine Kubernetes Settings You Should Check Right Now to Maximize Security

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Table of Contents HOL CNA

Prometheus. A Next Generation Monitoring System. Brian Brazil Founder

Alertmanager and high availability Frederic Branczyk

Transcription:

Federated Prometheus Monitoring at Scale LungChih Tung Oath Nandhakumar Venkatachalam Oath

Team Core Platform Team powering all Yahoo Media Products Yahoo Media Products Homepage, News Finance Sports, Fantasy

Journey to Kubernetes STAGE colo1 ETCD Master Nodes Nodes Nodes Nodes CANARY ETCD Master colo1 Nodes Nodes Nodes Nodes PROD colo1 ETCD Master Nodes Nodes Nodes Nodes

Journey to Kubernetes Monitoring Core system monitoring Heapster InFluxDB sink Prometheus 1.x Remote write plugin Missing insight on cluster metrics

Growing Kubernetes Clusters 12 independent clusters STAGE 2K+ Worker Nodes CANARY CANARY CANARY CANARY CANARY CANARY Colo n 12K+ Pods 50K+ Containers 200+ Application deployments PROD PROD PROD PROD PROD PROD PROD Colo n Peak requests/sec - 400K+

Demand for visibility How is the cluster utilization? Who is the big consumer? How is the etcd/ control plane health? How is the pod/container cpu/mem utilization? more...

Monitoring Requirements Cluster Health: ETCD Controller Manager Scheduler Kubernetes API server Kubelet Kubelet CAdvisor Kube DNS Any Add-ons... Application Health: Namespace Deployment Pod Container

Prometheus 2.0 Huge performance improvement specifically in storage Simple syntax for aggregation and alerting rules Great documentation Scrape all prometheus metrics

Metrics Collection Easy configuration ex: Prometheus example yaml API server metrics/ Controller/ Scheduler ETCD Kubelet and cadvisor Kube-state-metrics kube dns/ heapster etc etc Endpoint discovery is simple. annotation prometheus.io/scrape=true Simple File based discovery for apiservers

Metrics Collection - API Server Prometheus example yaml Endpoint discovery is simple. annotation prometheus.io/scrape=true Simple File based discovery for apiservers Adding datacenter label

Metrics Proxy ETCD ports are accessible only by master machines Scheduler and controller ports bind to 127.0.0.1 ETCD node Metrics Proxy forward only /metrics Port: 5443 X Port: 2379,2380

Metrics Proxy - ETCD

Metrics Proxy - Controller

Prometheus Federation

Federation Cluster 1 Cluster 2 Selective and Aggregated metrics from each cluster Federated Promo Cluster n Aggregated time series data Longer retention period Permanent storage Unified display of data

Federation Configuration Federate all control plane components Selective aggregated metrics

Aggregation Rules Ton of rules from prometheus operator team Built CPU and Memory utilization by Colo(cluster) level Namespace/ Deployment/ Pod and Container level

Aggregation - Colo level # colo level cpu utilization - record: colo:cpu_percentage:rate expr: 100 * CPU usage of all containers per colo sum(label_replace(irate(container_cpu_usage_seconds_total{container_name!="", container_name!="pod"}[5m]), "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo) / sum(label_replace(container_spec_cpu_shares{container_name!="", container_name!="pod"}, "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo) * 1000 CPU shares allocated to all containers per colo

Aggregation - Namespace level # namespace level cpu utilization - record: colo_namespace:cpu_percentage:rate expr: 100 * CPU usage of all containers per namespace sum(label_replace(irate(container_cpu_usage_seconds_total{container_name!="", container_name!="pod"}[5m]), "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo, namespace) / sum(label_replace(container_spec_cpu_shares{container_name!="", container_name!="pod"}, "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo, namespace) * 1000 CPU shares allocated to all containers per namespace

Aggregation - Controller level # controller level cpu utilization - record: colo_namespace_controller:cpu_percentage:rate expr: 100 * CPU usage of all containers per controller sum(label_replace(irate(container_cpu_usage_seconds_total{container_name!="", container_name!="pod"}[5m]), "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo, namespace, controller) / sum(label_replace(container_spec_cpu_shares{container_name!="", container_name!="pod"}, "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo, namespace, controller) * 1000 CPU shares allocated to all containers per controller

Aggregation - Pod level # pod level cpu utilization - record: colo_namespace_controller_pod:cpu_percentage:rate CPU usage of all containers per pod expr: 100 * sum(label_replace(irate(container_cpu_usage_seconds_total{container_name!="", container_name!="pod"}[5m]), "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo, namespace, controller, pod_name) / sum(label_replace(container_spec_cpu_shares{container_name!="", container_name!="pod"}, "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo, namespace, controller, pod_name) * 1000 CPU shares allocated to all containers per pod

Aggregation - Container level # container level cpu utilization - record: colo_namespace_controller_pod_container:cpu_percentage:rate CPU usage per container expr: 100 * sum(label_replace(irate(container_cpu_usage_seconds_total{container_name!="", container_name!="pod"}[5m]), "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo, namespace, controller, pod_name, container_name) / sum(label_replace(container_spec_cpu_shares{container_name!="", container_name!="pod"}, "controller", "$1", "pod_name", "^(.*)-[a-z0-9]+")) BY (colo, namespace, controller, pod_name, container_name) * 1000 CPU shares allocated to container

Alert Manager

Alert Manager Alert manager defines how to handle alerts (email, slack notification, etc.) Grouping alert Silences Inhibition

Alert Rules - alert: K8SNodeNotReady expr: kube_node_status_condition{condition="ready",status="true"} == 0 for: 1h labels: severity: warning colo: bf1 environment: production annotations: description: The Kubelet on {{ $labels.node }} has not checked in with the API, or has set itself to NotReady, for more than an hour summary: Node status is NotReady

Alert Rules - alert: PodCPUPercentage expr: colo_namespace_controller_pod:cpu_percentage:rate{namespace=~"kube-.*"} > 75 for: 10m labels: severity: critical colo: bf1 environment: production annotations: description: 'Pod cpu usage is above 75 for {{ $value }}. Please find out the cause of the spike and if required increase CPU allocation' summary: Pod cpu usage is above threshold

Alert Rules Simple alert template

Alerting on Prometheus Federated prometheus monitors individual prometheus A cron job monitors federated prometheus

Dashboards Showcase

Cluster View

Namespace

Deployment

Controller

Scheduler

API server

Kubelet kubenode

ETCD n

Package version

Volume of the metrics

Thank you Q&A Twitter: @vnandha Link to dashboards and configurations: https://github.com/vnandha/prometheus-kubernetes