Scalable Online Analytics for Monitoring

Similar documents
Data Analytics with HPC. Data Streaming

Flash Storage Complementing a Data Lake for Real-Time Insight

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

VOLTDB + HP VERTICA. page

Modern Stream Processing with Apache Flink

MSG: An Overview of a Messaging System for the Grid

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Anomaly Detection Fault Tolerance Anticipation

Intra-cluster Replication for Apache Kafka. Jun Rao

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

Identifying Workloads for the Cloud

Fluentd + MongoDB + Spark = Awesome Sauce

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

REAL-TIME ANALYTICS WITH APACHE STORM

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Axibase Time-Series Database. Non-relational database for storing and analyzing large volumes of metrics collected at high-frequency

Data Acquisition. The reference Big Data stack

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Scaling the Yelp s logging pipeline with Apache Kafka. Enrico

The Future of Real-Time in Spark

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics

Broker Clusters. Cluster Models

Oracle Exadata X7. Uwe Kirchhoff Oracle ACS - Delivery Senior Principal Service Delivery Engineer

Microsoft SQL Server Fix Pack 15. Reference IBM

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

Scaling Indexer Clustering

The Power of Snapshots Stateful Stream Processing with Apache Flink

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Spark, Shark and Spark Streaming Introduction

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Introduction to Distributed Systems. INF5040/9040 Autumn 2018 Lecturer: Eli Gjørven (ifi/uio)

Scalable Streaming Analytics

Time Series Live 2017

April 21, 2017 Revision GridDB Reliability and Robustness

Cisco Tetration Analytics

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data

VoltDB vs. Redis Benchmark

Aurora, RDS, or On-Prem, Which is right for you

@joerg_schad Nightmares of a Container Orchestration System

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

Pulsar. Realtime Analytics At Scale. Wang Xinglang

<Insert Picture Here> Value of TimesTen Oracle TimesTen Product Overview

Overview. SUSE OpenStack Cloud Monitoring

Shen, Tang, Yang, and Chu

Isolation Forest for Anomaly Detection

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Performance comparisons and trade-offs for various MySQL replication schemes

Map-Reduce. Marco Mura 2010 March, 31th

for Multi-Services Gateways

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Software Defined Storage at the Speed of Flash. PRESENTATION TITLE GOES HERE Carlos Carrero Rajagopal Vaideeswaran Symantec

Distributed Computation Models

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

ARCHITECTING WEB APPLICATIONS FOR THE CLOUD: DESIGN PRINCIPLES AND PRACTICAL GUIDANCE FOR AWS

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

HA solution with PXC-5.7 with ProxySQL. Ramesh Sivaraman Krunal Bauskar

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Distributed computing: index building and use

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

Oracle Database 12c: Dataguard Administration

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Reliable Distributed System Approaches

AN EVENTFUL TOUR FROM ENTERPRISE INTEGRATION TO SERVERLESS. Marius Bogoevici Christian Posta 9 May, 2018

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

Hadoop. copyright 2011 Trainologic LTD

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

Percona XtraDB Cluster ProxySQL. For your high availability and clustering needs

Cloud Monitoring as a Service. Built On Machine Learning

Kafka Streams: Hands-on Session A.A. 2017/18

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

TungstenFabric (Contrail) at Scale in Workday. Mick McCarthy, Software Workday David O Brien, Software Workday

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

BI, Big Data, Mission Critical. Eduardo Rivadeneira Specialist Sales Manager

Datacenter replication solution with quasardb

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Real-time data processing with Apache Flink

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Remote Procedure Call. Tom Anderson

Oracle Database 12c: Data Guard Administration

Stanislav Harvan Internet of Things

Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO. Oliver Michel

Cisco Tetration Analytics

AWS Solution Architecture Patterns

Inside Broker How Broker Leverages the C++ Actor Framework (CAF)

Introduction Storage Processing Monitoring Review. Scaling at Showyou. Operations. September 26, 2011

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

All you need is fun. Cons T Åhs Keeper of The Code

Strategic Briefing Paper Big Data

Oracle NoSQL Database Enterprise Edition, Version 18.1

Transcription:

Scalable Online Analytics for Monitoring LISA15, ov. 13, 2015, Washington, D.C. Heinrich Hartmann, PhD, Chief Data Scientist, Circonus

I m Heinrich Heinrich.Hartmann@Circonus.com From Mainz, Germany Studied Math. in Mainz, Bonn, Oxford PhD in Algebraic Geometry Sensor Data Mining at Univ. Koblenz Freelance IT consultant Chief Data Scientist at Circonus

Circonus is... monitoring and telemetry analysis platform scalable to millions of incoming metrics available as public and private SaaS built-in histograms, forecasting, anomaly-detection,...

This talk is about... I. The Future of Monitoring II. Patterns of Telemetry Analysis III. Design of the Online Analytics Engine Beaker

Part I - The Future of Monitoring

The cloud monitoring challenge 2015 Monitor this: 100 containers in one fleet 200 metrics per-container 50 code pushes a day For each push all containers are recreated

Line-plots work well for small numbers of nodes ode 1 ode 2 ode 3 ode 4 ode 5 ode 6 CPU utilization for a DB cluster. Source: Internal.

... but can get polluted easily CPU utilization for a DB cluster. Source: Internal.

Information is not a scarce resource. Attention is. Herbert A. Simon Source: http://www.circonus.com/the-future-of-monitoring-qa-with-john-allspaw/

Store Histograms and Alert on Percentiles CPU Utilization of a db service with a variable number of nodes. 90% percentile

Anomaly Detection for Surfacing relevant data Mockup of an metric overview page. Source: Circonus

Part II - Patterns of Telemetry Analysis

Telemetry analysis comes in two forms Online Analytics Anomaly detection Percentile alerting Smart alerting rules Smart dashboards Stream Processing Offline Analytics Post mortem analysis Assisted thresholding Big Data

ew Components in Circonus Beaker Bunsen

Beaker & Bunsen in the Circonus Architecture metrics Beaker Online AE Q Ernie Alerting Service Snowth Metric Store Bunsen Offline AE alerts Web-UI

Pattern 1: Windowing windows = window(y_stream) def results(): for w in windows: yield z = method(w) Examples Supervised Machine Learning Fourier Transformation Anomaly Detection (etsy) Remarks Tradeoff: window size rich features vs. memory Tradeoff: overlap latency vs. CPU

Pattern 3: Processing Units processing_unit = { Example state =... update = function(self, y)... end } Remarks Circonus Anomaly Detection Exponential Smoothing Holt Winters forecasting Anomaly Detection (Circonus) Fast updates Fully general Cost: maintains state

A Processing Unit for Exponential Smoothing local q = 0.9 exponential_smoothing = { s = 0, update = function(self, y) self.s = (y * q) + (self.s * (1 - q)) return self.s end } Exponential smoothing applied to a dns-duration metric For More examples on http://heinrichhartmann.com/.../generative-models-for-time-series

Processing units are convenient Primitive transformation are readily implemented: Arithmetic, Smoothing, Forecasting,... Fully general. Allow window-based processing as well Composable. Compose several PUs to get a new one!

Circonus Analytics Query Language Create your own customized processing unit from primitives UIX-inspired syntax with pipes ative support for histogram metrics

CAQL: Example 1 - Low frequency AD Pre-process a metric before feeding into anomaly detection metric:average(<>) rolling:mean(30m) anomaly_detection()

CAQL: Example 2 - Histogram Aggregation Histograms are first-class citizens metric:average(<uuid>) window:histogram(1h) histogram:percentile(95)

Part III - The Design of the Online Analytics Engine Beaker

Beaker v. 0.1: Simple stream processing 1. Read messages from a message queue 2. Execute a processing units over incoming metrics 3. Publish computed values to message queue input Q Beaker: Basic Data Flow Beaker Service output Q

Challenge 1: Rollup metrics by the minute Metrics arrive asynchronously on the input queue PUs expect exactly one sample per minute Rollup logic needs to allow for late arrival, e.g. when broker is behind out of order arrival errors in system time (time-zones, drifts)

Consequence: o real-time processing in Beaker Rolling up data in 1m periods causes an average delay of 30sec for data processing. Real-time threshold-based alerting still available. Approach: Avoid rolling up data for stateless PUs.

Challenge 2: Multiple input slots input metric input metric processing unit output metric input metric processing unit output metric input metric processing unit output metric input metric Beaker: Logical Data Flow

Challenge 3: Synchronize roll-ups for multiple inputs is tricky

Challenge 4: Fault Tolerance Definition (Birman): A service is fault tolerant if it tolerates the failure of nodes without producing wrong output messages. Failed nodes must be able to recover from a crash and rejoin the cluster. The time to recovery should be as low as possible. Source: K. Birman - Reliable Distributed Systems, Springer, 2005

Beaker v0.5: Automated restarts a. Service Management Facilities (svcadm) in OmniOS b. systemd or watchdogs in Linux But, recovery can take a long time State has to be rebuilt from input stream. eed a way to recover faster from errors: a. b. Persist processing unit state (software updates!) Access persisted metric data

Beaker v. 0.5: Use db to rebuild state on startup input Q Beaker Service Snowth db output Q

Challenge 5: High Availability Definition (Birman): A service is highly available it continues to publish valid messages during a node failure after a small reconfiguration period. For Beaker we require a reconfiguration period of less than 1 minute. In this time messages may be delayed (e.g. 30sec) and duplicated messages may be published. Source: K. Birman - Reliable Distributed Systems, Springer, 2005

Beaker v. 0.5 -- HA Cluster Beaker HA Cluster input Q Beaker Master output Q Beaker Slave Beaker Slave Snowth db On Failover: Master election

Challenge 6: Scalability Beaker needs to scale in the following dimensions umber of processing units up to an unlimited amount. In the number of incoming metrics up to ~100M metrics.

Beaker v.0.6 -- Multiple - HA Clusters Meta Service Beaker HA Cluster I (5 slaves) BI-M Msg Broker BI-S1... BI-S5.. BII-S3.. BIII-S2 Beaker HA Cluster II (3 slaves) BII-M BII-S1 Beaker HA Cluster III (1 slave) BIII-M BIII-S1 Snowth db Msg Broker

Done! Great. This works...

Can we do better?

Beaker v. 1.0 -- Divide and Conquer Beaker Service Q Rollup Service Processing Service Dedup Service Q Divide Beaker into multiple services Avoid master election in processing service, by allowing duplicates Upside: Only the processing service needs to scale out

Scaling the Processing Service is simplified o rollup logic in workers All workers publish messages o failover logic necessary Replicate each PUs on multiple workers Processing Service: Data Flow Meta Service Worker Worker Q Q... Worker o crosstalk between nodes ( = 0 in USL). snowth db

Benchmark results on prototype PU type anomaly_detection PU count 100 Throughput 15 khz PU type anomaly_detection PU count 10k Throughput 4.2 khz Machine: 6-core Xeon E5 2630 v2 at 2.6 GHz

Conclusion Stateful processing units allow implementation of next generation of monitoring analytics Use CAQL to build your own processing units Service orientation facilitates scaling Beaker will be out soon

Credits Joint work with: Jonas Kunze Theo Schlossnagle Image Credits: Example of Kummer Surface, Claudio Rocchini, CC-BY-SA, https://en.wikipedia.org/wiki/file:kummer_surface.png Indian truck, by strudelt, CC-BY, https://commons.wikimedia.org/wiki/file:truck_in_india_-_overloaded.jpg Kafka, Public Domain, https://en.wikipedia.org/wiki/file:kafka_portrait.jpg Thinker, Public Domain, https://www.flickr.com/photos/mustangjoe/5966894496