Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data

Similar documents
Chronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content.

Effecient monitoring with Open source tools. Osman Ungur, github.com/o

Search and Time Series Databases

Search Engines and Time Series Databases

Using Prometheus with InfluxDB for metrics storage

@InfluxDB. David Norton 1 / 69

Survey and Comparison of Open Source Time Series Databases

Mining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.

Axibase Time-Series Database. Non-relational database for storing and analyzing large volumes of metrics collected at high-frequency

Evolving Prometheus for the Cloud Native World. Brian Brazil Founder

RIPE75 - Network monitoring at scale. Louis Poinsignon

Unifying Big Data Workloads in Apache Spark

BIG DATA REVOLUTION IN JOBRAPIDO

In-Memory Data Management Jens Krueger

Monitoring MySQL Performance with Percona Monitoring and Management

CrateDB for Time Series. How CrateDB compares to specialized time series data stores

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

Evolution of the Prometheus TSDB. Brian Brazil Founder

Monitor Qlik Sense sites. Qlik Sense Copyright QlikTech International AB. All rights reserved.

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Time Series Live 2017

Using Data Transformations for Low-latency Time Series Analysis

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Who Am I? Chris Larsen

Inside the InfluxDB Storage Engine

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Accelerate critical decisions and optimize network use with distributed computing

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Technologies for the future of Network Insight and Automation

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Flash Storage Complementing a Data Lake for Real-Time Insight

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

Streaming Log Analytics with Kafka

Anomaly Detection Fault Tolerance Anticipation

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

Examples of Big Data analytics in ENEA: data sources and information extraction strategies

Streaming Massive Environments From Zero to 200MPH

Cisco Tetration Analytics

Using Data Transformations for Low-latency Time Series Analysis

LazyBase: Trading freshness and performance in a scalable database

Cloud Monitoring as a Service. Built On Machine Learning

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

Project: Relational Databases vs. NoSQL. Catherine Easdon

ADVANCED trouble-shooting of real-time systems. Bernd Hufmann, Ericsson

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Two Success Stories - Optimised Real-Time Reporting with BI Apps

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

LAPI on HPS Evaluating Federation

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Performance Sentry VM Provider Objects April 11, 2012

Monitoring Docker Containers with Splunk

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

Quick Start Guide for Data Buyers

ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE

ERDAS APOLLO PERFORMANCE BENCHMARK ECW DELIVERY PERFORMANCE OF ERDAS APOLLO VERSUS ESRI ARCGIS FOR SERVER

How to store millions metrics per second. Vladimir Smirnov System Administrator. SREcon17 Asia/Australia 22 May 2017

Monitoring Agent for Unix OS Version Reference IBM

Cache and Forward Architecture

Self-driving Datacenter: Analytics

PROCE55 Mobile: Web API App. Web API.

Main-Memory Databases 1 / 25

New Face of z/os Communications Server: V2R1 Configuration Assistant

Pricing Intra-Datacenter Networks with

MONITORING SERVERLESS ARCHITECTURES

10 Million Smart Meter Data with Apache HBase

Simple and Powerful Animation Compression. Nicholas Fréchette Programming Consultant for Eidos Montreal

TANGO CONTROLS CONCEPTS

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Storing metrics at scale with. Gnocchi. Julien Danjou OpenStack Day France 22 November 2016

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Crescando: Predictable Performance for Unpredictable Workloads

Monitoring the ALICE Grid with MonALISA

Microsoft Azure Stream Analytics

SmartSaver: Turning Flash Drive into a Disk Energy Saver for Mobile Computers

CS 537: Introduction to Operating Systems Fall 2015: Midterm Exam #1

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Cost-Driven Hybrid Configuration Prefetching for Partial Reconfigurable Coprocessor

Heckaton. SQL Server's Memory Optimized OLTP Engine

Monitoring Infrastructure in Booking.com. Anna Stepanyan

Monitor your infrastructure with the Elastic Beats. Monica Sarbu

A10 HARMONY CONTROLLER

Design of Flash-Based DBMS: An In-Page Logging Approach

Cloudera Kudu Introduction

Availability and Utility of Idle Memory in Workstation Clusters. Anurag Acharya, UC-Santa Barbara Sanjeev Setia, George Mason Univ

Efficient Data Structures for Tamper-Evident Logging

Efficient Signature Matching with Multiple Alphabet Compression Tables

Correlative Analytic Methods in Large Scale Network Infrastructure Hariharan Krishnaswamy Senior Principal Engineer Dell EMC

I/O rate on DSA1. Time. Massive user complaints right away > 100 processes active No process I/O request data correlation

Quick Note 044. Estimating Cellular Data Overhead When Using TransPort Enterprise Routers and Digi Remote Manager SM 2.0

Opportunistic Use of Content Addressable Storage for Distributed File Systems

How can we gain the insights and control we need to optimize the performance of applications running on our network?

An Oracle White Paper September Oracle Utilities Meter Data Management Demonstrates Extreme Performance on Oracle Exadata/Exalogic

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Application monitoring with BELK. Nishant Sahay, Sr. Architect Bhavani Ananth, Architect

Transcription:

Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data FAST 2017, Santa Clara Florian Lautenschlager, Michael Philippsen, Andreas Kumlehn, and Josef Adersberger Florian.Lautenschlager@qaware.de flolaut

Detecting Anomalies in Running Software matters Various kinds of anomalies: Resource consumption: anomalous memory consumption, high CPU usage, Sporadic failure: blocking state, deadlock, dirty read, Security: port scanning activity, short frequent login attempts, Economic or reputation loss. Detection is a complex task: Multiple components: Database, Service Discovery, Configuration Service, Different technologies: Go, Java, Java-Script, Python, Various transport protocols: HTTP, Protocol Buffers, Thrift, JSON, 1

Anomaly Detection Tool Chain for Operational Data Application Operational Data Collection Framework Time Series Database Analysis Framework Types of operational data: Metrics: scalar values, e.g., rates, runtimes, total hits, counters, Events: single occurrences, e.g., a user s login, product order, Traces: sequences within a software system, e.g., the called methods, 2

Anomaly Detection Tool Chain for Operational Data Collection Framework Time Series Database Analysis Framework Collects operational data from a running application Stores the time series data Timestamp V1 V2 25.10.2016 00:00:01.546 218.34 51 Asks the database for data and analyzes the data 3

Anomaly Detection Tool Chain for Operational Data Collection Framework Time Series Database Analysis Framework Domain specific sensors and adaptors Domain specific sensors and adaptors General-Purpose TSDB Brake shoe Resource hog Productivity obstacle Domain specific analysis algorithms and tools Domain specific analysis algorithms and tools Chronix: Domain specific TSDB 3

Graphite InfluxDB OpenTSDB KairosDB Prometheus Chronix State of the art: General-purpose TSDBs in Anomaly Detection Generic data model No support for data types = Productivity obstacle Analysis support No support for analyses = Productivity obstacle = Brake shoe Lossless long term storage High memory footprint = Performance hog High storage demands = Performance hog Loss of historical data = Brake shoe 4

7 Bullets for the domain of Anomaly Detection Collection Framework Chronix Analysis Framework 1 2 3 4 5 6 7 Option to pre-compute an extra representation of the data Optional timestamp compression for almost-periodic time series Records that meet the needs of the domain Compression technique that suits the domain s data Underlying multi-dimensional storage Domain specific query language with server-side evaluation Domain specific commissioning of configuration parameters 5

Running Example: Almost-periodic time series with operational data Timestamp Value Metric Process Host 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC 25.10.2016 00:00:16.964 218.52 ingester\time SmartHub QAMUC 6

1 Option to pre-compute data to speed up analyses Timestamp Value Metric Process Host SAX 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC B 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC C 25.10.2016 00:00:16.964 218.52 ingester\time SmartHub QAMUC D Chronix is lossless: it keeps all details because the analyses are ad-hoc and may need them. Chronix offers a programming interface for adding extra domain specific columns. Examples: Fourier transformation, Symbolic Aggregate approximation (SAX), etc. Added columns speed up anomaly detection queries. 7

2 Optional timestamp compaction Timestamp Value Metric Process Host SAX 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A 5.172 218.37 ingester\time SmartHub QAMUC B - 218.49 ingester\time SmartHub QAMUC C Space saved - 218.52 ingester\time SmartHub QAMUC D It suffices to be able to reconstruct approximate timestamps for almost-periodic time series. Date-Delta-Compaction Chronix is functionally lossless as it keeps all relevant details. The tolerable degree of inaccuracy is a Configuration Parameter of 7 8

Date-Delta-Compaction Timestamp 25.10.2016 00:00:01.546 25.10.2016 00:00:06.718 25.10.2016 00:00:11.891 25.10.2016 00:00:16.964 Timestamp 25.10 :01.546 Calculate deltas space saved 5.172 5.173 5.073 Timestamp 25.10 :01.546 Compute diffs between them 5.172 0.001 0.1 Timestamp 25.10 :01.546 Drop diffs below threshold 5.172 space saved - - Timestamp If accumulated drift > threshold store delta. (Upper bound on inaccuracy) 25.10 :01.546 5.172 - - 9

Domain specific data characteristics Timestamp Value Metric Process Host SAX 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A 5.172 218.37 ingester\time SmartHub QAMUC B - 218.49 ingester\time SmartHub QAMUC C Many anomaly detection tasks need blocks of data rather than lines. - 218.52 ingester\time SmartHub QAMUC D Some compression techniques work better than others. Columns Repetitive with values. repetitive values. 10

3 Records that meet the needs of the domain Therefore: Record := Attributes + Start + End + Type + Data Chunk Timestamp Value Metric Process Host SAX 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A 5.172 218.37 ingester\time SmartHub QAMUC B - 218.49 ingester\time SmartHub QAMUC C - 218.52 ingester\time SmartHub QAMUC D 1 1 2 2 2 1 chunk & convert Record metric: ingester\time process: SmartHub host: QAMUC 2 start: 25.10.2016 00:00:01.546 end: type: metric data: Timestamp Value SAX 1 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B - 218.49 C - 218.52 D Chronix offers a programming interface to implement time series specific records. Chronix exploits repetitiveness and bundles lines into data chunks. BLOB The chunk size is a Configuration Parameter of 7 11

4 Compression technique that suits the domain s data Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: type: metric data: Timestamp Value SAX 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B serialize & compress Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: type: metric data: 00105e0 e6b0 343b 9c74 080 7bc 0804 e7d5 0804 00105f0-218.49 C - 218.52 D Compressed BLOB Chronix exploits that domain data often has small increments, recurring patterns, etc. Chronix uses a lossless compression technique that minimizes (record sizes + index sizes). The choice of compression technique is a Configuration Parameter of 7 12

5 Underlying multi-dimensional storage Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: type: metric data: 00105e0 e6b0 343b 9c74 080 7bc 0804 e7d5 0804 00105f0 Record metric: ingester\methods process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: type: trace data: d65fa01 7ab2 433c 7c8e f123 2ca 0713 a8f5 926b 01006e1 q=host:qamuc AND metric:ingester* AND type:[metric OR trace] AND end:now-7month By using a multi-dimensional storage Chronix supports explorative analyses. Attributes are visible to the storage and indexed. Users can use any combination to find a record. Chronix supports correlating analyses. Every type of data can be stored. Queries can use and combine types. 13

also needed for anomaly detection basic functions 6 Domain specific query language with server-side evaluation Chronix offers not just basic functions but also high-level built-in domain specific analysis functions. Chronix evaluates functions server-side for speed. Chronix offers a plug-in interface to add functions. 14

6 Domain specific query language with server-side evaluation Chronix General-Purpose Time Series Database query 1 query 1 read result read process 1x query 1x latency extra code result query 2 result process read process 2x query extra code 2x latency Query 1: q=metric:ingester\time & cf=outlier high-level function Chronix achieves more programming comfort & fast results. Query 1: select q(0.25,time),q(0.75,time) from ingester Calculate threshold extra code Query 2: select time from ingester where time >= threshold 15

Operational data of 5 industry projects P1 P2 P3 P4 P5 Description Application for searching car maintenance and repair instructions. (8 app sever, 20 search server) Retail application for orders, billing, and customer relations. (1 database, 2 app server) Sales application of a car manufacturer. (1 database, 2 app servers) Service application for modern cars (music streaming) Manage the compatibility of software components in a car. Interval (sec) Pairs (mio) Time series 30 2,4 1,080 60 331.4 8,567 30 162.6 4,538 1 metric 3.9 lsof 0.4 strace 12.1 500 60 3,762.3 24,055 Total 4,275.1 38,740 used for 7 used for the Evaluation 16

7 Best threshold for the Date-Delta-Compaction DDC = 200 17

0.5 15 1 30 7 30 21 5 28 3 56 1 91 2 Operational data of 3 (of 5) industry projects 7 P1 P2 P3 Description Application for searching car maintenance and repair instructions. (8 app sever, 20 search server) Retail application for orders, billing, and customer relations. (1 database, 2 app server) Sales application of a car manufacturer. (1 database, 2 app servers) Interval (sec) Pairs (mio) 30 2,4 1,080 60 331.4 8,567 30 162.6 4,538 P4 Time Query Mix series r q r = range (days) q= # of queries P5 Total 4,275.1 38,740 18

7 Best compression technique & Best chunk size for query mix C= 128 KB, t= gzip 19

0.5 1 1 2 1 11 6 6 7 15 5 10 14 8 7 8 21 12 2 6 28 5 4 6 56 1 4 3 91 2 1 2 180 2 2 0 Operational data of 2 of (5) industry projects Evaluation Description Interval (sec) Pairs (mio) P1 P2 P3 P4 Service application for modern cars (music streaming) 1 metric 3.9 lsof 0.4 strace 12.1 Time Query Mix series r q b h 500 r = range (days) q= # of queries b= # of basis queries h= # of highlevel queries P5 Manage the compatibility of software components in a car. 60 3,762.3 24,055 Total 4,275.1 38,740 20

Quantitative comparison Time Series Database TSDBs under test Comparisons General-Purpose TSDB Productivity obstacles Brake shoe Resource hog Chronix: Domain specific TSDB InfluxDB OpenTSDB KairosDB Chronix a) Memory footprint b) Storage demand c) Data retrieval times d) Query mix runtimes 21

a) Memory footprint Memory footprint of the databases (in MB) InfluxDB OpenTSDB KairosDB Chronix Initially after startup (processes up and running) 33 2,726 8,763 446 Maximal memory usage during import 10,336 10,111 18,905 7,002 Maximal memory usage during query 8,269 9,712 11,230 4,792 Chronix has a 34% 69% smaller memory footprint. 22

b) Storage demand Storage demand (in GB) Raw data InfluxDB OpenTSDB KairosDB Chronix Project 4 1.2 0.2 0.2 0.3 0.1 Project 5 107.0 10.7 16.9 26.5 8.6 total 108.2 10.9 17.1 26.8 8.7 Chronix saves 20% 68% of the storage space. 23

c) Data retrieval times Data retrieval times for 20 58 queries (in s) r q InfluxDB OpenTSDB KairosDB Chronix 0.5 2 4.3 2.8 4.4 0.9 1 11 5.5 5.6 6.6 5.3 7 15 34.1 17.4 26.8 7.0 14 8 36.2 14.2 25.5 4.0 21 12 76.5 29.8 55.0 6.0 28 5 7.9 3.9 5.6 0.5 56 1 35.4 12.4 24.1 1.2 91 2 47.5 15.5 33.8 1.1 180 2 96.7 36.7 66.6 1.1 total 343.8 138.3 248.4 27.1 Chronix saves 80% 92% on data retrieval times. 24

High-level (h) Basic (b) d) Query mix runtimes Runtimes of 20 75 b- and h-queries (in s) more important q InfluxDB OpenTSDB KairosDB Chronix 4 avg 0.9 6.1 9.8 4.4 5 max 1.3 8.4 9.1 6.0 3 min 0.7 2.7 5.3 2.8 3 stddev. 6.7 16.7 21.1 2.3 5 sum 0.7 6.0 12.0 2.0 4 count 0.8 5.5 10.5 1.0 8 perc. 10.2 25.8 34.5 8.6 12 outlier 30.7 29.1 117.6 18.9 14 trend 162.7 50.4 100.6 30.2 11 frequency 47.3 23.9 45.7 16.3 3 grpsize 218.9 2927.8 206.3 29.6 3 split 123.1 2893.9 47.9 37.2 75 total 604.0 5996.3 620.4 159.3 Chronix saves 73% 97% of the runtime of analyzing queries. 25

Chronix unleashes Anomaly Detection tasks 7 domain specific levers to unleash Anomaly Detection 1. Option to pre-compute an extra representation of the data 2. Optional timestamp compression for almost-periodic time series 3. Records that meet the needs of the domain 4. Compression technique that suits the domain s data 5. Underlying multi-dimensional storage 6. Domain specific query language with server-side evaluation 7. Domain specific commissioning of configuration parameters 4 beneficial performance effects Chronix has a 34% 69% smaller memory footprint. Chronix saves 20% 68% of the storage space. Chronix saves 80% 92% on data retrieval time. Chronix saves 73% 97% of the runtime of analyzing queries. www.chronix.io open source 26