New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

Similar documents
Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer

Time Series Storage with Apache Kudu (incubating)

RIPE75 - Network monitoring at scale. Louis Poinsignon

Apache Kylin. OLAP on Hadoop

YOU SUN JEONG DATA ANALYTICS WITH DRUID

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Flash Storage Complementing a Data Lake for Real-Time Insight

microsoft

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Time Series Live 2017

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Data Lake Based Systems that Work

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CloudExpo November 2017 Tomer Levi

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Case Study: Tata Communications Delivering a Truly Interactive Business Intelligence Experience on a Large Multi-Tenant Hadoop Cluster

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

Exam Questions

Cloud Computing & Visualization

A Tutorial on Apache Spark

WHITEPAPER. The Lambda Architecture Simplified

Using Druid and Apache Hive

An Introduction to Big Data Formats

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Unifying Big Data Workloads in Apache Spark

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Streaming OLAP Applications

Elasticsearch. Presented by: Steve Mayzak, Director of Systems Engineering Vince Marino, Account Exec

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

BIG DATA REVOLUTION IN JOBRAPIDO

Monitor your containers with the Elastic Stack. Monica Sarbu

Bringing Data to Life

High-Performance Distributed DBMS for Analytics

Who Am I? Chris Larsen

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Architect.

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

April Copyright 2013 Cloudera Inc. All rights reserved.

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Network Traffic Visibility and Anomaly October 27th, 2016 Dan Ellis

CrateDB for Time Series. How CrateDB compares to specialized time series data stores

Distributed systems for stream processing

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Data Ingestion at Scale. Jeffrey Sica

Data-Intensive Distributed Computing

BIG DATA COURSE CONTENT

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Turbocharge your MySQL analytics with ElasticSearch. Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

MapR Enterprise Hadoop

Apache Spark 2.0. Matei

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Evolution of an Apache Spark Architecture for Processing Game Data

CarbonData : An Indexed Columnar File Format For Interactive Query HUAWEI TECHNOLOGIES CO., LTD.

To Shard or Not to Shard That is the question! Peter Zaitsev April 21, 2016

Big Data Infrastructure at Spotify

VOLTDB + HP VERTICA. page

Real-time Streaming Applications on AWS Patterns and Use Cases

Monitor your infrastructure with the Elastic Beats. Monica Sarbu

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Data Analytics at Logitech Snowflake + Tableau = #Winning

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Hortonworks DataFlow Sam Lachterman Solutions Engineer

Spotfire Advanced Data Services. Lunch & Learn Tuesday, 21 November 2017

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

HCatalog. Table Management for Hadoop. Alan F. Page 1

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Data Access 3. Managing Apache Hive. Date of Publish:

Getafix: Workload-aware Distributed Interactive Analytics

Microsoft. Perform Data Engineering on Microsoft Azure HDInsight Version: Demo. Web: [ Total Questions: 10]

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Prototyping Data Intensive Apps: TrendingTopics.org

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Building LinkedIn s Real-time Data Pipeline. Jay Kreps

Microsoft Exam

@InfluxDB. David Norton 1 / 69

Data Acquisition. The reference Big Data stack

Oracle Big Data Connectors

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

ClickHouse 2018 How to stop waiting for your queries to complete and start having fun. Alexander Zaitsev Altinity

Problem: Currently, Procore collects terabytes of user data & analytics. However, none of that important information is visible to the client.

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

DATA SCIENCE USING SPARK: AN INTRODUCTION

Over the last few years, we have seen a disruption in the data management

Big Data Integration Patterns. Michael Häusler Jun 12, 2017

Transcription:

New Data Architectures For Netflow Analytics NANOG 74 Fangjin Yang - Cofounder @ Imply

The Problem Comparing technologies Overview Operational analytic databases Try this at home

The Problem Netflow data continues to grow in scale and complexity This data is critical for network analysts, operators, and monitoring teams to understand performance and troubleshoot issues Real-time and historical analysis are both important

The Data Flows come in many different forms: Netflows, sflow, IP-Fix, etc Network data may often be enriched with application/user data as well. What is in common in all data is 3 components: - Timestamp: when the flow was created - Attributes (dimensions): properties that describe the flow - Measures (metrics): numbers to aggregate (# flows, # packets, bps, etc)

The Use Case We want to monitor the network and proactively detect issues When issues occur, we want to resolve them as fast as possible We want our analytics system to be cheap and efficient

Demo In case the internet didn t work, pretend you saw something cool

Challenges Storing and computing the data: - Scale: millions events/sec (batch and real-time) - Complexity: high dimensionality & high cardinality - Structure: semi-structured (nested, evolving schemas, etc.) Accessing the data: - UIs: need both reporting and exploring capabilities - Troubleshooting: need fast drill downs to quickly pinpoint issues - Multi-tenancy: support up to thousands of concurrent users across different networking teams

What technology to use? Logsearch systems (Elasticsearch) Analytic query engines (Presto, Vertica) Timeseries databases (InfluxDB, OpenTSDB) Something new?

Logsearch systems Pros: - Real-time ingest - Flexible schemas - Fast search and filter Cons: - Poor performance for high dimensional, high cardinality data - Not designed for numerical aggregation

Analytic query engines Pros: - Column-oriented storage: fast aggregations - Supports large scale groupbys - Supports complex aggregations Cons: - Not designed for real-time ingest - Inflexible schemas - Slow search and filter

Time series databases Pros: - Time-optimized partitioning and storage - Able to quickly aggregate time series Cons: - Slow for groupings on dimensions that are not time - Slow search and filter

Operational Analytics Database Combines benefits of all 3 classes of systems Ingest and search/filter capabilities of logsearch Column-oriented storage and query capabilities of analytic query engines Time-specific optimizations from time series databases

Operational Analytics Database Examples: - Apache Druid (incubating) (open source community) - Scuba (from Facebook) - Pinot (from LinkedIn) - Palo (from Baidu) - Clickhouse (from Yandex) Operational analytic databases have near identical storage formats and similar architecture. We ll use Druid to explain the architecture.

Druid in Production Alibaba: https://strata.oreilly.com.cn/hadoop-big-data-cn/public/schedule/detail/52345 Cisco: http://www.networkworld.com/article/3086250/cisco-subnet/under-the-hood-of-cisco-s-tetration-analytics-platform.html ebay: https://www.slideshare.net/gianmerlino/druid-for-real-time-monitoring-analytics-at-ebay Netflix: https://www.youtube.com/watch?v=dlqj34l2upk Paypal: https://dataworkssummit.com/san-jose-2018/session/paypal-merchant-ecosystem-using-apache-spark-hive-druid-and-hbase/ Walmart: https://medium.com/walmartlabs/event-stream-analytics-at-walmart-with-druid-dcf1a37ceda7 Wikimedia Foundation: https://conferences.oreilly.com/strata/strata-ny-2017/public/schedule/detail/60986 Airbnb, Lyft, Slack, Snapchat, Target, Tencent, Verizon + many more http://druid.io/druid-powered.html for more info

Storage Format

Raw data timestamp Action Protocol Flows 2011-01-01T00:01:35Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 2011-01-01T00:23:15Z ACCEPT TCP 1 2011-01-01T00:38:51Z REJECT UDP 10 2011-01-01T00:49:33Z REJECT TCP 10 2011-01-01T00:49:53Z REJECT TCP 1

Rollup timestamp Action Protocol Flows 2011-01-01T00:01:35Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 2011-01-01T00:23:15Z ACCEPT TCP 1 2011-01-01T00:38:51Z REJECT UDP 10 2011-01-01T00:49:33Z REJECT TCP 10 2011-01-01T00:49:53Z REJECT TCP 1 timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 12 2011-01-01T00:00:00Z REJECT TCP 22 2011-01-01T00:00:00Z REJECT UDP 30

Sharding/partitioning data timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 12 2011-01-01T00:00:00Z REJECT TCP 22... 2011-01-01T01:00:00Z ACCEPT TCP 12 2011-01-01T01:00:00Z REJECT TCP 22 1st hour segment 2nd hour segment... 2011-01-01T02:00:00Z ACCEPT TCP 12 2011-01-01T02:00:00Z REJECT TCP 22... 3rd hour segment

Segments Fundamental storage unit in Druid Immutable once created No contention between reads and writes One thread scans one segment

Columnar storage - compression timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 Create IDs Accept 0, Reject 1 TCP 0, UDP 1 Store Action [0 0 1 1 1 1] Protocol [0 0 1 1 0 0]

Columnar storage - fast search and filter timestamp Action Protocol Flows 2011-01-01T00:00:00Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 ACCEPT [0, 1] [110000] REJECT [2, 3, 4, 5] [001111] ACCEPT OR REJECT [111111] Compression!

Approximate algorithms Many aggregations in netflow use cases don t require exactness - Count distinct - Histograms and quantiles - Set analysis Approximate algorithms are very powerful for fast queries while minimizing storage

Rollup revisited timestamp Action Protocol Flows 2011-01-01T00:01:35Z ACCEPT TCP 10 2011-01-01T00:03:03Z ACCEPT TCP 1 2011-01-01T00:04:51Z REJECT UDP 10 2011-01-01T00:05:33Z REJECT UDP 10 2011-01-01T00:05:53Z REJECT TCP 1 2011-01-01T00:06:17Z REJECT TCP 10 2011-01-01T00:23:15Z ACCEPT TCP 1 2011-01-01T00:38:51Z REJECT UDP 10 2011-01-01T00:49:33Z REJECT TCP 10 2011-01-01T00:49:53Z REJECT TCP 1 timestamp Action Protocol Flow 2011-01-01T00:00:00Z ACCEPT TCP 12 2011-01-01T00:00:00Z REJECT TCP 22 2011-01-01T00:00:00Z REJECT UDP 30

High cardinality dimensions impact rollup timestamp Action device_id Protocol Flows 2011-01-01T00:01:35Z ACCEPT 4312345532 TCP 10 2011-01-01T00:03:03Z ACCEPT 3484920241 TCP 1 2011-01-01T00:04:51Z REJECT 9530174728 UDP 10 2011-01-01T00:05:33Z REJECT 4098310573 UDP 10 2011-01-01T00:05:53Z REJECT 5832058870 TCP 1 2011-01-01T00:06:17Z REJECT 5789283478 TCP 10 2011-01-01T00:23:15Z ACCEPT 4730093842 TCP 1 2011-01-01T00:38:51Z REJECT 9530174728 UDP 10 2011-01-01T00:49:33Z REJECT 4930097162 TCP 10 2011-01-01T00:49:53Z REJECT 3081837193 TCP 1

Approximate counts enable rollup again timestamp Action Protocol Flow unique_device_count 2011-01-01T00:00:00Z ACCEPT TCP 12 [sketch] 2011-01-01T00:00:00Z REJECT TCP 22 [sketch] 2011-01-01T00:00:00Z REJECT UDP 30 [sketch]

Architecture

Architecture (Ingestion) Indexers Files Indexers Streams Segments Historicals Historicals Indexers Historicals

Architecture Files Streams Indexers Indexers Indexers Segments Historicals Historicals Historicals Brokers Brokers Queries

Querying Query libraries: - JSON over HTTP - SQL - R - Python - Ruby

By the numbers Supports high scale: - 300B events daily on a 10K core Druid cluster - 100 Trillion events in the total database - From https://metamarkets.com/2017/benefits-of-a-full-stack-solution/ On 10TB of data (~50B rows): - Avg query latency of 550 ms - 90% of queries return < 1s, 95% < 2s, and 99% < 10s - 22,914.43 events/second/core on a datasource with 30 dimensions and 19 metrics, running an Amazon cc2.8xlarge instance. - From Druid whitepaper (http://static.druid.io/docs/druid.pdf)

End to end data architecture

End-to-end data stack Stream Processor Events Message bus Druid Apps Batch Processor

Connect Works well with: - Kafka - Hadoop - S3 - Spark Streaming - Storm - Samza - I bet some other things that also start with S

Do try this at home

Download http://druid.io/downloads.html or https://imply.io/download

Thanks! fj@imply.io @fangjin