Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Similar documents
The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Evolution of an Apache Spark Architecture for Processing Game Data

Deploying Applications on DC/OS

Advanced Continuous Delivery Strategies for Containerized Applications Using DC/OS

DATA SCIENCE USING SPARK: AN INTRODUCTION

Processing of big data with Apache Spark

Fluentd + MongoDB + Spark = Awesome Sauce

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Building a Data-Friendly Platform for a Data- Driven Future

WHITEPAPER. MemSQL Enterprise Feature List

Scale your Docker containers with Mesos

Cloud-Native Applications. Copyright 2017 Pivotal Software, Inc. All rights Reserved. Version 1.0

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Data Acquisition. The reference Big Data stack

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Qualys Cloud Platform

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

CSE 444: Database Internals. Lecture 23 Spark

Which compute option is designed for the above scenario? A. OpenWhisk B. Containers C. Virtual Servers D. Cloud Foundry

Modern Stream Processing with Apache Flink

Tools for Social Networking Infrastructures

Intra-cluster Replication for Apache Kafka. Jun Rao

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

Turning Relational Database Tables into Spark Data Sources

Lecture 11 Hadoop & Spark

An Introduction to Apache Spark

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

5 reasons why choosing Apache Cassandra is planning for a multi-cloud future

SCYLLA: NoSQL at Ludicrous Speed. 主讲人 :ScyllaDB 软件工程师贺俊

Cloud I - Introduction

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Flash Storage Complementing a Data Lake for Real-Time Insight

70-532: Developing Microsoft Azure Solutions

Thales PunchPlatform Agenda

SCALING LIKE TWITTER WITH APACHE MESOS

Exam : Implementing Microsoft Azure Infrastructure Solutions

VOLTDB + HP VERTICA. page

The Stream Processor as a Database. Ufuk

Spark, Shark and Spark Streaming Introduction

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Architekturen für die Cloud

@joerg_schad Nightmares of a Container Orchestration System

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Zero to Microservices in 5 minutes using Docker Containers. Mathew Lodge Weaveworks

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

CIB Session 12th NoSQL Databases Structures

Windows Azure Services - At Different Levels

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Datacenter replication solution with quasardb

Lessons Learned: Building Scalable & Elastic Akka Clusters on Google Managed Kubernetes. - Timo Mechler & Charles Adetiloye

microsoft

Migrating Oracle Databases To Cassandra

CS-580K/480K Advanced Topics in Cloud Computing. Container III

Using FPGAs as Microservices

STATE OF MODERN APPLICATIONS IN THE CLOUD

The four forces of Cloud Native

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Distributed Data Management Replication

Distributed Data on Distributed Infrastructure. Claudius Weinberger & Kunal Kusoorkar, ArangoDB Jörg Schad, Mesosphere

Apache Flink Big Data Stream Processing

CONTINUOUS DELIVERY WITH DC/OS AND JENKINS

Distributed Systems 16. Distributed File Systems II

Distributed File Systems II

Basic Concepts of the Energy Lab 2.0 Co-Simulation Platform

Stream Processing on IoT Devices using Calvin Framework

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

VoltDB vs. Redis Benchmark

RA-GRS, 130 replication support, ZRS, 130

Course Outline. Lesson 2, Azure Portals, describes the two current portals that are available for managing Azure subscriptions and services.

Apache Spark 2.0. Matei

Streaming Analytics with Apache Flink. Stephan

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

CONTINUOUS DELIVERY WITH MESOS, DC/OS AND JENKINS

4 Myths about in-memory databases busted

Design Micro Service Architectures the Right Way

Continuous Integration and Deployment (CI/CD)

PayPal Delivers World Class Customer Service, Worldwide

Unifying Big Data Workloads in Apache Spark

Think Small to Scale Big

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH

Designing for Scalability. Patrick Linskey EJB Team Lead BEA Systems

Connecting your Microservices and Cloud Services with Oracle Integration CON7348

Fast Data apps with Alpakka Kafka connector. Sean Glover,

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Managing Data at Scale: Microservices and Events. Randy linkedin.com/in/randyshoup

Time Series Storage with Apache Kudu (incubating)

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

[This is not an article, chapter, of conference paper!]

Microservices Lessons Learned From a Startup Perspective

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

Monitor your infrastructure with the Elastic Beats. Monica Sarbu

Creating a Recommender System. An Elasticsearch & Apache Spark approach

SQL Server on Linux and Containers

Transcription:

Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data pipelines Cassandra data model for time series data

Who am I Apache Bigtop PMC member Software Engineer @ Trend Micro Develop big data apps & infra

A Threat Analytic Big Data Product

Target Scenario InfoSec Team Security information and event management (SIEM)

Problems Too many log to investigate Lack of actionable, prioritized recommendations

Threat Analytic System AD Web Proxy Splunk Prioritization Windows event ArcSight Filtering Notification DNS Anomaly Detection Expert Rule

Deployment

On-Premises Prioritization Filtering InfoSec Team Security information and event Anomaly Detection Expert Rule

On-Premises AD Windows Event Log Prioritization DNS Filtering Anomaly Detection InfoSec Team Expert Rule

On the cloud? AD Windows Event Log DNS InfoSec Team

There re too many PII data in the log

System Requirement Identify anomaly activities based on data Lookup extra info such as ldap, active directory, hostname, etc Correlate with other data EX: C&C connection followed by an anomaly login Support both streaming and batch analytic workloads Run streaming detections to identify anomaly in timely manner Run batch detections with arbitrary start time and end time Scalable Stable

How exactly are we going to build this product? No problem. I have a solution. PM RD

SMACK Stack Toolbox for wide variety of data processing scenarios

SMACK Stack Spark: fast and general purpose engine for large-scale data processing scenarios Mesos: cluster resource management system Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka

Medium-sized Enterprises AD Windows Event Log DNS InfoSec Team

Can t we simplify it? No problem. I have a solution. PM RD

SDACK Stack

SDACK Stack Spark: fast and general purpose engine for large-scale data processing scenarios Docker: deployment and resource management Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka

Docker - How it works Three key techniques in Docker Use Linux Kernel s Namespaces to create isolated resources Use Linux Kernel s cgroups to constrain resource usage Union file system that supports git-like image creation

The nice things about it Lightweight Fast creation Efficient to ship Easy to deploy Portability runs on *any* Linux environment

Medium-sized Enterprises AD Windows Event Log DNS InfoSec Team

Large Enterprises AD Windows Event Log DNS InfoSec Team

Fortune 500 AD Windows Event Log DNS InfoSec Team

Threat Analytic System Architecture

Actor Actor Actor a d d d i w e b DB Log Web Portal API Server Detection Result

Web Portal API Server

Docker Compose kafka: build:. ports: - 9092:9092 spark: image: spark port: - 8080:8080 API Server Web Portal

Akka Streams and data pipelines

receiver

buffer receiver

receiver transformation, lookup ext info buffer

receiver transformation, lookup ext info buffer streaming batch

Why not Spark Streaming? Each streaming job occupies at least 1 core There re so many type of log AD, Windows Event, DNS, Proxy, Web Application, etc Kafka offset loss when code changed No back-pressure (back then) Estimated back-pressure available in Spark 1.5 Integration with Akka based info lookup services

Akka High performance concurrency framework for Java and Scala Actor model for message-driven processing Asynchronous by design to achieve high throughput Each message is handled in a single threaded context (no lock, synchronous needed) Let-it-crash model for fault tolerance and auto-healing Clustering mechanism available The Road to Akka Cluster, and Beyond

Akka Streams Akka Streams is a DSL library for doing streaming computation on Akka Source Flow Sink Materializer to transform each step into Actor Back-pressure enabled by default The Reactive Manifesto

Reactive Kafka Akka Streams wrapper for Kafka Commit processed offset back into Kafka Provide at-least-once message delivery guarantee https://github.com/akka/reactive-kafka

Message Delivery Sematics Actor Model: at-most-once Akka Persistence: at-least-once Persist log to external storage (like WAL) Reactive Kafka: at-least-once + back-pressure Write offset back into Kafka At-least-once + Idempotent writes = exactly-once

Implementing Data Pipelines Reactive Kafka Akka Stream kafka source Parsing Phase Storing Phase manual commit

Scale Up kafka source Parsing Phase Storing Phase manual commit map Async router scale up parser parser parser

Scale Out Scale out by creating more docker containers with same Kafka consumer group

Be careful using Reactive Kafka Manual Commit

A bug that periodically reset the committed offset #partition offset partition 0 50 0 partition 1 90 1 kafka source manual commit partition 2 70 2

A bug that periodically reset committed offset #partition offset partition 0 300 0 partition 1 250 1 kafka source manual commit partition 2 70 2 kafka source manual commit #partition offset partition 2 350

Normal: redline(lag) keep declining

Abnormal: redline(lag) keep getting reset

PR merged in 0.8 branch Apply patch if version <= 0.8.7 https://github.com/akka/reactive-kafka/pull/144/commits

Akka Cluster Connects Akka runtimes into a cluster No-SPOF Gossip Protocol ClusterSingleton library available

Lookup Service Integration Ldap Lookup Hostname Lookup ClusterSingleton Akka Streams Akka Streams Akka Streams Akka Cluster

Lookup Service Integration Ldap Lookup Hostname Lookup ClusterSingleton Akka Streams Akka Streams Akka Streams Akka Cluster

Lookup Service Integration Hostname Lookup Service Discovery Ldap Lookup ClusterSingleton Akka Streams Akka Streams Akka Streams Akka Cluster

Async Lookup Process Synchronous Asynchronous Hostname Lookup ClusterSingleton Cache Akka Streams

Async Lookup Process Synchronous Asynchronous Hostname Lookup ClusterSingleton Cache Akka Streams

Async Lookup Process Synchronous Asynchronous Hostname Lookup ClusterSingleton Cache Cache Cache Akka Akka Streams Akka Streams Streams

Cassandra Data Model for Time Series Data

Event Time V.S. Processing Time

Problem We need to do analytics based on event time instead of processing time Example: modelling user behaviour Log may arrive late network slow, machine busy, service restart, etc Log arrived out-of-order multiple sources, different network routing, etc

Problem Spark streaming can t handle event time ts 12:01 ts 12:02 ts 12:03 ts 11:50 ts 12:04

Solution Store time series data into Cassandra Query Cassandra based on event time Get most accurate result upon query executed Late tolerance time is flexible Simple

Cassandra Data Model Retrieve a entire row is fast Retrieve a range of columns is fast Read continuous blocks on disk (not random seek)

Data Model for AD log use shardid to distribute the log into different partitions

Data Model for AD log partition key to support frequent queries for a specific hour

Data Model for AD log clustering key for range queries

Wrap up

Recap: SDACK Stack Spark: streaming and batch analytics Docker: ship, deploy and resource management Akka: resilient, scalable, flexible data pipelines Cassandra: batch queries based on event data Kafka: durable buffer, streaming processing

Thank you! Questions?

Backup

Query APIs val conn = new CassandraDataConnector(sc) val rdd = conn.getdata(timeframe, table)) val rdd = conn.getdata(timeframe, table, eventid, clientaddr)) val rdd = sc.cassandratable(keyspace, table).select(columns: _*).where( shardid in (?,?,?), 0, 1, 2).where( ddmmyyhh in (?), 09-05-2016-00 )

Sync Lookup Synchronous Asynchronous Hostname Lookup ClusterSingleton Cache Akka Streams

No Back-pressure Source Fast!!! Slow Sink v( ) )y (> <)

No Back-pressure Source Fast!!! Slow Sink v( ) )y (> <)

With Back-pressure Source Fast!!! Slow Sink

With Back-pressure Source Fast!!! Slow Sink request 3 request 3

Distributed NoSQL key-value storage, no SPOF Fast on write, suitable for time series data Decent read performance, if designed right Build data model around your queries Spark Cassandra Connector Configurable CA (CAP theorem) Choose A over C and vise-versa Dynamo: Amazon s Highly Available Key-value Store

receiver transformation, lookup ext info buffer streaming batch

Sweet Spots Choose availability over consistency We don t do update, hence no strong consistency required Native, efficient and flexible time range queries CQL to support research and log investigations