Fluentd + MongoDB + Spark = Awesome Sauce

Similar documents
Application monitoring with BELK. Nishant Sahay, Sr. Architect Bhavani Ananth, Architect

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack

Big Data Architect.

Flash Storage Complementing a Data Lake for Real-Time Insight

Data pipelines with PostgreSQL & Kafka

Sizing Guidelines and Performance Tuning for Intelligent Streaming

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

IBM Data Replication for Big Data

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Data Analytics at Logitech Snowflake + Tableau = #Winning

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Streaming Integration and Intelligence For Automating Time Sensitive Events

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Developing Enterprise Cloud Solutions with Azure

rkafka rkafka is a package created to expose functionalities provided by Apache Kafka in the R layer. Version 1.1

BIG DATA REVOLUTION IN JOBRAPIDO

Distributed systems for stream processing

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Real-time Streaming Applications on AWS Patterns and Use Cases

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Analytics using Apache Hadoop and Spark with Scala

BIG DATA COURSE CONTENT

20777A: Implementing Microsoft Azure Cosmos DB Solutions

Apache Kafka Your Event Stream Processing Solution

Hadoop An Overview. - Socrates CCDH

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

Oracle NoSQL Database Enterprise Edition, Version 18.1

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Kafka Connect the Dots

Microservices Lessons Learned From a Startup Perspective

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Fast Innovation requires Fast IT

Evolution of an Apache Spark Architecture for Processing Game Data

Esper EQC. Horizontal Scale-Out for Complex Event Processing

Introduction to Kafka (and why you care)

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Scaling the Yelp s logging pipeline with Apache Kafka. Enrico

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT

Building Durable Real-time Data Pipeline

MapR Enterprise Hadoop

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Making Data Integration Easy For Multiplatform Data Architectures With Diyotta 4.0. WEBINAR MAY 15 th, PM EST 10AM PST

A day in the life of a log message Kyle Liberti, Josef

Security and Performance advances with Oracle Big Data SQL

How to Route Internet Traffic between A Mobile Application and IoT Device?

HDInsight > Hadoop. October 12, 2017

WHITEPAPER. MemSQL Enterprise Feature List

Let the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka. Materna GmbH

Designing High-Performance Data Structures for MongoDB

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

REAL-TIME ANALYTICS WITH APACHE STORM

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

microsoft

Hortonworks and The Internet of Things

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

An Information Asset Hub. How to Effectively Share Your Data

OPENSTACK BEIJING CONFERENCE. by: Steven Hallett Head of Cloud Infrastructure Engineering and Operations

Scalable Streaming Analytics

Enable IoT Solutions using Azure

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

VOLTDB + HP VERTICA. page

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

The Future of Real-Time in Spark

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Lambda Architecture for Batch and Stream Processing. October 2018

Deploying, Managing and Reusing R Models in an Enterprise Environment

A Single Source of Truth

70-532: Developing Microsoft Azure Solutions

EMC s IT TRANSFORMATION

70-532: Developing Microsoft Azure Solutions

The Evolution of Big Data Platforms and Data Science

Microservices with Kafka Ecosystem. Guido Schmutz

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Installing and configuring Apache Kafka

Intra-cluster Replication for Apache Kafka. Jun Rao

Talend Big Data Sandbox. Big Data Insights Cookbook

Building Event Driven Architectures using OpenEdge CDC Richard Banville, Fellow, OpenEdge Development Dan Mitchell, Principal Sales Engineer

Tools for Social Networking Infrastructures

Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018

zspotlight: Spark on z/os

Putting together the platform: Riak, Redis, Solr and Spark. Bryan Hunt

Oracle NoSQL Database Enterprise Edition, Version 18.1

Cloud Analytics and Business Intelligence on AWS

Transcription:

Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here

Wipro Open Source Practice: Vision & Mission Vision Wipro will be the world leader in solving customer problems through the use of innovative and practical open source solutions. We will be a steward of every open source community in which we engage, and always act with sensitivity and integrity. Mission Wipro s Open Source mission is to be the guide and partner to companies seeking to leverage the strategic, financial, organizational and technological benefits of open source software and methods. Wipro will anticipate and solve customers needs through a commitment to research, and by taking a balanced approach to legacy and innovative technologies. Wipro s comprehensive suite of strategic and technology services will be delivered with passion and precision.

Wipro Open Source Practice Offerings Advisory Enterprise-wide adoption strategies Best fit analysis & recommendation Business Case Advisory Governance Technical Consulting Productized Services Legacy Migration Services Greenfield Development Open Source Stack Setup Open App Cross Industry Solutions and Process Stacks Support Application and Infrastructure Dev Ops Architecture, Development Open Source Community

Connected Warehouse Platform CSC SCP Warehouse Mobility & Dashboards Carrier Vendor Facility Inventory & Operations Orders Alerts & Notification Warehouse KPI s Performance Tracker Equipment Monitor Dashboards Master Data Connected Warehouse Platform Transaction Data Webservices Integration Mapping FTP (Flat file/xml) Subscriber Queues Automation Enabler Publisher Queues Sales Orders [Real-Time] Route Plan / Carrier Tracking Almost Real-Time Associate Performance PUT/PICK Status Purchase Orders Master Data [Scheduled] OMS TMS WMS LMS WCS IOT ERP/HOST Direct to Customer Warehouses Equipment Retailer Supplier

The Awesome Sauce ANALYTICS & PREDICTION

Clickstream Analytics User Behavior Analysis Product Affinity Website Resource Allocation Prediction & recommendation

PREDICTION & RECOMMENDATION Prediction Using Machine Learning Content Recommendation Conversion Prediction Visitor Segmentation Demand Forecasting

Sauce Raw Material LOGS

Logs, Logs Everywhere! SysLog Clickstre am Data Social Media Feeds Packet Data Sensor Data CDR Device Logs Custom App (C, Ruby,Pyt hon) Payment Data Applicati on Server Logs Web Access Logs Database Logs

What can be done with logs? Real time monitoring Root cause analysis Anomaly Detection and Predictive Monitoring Debugging Troubleshooting/Support

Challenges with Log Analytics No standard log formats Multiple logging frameworks Logs highly decentralized Limited real time visualization capability Scalability Issues Normalizing and correlating logs from disparate sources

What can be done with logs Business PoV? Input Data Analytics User Interactions /Behavior End user Experience/Improvements

Awesome Foursome The Ingredients

The Ingredients FLUENTD

Why Fluentd Unified Logging Simple and Flexible Proven Minimal Resources Reliable Open Source Community

Fluentd Plugin Architecture Input Input (udp,tcp,http,tail) Parser (regexp,apache2) Filter Filter (grep,enrich, delete.mask) Output Buffer Output out_mongo Format

HA Fluentd topology At Most once and At Least once transfers Log Forwarders Node1 Log Aggregators Destination Log File Node2 Log File Fluentd Fluentd PUSH Fluentd (Active) PUSH MongoD B Node3 Log File Fluentd Fluentd (Backup) Amazon S3

Fluentd Failure Scenarios Forwarder goes down Aggregator goes down

The Ingredients KAFKA

Kafka distributed streaming platform Producers Publish-Subscribe streams of records Store streams of records in fault tolerant way Process streams of records Apps App App DBs App Connectors DBs Kafka Cluster Stream Processor App Apps App App Consumers

Kafka Terms Topic Partition Producer Consumer Producer Topics 0 1 0 1 2 0 1 Partition-1 Partition-2 Partition-3 Brokers p1 p2 p3 R1 R2 R3 Consumer Group Consumer Groups C1 C2 C2

Why Kafka Ideal unified platform to handle real time data feeds Has high throughput to support high volume event streams such as log aggregation Deals well with high volume data loads from offline systems Fault tolerance and Scalable Able to handle the low latency associated with traditional messaging systems

Kafka decouples data pipelines Producers Producers Producers Producers Broker Kafka Consumers Consumer Consumer Consumer

Kafka Guarantees Messages sent to the topic and partition are appended in the same order A consumer instance gets the message in the same order as they are produced A topic with replication factor N can tolerate n-1 failures

Kafka Replication Producer Producer Logs Logs Logs Logs Follower Leader Topic1- part1 Topic1- part1 Follower Follower Topic1- part1 Follower Leader Topic1- part2 Topic1- part2 Topic1- part2 Broker1 Broker2 Broker3 Broker4

Zookeeper Zookeeper enables highly reliable distributed coordination Kafka bundles single node ZooKeeper instance Metadata includes broker addresses, message offsets metadata Zookeeper metadata Producers metadata Consumers messages Kafka Cluster messages

Kafka Persistence - File System Sequential File I/O very fast Uses OS page cache for data storage Batching of messages speeds up disk operations, network transfers and in memory iterations. http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg

Batch Processing One of the big drivers for efficiency Producers accumulate data in memory and send larger batches in a single request Fix the number of messages in a batch - batch.size Wait no longer than a fixed latency bound - linger.ms Trade off small amount of latency for better throughput

Log Compaction Per-record retention, rather than the coarsergrained timebased retention

Fluentd Kafka Integration Kafka Fluentd Consumer Fluentd kafka plugin Log Forwarders Fluentd Kafka Ecosystem Consumers Fluentd Destination MongoD B Fluentd PUSH Kafka Clusters PULL Fluentd PUSH Fluentd Fluentd Amazon S3

Advantage - Fluentd-Kafka Backpressure - Pull versus Push Reliable, Flexible data pipeline

Connected Warehouse Kafka Cluster Architecture Fluentd-Kafka Plugin Data Center 1 - Active Data Center 2 - Active Kafka Cluster Kafka Broker -1 Topic 1, Partition 0..n ZK 1 Leader Zookeeper Ensemble Kafka Broker 2 Topic 1, Partition n+1, n+n ZK 2 Follower

The Ingredients MONGODB

Why MongoDB Cross platform document-oriented NOSQL database Simple and Flexible Data Model Field Level Indexing Built In Query Capabilities High Performance

System Architecture With Shards Config Server Data Sources mongos mongos mongos Primary Primary Primary Primary Primary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary

MongoDB For Analytics Denormalization with support of Embedded Documents Connector for almost all kind of data source Aggregation Framework Text Search Queries Range Queries, Key value queries

The Ingredients SPARK

Spark Logical Architecture Scala, Java, Python, R Spark SQL Spark Streaming MLlib GraphX Apache Spark Spark MongoDB Connector

Putting It All Together Click Stream + Inventory Mgmt Micro-Service Data Sync Processing Ingestion Collection

QUESTIONS & ANSWERS

Thank you

www.modsummit.com www.developersummit.com