Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here

Wipro Open Source Practice: Vision & Mission Vision Wipro will be the world leader in solving customer problems through the use of innovative and practical open source solutions. We will be a steward of every open source community in which we engage, and always act with sensitivity and integrity. Mission Wipro s Open Source mission is to be the guide and partner to companies seeking to leverage the strategic, financial, organizational and technological benefits of open source software and methods. Wipro will anticipate and solve customers needs through a commitment to research, and by taking a balanced approach to legacy and innovative technologies. Wipro s comprehensive suite of strategic and technology services will be delivered with passion and precision.

Wipro Open Source Practice Offerings Advisory Enterprise-wide adoption strategies Best fit analysis & recommendation Business Case Advisory Governance Technical Consulting Productized Services Legacy Migration Services Greenfield Development Open Source Stack Setup Open App Cross Industry Solutions and Process Stacks Support Application and Infrastructure Dev Ops Architecture, Development Open Source Community

Connected Warehouse Platform CSC SCP Warehouse Mobility & Dashboards Carrier Vendor Facility Inventory & Operations Orders Alerts & Notification Warehouse KPI s Performance Tracker Equipment Monitor Dashboards Master Data Connected Warehouse Platform Transaction Data Webservices Integration Mapping FTP (Flat file/xml) Subscriber Queues Automation Enabler Publisher Queues Sales Orders [Real-Time] Route Plan / Carrier Tracking Almost Real-Time Associate Performance PUT/PICK Status Purchase Orders Master Data [Scheduled] OMS TMS WMS LMS WCS IOT ERP/HOST Direct to Customer Warehouses Equipment Retailer Supplier

The Awesome Sauce ANALYTICS & PREDICTION

Clickstream Analytics User Behavior Analysis Product Affinity Website Resource Allocation Prediction & recommendation

PREDICTION & RECOMMENDATION Prediction Using Machine Learning Content Recommendation Conversion Prediction Visitor Segmentation Demand Forecasting

Sauce Raw Material LOGS

Logs, Logs Everywhere! SysLog Clickstre am Data Social Media Feeds Packet Data Sensor Data CDR Device Logs Custom App (C, Ruby,Pyt hon) Payment Data Applicati on Server Logs Web Access Logs Database Logs

What can be done with logs? Real time monitoring Root cause analysis Anomaly Detection and Predictive Monitoring Debugging Troubleshooting/Support

Challenges with Log Analytics No standard log formats Multiple logging frameworks Logs highly decentralized Limited real time visualization capability Scalability Issues Normalizing and correlating logs from disparate sources

What can be done with logs Business PoV? Input Data Analytics User Interactions /Behavior End user Experience/Improvements

Awesome Foursome The Ingredients

The Ingredients FLUENTD

Why Fluentd Unified Logging Simple and Flexible Proven Minimal Resources Reliable Open Source Community

Fluentd Plugin Architecture Input Input (udp,tcp,http,tail) Parser (regexp,apache2) Filter Filter (grep,enrich, delete.mask) Output Buffer Output out_mongo Format

HA Fluentd topology At Most once and At Least once transfers Log Forwarders Node1 Log Aggregators Destination Log File Node2 Log File Fluentd Fluentd PUSH Fluentd (Active) PUSH MongoD B Node3 Log File Fluentd Fluentd (Backup) Amazon S3

Fluentd Failure Scenarios Forwarder goes down Aggregator goes down

The Ingredients KAFKA

Kafka distributed streaming platform Producers Publish-Subscribe streams of records Store streams of records in fault tolerant way Process streams of records Apps App App DBs App Connectors DBs Kafka Cluster Stream Processor App Apps App App Consumers

Kafka Terms Topic Partition Producer Consumer Producer Topics 0 1 0 1 2 0 1 Partition-1 Partition-2 Partition-3 Brokers p1 p2 p3 R1 R2 R3 Consumer Group Consumer Groups C1 C2 C2

Why Kafka Ideal unified platform to handle real time data feeds Has high throughput to support high volume event streams such as log aggregation Deals well with high volume data loads from offline systems Fault tolerance and Scalable Able to handle the low latency associated with traditional messaging systems

Kafka decouples data pipelines Producers Producers Producers Producers Broker Kafka Consumers Consumer Consumer Consumer

Kafka Guarantees Messages sent to the topic and partition are appended in the same order A consumer instance gets the message in the same order as they are produced A topic with replication factor N can tolerate n-1 failures

Kafka Replication Producer Producer Logs Logs Logs Logs Follower Leader Topic1- part1 Topic1- part1 Follower Follower Topic1- part1 Follower Leader Topic1- part2 Topic1- part2 Topic1- part2 Broker1 Broker2 Broker3 Broker4

Zookeeper Zookeeper enables highly reliable distributed coordination Kafka bundles single node ZooKeeper instance Metadata includes broker addresses, message offsets metadata Zookeeper metadata Producers metadata Consumers messages Kafka Cluster messages

Kafka Persistence - File System Sequential File I/O very fast Uses OS page cache for data storage Batching of messages speeds up disk operations, network transfers and in memory iterations. http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg

Batch Processing One of the big drivers for efficiency Producers accumulate data in memory and send larger batches in a single request Fix the number of messages in a batch - batch.size Wait no longer than a fixed latency bound - linger.ms Trade off small amount of latency for better throughput

Log Compaction Per-record retention, rather than the coarsergrained timebased retention

Fluentd Kafka Integration Kafka Fluentd Consumer Fluentd kafka plugin Log Forwarders Fluentd Kafka Ecosystem Consumers Fluentd Destination MongoD B Fluentd PUSH Kafka Clusters PULL Fluentd PUSH Fluentd Fluentd Amazon S3

Advantage - Fluentd-Kafka Backpressure - Pull versus Push Reliable, Flexible data pipeline

Connected Warehouse Kafka Cluster Architecture Fluentd-Kafka Plugin Data Center 1 - Active Data Center 2 - Active Kafka Cluster Kafka Broker -1 Topic 1, Partition 0..n ZK 1 Leader Zookeeper Ensemble Kafka Broker 2 Topic 1, Partition n+1, n+n ZK 2 Follower

The Ingredients MONGODB

Why MongoDB Cross platform document-oriented NOSQL database Simple and Flexible Data Model Field Level Indexing Built In Query Capabilities High Performance

System Architecture With Shards Config Server Data Sources mongos mongos mongos Primary Primary Primary Primary Primary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary

MongoDB For Analytics Denormalization with support of Embedded Documents Connector for almost all kind of data source Aggregation Framework Text Search Queries Range Queries, Key value queries

The Ingredients SPARK

Spark Logical Architecture Scala, Java, Python, R Spark SQL Spark Streaming MLlib GraphX Apache Spark Spark MongoDB Connector

Putting It All Together Click Stream + Inventory Mgmt Micro-Service Data Sync Processing Ingestion Collection

QUESTIONS & ANSWERS

Thank you