Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here
Wipro Open Source Practice: Vision & Mission Vision Wipro will be the world leader in solving customer problems through the use of innovative and practical open source solutions. We will be a steward of every open source community in which we engage, and always act with sensitivity and integrity. Mission Wipro s Open Source mission is to be the guide and partner to companies seeking to leverage the strategic, financial, organizational and technological benefits of open source software and methods. Wipro will anticipate and solve customers needs through a commitment to research, and by taking a balanced approach to legacy and innovative technologies. Wipro s comprehensive suite of strategic and technology services will be delivered with passion and precision.
Wipro Open Source Practice Offerings Advisory Enterprise-wide adoption strategies Best fit analysis & recommendation Business Case Advisory Governance Technical Consulting Productized Services Legacy Migration Services Greenfield Development Open Source Stack Setup Open App Cross Industry Solutions and Process Stacks Support Application and Infrastructure Dev Ops Architecture, Development Open Source Community
Connected Warehouse Platform CSC SCP Warehouse Mobility & Dashboards Carrier Vendor Facility Inventory & Operations Orders Alerts & Notification Warehouse KPI s Performance Tracker Equipment Monitor Dashboards Master Data Connected Warehouse Platform Transaction Data Webservices Integration Mapping FTP (Flat file/xml) Subscriber Queues Automation Enabler Publisher Queues Sales Orders [Real-Time] Route Plan / Carrier Tracking Almost Real-Time Associate Performance PUT/PICK Status Purchase Orders Master Data [Scheduled] OMS TMS WMS LMS WCS IOT ERP/HOST Direct to Customer Warehouses Equipment Retailer Supplier
The Awesome Sauce ANALYTICS & PREDICTION
Clickstream Analytics User Behavior Analysis Product Affinity Website Resource Allocation Prediction & recommendation
PREDICTION & RECOMMENDATION Prediction Using Machine Learning Content Recommendation Conversion Prediction Visitor Segmentation Demand Forecasting
Sauce Raw Material LOGS
Logs, Logs Everywhere! SysLog Clickstre am Data Social Media Feeds Packet Data Sensor Data CDR Device Logs Custom App (C, Ruby,Pyt hon) Payment Data Applicati on Server Logs Web Access Logs Database Logs
What can be done with logs? Real time monitoring Root cause analysis Anomaly Detection and Predictive Monitoring Debugging Troubleshooting/Support
Challenges with Log Analytics No standard log formats Multiple logging frameworks Logs highly decentralized Limited real time visualization capability Scalability Issues Normalizing and correlating logs from disparate sources
What can be done with logs Business PoV? Input Data Analytics User Interactions /Behavior End user Experience/Improvements
Awesome Foursome The Ingredients
The Ingredients FLUENTD
Why Fluentd Unified Logging Simple and Flexible Proven Minimal Resources Reliable Open Source Community
Fluentd Plugin Architecture Input Input (udp,tcp,http,tail) Parser (regexp,apache2) Filter Filter (grep,enrich, delete.mask) Output Buffer Output out_mongo Format
HA Fluentd topology At Most once and At Least once transfers Log Forwarders Node1 Log Aggregators Destination Log File Node2 Log File Fluentd Fluentd PUSH Fluentd (Active) PUSH MongoD B Node3 Log File Fluentd Fluentd (Backup) Amazon S3
Fluentd Failure Scenarios Forwarder goes down Aggregator goes down
The Ingredients KAFKA
Kafka distributed streaming platform Producers Publish-Subscribe streams of records Store streams of records in fault tolerant way Process streams of records Apps App App DBs App Connectors DBs Kafka Cluster Stream Processor App Apps App App Consumers
Kafka Terms Topic Partition Producer Consumer Producer Topics 0 1 0 1 2 0 1 Partition-1 Partition-2 Partition-3 Brokers p1 p2 p3 R1 R2 R3 Consumer Group Consumer Groups C1 C2 C2
Why Kafka Ideal unified platform to handle real time data feeds Has high throughput to support high volume event streams such as log aggregation Deals well with high volume data loads from offline systems Fault tolerance and Scalable Able to handle the low latency associated with traditional messaging systems
Kafka decouples data pipelines Producers Producers Producers Producers Broker Kafka Consumers Consumer Consumer Consumer
Kafka Guarantees Messages sent to the topic and partition are appended in the same order A consumer instance gets the message in the same order as they are produced A topic with replication factor N can tolerate n-1 failures
Kafka Replication Producer Producer Logs Logs Logs Logs Follower Leader Topic1- part1 Topic1- part1 Follower Follower Topic1- part1 Follower Leader Topic1- part2 Topic1- part2 Topic1- part2 Broker1 Broker2 Broker3 Broker4
Zookeeper Zookeeper enables highly reliable distributed coordination Kafka bundles single node ZooKeeper instance Metadata includes broker addresses, message offsets metadata Zookeeper metadata Producers metadata Consumers messages Kafka Cluster messages
Kafka Persistence - File System Sequential File I/O very fast Uses OS page cache for data storage Batching of messages speeds up disk operations, network transfers and in memory iterations. http://deliveryimages.acm.org/10.1145/1570000/1563874/jacobs3.jpg
Batch Processing One of the big drivers for efficiency Producers accumulate data in memory and send larger batches in a single request Fix the number of messages in a batch - batch.size Wait no longer than a fixed latency bound - linger.ms Trade off small amount of latency for better throughput
Log Compaction Per-record retention, rather than the coarsergrained timebased retention
Fluentd Kafka Integration Kafka Fluentd Consumer Fluentd kafka plugin Log Forwarders Fluentd Kafka Ecosystem Consumers Fluentd Destination MongoD B Fluentd PUSH Kafka Clusters PULL Fluentd PUSH Fluentd Fluentd Amazon S3
Advantage - Fluentd-Kafka Backpressure - Pull versus Push Reliable, Flexible data pipeline
Connected Warehouse Kafka Cluster Architecture Fluentd-Kafka Plugin Data Center 1 - Active Data Center 2 - Active Kafka Cluster Kafka Broker -1 Topic 1, Partition 0..n ZK 1 Leader Zookeeper Ensemble Kafka Broker 2 Topic 1, Partition n+1, n+n ZK 2 Follower
The Ingredients MONGODB
Why MongoDB Cross platform document-oriented NOSQL database Simple and Flexible Data Model Field Level Indexing Built In Query Capabilities High Performance
System Architecture With Shards Config Server Data Sources mongos mongos mongos Primary Primary Primary Primary Primary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary Secondary
MongoDB For Analytics Denormalization with support of Embedded Documents Connector for almost all kind of data source Aggregation Framework Text Search Queries Range Queries, Key value queries
The Ingredients SPARK
Spark Logical Architecture Scala, Java, Python, R Spark SQL Spark Streaming MLlib GraphX Apache Spark Spark MongoDB Connector
Putting It All Together Click Stream + Inventory Mgmt Micro-Service Data Sync Processing Ingestion Collection
QUESTIONS & ANSWERS
Thank you
www.modsummit.com www.developersummit.com