MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com
HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale Consulting & Training in Big Data Spark / Hadoop / NoSQL / Data Science Author Hadoop illuminated open source book HBase Design Patterns Open Source contributor: github.com/sujee sujee@elephantscale.com www.elephantscale.com
WHO IS THIS TALK FOR? Data Managers / Data Architects / Developers Thinking about Big Data infrastructure
A LOOK AT BIG DATA ECO SYSTEM Source : datafloq.com
HADOOP ECO SYSTEM Source : hortonworks.com
WHAT IS A GOOD DESIGN / ARCHITECTURE? Source : fox.com
WORKS ON MY LAPTOP I just got XYZ working on my laptop in 3 hours! Let s build this!!
WHAT WORKS ON A LAPTOP MAY NOT WORK AT SCALE!
AT SCALE NOTHING WORKS AS ADVERTISED
BIG DATA DESIGN PATTERNS ARE EMERGING We are gaining experience in using Big Data tools We hear about other people s experience Conferences Meetups Failure stories are still hard to come by J
BIG DATA TECHNOLOGIES : A QUICK LOOK 2011 Batch Hadoop v1 2015 Beyond Batch / Streaming Spark Nifi Flink Kafka 1 st Gen (Big Data) 2013 Hadoop v2 2 nd Gen (Fast Data)
HADOOP IN 30 SECONDS The Original Big data platform Very well field tested Scales to peta-bytes of data Enables analytics at massive scale
HADOOP ECO SYSTEM Real Time Batch
HADOOP ECOSYSTEM BY FUNCTION HDFS provides distributed storage Map Reduce Pig Provides distributed computing High level MapReduce Hive SQL layer over Hadoop HBase NoSQL storage for real-time queries
SPARK IN 30 SECONDS Open source cluster computing engine Very fast: In-memory ops 100x faster than MR On-disk ops 10x faster than MR General purpose: MR, SQL, streaming, machine learning, analytics Compatible: Runs over Hadoop, Mesos, Yarn, standalone Works with HDFS, S3, Cassandra, HBase, Easier to code: Word count in 2 lines Spark's roots: Came out of Berkeley AMP Lab Now top-level Apache project Version 1.5 released in Sept 2015 First Big Data platform to integrate batch, streaming and interactive computations in a unified framework stratio.com
SPARK ILLUSTRATED Schema / sql Real Time Machine Learning Graph processing Spark SQL Spark Streaming ML lib GraphX Spark Core Standalone YARN MESOS Cluster managers S3 HDFS Cassandra??? Data Storage
HADOOP VS. SPARK Hadoop Spark
SPARK / HADOOP Hadoop Distributed Storage + Distributed Compute MapReduce framework Usually data on disk (HDFS) Not ideal for iterative work Batch process Mostly Java No unified shell Spark Distributed Compute Only Generalized computation On disk / in memory Great at Iterative workloads (machine learning..etc) - Up 10x faster for data on disk - Up to 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration
HADOOP + YARN : OS FOR DISTRIBUTED COMPUTING Batch (mapreduce) Streaming (storm, spark) In-memory (spark) Applications YARN HDFS Cluster Management Storage
Use Cases
USE CASES Batch Use case 1 : ETL / Batch query (Single Silo) Use case 2 : distributed log aggregation Batch + real time Use case 3 : real time data store Use case 4 : real time data store + batch analytics Real time / Streaming Use case 5 : Streaming
Use case 1 : ETL & Batch Analytics (Single Silo)
USE CASE 1 : ETL AND BATCH ANALYTICS @ SCALE Data collected in various databases Data is scattered across multiple silos! Need a single silo to bring all data together and analyze
USE CASE 1 : CONSIDERATIONS Batch analytics is ok We will use Hadoop core components This is most common use case
USE CASE 1 : DESIGN
USE CASE 1 : DESIGN REVIEW We are using core Hadoop components No vendor lock in (works on all Hadoop distributions) Use HDFS (Hadoop File System) for storage Data Ingest with Sqoop Processing done by Map Reduce & Cousins Results are exported back to DB
USE CASE 1 : DESIGN REVIEW HDFS as Single Silo Great for storing large amounts of data (100s of Terra Bytes to Peta Bytes) Content agnostic (text / binary / no schema) Source : hortonworks
USE CASE 1 : DESIGN REVIEW HDFS protects data very well Five-nines to seven-nines of availability
USE CASE 1 : DESIGN REVIEW Moving data between DB and Hadoop Sqoop ETL Tools Sqoop is a tool to interface Database & Hadoop Can connect to any JDBC compliant DB (or custom connectors) Import from DB à Hadoop Export from Hadoop à DB Tool Description Open Source / Premium Sqoop Migrates data between DB & Hadoop OS Part of most Hadoop eco system Talend Native Hadoop support OS Informatica Hadoop support (?) Premium Many more
USE CASE 1 : DESIGN REVIEW Processing is batch mode (minutes / hours) Processing Engines Description Sample Use case Java Map Reduce (engine : MR) Pig (engine : MR) Hive (engine : MR / Tez) Spark (engine : Spark + YARN) Native low level API to Map Reduce High level data flow language / engine SQL layer on Hadoop Generic programming model. Complex data processing (image processing / video encoding..etc) ETL work flows Ad-hoc queries Complex workflows (RDD programming) SQL querying (Dataframes / Spark SQL)
SQL ENGINES FOR HADOOP Engine Description Distribution Support Hive First SQL layer for Hadoop. All Hadoop distribution Presto Developed by Facebook All? Impala Tez / Stinger Spark Developed by Cloudera. Focus on low latency queries. Very fast. Open source, but tightly integrated with Cloudera distribution. Hortonworks initiative. Provides a new run-time / execution engine for Hive and others. Focus on speed / scale / SQL. Work in progress. Can query data on Hive tables / HDFS. Uses Spark as execution engine. Can be very fast (10x) Cloudera Hortonworks (might work on Cloudera?) All
USE CASE 1 : ETL WORK FLOW
SPARK SQL VS. HIVE Fast on same HDFS data!
SPARK SQL VS. HIVE Fast on same data on HDFS
USE CASE 1 : DESIGN RECAP Hadoop is COMPLIMENTARY to existing data warehouse Not replacing Hadoop can be a SINGLE SILO Facilitates analytics at massive scale Lots of choices for each task Data movement : Sqoop / ETL tools Data processing : Map Reduce (Java / Pig / Hive), Spark, ETL tools SQL Engines : Hive / Impala / Hive + Tez / Presto / Spark Mix & Match Hadoop & Spark
Use Case 2 : Aggregate Data From Multiple Sources (Near Real Time)
USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES Data coming in from multiple sources. Data is streaming in Capture data in Hadoop Do batch analytics
USE CASE 2 : FUNCTIONAL SKETCH
USE CASE 2 : DESIGN
DESIGN 2 : REVIEW Flume To bring in logs from multiple sources Distributed, reliable way to collect and move data If uplinks are dis connected, flume agents will store and forward data HDFS Flume can directly write data to HDFS Files are segmented or rolled by size / time e.g. Data-2015-01-01_10-00-00.log Data-2015-01-01_11-00-00.log Data-2015-01-01_12-00-00.log
DESIGN 2 : ANALYTICS Analytics stack : Pig / Hive / Oozie / Spark (Same as in Use Case 1) Oozie Work flow manager run this work flow every 1 hour run this work flow when data shows up in input directory Can manage complex work flows Send alerts when processes fail..etc
DESIGN 2 : REVIEW How can processed new data? E.g. Logs that came in today Option 1) Use timestamped files log-2015-01-01_10-00.log log-2015-01-01_13-00.log... log-2015-01-02_10-00.log Use wildcards to load files : log-2015-01-01*.log Option 2) Hive Partitions
DESIGN 2 : REVIEW Hive Partitions Data is partitioned over a dimension (time) Hive only picks up data in select partitions during query times select * from. Where dt = 2015-01-01
Use Case 3 : Real Time Store
USE CASE 3 : REAL TIME DATA STORE Events are coming in Need to store the events Can be billions of events And query them in real time e.g. last 10 events by user
USE CASE 3 : DESIGN HDFS is not ideal for updating data in real time And it is not ideal for accessing data in random We need a scalable real time store à HBASE as operational store (New storage from Cloudera : Kudu)
USE CASE 3 : DESIGN
DESIGN 3 : REVIEW HBase supports real time updates Data comes trickling in (as stream) Saved data becomes queryable immediately Use HBase APIs (Java / REST) to build dashboards Data can be queried in real time (milliseconds) 6 node HBase cluster 3 billion rows of data Query a single row in 1 20 ms
Use Case 4 : Real Time + Batch Analytics
USE CASE 4 : REAL TIME + BATCH Building on use case 3 We want to do extensive analysis on data on HBase E.g. : scoring user models flagging credit card transactions
USE CASE 4 : DESIGN HBase is the real time store Analytics is done via Map Reduce stack (Pig / Hive) Can we do them in a single stack? May not be a good idea Don t mix real time and batch analytics Batch Analytics will impede real time performance
REAL TIME & BATCH DON T MIX
USE CASE 4 : DESIGN (SEPARATE REAL TIME & BATCH)
USE CASE 4 : DESIGN REVIEW How to replicate data? 1 : periodic synchronization of data between clusters 2 : data goes to both clusters at the same time
USE CASE 4 : DESIGN REVIEW How to replicate data between clusters HBase Active Sync Data in HDFS can be synchronized using utilities like Distcp How to import data into both clusters at the same time? Build a data pipeline to send data to both Use tools like Flume
Use Case 5 : Streaming
BIG DATA EVOLUTION Decision times : batch ( hours / days) Use cases: Modeling ETL Reporting
MOVING TOWARDS FAST DATA Decision time : (near) real time seconds (or milli seconds) Use Cases Alerts (medical / security) Fraud detection Streaming is becoming more prevalent Connected Devices Internet of Things Beyond Batch We need faster processing / analytics
STREAMING ARCHITECTURE OVER SIMPLIFIED J
STREAMING ARCHITECTURE DATA BUCKET data bucket Captures incoming data Acts as a buffer smoothes out bursts So even if our processing offline, we won t loose data Data bucket choices * Kafka MQ (RabittMQ..etc) Amazon Kinesis
KAFKA ARCHITECTURE Producers write data to brokers Consumers read data from brokers All of this is distributed / parallel Failure tolerant Data is stored as topics sensor_data alerts emails
STREAMING ARCHITECTURE PROCESSING ENGINE Need to process events with low latency So many to choose from! Choices Storm Spark NiFi Flink
STREAMING SYSTEMS FEATURE COMPARISON Feature Storm Spark Streaming Processing Model Windowing operations Event based by default (micro batch using Trident) Supported by Trident Micro Batch Flink Event based + Micro Batch based Yes Yes? NiFi Event Based (?) Latency Milliseconds Seconds Milliseconds Milliseconds At-least-once YES YES YES YES At-most-once YES NO YES? Exactly-once YES with Trident YES YES?
STREAMING ARCHITECTURE DATA STORE Where processed data ends up Need to absorb data in real time Usually a NoSQL storage HBase Cassandra Lots of NoSQL stores
DATA STORAGE OPTIONS
DATA STORAGE CHOICES forever storage Scalable distributed file systems Hadoop! (HDFS actually) real time store Traditional RDBMS won t work Don t scale well (or too expensive) NoSQL! Rigid schema layout
LAMBDA ARCHITECTURE
LAMBDA ARCHITECTURE EXPLAINED 1. All new data is sent to both batch layer and speed layer 2. Batch layer Holds master data set (immutable, append-only) Answers batch queries 3. Serving layer updates batch views so they can be queried adhoc 4. Speed Layer Handles new data Facilitates fast / real-time queries 5. Query layer Answers queries using batch & real-time views
INCORPORATING LAMBDA ARCHITECTURE
ARCHITECTURE REVIEW Each component is scalable Each component is fault tolerant Incorporates best practices All open source!
SUMMARY We looked at a bunch of use cases Batch analytics DB à Hadoop Multiple Sources à Hadoop Real time + Batch Real time data store using HBase HBase + Batch Analytics Streaming Real time Lots of choices!
SUMMARY / BEST PRACTICES Start small Test with large amount of data as soon as possible Iterate / iterate / iterate Only benchmark that matters is YOURS! Build in lot of metrics collection Host level metrics are readily collected by monitoring systems Application level metrics (most useful) have to implemented by YOU e.g. Request is taking 2000 ms.. Where is the time spent? Let loose chaos monkey
THANKS AND QUESTIONS? Sujee Maniyam Founder / Principal @ ElephantScale Expert Consulting + Training in Big Data technologies sujee@elephantscale.com Elephantscale.com Sign up for upcoming trainings : ElephantScale.com/ training Hadoop to Spark Webinar @ ElephantScale.com/webinars/