MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Size: px

Start display at page:

Download "MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS"

Kevin Eaton
5 years ago
Views:

1 MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / ELEPHANT SCALE sujee@elephantscale.com

HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale Consulting & Training in Big Data Spark / Hadoop / NoSQL / Data Science Author

2 HI, I M SUJEE MANIYAM Founder / ElephantScale Consulting & Training in Big Data Spark / Hadoop / NoSQL / Data Science Author Hadoop illuminated open source book HBase Design Patterns Open Source contributor: github.com/sujee sujee@elephantscale.com

3 WHO IS THIS TALK FOR? Data Managers / Data Architects / Developers Thinking about Big Data infrastructure

4 A LOOK AT BIG DATA ECO SYSTEM Source : datafloq.com

5 HADOOP ECO SYSTEM Source : hortonworks.com

6 WHAT IS A GOOD DESIGN / ARCHITECTURE? Source : fox.com

7 WORKS ON MY LAPTOP I just got XYZ working on my laptop in 3 hours! Let s build this!!

8 WHAT WORKS ON A LAPTOP MAY NOT WORK AT SCALE!

9 AT SCALE NOTHING WORKS AS ADVERTISED

10 BIG DATA DESIGN PATTERNS ARE EMERGING We are gaining experience in using Big Data tools We hear about other people s experience Conferences Meetups Failure stories are still hard to come by J

11 BIG DATA TECHNOLOGIES : A QUICK LOOK 2011 Batch Hadoop v Beyond Batch / Streaming Spark Nifi Flink Kafka 1 st Gen (Big Data) 2013 Hadoop v2 2 nd Gen (Fast Data)

12 HADOOP IN 30 SECONDS The Original Big data platform Very well field tested Scales to peta-bytes of data Enables analytics at massive scale

13 HADOOP ECO SYSTEM Real Time Batch

14 HADOOP ECOSYSTEM BY FUNCTION HDFS provides distributed storage Map Reduce Pig Provides distributed computing High level MapReduce Hive SQL layer over Hadoop HBase NoSQL storage for real-time queries

16 SPARK IN 30 SECONDS Open source cluster computing engine Very fast: In-memory ops 100x faster than MR On-disk ops 10x faster than MR General purpose: MR, SQL, streaming, machine learning, analytics Compatible: Runs over Hadoop, Mesos, Yarn, standalone Works with HDFS, S3, Cassandra, HBase, Easier to code: Word count in 2 lines Spark's roots: Came out of Berkeley AMP Lab Now top-level Apache project Version 1.5 released in Sept 2015 First Big Data platform to integrate batch, streaming and interactive computations in a unified framework stratio.com

17 SPARK ILLUSTRATED Schema / sql Real Time Machine Learning Graph processing Spark SQL Spark Streaming ML lib GraphX Spark Core Standalone YARN MESOS Cluster managers S3 HDFS Cassandra??? Data Storage

18 HADOOP VS. SPARK Hadoop Spark

Generalized computation On disk / in memory Great at Iterative workloads (machine learning.

19 SPARK / HADOOP Hadoop Distributed Storage + Distributed Compute MapReduce framework Usually data on disk (HDFS) Not ideal for iterative work Batch process Mostly Java No unified shell Spark Distributed Compute Only Generalized computation On disk / in memory Great at Iterative workloads (machine learning..etc) - Up 10x faster for data on disk - Up to 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration

20 HADOOP + YARN : OS FOR DISTRIBUTED COMPUTING Batch (mapreduce) Streaming (storm, spark) In-memory (spark) Applications YARN HDFS Cluster Management Storage

21 Use Cases

22 USE CASES Batch Use case 1 : ETL / Batch query (Single Silo) Use case 2 : distributed log aggregation Batch + real time Use case 3 : real time data store Use case 4 : real time data store + batch analytics Real time / Streaming Use case 5 : Streaming

23 Use case 1 : ETL & Batch Analytics (Single Silo)

USE CASE 1 : ETL AND BATCH ANALYTICS @ SCALE Data collected in various databases Data is

24 USE CASE 1 : ETL AND BATCH SCALE Data collected in various databases Data is scattered across multiple silos! Need a single silo to bring all data together and analyze

25 USE CASE 1 : CONSIDERATIONS Batch analytics is ok We will use Hadoop core components This is most common use case

26 USE CASE 1 : DESIGN

27 USE CASE 1 : DESIGN REVIEW We are using core Hadoop components No vendor lock in (works on all Hadoop distributions) Use HDFS (Hadoop File System) for storage Data Ingest with Sqoop Processing done by Map Reduce & Cousins Results are exported back to DB

28 USE CASE 1 : DESIGN REVIEW HDFS as Single Silo Great for storing large amounts of data (100s of Terra Bytes to Peta Bytes) Content agnostic (text / binary / no schema) Source : hortonworks

29 USE CASE 1 : DESIGN REVIEW HDFS protects data very well Five-nines to seven-nines of availability

30 USE CASE 1 : DESIGN REVIEW Moving data between DB and Hadoop Sqoop ETL Tools Sqoop is a tool to interface Database & Hadoop Can connect to any JDBC compliant DB (or custom connectors) Import from DB à Hadoop Export from Hadoop à DB Tool Description Open Source / Premium Sqoop Migrates data between DB & Hadoop OS Part of most Hadoop eco system Talend Native Hadoop support OS Informatica Hadoop support (?) Premium Many more

31 USE CASE 1 : DESIGN REVIEW Processing is batch mode (minutes / hours) Processing Engines Description Sample Use case Java Map Reduce (engine : MR) Pig (engine : MR) Hive (engine : MR / Tez) Spark (engine : Spark + YARN) Native low level API to Map Reduce High level data flow language / engine SQL layer on Hadoop Generic programming model. Complex data processing (image processing / video encoding..etc) ETL work flows Ad-hoc queries Complex workflows (RDD programming) SQL querying (Dataframes / Spark SQL)

32 SQL ENGINES FOR HADOOP Engine Description Distribution Support Hive First SQL layer for Hadoop. All Hadoop distribution Presto Developed by Facebook All? Impala Tez / Stinger Spark Developed by Cloudera. Focus on low latency queries. Very fast. Open source, but tightly integrated with Cloudera distribution. Hortonworks initiative. Provides a new run-time / execution engine for Hive and others. Focus on speed / scale / SQL. Work in progress. Can query data on Hive tables / HDFS. Uses Spark as execution engine. Can be very fast (10x) Cloudera Hortonworks (might work on Cloudera?) All

33 USE CASE 1 : ETL WORK FLOW

34 SPARK SQL VS. HIVE Fast on same HDFS data!

35 SPARK SQL VS. HIVE Fast on same data on HDFS

36 USE CASE 1 : DESIGN RECAP Hadoop is COMPLIMENTARY to existing data warehouse Not replacing Hadoop can be a SINGLE SILO Facilitates analytics at massive scale Lots of choices for each task Data movement : Sqoop / ETL tools Data processing : Map Reduce (Java / Pig / Hive), Spark, ETL tools SQL Engines : Hive / Impala / Hive + Tez / Presto / Spark Mix & Match Hadoop & Spark

37 Use Case 2 : Aggregate Data From Multiple Sources (Near Real Time)

38 USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES Data coming in from multiple sources. Data is streaming in Capture data in Hadoop Do batch analytics

39 USE CASE 2 : FUNCTIONAL SKETCH

40 USE CASE 2 : DESIGN

41 DESIGN 2 : REVIEW Flume To bring in logs from multiple sources Distributed, reliable way to collect and move data If uplinks are dis connected, flume agents will store and forward data HDFS Flume can directly write data to HDFS Files are segmented or rolled by size / time e.g. Data _ log Data _ log Data _ log

42 DESIGN 2 : ANALYTICS Analytics stack : Pig / Hive / Oozie / Spark (Same as in Use Case 1) Oozie Work flow manager run this work flow every 1 hour run this work flow when data shows up in input directory Can manage complex work flows Send alerts when processes fail..etc

43 DESIGN 2 : REVIEW How can processed new data? E.g. Logs that came in today Option 1) Use timestamped files log _10-00.log log _13-00.log... log _10-00.log Use wildcards to load files : log *.log Option 2) Hive Partitions

44 DESIGN 2 : REVIEW Hive Partitions Data is partitioned over a dimension (time) Hive only picks up data in select partitions during query times select * from. Where dt =

45 Use Case 3 : Real Time Store

46 USE CASE 3 : REAL TIME DATA STORE Events are coming in Need to store the events Can be billions of events And query them in real time e.g. last 10 events by user

47 USE CASE 3 : DESIGN HDFS is not ideal for updating data in real time And it is not ideal for accessing data in random We need a scalable real time store à HBASE as operational store (New storage from Cloudera : Kudu)

48 USE CASE 3 : DESIGN

49 DESIGN 3 : REVIEW HBase supports real time updates Data comes trickling in (as stream) Saved data becomes queryable immediately Use HBase APIs (Java / REST) to build dashboards Data can be queried in real time (milliseconds) 6 node HBase cluster 3 billion rows of data Query a single row in 1 20 ms

50 Use Case 4 : Real Time + Batch Analytics

51 USE CASE 4 : REAL TIME + BATCH Building on use case 3 We want to do extensive analysis on data on HBase E.g. : scoring user models flagging credit card transactions

52 USE CASE 4 : DESIGN HBase is the real time store Analytics is done via Map Reduce stack (Pig / Hive) Can we do them in a single stack? May not be a good idea Don t mix real time and batch analytics Batch Analytics will impede real time performance

53 REAL TIME & BATCH DON T MIX

54 USE CASE 4 : DESIGN (SEPARATE REAL TIME & BATCH)

55 USE CASE 4 : DESIGN REVIEW How to replicate data? 1 : periodic synchronization of data between clusters 2 : data goes to both clusters at the same time

56 USE CASE 4 : DESIGN REVIEW How to replicate data between clusters HBase Active Sync Data in HDFS can be synchronized using utilities like Distcp How to import data into both clusters at the same time? Build a data pipeline to send data to both Use tools like Flume

57 Use Case 5 : Streaming

58 BIG DATA EVOLUTION Decision times : batch ( hours / days) Use cases: Modeling ETL Reporting

MOVING TOWARDS FAST DATA Decision time :

59 MOVING TOWARDS FAST DATA Decision time : (near) real time seconds (or milli seconds) Use Cases Alerts (medical / security) Fraud detection Streaming is becoming more prevalent Connected Devices Internet of Things Beyond Batch We need faster processing / analytics

60 STREAMING ARCHITECTURE OVER SIMPLIFIED J

61 STREAMING ARCHITECTURE DATA BUCKET data bucket Captures incoming data Acts as a buffer smoothes out bursts So even if our processing offline, we won t loose data Data bucket choices * Kafka MQ (RabittMQ..etc) Amazon Kinesis

62 KAFKA ARCHITECTURE Producers write data to brokers Consumers read data from brokers All of this is distributed / parallel Failure tolerant Data is stored as topics sensor_data alerts s

63 STREAMING ARCHITECTURE PROCESSING ENGINE Need to process events with low latency So many to choose from! Choices Storm Spark NiFi Flink

64 STREAMING SYSTEMS FEATURE COMPARISON Feature Storm Spark Streaming Processing Model Windowing operations Event based by default (micro batch using Trident) Supported by Trident Micro Batch Flink Event based + Micro Batch based Yes Yes? NiFi Event Based (?) Latency Milliseconds Seconds Milliseconds Milliseconds At-least-once YES YES YES YES At-most-once YES NO YES? Exactly-once YES with Trident YES YES?

65 STREAMING ARCHITECTURE DATA STORE Where processed data ends up Need to absorb data in real time Usually a NoSQL storage HBase Cassandra Lots of NoSQL stores

66 DATA STORAGE OPTIONS

67 DATA STORAGE CHOICES forever storage Scalable distributed file systems Hadoop! (HDFS actually) real time store Traditional RDBMS won t work Don t scale well (or too expensive) NoSQL! Rigid schema layout

68 LAMBDA ARCHITECTURE

69 LAMBDA ARCHITECTURE EXPLAINED 1. All new data is sent to both batch layer and speed layer 2. Batch layer Holds master data set (immutable, append-only) Answers batch queries 3. Serving layer updates batch views so they can be queried adhoc 4. Speed Layer Handles new data Facilitates fast / real-time queries 5. Query layer Answers queries using batch & real-time views

70 INCORPORATING LAMBDA ARCHITECTURE

71 ARCHITECTURE REVIEW Each component is scalable Each component is fault tolerant Incorporates best practices All open source!

72 SUMMARY We looked at a bunch of use cases Batch analytics DB à Hadoop Multiple Sources à Hadoop Real time + Batch Real time data store using HBase HBase + Batch Analytics Streaming Real time Lots of choices!

SUMMARY / BEST PRACTICES Start small Test with large amount of data as soon as possible Iterate / iterate / iterate Only benchmark that matters is YOURS!

73 SUMMARY / BEST PRACTICES Start small Test with large amount of data as soon as possible Iterate / iterate / iterate Only benchmark that matters is YOURS! Build in lot of metrics collection Host level metrics are readily collected by monitoring systems Application level metrics (most useful) have to implemented by YOU e.g. Request is taking 2000 ms.. Where is the time spent? Let loose chaos monkey

74 THANKS AND QUESTIONS? Sujee Maniyam Founder / ElephantScale Expert Consulting + Training in Big Data technologies sujee@elephantscale.com Elephantscale.com Sign up for upcoming trainings : ElephantScale.com/ training Hadoop to Spark ElephantScale.com/webinars/

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional