MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Similar documents
Big Data Architect.

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Stack

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

DATA SCIENCE USING SPARK: AN INTRODUCTION

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Innovatus Technologies

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Hadoop. Introduction / Overview

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Big Data Hadoop Course Content

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Configuring and Deploying Hadoop Cluster Deployment Templates

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Analytics using Apache Hadoop and Spark with Scala

Certified Big Data Hadoop and Spark Scala Course Curriculum

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Certified Big Data and Hadoop Course Curriculum

April Copyright 2013 Cloudera Inc. All rights reserved.

MapR Enterprise Hadoop

Hadoop Development Introduction

microsoft

Databases 2 (VU) ( / )

Hadoop, Yarn and Beyond

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

The Technology of the Business Data Lake. Appendix

Fluentd + MongoDB + Spark = Awesome Sauce

Hadoop An Overview. - Socrates CCDH

BIG DATA COURSE CONTENT

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Spark, Shark and Spark Streaming Introduction

Turning Relational Database Tables into Spark Data Sources

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Apache Spark 2.0. Matei

Microsoft Big Data and Hadoop

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Unifying Big Data Workloads in Apache Spark

AWS Serverless Architecture Think Big

Hadoop. Introduction to BIGDATA and HADOOP

Introduction to BigData, Hadoop:-

Hadoop course content

Processing of big data with Apache Spark

Introduction to Big-Data

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Big Data Infrastructures & Technologies

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Oracle Big Data Fundamentals Ed 2

Flash Storage Complementing a Data Lake for Real-Time Insight

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

An Introduction to Apache Spark

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Distributed systems for stream processing

Big Data Infrastructure at Spotify

Webinar Series TMIP VISION

Bringing Data to Life

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Data Acquisition. The reference Big Data stack

Dell In-Memory Appliance for Cloudera Enterprise

A Tutorial on Apache Spark

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Stages of Data Processing

HDInsight > Hadoop. October 12, 2017

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Cloud Computing & Visualization

Hadoop Online Training

Techno Expert Solutions An institute for specialized studies!

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Hortonworks Data Platform

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Data Acquisition. The reference Big Data stack

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

The age of Big Data Big Data for Oracle Database Professionals

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Data Lake Based Systems that Work

An Introduction to Big Data Formats

Databricks, an Introduction

Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

CSE 444: Database Internals. Lecture 23 Spark

Lambda Architecture for Batch and Stream Processing. October 2018

Exam Questions

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Over the last few years, we have seen a disruption in the data management

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Analyzing Flight Data

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Transcription:

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com

HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale Consulting & Training in Big Data Spark / Hadoop / NoSQL / Data Science Author Hadoop illuminated open source book HBase Design Patterns Open Source contributor: github.com/sujee sujee@elephantscale.com www.elephantscale.com

WHO IS THIS TALK FOR? Data Managers / Data Architects / Developers Thinking about Big Data infrastructure

A LOOK AT BIG DATA ECO SYSTEM Source : datafloq.com

HADOOP ECO SYSTEM Source : hortonworks.com

WHAT IS A GOOD DESIGN / ARCHITECTURE? Source : fox.com

WORKS ON MY LAPTOP I just got XYZ working on my laptop in 3 hours! Let s build this!!

WHAT WORKS ON A LAPTOP MAY NOT WORK AT SCALE!

AT SCALE NOTHING WORKS AS ADVERTISED

BIG DATA DESIGN PATTERNS ARE EMERGING We are gaining experience in using Big Data tools We hear about other people s experience Conferences Meetups Failure stories are still hard to come by J

BIG DATA TECHNOLOGIES : A QUICK LOOK 2011 Batch Hadoop v1 2015 Beyond Batch / Streaming Spark Nifi Flink Kafka 1 st Gen (Big Data) 2013 Hadoop v2 2 nd Gen (Fast Data)

HADOOP IN 30 SECONDS The Original Big data platform Very well field tested Scales to peta-bytes of data Enables analytics at massive scale

HADOOP ECO SYSTEM Real Time Batch

HADOOP ECOSYSTEM BY FUNCTION HDFS provides distributed storage Map Reduce Pig Provides distributed computing High level MapReduce Hive SQL layer over Hadoop HBase NoSQL storage for real-time queries

SPARK IN 30 SECONDS Open source cluster computing engine Very fast: In-memory ops 100x faster than MR On-disk ops 10x faster than MR General purpose: MR, SQL, streaming, machine learning, analytics Compatible: Runs over Hadoop, Mesos, Yarn, standalone Works with HDFS, S3, Cassandra, HBase, Easier to code: Word count in 2 lines Spark's roots: Came out of Berkeley AMP Lab Now top-level Apache project Version 1.5 released in Sept 2015 First Big Data platform to integrate batch, streaming and interactive computations in a unified framework stratio.com

SPARK ILLUSTRATED Schema / sql Real Time Machine Learning Graph processing Spark SQL Spark Streaming ML lib GraphX Spark Core Standalone YARN MESOS Cluster managers S3 HDFS Cassandra??? Data Storage

HADOOP VS. SPARK Hadoop Spark

SPARK / HADOOP Hadoop Distributed Storage + Distributed Compute MapReduce framework Usually data on disk (HDFS) Not ideal for iterative work Batch process Mostly Java No unified shell Spark Distributed Compute Only Generalized computation On disk / in memory Great at Iterative workloads (machine learning..etc) - Up 10x faster for data on disk - Up to 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration

HADOOP + YARN : OS FOR DISTRIBUTED COMPUTING Batch (mapreduce) Streaming (storm, spark) In-memory (spark) Applications YARN HDFS Cluster Management Storage

Use Cases

USE CASES Batch Use case 1 : ETL / Batch query (Single Silo) Use case 2 : distributed log aggregation Batch + real time Use case 3 : real time data store Use case 4 : real time data store + batch analytics Real time / Streaming Use case 5 : Streaming

Use case 1 : ETL & Batch Analytics (Single Silo)

USE CASE 1 : ETL AND BATCH ANALYTICS @ SCALE Data collected in various databases Data is scattered across multiple silos! Need a single silo to bring all data together and analyze

USE CASE 1 : CONSIDERATIONS Batch analytics is ok We will use Hadoop core components This is most common use case

USE CASE 1 : DESIGN

USE CASE 1 : DESIGN REVIEW We are using core Hadoop components No vendor lock in (works on all Hadoop distributions) Use HDFS (Hadoop File System) for storage Data Ingest with Sqoop Processing done by Map Reduce & Cousins Results are exported back to DB

USE CASE 1 : DESIGN REVIEW HDFS as Single Silo Great for storing large amounts of data (100s of Terra Bytes to Peta Bytes) Content agnostic (text / binary / no schema) Source : hortonworks

USE CASE 1 : DESIGN REVIEW HDFS protects data very well Five-nines to seven-nines of availability

USE CASE 1 : DESIGN REVIEW Moving data between DB and Hadoop Sqoop ETL Tools Sqoop is a tool to interface Database & Hadoop Can connect to any JDBC compliant DB (or custom connectors) Import from DB à Hadoop Export from Hadoop à DB Tool Description Open Source / Premium Sqoop Migrates data between DB & Hadoop OS Part of most Hadoop eco system Talend Native Hadoop support OS Informatica Hadoop support (?) Premium Many more

USE CASE 1 : DESIGN REVIEW Processing is batch mode (minutes / hours) Processing Engines Description Sample Use case Java Map Reduce (engine : MR) Pig (engine : MR) Hive (engine : MR / Tez) Spark (engine : Spark + YARN) Native low level API to Map Reduce High level data flow language / engine SQL layer on Hadoop Generic programming model. Complex data processing (image processing / video encoding..etc) ETL work flows Ad-hoc queries Complex workflows (RDD programming) SQL querying (Dataframes / Spark SQL)

SQL ENGINES FOR HADOOP Engine Description Distribution Support Hive First SQL layer for Hadoop. All Hadoop distribution Presto Developed by Facebook All? Impala Tez / Stinger Spark Developed by Cloudera. Focus on low latency queries. Very fast. Open source, but tightly integrated with Cloudera distribution. Hortonworks initiative. Provides a new run-time / execution engine for Hive and others. Focus on speed / scale / SQL. Work in progress. Can query data on Hive tables / HDFS. Uses Spark as execution engine. Can be very fast (10x) Cloudera Hortonworks (might work on Cloudera?) All

USE CASE 1 : ETL WORK FLOW

SPARK SQL VS. HIVE Fast on same HDFS data!

SPARK SQL VS. HIVE Fast on same data on HDFS

USE CASE 1 : DESIGN RECAP Hadoop is COMPLIMENTARY to existing data warehouse Not replacing Hadoop can be a SINGLE SILO Facilitates analytics at massive scale Lots of choices for each task Data movement : Sqoop / ETL tools Data processing : Map Reduce (Java / Pig / Hive), Spark, ETL tools SQL Engines : Hive / Impala / Hive + Tez / Presto / Spark Mix & Match Hadoop & Spark

Use Case 2 : Aggregate Data From Multiple Sources (Near Real Time)

USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES Data coming in from multiple sources. Data is streaming in Capture data in Hadoop Do batch analytics

USE CASE 2 : FUNCTIONAL SKETCH

USE CASE 2 : DESIGN

DESIGN 2 : REVIEW Flume To bring in logs from multiple sources Distributed, reliable way to collect and move data If uplinks are dis connected, flume agents will store and forward data HDFS Flume can directly write data to HDFS Files are segmented or rolled by size / time e.g. Data-2015-01-01_10-00-00.log Data-2015-01-01_11-00-00.log Data-2015-01-01_12-00-00.log

DESIGN 2 : ANALYTICS Analytics stack : Pig / Hive / Oozie / Spark (Same as in Use Case 1) Oozie Work flow manager run this work flow every 1 hour run this work flow when data shows up in input directory Can manage complex work flows Send alerts when processes fail..etc

DESIGN 2 : REVIEW How can processed new data? E.g. Logs that came in today Option 1) Use timestamped files log-2015-01-01_10-00.log log-2015-01-01_13-00.log... log-2015-01-02_10-00.log Use wildcards to load files : log-2015-01-01*.log Option 2) Hive Partitions

DESIGN 2 : REVIEW Hive Partitions Data is partitioned over a dimension (time) Hive only picks up data in select partitions during query times select * from. Where dt = 2015-01-01

Use Case 3 : Real Time Store

USE CASE 3 : REAL TIME DATA STORE Events are coming in Need to store the events Can be billions of events And query them in real time e.g. last 10 events by user

USE CASE 3 : DESIGN HDFS is not ideal for updating data in real time And it is not ideal for accessing data in random We need a scalable real time store à HBASE as operational store (New storage from Cloudera : Kudu)

USE CASE 3 : DESIGN

DESIGN 3 : REVIEW HBase supports real time updates Data comes trickling in (as stream) Saved data becomes queryable immediately Use HBase APIs (Java / REST) to build dashboards Data can be queried in real time (milliseconds) 6 node HBase cluster 3 billion rows of data Query a single row in 1 20 ms

Use Case 4 : Real Time + Batch Analytics

USE CASE 4 : REAL TIME + BATCH Building on use case 3 We want to do extensive analysis on data on HBase E.g. : scoring user models flagging credit card transactions

USE CASE 4 : DESIGN HBase is the real time store Analytics is done via Map Reduce stack (Pig / Hive) Can we do them in a single stack? May not be a good idea Don t mix real time and batch analytics Batch Analytics will impede real time performance

REAL TIME & BATCH DON T MIX

USE CASE 4 : DESIGN (SEPARATE REAL TIME & BATCH)

USE CASE 4 : DESIGN REVIEW How to replicate data? 1 : periodic synchronization of data between clusters 2 : data goes to both clusters at the same time

USE CASE 4 : DESIGN REVIEW How to replicate data between clusters HBase Active Sync Data in HDFS can be synchronized using utilities like Distcp How to import data into both clusters at the same time? Build a data pipeline to send data to both Use tools like Flume

Use Case 5 : Streaming

BIG DATA EVOLUTION Decision times : batch ( hours / days) Use cases: Modeling ETL Reporting

MOVING TOWARDS FAST DATA Decision time : (near) real time seconds (or milli seconds) Use Cases Alerts (medical / security) Fraud detection Streaming is becoming more prevalent Connected Devices Internet of Things Beyond Batch We need faster processing / analytics

STREAMING ARCHITECTURE OVER SIMPLIFIED J

STREAMING ARCHITECTURE DATA BUCKET data bucket Captures incoming data Acts as a buffer smoothes out bursts So even if our processing offline, we won t loose data Data bucket choices * Kafka MQ (RabittMQ..etc) Amazon Kinesis

KAFKA ARCHITECTURE Producers write data to brokers Consumers read data from brokers All of this is distributed / parallel Failure tolerant Data is stored as topics sensor_data alerts emails

STREAMING ARCHITECTURE PROCESSING ENGINE Need to process events with low latency So many to choose from! Choices Storm Spark NiFi Flink

STREAMING SYSTEMS FEATURE COMPARISON Feature Storm Spark Streaming Processing Model Windowing operations Event based by default (micro batch using Trident) Supported by Trident Micro Batch Flink Event based + Micro Batch based Yes Yes? NiFi Event Based (?) Latency Milliseconds Seconds Milliseconds Milliseconds At-least-once YES YES YES YES At-most-once YES NO YES? Exactly-once YES with Trident YES YES?

STREAMING ARCHITECTURE DATA STORE Where processed data ends up Need to absorb data in real time Usually a NoSQL storage HBase Cassandra Lots of NoSQL stores

DATA STORAGE OPTIONS

DATA STORAGE CHOICES forever storage Scalable distributed file systems Hadoop! (HDFS actually) real time store Traditional RDBMS won t work Don t scale well (or too expensive) NoSQL! Rigid schema layout

LAMBDA ARCHITECTURE

LAMBDA ARCHITECTURE EXPLAINED 1. All new data is sent to both batch layer and speed layer 2. Batch layer Holds master data set (immutable, append-only) Answers batch queries 3. Serving layer updates batch views so they can be queried adhoc 4. Speed Layer Handles new data Facilitates fast / real-time queries 5. Query layer Answers queries using batch & real-time views

INCORPORATING LAMBDA ARCHITECTURE

ARCHITECTURE REVIEW Each component is scalable Each component is fault tolerant Incorporates best practices All open source!

SUMMARY We looked at a bunch of use cases Batch analytics DB à Hadoop Multiple Sources à Hadoop Real time + Batch Real time data store using HBase HBase + Batch Analytics Streaming Real time Lots of choices!

SUMMARY / BEST PRACTICES Start small Test with large amount of data as soon as possible Iterate / iterate / iterate Only benchmark that matters is YOURS! Build in lot of metrics collection Host level metrics are readily collected by monitoring systems Application level metrics (most useful) have to implemented by YOU e.g. Request is taking 2000 ms.. Where is the time spent? Let loose chaos monkey

THANKS AND QUESTIONS? Sujee Maniyam Founder / Principal @ ElephantScale Expert Consulting + Training in Big Data technologies sujee@elephantscale.com Elephantscale.com Sign up for upcoming trainings : ElephantScale.com/ training Hadoop to Spark Webinar @ ElephantScale.com/webinars/