MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
|
|
- Kevin Eaton
- 5 years ago
- Views:
Transcription
1 MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / ELEPHANT SCALE sujee@elephantscale.com
2 HI, I M SUJEE MANIYAM Founder / ElephantScale Consulting & Training in Big Data Spark / Hadoop / NoSQL / Data Science Author Hadoop illuminated open source book HBase Design Patterns Open Source contributor: github.com/sujee sujee@elephantscale.com
3 WHO IS THIS TALK FOR? Data Managers / Data Architects / Developers Thinking about Big Data infrastructure
4 A LOOK AT BIG DATA ECO SYSTEM Source : datafloq.com
5 HADOOP ECO SYSTEM Source : hortonworks.com
6 WHAT IS A GOOD DESIGN / ARCHITECTURE? Source : fox.com
7 WORKS ON MY LAPTOP I just got XYZ working on my laptop in 3 hours! Let s build this!!
8 WHAT WORKS ON A LAPTOP MAY NOT WORK AT SCALE!
9 AT SCALE NOTHING WORKS AS ADVERTISED
10 BIG DATA DESIGN PATTERNS ARE EMERGING We are gaining experience in using Big Data tools We hear about other people s experience Conferences Meetups Failure stories are still hard to come by J
11 BIG DATA TECHNOLOGIES : A QUICK LOOK 2011 Batch Hadoop v Beyond Batch / Streaming Spark Nifi Flink Kafka 1 st Gen (Big Data) 2013 Hadoop v2 2 nd Gen (Fast Data)
12 HADOOP IN 30 SECONDS The Original Big data platform Very well field tested Scales to peta-bytes of data Enables analytics at massive scale
13 HADOOP ECO SYSTEM Real Time Batch
14 HADOOP ECOSYSTEM BY FUNCTION HDFS provides distributed storage Map Reduce Pig Provides distributed computing High level MapReduce Hive SQL layer over Hadoop HBase NoSQL storage for real-time queries
15
16 SPARK IN 30 SECONDS Open source cluster computing engine Very fast: In-memory ops 100x faster than MR On-disk ops 10x faster than MR General purpose: MR, SQL, streaming, machine learning, analytics Compatible: Runs over Hadoop, Mesos, Yarn, standalone Works with HDFS, S3, Cassandra, HBase, Easier to code: Word count in 2 lines Spark's roots: Came out of Berkeley AMP Lab Now top-level Apache project Version 1.5 released in Sept 2015 First Big Data platform to integrate batch, streaming and interactive computations in a unified framework stratio.com
17 SPARK ILLUSTRATED Schema / sql Real Time Machine Learning Graph processing Spark SQL Spark Streaming ML lib GraphX Spark Core Standalone YARN MESOS Cluster managers S3 HDFS Cassandra??? Data Storage
18 HADOOP VS. SPARK Hadoop Spark
19 SPARK / HADOOP Hadoop Distributed Storage + Distributed Compute MapReduce framework Usually data on disk (HDFS) Not ideal for iterative work Batch process Mostly Java No unified shell Spark Distributed Compute Only Generalized computation On disk / in memory Great at Iterative workloads (machine learning..etc) - Up 10x faster for data on disk - Up to 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration
20 HADOOP + YARN : OS FOR DISTRIBUTED COMPUTING Batch (mapreduce) Streaming (storm, spark) In-memory (spark) Applications YARN HDFS Cluster Management Storage
21 Use Cases
22 USE CASES Batch Use case 1 : ETL / Batch query (Single Silo) Use case 2 : distributed log aggregation Batch + real time Use case 3 : real time data store Use case 4 : real time data store + batch analytics Real time / Streaming Use case 5 : Streaming
23 Use case 1 : ETL & Batch Analytics (Single Silo)
24 USE CASE 1 : ETL AND BATCH SCALE Data collected in various databases Data is scattered across multiple silos! Need a single silo to bring all data together and analyze
25 USE CASE 1 : CONSIDERATIONS Batch analytics is ok We will use Hadoop core components This is most common use case
26 USE CASE 1 : DESIGN
27 USE CASE 1 : DESIGN REVIEW We are using core Hadoop components No vendor lock in (works on all Hadoop distributions) Use HDFS (Hadoop File System) for storage Data Ingest with Sqoop Processing done by Map Reduce & Cousins Results are exported back to DB
28 USE CASE 1 : DESIGN REVIEW HDFS as Single Silo Great for storing large amounts of data (100s of Terra Bytes to Peta Bytes) Content agnostic (text / binary / no schema) Source : hortonworks
29 USE CASE 1 : DESIGN REVIEW HDFS protects data very well Five-nines to seven-nines of availability
30 USE CASE 1 : DESIGN REVIEW Moving data between DB and Hadoop Sqoop ETL Tools Sqoop is a tool to interface Database & Hadoop Can connect to any JDBC compliant DB (or custom connectors) Import from DB à Hadoop Export from Hadoop à DB Tool Description Open Source / Premium Sqoop Migrates data between DB & Hadoop OS Part of most Hadoop eco system Talend Native Hadoop support OS Informatica Hadoop support (?) Premium Many more
31 USE CASE 1 : DESIGN REVIEW Processing is batch mode (minutes / hours) Processing Engines Description Sample Use case Java Map Reduce (engine : MR) Pig (engine : MR) Hive (engine : MR / Tez) Spark (engine : Spark + YARN) Native low level API to Map Reduce High level data flow language / engine SQL layer on Hadoop Generic programming model. Complex data processing (image processing / video encoding..etc) ETL work flows Ad-hoc queries Complex workflows (RDD programming) SQL querying (Dataframes / Spark SQL)
32 SQL ENGINES FOR HADOOP Engine Description Distribution Support Hive First SQL layer for Hadoop. All Hadoop distribution Presto Developed by Facebook All? Impala Tez / Stinger Spark Developed by Cloudera. Focus on low latency queries. Very fast. Open source, but tightly integrated with Cloudera distribution. Hortonworks initiative. Provides a new run-time / execution engine for Hive and others. Focus on speed / scale / SQL. Work in progress. Can query data on Hive tables / HDFS. Uses Spark as execution engine. Can be very fast (10x) Cloudera Hortonworks (might work on Cloudera?) All
33 USE CASE 1 : ETL WORK FLOW
34 SPARK SQL VS. HIVE Fast on same HDFS data!
35 SPARK SQL VS. HIVE Fast on same data on HDFS
36 USE CASE 1 : DESIGN RECAP Hadoop is COMPLIMENTARY to existing data warehouse Not replacing Hadoop can be a SINGLE SILO Facilitates analytics at massive scale Lots of choices for each task Data movement : Sqoop / ETL tools Data processing : Map Reduce (Java / Pig / Hive), Spark, ETL tools SQL Engines : Hive / Impala / Hive + Tez / Presto / Spark Mix & Match Hadoop & Spark
37 Use Case 2 : Aggregate Data From Multiple Sources (Near Real Time)
38 USE CASE 2 : DATA COMING FROM MULTIPLE SOURCES Data coming in from multiple sources. Data is streaming in Capture data in Hadoop Do batch analytics
39 USE CASE 2 : FUNCTIONAL SKETCH
40 USE CASE 2 : DESIGN
41 DESIGN 2 : REVIEW Flume To bring in logs from multiple sources Distributed, reliable way to collect and move data If uplinks are dis connected, flume agents will store and forward data HDFS Flume can directly write data to HDFS Files are segmented or rolled by size / time e.g. Data _ log Data _ log Data _ log
42 DESIGN 2 : ANALYTICS Analytics stack : Pig / Hive / Oozie / Spark (Same as in Use Case 1) Oozie Work flow manager run this work flow every 1 hour run this work flow when data shows up in input directory Can manage complex work flows Send alerts when processes fail..etc
43 DESIGN 2 : REVIEW How can processed new data? E.g. Logs that came in today Option 1) Use timestamped files log _10-00.log log _13-00.log... log _10-00.log Use wildcards to load files : log *.log Option 2) Hive Partitions
44 DESIGN 2 : REVIEW Hive Partitions Data is partitioned over a dimension (time) Hive only picks up data in select partitions during query times select * from. Where dt =
45 Use Case 3 : Real Time Store
46 USE CASE 3 : REAL TIME DATA STORE Events are coming in Need to store the events Can be billions of events And query them in real time e.g. last 10 events by user
47 USE CASE 3 : DESIGN HDFS is not ideal for updating data in real time And it is not ideal for accessing data in random We need a scalable real time store à HBASE as operational store (New storage from Cloudera : Kudu)
48 USE CASE 3 : DESIGN
49 DESIGN 3 : REVIEW HBase supports real time updates Data comes trickling in (as stream) Saved data becomes queryable immediately Use HBase APIs (Java / REST) to build dashboards Data can be queried in real time (milliseconds) 6 node HBase cluster 3 billion rows of data Query a single row in 1 20 ms
50 Use Case 4 : Real Time + Batch Analytics
51 USE CASE 4 : REAL TIME + BATCH Building on use case 3 We want to do extensive analysis on data on HBase E.g. : scoring user models flagging credit card transactions
52 USE CASE 4 : DESIGN HBase is the real time store Analytics is done via Map Reduce stack (Pig / Hive) Can we do them in a single stack? May not be a good idea Don t mix real time and batch analytics Batch Analytics will impede real time performance
53 REAL TIME & BATCH DON T MIX
54 USE CASE 4 : DESIGN (SEPARATE REAL TIME & BATCH)
55 USE CASE 4 : DESIGN REVIEW How to replicate data? 1 : periodic synchronization of data between clusters 2 : data goes to both clusters at the same time
56 USE CASE 4 : DESIGN REVIEW How to replicate data between clusters HBase Active Sync Data in HDFS can be synchronized using utilities like Distcp How to import data into both clusters at the same time? Build a data pipeline to send data to both Use tools like Flume
57 Use Case 5 : Streaming
58 BIG DATA EVOLUTION Decision times : batch ( hours / days) Use cases: Modeling ETL Reporting
59 MOVING TOWARDS FAST DATA Decision time : (near) real time seconds (or milli seconds) Use Cases Alerts (medical / security) Fraud detection Streaming is becoming more prevalent Connected Devices Internet of Things Beyond Batch We need faster processing / analytics
60 STREAMING ARCHITECTURE OVER SIMPLIFIED J
61 STREAMING ARCHITECTURE DATA BUCKET data bucket Captures incoming data Acts as a buffer smoothes out bursts So even if our processing offline, we won t loose data Data bucket choices * Kafka MQ (RabittMQ..etc) Amazon Kinesis
62 KAFKA ARCHITECTURE Producers write data to brokers Consumers read data from brokers All of this is distributed / parallel Failure tolerant Data is stored as topics sensor_data alerts s
63 STREAMING ARCHITECTURE PROCESSING ENGINE Need to process events with low latency So many to choose from! Choices Storm Spark NiFi Flink
64 STREAMING SYSTEMS FEATURE COMPARISON Feature Storm Spark Streaming Processing Model Windowing operations Event based by default (micro batch using Trident) Supported by Trident Micro Batch Flink Event based + Micro Batch based Yes Yes? NiFi Event Based (?) Latency Milliseconds Seconds Milliseconds Milliseconds At-least-once YES YES YES YES At-most-once YES NO YES? Exactly-once YES with Trident YES YES?
65 STREAMING ARCHITECTURE DATA STORE Where processed data ends up Need to absorb data in real time Usually a NoSQL storage HBase Cassandra Lots of NoSQL stores
66 DATA STORAGE OPTIONS
67 DATA STORAGE CHOICES forever storage Scalable distributed file systems Hadoop! (HDFS actually) real time store Traditional RDBMS won t work Don t scale well (or too expensive) NoSQL! Rigid schema layout
68 LAMBDA ARCHITECTURE
69 LAMBDA ARCHITECTURE EXPLAINED 1. All new data is sent to both batch layer and speed layer 2. Batch layer Holds master data set (immutable, append-only) Answers batch queries 3. Serving layer updates batch views so they can be queried adhoc 4. Speed Layer Handles new data Facilitates fast / real-time queries 5. Query layer Answers queries using batch & real-time views
70 INCORPORATING LAMBDA ARCHITECTURE
71 ARCHITECTURE REVIEW Each component is scalable Each component is fault tolerant Incorporates best practices All open source!
72 SUMMARY We looked at a bunch of use cases Batch analytics DB à Hadoop Multiple Sources à Hadoop Real time + Batch Real time data store using HBase HBase + Batch Analytics Streaming Real time Lots of choices!
73 SUMMARY / BEST PRACTICES Start small Test with large amount of data as soon as possible Iterate / iterate / iterate Only benchmark that matters is YOURS! Build in lot of metrics collection Host level metrics are readily collected by monitoring systems Application level metrics (most useful) have to implemented by YOU e.g. Request is taking 2000 ms.. Where is the time spent? Let loose chaos monkey
74 THANKS AND QUESTIONS? Sujee Maniyam Founder / ElephantScale Expert Consulting + Training in Big Data technologies sujee@elephantscale.com Elephantscale.com Sign up for upcoming trainings : ElephantScale.com/ training Hadoop to Spark ElephantScale.com/webinars/
Big Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationConfiguring and Deploying Hadoop Cluster Deployment Templates
Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationHadoop, Yarn and Beyond
Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationThe Technology of the Business Data Lake. Appendix
The Technology of the Business Data Lake Appendix Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationBIG DATA COURSE CONTENT
BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data
More informationHow Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,
How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS
More informationSpark, Shark and Spark Streaming Introduction
Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References
More informationTurning Relational Database Tables into Spark Data Sources
Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationApache Spark 2.0. Matei
Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and
More informationMicrosoft Big Data and Hadoop
Microsoft Big Data and Hadoop Lara Rubbelke @sqlgal Cindy Gross @sqlcindy 2 The world of data is changing The 4Vs of Big Data http://nosql.mypopescu.com/post/9621746531/a-definition-of-big-data 3 Common
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationAWS Serverless Architecture Think Big
MAKING BIG DATA COME ALIVE AWS Serverless Architecture Think Big Garrett Holbrook, Data Engineer Feb 1 st, 2017 Agenda What is Think Big? Example Project Walkthrough AWS Serverless 2 Think Big, a Teradata
More informationHadoop. Introduction to BIGDATA and HADOOP
Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationHadoop course content
course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationIntroduction to Big-Data
Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,
More informationHadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationThe SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.
Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate
More informationOracle Big Data Fundamentals Ed 2
Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationIndex. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /
Index A ACID, 251 Actor model Akka installation, 44 Akka logos, 41 OOP vs. actors, 42 43 thread-based concurrency, 42 Agents server, 140, 251 Aggregation techniques materialized views, 216 probabilistic
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationActivator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationDistributed systems for stream processing
Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall Alena Hall Large-scale data processing Distributed Systems Functional Programming Data Science & Machine
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationBringing Data to Life
Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More informationHADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)
HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationTowards a Real- time Processing Pipeline: Running Apache Flink on AWS
Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Dr. Steffen Hausmann, Solutions Architect Michael Hanisch, Manager Solutions Architecture November 18 th, 2016 Stream Processing Challenges
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationHDInsight > Hadoop. October 12, 2017
HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond
More informationDown the event-driven road: Experiences of integrating streaming into analytic data platforms
Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018 Integrate
More information1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions
Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationTechno Expert Solutions An institute for specialized studies!
Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data
More informationSQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
More informationHortonworks Data Platform
Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationBring Context To Your Machine Data With Hadoop, RDBMS & Splunk
Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationThe age of Big Data Big Data for Oracle Database Professionals
The age of Big Data Big Data for Oracle Database Professionals Oracle OpenWorld 2017 #OOW17 SessionID: SUN5698 Tom S. Reddy tom.reddy@datareddy.com About the Speaker COLLABORATE & OpenWorld Speaker IOUG
More informationContainer 2.0. Container: check! But what about persistent data, big data or fast data?!
@unterstein @joerg_schad @dcos @jaxdevops Container 2.0 Container: check! But what about persistent data, big data or fast data?! 1 Jörg Schad Distributed Systems Engineer @joerg_schad Johannes Unterstein
More informationData Lake Based Systems that Work
Data Lake Based Systems that Work There are many article and blogs about what works and what does not work when trying to build out a data lake and reporting system. At DesignMind, we have developed a
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationReport on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt
Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt Date: 10 Sep, 2017 Draft v 4.0 Table of Contents 1. Introduction... 3 2. Infrastructure Reference Architecture...
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationLambda Architecture for Batch and Stream Processing. October 2018
Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.
More informationExam Questions
Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure
More informationIntro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect
Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing
More informationOver the last few years, we have seen a disruption in the data management
JAYANT SHEKHAR AND AMANDEEP KHURANA Jayant is Principal Solutions Architect at Cloudera working with various large and small companies in various Verticals on their big data and data science use cases,
More informationBig Data on AWS. Peter-Mark Verwoerd Solutions Architect
Big Data on AWS Peter-Mark Verwoerd Solutions Architect What to get out of this talk Non-technical: Big Data processing stages: ingest, store, process, visualize Hot vs. Cold data Low latency processing
More informationAnalyzing Flight Data
IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo
More informationThis is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.
About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and
More informationAgenda. Spark Platform Spark Core Spark Extensions Using Apache Spark
Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing
More information