Hadoop An Overview. - Socrates CCDH

Similar documents
Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Introduction to BigData, Hadoop:-

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Innovatus Technologies

Big Data Hadoop Course Content

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Hadoop & Big Data Analytics Complete Practical & Real-time Training

HADOOP FRAMEWORK FOR BIG DATA

Big Data Analytics using Apache Hadoop and Spark with Scala

Hadoop Development Introduction

Big Data Hadoop Stack

Microsoft Big Data and Hadoop

A BigData Tour HDFS, Ceph and MapReduce

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Hadoop. Introduction / Overview

50 Must Read Hadoop Interview Questions & Answers

MapR Enterprise Hadoop

Big Data Architect.

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Introduction to Hadoop and MapReduce

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Hadoop. Introduction to BIGDATA and HADOOP

Configuring and Deploying Hadoop Cluster Deployment Templates

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Hadoop Online Training

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Certified Big Data and Hadoop Course Curriculum

Top 25 Hadoop Admin Interview Questions and Answers

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Distributed Systems 16. Distributed File Systems II

Hadoop. copyright 2011 Trainologic LTD

New Approaches to Big Data Processing and Analytics

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

CISC 7610 Lecture 2b The beginnings of NoSQL

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Certified Big Data Hadoop and Spark Scala Course Curriculum

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Data Informatics. Seon Ho Kim, Ph.D.

Hadoop: The Definitive Guide

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Big Data with Hadoop Ecosystem

Chase Wu New Jersey Institute of Technology

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Databases 2 (VU) ( / )

Expert Lecture plan proposal Hadoop& itsapplication

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Introduction to the Hadoop Ecosystem - 1

Lecture 11 Hadoop & Spark

Hadoop Overview. Lars George Director EMEA Services

International Journal of Advance Engineering and Research Development. A study based on Cloudera's distribution of Hadoop technologies for big data"

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

DATA SCIENCE USING SPARK: AN INTRODUCTION

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Stages of Data Processing

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

A Survey on Big Data

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

A Glimpse of the Hadoop Echosystem

Chapter 5. The MapReduce Programming Model and Implementation

Oracle Big Data Fundamentals Ed 1

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Techno Expert Solutions An institute for specialized studies!

Online Bill Processing System for Public Sectors in Big Data

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Hadoop/MapReduce Computing Paradigm

Hadoop course content

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Oracle Big Data Fundamentals Ed 2

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Understanding NoSQL Database Implementations

docs.hortonworks.com

Map Reduce & Hadoop Recommended Text:

Big Data Analytics. Description:

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Improving the MapReduce Big Data Processing Framework

Big Data landscape Lecture #2

BIG DATA COURSE CONTENT

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

<Insert Picture Here> Introduction to Big Data Technology

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Transcription:

Hadoop An Overview - Socrates CCDH

What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected in last 2 years. Velocity Rate at which data is produced - Due to proliferation of technology, social media websites, texting, mms etc Variety Structured, Semi-structured, and unstructured data - Due to texting, mms, various audio and video formats, and file formats

Challenges with Big Data Storage - How can I store such large data? - What are the options available? Performance - How can I seek, retrieve, and work on the data? - What will be the performance? Network Bandwidth - How can I transport the data to my application? - How can I efficiently use the Corporate Network? Fault Tolerance - What if my database fails? Or What if I want to upgrade my DB? - Can I continue to serve my customers? In other words, is my system fault tolerant?

What is the Solution? Hadoop

What is Hadoop? Not a single technology, but an Eco-system Not a RDBMS, but a File System Developed in Java supports Ruby, Pearl, and Python scripts. Supports C, C++ Can be integrated with NoSQL DB Not Only SQL (Hbase, MongoDB, Cassandra, Neo4J, Riak) Not a silver-bullet solution does not solve all the problems in an Enterprise

Two Core Projects 1. HDFS Hadoop Distributed File System A scalable, fault-tolerant file system Meant for storing of large files (small number of large files) 2. MR (MapReduce) Framework A computational framework that works on top of HDFS

Architecture & Processes Master - Slave Architecture HDFS Components Master Slave Name Node, Secondary Name Node Data Node Map Reduce Job Tracker Task Tracker Daemon Processes 1. Name Node 2. Secondary Name Node 3. Job Tracker 4. Data Node 5. Task Tracker One process each Many processes

Hadoop Infrastructure - PROD Cluster Name Node Sec Name Node Job Tracker Network Switch MASTER SLAVE Rack-1 Switch Rack-2 Switch Rack-3 Switch Rack-1 Rack-2 Rack-3

Hadoop Infrastructure Rack Servers as opposed to Blade Servers Commodity hardware Graceful addition or removal of nodes Fault Tolerant Can run on any OS (Unix, Mac, Windows) Rack-1 Switch Rack-1

HDFS File Copy File 100 MB 64 MB 36 MB Network Switch Rack-1 Switch Rack-2 Switch Rack-3 Switch Rack-4 Switch Name Node Sec Name Node Job Tracker Node 3 Node 3 Node 3 - - - Rack-1 MASTER Rack-2 Rack-3 Rack-4 SLAVE

HDFS Fault Tolerance Network Switch Rack-1 Switch Name Node Rack-2 Switch Rack-3 Switch Rack-4 Switch Sec Name Node Job Tracker Node 3 Node 3 Node 3 - - - Rack-1 Rack-2 Rack-3 Rack-4

HDFS Fault Tolerance Network Switch Rack-1 Switch Name Node Rack-2 Switch Rack-3 Switch Rack-4 Switch Sec Name Node Job Tracker Node 3 Node 3 Node 3 - - - Rack-1 Rack-2 Rack-3 Rack-4 1. Node 3 in Rack-3 goes down 2. Block gets copied onto in Rack-3 to maintain replication factor

HDFS Fault Tolerance Network Switch Rack-1 Switch Name Node Rack-2 Switch Rack-3 Switch Rack-4 Switch Sec Name Node Job Tracker Node 3 Node 3 Node 3 - - - Rack-1 Rack-2 Rack-3 Rack-4 1. Node 3 in Rack-3 goes down 2. Block gets copied onto in Rack-3 to maintain replication factor 3. Node 3 in Rack-3 comes-up after some time 4. Block from Node 3 in Rack-3 gets deleted to maintain replication factor

HDFS Summary Default block size: 64 MB Default replication factor: 3 HDFS storage capacity: (Cluster Capacity/ Repl factor) Name Node is a Single Point of Failure (SPOF) in HDFS Name Node maintains metadata. They are, list of 1. blocks in each data node Helps replicate the blocks in the event of Data Node failure 2. files and directories in HDFS. hadoop fs ls 3. # of blocks and its location for a given file (foo.txt-,; faa.txt-b3,b4,b5) 4. operations carried out in HDFS Secondary Name Node: 1. NOT a backup name node 2. Copies namespace image and log data from Name Node to permanent storage 3. Helps bring-up the Name Node in case of failure

HDFS Summary Contd.. Data Node stores actual data Data Node sends periodical Heartbeat signal (block report, storage capacity) - say every few secs to Name Node Name Node contacts Data node as response to Heartbeat signal only Under-replication and Over-replication carriedout through Heartbeat signals.

Traditional Application Application size is constant, but data size varies Data gets transmitted through network where process/program resides Network capacity remains same, but Volume keeps increasing (remember 3-V s??)

Map Reduce Move the code to where data resides Data Localization Components Master Slave Map Reduce Job Tracker Task Tracker Network Switch Rack-1 Switch Name Node Rack-2 Switch Rack-3 Switch Rack-4 Switch Sec Name Node T1 Job Tracker T1 Node 3 Node 3 Node 3 - - - Rack-1 MASTER Rack-2 Rack-3 Rack-4 SLAVE

Map Reduce Map Reduce consists of two phases - Map Phase - Reduce Phase Mapper accepts input as key/value pair and emits result as key/value pair Map function is called for one record per time. Reducer task starts after all Map tasks are done Reducer gets executed on the Mapper output Mapper & Reducer may output zero or more key/value pair Speculative Execution

With Combiner W/O Combiner How Mapper Works? (Word count problem) Input: Key-Value pair Output: Zero, one, or more interm Key-Value pair 1 Deer Bear River 2 Car Car River 3 Deer Car Deer Actual Data Reference to Data Reads Referenced Data Performs Mapping Block 1 Input Split 1 Rec Reader 1 Mapper 1 Block 2 Block 3 Input Split 2 Input Split 3 Rec Reader 2 Rec Reader 3 Mapper 2 Mapper 3 Interm K/V Pair Deer:1, Bear:1, River:1 Car:1, Car:1, River:1 Deer:1, Car:1, Deer:1 Combines Mapper O/P Interm K/V Pair Block 1 Input Split 1 Rec Reader 1 Mapper 1 Block 2 Block 3 Input Split 2 Input Split 3 Rec Reader 2 Mapper 2 Rec Reader 3 Mapper 3 Combiner 1 Combiner 2 Combiner 3 Deer:1, Bear:1, River:1 Car:{1,1}, River:1 Deer:{1,1}, Car:1

With Combiner W/O Combiner How Reducer Works? Input: Interm Key-Value pair from Mapper Output: One value per key Shuffling Interm K/V Pair Deer:1, Bear:1, River:1 Car:1, Car:1, River:1 Deer:1, Car:1, Deer:1 Partitioning Sorting Bear: 1 Car: 1 Car: 1 Car: 1 Deer: 1 Deer: 1 Deer: 1 River: 1 River: 1 Reducer Bear: 1 Car: 3 Deer: 3 River: 2 Output Bear: 1 Car: 3 Deer: 3 River: 2 Interm K/V Pair Deer:1, Bear:1, River:1 Car:{1,1}, River:1 Deer:{1,1} Car:1 Partitioning Sorting Bear: 1 Car: 2 Car: 1 Deer: 1 Deer: 2 River: 1 River: 1 Reducer Bear: 1 Car: 3 Deer: 3 River: 2 Output Bear: 1 Car: 3 Deer: 3 River: 2

Map Reduce Summary Data Localization Speculative Execution Parallel Processing Batch Processing Sorted Output based on key Combiners and Reducers are optional

Challenges with Big Data Storage - How can I store such large data? - What are the options available? Performance 1. MapReduce: Perform parallel - How can I seek, retrieve, and work on the processing data? - What will be the performance? 2. Minimize seek rate by reading full block of data Network Bandwidth - How can I transport the data to my application? 1. Data localization: Ship code to the data (as opposed to traditional model) - How can I efficiently use the Corporate Network? Fault Tolerance 1. Distributed storage 2. Add more machine w/o affecting the Data/ App 3. Increase hard disk capacity 1. Make redundant data copy - What if my database fails? Or What if I want to upgrade my DB? 2. Speculative Execution - Can I continue to serve my customers? In other words, is my system fault tolerant?

MapReduce Hadoop Ecosystem Hive Developed by Facebook Provides SQL like interface Commands are converted into MapReduce jobs Can be used by users familiar with SQL Does not support all SQL capabilities (say sub-query) SELECT * FROM employee; SELECT * FROM employee WHERE id > 100 ORDER BY ASC LIMIT 20; HIVE Metastore HDFS

PIG MapReduce Hadoop Ecosystem Pig Developed by Yahoo! High level platform for MapReduce Uses PigLatin language Commands are converted into MapReduce jobs Does not require Metastore like HIVE results = FILTER records BY project == en ; sorted_results = ORDER results BY $1 desc; STORE sorted_results INTO myresults ; HIVE Metastore HDFS

Hadoop Ecosystem Projects HBase Ambari Sqoop Flume Avro Mahout YARN Zookeeper Description Scalable distributed DB for random read/write Web-based tool for managing and monitoring Hadoop clusters. Also provides Dashboard SQL-to-Hadoop. Helps transferring data from RDBMS to HDFS Distributed service for moving large data to HDFS. (say, server logs to HDFS) Data serialization mechanism Scalable Machine learning, data mining library Yet Another Resource Negotiator MapReduce (V2) High-performance co-ordination service

Hadoop Ecosystem Projects Whirr G(i)raph HUE Impala OOZIE Description Java API that allows Hadoop to run on EC2 and other cloud services Iterative graph processing system currently used by FB for analyzing social graph formed by users UI Framework and SDK for visual Hadoop apps Interface that provides SQL Query execution. Implemented in C++, does not use MR, 10-100 times faster than Hive Workflow engine that executes Pig, Hive, Sqoop, or MapReduce jobs. Workflow Engine & Coordinator Engine

Hadoop Ecosystem Ingest / Propagate Flume, Sqoop Describe, Develop Hive, Pig Compute, Search MapReduce, Giraph Persist File System: HDFS, MapR DFS Serialization: Avro DBMS: Cassandra, Riak, MongoDB, Hbase, Neo4J, CouchBase Monitor, Admin Ambari, Oozie, Zookeeper Analytics, Machine Learning Mahout

Who is using Hadoop? Netflix Yahoo! Facebook BFIS

Thank you! Questions?