Hadoop. copyright 2011 Trainologic LTD

Similar documents
Introduction to BigData, Hadoop:-

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Lecture 11 Hadoop & Spark

Innovatus Technologies

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Certified Big Data and Hadoop Course Curriculum

HADOOP FRAMEWORK FOR BIG DATA

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Certified Big Data Hadoop and Spark Scala Course Curriculum

MI-PDB, MIE-PDB: Advanced Database Systems

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Hadoop. Introduction to BIGDATA and HADOOP

itpass4sure Helps you pass the actual test with valid and latest training material.

Big Data Analytics using Apache Hadoop and Spark with Scala

Hadoop An Overview. - Socrates CCDH

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Top 25 Hadoop Admin Interview Questions and Answers

Big Data Hadoop Course Content

Hadoop Map Reduce 10/17/2018 1

Clustering Lecture 8: MapReduce

Expert Lecture plan proposal Hadoop& itsapplication

Hadoop Development Introduction

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Introduction to HDFS and MapReduce

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

MapReduce, Hadoop and Spark. Bompotas Agorakis

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

50 Must Read Hadoop Interview Questions & Answers

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Distributed Systems 16. Distributed File Systems II

Data Analytics Job Guarantee Program

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

A BigData Tour HDFS, Ceph and MapReduce

Introduction to MapReduce

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Database Applications (15-415)

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Big Data and Scripting map reduce in Hadoop

Chapter 5. The MapReduce Programming Model and Implementation

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Distributed Systems. CS422/522 Lecture17 17 November 2014

A Survey on Big Data

TP1-2: Analyzing Hadoop Logs

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Introduction to Hadoop and MapReduce

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Distributed Computation Models

Exam Name: Cloudera Certified Developer for Apache Hadoop CDH4 Upgrade Exam (CCDH)

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

docs.hortonworks.com

BigData and Map Reduce VITMAC03

Big Data Hadoop Stack

MapReduce. U of Toronto, 2014

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze

Big Data Architect.

CSE6331: Cloud Computing

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

A Glimpse of the Hadoop Echosystem

Hadoop Online Training

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Introduction to Map Reduce

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Distributed Systems CS6421

Map Reduce & Hadoop Recommended Text:

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Introduction to MapReduce

Batch Inherence of Map Reduce Framework

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Laarge-Scale Data Engineering

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Data-Intensive Computing with MapReduce

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Hadoop ecosystem. Nikos Parlavantzas

Your First Hadoop App, Step by Step

Hadoop File Management System

Big Data Development HADOOP Training - Workshop. FEB 12 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Big Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka

A brief history on Hadoop

MapReduce Algorithm Design

Microsoft Big Data and Hadoop

Map Reduce. Yerevan.

Improving the MapReduce Big Data Processing Framework

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Programming Models MapReduce

Transcription:

Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides the complexity of accomplishing high scalability and better hardware utilization.

HDFS HDFS stands for: Hadoop Distributed File System. It is a subproject of Hadoop. Has the following features: Write-once-read-many for simple concurrency model. Replications for fault-tolerance. Supports large files (terra/ petas) and lots of them (millions). Shell support (linux like). Data access by Map/Reduce streaming.

Map-Reduce Map-reduce was introduced by Google and it is an interface that can break a task to sub-tasks distribute them to be executed in parallel (map), and aggregate the results (reduce). Between the Map and Reduce parts, an alternative phase called shuffe can be introduced. In the Shuffle phase, the output of the Map operations is sorted and aggregated for the Reduce part.

NameNode In a single Hadoop cluster, there is a single NameNode. The NameNode is responsible for the directory structure of the HDFS and the blocks location. It is currently a single-point-of-failure, but will be fixed in the next version (2.0.2). There can be a secondary NameNode that contains an edit log which will enable fast recovery. Though, when a NameNode goes down, the whole HDFS is down.

DataNode There are many DataNodes inside an Hadoop Cluster. They contain the data and provide HA and faulttolerance through replication.

Master Nodes There are two nodes in an Hadoop cluster that are single points of failure: NameNode HDFS point of failure (discussed). JobTracker MapReduce point of failure. If a JobTracker machine goes down, all the map-reduces are halted. This is not planned to be resolved in the coming version.

JobTracker A JobTracker node is responsible for assigning Job ids, distributing the task code and configuration. It also monitors the tasks for failures and does reassignment. Provides diagnostic API for the jobs.

TaskTracker A TaskTracker is responsible for the task execution. There is one TaskTracker per node (slave). Uses heartbeat service to notify the JobTracker on progress.

Map-Reduce Interfaces In order to perform a Map-Reduce operation you ll have to create the following components: JobConf InputFormat OutputFormat Mapper Partitioner Combiner Reducer Let s review them

JobConf Represents the configuration of the job. It binds up all the other components. A job can be run by the JobClient.runJob() by providing a JobConf to the method.

Input Processing You can provide configuration for InputFormat, InputSplit and RecordReader for processing the input and deciding whether to split a file into records for the map operations, or to combine files, etc Hadoop knows how to decompress files by default. You can provide codec implementation for different formats.

Mapper The Mapper represents the logic to be done on a key/ value pair of input. It returns an intermediary key/value output. It s like the select clause from SQL (with simple where ). There are many built-in extending classes that provide services like: concurrency, inverting, identity and regex.

Side Effects The mapper s logic should be idempotent (i.e., without side effects). Actually, every implementation of Mapper, Combiner, Partitioner and Reducer, should be pure functional.

Reducer The Reducer is responsible to aggregate the outputs from the Mappers into smaller results. Just like aggregate functions in SQL. It is composed of 3 stages: Shuffle fetching the relevant partition from the mappers. Sort grouping the outputs by the key. Reduce generated the final output (usually saved to the FS).

Combiner The combiner performs aggregation on the mapper s output before it is sent to the reducer. There is no special interface to the combiner and Hadoop uses the same interface as for the Reducer. Usually, the combiner class is the same as the reducer class. Best suited for monoid aggregators (like sum).

Partitioner By default, the results from the mappers are distributed to the reducers according to some hashing on the partition key. However, in order to implement an order by behavior, you can implement a custom partitioner that will uses ranges instead of hashing. Note that usually we combine it with predefined histogram sampling job to define the ranges.

Reporter Both Mappers and Reducers can use Reporters to report their progress and to update counters. Counters are defined by either the application or the map-reduce job. Counters are based on long values.

Hadoop Streaming Bundled in the Hadoop distribution. Can take any executable as the Mapper and the Reducer. E.g.: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoopstreaming.jar -input idir -output odir -mapper /bin/cat - reducer /bin/wc

Hadoop Streaming You can specify Java classes as inputs for mapper, reducer, combiner, partitioner, etc Actually you can create the JobConf entirely from the command line.

Writable Inputs and Outputs in Hadoop must be serialized. The standard Java serialization is way too slow. Instead, Hadoop uses DataInput and DataOutput stream wrappers to write the primitives of your data. You should implement the Writable interface if you want custom serialization, or use out-of-box implementations. Note that there is no support for super-fast serialization using direct memory or Unsafe.

Caching Hadoop provides a distribute cache utility that can efficiently distribute read-only files and JARs. The application can specify in the JobConf which files will be distributed in the cache.

What Hadoop is Not Hadoop is not well suited for the following usecases: Low Latency applications Hadoop is tuned for high throughput. Small files default file block size is 64m. Not fully HA. Complex configuration. Note that for real-time processing you can use Twitter Storm.

Additional Projects The Hadoop ecosystem provides several projects that are used on-top of Hadoop and provide some useful abstractions: Pig. Hive. Mahout. Zookeeper. HBase. Oozie.

Pig Pig provides an SQL-like DSL on top of Hadoop. It also allows for great flexibility by the way of writing Java functions that can be used in the select and where clauses. The framework will automatically run the functions in the Mappers or Reducers. Let s see some examples

Pig C = LOAD course' USING PigStorage() AS (name:chararray, duration:int; Ns = FOREACH C GENERATE name;

Hive Hive provides warehouse facilities on-top of Hadoop. Can be seen as an alternative to Pig. The query language is very much like SQL. Some consider Pig appropriate for developers and Hive for analysts.

HBase HBase is a NoSQL storage on top of the HDFS. Doesn t use the Map-Reduce and is tuned for lowlatency. Belong to the column-family category.

Zookeeper Zookeeper is a distributed coordination service. Allows for implementing distributed transactions, configuration management, primary elections, etc Very useful in the Hadoop ecosystem.

Mahout A machine learning library on top of Hadoop. Implements most of the machine-learning algorithms. Machine Learning computations are usually well-suited for Hadoop as we usually have big data and some parts of the algorithm can be easily parallelized.