Introduction to Hadoop and MapReduce

Similar documents
Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Hadoop An Overview. - Socrates CCDH

Big Data Hadoop Stack

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

HADOOP FRAMEWORK FOR BIG DATA

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

MapReduce, Hadoop and Spark. Bompotas Agorakis

Configuring and Deploying Hadoop Cluster Deployment Templates

Distributed Systems. CS422/522 Lecture17 17 November 2014

Hadoop/MapReduce Computing Paradigm

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Certified Big Data and Hadoop Course Curriculum

Introduction to BigData, Hadoop:-

Hadoop. copyright 2011 Trainologic LTD

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Big Data Architect.

Parallel Computing: MapReduce Jin, Hai

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Welcome to the New Era of Cloud Computing

Distributed Filesystem

Lecture 11 Hadoop & Spark

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

MapReduce-style data processing

Big Data Analytics using Apache Hadoop and Spark with Scala

50 Must Read Hadoop Interview Questions & Answers

Pig A language for data processing in Hadoop

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

CISC 7610 Lecture 2b The beginnings of NoSQL

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

HBase... And Lewis Carroll! Twi:er,

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

A Survey on Big Data

A BigData Tour HDFS, Ceph and MapReduce

Importing and Exporting Data Between Hadoop and MySQL

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Big Data with Hadoop Ecosystem

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MI-PDB, MIE-PDB: Advanced Database Systems

Databases 2 (VU) ( / )

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Chapter 5. The MapReduce Programming Model and Implementation

Hadoop and HDFS Overview. Madhu Ankam

Microsoft Big Data and Hadoop

MapReduce and Hadoop

Certified Big Data Hadoop and Spark Scala Course Curriculum

MapReduce and Friends

Hortonworks PR PowerCenter Data Integration 9.x Administrator Specialist.

Hadoop Overview. Lars George Director EMEA Services

Big Data Hadoop Course Content

Challenges for Data Driven Systems

Hadoop. Introduction to BIGDATA and HADOOP

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

Hive SQL over Hadoop

Improving the MapReduce Big Data Processing Framework

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Map-Reduce. Marco Mura 2010 March, 31th

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Distributed Systems 16. Distributed File Systems II

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Introduction to Big-Data

Hadoop Online Training

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Hadoop Development Introduction

Decision analysis of the weather log by Hadoop

Expert Lecture plan proposal Hadoop& itsapplication

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Chase Wu New Jersey Institute of Technology

New Approaches to Big Data Processing and Analytics

Innovatus Technologies

Hadoop. Introduction / Overview

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Data Analysis Using MapReduce in Hadoop Environment

MapReduce. U of Toronto, 2014

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Google File System (GFS) and Hadoop Distributed File System (HDFS)

A Review Paper on Big data & Hadoop

docs.hortonworks.com

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hortonworks Data Platform

Distributed File Systems II

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Transcription:

Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Large-scale Computation Traditional solutions for computing large quantities of data relied mainly on processor Complex processing made on data moved in memory Scale only by adding power (more memory, faster processor) Works for relatively small-medium amounts of data but cannot keep up with larger datasets How to cope with today s indefinitely growing production of data? Terabytes per day 2

Distributed Computing Multiple machines connected among each other and cooperating for a common job «Cluster» Challenges Complexity of coordination all processes and data have to be maintained syncronized about the global system state Failures Data distribution 3

Hadoop Open source platform for distributed processing of large datasets Based on a project developed at Google Functions: Distribution of data and processing across machines Management of the cluster Simplified programming model Easy to write distributed algorithms

Hadoop scalability Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model Huge clusters can be made up using (cheap) commodity hardware A 1000-CPU machine would be much more expensive than 1000 single-cpu or 250 quad-core machines Cluster can easily scale up with little or no modifications to the programs

Hadoop Concepts Applications are written in common high-level languages Inter-node communication is limited to the minimum Data is distributed in advance Bring the computation close to the data Data is replicated for availability and reliability Scalability and fault-tolerance 6

Scalability and Fault-tolerance Scalability principle Capacity can be increased by adding nodes to the cluster Increasing load does not cause failures but in the worst case only a graceful degradation of performance Fault-tolerance Failure of nodes are considered inevitable and are coped with in the architecture of the platform System continues to function when failure of a node occurs tasks are re-scheduled Data replication guarantees no data is lost Dynamic reconfiguration of the cluster when nodes join and leave 7

Benefits of Hadoop Previously impossible or impractical analysis made possible Lower cost Less time Greater flexibility Near-linear scalability Ask Bigger Questions 8

Hadoop Components Hive Pig Sqoop HBase Flume Mahout Oozie Core Components 9

Hadoop Core Components HDFS: Hadoop Distributed File System Abstraction of a file system over a cluster Stores large amount of data by transparently spreading it on different machines MapReduce Simple programming model that enables parallel execution of data processing programs Executes the work on the data near the data In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work

Structure of an Hadoop Cluster Hadoop Cluster: Group of machines working together to store and process data Any number of worker nodes Run both HDFS and MapReduce components Two Master nodes Name Node: manages HDFS Job Tracker: manages MapReduce 11

Hadoop Principle I m one big data set Hadoop HDFS Hadoop is basically a middleware platform that manages a cluster of machines The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster The cluster can grow indefinitely simply by adding new nodes

The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution Map Data elements are classified into categories Reduce x 4 x 5 x 3 An algorithm is applied to all the elements of the same category

MapReduce and Hadoop Hadoop MapReduce is logically placed on top of HDFS MapReduce HDFS

MapReduce and Hadoop Hadoop MR works on (big) files loaded on HDFS MR MR MR MR Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is

HDFS Concepts Logical distributed file system that sits on top of the native file system of the operating system Written in Java Usable by several languages/tools All common commands for handling operations on a file systems are defined ls, chmod, Commands for moving files from/to the local file system are present 16

Accessing and Storing Files Data files are split into blocks and transparently distributed when loaded Each block is replicated in multiple nodes (default:3) NameNode stores metadata Write-once file write policy No random writes/updates on files are allowed Optimized for streaming reads of large files Random reads are discouraged/impossible Performs best with large files (>100Mb) 17

Accessing HDFS Command line >hadoop fs -<command> Java API Hue Web-based, interactive UI Can browse, upload, download and view files 18

MapReduce Programming model for parallel execution Programs are realized just by implementing two functions Map and Reduce Execution is streamed to the Hadoop cluster and the functions are processed in parallel on the data nodes 19

MapReduce Concepts Automatic parallelization and distribution Fault-tolerance A clean abstraction for programmers MapReduce programs are usually written in Java Can be written in any language using Hadoop Streaming All of Hadoop is written in Java MapReduce abstracts all the housekeeping away from the developer Developer can simply concentrate on writing the Map and Reduce functions 20

Programming Model A MapReduce program transforms an input list into an output list Processing is organized into two steps: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

map Data source must be structured in records (lines out of files, rows of a database, etc) Each record has an associated key Records are fed into the map function as key*value pairs: e.g., (filename, line) map() produces one or more intermediate values along with an output key from the input In other words, map identifies input values with the same characteristics that are represented by the output key Not necessarily related to input key

reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() aggregates the intermediate values into one or more final values for that same output key in practice, usually only one final value per key

Example: word count

Example: word count

Beyond Word Count Word count is challenging over massive amounts of data Using a single compute node would be too time-consuming Number of unique words can easily exceed available memory Would need to store to disk Statistics are simple aggregate functions Distributive in nature e.g., max, min, sum, count MapReduce breaks complex tasks down into smaller elements which can be executed in parallel Many common tasks are very similar to word count e.g., log file analysis 26

MapReduce Applications Text mining Index building Graph creation and analysis Pattern recognition Collaborative filtering Prediction models Sentiment analysis Risk assessment 27