CS 378 Big Data Programming

Similar documents
CS 378 Big Data Programming. Lecture 4 Summariza9on Pa:erns

Hadoop Map Reduce 10/17/2018 1

Hadoop MapReduce Framework

CS 378 Big Data Programming

Big Data for Engineers Spring Resource Management

Big Data 7. Resource Management

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

2/26/2017. For instance, consider running Word Count across 20 splits

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop

CS 378 Big Data Programming

MAXIMIZING PARALLELIZATION OPPORTUNITIES BY AUTOMATICALLY INFERRING OPTIMAL CONTAINER MEMORY FOR ASYMMETRICAL MAP TASKS. Shubhendra Shrimal.

Map Reduce & Hadoop Recommended Text:

MI-PDB, MIE-PDB: Advanced Database Systems

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Expert Lecture plan proposal Hadoop& itsapplication

Progress on Efficient Integration of Lustre* and Hadoop/YARN

Database Applications (15-415)

h p://

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

Introduction to MapReduce

Distributed Systems. CS422/522 Lecture17 17 November 2014

A Glimpse of the Hadoop Echosystem

CS 378 Big Data Programming

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

1. Stratified sampling is advantageous when sampling each stratum independently.

CS 555 Fall Before starting, make sure that you have HDFS and Yarn running, using sbin/start-dfs.sh and sbin/start-yarn.sh

Introduction to BigData, Hadoop:-

Configuring Ports for Big Data Management, Data Integration Hub, Enterprise Information Catalog, and Intelligent Data Lake 10.2

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Exam Questions CCA-505

3. Monitoring Scenarios

Op#mizing MapReduce for Highly- Distributed Environments

Introduction To YARN. Adam Kawa, Spotify The 9 Meeting of Warsaw Hadoop User Group 2/23/13

Hadoop. copyright 2011 Trainologic LTD

Introduction to Data Management CSE 344

A brief history on Hadoop

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

Clustering Lecture 8: MapReduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

15/03/2018. Counters

2/4/2019 Week 3- A Sangmi Lee Pallickara

Deployment Planning Guide

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

MapReduce Design Patterns

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Getting Started with Hadoop

CS370 Operating Systems

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Apache Spark Internals

Your First Hadoop App, Step by Step

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Improving the MapReduce Big Data Processing Framework

CCA Administrator Exam (CCA131)

TI2736-B Big Data Processing. Claudia Hauff

Analysis in the Big Data Era

Harnessing the Power of YARN with Apache Twill

Capacity Scheduler. Table of contents

Hadoop On Demand: Configuration Guide

Exam Questions CCA-500

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Introduction to Data Management CSE 344

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Distributed Systems CS6421

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Database Systems CSE 414

Hadoop Lab 3 Creating your first Map-Reduce Process

Lecture 11 Hadoop & Spark

Harnessing the Power of YARN with Apache Twill

Exam Name: Cloudera Certified Developer for Apache Hadoop CDH4 Upgrade Exam (CCDH)

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Re- op&mizing Data Parallel Compu&ng

BESIII Physical Analysis on Hadoop Platform

Big Data Analysis using Hadoop Lecture 3

Programming with Hadoop MapReduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to MapReduce

Data Analytics Job Guarantee Program

MapReduce Algorithm Design

M 2 R: Enabling Stronger Privacy in MapReduce Computa;on

Pentaho MapReduce. 3. Expand the Big Data tab, drag a Hadoop Copy Files step onto the canvas 4. Draw a hop from Start to Hadoop Copy Files

Introduction to the Hadoop Ecosystem - 1

Hortonworks Data Platform

Transcription:

CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns CS 378 Fall 2017 Big Data Programming 1

Review Assignment 2 Ques9ons? mrunit How do you test map() or reduce() calls that produce mul9ple outputs? Issues with calcula9ng variance CS 378 Fall 2017 Big Data Programming 2

Summariza9on Other summariza9ons of interest Min, max, median Suppose we are interested in these metrics for paragraph length (Assignment 2 data) If paragraph lengths are normally distributed, then the median will be very near the mean If the distribu9on of paragraph lengths is skewed, then the mean and median will be very different CS 378 Fall 2017 Big Data Programming 3

Summariza9on Min and max are straighworward For each paragraph, output two values Min length (the length of the current paragraph) Max length (the length of the current paragraph) Key? Combiner will get a key and list of values pair Select the min, max from the list, output the values Key? Reducer does the same CS 378 Fall 2017 Big Data Programming 4

Summariza9on Median Get all the values, sort them, then find the middle Since our computa9on is distributed, no mapper sees all the values Should we send them all to one reducer? Not u9lizing map-reduce (computa9on not distributed) Data sizes likely too large to keep in memory CS 378 Fall 2017 Big Data Programming 5

Summariza9on Median Keep the unique paragraph lengths, and The frequency of each length Map output: <paragraph length, 1> Combiner gets a list of these pairs and updates the count for recurring lengths Reducer does the same, then computes the median CS 378 Fall 2017 Big Data Programming 6

Summariza9on Median Hadoop provides the SortedMapWritable class Can associate a frequency count with a paragraph length Keeps the lengths in sorted order See the example in Chapter 2 of Map-Reduce Design Pa1erns How could we compute all in one pass over the data? min, max, median CS 378 Fall 2017 Big Data Programming 7

Counters Hadoop map-reduce infrastructure provides counters Accessed by group name Cannot have a large number of counters For example, can t use counters to solve WordCount A few tens of counters can be used Counters are stored in memory on JobTracker CS 378 Fall 2017 Big Data Programming 8

Counters Figure 2-6, MapReduce Design Pa:erns CS 378 Fall 2017 Big Data Programming 9

How Hadoop MapReduce Works We ve seen some terms like: Job, JobTracker, TaskTracker (MapReduce 1) Job, ResourceManager, NodeManager (YARN, MapReduce 2) Let s look at what they do Details from Chapter 7, Hadoop: The Defini9ve Guide 4 th Edi9on CS 378 Fall 2017 Big Data Programming 10

How Hadoop MapReduce Works Figure 7-1, Hadoop: The Defini9ve Guide 4 th Edi9on CS 378 Fall 2017 Big Data Programming 11

Job Submission Job submission Input files exist? Can splits be computed? Output directory exist? If yes, it fails. Hadoop expects to create this directory Copy resources to HDFS JAR files Configura9on file Computed file splits CS 378 Fall 2017 Big Data Programming 12

Resource Manager Creates tasks (work to be done) Map task for each input split Requested number of reducer tasks Job setup, job cleanup tasks Map tasks are assigned to task trackers that are close to the input split loca9on Data local preferred Rack local next Reduce task can go anywhere. Why? Scheduling algorithm orders the tasks CS 378 Fall 2017 Big Data Programming 13

Task Execu9on Configured for several map and reduce tasks Each task has status info (state, progress, counters) Periodically sends info to MRAppMaster Running, successful comple9on, failed Progress (% complete) For a new task Copy files to local file system (JAR, configura9on) Launch a new JVM (YarnChild drives execu9on) Load the mapper/reducer class and run the task CS 378 Fall 2017 Big Data Programming 14

Task Progress Read in input pair (mapper or reducer) Write an output pair (mapper or reducer) Set the status descrip9on Increment a counter Repor9ng progress CS 378 Fall 2017 Big Data Programming 15

Task Progress Mapper straighworward How much of the input has been processed Reducer more complicated Sort, shuffle and reduce are considered here Progress is an es9mate of how much of the total work has been done One-third allocated to each CS 378 Fall 2017 Big Data Programming 16

Shuffle Figure 7-4, Hadoop: The Defini9ve Guide 4 th Edi9on CS 378 Fall 2017 Big Data Programming 17

MapReduce in Hadoop Figure 2.4, Hadoop - The Defini9ve Guide CS 378 Fall 2017 Big Data Programming 18