MapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014

Similar documents
MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

EE123 Digital Signal Processing

Elementary Educational Computer

Lecture 1: Introduction and Strassen s Algorithm

Major CSL Write your name and entry no on every sheet of the answer script. Time 2 Hrs Max Marks 70

Lower Bounds for Sorting

Clustering Lecture 8: MapReduce

A graphical view of big-o notation. c*g(n) f(n) f(n) = O(g(n))

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Chapter 3 Classification of FFT Processor Algorithms

MapReduce and the New Software Stack

Lecture 5. Counting Sort / Radix Sort

Chapter 4 The Datapath

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Data diverse software fault tolerance techniques

Algorithm Design Techniques. Divide and conquer Problem

Appendix D. Controller Implementation

How do we evaluate algorithms?

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

Avid Interplay Bundle

Database Systems CSE 414

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

The following algorithms have been tested as a method of converting an I.F. from 16 to 512 MHz to 31 real 16 MHz USB channels:

Computers and Scientific Thinking

On Infinite Groups that are Isomorphic to its Proper Infinite Subgroup. Jaymar Talledo Balihon. Abstract

2/26/2017. For instance, consider running Word Count across 20 splits

Data Structures Week #9. Sorting

Image Segmentation EEE 508

Term Project Report. This component works to detect gesture from the patient as a sign of emergency message and send it to the emergency manager.

Octahedral Graph Scaling

CSE 2320 Notes 8: Sorting. (Last updated 10/3/18 7:16 PM) Idea: Take an unsorted (sub)array and partition into two subarrays such that.

Minimum Spanning Trees

2.3 Algorithms Using Map-Reduce

1.2 Binomial Coefficients and Subsets

Homework 1 Solutions MA 522 Fall 2017

Minimum Spanning Trees. Application: Connecting a Network

LU Decomposition Method

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #2: Randomized MST and MST Verification January 14, 2015

BOOLEAN MATHEMATICS: GENERAL THEORY

Hadoop Map Reduce 10/17/2018 1

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Our second algorithm. Comp 135 Machine Learning Computer Science Tufts University. Decision Trees. Decision Trees. Decision Trees.

Sorting 9/15/2009. Sorting Problem. Insertion Sort: Soundness. Insertion Sort. Insertion Sort: Running Time. Insertion Sort: Soundness

CIS 121. Introduction to Trees

MI-PDB, MIE-PDB: Advanced Database Systems

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

CSE 417: Algorithms and Computational Complexity

HADOOP: A NEW APPROACH FOR DOCUMENT CLUSTERING

Examples and Applications of Binary Search

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

HADOOP FRAMEWORK FOR BIG DATA

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Fast Fourier Transform (FFT) Algorithms

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

1 Enterprise Modeler

Abstract. Chapter 4 Computation. Overview 8/13/18. Bjarne Stroustrup Note:

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

Reversible Realization of Quaternary Decoder, Multiplexer, and Demultiplexer Circuits

MapReduce. Stony Brook University CSE545, Fall 2016

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Behavioral Modeling in Verilog

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

Greedy Algorithms. Interval Scheduling. Greedy Algorithms. Interval scheduling. Greedy Algorithms. Interval Scheduling

Introduction to Data Management CSE 344

Lecture 3. RTL Design Methodology. Transition from Pseudocode & Interface to a Corresponding Block Diagram

MR-2010I %MktBSize Macro 989. %MktBSize Macro

MapReduce and Friends

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Solutions for Homework 2

CompSci 516: Database Systems

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

condition w i B i S maximum u i

Alpha Individual Solutions MAΘ National Convention 2013

5.3 Recursive definitions and structural induction

1. SWITCHING FUNDAMENTALS

CS 111: Program Design I Lecture # 7: First Loop, Web Crawler, Functions

NAG Library Function Document nag_fft_hermitian (c06ebc)

L22: SC Report, Map Reduce

Threads and Concurrency in Java: Part 2

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Databases 2 (VU) ( / )

The isoperimetric problem on the hypercube

Τεχνολογία Λογισμικού

Transcription:

MapReduce ad Hadoop Debapriyo Majumdar Data Miig Fall 2014 Idia Statistical Istitute Kolkata November 10, 2014

Let s keep the itro short Moder data miig: process immese amout of data quickly Exploit parallelism Traditioal parallelism Brig data to compute MapReduce Brig compute to data Pictures courtesy: Gle K. Lockwood, gleklockwood.com 2

The MapReduce paradigm Split Map Shuffle ad sort Reduce Fial origial Iput May be already split i filesystem Iput chuks <Key,Value> pairs <Key,Value> pairs grouped by keys Output chuks The user eeds to write the map() ad the reduce() Fial output May ot eed to combie 3

A example: word frequecy couqg Split Map Shuffle ad sort Reduce Fial collec.o of documts origial Iput subcollec.os of documts Iput chuks Problem: Give a collecqo of documets, cout the umber of Qmes each word occurs i the collecqo map: for each word w, output pairs (w,1) <Key,Value> pairs map: for each word w, emit pairs (w,1) the pairs (w,1) for the same words are grouped together <Key,Value> pairs grouped by keys reduce: cout the umber () of pairs for each w, make it (w,) Output chuks output: (w,) for each w reduce: cout the umber () of pairs for each w, make it (w,) Fial output 4

A example: word frequecy couqg Split Map Shuffle ad sort Reduce Fial apple orage peach orage plum orage apple guava cherry fig peach fig peach origial Iput apple orage peach orage plum orage apple guava cherry fig peach fig peach Iput chuks Problem: Give a collecqo of documets, cout the umber of Qmes each word occurs i the collecqo (apple,1) (orage,1) (peach,1) (orage,1) (plum,1) (orage,1) (apple,1) (guava,1) (cherry,1) (fig,1) (peach,1) (fig,1) (peach,1) <Key,Value> pairs map: for each word w, output pairs (w,1) (apple,1) (apple,1) (orage,1) (orage,1) (orage,1) (guava,1) (plum,1) (plum,1) (cherry,1) (cherry,1) (fig,1) (fig,1) (peach,1) (peach,1) (peach,1) <Key,Value> pairs grouped by keys (apple,2) (orage,3) (guava,1) (plum,2) (cherry,2) (fig,2) (peach,3) Output chuks reduce: cout the umber () of pairs for each w, make it (w,) (apple,2) (orage,3) (guava,1) (plum,2) (cherry,2) (fig,2) (peach,3) Fial output 5

Apache Hadoop A ope source MapReduce framework HADOOP 6

Hadoop Two mai compoets Hadoop Distributed File System (HDFS): to store data MapReduce egie: to process data Master slave architecture usig commodity servers The HDFS Master: Nameode Slave: Dataode MapReduce Master: JobTracker Slave: TaskTracker 7

HDFS: Blocks Block 1 Block 2 Dataode 1 Block 1 Block 2 Block 3 Big File Block 3 Block 4 Dataode 2 Block 1 Block 3 Block 4 Block 5 Block 6 Dataode 3 Block 2 Block 6 Block 5 Rus o top of existig filesystem Blocks are 64MB (128MB recommeded) Sigle file ca be > ay sigle disk POSIX based permissios Fault tolerat Dataode 4 Block 4 Block 6 Block 5 8

HDFS: Nameode ad Dataode Nameode Oly oe per Hadoop Cluster Maages the filesystem amespace The filesystem tree A edit log For each block block i, the dataode(s) i which block i is saved All the blocks residig i each dataode Secodary Nameode Backup ameode Dataodes May per Hadoop cluster Cotrols block operatios Physically puts the block i the odes Do the physical replicatio 9

HDFS: a example 10

MapReduce: JobTracker ad TaskTracker 1. JobCliet submits job to JobTracker; Biary copied ito HDFS 2. JobTracker talks to Nameode 3. JobTracker creates executio pla 4. JobTracker submits work to TaskTrackers 5. TaskTrackers report progress via heartbeat 6. JobTracker updates status 11

Map, Shuffle ad Reduce: iteral steps 1. Splits data up to sed it to the mapper 2. Trasforms splits ito key/value pairs 3. (Key-Value) with same key set to the same reducer 4. Aggregates key/value pairs based o user-defied code 5. Determies how the result are saved 12

Fault Tolerace If the master fails MapReduce would fail, have to restart the etire job A map worker ode fails Master detects (periodic pig would timeout) All the map tasks for this ode have to be restarted Eve if the map tasks were doe, the output were at the ode A reduce worker fails Master sets the status of its curretly executig reduce tasks to idle Reschedule these tasks o aother reduce worker 13

Some algorithms usig MapReduce USING MAPREDUCE 14

Matrix Vector MulQplicaQo Multiply M = (m ij ) (a matrix) ad v = (v i ) (a -vector) If = 1000, o eed of MapReduce! Mv = (x ij ) x ij = m ij v j M v j=1 (i, m ij v j ) Case 1: Large, M does ot fit ito mai memory, but v does Sice v fits ito mai memory, v is available to every map task Map: for each matrix elemet m ij, emit key value pair (i, m ij v j ) Shuffle ad sort: groups all m ij v j values together for the same i Reduce: sum m ij v j for all j for the same i 15

Matrix Vector MulQplicaQo Multiply M = (m ij ) (a matrix) ad v = (v i ) (a -vector) If = 1000, o eed of MapReduce! Mv = (x ij ) x ij = m ij v j (i, m ij v j ) j=1 This much will fit ito mai memory This whole chuk does ot fit i mai memory aymore Case 2: Very large, eve v does ot fit ito mai memory For every map, may accesses to disk (for parts of v) required! Solutio: How much of v will fit i? Partitio v ad rows of M so that each partitio of v fits ito memory Take dot product of oe partitio of v ad the correspodig partitio of M Map ad reduce same as before 16

RelaQoal Alegebra Relatio R(A 1, A 3,, A ) is a relatio with attributes A i Schema: set of attributes Selectio o coditio C: apply C o each tuple i R, output oly those which satisfy C Projectio o a subset S of attributes: output the compoets for the attributes i S Uio, Itersectio, Joi Attr 1 Attr 2 Attr 3 Attr 4 xyz abc 1 true abc xyz 1 true xyz def 1 false bcd def 2 true Liks betwee URLs URL1 URL2 url1 url2 url2 url1 url3 url5 url1 url3 17

SelecQo usig MapReduce Trivial Map: For each tuple t i R, test if t satisfies C. If so, produce the key-value pair (t, t). Reduce: The idetity fuctio. It simply passes each key-value pair to the output. Liks betwee URLs URL1 URL2 url1 url2 url2 url1 url3 url5 url1 url3 18

Uio usig MapReduce Uio of two relatios R ad S Suppose R ad S have the same schema Map tasks are geerated from chuks of both R ad S Map: For each tuple t, produce the keyvalue pair (t, t) Reduce: Oly eed to remove duplicates For all key t, there would be either oe or two values Output (t, t) i either case Liks betwee URLs URL1 URL2 url1 url2 url2 url1 url3 url5 url1 url3 19

Natural joi usig MapReduce Joi R(A,B) with S(B,C) o attribute B Map: For each tuple t = (a,b) of R, emit key value pair (b,(r,a)) For each tuple t = (b,c) of S, emit key value pair (b,(s,c)) Reduce: Each key b would be associated with a list of values that are of the form (R,a) or (S,c) Costruct all pairs cosistig of oe with first compoet R ad the other with first compoet S, say (R,a ) ad (S,c ). The output from this key ad value list is a sequece of key-value pairs The key is irrelevat. Each value is oe of the triples (a, b, c ) such that (R,a ) ad (S,c) are o the iput list of values R A B x a y b z c w d S B C a 1 c 3 d 4 g 7 20

Groupig ad AggregaQo usig MapReduce Group ad aggregate o a relatio R(A,B) usig aggregatio fuctio γ(b), group by Map: For each tuple t = (a,b) of R, emit key value pair (a,b) Reduce: For all group {(a,b 1 ),, (a,b m )} represeted by a key a, apply γ to obtai b a = b 1 + + b m Output (a,b a ) A R B x 2 y 1 z 4 z 1 x 5 select A, sum(b) from R group by A; A SUM(B) x 7 y 1 z 5 21

Matrix mulqplicaqo usig MapReduce m A (m ) l B ( l) = l C (m l) j=1 m c ik = a ij b jk Thik of a matrix as a relatio with three attributes For example matrix A is represeted by the relatio A(I, J, V) For every o-zero etry (i, j, a ij ), the row umber is the value of I, colum umber is the value of J, the etry is the value i V Also advatage: usually most large matrices would be sparse, the relatio would have less umber of etries The product is ~ a atural joi followed by a groupig with aggregatio 22

Matrix mulqplicaqo usig MapReduce m A (m ) (i, j, a ij ) l B ( l) (j, k, b jk ) = l C (m l) j=1 m c ik = a ij b jk Natural joi of (I,J,V) ad (J,K,W) à tuples (i, j, k, a ij, b jk ) Map: For every (i, j, a ij ), emit key value pair (j, (A, i, a ij )) For every (j, k, b jk ), emit key value pair (j, (B, k, b jk )) Reduce: for each key j for each value (A, i, a ij ) ad (B, k, b jk ) produce a key value pair ((i,k),(a ij b jk )) 23

Matrix mulqplicaqo usig MapReduce m A (m ) (i, j, a ij ) l B ( l) (j, k, b jk ) = l C (m l) j=1 m c ik = a ij b jk First MapReduce process has produced key value pairs ((i,k), (a ij b jk )) Aother MapReduce process to group ad aggregate Map: idetity, just emit the key value pair ((i,k),(a ij b jk )) Reduce: for each key (i,k) produce the sum of the all the values for the key: c ik = a ij b jk j=1 24

Matrix mulqplicaqo usig MapReduce: Method 2 m A (m ) (i, j, a ij ) l B ( l) (j, k, b jk ) = l C (m l) j=1 m c ik = a ij b jk A method with oe MapReduce step Map: For every (i, j, a ij ), emit for all k = 1,, l, the key value ((i,k), (A, j, a ij )) For every (j, k, b jk ), emit for all i = 1,, m, the key value ((i,k), (B, j, b jk )) Reduce: for each key (i,k) sort values (A, j, a ij ) ad (B, j, b jk ) by j to group them by j for each j multiply a ij ad b jk sum the products for the key (i,k) to produce c ik = j=1 a ij b jk May ot fit i mai memory. Expesive exteral sort! 25

Refereces ad ackowledgemets Miig of Massive Datasets, by Leskovec, Rajarama ad Ullma, Chapter 2 Slides by Dwaipaya Roy 26