MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

Similar documents
MapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014

2.3 Algorithms Using Map-Reduce

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

MapReduce. Stony Brook University CSE545, Fall 2016

Data Partitioning and MapReduce

Database Systems CSE 414

HADOOP FRAMEWORK FOR BIG DATA

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

MapReduce and the New Software Stack

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Databases 2 (VU) ( / )

MI-PDB, MIE-PDB: Advanced Database Systems

Map- Reduce. Everything Data CompSci Spring 2014

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Distributed Computation Models

Clustering Lecture 8: MapReduce

INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)

TI2736-B Big Data Processing. Claudia Hauff

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Distributed computing: index building and use

Introduction to Data Management CSE 344

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Databases 2 (VU) ( )

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CompSci 516: Database Systems

Introduction to Data Management CSE 344

A brief history on Hadoop

Introduction to MapReduce

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Part A: MapReduce. Introduction Model Implementation issues

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Lecture 11 Hadoop & Spark

Distributed computing: index building and use

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

CS 345A Data Mining. MapReduce

Distributed File Systems II

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hadoop Map Reduce 10/17/2018 1

2/26/2017. For instance, consider running Word Count across 20 splits

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Introduction to MapReduce

MapReduce, Hadoop and Spark. Bompotas Agorakis

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Introduction to MapReduce (cont.)

Big Data for Engineers Spring Resource Management

Hadoop. copyright 2011 Trainologic LTD

Map Reduce. Yerevan.

2/26/2017. Note. The relational algebra and the SQL language have many useful operators

MapReduce and Friends

Hadoop An Overview. - Socrates CCDH

A BigData Tour HDFS, Ceph and MapReduce

Introduction to Hadoop and MapReduce

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Big Data 7. Resource Management

BigData and Map Reduce VITMAC03

MapReduce. U of Toronto, 2014

Graphs (Part II) Shannon Quinn

Programming Systems for Big Data

The relational algebra and the SQL language have many useful operators

Map-Reduce and Related Systems

Batch Inherence of Map Reduce Framework

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

MapReduce: Algorithm Design for Relational Operations

Improving the MapReduce Big Data Processing Framework

Introduction to Map Reduce

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Hadoop/MapReduce Computing Paradigm

Chapter 5. The MapReduce Programming Model and Implementation

Introduction to Data Management CSE 344

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Map-Reduce. Marco Mura 2010 March, 31th

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Big Data Management and NoSQL Databases

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Introduction to MapReduce

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Distributed Filesystem

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Hadoop MapReduce Framework

CS427 Multicore Architecture and Parallel Computing

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Certified Big Data and Hadoop Course Curriculum

50 Must Read Hadoop Interview Questions & Answers

Dept. Of Computer Science, Colorado State University

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

MapReduce Simplified Data Processing on Large Clusters

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Map Reduce.

Evaluation of Apache Hadoop for parallel data analysis with ROOT

Transcription:

MapReduce and Hadoop Debapriyo Majumdar Indian Statistical Institute Kolkata debapriyo@isical.ac.in

Let s keep the intro short Modern data mining: process immense amount of data quickly Exploit parallelism Traditional parallelism Bring data to compute MapReduce Bring compute to data Pictures courtesy: Glenn K. Lockwood, glennklockwood.com 2

The MapReduce paradigm Split Map Shuffle and sort Reduce Final original Input May be already split in filesystem Input chunks <Key,Value> pairs <Key,Value> pairs grouped by keys Output chunks The user needs to write the map() and the reduce() Final output May not need to combine 3

An example: word frequency counqng Split Map Shuffle and sort Reduce Final collec.on of documnts original Input subcollec.ons of documnts Input chunks Problem: Given a collecqon of documents, count the number of Qmes each word occurs in the collecqon map: for each word w, output pairs (w,1) <Key,Value> pairs map: for each word w, emit pairs (w,1) the pairs (w,1) for the same words are grouped together <Key,Value> pairs grouped by keys reduce: count the number (n) of pairs for each w, make it (w,n) Output chunks output: (w,n) for each w reduce: count the number (n) of pairs for each w, make it (w,n) Final output 4

An example: word frequency counqng Split Map Shuffle and sort Reduce Final apple orange peach orange plum orange apple guava cherry fig peach fig peach original Input apple orange peach orange plum orange apple guava cherry fig peach fig peach Input chunks Problem: Given a collecqon of documents, count the number of Qmes each word occurs in the collecqon (apple,1) (orange,1) (peach,1) (orange,1) (plum,1) (orange,1) (apple,1) (guava,1) (cherry,1) (fig,1) (peach,1) (fig,1) (peach,1) <Key,Value> pairs map: for each word w, output pairs (w,1) (apple,1) (apple,1) (orange,1) (orange,1) (orange,1) (guava,1) (plum,1) (plum,1) (cherry,1) (cherry,1) (fig,1) (fig,1) (peach,1) (peach,1) (peach,1) <Key,Value> pairs grouped by keys (apple,2) (orange,3) (guava,1) (plum,2) (cherry,2) (fig,2) (peach,3) Output chunks reduce: count the number (n) of pairs for each w, make it (w,n) (apple,2) (orange,3) (guava,1) (plum,2) (cherry,2) (fig,2) (peach,3) Final output 5

Apache Hadoop An open source MapReduce framework HADOOP 6

Hadoop Two main components Hadoop Distributed File System (HDFS): to store data MapReduce engine: to process data Master slave architecture using commodity servers The HDFS Master: Namenode Slave: Datanode MapReduce Master: JobTracker Slave: TaskTracker 7

HDFS: Blocks Block 1 Block 2 Datanode 1 Block 1 Block 2 Block 3 Big File Block 3 Block 4 Datanode 2 Block 1 Block 3 Block 4 Block 5 Block 6 Runs on top of existing filesystem Blocks are 64MB (128MB recommended) Single file can be > any single disk POSIX based permissions Fault tolerant Datanode 3 Block 2 Block 6 Block 5 Datanode 4 Block 4 Block 6 Block 5 8

HDFS: Namenode and Datanode Namenode Only one per Hadoop Cluster Manages the filesystem namespace The filesystem tree An edit log For each block block i, the datanode(s) in which block i is saved All the blocks residing in each datanode Secondary Namenode Backup namenode Datanodes Many per Hadoop cluster Controls block operations Physically puts the block in the nodes Do the physical replication 9

HDFS: an example 10

MapReduce: JobTracker and TaskTracker 1. JobClient submits job to JobTracker; Binary copied into HDFS 2. JobTracker talks to Namenode 3. JobTracker creates execution plan 4. JobTracker submits work to TaskTrackers 5. TaskTrackers report progress via heartbeat 6. JobTracker updates status 11

Map, Shuffle and Reduce: internal steps 1. Splits data up to send it to the mapper 2. Transforms splits into key/value pairs 3. (Key-Value) with same key sent to the same reducer 4. Aggregates key/value pairs based on user-defined code 5. Determines how the result are saved 12

Fault Tolerance If the master fails MapReduce would fail, have to restart the entire job A map worker node fails Master detects (periodic ping would timeout) All the map tasks for this node have to be restarted Even if the map tasks were done, the output were at the node A reduce worker fails Master sets the status of its currently executing reduce tasks to idle Reschedule these tasks on another reduce worker 13

Some algorithms using MapReduce USING MAPREDUCE 14

Matrix Vector MulQplicaQon Multiply M = (m ij ) (an n n matrix) and v = (v i ) (an n-vector) If n = 1000, no need of MapReduce! Mv = (x i ) n x i = m ij v j M v j=1 n n (i, m ij v j ) Case 1: Large n, M does not fit into main memory, but v does Since v fits into main memory, v is available to every map task Map: for each matrix element m ij, emit key value pair (i, m ij v j ) Shuffle and sort: groups all m ij v j values together for the same i Reduce: sum m ij v j for all j for the same i 15

Matrix Vector MulQplicaQon Multiply M = (m ij ) (an n n matrix) and v = (v i ) (an n-vector) If n = 1000, no need of MapReduce! Mv = (x i ) n x i = m ij v j (i, m ij v j ) j=1 This much will fit into main memory This whole chunk does not fit in main memory anymore Case 2: Very large n, even v does not fit into main memory For every map, many accesses to disk (for parts of v) required! Solution: How much of v will fit in? Partition v and rows of M so that each partition of v fits into memory Take dot product of one partition of v and the corresponding partition of M Map and reduce same as before 16

RelaQonal Algebra Relation R(A 1, A 3,, A n ) is a relation with attributes A i Schema: set of attributes Selection on condition C: apply C on each tuple in R, output only those which satisfy C Projection on a subset S of attributes: output the components for the attributes in S Union, Intersection, Join Attr 1 Attr 2 Attr 3 Attr 4 xyz abc 1 true abc xyz 1 true xyz def 1 false bcd def 2 true Links between URLs URL1 URL2 url1 url2 url2 url1 url3 url5 url1 url3 17

SelecQon using MapReduce Trivial Map: For each tuple t in R, test if t satisfies C. If so, produce the key-value pair (t, t). Reduce: The identity function. It simply passes each key-value pair to the output. Links between URLs URL1 URL2 url1 url2 url2 url1 url3 url5 url1 url3 18

Union using MapReduce Union of two relations R and S Suppose R and S have the same schema Map tasks are generated from chunks of both R and S Map: For each tuple t, produce the keyvalue pair (t, t) Reduce: Only need to remove duplicates For all key t, there would be either one or two values Output (t, t) in either case Links between URLs URL1 URL2 url1 url2 url2 url1 url3 url5 url1 url3 19

Natural join using MapReduce Join R(A,B) with S(B,C) on attribute B Map: For each tuple t = (a,b) of R, emit key value pair (b,(r,a)) For each tuple t = (b,c) of S, emit key value pair (b,(s,c)) Reduce: Each key b would be associated with a list of values that are of the form (R,a) or (S,c) Construct all pairs consisting of one with first component R and the other with first component S, say (R,a ) and (S,c ). The output from this key and value list is a sequence of key-value pairs The key is irrelevant. Each value is one of the triples (a, b, c ) such that (R,a ) and (S,c) are on the input list of values R A B x a y b z c w d S B C a 1 c 3 d 4 g 7 20

Grouping and AggregaQon using MapReduce Group and aggregate on a relation R(A,B) using aggregation function γ(b), group by A Map: For each tuple t = (a,b) of R, emit key value pair (a,b) Reduce: For all group {(a,b 1 ),, (a,b m )} represented by a key a, apply γ to obtain b a = b 1 + + b m Output (a,b a ) A R B x 2 y 1 z 4 z 1 x 5 select A, sum(b) from R group by A; A SUM(B) x 7 y 1 z 5 21

Matrix mulqplicaqon using MapReduce m n A (m n) n l B (n l) = l C (m l) n j=1 m c ik = a ij b jk Think of a matrix as a relation with three attributes For example matrix A is represented by the relation A(I, J, V) For every non-zero entry (i, j, a ij ), the row number is the value of I, column number is the value of J, the entry is the value in V Also advantage: usually most large matrices would be sparse, the relation would have less number of entries The product is ~ a natural join followed by a grouping with aggregation 22

Matrix mulqplicaqon using MapReduce m n A (m n) (i, j, a ij ) n l B (n l) (j, k, b jk ) = l C (m l) n j=1 m c ik = a ij b jk Natural join of (I,J,V) and (J,K,W) à tuples (i, j, k, a ij, b jk ) Map: For every (i, j, a ij ), emit key value pair (j, (A, i, a ij )) For every (j, k, b jk ), emit key value pair (j, (B, k, b jk )) Reduce: for each key j for each value (A, i, a ij ) and (B, k, b jk ) produce a key value pair ((i,k),(a ij b jk )) 23

Matrix mulqplicaqon using MapReduce m n A (m n) (i, j, a ij ) n l B (n l) (j, k, b jk ) = l C (m l) n j=1 m c ik = a ij b jk First MapReduce process has produced key value pairs ((i,k), (a ij b jk )) Another MapReduce process to group and aggregate Map: identity, just emit the key value pair ((i,k),(a ij b jk )) Reduce: for each key (i,k) n produce the sum of the all the values for the key: c ik = a ij b jk j=1 24

Matrix mulqplicaqon using MapReduce: Method 2 m n A (m n) (i, j, a ij ) n l B (n l) (j, k, b jk ) = l C (m l) n j=1 m c ik = a ij b jk A method with one MapReduce step Map: For every (i, j, a ij ), emit for all k = 1,, l, the key value ((i,k), (A, j, a ij )) For every (j, k, b jk ), emit for all i = 1,, m, the key value ((i,k), (B, j, b jk )) Reduce: for each key (i,k) sort values (A, j, a ij ) and (B, j, b jk ) by j to group them by j for each j multiply a ij and b jk sum the products for the key (i,k) to produce c ik = n j=1 a ij b jk May not fit in main memory. Expensive external sort! 25

Combiners If the aggregation operator in the reduce algorithm is associative and commutative Associative: (a * b) * c = a * (b * c) Commutative: a * b = b * a It does not matter in which order we aggregate We can do some work for reduce in the map step Example: the word count problem Instead of emitting (w, 1) for each word We can emit (w, n) for each document or each map task 26

More examples 27

An inverted index for text collecqons 28

CollecQon and Documents The curse of the black pearl Ship Jack Sparrow Caribbean Turner Elizabeth Gun Fight Finding Nemo Ocean Fish Nemo Reef AnimaQon Tin.n Ocean AnimaQon Ship Haddock TinQn Titanic Ship Rose Jack AtlanQc Ocean England Sink The Dark Knight Bruce Wayne Batman Joker Harvey Gordon Gun Fight Crime Skyfall 007 James Bond MI6 Gun Fight Silence of the Lambs Hannibal Lector FBI Crime Gun Cannibal The Ghost Ship Ship Ghost Ocean Death Horror Document: unit of retrieval Collection: the group of documents from which we retrieve Also called corpus (a body of texts) 29

Boolean retrieval The curse of the black pearl Ship Captain Jack Sparrow Caribbean Elizabeth Gun Fight Finding Nemo Ocean Fish Nemo Reef AnimaQon Tin.n Ocean AnimaQon Ship Captain Haddock TinQn Titanic Ship Rose Jack AtlanQc Ocean England Sink Captain The Dark Knight Bruce Wayne Batman Joker Harvey Gordon Gun Fight Crime Skyfall 007 James Bond MI6 Gun Fight Silence of the Lambs Hannibal Lector FBI Crime Gun Cannibal The Ghost Ship Ship Ghost Ocean Death Horror Find all documents containing a word w Find all documents containing a word w 1 but not containing the word w 2 Queries in the form of any Boolean expression Query: Jack 30

Boolean retrieval The curse of the black pearl Ship Captain Jack Sparrow Caribbean Elizabeth Gun Fight Finding Nemo Ocean Fish Nemo Reef AnimaQon Tin.n Ocean AnimaQon Ship Captain Haddock TinQn Titanic Ship Rose Jack AtlanQc Ocean England Sink Captain The Dark Knight Bruce Wayne Batman Joker Harvey Gordon Gun Fight Crime Skyfall 007 James Bond MI6 Gun Fight Silence of the Lambs Hannibal Lector FBI Crime Gun Cannibal The Ghost Ship Ship Ghost Ocean Death Horror Find all documents containing a word w Find all documents containing a word w 1 but not containing the word w 2 Queries in the form of any Boolean expression Query: Jack 31

Inverted index 1 2 3 4 5 6 7 8 Finding TinQn Titanic Dark Skyfall Silence Nemo Knight of lambs Black pearl Ghost ship Ship Jack Bond Gun Ocean Captain Batman Crime 32

Inverted index 1 2 3 4 5 6 7 8 Finding TinQn Titanic Dark Skyfall Silence Nemo Knight of lambs Black pearl Ghost ship Ship 1 3 4 8 Jack Bond Gun Ocean Captain Batman Crime 33

Inverted index 1 2 3 4 5 6 7 8 Finding TinQn Titanic Dark Skyfall Silence Nemo Knight of lambs Black pearl Inverted Index Ghost ship Ship 1 3 4 8 Jack 1 4 Bond 6 Gun 1 5 6 7 Ocean 1 2 3 4 8 Captain 1 3 4 Batman 5 Crime 5 7 PosQng lists Vocabulary / DicQonary 34

CreaQng an inverted index The curse of the black pearl Ship Captain Jack Sparrow Caribbean Elizabeth Gun Fight The Dark Knight Bruce Wayne Batman Joker Harvey Gordon Gun Fight Crime Finding Nemo Ocean Fish Nemo Reef AnimaQon Skyfall 007 James Bond MI6 Gun Fight Tin.n Ocean AnimaQon Ship Captain Haddock TinQn 1 2 3 4 5 6 7 8 Silence of the Lambs Hannibal Lector FBI Crime Gun Cannibal Titanic Ship Rose Jack AtlanQc Ocean England Sink Captain The Ghost Ship Ship Ghost Ocean Death Horror Term docid Ship 1 Captain 1 Jack 1 Ship 3 Sort and group by term Term docid Captain 1 Jack 1 Jack 4 Map: for each term in each document, write out pairs (term, docid) Reduce: List documents for each term Term docid docid docid Captain 1 Tintin 3 Jack 4 Ship 1 Ship 3 Jack 1 4 Ship 1 3 35

A problem: find people you may know In a social network There are several factors A very important one is by number of mutual friends How to compute a list of suggestions? Is one mutual friend enough? If yes, then simply all friends of friends Sometimes, but more is better Consider a simple criterion: If X and Y have at least 3 mutual friends then Y is a suggested friend of X 36

The Data For every member, there is a list of friends A à B C L B à A G K C à A G M Naïve approach: For every friend X of A N members. O(n) Every member has n friends. N >> n For every friend Y of X O(n) Compute number of mutual friends between Y and A Intersect the friend-list of Y and friend-list of A O(n 3 ) per member. Total O(N n 3 )!! O(n) 37

A beker method A à B C L B à A G K C à A G M For each friend X of A O(n) (A, A, B) For each friend Y of X O(n) (A, G, B) Write: (A, Y, X) (A, K, B) (A, G, B) What do we get? Now sort by first two entries (A, G, C) (A, G, C) All the (A,G) s are together (A, G, P) O(n For each (A, X), count how 2 log n) (A, G, P) many times it occurs O(n 2 ) Number of mutual friends O(n 2 ) between A and X 38 O(n 2 log n) per member

Challenges A à B C L B à A G K C à A G M For each friend X of A For each friend Y of X Write: (A, Y, X) Now sort by first two entries All the (A,G) s are together For each (A, Y), count how many times it occurs Data may be too huge, distributed Friendlist of A in one part of storage Friendlist of friend of A in another part (A, A, B) (A, G, B) (A, K, B) (A, G, C) (A, G, P) (A, G, B) (A, G, C) (A, G, P) 39

How to do this using MapReduce? Assume the friendlist for each user is in one machine. Describe an algorithm using MapReduce. 40

Extensions of MapReduce MapReduce: two-step workflow Extension: any collection of functions Acyclic graph representing the workflow g i f k h j Each function of the workflow can be executed by many tasks A master controller divides the work among the tasks Each task s output goes to the successor as input Two experimental systems: Clustera (University of Wisconsin), Hyracks (University of California at Irvine) Similarly, recursive use of MapReduce 41

Cost of MapReduce based algorithms map reduce g i f k h j Computation in the nodes à Computation cost Transferring of data through network à Communication cost Communication cost is most often the bottleneck. Why? 42

Why communicaqon cost? The algorithm executed by each task is usually very simple, often linear in the size of its input Network speed within cluster ~ 1Gigabit/s But processor executes simple map or reduce tasks even faster In a cluster several network transfers may be required at the same time à overload Data typically stored in disk Reading the data into main memory may take more time than to process it in map or reduce tasks 43

CommunicaQon cost Communication cost of a task = The size of the input to the task Why size of the input? Most often size of the input size of the output Why? Because unless summarized, the output can t be used much If output size is large, then it is the input for another task (counting input size suffices) 44

References and acknowledgements Mining of Massive Datasets, by Leskovec, Rajaraman and Ullman, Chapter 2 Slides by Dwaipayan Roy 45