MapReduce ad Hadoop Debapriyo Majumdar Data Miig Fall 2014 Idia Statistical Istitute Kolkata November 10, 2014
Let s keep the itro short Moder data miig: process immese amout of data quickly Exploit parallelism Traditioal parallelism Brig data to compute MapReduce Brig compute to data Pictures courtesy: Gle K. Lockwood, gleklockwood.com 2
The MapReduce paradigm Split Map Shuffle ad sort Reduce Fial origial Iput May be already split i filesystem Iput chuks <Key,Value> pairs <Key,Value> pairs grouped by keys Output chuks The user eeds to write the map() ad the reduce() Fial output May ot eed to combie 3
A example: word frequecy couqg Split Map Shuffle ad sort Reduce Fial collec.o of documts origial Iput subcollec.os of documts Iput chuks Problem: Give a collecqo of documets, cout the umber of Qmes each word occurs i the collecqo map: for each word w, output pairs (w,1) <Key,Value> pairs map: for each word w, emit pairs (w,1) the pairs (w,1) for the same words are grouped together <Key,Value> pairs grouped by keys reduce: cout the umber () of pairs for each w, make it (w,) Output chuks output: (w,) for each w reduce: cout the umber () of pairs for each w, make it (w,) Fial output 4
A example: word frequecy couqg Split Map Shuffle ad sort Reduce Fial apple orage peach orage plum orage apple guava cherry fig peach fig peach origial Iput apple orage peach orage plum orage apple guava cherry fig peach fig peach Iput chuks Problem: Give a collecqo of documets, cout the umber of Qmes each word occurs i the collecqo (apple,1) (orage,1) (peach,1) (orage,1) (plum,1) (orage,1) (apple,1) (guava,1) (cherry,1) (fig,1) (peach,1) (fig,1) (peach,1) <Key,Value> pairs map: for each word w, output pairs (w,1) (apple,1) (apple,1) (orage,1) (orage,1) (orage,1) (guava,1) (plum,1) (plum,1) (cherry,1) (cherry,1) (fig,1) (fig,1) (peach,1) (peach,1) (peach,1) <Key,Value> pairs grouped by keys (apple,2) (orage,3) (guava,1) (plum,2) (cherry,2) (fig,2) (peach,3) Output chuks reduce: cout the umber () of pairs for each w, make it (w,) (apple,2) (orage,3) (guava,1) (plum,2) (cherry,2) (fig,2) (peach,3) Fial output 5
Apache Hadoop A ope source MapReduce framework HADOOP 6
Hadoop Two mai compoets Hadoop Distributed File System (HDFS): to store data MapReduce egie: to process data Master slave architecture usig commodity servers The HDFS Master: Nameode Slave: Dataode MapReduce Master: JobTracker Slave: TaskTracker 7
HDFS: Blocks Block 1 Block 2 Dataode 1 Block 1 Block 2 Block 3 Big File Block 3 Block 4 Dataode 2 Block 1 Block 3 Block 4 Block 5 Block 6 Dataode 3 Block 2 Block 6 Block 5 Rus o top of existig filesystem Blocks are 64MB (128MB recommeded) Sigle file ca be > ay sigle disk POSIX based permissios Fault tolerat Dataode 4 Block 4 Block 6 Block 5 8
HDFS: Nameode ad Dataode Nameode Oly oe per Hadoop Cluster Maages the filesystem amespace The filesystem tree A edit log For each block block i, the dataode(s) i which block i is saved All the blocks residig i each dataode Secodary Nameode Backup ameode Dataodes May per Hadoop cluster Cotrols block operatios Physically puts the block i the odes Do the physical replicatio 9
HDFS: a example 10
MapReduce: JobTracker ad TaskTracker 1. JobCliet submits job to JobTracker; Biary copied ito HDFS 2. JobTracker talks to Nameode 3. JobTracker creates executio pla 4. JobTracker submits work to TaskTrackers 5. TaskTrackers report progress via heartbeat 6. JobTracker updates status 11
Map, Shuffle ad Reduce: iteral steps 1. Splits data up to sed it to the mapper 2. Trasforms splits ito key/value pairs 3. (Key-Value) with same key set to the same reducer 4. Aggregates key/value pairs based o user-defied code 5. Determies how the result are saved 12
Fault Tolerace If the master fails MapReduce would fail, have to restart the etire job A map worker ode fails Master detects (periodic pig would timeout) All the map tasks for this ode have to be restarted Eve if the map tasks were doe, the output were at the ode A reduce worker fails Master sets the status of its curretly executig reduce tasks to idle Reschedule these tasks o aother reduce worker 13
Some algorithms usig MapReduce USING MAPREDUCE 14
Matrix Vector MulQplicaQo Multiply M = (m ij ) (a matrix) ad v = (v i ) (a -vector) If = 1000, o eed of MapReduce! Mv = (x ij ) x ij = m ij v j M v j=1 (i, m ij v j ) Case 1: Large, M does ot fit ito mai memory, but v does Sice v fits ito mai memory, v is available to every map task Map: for each matrix elemet m ij, emit key value pair (i, m ij v j ) Shuffle ad sort: groups all m ij v j values together for the same i Reduce: sum m ij v j for all j for the same i 15
Matrix Vector MulQplicaQo Multiply M = (m ij ) (a matrix) ad v = (v i ) (a -vector) If = 1000, o eed of MapReduce! Mv = (x ij ) x ij = m ij v j (i, m ij v j ) j=1 This much will fit ito mai memory This whole chuk does ot fit i mai memory aymore Case 2: Very large, eve v does ot fit ito mai memory For every map, may accesses to disk (for parts of v) required! Solutio: How much of v will fit i? Partitio v ad rows of M so that each partitio of v fits ito memory Take dot product of oe partitio of v ad the correspodig partitio of M Map ad reduce same as before 16
RelaQoal Alegebra Relatio R(A 1, A 3,, A ) is a relatio with attributes A i Schema: set of attributes Selectio o coditio C: apply C o each tuple i R, output oly those which satisfy C Projectio o a subset S of attributes: output the compoets for the attributes i S Uio, Itersectio, Joi Attr 1 Attr 2 Attr 3 Attr 4 xyz abc 1 true abc xyz 1 true xyz def 1 false bcd def 2 true Liks betwee URLs URL1 URL2 url1 url2 url2 url1 url3 url5 url1 url3 17
SelecQo usig MapReduce Trivial Map: For each tuple t i R, test if t satisfies C. If so, produce the key-value pair (t, t). Reduce: The idetity fuctio. It simply passes each key-value pair to the output. Liks betwee URLs URL1 URL2 url1 url2 url2 url1 url3 url5 url1 url3 18
Uio usig MapReduce Uio of two relatios R ad S Suppose R ad S have the same schema Map tasks are geerated from chuks of both R ad S Map: For each tuple t, produce the keyvalue pair (t, t) Reduce: Oly eed to remove duplicates For all key t, there would be either oe or two values Output (t, t) i either case Liks betwee URLs URL1 URL2 url1 url2 url2 url1 url3 url5 url1 url3 19
Natural joi usig MapReduce Joi R(A,B) with S(B,C) o attribute B Map: For each tuple t = (a,b) of R, emit key value pair (b,(r,a)) For each tuple t = (b,c) of S, emit key value pair (b,(s,c)) Reduce: Each key b would be associated with a list of values that are of the form (R,a) or (S,c) Costruct all pairs cosistig of oe with first compoet R ad the other with first compoet S, say (R,a ) ad (S,c ). The output from this key ad value list is a sequece of key-value pairs The key is irrelevat. Each value is oe of the triples (a, b, c ) such that (R,a ) ad (S,c) are o the iput list of values R A B x a y b z c w d S B C a 1 c 3 d 4 g 7 20
Groupig ad AggregaQo usig MapReduce Group ad aggregate o a relatio R(A,B) usig aggregatio fuctio γ(b), group by Map: For each tuple t = (a,b) of R, emit key value pair (a,b) Reduce: For all group {(a,b 1 ),, (a,b m )} represeted by a key a, apply γ to obtai b a = b 1 + + b m Output (a,b a ) A R B x 2 y 1 z 4 z 1 x 5 select A, sum(b) from R group by A; A SUM(B) x 7 y 1 z 5 21
Matrix mulqplicaqo usig MapReduce m A (m ) l B ( l) = l C (m l) j=1 m c ik = a ij b jk Thik of a matrix as a relatio with three attributes For example matrix A is represeted by the relatio A(I, J, V) For every o-zero etry (i, j, a ij ), the row umber is the value of I, colum umber is the value of J, the etry is the value i V Also advatage: usually most large matrices would be sparse, the relatio would have less umber of etries The product is ~ a atural joi followed by a groupig with aggregatio 22
Matrix mulqplicaqo usig MapReduce m A (m ) (i, j, a ij ) l B ( l) (j, k, b jk ) = l C (m l) j=1 m c ik = a ij b jk Natural joi of (I,J,V) ad (J,K,W) à tuples (i, j, k, a ij, b jk ) Map: For every (i, j, a ij ), emit key value pair (j, (A, i, a ij )) For every (j, k, b jk ), emit key value pair (j, (B, k, b jk )) Reduce: for each key j for each value (A, i, a ij ) ad (B, k, b jk ) produce a key value pair ((i,k),(a ij b jk )) 23
Matrix mulqplicaqo usig MapReduce m A (m ) (i, j, a ij ) l B ( l) (j, k, b jk ) = l C (m l) j=1 m c ik = a ij b jk First MapReduce process has produced key value pairs ((i,k), (a ij b jk )) Aother MapReduce process to group ad aggregate Map: idetity, just emit the key value pair ((i,k),(a ij b jk )) Reduce: for each key (i,k) produce the sum of the all the values for the key: c ik = a ij b jk j=1 24
Matrix mulqplicaqo usig MapReduce: Method 2 m A (m ) (i, j, a ij ) l B ( l) (j, k, b jk ) = l C (m l) j=1 m c ik = a ij b jk A method with oe MapReduce step Map: For every (i, j, a ij ), emit for all k = 1,, l, the key value ((i,k), (A, j, a ij )) For every (j, k, b jk ), emit for all i = 1,, m, the key value ((i,k), (B, j, b jk )) Reduce: for each key (i,k) sort values (A, j, a ij ) ad (B, j, b jk ) by j to group them by j for each j multiply a ij ad b jk sum the products for the key (i,k) to produce c ik = j=1 a ij b jk May ot fit i mai memory. Expesive exteral sort! 25
Refereces ad ackowledgemets Miig of Massive Datasets, by Leskovec, Rajarama ad Ullma, Chapter 2 Slides by Dwaipaya Roy 26