Generalizing Map- Reduce

Size: px

Start display at page:

Download "Generalizing Map- Reduce"

Denis Ryan
5 years ago
Views:

1 Generalizing Map- Reduce 1

2 Example: A Map- Reduce Graph map reduce map... reduce reduce map 2

3 Map- reduce is not a solu;on to every problem, not even every problem that profitably can use many compute nodes opera;ng in parallel

4 Algorithm Design Goal: Algorithms should exploit as much parallelism as possible. To encourage parallelism, we put a limit s on the amount of input or output that any one process can have. s could be: What fits in main memory. What fits on local disk. No more than a process can handle before cosmic rays are likely to cause an error. 4

5 Cost Measures for Algorithms 1. Communica,on cost = total I/O of all processes. 2. Elapsed communica,on cost = max of I/O along any path. 3. (Elapsed ) computa,on costs analogous, but count only running ;me of processes. 5

6 Cost Measures For a map- reduce algorithm: Communica;on cost = input file size + 2 (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce processes. Elapsed communica;on cost is the sum of the largest input + output for any map process, plus the same for any reduce process. 6

7 What Cost Measures Mean Either the I/O (communica;on) or processing (computa;on) cost dominates. Ignore one or the other. Total costs tell what you pay in rent from your friendly neighborhood cloud. Elapsed costs are wall- clock ;me using parallelism. 7

8 Join By Map- Reduce Our first example of an algorithm in this framework is a map- reduce example. Compute the natural join R(A,B) S(B,C). R and S each are stored in files. Tuples are pairs (a,b) or (b,c). 8

9 Map- Reduce Join (2) Use a hash func;on h from B- values to 1..k. A Map process turns input tuple R(a,b) into key- value pair (b,(a,r)) and each input tuple S(b,c) into (b,(c,s)). 9

10 Map- Reduce Join (3) Map processes send each key- value pair with key b to Reduce process h(b). Hadoop does this automa;cally just tell it what k is. Each Reduce process matches all the pairs (b,(a,r)) with all (b,(c,s)) and outputs (a,b,c). 10

11 Cost of Map- Reduce Join Total communica;on cost = O( R + S + R S ). Elapsed communica;on cost = O(s ). We re going to pick k and the number of Map processes so I/O limit s is respected. With proper indexes, computa;on cost is linear in the input + output size. So computa;on costs are like comm. costs. 11

12 Three- Way Join We shall consider a simple join of three rela;ons, the natural join R(A,B) S(B,C) T(C,D). One way: cascade of two 2- way joins, each implemented by map- reduce. Fine, unless the 2- way joins produce large intermediate rela;ons. 12

13 Example: Large Intermediate Rela;ons A = good pages ; B, C = all pages ; D = spam pages. R, S, and T each represent links. 3- way join = path of length 3 from good page to spam page. R S = paths of length 2 from good page to any; S T = paths of length 2 from any page to spam page. 13

14 Another 3- Way Join Reduce processes use hash values of en;re S(B,C) tuples as key. Choose a hash func;on h that maps B- and C- values to k buckets. There are k 2 Reduce processes, one for each (B- bucket, C- bucket) pair. 14

15 Mapping for 3- Way Join We map each tuple S(b,c) to ((h(b), h(c)), (S, b, c)). We map each R(a,b) tuple to ((h(b), y), (R, a, b)) for all y = 1, 2,,k. We map each T(c,d) tuple to ((x, h(c)), (T, c, d)) for all x = 1, 2,,k. Aside: even normal map-reduce allows inputs to map to several key-value pairs. Keys Values 15

16 Assigning Tuples to Reducers h(c) = T(c,d), where h(c)=3 h(b) = 0 S(b,c) where h(b)=1; h(c)=2 1 2 R(a,b), where h(b)=2 3 16

17 Job of the Reducers Each reducer gets, for certain B- values b and C- values c : 1. All tuples from R with B = b, 2. All tuples from T with C = c, and 3. The tuple S(b,c) if it exists. Thus it can create every tuple of the form (a, b, c, d) in the join. 17

18 3- Way Join and Map- Reduce This algorithm is not exactly in the spirit of map- reduce. While you could use the hash- func;on h in the Map processes, Hadoop normally does the hashing of keys itself. 18

19 3- Way Join/Map- Reduce (2) But if you Map to acribute values rather than hash values, you have a subtle problem. Example: R(a, b) needs to go to all keys of the form (b, y), where y is any C- value. But you don t know all the C- values. 19

20 Semijoin Op;on A possible solu;on: first semijoin find all the C- values in S(B,C). Feed these to the Map processes for R(A,B), so they produce only keys (b, y) such that y is in π C (S). Similarly, compute π B (S), and have the Map processes for T(C,D) produce only keys (x, c) such that x is in π B (S). 20

21 Semijoin Op;on (2) Problem: while this approach works, it is not a map- reduce process. Rather, it requires three layers of processes: 1. Map S to π B (S), π C (S), and S itself (for join). 2. Map R and π B (S) to key- value pairs and do the same for T and π C (S). 3. Reduce (join) the mapped R, S, and T tuples. 21

22 Term Co- occurrence

23 Term co- occurrence (2) How do we aggregate counts efficiently?

24 1 st try: Pairs Note: in all these slides, a key- value pair denoted as k v

25 1 st try: Pairs Advantages Easy to implement, easy to understand Disadvantages Lots of pairs to sort and shuffle around

26 Another try: Stripes

27 Another try: Stripes Advantages Far less sor;ng and shuffling of key- value pairs Can make becer use of combiners Disadvantages More difficult to implement Underlying object is more heavyweight Fundamental limita;on in terms of size of event space

29 Condi;onal probabili;es

30 P(B A): Pairs

31 P(B A): Stripes

32 Synchroniza;on in Hadoop

33 Matrix- vector mul;plica;on Suppose we have an n n matrix M, whose element in row i and column j will be denoted m ij. Suppose we also have a vector v of length n, whose jth element is v j. Then the matrix- vector product is the vector x of length n, whose ith element x i is given by

34 Matrix- vector mul;plica;on Let us first assume that n is large, but not so large that vector v cannot fit in main memory and thus be available to every Map task We assume that the row- column coordinates of each matrix element will be discoverable its posi;on in the file or as a triple (i, j, m ij ) We also assume the posi;on of element v j in v is discoverable in the analogous way

35 Matrix- vector mul;plica;on The Map func;on applies to one element of M Each Map task will operate on a chunk of the matrix M From each matrix element m ij it produces the key- value pair (i, m ij v j ) All terms of the sum that make up the component x i of the matrix- vector product will get the same key, i

36 Matrix vector mul;plica;on The Reduce Func;on: simply sums all the values associated with a given key i The result will be a pair (i, x i )

37 Matrix- vector mul;plica;on If the vector v cannot fit in main memory If v does not fit in memory there will be a very large number of disk accesses as we move pieces of the vector into main memory to mul;ply components by elements of the matrix.

38 Matrix- vector mul;plica;on If the vector v cannot fit in main memory As an alterna;ve, we can divide the matrix into ver;cal stripes of equal width and divide the vector into an equal number of horizontal stripes, of the same height.

39 Matrix- vector mul;plica;on The ith stripe of the matrix mul;plies only components from the ith stripe of the vector.

40 Matrix- vector mul;plica;on We can divide the matrix into one file for each stripe, and do the same for the vector. Each Map task is assigned a chunk from one of the stripes of the matrix and gets the en;re corresponding stripe of the vector. The Map and Reduce tasks can then act exactly as was described above for the case where Map tasks get the en;re vector

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce