Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

Size: px

Start display at page:

Download "Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme"

Kristin Sharp
5 years ago
Views:

1 Big Data Analytics 4. Map Reduce I Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego Drumond, ISMLL 1 / 32

2 Outline 1. Introduction 2. Parallel programming paradigms 1 / 32

3 1. Introduction Outline 1. Introduction 2. Parallel programming paradigms 1 / 32

4 1. Introduction Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System 1 / 32

5 1. Introduction Why do we need a Computational Model? Our data is nicely stored in a distributed infrastructure 2 / 32

6 1. Introduction Why do we need a Computational Model? Our data is nicely stored in a distributed infrastructure We have a number of computers at our disposal 2 / 32

7 1. Introduction Why do we need a Computational Model? Our data is nicely stored in a distributed infrastructure We have a number of computers at our disposal We want our analytics software to take advantage of all this computing power 2 / 32

8 1. Introduction Why do we need a Computational Model? Our data is nicely stored in a distributed infrastructure We have a number of computers at our disposal We want our analytics software to take advantage of all this computing power When programming we want to focus on understanding our data and not our infrastructure 2 / 32

9 1. Introduction Shared Memory Infrastructure Processor Processor Processor Memory 3 / 32

10 1. Introduction Distributed Infrastructure Network Processor Processor Processor Processor Memory Memory Memory Memory 4 / 32

11 2. Parallel programming paradigms Outline 1. Introduction 2. Parallel programming paradigms 5 / 32

12 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T 5 / 32

13 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed 5 / 32

14 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed Reality: synchronisation and communication costs 5 / 32

15 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed Reality: synchronisation and communication costs Speedup s(t, p) of a task T by using p processors: 5 / 32

16 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed Reality: synchronisation and communication costs Speedup s(t, p) of a task T by using p processors: Be t(t, p) the time needed to execute T using p processors 5 / 32

17 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Ideally: the more processors the faster a task is executed Reality: synchronisation and communication costs Speedup s(t, p) of a task T by using p processors: Be t(t, p) the time needed to execute T using p processors Speedup is given by: s(t, p) = t(t, 1) t(t, p) 5 / 32

18 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T 6 / 32

19 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Efficiency e(t, p) of a task T by using p processors: 6 / 32

20 2. Parallel programming paradigms Parallel Computing principles We have p processors available to execute a task T Efficiency e(t, p) of a task T by using p processors: e(t, p) = t(t, 1) p t(t, p) 6 / 32

21 2. Parallel programming paradigms Considerations It is not worth using a lot of processors for solving small problems 7 / 32

22 2. Parallel programming paradigms Considerations It is not worth using a lot of processors for solving small problems Algorithms should increase efficiency with problem size 7 / 32

23 2. Parallel programming paradigms Paradigms - Shared Memory All the processors have access to all the data D := {d 1,..., d n } 8 / 32

24 2. Parallel programming paradigms Paradigms - Shared Memory All the processors have access to all the data D := {d 1,..., d n } Pieces of data can be overwritten 8 / 32

25 2. Parallel programming paradigms Paradigms - Shared Memory All the processors have access to all the data D := {d 1,..., d n } Pieces of data can be overwritten Processors need to lock datapoints before using them 8 / 32

26 2. Parallel programming paradigms Paradigms - Shared Memory All the processors have access to all the data D := {d 1,..., d n } Pieces of data can be overwritten Processors need to lock datapoints before using them For each processor p: 1. lock(d i ) 2. process(d i ) 3. unlock(d i ) 8 / 32

27 2. Parallel programming paradigms Word Count Example Given a corpus of text documents D := {d 1,..., d n } 9 / 32

28 2. Parallel programming paradigms Word Count Example Given a corpus of text documents D := {d 1,..., d n } each containing a sequence of words: w 1,..., w m pooled from a set W of possible words. 9 / 32

29 2. Parallel programming paradigms Word Count Example Given a corpus of text documents D := {d 1,..., d n } each containing a sequence of words: w 1,..., w m pooled from a set W of possible words. the task is to generate word counts for each word in the corpus 9 / 32

30 2. Parallel programming paradigms Word Count - Shared Memory Shared vector for word counts: c R W c {0} W Each processor: 1. access a document d D 2. for each word w i in document d: 2.1 lock(c i ) 2.2 c i c i unlock(c i ) 10 / 32

31 2. Parallel programming paradigms Paradigms - Shared Memory Inefficient in a distributed scenario 11 / 32

32 2. Parallel programming paradigms Paradigms - Shared Memory Inefficient in a distributed scenario Results of a process can easily be overwritten 11 / 32

33 2. Parallel programming paradigms Paradigms - Shared Memory Inefficient in a distributed scenario Results of a process can easily be overwritten Possible long waiting times for a piece of data because of the lock mechanism 11 / 32

34 2. Parallel programming paradigms Paradigms - Message passing Each processor sees only one part of the data π(d, p) := {d p,..., d p+ n p 1 } 12 / 32

35 2. Parallel programming paradigms Paradigms - Message passing Each processor sees only one part of the data π(d, p) := {d p,..., d p+ n p 1 } Each processor works on its partition 12 / 32

36 2. Parallel programming paradigms Paradigms - Message passing Each processor sees only one part of the data π(d, p) := {d p,..., d p+ n p 1 } Each processor works on its partition Results are exchanged between processors (message passing) 12 / 32

37 2. Parallel programming paradigms Paradigms - Message passing Each processor sees only one part of the data π(d, p) := {d p,..., d p+ n p 1 } Each processor works on its partition Results are exchanged between processors (message passing) For each processor p: 1. For each d π(d, p) 1.1 process(d) 2. Communicate results 12 / 32

38 2. Parallel programming paradigms Word Count - Message passing We need to define two types of processes: 1. Slave - counts the words on a subset of documents and informs the master 2. Master - gathers counts from the slaves and sums them up 13 / 32

39 2. Parallel programming paradigms Word Count - Message passing Slave: Local memory: subset of documents: π(d, p) := {d p,..., d p+ n p 1 } address of the master: addr_master local word counts: c R W 1. c {0} W 2. for each document d π(d, p) for each word w i in document d: c i c i Send message send(addr_master, c) 14 / 32

40 2. Parallel programming paradigms Word Count - Message passing Master: Local memory: 1. Global word counts: c global R W 2. List of slaves: S c global {0} W s {0} S For each received message (p, c p ) 1. c global c global + c p 2. s p 1 3. if s 1 = S return c global 15 / 32

41 2. Parallel programming paradigms Paradigms - Message passing We need to manually assign master and slave roles for each processor 16 / 32

42 2. Parallel programming paradigms Paradigms - Message passing We need to manually assign master and slave roles for each processor Partition of the data needs to be done manually 16 / 32

43 2. Parallel programming paradigms Paradigms - Message passing We need to manually assign master and slave roles for each processor Partition of the data needs to be done manually Implementations like OpenMPI only provide services to exchange messages 16 / 32

44 Outline 1. Introduction 2. Parallel programming paradigms 17 / 32

45 Map-Reduce Builds on the distributed message passing paradigm 17 / 32

46 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes 17 / 32

47 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes Pipelined procedure: 17 / 32

48 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes Pipelined procedure: 1. Map phase 17 / 32

49 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes Pipelined procedure: 1. Map phase 2. Reduce phase 17 / 32

50 Map-Reduce Builds on the distributed message passing paradigm Considers the data is partitioned over the nodes Pipelined procedure: 1. Map phase 2. Reduce phase High level abstraction: programmer only specifies a map and a reduce routine 17 / 32

51 Map-Reduce No need to worry about how many processors are available No need to specify which ones will be mappers and which ones will be reducers 18 / 32

52 Key-Value input data Map-Reduce requires the data to be stored in a key-value format 19 / 32

53 Key-Value input data Map-Reduce requires the data to be stored in a key-value format Natural if one works with column databases 19 / 32

54 Key-Value input data Map-Reduce requires the data to be stored in a key-value format Natural if one works with column databases Examples: 19 / 32

55 Key-Value input data Map-Reduce requires the data to be stored in a key-value format Natural if one works with column databases Examples: Key document document user user user Value array of words word movies friends tweet 19 / 32

56 The Paradigm - Formally Given A set of input keys I A set of output keys O A set of input values X A set of intermediate values V A set of output values Y We can define: map : I X P(O V ) and reduce : O P(V ) O Y where P denotes the powerset 20 / 32

57 The Paradigm - Informally 1. Each mapper transforms some key-value pairs into a set of pairs of an output key and an intermediate value 2. all intermediate values are grouped according to their output keys 3. each reducer receives some pairs of a key and all its intermediate values 4. each reducer for each key aggregates all its intermediate values to one final value 21 / 32

58 Word Count Example Map: Input: document-word list pairs Output: word-count pairs Reduce: Input: word-(count list) pairs Output: word-count pairs (d k, w 1,..., w m) [(w i, c i )] (w i, [c i ]) (w i, c [c i ] c) 22 / 32

59 Word Count Example (d1, love ain't no stranger ) (d2, crying in the rain ) (love, 1) (stranger,1) (crying, 1) (rain, 1) (ain't, 1) (stranger,1) (love, 1) (love, 2) (love, 2) (crying, 1) (stranger,1) (love, 5) (crying, 2) (crying, 1) (d3, looking for love ) (love, 2) (d4, I'm crying ) (d5, the deeper the love ) (crying, 1) (looking, 1) (deeper, 1) (ain't, 1) (ain't, 2) (ain't, 1) (rain, 1) (rain, 1) (looking, 1) (d6, is this love ) (love, 2) (ain't, 1) (looking, 1) (deeper, 1) (deeper, 1) (this, 1) (d7, Ain't no love ) (this, 1) (this, 1) Mappers Reducers 23 / 32

60 Map 1 public static class Map 2 extends MapReduceBase 3 implements Mapper<LongWritable, Text, Text, IntWritable> { 4 private final static IntWritable one = new IntWritable(1); 5 private Text word = new Text(); 6 7 public void map(longwritable key, Text value, 8 OutputCollector<Text, IntWritable> output, 9 Reporter reporter) 0 throws IOException { 1 2 String line = value. tostring (); 3 StringTokenizer tokenizer = new StringTokenizer(line ); 4 5 while ( tokenizer.hasmoretokens()) { 6 word.set( tokenizer.nexttoken()); 7 output. collect (word, one); 8 } 9 } 0 } 24 / 32

61 Reduce 1 public static class Reduce 2 extends MapReduceBase 3 implements Reducer<Text, IntWritable, Text, IntWritable> { 4 5 public void reduce(text key, Iterator <IntWritable> values, 6 OutputCollector<Text, IntWritable> output, 7 Reporter reporter) 8 throws IOException { 9 0 int sum = 0; 1 while ( values.hasnext()) 2 sum += values.next().get(); 3 4 output. collect (key, new IntWritable(sum)); 5 } 6 } 25 / 32

62 Execution snippet 1 public static void main(string [] args) throws Exception { 2 JobConf conf = new JobConf(WordCount.class); 3 conf.setjobname("wordcount"); 4 5 conf.setoutputkeyclass(text.class); 6 conf.setoutputvalueclass(intwritable.class); 7 8 conf.setmapperclass(map.class); 9 conf.setcombinerclass(reduce.class); 0 conf.setreducerclass(reduce.class); 1 2 conf.setinputformat(textinputformat.class); 3 conf.setoutputformat(textoutputformat.class); 4 5 FileInputFormat.setInputPaths(conf, new Path(args[0])); 6 FileOutputFormat.setOutputPath(conf, new Path(args[1])); 7 8 JobClient.runJob(conf); 9 } 0 } 26 / 32

63 Considerations Maps are executed in parallel Reduces are executed in parallel Bottleneck: Reducers can only execute after all the mappers are finished 27 / 32

64 Fault tolerance When the master node detects node failures: Re-executes completed and in-progress map() Re-executes in-progress reduce tasks 28 / 32

65 Fault tolerance When the master node detects node failures: Re-executes completed and in-progress map() Re-executes in-progress reduce tasks When the master node detects particular key-value pairs that cause mappers to crash: Problematic pairs are skipped in the execution 28 / 32

66 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations 29 / 32

67 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd 29 / 32

68 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd 29 / 32

69 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd Overheads: 29 / 32

70 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd Overheads: intermediate data per mapper: σd p 29 / 32

71 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd Overheads: intermediate data per mapper: σd p each of the p reducers needs to read one p-th of the data written by each of the p mappers: σd p 1 p p = σd p 29 / 32

72 Parallel Efficiency of Map-Reduce We have p processors for performing map and reduce operations Time to perform a task T on data D: t(t, 1) = wd Time for producing intermediate data σd after the map phase: t(t inter, 1) = σd Overheads: intermediate data per mapper: σd p each of the p reducers needs to read one p-th of the data written by each of the p mappers: σd p 1 p p = σd p Time for performing the task with Map-reduce: t MR (T, p) = wd p + 2K σd p K - constant for representing the overhead of IO operations (reading and writing data to disk) 29 / 32

73 Parallel Efficiency of Map-Reduce Time for performing the task in one processor: wd 30 / 32

74 Parallel Efficiency of Map-Reduce Time for performing the task in one processor: wd Time for performing the task with p processors on Map-reduce: t MR (T, p) = wd p + 2K σd p 30 / 32

75 Parallel Efficiency of Map-Reduce Time for performing the task in one processor: wd Time for performing the task with p processors on Map-reduce: t MR (T, p) = wd p + 2K σd p Efficiency equation: e(t, p) = t(t, 1) p t(t, p) 30 / 32

76 Parallel Efficiency of Map-Reduce Time for performing the task in one processor: wd Time for performing the task with p processors on Map-reduce: t MR (T, p) = wd p + 2K σd p Efficiency equation: Efficiency of Map-Reduce: e(t, p) = t(t, 1) p t(t, p) e MR (T, p) = wd p( wd p + 2K σd p ) 30 / 32

77 Parallel Efficiency of Map-Reduce e MR (T, p) = wd p( wd p + 2K σd p ) 31 / 32

78 Parallel Efficiency of Map-Reduce e MR (T, p) = wd p( wd p + 2K σd p ) = wd wd + 2KσD 31 / 32

79 Parallel Efficiency of Map-Reduce e MR (T, p) = wd p( wd p + 2K σd p ) = wd wd + 2KσD = wd 1 wd wd 1 wd + 2KσD 1 wd 31 / 32

80 Parallel Efficiency of Map-Reduce e MR (T, p) = wd p( wd p + 2K σd p ) = wd wd + 2KσD = wd 1 wd wd 1 wd + 2KσD 1 wd 1 = 1 + 2K σ w 31 / 32

81 Parallel Efficiency of Map-Reduce e MR (T, p) = K σ w 32 / 32

82 Parallel Efficiency of Map-Reduce 1 e MR (T, p) = 1 + 2K σ w Apparently the efficiency is independent of p 32 / 32

83 Parallel Efficiency of Map-Reduce 1 e MR (T, p) = 1 + 2K σ w Apparently the efficiency is independent of p High speedups can be achieved with large number of processors 32 / 32

84 Parallel Efficiency of Map-Reduce 1 e MR (T, p) = 1 + 2K σ w Apparently the efficiency is independent of p High speedups can be achieved with large number of processors If σ is high (too much intermediate data) the efficiency deteriorates 32 / 32

85 Parallel Efficiency of Map-Reduce 1 e MR (T, p) = 1 + 2K σ w Apparently the efficiency is independent of p High speedups can be achieved with large number of processors If σ is high (too much intermediate data) the efficiency deteriorates In many cases σ depends on p 32 / 32

Map-Reduce in Various Programming Languages

Map-Reduce in Various Programming Languages 1 Context of Map-Reduce Computing The use of LISP's map and reduce functions to solve computational problems probably dates from the 1960s -- very early in the