Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Size: px

Start display at page:

Download "Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)"

Quentin Bennett
5 years ago
Views:

1 Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce)

2 So far, we have... Storage as file system (HDFS) 13

3 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14

4 Data is only useful if we can query it Querying Storage as tables (HBase) Storage as file system (HDFS) 15

5 ... in parallel Querying Storage as tables (HBase) Storage as file system (HDFS) 16

6 Data Processing Input data 17

7 Data Processing Input data Query 18

8 Data Processing Input data Query Output data 19

9 MapReduce 20

10 Data Processing: data comes in chunks Query 21

11 Data Processing: the ideal case Query Query Query Query Query Query Query Query 22

12 Data Processing: the worst case 23

13 Data Processing: the typical case 24

14 Data Processing: Map here... 25

15 Data Processing:... and shuffle there 26

16 A common and useful sub-case: MapReduce Input data 27

17 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map 28

18 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle 29

19 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 30

20 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 31

21 Data Processing: Data Model Input data Map Map Map Map Map Map Map Map Intermediate data (shuffled) Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 32

22 Data Processing: Data Shape Key- pairs Map Map Map Map Map Map Map Map Key- pairs Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Key- pairs 33

23 Data Processing: Data Types key type 1 -> type 1 Map Map Map Map Map Map Map Map key type I -> type I Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 34

24 Data Processing: Most often key type 1 -> type 1 Map Map Map Map Map Map Map Map key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 35

25 Splitting 36

26 Splitting Split 37

27 Splitting key 1 Split key 2 key 3 key 4 38

28 Mapping function key 1 39

29 Mapping function key 1 Map 40

30 Mapping function key 1 Map key I key II 41

31 Mapping function... in parallel key 1 Map key I key II 42

32 Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III 43

33 Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III key 3 Map key II key III 44

34 Put it all together key I key II key I key III key II key III 45

35 Put it all together key I key II key I key III key II key III 46

36 Put it all together key I key II key I key II key I key I key III key III key II key II key III key III 47

37 Sort by key key I key II key I key III key III key II key I 48

38 Sort by key key I key I key II key I key I key I key III key II key III key II key II key III key I key III 49

39 Partition key I key I key I key II key II key III key III 50

40 Partition key I key I key I key II key II key III key III 51

41 Partition key I key I key I key I key I key I key II key II key II key II key III key III key III key III 52

42 Reduce function key I key I key I 53

43 Reduce function key I key I key I Reduce 54

44 Reduce function key I key I key I Reduce key A 55

45 Reduce function (with identical key sets) key A key A key A A B C Reduce key A 56

46 Reduce function (most generic) key I key I key I Reduce key A ( key B ) More is fine, but uncommon 57

47 Reduce function... in parallel key I key I key I Reduce key A 58

48 Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B 59

49 Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B key III key III Reduce key C 60

50 Overall 61

51 Overall Map 62

52 Overall Map 63

53 Overall Map Sort 64

54 Overall Map Sort 65

55 Overall Map Sort Partition 66

56 Overall Map Sort Partition 67

57 Overall Map Sort Partition Reduce 68

58 Overall Map Sort Partition Reduce 69

59 Input/Output formats 70

60 Input and output formats 71

61 Input and output formats From/to tables 72

62 Input and output formats From/to tables From/to files 73

63 Formats: tabular 74

64 Formats: tabular RDBMS 75

65 Formats: tabular RDBMS Row ID A1 1E0 22A 4A2 HBase 76

66 Formats: tabular 77

67 Formats: tabular 78

68 Formats: files (e.g., from HDFS) 79

69 Formats: files (e.g., from HDFS) Text 80

70 Formats: files (e.g., from HDFS) Text KeyValue 81

71 Formats: files (e.g., from HDFS) Text KeyValue SequenceFile 82

72 Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 83

73 Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 84

74 Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 85

75 Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 86

76 Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 87

77 Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... Lorem sit consectetur adipiscing... ipsum dolor amet, elit, sed 88

78 Sequence files Hadoop binary format Stores generic key-s 89

79 Sequence files Hadoop binary format Stores generic key-s KeyLength Key ValueLength Value 90

80 Optimization 91

81 Optimization key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 92

82 Optimization How to reduce* the amount of data key type shuffled 1 -> around? type 1 Mapper *pun intended (Eselsbrücke) Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 93

83 Optimization: Combine key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Combine Combine Combine Combine Combine Combine key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 94

84 Combine: the 90% case 95

85 Combine: the 90% case Often, the combine function is identical to the reduce function. Combine Reduce Disclaimer: there are assumptions 96

86 Combine=Reduce: Assumption 1 Key/Value types must be identical for reduce input and output. key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 97

87 Combine=Reduce : Assumption 2 98

88 Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B 99

89 Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B and Associative key A key A key A A B C 100

90 Optimization: Bring the Query to the Data Query Data 101

91 MapReduce: the APIs 102

92 Supported frameworks Hadoop MapReduce 103

93 Supported frameworks Hadoop MapReduce Java Streaming 104

94 Supported frameworks Hadoop MapReduce Java Streaming 105

95 Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 106

96 Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 107

97 Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 108

98 Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 109

99 Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 110

100 Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 111

101 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 112

102 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 113

103 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 114

104 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 115

105 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 116

106 Java API: Combiner (=Reducer) import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setcombinerclass(myownreducer.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 117

107 Java API: InputFormat classes InputFormat 118

108 Java API: InputFormat classes InputFormat DBInputFormat RDBMS 119

109 Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase 120

110 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase 121

111 Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase FileInputFormat KeyValueTextInputFormat Key file 122

112 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat Key file Sequence file 123

113 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat Key file Sequence file 124

114 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat FixedLengthInputFormat Key file Sequence file Text 125

115 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat FixedLengthInputFormat NLineInputFormat Key file Sequence file Text 126

116 Java API: OutputFormat classes OutputFormat DBOutputFormat RDBMS TableOutputFormat HBase FileoutputFormat SequenceFileOutputFormat TextOutputFormat Text MapFileOutputFormat Sequence file 127

117 MapReduce: the physical layer 128

118 Possible storage layers Hadoop MapReduce 129

119 Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 130

120 Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 131

121 Hadoop MapReduce: Numbers Several TBs of data Data 132

122 Hadoop MapReduce: Numbers Several TBs of data Data MapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMap Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map M 1000s of nodes 133

123 Hadoop infrastructure (version 1) Namenode Datanode Datanode Datanode Datanode Datanode Datanode 134

124 Master-slave architecture Master Slave Slave Slave Slave Slave Slave 135

125 Hadoop infrastructure (version 1) Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 136

126 Hadoop infrastructure (version 1) Namenode + JobTracker Bring the Query to the Data Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 137

127 Tasks Task = or 138

128 Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 139

129 Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 140

130 Splits vs. map tasks Split 141

131 Splits vs. map tasks 1 split = 1 map task Split M M M M M 142

132 In practice M M M M 1 split Split 143

133 In practice M M M M 1 split = 1 block (subject to min and max size) Split Block 144

134 Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Physical Level (HDFS) Block 145

135 Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Record (key/ pair) Bit Physical Level (HDFS) Block 146

136 Records across blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 147

137 Records across blocks Logical Level (MapReduce) Split Remote read Physical Level (HDFS) Block 148

138 Fine-tuning to adjust splits to blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 149

139 Hadoop infrastructure (version 1) Namenode + JobTracker /dir/file Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 150

140 Hadoop infrastructure: map tasks Namenode + JobTracker As many map tasks as splits /dir/file M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker 151

141 Hadoop infrastructure: map tasks As many map tasks as splits Namenode + JobTracker /dir/file Occasionally not possible to co-locate task and block M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 152

142 Hadoop infrastructure: reduce tasks A few reduce tasks Namenode + JobTracker /dir/file R R Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 153

143 Hadoop infrastructure: shuffling (inbetween) M R Namenode + JobTracker /dir/file M Datanode + TaskTracker R M Datanode + TaskTracker R M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 154

144 Shuffling phase Reducer Mappers 155

145 Shuffling phase Reducer Mappers Each mapper sorts its output key- pairs 156

146 Spilling to disk Key- pairs are spilled to disk if necessary 157

147 Shuffling phase Reducer Gets its key pairs over HTTP Mappers 158

148 Issue 1: Tight coupling Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 159

149 Issue 2: Scalability Namenode + JobTracker Only one! Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 160

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers