Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Size: px

Start display at page:

Download "Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)"

Abigail Allen
5 years ago
Views:

1 Ghislain Fourny Big Data Fall Massive Parallel Processing (MapReduce)

2 Let's begin with a field experiment 2

3 400+ Pokemons, 10 different 3

4 How many of each??????????? 4

5 400 distributed to many volunteers 5

6 Task 1 (10 people) 6

7 Task 1 (10 people)

8 Task 2 (8 people) 8

9 Task 2 (8 people) part 1 aka "The big mess"

10 Task 2 (8 people) part

11 Final summary

12 Let's go! 12

13 So far, we have... Storage as file system (HDFS) 13

14 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14

15 Data is only useful if we can query it Querying Storage as tables (HBase) Storage as file system (HDFS) 15

16 ... in parallel Querying Storage as tables (HBase) Storage as file system (HDFS) 16

17 Data Processing Input data 17

18 Data Processing Input data Query 18

19 Data Processing Input data Query Output data 19

20 MapReduce 20

21 Data Processing: data comes in chunks Query 21

22 Data Processing: the ideal case Query Query Query Query Query Query Query Query 22

23 Data Processing: the worst case 23

24 Data Processing: the typical case 24

25 Data Processing: Map here... 25

26 Data Processing:... and shuffle there 26

27 A common and useful sub-case: MapReduce Input data 27

28 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map 28

29 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle 29

30 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 30

31 A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 31

32 Data Processing: Data Model Input data Map Map Map Map Map Map Map Map Intermediate data (shuffled) Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 32

33 Data Processing: Data Shape Key- pairs Map Map Map Map Map Map Map Map Key- pairs Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Key- pairs 33

34 Data Processing: Data Types key type 1 -> type 1 Map Map Map Map Map Map Map Map key type I -> type I Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 34

35 Data Processing: Most often key type 1 -> type 1 Map Map Map Map Map Map Map Map key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 35

36 Splitting 36

37 Splitting Split 37

38 Splitting key 1 Split key 2 key 3 key 4 38

39 Mapping function key 1 39

40 Mapping function key 1 Map 40

41 Mapping function key 1 Map key I key II 41

42 Mapping function... in parallel key 1 Map key I key II 42

43 Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III 43

44 Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III key 3 Map key II key III 44

45 Put it all together key I key II key I key III key II key III 45

46 Put it all together key I key II key I key III key II key III 46

47 Put it all together key I key II key I key II key I key I key III key III key II key II key III key III 47

48 Sort by key key I key II key I key III key III key II key I 48

49 Sort by key key I key I key II key I key I key I key III key II key III key II key II key III key I key III 49

50 Partition key I key I key I key II key II key III key III 50

51 Partition key I key I key I key II key II key III key III 51

52 Partition key I key I key I key I key I key I key II key II key II key II key III key III key III key III 52

53 Reduce function key I key I key I 53

54 Reduce function key I key I key I Reduce 54

55 Reduce function key I key I key I Reduce key A 55

56 Reduce function (with identical key sets) key A key A key A A B C Reduce key A 56

57 Reduce function (most generic) key I key I key I Reduce key A ( key B ) More is fine, but uncommon 57

58 Reduce function... in parallel key I key I key I Reduce key A 58

59 Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B 59

60 Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B key III key III Reduce key C 60

61 Overall 61

62 Overall Map 62

63 Overall Map 63

64 Overall Map Sort 64

65 Overall Map Sort 65

66 Overall Map Sort Partition 66

67 Overall Map Sort Partition 67

68 Overall Map Sort Partition Reduce 68

69 Overall Map Sort Partition Reduce 69

70 Input/Output formats 70

71 Input and output formats 71

72 Input and output formats From/to tables 72

73 Input and output formats From/to tables From/to files 73

74 Formats: tabular 74

75 Formats: tabular RDBMS 75

76 Formats: tabular RDBMS Row ID A1 1E0 22A 4A2 HBase 76

77 Formats: tabular 77

78 Formats: tabular 78

79 Formats: files (e.g., from HDFS) 79

80 Formats: files (e.g., from HDFS) Text 80

81 Formats: files (e.g., from HDFS) Text KeyValue 81

82 Formats: files (e.g., from HDFS) Text KeyValue SequenceFile 82

83 Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 83

84 Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 84

85 Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 85

86 Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 86

87 Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 87

88 Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... Lorem sit consectetur adipiscing... ipsum dolor amet, elit, sed 88

89 Sequence files Hadoop binary format Stores generic key-s 89

90 Sequence files Hadoop binary format Stores generic key-s KeyLength Key ValueLength Value 90

91 Optimization 91

92 Optimization key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 92

93 Optimization How to reduce* the amount of data key type shuffled 1 -> around? type 1 Mapper *pun intended (Eselsbrücke) Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 93

94 Optimization: Combine key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Combine Combine Combine Combine Combine Combine key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 94

95 Combine: the 90% case 95

96 Combine: the 90% case Often, the combine function is identical to the reduce function. Combine Reduce Disclaimer: there are assumptions 96

97 Combine=Reduce: Assumption 1 Key/Value types must be identical for reduce input and output. key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 97

98 Combine=Reduce : Assumption 2 98

99 Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B 99

100 Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B and Associative key A key A key A A B C 100

101 Optimization: Bring the Query to the Data Query Data 101

102 MapReduce: the APIs 102

103 Supported frameworks Hadoop MapReduce 103

104 Supported frameworks Hadoop MapReduce Java Streaming 104

105 Supported frameworks Hadoop MapReduce Java Streaming 105

106 Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 106

107 Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 107

108 Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 108

109 Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 109

110 Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 110

111 Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 111

112 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 112

113 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 113

114 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 114

115 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 115

116 Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 116

117 Java API: Combiner (=Reducer) import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setcombinerclass(myownreducer.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 117

118 Java API: InputFormat classes InputFormat 118

119 Java API: InputFormat classes InputFormat DBInputFormat RDBMS 119

120 Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase 120

121 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase 121

122 Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase FileInputFormat KeyValueTextInputFormat Key file 122

123 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat Key file Sequence file 123

124 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat Key file Sequence file 124

125 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat FixedLengthInputFormat Key file Sequence file Text 125

126 Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat FixedLengthInputFormat NLineInputFormat Key file Sequence file Text 126

127 Java API: OutputFormat classes OutputFormat DBOutputFormat RDBMS TableOutputFormat HBase FileoutputFormat SequenceFileOutputFormat TextOutputFormat Text MapFileOutputFormat Sequence file 127

128 MapReduce: the physical layer 128

129 Possible storage layers Hadoop MapReduce 129

130 Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 130

131 Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 131

132 Hadoop MapReduce: Numbers Several TBs of data Data 132

133 Hadoop MapReduce: Numbers Several TBs of data Data MapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMap Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map M 1000s of nodes 133

134 Hadoop infrastructure (version 1) Namenode Datanode Datanode Datanode Datanode Datanode Datanode 134

135 Master-slave architecture Master Slave Slave Slave Slave Slave Slave 135

136 Hadoop infrastructure (version 1) Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 136

137 Hadoop infrastructure (version 1) Namenode + JobTracker Bring the Query to the Data Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 137

138 Tasks Task = or 138

139 Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 139

140 Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 140

141 Splits vs. map tasks Split 141

142 Splits vs. map tasks 1 split = 1 map task Split M M M M M 142

143 In practice M M M M 1 split Split 143

144 In practice M M M M 1 split = 1 block (subject to min and max size) Split Block 144

145 Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Physical Level (HDFS) Block 145

146 Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Record (key/ pair) Bit Physical Level (HDFS) Block 146

147 Records across blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 147

148 Records across blocks Logical Level (MapReduce) Split Remote read Physical Level (HDFS) Block 148

149 Fine-tuning to adjust splits to blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 149

150 Hadoop infrastructure (version 1) Namenode + JobTracker /dir/file Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 150

151 Hadoop infrastructure: map tasks Namenode + JobTracker As many map tasks as splits /dir/file M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker 151

152 Hadoop infrastructure: map tasks As many map tasks as splits Namenode + JobTracker /dir/file Occasionally not possible to co-locate task and block M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 152

153 Hadoop infrastructure: reduce tasks A few reduce tasks Namenode + JobTracker /dir/file R R Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 153

154 Hadoop infrastructure: shuffling (inbetween) M R Namenode + JobTracker /dir/file M Datanode + TaskTracker R M Datanode + TaskTracker R M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 154

155 Shuffling phase Reducer Mappers 155

156 Shuffling phase Reducer Mappers Each mapper sorts its output key- pairs 156

157 Spilling to disk Key- pairs are spilled to disk if necessary 157

158 Shuffling phase Reducer Gets its key pairs over HTTP Mappers 158

159 Issue 1: Tight coupling Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 159

160 Issue 2: Scalability Namenode + JobTracker Only one! Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 160

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data