Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce)

Let's begin with a field experiment 2

400+ Pokemons, 10 different 3

How many of each??????????? 4

400 distributed to many volunteers 5

Task 1 (10 people) 6

Task 1 (10 people) 1 3 1 2 1 1 7

Task 2 (8 people) 8

Task 2 (8 people) part 1 aka "The big mess" 1 2 1 3 1 9

Task 2 (8 people) part 2 1 2 1 8 3 1 10

Final summary 8 6 4 2 10 5 3 7 11

Let's go! 12

So far, we have... Storage as file system (HDFS) 13

So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14

Data is only useful if we can query it Querying Storage as tables (HBase) Storage as file system (HDFS) 15

... in parallel Querying Storage as tables (HBase) Storage as file system (HDFS) 16

Data Processing Input data 17

Data Processing Input data Query 18

Data Processing Input data Query Output data 19

MapReduce 20

Data Processing: data comes in chunks Query 21

Data Processing: the ideal case Query Query Query Query Query Query Query Query 22

Data Processing: the worst case 23

Data Processing: the typical case 24

Data Processing: Map here... 25

Data Processing:... and shuffle there 26

A common and useful sub-case: MapReduce Input data 27

A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map 28

A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle 29

A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 30

A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 31

Data Processing: Data Model Input data Map Map Map Map Map Map Map Map Intermediate data (shuffled) Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 32

Data Processing: Data Shape Key- pairs Map Map Map Map Map Map Map Map Key- pairs Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Key- pairs 33

Data Processing: Data Types key type 1 -> type 1 Map Map Map Map Map Map Map Map key type I -> type I Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 34

Data Processing: Most often key type 1 -> type 1 Map Map Map Map Map Map Map Map key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 35

Splitting 36

Splitting Split 37

Splitting key 1 Split key 2 key 3 key 4 38

Mapping function key 1 39

Mapping function key 1 Map 40

Mapping function key 1 Map key I key II 41

Mapping function... in parallel key 1 Map key I key II 42

Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III 43

Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III key 3 Map key II key III 44

Put it all together key I key II key I key III key II key III 45

Put it all together key I key II key I key III key II key III 46

Put it all together key I key II key I key II key I key I key III key III key II key II key III key III 47

Sort by key key I key II key I key III key III key II key I 48

Sort by key key I key I key II key I key I key I key III key II key III key II key II key III key I key III 49

Partition key I key I key I key II key II key III key III 50

Partition key I key I key I key II key II key III key III 51

Partition key I key I key I key I key I key I key II key II key II key II key III key III key III key III 52

Reduce function key I key I key I 53

Reduce function key I key I key I Reduce 54

Reduce function key I key I key I Reduce key A 55

Reduce function (with identical key sets) key A key A key A A B C Reduce key A 56

Reduce function (most generic) key I key I key I Reduce key A ( key B ) More is fine, but uncommon 57

Reduce function... in parallel key I key I key I Reduce key A 58

Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B 59

Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B key III key III Reduce key C 60

Overall 61

Overall Map 62

Overall Map 63

Overall Map Sort 64

Overall Map Sort 65

Overall Map Sort Partition 66

Overall Map Sort Partition 67

Overall Map Sort Partition Reduce 68

Overall Map Sort Partition Reduce 69

Input/Output formats 70

Input and output formats 71

Input and output formats From/to tables 72

Input and output formats From/to tables From/to files 73

Formats: tabular 74

Formats: tabular RDBMS 75

Formats: tabular RDBMS Row ID 000 002 0A1 1E0 22A 4A2 HBase 76

Formats: tabular 77

Formats: tabular 78

Formats: files (e.g., from HDFS) 79

Formats: files (e.g., from HDFS) Text 80

Formats: files (e.g., from HDFS) Text KeyValue 81

Formats: files (e.g., from HDFS) Text KeyValue SequenceFile 82

Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 83

Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 0 18 27 38... Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 84

Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 85

Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 0 27... Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 86

Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 87

Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... Lorem sit consectetur adipiscing... ipsum dolor amet, elit, sed 88

Sequence files Hadoop binary format Stores generic key-s 89

Sequence files Hadoop binary format Stores generic key-s KeyLength Key ValueLength Value 90

Optimization 91

Optimization key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 92

Optimization How to reduce* the amount of data key type shuffled 1 -> around? type 1 Mapper *pun intended (Eselsbrücke) Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 93

Optimization: Combine key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Combine Combine Combine Combine Combine Combine key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 94

Combine: the 90% case 95

Combine: the 90% case Often, the combine function is identical to the reduce function. Combine Reduce Disclaimer: there are assumptions 96

Combine=Reduce: Assumption 1 Key/Value types must be identical for reduce input and output. key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 97

Combine=Reduce : Assumption 2 98

Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B 99

Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B and Associative key A key A key A A B C 100

Optimization: Bring the Query to the Data Query Data 101

MapReduce: the APIs 102

Supported frameworks Hadoop MapReduce 103

Supported frameworks Hadoop MapReduce Java Streaming 104

Supported frameworks Hadoop MapReduce Java Streaming 105

Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 106

Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 109

Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 112

Java API: Combiner (=Reducer) import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setcombinerclass(myownreducer.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 117

Java API: InputFormat classes InputFormat 118

Java API: InputFormat classes InputFormat DBInputFormat RDBMS 119

Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase 120

Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase 121

Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase FileInputFormat KeyValueTextInputFormat Key file 122

Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat Key file Sequence file 123

Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat Key file Sequence file 124

Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat FixedLengthInputFormat Key file Sequence file Text 125

Java API: OutputFormat classes OutputFormat DBOutputFormat RDBMS TableOutputFormat HBase FileoutputFormat SequenceFileOutputFormat TextOutputFormat Text MapFileOutputFormat Sequence file 127

MapReduce: the physical layer 128

Possible storage layers Hadoop MapReduce 129

Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 130

Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 131

Hadoop MapReduce: Numbers Several TBs of data Data 132

Hadoop MapReduce: Numbers Several TBs of data Data MapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMap Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map M 1000s of nodes 133

Hadoop infrastructure (version 1) Namenode Datanode Datanode Datanode Datanode Datanode Datanode 134

Master-slave architecture Master Slave Slave Slave Slave Slave Slave 135

Hadoop infrastructure (version 1) Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 136

Hadoop infrastructure (version 1) Namenode + JobTracker Bring the Query to the Data Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 137

Tasks Task = or 138

Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 139

Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 140

Splits vs. map tasks Split 141

Splits vs. map tasks 1 split = 1 map task Split M M M M M 142

In practice M M M M 1 split Split 143

In practice M M M M 1 split = 1 block (subject to min and max size) Split Block 144

Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Physical Level (HDFS) Block 145

Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Record (key/ pair) Bit Physical Level (HDFS) Block 146

Records across blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 147

Records across blocks Logical Level (MapReduce) Split Remote read Physical Level (HDFS) Block 148

Fine-tuning to adjust splits to blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 149

Hadoop infrastructure (version 1) Namenode + JobTracker /dir/file Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 150

Hadoop infrastructure: map tasks Namenode + JobTracker As many map tasks as splits /dir/file M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker 151

Hadoop infrastructure: map tasks As many map tasks as splits Namenode + JobTracker /dir/file Occasionally not possible to co-locate task and block M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 152

Hadoop infrastructure: reduce tasks A few reduce tasks Namenode + JobTracker /dir/file R R Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 153

Hadoop infrastructure: shuffling (inbetween) M R Namenode + JobTracker /dir/file M Datanode + TaskTracker R M Datanode + TaskTracker R M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 154

Shuffling phase Reducer Mappers 155

Shuffling phase Reducer Mappers Each mapper sorts its output key- pairs 156

Spilling to disk Key- pairs are spilled to disk if necessary 157

Shuffling phase Reducer Gets its key pairs over HTTP Mappers 158

Issue 1: Tight coupling Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 159

Issue 2: Scalability Namenode + JobTracker Only one! Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 160