Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Similar documents
Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Data-Intensive Computing with MapReduce

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

CSE6331: Cloud Computing

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Big Data for Engineers Spring Resource Management

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Big Data 7. Resource Management

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Hadoop. copyright 2011 Trainologic LTD

Hadoop. Introduction to BIGDATA and HADOOP

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Big Data Analysis using Hadoop Lecture 3

Laarge-Scale Data Engineering

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

itpass4sure Helps you pass the actual test with valid and latest training material.

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Ghislain Fourny. Big Data 5. Column stores

Hadoop Map Reduce 10/17/2018 1

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

HADOOP FRAMEWORK FOR BIG DATA

Dept. Of Computer Science, Colorado State University

MapReduce. Arend Hintze

Session 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi

MapReduce Simplified Data Processing on Large Clusters

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Cloud Computing CS

2. MapReduce Programming Model

MI-PDB, MIE-PDB: Advanced Database Systems

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Exam Name: Cloudera Certified Developer for Apache Hadoop CDH4 Upgrade Exam (CCDH)

Chapter 3. Distributed Algorithms based on MapReduce

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Introduction to Map Reduce

Recommended Literature

MapReduce and Hadoop. The reference Big Data stack

Clustering Lecture 8: MapReduce

Distributed Systems. CS422/522 Lecture17 17 November 2014

Java in MapReduce. Scope

Enter the Elephant. Massively Parallel Computing With Hadoop. Toby DiPasquale Chief Architect Invite Media, Inc.

Facilitating Consistency Check between Specification & Implementation with MapReduce Framework

Database Applications (15-415)

Steps: First install hadoop (if not installed yet) by,

Introduction to MapReduce

Introduction to Hadoop and MapReduce

The core source code of the edge detection of the Otsu-Canny operator in the Hadoop

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

Introduction to BigData, Hadoop:-

MapReduce, Hadoop and Spark. Bompotas Agorakis

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Lecture 11 Hadoop & Spark

INTRODUCTION TO HADOOP

Hadoop MapReduce Framework

Introduction to HDFS and MapReduce

Big Data: Architectures and Data Analytics

Your First Hadoop App, Step by Step

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Data Analytics Job Guarantee Program

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

50 Must Read Hadoop Interview Questions & Answers

Big Data: Architectures and Data Analytics

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

2/26/2017. For instance, consider running Word Count across 20 splits

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Topics covered in this lecture

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Big Data and Scripting map reduce in Hadoop

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Big Data: Architectures and Data Analytics

Guidelines For Hadoop and Spark Cluster Usage

Programming Models MapReduce

Ghislain Fourny. Big Data 5. Wide column stores

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

A MapReduce Relational-Database Index-Selection Tool

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Introduction to MapReduce

Introduction to Map/Reduce & Hadoop

A brief history on Hadoop

Big Data: Architectures and Data Analytics

COSC 6339 Big Data Analytics. NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment. Edgar Gabriel Spring 2017.

Expert Lecture plan proposal Hadoop& itsapplication

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

A BigData Tour HDFS, Ceph and MapReduce

Top 25 Hadoop Admin Interview Questions and Answers

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop

Certified Big Data and Hadoop Course Curriculum

Transcription:

Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce)

Let's begin with a field experiment 2

400+ Pokemons, 10 different 3

How many of each??????????? 4

400 distributed to many volunteers 5

Task 1 (10 people) 6

Task 1 (10 people) 1 3 1 2 1 1 7

Task 2 (8 people) 8

Task 2 (8 people) part 1 aka "The big mess" 1 2 1 3 1 9

Task 2 (8 people) part 2 1 2 1 8 3 1 10

Final summary 8 6 4 2 10 5 3 7 11

Let's go! 12

So far, we have... Storage as file system (HDFS) 13

So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14

Data is only useful if we can query it Querying Storage as tables (HBase) Storage as file system (HDFS) 15

... in parallel Querying Storage as tables (HBase) Storage as file system (HDFS) 16

Data Processing Input data 17

Data Processing Input data Query 18

Data Processing Input data Query Output data 19

MapReduce 20

Data Processing: data comes in chunks Query 21

Data Processing: the ideal case Query Query Query Query Query Query Query Query 22

Data Processing: the worst case 23

Data Processing: the typical case 24

Data Processing: Map here... 25

Data Processing:... and shuffle there 26

A common and useful sub-case: MapReduce Input data 27

A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map 28

A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle 29

A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 30

A common and useful sub-case: MapReduce Input data Map Map Map Map Map Map Map Map Shuffle Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 31

Data Processing: Data Model Input data Map Map Map Map Map Map Map Map Intermediate data (shuffled) Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Output data 32

Data Processing: Data Shape Key- pairs Map Map Map Map Map Map Map Map Key- pairs Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce Key- pairs 33

Data Processing: Data Types key type 1 -> type 1 Map Map Map Map Map Map Map Map key type I -> type I Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 34

Data Processing: Most often key type 1 -> type 1 Map Map Map Map Map Map Map Map key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 35

Splitting 36

Splitting Split 37

Splitting key 1 Split key 2 key 3 key 4 38

Mapping function key 1 39

Mapping function key 1 Map 40

Mapping function key 1 Map key I key II 41

Mapping function... in parallel key 1 Map key I key II 42

Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III 43

Mapping function... in parallel key 1 Map key I key II key 2 Map key I key III key 3 Map key II key III 44

Put it all together key I key II key I key III key II key III 45

Put it all together key I key II key I key III key II key III 46

Put it all together key I key II key I key II key I key I key III key III key II key II key III key III 47

Sort by key key I key II key I key III key III key II key I 48

Sort by key key I key I key II key I key I key I key III key II key III key II key II key III key I key III 49

Partition key I key I key I key II key II key III key III 50

Partition key I key I key I key II key II key III key III 51

Partition key I key I key I key I key I key I key II key II key II key II key III key III key III key III 52

Reduce function key I key I key I 53

Reduce function key I key I key I Reduce 54

Reduce function key I key I key I Reduce key A 55

Reduce function (with identical key sets) key A key A key A A B C Reduce key A 56

Reduce function (most generic) key I key I key I Reduce key A ( key B ) More is fine, but uncommon 57

Reduce function... in parallel key I key I key I Reduce key A 58

Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B 59

Reduce function... in parallel key I key I key I Reduce key A key II key II Reduce key B key III key III Reduce key C 60

Overall 61

Overall Map 62

Overall Map 63

Overall Map Sort 64

Overall Map Sort 65

Overall Map Sort Partition 66

Overall Map Sort Partition 67

Overall Map Sort Partition Reduce 68

Overall Map Sort Partition Reduce 69

Input/Output formats 70

Input and output formats 71

Input and output formats From/to tables 72

Input and output formats From/to tables From/to files 73

Formats: tabular 74

Formats: tabular RDBMS 75

Formats: tabular RDBMS Row ID 000 002 0A1 1E0 22A 4A2 HBase 76

Formats: tabular 77

Formats: tabular 78

Formats: files (e.g., from HDFS) 79

Formats: files (e.g., from HDFS) Text 80

Formats: files (e.g., from HDFS) Text KeyValue 81

Formats: files (e.g., from HDFS) Text KeyValue SequenceFile 82

Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 83

Text files Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 0 18 27 38... Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 84

Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 85

Text files: NLine Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 0 27... Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed 86

Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... 87

Key-Value Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed... Lorem sit consectetur adipiscing... ipsum dolor amet, elit, sed 88

Sequence files Hadoop binary format Stores generic key-s 89

Sequence files Hadoop binary format Stores generic key-s KeyLength Key ValueLength Value 90

Optimization 91

Optimization key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 92

Optimization How to reduce* the amount of data key type shuffled 1 -> around? type 1 Mapper *pun intended (Eselsbrücke) Map Map Map Map Map Map key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 93

Optimization: Combine key type 1 -> type 1 Mapper Map Map Map Map Map Map key type A -> type A Combine Combine Combine Combine Combine Combine key type A -> type A Reducer Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 94

Combine: the 90% case 95

Combine: the 90% case Often, the combine function is identical to the reduce function. Combine Reduce Disclaimer: there are assumptions 96

Combine=Reduce: Assumption 1 Key/Value types must be identical for reduce input and output. key type A -> type A Reduce Reduce Reduce Reduce Reduce Reduce key type A -> type A 97

Combine=Reduce : Assumption 2 98

Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B 99

Combine=Reduce : Assumption 2 Reduce function must be Commutative key A key A A B and Associative key A key A key A A B C 100

Optimization: Bring the Query to the Data Query Data 101

MapReduce: the APIs 102

Supported frameworks Hadoop MapReduce 103

Supported frameworks Hadoop MapReduce Java Streaming 104

Supported frameworks Hadoop MapReduce Java Streaming 105

Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 106

Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 107

Java API: Mapper import org.apache.hadoop.mapreduce.mapper; public class MyOwnMapper extends Mapper<K1, V1, K2, V2>{ } public void map(k1 key, V1, Context context) throws IOException, InterruptedException {... K2 new-key =... V2 new- =... context.write(new-key, new-);... } 108

Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 109

Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 110

Java API: Reducer import org.apache.hadoop.mapreduce.reducer; public class MyOwnReducer extends Reducer<K2, V2, K3, V3>{ } public void reduce (K2 key, Iterable<V2> s, Context context) throws IOException, InterruptedException {... K3 new-key =... V3 new- =... context.write(new-key, new-);... } 111

Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 112

Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 113

Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 114

Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 115

Java API: Job import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 116

Java API: Combiner (=Reducer) import org.apache.hadoop.mapreduce.job; public class MyMapReduceJob { public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setmapperclass(myownmapper.class); job.setcombinerclass(myownreducer.class); job.setreducerclass(myownreducer.class); FileInputFormat.addInputPath(job,...); FileOutputFormat.setOutputPath(job,...); } System.exit(job.waitForCompletion(true)? 0 : 1); 117

Java API: InputFormat classes InputFormat 118

Java API: InputFormat classes InputFormat DBInputFormat RDBMS 119

Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase 120

Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase 121

Java API: InputFormat classes InputFormat DBInputFormat RDBMS TableInputFormat HBase FileInputFormat KeyValueTextInputFormat Key file 122

Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat Key file Sequence file 123

Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat Key file Sequence file 124

Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat FixedLengthInputFormat Key file Sequence file Text 125

Java API: InputFormat classes InputFormat DBInputFormat TableInputFormat FileInputFormat RDBMS HBase KeyValueTextInputFormat SequenceFileInputFormat TextInputFormat FixedLengthInputFormat NLineInputFormat Key file Sequence file Text 126

Java API: OutputFormat classes OutputFormat DBOutputFormat RDBMS TableOutputFormat HBase FileoutputFormat SequenceFileOutputFormat TextOutputFormat Text MapFileOutputFormat Sequence file 127

MapReduce: the physical layer 128

Possible storage layers Hadoop MapReduce 129

Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 130

Possible storage layers Hadoop MapReduce Local Filesystem HDFS S3 Azure Blob Storage 131

Hadoop MapReduce: Numbers Several TBs of data Data 132

Hadoop MapReduce: Numbers Several TBs of data Data MapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMapMap Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map Map M 1000s of nodes 133

Hadoop infrastructure (version 1) Namenode Datanode Datanode Datanode Datanode Datanode Datanode 134

Master-slave architecture Master Slave Slave Slave Slave Slave Slave 135

Hadoop infrastructure (version 1) Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 136

Hadoop infrastructure (version 1) Namenode + JobTracker Bring the Query to the Data Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 137

Tasks Task = or 138

Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 139

Splits Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Reduce Reduce 140

Splits vs. map tasks Split 141

Splits vs. map tasks 1 split = 1 map task Split M M M M M 142

In practice M M M M 1 split Split 143

In practice M M M M 1 split = 1 block (subject to min and max size) Split Block 144

Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Physical Level (HDFS) Block 145

Splits vs. blocks: possible confusion Logical Level (MapReduce) Split Record (key/ pair) Bit Physical Level (HDFS) Block 146

Records across blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 147

Records across blocks Logical Level (MapReduce) Split Remote read Physical Level (HDFS) Block 148

Fine-tuning to adjust splits to blocks Logical Level (MapReduce) Split Physical Level (HDFS) Block 149

Hadoop infrastructure (version 1) Namenode + JobTracker /dir/file Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 150

Hadoop infrastructure: map tasks Namenode + JobTracker As many map tasks as splits /dir/file M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker 151

Hadoop infrastructure: map tasks As many map tasks as splits Namenode + JobTracker /dir/file Occasionally not possible to co-locate task and block M Datanode + TaskTracker M Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 152

Hadoop infrastructure: reduce tasks A few reduce tasks Namenode + JobTracker /dir/file R R Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 153

Hadoop infrastructure: shuffling (inbetween) M R Namenode + JobTracker /dir/file M Datanode + TaskTracker R M Datanode + TaskTracker R M Datanode + TaskTracker Datanode + TaskTracker M Datanode + TaskTracker Datanode + TaskTracker 154

Shuffling phase Reducer Mappers 155

Shuffling phase Reducer Mappers Each mapper sorts its output key- pairs 156

Spilling to disk Key- pairs are spilled to disk if necessary 157

Shuffling phase Reducer Gets its key pairs over HTTP Mappers 158

Issue 1: Tight coupling Namenode + JobTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 159

Issue 2: Scalability Namenode + JobTracker Only one! Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker Datanode + TaskTracker 160