MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

Size: px

Start display at page:

Download "MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina"

Edward Thomas
6 years ago
Views:

1 MapReduce: Simplified Data Processing on Large Clusters By Stephen Cardina

2 The Problem You have a large amount of raw data, such as a database or a web log, and you need to get some sort of derived data from it, such as finding how many words start with a specific letter or how many items are within a specific range. When the input is large it needs a lot of different machines working at once in order to finish in a reasonable pace. Along with several other small issues, such as how to deal with it if it fails or how to handle the data, this can lead to a lot of troubleshooting and a bigger time investment.

3 The Solution MapReduce is a programming model for processing and generating large data sets. MapReduce automatically does parallel processing going through data much faster than it would normally. MapReduce can be separated into two distinct parts, map and reduce. Map goes through the data and parses the information based on the user s input. This separates it into columns for later use. Reduce merges the values given from map to make it easier for the user to use the data.

4 A Quick MapReduce Example If you had an array with 5 words, such as [car, bike, train, bus, boat] and you wanted to separate them by how many letters they have, you can. You can use Map to separate it into 3 different columns 3: [car, bus] 4: [boat, bike] 5: [train] 3, 4, and 5 are the intermediate keys while the words are value pairs

5 A Quick MapReduce Example Now you might not care what they actually say and instead just want to get how many instances of 3 letter words there are. You can now use reduce to make the columns even quicker to read 3: 2 4: 2 5: 1 Here reduce joined the 3 letter words, [car, bus] into a single value 2, alongside the 4 letter words.

6 Execution overview Step 1: The MapReduce library splits the input file into many smaller (M) pieces, which tend to be between 16 and 64 megabytes. It also starts up multiple copies on the machines.

7 Execution overview Step 2: One of the copies is made into what is known as the master, while all the others are workers. There are several M map tasks and R reduce tasks to do, so the master assigns them to idle workers.

8 Execution overview Step 3: A worker is assigned a map task and reads the current piece it has been assigned. It parses the information according to the map function and buffers them in memory.

9 Execution overview Step 4: Periodically, the buffered memory will be written to the local disk and made into R, reduce, regions. The master is then told of the location so it can get a worker to reduce it.

10 Execution overview Step 5: When a reduce worker is called it will read the buffered data in the local disk. After it has read everything it will sort the data by the intermediate key.

11 Execution overview Step 6: As the reduce worker goes through the data, every instance of a unique intermediate key it sends key and the corresponding intermediate key to the reduce function. The output of the function is then written to the output file.

12 Execution overview Step 7: When all of the map and reduce tasks are done, the MapReduce function ends and we go back to the user code.

13 The graph over on the right show how fast the input is scanned. It slowly gets faster as more workers are assigned. Around the 55 second mark it peaks then it slowly goes down as there are no more tasks to be assigned. Execution speed

14 Comparing different executions This graph show how long it takes to do a sort program with 10^10 records in it. For this there are 15,000 map tasks, 4,000 reduce tasks (as in output files) and 1,746 workers

15 Backup tasks are when the MapReduce is almost done so it has it s workers work on the in progress tasks. This helps save a lot of time if there are stragglers that are taking a long time. On the graph to the right it shows it takes a lot longer, about 400 seconds, if it s disabled. Comparing different executions

16 Comparing different executions MapReduce also deals with if a worker stops functioning. If the master doesn t hear back from them for a bit it assumes that it failed and marks it as such. This helps if one of the worker stops responding for a few minutes and still gives us a decent time shown to the right.

17 In Summary MapReduce is a great way to deal with a large amount of raw data. MapReduce automatically does parallel processing. It separates it into 2 tasks map and reduce. Map takes the inputs and separates them into groups. Reduce takes the data from map and reduces it for the user. It handles all of this via a master and several workers. Deals with slow down and non-responding workers at a decent pace.

18 Questions?

19 Reference

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,