Big Data Exercises. Fall 2017 Week 5 ETH Zurich. MapReduce

Size: px

Start display at page:

Download "Big Data Exercises. Fall 2017 Week 5 ETH Zurich. MapReduce"

Aron West
5 years ago
Views:

Original MapReduce paper: MapReduce: Simplified Data Processing on Large Clusters (optional) This exercise will consist of 2 main parts: Hands-on practice with MapReduce on Azure HDInsight

1 Big Data Exercises Fall 2017 Week 5 ETH Zurich MapReduce Reading: White, T. (2015). Hadoop: The Definitive Guide (4th ed.). O Reilly Media, Inc. [ETH library] (Chapters 2, 6, 7, 8: mandatory, Chapter 9: recommended) George, L. (2011). HBase: The Definitive Guide (1st ed.). O Reilly. [ETH library] (Chapter 7: mandatory). Original MapReduce paper: MapReduce: Simplified Data Processing on Large Clusters (optional) This exercise will consist of 2 main parts: Hands-on practice with MapReduce on Azure HDInsight Architecture and theory of MapReduce 1. Setup a cluster Create an HBase cluster Follow the steps from the exercise of last week, but this time the cluster will be slightly more performant. Create 4 RegionServers instead of 2 and choose type D3 v2 for these RegionServers. HeadNodes and Zookeper nodes should be of type A3 like last week. If you still have the storage account from last week, you can connect the same, otherwise create a new one. Make sure that: you have enabled "Custom" settings you have selected "North Europe" for the data center location you have downgraded the two HeadNodes to type A3 you have downgraded the four RegionServers to type D3 v2. the total cost of your cluster is 3.69 CHF/hour Access your cluster Make sure you can access your cluster (the HeadNode) via SSH: $ ssh <ssh_user_name>@<cluster_name>-ssh.azurehdinsight.net Understand the difference between DataNode and RegionServer You have heard the concept of DataNode talking about HDFS/MapReduce and the concept of RegionServer talking about HBase. What are the role of these two components? Where they are located? DataNodes are processes that handle HDFS blocks on a physical node and from which blocks are server to clients. RegionServers are processes to serve HBase regions. Since HBase stores data in HDFS, it is usually a good practice to co-locate DataNodes and RegionServers (ie. 1 RegionServer and 1 DataNode running on 1 physical node). What about NameNodes and HMaster? Again, these serve two different purposes: the former is responsible for orchestrating access to HDFS blocks, the latter for coordinating the region servers and for admin functions (creating, deleting, updating tables). These can run on the same physical node or not.

2 Explore your cluster admin dashboard Connect to your cluster dashboard, navigating to: Username and password are the one you have chosen during the cluster setup (notice that the username is not the SSH username). Here you can find a lot of information about your cluster. Select "Hosts" in the topbar and and explore the components running on each node. Where are RegionServers running? Where is the Active HBase Master running? Where is the Active NameNode running? Which services run on the head nodes? Which physical node is responsible to coordinate a MapReduce job? 2. Write a Word Count MapReduce job We want to find which are the most frequently-used English words. To answer this question, we prepared a big text files (1.2GB) where we concatenated the 3,036 books of the Gutemberg dataset. 2.1 Load the dataset The dataset we provide consists of a concatenation of 3,036 books ( gutemberg.txt). However we provide 3 versions: gutemberg_x0.1.txt - a smaller dataset of about 120MB gutemberg.txt - the original dataset, 1.2GB gutemberg_x10.txt - a bigger dataset of 12GB. This is optional. Load and process this only after you finished the exercise with the first two. Be aware that it might take some time. Follow the steps below to set this dataset up in HDFS: Log in into the active NameNode via SSH ssh <ssh_user_name>@<cluster_name>-ssh.azurehdinsight.net Download the dataset from our storage to the local filesystem of the NameNode using wget: wget wget wget Load the dataset into the HDFS filesystem: With ls -lh you should see the 2 files mentioned above. These files are now in the local hard drive of your HeadNode. Upload the files into HDFS where they can be consumed by MapReduce: $ hadoop dfs -copyfromlocal *.txt /tmp/ 2.2 Understand the MapReduce Java API We wrote a template project that you can use to experiment with MapReduce. Download and unzip the following package. wget unzip mapreduce.zip Now examine the content of the the src folder. You will see one Java class: MapReduceWordCount: a skeleton for a MapReduce job that loads data from file Start looking at MapReduceWordCount. You can see that the main method is already provided. Our WordCountMapper and WordCountReducer are implemented as classes that extend Hadoop's Mapper and Reducer. For this exercise, you only need to map() reduce()

3 consider (and override) the map() method for the mapper and the reduce() method for the reducer. public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void map(keyin key, VALUEIN value, Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Co ntext context) { context.write(key, value); public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void reduce(keyin key, Iterable<VALUEIN> values, Reducer<KEYIN, VALUEIN, KEYOU T, VALUEOUT>.Context context) { Iterator var4 = values.iterator(); while(var4.hasnext()) { Object value = var4.next(); context.write(key, value); Consulting the documentation if necessary, answer the following questions: 1. What are possible types for KEYIN, VALUEIN, KEYOUT and VALUEOUT? Should KEYOUT and VALUEOUT for the Mapper be the same as KEYIN and VALUEIN for the Reducer? 2. What is the default behavior of a MapReduce job running the base Mapper and Reducer above? 3. What is the role of the object Context? 1. Since the output of mapper and reducer must be serializable, they have to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. Hadoop provides classes implementing these interfaces for string, boolean, integers, long and short values. 2. Mapper and Reducers are identity functions. Since there is always a shuffling phase, the overall job performs sorting by input key. 3. Context allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job and the methods for mapper and reducer to emit output (write()). Applications can also use the Context to set parameters to mappers and reducers runnning on different nodes. 2.2 Write and run your MapReduce wordcount Edit the provided skeleton and implement mapper and reducer to implement a word count. The goal is to know how many times each unique word appears in the dataset. You can consider each word as a sequence of characters separated by a whitespace, or implement a more sophisticated tokenizer if you wish. Can you use your Reducer as Combiner? If so enable it by uncommenting the appropriate line in the main method. Once you are confident on your solution, follow the instruction on the README to compile the program and run it on your cluster. The process is very similar to the one for HBase of last week. Answer the following questions: 1. Run the MapReduce job on the cluster with the default configuration and 4 DataNodes using using only the medium size Gutemberg file for now. (Note: if you want to run your job again, you first need to delete the previous result folder because Hadoop refuses to write in the same location): hadoop fs -rm - r <path-to-hdfs-output-folder> 2. While the job proceeds, connect to to verify the status. Once done, check the final job report. How many map and reduce tasks were created with the default configuration? 3. Does it go faster with more reduce tasks? Experiment with job.setnumreducetasks(). What is the disadvantage of having multiple reducers? (Hint: check the format of your output)

4 A possible solution is the one below. Notice that this Reducer can be also used as a Combiner. Mapper private static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); Reducer private static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); 1. Some figures are provided below. 2. In my case, the number of map tasks was 3 and the number of reduce tasks 1. By default, the number of map tasks for a given job is driven by the number of input splits. The parameter mapred.map.tasks can be used to set a suggested value. For reduce tasks, 1 is the default number. This might be a bad choice in terms of performance, but on the other hand the result will be on a single file, without further merging necessary. 3. Generally, one reducer might become a bottleneck. In this case, the bottleneck is more evident if not combiner is used. However, using multiple reducers, multiple output files will be created and an additional merging step is necessary to get the same result. Note: according to the official documentation, the right number of reduces is 0.95 or 1.75 multiplied by [no. of nodes] * [no. of maximum containers per node]. With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks Download and plot the results The output of the MapReduce process has been stored in your remote cluster. You can now download the result file and plot the frequency of the 30 most common words in the dataset. The easiest way for this notebook to be able to access your result file is to set your container as public. Access the MapReduce container in portal.azure.com. Set the Accesss Policy of the container to public. Locate the result file in /tmp/results and copy the file URL. The link should be now accessible from your browser. Add the URL below, run to download the file and plot the results. In [1]: RESULT_FILE_URL = '...' import urllib.request urllib.request.urlretrieve (RESULT_FILE_URL, "results.txt") print ('Done')

5 Done In [2]: %matplotlib inline import matplotlib.pyplot as plt import operator print ('Plotting...') freq = { # Read input and sort by frequency. Keep only top 30. with open('results.txt', 'rb') as csvfile: for line in csvfile.readlines(): word, count = line.decode('utf-8').split('\t') freq[word] = int(count) srt = sorted(freq.items(), key=operator.itemgetter(1), reverse=true)[:30] # Generate plot plt.figure(figsize=(16,6)) plt.bar(range(len(srt)), [x[1] for x in srt], align='center', color='#ba2121') plt.xticks(range(len(srt)), [x[0] for x in srt]) plt.show() /home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('matplotlib is building the font cache using fc-list. This may take a moment.') /home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('matplotlib is building the font cache using fc-list. This may take a moment.') Plotting... In everything is correct, the 3 most frequent words should be the, of and and. 3. Performance comparison Test your MapReduce with the smaller gutemberg_x0.1.txt as well. If you want, you can also try with gutemberg_x10.txt. For each test, write down the running time: time mapreduce-wc /tmp/gutemberg_x0.1.txt /tmp/gut01/ time mapreduce-wc /tmp/gutemberg.txt /tmp/gut1/ time mapreduce-wc /tmp/gutemberg_x10.txt /tmp/gut10/ While job proceeds, you can connect to to check the status. In the meanwhile: Download the dataset on your laptop. You can either use wget or download one file at a time from your container in the Azure portal. Note: the bigger file is optional. You need at least 12.5GB of free space in your hard drive, and will take some time to download and process! Alternatively, you can also run this experiment on the HeadNode of your cluster, where you should still have the text files. wget

6 wget wget We parapred a simple wordcount propgram in Python. Download it on your laptop (or the cluster HeadNode) and test how long it takes to process the three datasets. Annotate the times for the next exercise. python wordcount.py < /pathtoyourfile/gutemberg_x0.1.txt python wordcount.py < /pathtoyourfile/gutemberg.txt python wordcount.py < /pathtoyourfile/gutemberg_x10.txt 3.1 Plot Compare the performance of the MapReduce vs the single-thread implementation of the word count algorithm for the three different input sizes. Fill the time in seconds in the code below to plot the results. In [ ]: # NOTE: remove the last number on the lists below if you did not test gutemberg_x10.txt size_input = [1.2*10e2, 1.2*10e3, 1.2*10e4] # the input size in MB time_mapreduce = [0., 0., 0.] # replace 0s with the time (in seconds) for the corresponding inputs time_locallaptop = [0., 0., 0.] # replace 0s with the time (in seconds) for the corresponding inputs %matplotlib inline # Import plot library import matplotlib.pyplot as plt # Plot plt.figure(figsize=(18,9)) plt.plot(size_input, time_mapreduce, '#f37626', label='mapreduce', linewidth=3.0) plt.plot(size_input, time_locallaptop, '#00b300', label='local laptop', linewidth=3.0, linestyle='d ashed') plt.xscale('log') plt.yscale('log') plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.xlabel('input size (MB)') plt.ylabel('time (seconds)') plt.title('wall-time comparison') plt.show() 3.2. Discussion 1. Which is faster, MapReduce on your cluster or a local wordcount implementation? Why? 2. Based on your experiment, what input size is the approximate break-even point for time performance? 3. Why MapReduce is not performing better than local computation for small inputs? 4. How can you optimize the MapReduce performance for this job? We have run some more tests. Here we present the running time for 3 configurations on HDInsight, one workstation and one laptop. The figures below are indicative only, cause every machine has performance that depends on several factors. MapReduce v1: no combiner with default configuration (1 reducer) MapReduce v2: no combiner with 8 reduce tasks MapReduce v3: using combiner with default configuration (1 reducer) MapReduce v4: using combiner with 8 reduce tasks 1. For small inputs, local computation is faster 2. Using a combiner, MapReduce gets faster than a single workstation somewhere between 1GB and 10GB of data. This of course, really depends on the task and the configuration of the machines. 3. The shuffling phase adds a lot of overhead computation 4. A combiner drastically improves the performance for this job where the data to be shuffled can be reduced a lot In [3]: size_input = [1.2*1e2, 1.2*1e3, 1.2*1e4] # the input size in MB time_mapreduce1 = [123, 445, 2160] # MapReduce v1 time_mapreduce2 = [100, 295, 970] # MapReduce v2

time_mapreduce2 = [100, 295, 970] # MapReduce v2 time_mapreduce3 = [110, 240, 413] # MapReduce v3 time_mapreduce4 = [110, 218, 385] # MapReduce v4 # 11.4, 127, 1332 #time_localstation = [11.

7] # our laptop %matplotlib inline # Import plot library import matplotlib.pyplot as plt # Plot plt.figure(figsize=(18,9)) plt.

7 time_mapreduce2 = [100, 295, 970] # MapReduce v2 time_mapreduce3 = [110, 240, 413] # MapReduce v3 time_mapreduce4 = [110, 218, 385] # MapReduce v4 # 11.4, 127, 1332 #time_localstation = [11.4, 127, 1332] # our workstation (java) time_localstation = [5.1, 47.7, 475.5] # our workstation time_locallaptop = [14.9, 148.4, ] # our laptop %matplotlib inline # Import plot library import matplotlib.pyplot as plt # Plot plt.figure(figsize=(18,9)) plt.plot(size_input, time_mapreduce1, '#7d0026', label='mapreduce v1', linewidth=3.0) plt.plot(size_input, time_mapreduce2, '#f30000', label='mapreduce v2', linewidth=3.0) plt.plot(size_input, time_mapreduce3, '#430064', label='mapreduce v3', linewidth=3.0) plt.plot(size_input, time_mapreduce4, '#f37626', label='mapreduce v4', linewidth=3.0) plt.plot(size_input, time_localstation, '#0055f3', label='local workstation', linewidth=3.0, linest yle='dashed') plt.plot(size_input, time_locallaptop, '#00b300', label='local laptop', linewidth=3.0, linestyle='d ashed') plt.xscale('log') plt.yscale('log') plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.xlabel('input size (MB)') plt.ylabel('time (seconds)') plt.title('performance comparison') plt.show() Important: delete your cluster It is very important that you delete your cluster from the Azure Portal as soon as you finished this exercise. 4. Reverse engineering Conceptually, a map function takes in input a kay-value pair and emits a list of key-values pairs, while a reduce function takes in input a key with an associated list of values and returns a list of values or key-value pairs. Often the type of the final key and value is the same of the type of the intermediate data: map (k1,v1) --> list(k2,v2) reduce (k2,list(v2))--> list(k2, v2) Analyze the following Mapper and Reducer, written in pseudo-code, and answer the questions below. function map(key, value) emit(key, value);

8 z = 0.0 for value in values: z += value emit(key, z / values.length()) Questions 1. Explain what is the result of running this job on a list of pairs with type ([string], [float]). 2. Write the equivalent SQL query. 3. Could you use this reduce function as combiner as well? Why or why not? 4. If your answer to the previous question was yes, does the number of different keys influences the effectiveness of the combiner? If you answer was no, can you change map and reduce functions in such a way the new reducer the can be used as combiner? 1. This will output a pair (k1, v1) for each unique input key, where k1 is the input key and v1 is the average of the values associated with k1 2. Equivalent to SELECT key, AVG(value) FROM tablename GROUP BY key 3. No, because the average operation is not associative. 4. If we allow the final output to contain an additional piece of information (how many samples the average represents), the reducer can be used as combiner, with the values being themselves a pair: function map(key, value) emit(key, (1, value)); n = 0 z = 0.0 for value in values: n += value[0] z += value[0] * value[1] emit(key, (n, z / n)) 5. True or False Say if the following statements are true or false, and explain why. 1. Each mapper must generate the same number of key/value pairs as its input had. 2. The TaskTracker is responsible for scheduling mappers and reducers and make sure all nodes are correctly running. 3. Mappers input key/value pairs are sorted by the key. 4. MapReduce splits might not correspond to HDFS block. 5. One single Reducer is applied to all values associated with the same key. 6. Multiple Reducers can be assigned pairs with the same value. 7. In Hadoop MapReduce, the key-value pairs a Reducer outputs must be of the same type as its input pairs. 1. False - for each input pair, the mapper can emit zero, one or several key/value pairs. 2. False - the JobTracker is responsible for this. 3. False - mapper input is not sorted. 4. True - since splits respects logical record boundaries, they might contain data from multiple HDFS blocks. 5. True - this is the principle behind partitioning: one Reducer is responsible for all values associated with a particular key. 6. True - values are not relevant in partitioning. 7. False - Reducer's input and output pairs might have different types. 6. Some more MapReduce and SQL Design, in Python or pseudo-code, MapReduce functions that take a very large file of integers and produce as output: 1. The largest integer. 2. The average of all the integers.

9 3. The same set of integers, but with each integer appearing only once. 4. The number of times each unique integer appears. 5. The number of distinct integers in the input. For each of these, write the equivalent SQL query, assuming you have a column values that stores all the integers. 1) SELECT MAX(values)... - Naive solution, very low degree of parallelization. We can use a Combiner (identical to Reducer) to increase the degree of parallelization. function map(key, value) emit(1, key); //group everything on the same key global_max = 0 for value in values: if (value > global_max): global_max = value emit(global_max) 2) SELECT AVG(values)... - using the same Map above sum, count = 0 for value in values: sum += value count += 1 emit(sum / count) 3) SELECT DISTINCT values... function map(key, value) emit(key, 1) emit(key) 4) SELECT values, COUNT (*)... GROUP BY values. function map(key, value) emit(key, 1) emit(key, len(values)) 5) SELECT COUNT (DISTINCT values)... Reuse (3) and add a postprocessing step to combine the output of the reducers. A second MapReduce stage could also be used but it would probably be an overkill for this scenario. 7. TF-IDF in MapReduce Imagine we want to build a search engine over the Gutemberg dataset of ~3000 books. Given a word or a set of words, we want to rank these books according their relevance for these words. We need a metric to measure the importance of a word in a set of document Understand TF-IDF TF-IDF is a statistic to determine the relative importance of the words in a set of documents. Is is computed as the product of two statistics, term frequency (tf) and inverse document frequency (idf). Given a word t, a document d (in this case a book) and the collection of all documents D we can define tf(t, d) as the number of times t appears in d. This gives us some information about the content of a document but because some terms (eg. "the") are so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms. The inverse document frequency idf(t, D) is a measure of how much information the word provides, that is, whether the term is

idf(t, D) common or rare across all documents. It can be computed as: where D is the total number of documents and the denominator represents how many documents contain the word t at least once.

10 idf(t, D) common or rare across all documents. It can be computed as: where D is the total number of documents and the denominator represents how many documents contain the word t at least once. However, this would cause a division-by-zero exception if the user query a word that never appear in the dataset. A better formulation would be: Then, the tdidf(t, d, D) is calculated as follows: A high weight in tfidf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents. 7.2 Implement TF-IDF in MapReduce (pseudo-code) Implement Mapper and Reducer functions in pseudo-code to compute TF-IDF. Assume each Mapper receives the document name as string key and the entire document content as string value. The output of your job should be a list of key-value pairs, where the key is a string in the form "document:word" and the value is the tfidf score for that document/word pair. function map(doc_id, doc_content) ) Mapper outputs each word-document pair function map(doc_id, doc_content) tokens = doc_content.split() for token in tokens: emit(token, (doc_id, 1)); 2) An optional combiner performs an aggregation over all occurrences of the same word per document function combiner(token, values) frequencies = { //empty dictionary for (doc_id, count) in values: frequencies[doc_id] += count //list of documents should fit in memory for doc_id in frequencies.keys(): emit(token, (doc_id, frequencies[doc_id])); 3) Reducer is called for ech unique word in the collection and outputs a list of document:word pairs with the tf-idf score function reduce(token, values[]) frequencies = { //empty dictionary for (doc_id, count) in values: frequencies[doc_id] += count //list of documents should fit in memory idf = log(n/(1+frequencies.keys().length())) // idf for token for doc_id in frequencies.keys(): tf = frequencies[doc_id] // tf for token-docid pair emit(doc_id + ":" + token, tf*idf)

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example