Big Data Exercises. Fall 2017 Week 5 ETH Zurich. MapReduce

Size: px
Start display at page:

Download "Big Data Exercises. Fall 2017 Week 5 ETH Zurich. MapReduce"

Transcription

1 Big Data Exercises Fall 2017 Week 5 ETH Zurich MapReduce Reading: White, T. (2015). Hadoop: The Definitive Guide (4th ed.). O Reilly Media, Inc. [ETH library] (Chapters 2, 6, 7, 8: mandatory, Chapter 9: recommended) George, L. (2011). HBase: The Definitive Guide (1st ed.). O Reilly. [ETH library] (Chapter 7: mandatory). Original MapReduce paper: MapReduce: Simplified Data Processing on Large Clusters (optional) This exercise will consist of 2 main parts: Hands-on practice with MapReduce on Azure HDInsight Architecture and theory of MapReduce 1. Setup a cluster Create an HBase cluster Follow the steps from the exercise of last week, but this time the cluster will be slightly more performant. Create 4 RegionServers instead of 2 and choose type D3 v2 for these RegionServers. HeadNodes and Zookeper nodes should be of type A3 like last week. If you still have the storage account from last week, you can connect the same, otherwise create a new one. Make sure that: you have enabled "Custom" settings you have selected "North Europe" for the data center location you have downgraded the two HeadNodes to type A3 you have downgraded the four RegionServers to type D3 v2. the total cost of your cluster is 3.69 CHF/hour Access your cluster Make sure you can access your cluster (the HeadNode) via SSH: $ ssh <ssh_user_name>@<cluster_name>-ssh.azurehdinsight.net Understand the difference between DataNode and RegionServer You have heard the concept of DataNode talking about HDFS/MapReduce and the concept of RegionServer talking about HBase. What are the role of these two components? Where they are located? DataNodes are processes that handle HDFS blocks on a physical node and from which blocks are server to clients. RegionServers are processes to serve HBase regions. Since HBase stores data in HDFS, it is usually a good practice to co-locate DataNodes and RegionServers (ie. 1 RegionServer and 1 DataNode running on 1 physical node). What about NameNodes and HMaster? Again, these serve two different purposes: the former is responsible for orchestrating access to HDFS blocks, the latter for coordinating the region servers and for admin functions (creating, deleting, updating tables). These can run on the same physical node or not.

2 Explore your cluster admin dashboard Connect to your cluster dashboard, navigating to: Username and password are the one you have chosen during the cluster setup (notice that the username is not the SSH username). Here you can find a lot of information about your cluster. Select "Hosts" in the topbar and and explore the components running on each node. Where are RegionServers running? Where is the Active HBase Master running? Where is the Active NameNode running? Which services run on the head nodes? Which physical node is responsible to coordinate a MapReduce job? 2. Write a Word Count MapReduce job We want to find which are the most frequently-used English words. To answer this question, we prepared a big text files (1.2GB) where we concatenated the 3,036 books of the Gutemberg dataset. 2.1 Load the dataset The dataset we provide consists of a concatenation of 3,036 books ( gutemberg.txt). However we provide 3 versions: gutemberg_x0.1.txt - a smaller dataset of about 120MB gutemberg.txt - the original dataset, 1.2GB gutemberg_x10.txt - a bigger dataset of 12GB. This is optional. Load and process this only after you finished the exercise with the first two. Be aware that it might take some time. Follow the steps below to set this dataset up in HDFS: Log in into the active NameNode via SSH ssh <ssh_user_name>@<cluster_name>-ssh.azurehdinsight.net Download the dataset from our storage to the local filesystem of the NameNode using wget: wget wget wget Load the dataset into the HDFS filesystem: With ls -lh you should see the 2 files mentioned above. These files are now in the local hard drive of your HeadNode. Upload the files into HDFS where they can be consumed by MapReduce: $ hadoop dfs -copyfromlocal *.txt /tmp/ 2.2 Understand the MapReduce Java API We wrote a template project that you can use to experiment with MapReduce. Download and unzip the following package. wget unzip mapreduce.zip Now examine the content of the the src folder. You will see one Java class: MapReduceWordCount: a skeleton for a MapReduce job that loads data from file Start looking at MapReduceWordCount. You can see that the main method is already provided. Our WordCountMapper and WordCountReducer are implemented as classes that extend Hadoop's Mapper and Reducer. For this exercise, you only need to map() reduce()

3 consider (and override) the map() method for the mapper and the reduce() method for the reducer. public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void map(keyin key, VALUEIN value, Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>.Co ntext context) { context.write(key, value); public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { protected void reduce(keyin key, Iterable<VALUEIN> values, Reducer<KEYIN, VALUEIN, KEYOU T, VALUEOUT>.Context context) { Iterator var4 = values.iterator(); while(var4.hasnext()) { Object value = var4.next(); context.write(key, value); Consulting the documentation if necessary, answer the following questions: 1. What are possible types for KEYIN, VALUEIN, KEYOUT and VALUEOUT? Should KEYOUT and VALUEOUT for the Mapper be the same as KEYIN and VALUEIN for the Reducer? 2. What is the default behavior of a MapReduce job running the base Mapper and Reducer above? 3. What is the role of the object Context? 1. Since the output of mapper and reducer must be serializable, they have to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework. Hadoop provides classes implementing these interfaces for string, boolean, integers, long and short values. 2. Mapper and Reducers are identity functions. Since there is always a shuffling phase, the overall job performs sorting by input key. 3. Context allows the Mapper/Reducer to interact with the rest of the Hadoop system. It includes configuration data for the job and the methods for mapper and reducer to emit output (write()). Applications can also use the Context to set parameters to mappers and reducers runnning on different nodes. 2.2 Write and run your MapReduce wordcount Edit the provided skeleton and implement mapper and reducer to implement a word count. The goal is to know how many times each unique word appears in the dataset. You can consider each word as a sequence of characters separated by a whitespace, or implement a more sophisticated tokenizer if you wish. Can you use your Reducer as Combiner? If so enable it by uncommenting the appropriate line in the main method. Once you are confident on your solution, follow the instruction on the README to compile the program and run it on your cluster. The process is very similar to the one for HBase of last week. Answer the following questions: 1. Run the MapReduce job on the cluster with the default configuration and 4 DataNodes using using only the medium size Gutemberg file for now. (Note: if you want to run your job again, you first need to delete the previous result folder because Hadoop refuses to write in the same location): hadoop fs -rm - r <path-to-hdfs-output-folder> 2. While the job proceeds, connect to to verify the status. Once done, check the final job report. How many map and reduce tasks were created with the default configuration? 3. Does it go faster with more reduce tasks? Experiment with job.setnumreducetasks(). What is the disadvantage of having multiple reducers? (Hint: check the format of your output)

4 A possible solution is the one below. Notice that this Reducer can be also used as a Combiner. Mapper private static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); Reducer private static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); result.set(sum); context.write(key, result); 1. Some figures are provided below. 2. In my case, the number of map tasks was 3 and the number of reduce tasks 1. By default, the number of map tasks for a given job is driven by the number of input splits. The parameter mapred.map.tasks can be used to set a suggested value. For reduce tasks, 1 is the default number. This might be a bad choice in terms of performance, but on the other hand the result will be on a single file, without further merging necessary. 3. Generally, one reducer might become a bottleneck. In this case, the bottleneck is more evident if not combiner is used. However, using multiple reducers, multiple output files will be created and an additional merging step is necessary to get the same result. Note: according to the official documentation, the right number of reduces is 0.95 or 1.75 multiplied by [no. of nodes] * [no. of maximum containers per node]. With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks Download and plot the results The output of the MapReduce process has been stored in your remote cluster. You can now download the result file and plot the frequency of the 30 most common words in the dataset. The easiest way for this notebook to be able to access your result file is to set your container as public. Access the MapReduce container in portal.azure.com. Set the Accesss Policy of the container to public. Locate the result file in /tmp/results and copy the file URL. The link should be now accessible from your browser. Add the URL below, run to download the file and plot the results. In [1]: RESULT_FILE_URL = '...' import urllib.request urllib.request.urlretrieve (RESULT_FILE_URL, "results.txt") print ('Done')

5 Done In [2]: %matplotlib inline import matplotlib.pyplot as plt import operator print ('Plotting...') freq = { # Read input and sort by frequency. Keep only top 30. with open('results.txt', 'rb') as csvfile: for line in csvfile.readlines(): word, count = line.decode('utf-8').split('\t') freq[word] = int(count) srt = sorted(freq.items(), key=operator.itemgetter(1), reverse=true)[:30] # Generate plot plt.figure(figsize=(16,6)) plt.bar(range(len(srt)), [x[1] for x in srt], align='center', color='#ba2121') plt.xticks(range(len(srt)), [x[0] for x in srt]) plt.show() /home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('matplotlib is building the font cache using fc-list. This may take a moment.') /home/nbuser/anaconda3_410/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('matplotlib is building the font cache using fc-list. This may take a moment.') Plotting... In everything is correct, the 3 most frequent words should be the, of and and. 3. Performance comparison Test your MapReduce with the smaller gutemberg_x0.1.txt as well. If you want, you can also try with gutemberg_x10.txt. For each test, write down the running time: time mapreduce-wc /tmp/gutemberg_x0.1.txt /tmp/gut01/ time mapreduce-wc /tmp/gutemberg.txt /tmp/gut1/ time mapreduce-wc /tmp/gutemberg_x10.txt /tmp/gut10/ While job proceeds, you can connect to to check the status. In the meanwhile: Download the dataset on your laptop. You can either use wget or download one file at a time from your container in the Azure portal. Note: the bigger file is optional. You need at least 12.5GB of free space in your hard drive, and will take some time to download and process! Alternatively, you can also run this experiment on the HeadNode of your cluster, where you should still have the text files. wget

6 wget wget We parapred a simple wordcount propgram in Python. Download it on your laptop (or the cluster HeadNode) and test how long it takes to process the three datasets. Annotate the times for the next exercise. python wordcount.py < /pathtoyourfile/gutemberg_x0.1.txt python wordcount.py < /pathtoyourfile/gutemberg.txt python wordcount.py < /pathtoyourfile/gutemberg_x10.txt 3.1 Plot Compare the performance of the MapReduce vs the single-thread implementation of the word count algorithm for the three different input sizes. Fill the time in seconds in the code below to plot the results. In [ ]: # NOTE: remove the last number on the lists below if you did not test gutemberg_x10.txt size_input = [1.2*10e2, 1.2*10e3, 1.2*10e4] # the input size in MB time_mapreduce = [0., 0., 0.] # replace 0s with the time (in seconds) for the corresponding inputs time_locallaptop = [0., 0., 0.] # replace 0s with the time (in seconds) for the corresponding inputs %matplotlib inline # Import plot library import matplotlib.pyplot as plt # Plot plt.figure(figsize=(18,9)) plt.plot(size_input, time_mapreduce, '#f37626', label='mapreduce', linewidth=3.0) plt.plot(size_input, time_locallaptop, '#00b300', label='local laptop', linewidth=3.0, linestyle='d ashed') plt.xscale('log') plt.yscale('log') plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.xlabel('input size (MB)') plt.ylabel('time (seconds)') plt.title('wall-time comparison') plt.show() 3.2. Discussion 1. Which is faster, MapReduce on your cluster or a local wordcount implementation? Why? 2. Based on your experiment, what input size is the approximate break-even point for time performance? 3. Why MapReduce is not performing better than local computation for small inputs? 4. How can you optimize the MapReduce performance for this job? We have run some more tests. Here we present the running time for 3 configurations on HDInsight, one workstation and one laptop. The figures below are indicative only, cause every machine has performance that depends on several factors. MapReduce v1: no combiner with default configuration (1 reducer) MapReduce v2: no combiner with 8 reduce tasks MapReduce v3: using combiner with default configuration (1 reducer) MapReduce v4: using combiner with 8 reduce tasks 1. For small inputs, local computation is faster 2. Using a combiner, MapReduce gets faster than a single workstation somewhere between 1GB and 10GB of data. This of course, really depends on the task and the configuration of the machines. 3. The shuffling phase adds a lot of overhead computation 4. A combiner drastically improves the performance for this job where the data to be shuffled can be reduced a lot In [3]: size_input = [1.2*1e2, 1.2*1e3, 1.2*1e4] # the input size in MB time_mapreduce1 = [123, 445, 2160] # MapReduce v1 time_mapreduce2 = [100, 295, 970] # MapReduce v2

7 time_mapreduce2 = [100, 295, 970] # MapReduce v2 time_mapreduce3 = [110, 240, 413] # MapReduce v3 time_mapreduce4 = [110, 218, 385] # MapReduce v4 # 11.4, 127, 1332 #time_localstation = [11.4, 127, 1332] # our workstation (java) time_localstation = [5.1, 47.7, 475.5] # our workstation time_locallaptop = [14.9, 148.4, ] # our laptop %matplotlib inline # Import plot library import matplotlib.pyplot as plt # Plot plt.figure(figsize=(18,9)) plt.plot(size_input, time_mapreduce1, '#7d0026', label='mapreduce v1', linewidth=3.0) plt.plot(size_input, time_mapreduce2, '#f30000', label='mapreduce v2', linewidth=3.0) plt.plot(size_input, time_mapreduce3, '#430064', label='mapreduce v3', linewidth=3.0) plt.plot(size_input, time_mapreduce4, '#f37626', label='mapreduce v4', linewidth=3.0) plt.plot(size_input, time_localstation, '#0055f3', label='local workstation', linewidth=3.0, linest yle='dashed') plt.plot(size_input, time_locallaptop, '#00b300', label='local laptop', linewidth=3.0, linestyle='d ashed') plt.xscale('log') plt.yscale('log') plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.xlabel('input size (MB)') plt.ylabel('time (seconds)') plt.title('performance comparison') plt.show() Important: delete your cluster It is very important that you delete your cluster from the Azure Portal as soon as you finished this exercise. 4. Reverse engineering Conceptually, a map function takes in input a kay-value pair and emits a list of key-values pairs, while a reduce function takes in input a key with an associated list of values and returns a list of values or key-value pairs. Often the type of the final key and value is the same of the type of the intermediate data: map (k1,v1) --> list(k2,v2) reduce (k2,list(v2))--> list(k2, v2) Analyze the following Mapper and Reducer, written in pseudo-code, and answer the questions below. function map(key, value) emit(key, value);

8 z = 0.0 for value in values: z += value emit(key, z / values.length()) Questions 1. Explain what is the result of running this job on a list of pairs with type ([string], [float]). 2. Write the equivalent SQL query. 3. Could you use this reduce function as combiner as well? Why or why not? 4. If your answer to the previous question was yes, does the number of different keys influences the effectiveness of the combiner? If you answer was no, can you change map and reduce functions in such a way the new reducer the can be used as combiner? 1. This will output a pair (k1, v1) for each unique input key, where k1 is the input key and v1 is the average of the values associated with k1 2. Equivalent to SELECT key, AVG(value) FROM tablename GROUP BY key 3. No, because the average operation is not associative. 4. If we allow the final output to contain an additional piece of information (how many samples the average represents), the reducer can be used as combiner, with the values being themselves a pair: function map(key, value) emit(key, (1, value)); n = 0 z = 0.0 for value in values: n += value[0] z += value[0] * value[1] emit(key, (n, z / n)) 5. True or False Say if the following statements are true or false, and explain why. 1. Each mapper must generate the same number of key/value pairs as its input had. 2. The TaskTracker is responsible for scheduling mappers and reducers and make sure all nodes are correctly running. 3. Mappers input key/value pairs are sorted by the key. 4. MapReduce splits might not correspond to HDFS block. 5. One single Reducer is applied to all values associated with the same key. 6. Multiple Reducers can be assigned pairs with the same value. 7. In Hadoop MapReduce, the key-value pairs a Reducer outputs must be of the same type as its input pairs. 1. False - for each input pair, the mapper can emit zero, one or several key/value pairs. 2. False - the JobTracker is responsible for this. 3. False - mapper input is not sorted. 4. True - since splits respects logical record boundaries, they might contain data from multiple HDFS blocks. 5. True - this is the principle behind partitioning: one Reducer is responsible for all values associated with a particular key. 6. True - values are not relevant in partitioning. 7. False - Reducer's input and output pairs might have different types. 6. Some more MapReduce and SQL Design, in Python or pseudo-code, MapReduce functions that take a very large file of integers and produce as output: 1. The largest integer. 2. The average of all the integers.

9 3. The same set of integers, but with each integer appearing only once. 4. The number of times each unique integer appears. 5. The number of distinct integers in the input. For each of these, write the equivalent SQL query, assuming you have a column values that stores all the integers. 1) SELECT MAX(values)... - Naive solution, very low degree of parallelization. We can use a Combiner (identical to Reducer) to increase the degree of parallelization. function map(key, value) emit(1, key); //group everything on the same key global_max = 0 for value in values: if (value > global_max): global_max = value emit(global_max) 2) SELECT AVG(values)... - using the same Map above sum, count = 0 for value in values: sum += value count += 1 emit(sum / count) 3) SELECT DISTINCT values... function map(key, value) emit(key, 1) emit(key) 4) SELECT values, COUNT (*)... GROUP BY values. function map(key, value) emit(key, 1) emit(key, len(values)) 5) SELECT COUNT (DISTINCT values)... Reuse (3) and add a postprocessing step to combine the output of the reducers. A second MapReduce stage could also be used but it would probably be an overkill for this scenario. 7. TF-IDF in MapReduce Imagine we want to build a search engine over the Gutemberg dataset of ~3000 books. Given a word or a set of words, we want to rank these books according their relevance for these words. We need a metric to measure the importance of a word in a set of document Understand TF-IDF TF-IDF is a statistic to determine the relative importance of the words in a set of documents. Is is computed as the product of two statistics, term frequency (tf) and inverse document frequency (idf). Given a word t, a document d (in this case a book) and the collection of all documents D we can define tf(t, d) as the number of times t appears in d. This gives us some information about the content of a document but because some terms (eg. "the") are so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms. The inverse document frequency idf(t, D) is a measure of how much information the word provides, that is, whether the term is

10 idf(t, D) common or rare across all documents. It can be computed as: where D is the total number of documents and the denominator represents how many documents contain the word t at least once. However, this would cause a division-by-zero exception if the user query a word that never appear in the dataset. A better formulation would be: Then, the tdidf(t, d, D) is calculated as follows: A high weight in tfidf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents. 7.2 Implement TF-IDF in MapReduce (pseudo-code) Implement Mapper and Reducer functions in pseudo-code to compute TF-IDF. Assume each Mapper receives the document name as string key and the entire document content as string value. The output of your job should be a list of key-value pairs, where the key is a string in the form "document:word" and the value is the tfidf score for that document/word pair. function map(doc_id, doc_content) ) Mapper outputs each word-document pair function map(doc_id, doc_content) tokens = doc_content.split() for token in tokens: emit(token, (doc_id, 1)); 2) An optional combiner performs an aggregation over all occurrences of the same word per document function combiner(token, values) frequencies = { //empty dictionary for (doc_id, count) in values: frequencies[doc_id] += count //list of documents should fit in memory for doc_id in frequencies.keys(): emit(token, (doc_id, frequencies[doc_id])); 3) Reducer is called for ech unique word in the collection and outputs a list of document:word pairs with the tf-idf score function reduce(token, values[]) frequencies = { //empty dictionary for (doc_id, count) in values: frequencies[doc_id] += count //list of documents should fit in memory idf = log(n/(1+frequencies.keys().length())) // idf for token for doc_id in frequencies.keys(): tf = frequencies[doc_id] // tf for token-docid pair emit(doc_id + ":" + token, tf*idf)

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Java in MapReduce. Scope

Java in MapReduce. Scope Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming) Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm

More information

Processing Distributed Data Using MapReduce, Part I

Processing Distributed Data Using MapReduce, Part I Processing Distributed Data Using MapReduce, Part I Computer Science E-66 Harvard University David G. Sullivan, Ph.D. MapReduce A framework for computation on large data sets that are fragmented and replicated

More information

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms

Map Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 2: Hadoop Nuts and Bolts Jimmy Lin University of Maryland Thursday, January 31, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of

More information

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science Introduction The Hadoop cluster in Computing Science at Stirling allows users with a valid user account to submit and

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Chapter 3. Distributed Algorithms based on MapReduce

Chapter 3. Distributed Algorithms based on MapReduce Chapter 3 Distributed Algorithms based on MapReduce 1 Acknowledgements Hadoop: The Definitive Guide. Tome White. O Reilly. Hadoop in Action. Chuck Lam, Manning Publications. MapReduce: Simplified Data

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Attacking & Protecting Big Data Environments

Attacking & Protecting Big Data Environments Attacking & Protecting Big Data Environments Birk Kauer & Matthias Luft {bkauer, mluft}@ernw.de #WhoAreWe Birk Kauer - Security Researcher @ERNW - Mainly Exploit Developer Matthias Luft - Security Researcher

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018 Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However,

More information

Processing Big Data with Hadoop in Azure HDInsight

Processing Big Data with Hadoop in Azure HDInsight Processing Big Data with Hadoop in Azure HDInsight Lab 1 - Getting Started with HDInsight Overview In this lab, you will provision an HDInsight cluster. You will then run a sample MapReduce job on the

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (1/2) January 10, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code. title: "Data Analytics with HPC: Hadoop Walkthrough" In this walkthrough you will learn to execute simple Hadoop Map/Reduce jobs on a Hadoop cluster. We will use Hadoop to count the occurrences of words

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc. D. Praveen Kumar Junior Research Fellow Department of Computer Science & Engineering Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India Head of IT & ITES, Skill Subsist Impels

More information

Batch Inherence of Map Reduce Framework

Batch Inherence of Map Reduce Framework Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

ExamTorrent.   Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

2. MapReduce Programming Model

2. MapReduce Programming Model Introduction MapReduce was proposed by Google in a research paper: Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System

More information

Java & Inheritance. Inheritance - Scenario

Java & Inheritance. Inheritance - Scenario Java & Inheritance ITNPBD7 Cluster Computing David Cairns Inheritance - Scenario Inheritance is a core feature of Object Oriented languages. A class hierarchy can be defined where the class at the top

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

Distributed Systems. CS422/522 Lecture17 17 November 2014

Distributed Systems. CS422/522 Lecture17 17 November 2014 Distributed Systems CS422/522 Lecture17 17 November 2014 Lecture Outline Introduction Hadoop Chord What s a distributed system? What s a distributed system? A distributed system is a collection of loosely

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.

More information

Batch Processing Basic architecture

Batch Processing Basic architecture Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 16 th, 2015 Emily Fox 2015 1 Document Retrieval n Goal: Retrieve

More information

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak

Cloud Programming on Java EE Platforms. mgr inż. Piotr Nowak Cloud Programming on Java EE Platforms mgr inż. Piotr Nowak dsh distributed shell commands execution -c concurrent --show-machine-names -M --group cluster -g cluster /etc/dsh/groups/cluster needs passwordless

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo Vendor: Hortonworks Exam Code: HDPCD Exam Name: Hortonworks Data Platform Certified Developer Version: Demo QUESTION 1 Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These

More information

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara

FAQs. Topics. This Material is Built Based on, Analytics Process Model. 8/22/2018 Week 1-B Sangmi Lee Pallickara CS435 Introduction to Big Data Week 1-B W1.B.0 CS435 Introduction to Big Data No Cell-phones in the class. W1.B.1 FAQs PA0 has been posted If you need to use a laptop, please sit in the back row. August

More information

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data Fall Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data Fall 2018 6. Massive Parallel Processing (MapReduce) Let's begin with a field experiment 2 400+ Pokemons, 10 different 3 How many of each??????????? 4 400 distributed to many volunteers

More information

MapReduce and Hadoop. The reference Big Data stack

MapReduce and Hadoop. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

More information

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo Vendor: Cloudera Exam Code: CCD-410 Exam Name: Cloudera Certified Developer for Apache Hadoop Version: Demo QUESTION 1 When is the earliest point at which the reduce method of a given Reducer can be called?

More information

1/30/2019 Week 2- B Sangmi Lee Pallickara

1/30/2019 Week 2- B Sangmi Lee Pallickara Week 2-A-0 1/30/2019 Colorado State University, Spring 2019 Week 2-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING Term project deliverable

More information

Top 25 Hadoop Admin Interview Questions and Answers

Top 25 Hadoop Admin Interview Questions and Answers Top 25 Hadoop Admin Interview Questions and Answers 1) What daemons are needed to run a Hadoop cluster? DataNode, NameNode, TaskTracker, and JobTracker are required to run Hadoop cluster. 2) Which OS are

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 9 MapReduce Prof. Li Jiang 2014/11/19 1 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale

More information

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX

KillTest *KIJGT 3WCNKV[ $GVVGT 5GTXKEG Q&A NZZV ]]] QORRZKYZ IUS =K ULLKX LXKK [VJGZK YKX\OIK LUX UTK _KGX KillTest Q&A Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Version : DEMO 1 / 4 1.When is the earliest point at which the reduce method of a given Reducer can be called?

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Your First Hadoop App, Step by Step

Your First Hadoop App, Step by Step Learn Hadoop in one evening Your First Hadoop App, Step by Step Martynas 1 Miliauskas @mmiliauskas Your First Hadoop App, Step by Step By Martynas Miliauskas Published in 2013 by Martynas Miliauskas On

More information

The MapReduce Abstraction

The MapReduce Abstraction The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Big Data for Engineers Spring Resource Management

Big Data for Engineers Spring Resource Management Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce

More information

Recommended Literature

Recommended Literature COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2017 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

MapReduce Algorithms

MapReduce Algorithms Large-scale data processing on the Cloud Lecture 3 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License) Outline

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

Hadoop 3.X more examples

Hadoop 3.X more examples Hadoop 3.X more examples Big Data - 09/04/2018 Let s start with some examples! http://www.dia.uniroma3.it/~dvr/es2_material.zip Example: LastFM Listeners per Track Consider the following log file UserId

More information

Getting Started with Hadoop

Getting Started with Hadoop Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018 What is Hadoop Started in 2004 by Yahoo Open-Source implementation

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

App Engine MapReduce. Mike Aizatsky 11 May Hashtags: #io2011 #AppEngine Feedback:

App Engine MapReduce. Mike Aizatsky 11 May Hashtags: #io2011 #AppEngine Feedback: App Engine MapReduce Mike Aizatsky 11 May 2011 Hashtags: #io2011 #AppEngine Feedback: http://goo.gl/snv2i Agenda MapReduce Computational Model Mapper library Announcement Technical bits: Files API User-space

More information

BigData and MapReduce with Hadoop

BigData and MapReduce with Hadoop BigData and MapReduce with Hadoop Ivan Tomašić 1, Roman Trobec 1, Aleksandra Rashkovska 1, Matjaž Depolli 1, Peter Mežnar 2, Andrej Lipej 2 1 Jožef Stefan Institute, Jamova 39, 1000 Ljubljana 2 TURBOINŠTITUT

More information

Large-scale Information Processing

Large-scale Information Processing Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,

More information

Compile and Run WordCount via Command Line

Compile and Run WordCount via Command Line Aims This exercise aims to get you to: Compile, run, and debug MapReduce tasks via Command Line Compile, run, and debug MapReduce tasks via Eclipse One Tip on Hadoop File System Shell Following are the

More information

Parallel Programming Concepts

Parallel Programming Concepts Parallel Programming Concepts MapReduce Frank Feinbube Source: MapReduce: Simplied Data Processing on Large Clusters; Dean et. Al. Examples for Parallel Programming Support 2 MapReduce 3 Programming model

More information

1. Introduction to MapReduce

1. Introduction to MapReduce Processing of massive data: MapReduce 1. Introduction to MapReduce 1 Origins: the Problem Google faced the problem of analyzing huge sets of data (order of petabytes) E.g. pagerank, web access logs, etc.

More information

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings

W1.A.0 W2.A.0 1/22/2018 1/22/2018. CS435 Introduction to Big Data. FAQs. Readings CS435 Introduction to Big Data 1/17/2018 W2.A.0 W1.A.0 CS435 Introduction to Big Data W2.A.1.A.1 FAQs PA0 has been posted Feb. 6, 5:00PM via Canvas Individual submission (No team submission) Accommodation

More information

Aims. Background. This exercise aims to get you to:

Aims. Background. This exercise aims to get you to: Aims This exercise aims to get you to: Import data into HBase using bulk load Read MapReduce input from HBase and write MapReduce output to HBase Manage data using Hive Manage data using Pig Background

More information

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016 15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module

More information

/ Cloud Computing. Recitation 13 April 14 th 2015

/ Cloud Computing. Recitation 13 April 14 th 2015 15-319 / 15-619 Cloud Computing Recitation 13 April 14 th 2015 Overview Last week s reflection Project 4.1 Budget issues Tagging, 15619Project This week s schedule Unit 5 - Modules 18 Project 4.2 Demo

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

September 2013 Alberto Abelló & Oscar Romero 1

September 2013 Alberto Abelló & Oscar Romero 1 duce-i duce-i September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate several use cases of duce 2. Describe what the duce environment is 3. Explain 6 benefits of using duce 4.

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

Big Data and Scripting map reduce in Hadoop

Big Data and Scripting map reduce in Hadoop Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2017 by Leonidas Fegaras Map-Reduce Fundamentals Based on: J. Simeon: Introduction to MapReduce P. Michiardi: Tutorial on MapReduce

More information

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) Hortonworks Hadoop-PR000007 Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer) http://killexams.com/pass4sure/exam-detail/hadoop-pr000007 QUESTION: 99 Which one of the following

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Hadoop Map Reduce 10/17/2018 1

Hadoop Map Reduce 10/17/2018 1 Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018

More information

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running

More information

Data Analytics Job Guarantee Program

Data Analytics Job Guarantee Program Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Topics covered in this lecture

Topics covered in this lecture 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.0 CS435 Introduction to Big Data 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.1 FAQs How does Hadoop mapreduce run the map instance?

More information

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html

More information

itpass4sure Helps you pass the actual test with valid and latest training material.

itpass4sure   Helps you pass the actual test with valid and latest training material. itpass4sure http://www.itpass4sure.com/ Helps you pass the actual test with valid and latest training material. Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Vendor : Cloudera

More information