Hadoop Lab 3 Creating your first Map-Reduce Process

Size: px

Start display at page:

Download "Hadoop Lab 3 Creating your first Map-Reduce Process"

Meredith Greene
6 years ago
Views:

1 Programming for Big Data Hadoop Lab 3 Creating your first Map-Reduce Process Lab work Take the map-reduce code from these notes and get it running on your Hadoop VM Driver Code Mapper Code Reducer Code Explore the results in HDFS and on the web interface. Additional exercises. Complete before next week. More MR labs next week and Assignment Handout. 1

as your Java IDE Download and install on

2 Java Yes you are going to use and write your own Java code Map-Reduce uses Java. You need to be good at writing Java And working at the Command Line Use Eclipse as your Java IDE Download and install on your own (VM) machine Or install on own machine or use lab PCs But need to be careful of version of Java to use Java 1.7 on VM For VM Install Eclipse Luna Download & Unzip 2

Exercise 1 Your First Map-Reduce Job Read through the following slides and notes

understand before commencing Java A simple Test Before we get to the fun stuff In

program as a Java Application in Eclipse Generate the jar file Run it on the VM Java

Is this easy J or L This needs to be easy J J If using Eclipse on own Machine/lab

3 Exercise 1 Your First Map-Reduce Job Read through the following slides and notes before you commence Exercise 1 There is a lot covered in these It is important to understand before commencing Java A simple Test Before we get to the fun stuff In Eclipse, create the basic Hello World Follow the tutorial in Eclipse Test running the program as a Java Application in Eclipse Generate the jar file Run it on the VM Java jar HelloWorld.jar No Hadoop or fun stuff! Is this easy J or L This needs to be easy J J If using Eclipse on own Machine/lab Check Java version on VM = version on own machine If not then you need to configure this Version on VM should not be changed Up to you to manage this 3

Check your VM to see what data already exists on it.

gz file Insert the shakespeare directory into HDFS using the put command hadoop fs -put shakespeare shakespeare

the shakespeare directory in HDFS Note that the default location in HDFS is user/<your name> You can use these

If the data already exists then follow these steps Access the contents of the the poems file using hadoop fs

http://localhost:50070/ How many blocks are used? What else can you find out?

4 Check your VM to see what data already exists on it. Loading Data into Hadoop Do all of these tasks on the VM Download the sample data set /Hadoop/Notes/shakespeare.tar.gz Unzip the shakespeare.tar.gz file Insert the shakespeare directory into HDFS using the put command hadoop fs -put shakespeare shakespeare Enter hadoop fs -ls to see the updated contents in HDFS Enter hadoop fs -ls shakespeare to see the contents in the shakespeare directory in HDFS Note that the default location in HDFS is user/<your name> You can use these same steps to load your own data. If the data already exists then follow these steps Access the contents of the the poems file using hadoop fs -cat shakespeare/poems less Browse the web interface for the NameNode and see the explore the contents How many blocks are used? What else can you find out? Create your 1 st Map-Reduce Process Set up a project in Eclipse for your work on your host machine Import the hadoop libraries into the project configuration these are available on the VM hadoop-common-<version>.jar available at home/soc/yarn/hadoop-<version>/share/hadoop/common hadoop-mapreduce-client-core-<version>.jar available at home/soc/yarn/hadoop- <version>/share/hadoop/mapreduce Create a Mapper, Reducer and Driver class in the project Add the appropriate code to each class 4

5 Create your 1 st Map-Reduce Process Create a Mapper, Reducer and Driver class in the project Add the appropriate code to each class See code in the notes Sample code is available on module webpage 5

6 Run your 1 st Map-Reduce Process Compile the Mapper, Reducer and Driver classes Create a jar file: Export -> Java -> Jar File Run the Map-Reduce process on Hadoop hadoop jar <jar filespec> <driver class name> <input-hdfs-dir> <output-hdfs-dir> E.g. to run WordCount on shakespeare s poems: hadoop jar WordCount.jar WordCount shakespeare/poems myoutput NOTE: Before running the above command, check to see if a file already exists with this name (myoutput). If it does then you will need to remove it. Monitor & Review your 1 st Map-Reduce Process Browse the web interface and see the job is running When the job finishes look at its History How many mappers ran? How many reducers ran? How many input records were read by mappers? (See Counters) Browse the logs for the mappers and reducers Note: you will see a stdout, stderr and syslog for each mapper and reducer that ran. 6

7 Examine output from your 1 st Map-Reduce Process Check HDFS for the output using either the command line or the web interface. (For the example above the output will be in a directory called myoutput) Browse the output directory in HDFS. The part-r-0000x file(s) give the output data, one per reducer. Browse part-r and check the output Note: The output directory can t exist before running the job Hadoop will complain and not run the job. This precaution is to prevent data loss - accidentally overwriting the output of a long job with another Exercise 2 Calculate the Averages Using the structure of the WordCount programme write a Hadoop program that calculates the average word length of all words that start with each character. To do this consider: What key/value pairs should the Mapper output Change the SumReducer to be an AverageReducer which calculates an average rather than a sum. Complete this Exercise before moving onto the next topic and exercises. 7

View the Application History for the Application at the Resource Manager localhost://8088 Select MapTasks for mappers Select a Map task Select the Logs for the

8 Exercise 3 - Debugging Map-Reduce Process You can include System.out.println() or System.err.println() statements in the code For the Driver, the output is visible on the console For the Mapper or Reducer, the output is visible though the UI interface View the Application History for the Application at the Resource Manager localhost://8088 Select MapTasks for mappers Select a Map task Select the Logs for the task Debugging Map-Reduce Process To debug you can also set the number of Reducers to zero and the output of the map tasks goes directly to the HDFS file-system, unsorted. In the Driver use job.setnumreducetasks(0) on the Job object 8

9 Complete all exercises before next class 9

Hadoop Lab 2 Exploring the Hadoop Environment

Programming for Big Data Hadoop Lab 2 Exploring the Hadoop Environment Video A short video guide for some of what is covered in this lab. Link for this video is on my module webpage 1 Open a Terminal window