COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

Size: px

Start display at page:

Download "COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838."

Aubrey Luke Brooks
5 years ago
Views:

1 COMP4442 Service and Cloud Computing Lab 12: MapReduce Prof. George Baciu PQ838 1

2 Contents Introduction to MapReduce A WordCount example Running MapReduce applications 2

3 MapReduce Vast amount of data processing Parallel computation Reliable, fault-tolerant <key, value> data format 3

4 MapReduce 4

5 When to Use Data size reaches Terabyte (TB) range A task can be divided into independent sub-tasks e.g. the input file of WordCount program can be sliced into several sub-files 5

6 When NOT to Use Real-time accessing data Randomly accessing data Frequent changes to data Iterative tasks where the next step depends on last step 6

7 Infrastructure MapReduce Infrastructure and HDFS both run on compute nodes for high aggregate bandwidth. 7

8 Input / Output <Key1, Value1> Input map combine <Key2, Value2> <Key2, Value2> reduce <Key3, Value3> Output 8

9 1 9

10 Word Count import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } } 4/11/

11 Part Explanation Map Function public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } Process one line of input at a time Split line to tokens by whitespace Output: <<word>, 1> 11

12 Part Explanation Reduce Function public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } Sum up the value of each key Output <key, value> 12

13 2 4/11/

14 Check Variables Variables set in ~/.bashrc JAVA_HOME = /usr/lib/jvm/java openjdk-amd64 PATH = =$PATH:/home/panda/hadoop/hadoop /bin:/home/panda/hadoop/hadoop-3.0.2/sbin HADOOP_CLASSPATH = ${JAVA_HOME}/lib/tools.jar designate compiler 14

15 Compile & JAR Compile Source Code $ bin/hadoop com.sun.tools.javac.main WordCount.java $ jar cf wc.jar WordCount*.class c---want to Create a JAR file. f---want the output to go to a file rather than to stdout. Create Jar The generated jar package All source code must be packaged into JAR in order to be executed by MapReduce!! 15

16 Example Remember to format first! 1. cd share 2. mkdir WordCount 3. cd WordCount 4. wget 5. unzip Hadoop-WordCount.zip # may need to install unzip first 6. hdfs dfs -put Word_Count_input.txt /user/panda/input #upload input If a compiled JAR package has been included in the downloaded package, skip compilation~ 16

17 Access HDFS File structure $ hdfs dfs -ls /user/panda/input/ /user /panda /input /Word_Count_input.txt List files in HDFS directory /output $ hdfs dfs cat /user/panda/input/word_count_input.txt Check the content in the input file 17

18 Run & Result No need to manually create! $hadoop jar wordcount.jar WordCount /user/panda/input/word_count_input.txt /user/panda/output Jar package $hadoop fs -cat /user/panda/output/part-r Output filename 18

19 Result <word, count> Execution information 19

20 4/11/ END

Java in MapReduce. Scope

Java in MapReduce. Scope Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on