UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

Size: px

Start display at page:

Download "UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus"

Silvia Richardson
6 years ago
Views:

1 UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation. INTRODUCTION: Hadoop is much more than a highly available, massive data storage engine. One of the main advantages of using Hadoop is that you can combine data storage and processing. Hadoop s main processing engine is MapReduce, which is currently one of the most popular big-data processing frameworks available. It enables you to seamlessly integrate existing Hadoop data storage into processing, and it provides a unique combination of simplicity and power. Numerous practical problems (ranging from log analysis, to data sorting, to text processing, to pattern-based search, to graph processing, to machine learning, and much more) have been solved using MapReduce. GETTING TO KNOW MAPREDUCE MapReduce is a framework for executing highly parallelizable and distributable algorithms across huge data sets using a large number of commodity computers. A map takes as input a function and a sequence of values. It then applies the function to each value in the sequence. A reduce combines all the elements of a sequence using a binary operation. A combinator is a function that builds program fragments from program fragments. Combinators aid in programming at a higher level of abstraction, and enable you to separate the strategy from the implementation. MapReduce was introduced to solve large-data computational problems, and is specifically designed to run on commodity hardware. It is based on divide-and-conquer principles the input data sets are split into independent chunks, which are processed by the mappers in parallel. Additionally, execution of the maps is typically co-located with the data. The framework then sorts the outputs of the maps, and uses them as an input to the reducers. The responsibility of the user is to implement mappers and reducers classes that extend Hadoop-provided base classes to solve a specific problem. As shown in Figure 3-1, a mapper takes input in a form of key/value pairs (k1, v1) and transforms them into another key/value pair (k2, v2). The MapReduce framework sorts a mapper s output key/value pairs and combines Ms. Selva Mary. G Page 1

each unique key with all its values (k2, {v2, v2, ). These key/ value combinations are delivered to reducers, which translate them into yet another key/value pair (k3, v3).

2 each unique key with all its values (k2, {v2, v2, ). These key/ value combinations are delivered to reducers, which translate them into yet another key/value pair (k3, v3). A mapper and reducer together constitute a single Hadoop job. A mapper is a mandatory part of a job, and can produce zero or more key/value pairs (k2, v2). A reducer is an optional part of a job, and can produce zero or more key/value pairs (k3, v3). The user is also responsible for the implementation of a driver (that is, the main application controlling some of the aspects of the execution). The responsibility of the MapReduce framework (based on the user-supplied code) is to provide the overall coordination of execution. This includes choosing appropriate machines (nodes) for running mappers; starting and monitoring the mapper s execution; choosing appropriate locations for the reducer s execution; sorting and shuffling output of mappers and delivering the output to reducer nodes; and starting and monitoring the reducer s execution. MAPREDUCE EXECUTION PIPELINE Any data stored in Hadoop (including HDFS and HBase) or even outside of Hadoop (for example, in a database) can be used as an input to the MapReduce job. Similarly, output of the job can be stored either in Hadoop (HDFS or HBase) or outside of it. The framework takes care of scheduling tasks, monitoring them, and re-executing failed tasks. Figure 3-2 shows a high-level view of the MapReduce processing architecture. Following are the main components of the MapReduce execution pipeline: Driver This is the main program that initializes a MapReduce job. It defines job-specific configuration, and specifies all of its components (including input and output formats, mapper and reducer, use of a combiner, use of a custom partitioner, and so on). The driver can also get back the status of the job execution. Ms. Selva Mary. G Page 2

FIGURE 3-2: High-level Hadoop execution architecture Context The driver, mappers, and reducers are executed in different processes, typically on multiple machines.

3 FIGURE 3-2: High-level Hadoop execution architecture Context The driver, mappers, and reducers are executed in different processes, typically on multiple machines. A context object (not shown in Figure 3-2) is available at any point of MapReduce execution. It provides a convenient mechanism for exchanging required system and job-wide information. Keep in mind that context coordination happens only when an appropriate phase (driver, map, reduce) of a MapReduce job starts. This means that, for example, values set by one mapper are not available in another mapper (even if another mapper starts after the first one completes), but is available in any reducer. Input data This is where the data for a MapReduce task is initially stored. This data can reside in HDFS, HBase, or other storage. Typically, the input data is very large tens of gigabytes or more. InputFormat This defines how input data is read and split. InputFormat is a class that defines the InputSplits that break input data into tasks, and provides a factory for RecordReader objects that read the file. Several InputFormats are provided by Hadoop. InputFormat is invoked directly by a job s driver to decide (based on the InputSplits) the number and location of the map task execution. Ms. Selva Mary. G Page 3

4 InputSplit An InputSplit defines a unit of work for a single map task in a MapReduce program. A MapReduce program applied to a data set is made up of several (possibly several hundred) map tasks. The InputFormat (invoked directly by a job driver) defines the number of map tasks that make up the mapping phase. Each map task is given a single InputSplit to work on. After the InputSplits are calculated, the MapReduce framework starts the required number of mapper jobs in the desired locations. RecordReader Although the InputSplit defines a data subset for a map task, it does not describe how to access the data. The RecordReader class actually reads the data from its source (inside a mapper task), converts it into key/value pairs suitable for processing by the mapper, and delivers them to the map method. The RecordReader class is defined by the InputFormat. Chapter 4 shows examples of how to implement a custom RecordReader. Mapper The mapper performs the user-defined work of the first phase of the MapReduce program. From the implementation point of view, a mapper implementation takes input data in the form of a series of key/value pairs (k1, v1), which are used for individual map execution. The map typically transforms the input pair into an output pair (k2, v2), which is used as an input for shuffle and sort. A new instance of a mapper is instantiated in a separate JVM instance for each map task that makes up part of the total job input. The individual mappers are intentionally not provided with a mechanism to communicate with one another in any way. This allows the reliability of each map task to be governed solely by the reliability of the local machine. Partition A subset of the intermediate key space (k2, v2) produced by each individual mapper is assigned to each reducer. These subsets (or partitions) are the inputs to the reduce tasks. Each map task may emit key/value pairs to any partition. All values for the same key are always reduced together, regardless of which mapper they originated from. As a result, all of the map nodes must agree on which reducer will process the different pieces of the intermediate data. The Partitioner class determines which reducer a given key/value pair will go to. The default Partitioner computes a hash value for the key, and assigns the partition based on this result. Shuffle Each node in a Hadoop cluster might execute several map tasks for a given job. Once at least one map function for a given node is completed, and the keys space is partitioned, the run time begins moving the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. Ms. Selva Mary. G Page 4

5 Sort Each reduce task is responsible for processing the values associated with several intermediate keys. The set of intermediate key/value pairs for a given reducer is automatically sorted by Hadoop to form keys/values (k2, {v2, v2, ) before they are presented to the reducer. Reducer A reducer is responsible for an execution of user-provided code for the second phase of job-specific work. For each key assigned to a given reducer, the reducer s reduce() method is called once. This method receives a key, along with an iterator over all the values associated with the key. The values associated with a key are returned by the iterator in an undefined order. The reducer typically transforms the input key/value pairs into output pairs (k3, v3). OutputFormat The way that job output (job output can be produced by reducer or mapper, if a reducer is not present) is written is governed by the OutputFormat. The responsibility of the OutputFormat is to define a location of the output data and RecordWriter used for storing the resulting data. Examples in Chapter 4 show how to implement a custom OutputFormat. RecordWriter A RecordWriter defines how individual output records are written. The following are two optional components of MapReduce execution Combiner This is an optional processing step that can be used for optimizing MapReduce job execution. If present, a combiner runs after the mapper and before the reducer. An instance of the Combiner class runs in every map task and some reduce tasks. The Combiner receives all data emitted by mapper instances as input, and tries to combine values with the same key, thus reducing the keys space, and decreasing the number of keys (not necessarily data) that must be sorted. The output from the Combiner is then sorted and sent to the reducers. Chapter 4 provides additional information about combiners. Distributed cache An additional facility often used in MapReduce jobs is a distributed cache. This is a facility that enables the sharing of data globally by all nodes on the cluster. The distributed cache can be a shared library to be accessed by each task, a global lookup file holding key/value pairs, jar files (or archives) containing executable code, and so on. The cache copies over the file(s) to the machines where the actual execution occurs, and makes them available for the local usage. Ms. Selva Mary. G Page 5

6 One of the most important MapReduce features is the fact that it completely hides the complexity of managing a large distributed cluster of machines, and coordination of job execution between these nodes. A developer s programming model is very simple he or she is responsible only for implementation of mapper and reducer functionality, as well as a driver, bringing them together as a single job and configuring required parameters. All users code is then packaged into a single jar file (in reality, the MapReduce framework can operate on multiple jar files), that can be submitted for execution on the MapReduce cluster. RUNTIME COORDINATION AND TASK MANAGEMENT IN MAPREDUCE Once the job jar file is submitted to a cluster, the MapReduce framework takes care of everything else. It transparently handles all of the aspects of distributed code execution on clusters ranging from a single to a few thousand nodes. The MapReduce framework provides the following support for application development: Scheduling The framework ensures that multiple tasks from multiple jobs are executed on the cluster. Different schedulers provide different scheduling strategies ranging from first come, first served, to ensuring that all the jobs from all users get their fair share of a cluster s execution. Another aspect of scheduling is speculative execution, which is an optimization that is implemented by MapReduce. If the JobTracker notices that one of the tasks is taking too long to execute, it can start an additional instance of the same task (using a different TaskTracker). The rationale behind speculative execution is ensuring that non-anticipated slowness of a given machine will not slow down execution of the task. Speculative execution is enabled by default, but you can disable it for mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution job options to false, respectively. Synchronization MapReduce execution requires synchronization between the map and reduce phases of processing. (The reduce phase cannot start until all of a map s key/value pairs are emitted.) At this point, intermediate key/value pairs are grouped by key, which is accomplished by a large distributed sort involving all the nodes that executed map tasks, and all the nodes that will execute reduce tasks. Error and fault handling To accomplish job execution in the environment where errors and faults are the norm, the JobTracker attempts to restart failed task executions. Ms. Selva Mary. G Page 6

7 As shown in Figure 3-3, Hadoop MapReduce uses a very simple coordination mechanism. A job driver uses InputFormat to partition a map s execution (based on data splits), and initiates a job client, which communicates with the JobTracker and submits the job for the execution. Once the job is submitted, the job client can poll the JobTracker waiting for the job completion. The JobTracker creates one map task for each split and a set of reducer tasks. (The number of created reduce tasks is determined by the job configuration.) The actual execution of the tasks is controlled by TaskTrackers, which are present on every node of the cluster. TaskTrackers start map jobs and run a simple loop that periodically sends a heartbeat message to the JobTracker. Heartbeats have a dual function here they tell the JobTracker that a TaskTracker is alive, and are used as a communication channel. As a part of the heartbeat, a TaskTracker indicates when it is ready to run a new task. At this point, the JobTracker uses a scheduler to allocate a task for execution on a particular node, and sends its content to the TaskTracker by using the heartbeat return value. Hadoop comes with a range of schedulers (with fair scheduler currently being the most widely used one). Once the task is assigned to the TaskTracker, controlling its task slots (currently every node can run several map and reduce tasks, and has several map and reduce slots assigned to it), the next step is for it to run the task. Ms. Selva Mary. G Page 7

8 First, it localizes the job jar file by copying it to the TaskTracker s filesystem. It also copies any files needed by the application to be on the local disk, and creates an instance of the task runner to run the task. The task runner launches from the distributed cache a new Java virtual machine (JVM) for task execution. The child process (task execution) communicates with its parent (TaskTracker) through the umbilical interface. This way, it informs the parent of the task s progress every few seconds until the task is complete. When the JobTracker receives a notification that the last task for a job is complete, it changes the status for the job to completed. The job client discovers job completion by periodically polling for the job s status. NOTE By default, Hadoop runs every task in its own JVM to isolate them from each other. The overhead of starting a new JVM is around 1 second, which, in the majority of cases, is insignificant (compare it to several minutes for the execution of the map task itself). In the case of very small, fast-running map tasks (where the order of execution time is in seconds), Hadoop allows you to enable several tasks to reuse JVMs by specifying the job configuration mapreduce.job.jvm.numtasks. If the value is 1 (the default), then JVMs are not reused. If it is -1, there is no limit to the number of tasks (of the same job) a JVM can run. It is also possible to specify some value greater than 1 using the Job.getConfiguration().setInt(Job.JVM_NUM_TASKS_TO_RUN, int) API. MAPREDUCE APPLICATION 3-1 shows a very simple implementation of a word count MapReduce job. 3-1: Hadoop word count implementation import java.io.ioexception; import java.util.iterator; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; Ms. Selva Mary. G Page 8

9 import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; public class WordCount extends Configured implements Tool{ public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new public void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); public static class Reduce extends Reducer<Text, IntWritable, Text, public void reduce(text key, Iterable<IntWritable> val, Context context) throws IOException, InterruptedException { int sum = 0; Iterator<IntWritable> values = val.iterator(); while (values.hasnext()) { sum += values.next().get(); Ms. Selva Mary. G Page 9

10 context.write(key, new IntWritable(sum)); public int run(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "Word Count"); job.setjarbyclass(wordcount.class); // Set up the input job.setinputformatclass(textinputformat.class); TextInputFormat.addInputPath(job, new Path(args[0])); // Mapper job.setmapperclass(map.class); // Reducer job.setreducerclass(reduce.class); // Output job.setoutputformatclass(textoutputformat.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); TextOutputFormat.setOutputPath(job, new Path(args[1])); //Execute boolean res = job.waitforcompletion(true); if (res) return 0; else return -1; public static void main(string[] args) throws Exception { int res = ToolRunner.run(new WordCount(), args); System.exit(res); This implementation has two inner classes Map and Reduce that extend Hadoop s Mapper and Reducer classes, respectively. Ms. Selva Mary. G Page 10

11 IT1110 DATA SCIENCE AND BIG DATA ANALYTICS MAPPER CLASS The Mapper class has three key methods (which you can overwrite): setup, cleanup, and map (the only one that is implemented here). Both setup and cleanup methods are invoked only once during a specific mapper life cycle at the beginning and end of mapper execution, respectively. The setup method is used to implement the mapper s initialization (for example, reading shared resources, connecting to HBase tables, and so on), whereas cleanup is used for cleaning up the mapper s resources and, optionally, if the mapper implements an associative array or counter, to write out the information. Run method of the base mapper class /** * Expert users can override this method for more complete control over the execution of the Mapper. context IOException */ public void run(context context) throws IOException, InterruptedException { setup(context); while (context.nextkeyvalue()) { map(context.getcurrentkey(), context.getcurrentvalue(), context); cleanup(context); This is the method behind most of the magic of the mapper class execution. The MapReduce pipeline first sets up execution that is, does all necessary initialization. Then, while input records exist for this mapper, the map method is invoked with a key and value passed to it. Once all the input records are processed, a cleanup is invoked, including invocation of cleanup method of the mapper class itself. The Hollywood principle Don t call us, we ll call you is a useful software development technique in which an object s (or component s) initial condition and ongoing life cycle is handled by its environment, rather than by the object itself. This principle is typically used for implementing a class/component that must fit into the constraints of an existing framework. Ms. Selva Mary. G Page 11

12 REDUCER CLASS Similar to mapper, a reducer class has three key methods setup, cleanup, and reduce as well as a run method (similar to the run method of the mapper class). Functionally, the methods of the reducer class are similar to the methods of the mapper class. The difference is that, unlike a map method that is invoked with a single key/value pair, a reduce method is invoked with a single key and an iterable set of values (remember, a reducer is invoked after execution of shuffle and sort, at which point, all the input key/value pairs are sorted, and all the values for the same key are partitioned to a single reducer and come together). A typical implementation of the reduce method iterates over a set of values, transforms all the key/value pairs in the new ones, and writes them to the output. The WordCount class itself implements the Tool interface, which means that it must implement the run method responsible for configuring the MapReduce job. This method first creates a configuration object, which is used to create a job object. A default configuration object constructor (used in the example code) simply reads the default configuration of the cluster. If some specific configuration is required, it is possible to either overwrite the defaults (once configuration is created), or set additional configuration resources that are used by a configuration constructor to define additional parameters. A job object represents job submitter s view of the job. It allows the user to configure the job s parameters (they will be stored in the configuration object), submit it, control its execution, and query its state. Job setup is comprised of the following main sections: Input setup This is set up as InputFormat, which is responsible for calculation of the job s input split and creation of the data reader. In this example, TextInputFormat is used. This InputFormat leverages its base class (FileInputFormat) to calculate splits (by default, this will be HDFS blocks) and creates a LineRecordReader as its reader. Several additional InputFormats supporting HDFS, HBase, and even databases are provided by Hadoop, covering the majority of scenarios used by MapReduce jobs. Because an InputFormat based on the HDFS file is used in this case, it is necessary to specify the location of the input data. You do this by adding an input path to the TextInputFormat class. It is possible to add multiple paths to the HDFS-based input format, where every path can specify either a specific file or a directory. In the latter case, all files in the directory are included as an input to the job. Mapper setup This sets up a mapper class that is used by the job. Ms. Selva Mary. G Page 12

13 Reducer setup This sets up reducer class that is used by the job. In addition, you can set up the number of reducers that are used by the job. (There is a certain asymmetry in Hadoop setup. The number of mappers depends on the size of the input data and split, whereas the number of reducers is explicitly settable.) If this value is not set up, a job uses a single reducer. For MapReduce applications that specifically do not want to use reducers, the number of reducers must be set to 0. Output setup This sets up output format, which is responsible for outputting results of the execution. The main function of this class is to create an OutputWriter. In this case, TextOutputFormat (which creates a LineRecordWriter for outputting data) is used. Several additional OutputFormats supporting HDFS, HBase, and even databases are provided with Hadoop, covering the majority of scenarios used by MapReduce jobs. In addition to the output format, it is necessary to specify data types used for output of key/value pairs (Text and IntWritable, in this case), and the output directory (used by the output writer). Hadoop also defines a special output format NullOutputFormat which should be used in the case where a job does not use an output (for example, it writes its output to HBase directly from either map or reduce). In this case, you should also use NullWritable class for output of key/value pair types. Finally, when the job object is configured, a job can be submitted for execution. Two main APIs are used for submitting a job using a Job object: The submit method submits a job for execution, and returns immediately. In this case, if, at some point, execution must be synchronized with completion of the job, you can use a method iscomplete() on a Job object to check whether the job has completed. Additionally, you can use the issuccessful() method on a Job object to check whether a job has completed successfully. The waitforcompletion method submits a job, monitors its execution, and returns only when the job is completed. Ms. Selva Mary. G Page 13

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of