Implementing Algorithmic Skeletons over Hadoop

Size: px

Start display at page:

Download "Implementing Algorithmic Skeletons over Hadoop"

Jonah Blankenship
5 years ago
Views:

1 Implementing Algorithmic Skeletons over Hadoop Dimitrios Mouzopoulos E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Computer Science School of Informatics University of Edinburgh 2011

2 Abstract In the past few years, there has been a growing interest for storing and processing vast amounts of data that many times exceed the Petabyte limit. To that end, MapReduce, a computational paradigm that was introduced by Google in 2003, has become particularly popular. It provides a simple interface with two functions, map and reduce for developing and implementing scalable parallel applications. The goal of this project is to enhance Hadoop, the open source implementation of Google s MapReduce and accompanying distributed file system, so that it supports additional computational paradigms. By providing more parallel patterns to the user of Hadoop, we believe that the task of dealing with specific kinds of problems becomes simpler and easier. To this end, we present our design of four Algorithmic Skeletons over Hadoop. Algorithmic skeletons are structured parallel programming models that allow programmers to develop applications over parallel and distributed systems. We implement these skeleton operations and, along with a streaming mechanism, we offer them in a library of skeleton operations. The use of these operations when dealing with problems that are a good fit for the abstract parallel processing pattern they encapsulate, results in more concise and efficient programs. i

3 Acknowledgements Special thanks are dedicated to Dr. Stratis Viglas, my thesis supervisor, not only for his constant and meaningful guidance and his expert s opinion on my project, but also for his thoughtful support during the development and completion of this thesis. I wish to acknowledge the work of the Apache Software Foundation and all the individuals who were involved to the implementation of Hadoop, as well as all the groups and researchers who contributed with their work to Algorithmic Skeletons. To my family for their continuous support and encouragement during my studies. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Dimitrios Mouzopoulos) iii

5 Table of Contents 1 Introduction Motivation Objectives Structure of the Report On Algorithmic Skeletons and MapReduce Algorithmic Skeletons MapReduce Hadoop Skeletons over Hadoop Related Work Overview of Algorithmic Skeletons Related work for MapReduce Summary Background Hadoop s MapReduce Implementation Setting up a MapReduce job Parallel For Parallel Sort Parallel While Condition in Parallel While Parallel If Condition in Parallel If Summary Design Designing Parallel Skeletons over Hadoop iv

6 4.2 Designing Parallel For over Hadoop Designing Parallel Sort over Hadoop Designing Parallel While over Hadoop Designing Parallel If over Hadoop Designing a Streaming API for Parallel Skeletons over Hadoop Summary Implementation General Guidelines for Implementing Algorithmic Skeletons over Hadoop Implementing Parallel For over Hadoop Implementing Parallel Sort over Hadoop Implementing Parallel While over Hadoop Implementing Parallel If over Hadoop Implementing a Streaming API for Parallel Skeletons over Hadoop Summary Evaluation Information regarding the process of Evaluation Environment of Evaluation Input Execution Time Evaluation of For Evaluation of Sort Evaluation of While Evaluation of If Evaluation of the Streaming API Level of Expressiveness Summary Conclusion Summary Challenges Lessons Learned Future Work v

7 A Setting up Hadoop so that is supports the implemented skeletons 66 A.1 How to use the Skeletons library A.1.1 Alternative methods A.2 Examples of setting up Skeleton jobs A.2.1 Example of setting up a For job A.2.2 Example of setting up a Sort job A.2.3 Example of setting up an If job A.2.4 Example of setting up a Streaming job B The API of the new package 74 B.1 API of Skeleton Job B.2 API of For B.3 API of Sort B.4 API of While B.5 API of If B.6 API of Streaming Bibliography 80 vi

8 List of Figures 2.1 The MapReduce process in the form of a Diagram. [3] The infrastructure of Hadoop. [4] The Design of Algorithmic Skeletons over Hadoop Comparison between Skeleton For and MapReduce Comparison between Skeleton Sort and MapReduce Comparison between Skeleton While and MapReduce Comparison between Skeleton If and MapReduce Comparison of the Streaming, Non-Streaming and MapReduce implementations vii

9 List of Tables 5.1 Most Important methods of the SkeletonJob API Most Important methods of the For API Most Important methods of the Sort API Most Important methods of the While API Most Important methods of the If API Most Important methods of the Streaming API Execution time of the For and the equivalent MapReduce implementation Execution time of the Sort and the equivalent MapReduce implementation Execution time of the While and the equivalent MapReduce implementation Execution time of the If and the equivalent MapReduce implementation Execution time of the Streaming implementation Comparison between MapReduce and Skeletons against lines of code. 60 B.1 SkeletonJob API B.2 For API B.3 Sort API B.4 While API B.5 If API B.6 Streaming API viii

10 Chapter 1 Introduction In the past few years there has been a growing interest in parallel and distributed computing, mainly due to the vast amounts of data produced. Most of the times, data is organized and stored in structured clusters of computers that may not even be in the same site. In any case, however, it needs to be processed in a quick and efficient way by various applications and for various reasons. Algorithmic Skeletons [1] are an approach in which the complexity of parallel programming is abstracted away through a library of skeleton operations. Each skeleton captures a particular pattern of computation and interaction. Thus, it provides an interface for each pattern to the programmer without presenting the implementation of the pattern itself. This results in a far less complex and more efficient way to write programs in a parallel or a distributed environment. However, algorithmic skeletons have not really been used broadly to this end, especially by commercial companies, but rather remained more of an academic concept. The MapReduce framework [2] can be considered an exception to the above. It was developed by Google in 2003 for processing large data sets and has grown in popularity since. While inspired by functions commonly used in functional programming, the MapReduce framework is not similar to these functions. MapReduce is used for processing petabytes of data that is stored across a large number of nodes organized over a distributed file system like Google s GFS or Hadoop s HDFS. An apparent subsequent question raised is why not have additional skeleton operations implemented over a MapReduce system like Hadoop and more importantly whether or not something might be gained out of it. 1

11 Chapter 1. Introduction Motivation The main reason why a significant number of skeleton operations exist is that each performs better and more efficiently at specific kinds of problems. For instance, when we have as input lines of numbers and we need to compute the average of each line, an algorithmic skeleton like Divide and Conquer is not useful at all and others like MapReduce may prove inefficient due to redundant computation. In such a case a single Map or For operation will perform far more efficiently. In other words, an algorithmic skeleton is the better at a problem, the more naturally this problem fits the abstract pattern of the skeleton. Hadoop is an open source implementation of Google s MapReduce system and, in addition to other things, it offers a way to store large data sets in a distributed setting and schedule MapReduce jobs over it. Finding a way to offer additional parallel programming models over Hadoop, other than MapReduce, will aid the user write far more efficient programs for certain kinds of problems. Moreover, another purpose of this project is to do this in a more user-friendly way than the original interface of MapReduce. Setting up a MapReduce job requires knowledge of the framework and thus it would surely be useful if certain details can be hidden from the user when dealing with skeleton jobs. After all, algorithmic skeletons are all about hiding complex information from the user and providing him or her with a simple API with which he or she can write programs that can run in parallel, without knowledge of the underlying infrastructure and the process of actually setting up and configuring the job that is submitted to the framework. 1.2 Objectives The key idea behind this project is to implement a selection of algorithmic skeletons over Hadoop. The main concept is to provide to the users of Hadoop more options regarding the way they can organize and parallelise the computations they need to perform over Big Data, other than what the MapReduce programming framework offers. This will result in an improved data processing model with more capabilities. More specifically, four skeletons have been implemented; For, If, Sort and While. Additionally, an API for setting up streamed skeleton and/or MapReduce jobs as a pipeline has also been implemented. Implementing structured algorithmic skeletons over a MapReduce system like Hadoop is not a trivial task. In order for this to be

12 Chapter 1. Introduction 3 carried out, the implementation of MapReduce over Hadoop and HDFS must be examined. This will be the guide for designing and implementing additional skeleton patterns over Hadoop. As a result, before attempting to design the skeletons, comprehending the way MapReduce is organized and implemented may prove more than useful. 1.3 Structure of the Report This chapter aimed to provide the reader with an understanding of the scope of this project. Moreover it provided a description of the project objectives and motivation. Chapter 2 provides information on related work within the field of algorithmic skeletons and MapReduce. In Chapter 3, we give more details concerning the background of the concepts related to this project. More specifically, we give a short description of the Hadoop software framework. We also present the package org.apache.hadoop.mapreduce, which is the newest package of Hadoop that realises MapReduce. Moreover, we introduce the parallel algorithmic skeletons (For, Sort, While and If), which we have implemented for the purposes of this project, to the reader. In Chapter 4 we describe the main ideas behind the design of the four skeletons over Hadoop and the interface for streaming Hadoop jobs along with alternative approaches and the reasons behind the choices of the implementation. What is more, in this chapter we mention issues that need addressing together with ways for dealing with them. Moreover, in Chapter 5 we present a detailed description of the implementation. In this description, low level details of the implementation, how issues were dealt with, what is offered to the user of Hadoop and more are thoroughly presented. Chapter 6 is all about evaluation. More specifically, we present the metrics that are used for evaluating the implementation of the four skeleton operations and the streaming API and the results, along with relevant comments. Finally, Chapter 7 serves for concluding this thesis and this project. It offers the opportunity to summarize what was achieved and what is finally offered to the user of Hadoop. Information regarding possible future work is also given.

13 Chapter 2 On Algorithmic Skeletons and MapReduce This project deals with a variety of computational concepts and systems. It is more than necessary to introduce the reader to these concepts and outline the overall scope of our project. A significant number of related work has already been carried out regarding both Algorithmic Skeletons and MapReduce but none so far, to the best of our knowledge, has tried to provide a library with a few skeleton operations over a software framework that implements MapReduce. 2.1 Algorithmic Skeletons Algorithmic skeletons are structured parallel programming models that allow programmers to develop applications over parallel and distributed systems. The main idea behind them is to provide the user with an abstract framework for programming in parallel and distributed settings, where many details regarding the underlying architecture of the system but also the implementation of the parallel pattern itself remain hidden. What separates Algorithmic skeletons from many other parallel programming models is that synchronization between the different tasks is defined by the skeleton itself and the programmer needs not worry about it. They were introduced by Murray Cole in 1989 [1] as a way to offer to the programmer a framework that will appear non-parallel to him while its execution will take place in parallel. Cole presented four initial skeletons: divide and conquer, iterative combination, cluster and task queue. Afterwards, other research groups proposed additional algorithmic skeletons and developed many algorithmic skeleton frameworks based on 4

14 Chapter 2. On Algorithmic Skeletons and MapReduce 5 different techniques such as functional, imperative and object oriented languages. 2.2 MapReduce MapReduce [2] can be described as a parallel programming model that serves for manipulating vast amounts of data. It has become rather popular over the past few years, mainly due to the fact that it provides a simple interface with two functions for developing and implementing scalable parallel applications. Perhaps the major reason behind the great success of MapReduce is that it supports auto-parallelization of programs on large clusters of commodity machines. Moreover, the capabilities it holds regarding its fault tolerance and scalability are invaluable when the size of the data which needs any kind of processing grows larger and larger. In essence, MapReduce is just one of the many skeleton algorithms (combination of the skeleton Map and the skeleton Reduce), which is implemented over a distributed infrastructure. Figure 2.1: The MapReduce process in the form of a Diagram. [3] However, MapReduce comes with certain shortcomings and the primary reason for this is that its original purpose was not performing structured data analytics. What is more, this model does not support many features that would be useful for developers. Assertions have been made regarding the limitations and thus the breadth of problems that MapReduce can be used for. From a software engineering point of view MapReduce systems lack the features that other parallel algorithmic structures can offer to

Chapter 2. On Algorithmic Skeletons and MapReduce 6 programmers. 2.2.1 Hadoop Hadoop is a software framework that supports storing and processing large amounts of data.

15 Chapter 2. On Algorithmic Skeletons and MapReduce 6 programmers Hadoop Hadoop is a software framework that supports storing and processing large amounts of data. Basically Hadoop can be considered as the open source realisation of Google s MapReduce System and is used by a number of companies and organizations like Yahoo and Facebook. It consists of a distributed scalable, and portable file system (HDFS) in which petabytes of data can be stored and an implementation of the computational paradigm of MapReduce. HDFS has a master/slave architecture, similar to the one of Google s GFS. An HDFS cluster consists of a single NameNode which is the master server and is responsible for managing the file system namespace and providing access to the files stored in the cluster by the clients. Additionally, a number of DataNodes exist, usually one per node in the cluster, which manage the storage attached to the node or nodes that they run on. Figure 2.2: The infrastructure of Hadoop. [4] HDFS is designed to store and handle files larger than a few Gigabytes or even Petabytes, across the machines of a cluster. In addition, the files are replicated so as to offer a high level of fault tolerance. Every file is split in blocks of equal size, except the last block, and is then stored in sequence of blocks. The files are only written once and only by one writer at any time. It has to be noted that the size of the block and the replication factor can be configured separately for every file.

16 Chapter 2. On Algorithmic Skeletons and MapReduce 7 Hadoop Common provides access to the file system supported by Hadoop. In short, the Hadoop Common package contains all the necessary JAR files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community. The MapReduce engine runs on top of the file system. It consists of one Job Tracker and multiple Task Trackers. When a user wants to submit a MapReduce job to the framework, his or her client submits it to the Job Tracker. The Job Tracker then pushes the necessary work to the available Task Tracker nodes. It is of the highest priority to move the computation to the data and not the other way around. This depends heavily upon the Job Tracker that knows which node contains the data and the nearby machines (it is preferable to use the node where the data resides or if this is not possible machines of the same rack). This will result in the reduction of the network traffic and to more efficient jobs. By default Hadoop uses a first-in-first-out principal to schedule jobs from a work queue. In version 0.19 of Hadoop the job scheduler was refactored out of Hadoop which added the ability to use an alternate scheduler. Companies that use Hadoop took this opportunity to develop their own schedulers that suit better their needs. Most notably, Facebook developed the Fair scheduler and Yahoo the Capacity scheduler. 2.3 Skeletons over Hadoop Hadoop s infrastructure can be used for developing more algorithmic skeletons over it. Providing the developers with more parallel programming models, in a framework that is widely used, will of course strongly enhance it, but it will also allow programmers to implement applications that will cover a broader range of problems and needs. This will ultimately result in a more expressive programming framework with more capabilities. There is a great research interest in MapReduce as it is a rather new technology and there is ground to further explore and optimize it. There are many papers and projects that are focused on enhancing MapReduce systems. These attempts are mainly focused on the open source MapReduce system, Hadoop. This project targets in enhancing Hadoop by offering additional parallel programming models other than MapReduce to the user. More specifically four algorithmic skeletons are to be designed and implemented over Hadoop and HDFS which will co-exist with MapReduce along with a streaming mechanism for MapReduce and/or Skeleton jobs. The programmer will be

17 Chapter 2. On Algorithmic Skeletons and MapReduce 8 able to choose between these programming models, in order to implement his application depending on the needs of the task he has at hand. 2.4 Related Work Overview of Algorithmic Skeletons Regarding Algorithmic Skeletons there has been quite a lot of research and work at an academic level. For various reasons that are beyond the scope of this project, companies have not really taken an interest in using them in their applications. Most of the work that has been done is focused in the area of providing programming frameworks that implement a number of algorithmic skeletons with generic parallel functionality which can be used from the user to implement parallel programs [5]. There are three types of algorithmic skeletons; data-parallel, task-parallel and resolution [5]. Data-parallel skeletons operate on data structures. Task-parallel skeletons work on tasks and their functionality heavily depends on the interaction between the tasks. Resolution skeletons represent an algorithmic way for dealing with a given group of problems. Many frameworks have been developed that provide sets of algorithmic skeletons to users. These algorithmic skeleton frameworks (ASkF) can be split into four categories according to their programming paradigm: - Coordination ASkFs - Functional ASkFs - Object-oriented ASkFs - Imperative ASkFs ASSIST [6] can be classified as a coordination ASkF. Parallel programs are expressed as graphs of software modules by using structured coordination language. The execution language is C++ and while it supports type safety, it does not support skeleton nesting. The skeletons that are offered in ASSIST are seq and parmod. Skandium [7] is another ASkF that supports both data-parallel and task-parallel skeletons. It is a re-implementation of an older ASkF, Calcium, with multi-core computing in mind. The execution language is Java and both type safety and skeleton nesting are supported. For more information regarding the numerous ASkFs that are developed [5] pro-

18 Chapter 2. On Algorithmic Skeletons and MapReduce 9 vides a very good description of the most important along with references to papers for more information Related work for MapReduce Even though MapReduce systems are a relatively new idea (or maybe better, a new implementation of an old idea) there is a growing interest in them and a lot of effort from research groups aim towards enhancing it. The majority of these attempts are built on top of Hadoop due to the fact that it is the most popular open source realization of MapReduce systems. Hive is a data warehousing application in Hadoop. It was originally developed by Facebook a few years ago but now it is open source [8], [9]. Hive organises the data in three ways which are analogous to well-known database concepts; tables, partitions and buckets. Hive provides a SQL-like query language which is called HiveQL (HQL) [8], [9]. HQL supports project, select, join, union, aggregate expressions and subqueries in the from-clause like SQL. Hive translates HQL statements into a syntax tree. This syntax tree is then compiled into an execution plan of MapReduce jobs. Finally, these jobs are executed by Hadoop. It can be concluded that with Hive, the developer has in his possession a declarative query language which is close to SQL and supports quite a few functionalities that are essential for many data analytic jobs and are pretty repetitive [8], [9]. Pig on the other hand, is a large scale dataflow system that is built on top of Hadoop. The idea behind Pig is similar to the one of Hive. Pig programs are parsed and compiled into MapReduce jobs, which are then executed by the MapReduce framework on the cluster [10]. A Pig program goes through a number of intermediate stages before execution. First of all, it is parsed and checked for errors. A logical plan is produced, and then is optimized and compiled into a series of MapReduce jobs and afterwards it passes another optimization phase. Finally, the jobs are sorted and submitted to Hadoop for execution [10]. The programs that are given as input to Pig are written in a specifically designed script language, Pig Latin. Pig Latin is a script programming language where the user specifies a number of consecutive steps that implement a specific task. Each step is equivalent to a single, high level data transformation. This is different from the declarative approach of SQL where only the constraints that define the final result are declared. As Pig Latin was developed having in mind processing web-scale data, only

19 Chapter 2. On Algorithmic Skeletons and MapReduce 10 the parallelisable primitives were included in it [11]. Another extension of MapReduce is Twister [12] which aims to make MapReduce suitable for a wider range of applications. What the runtime of Twister offers compared to similar MapReduce runtimes is the ability to support iterative MapReduce computations. Being a runtime itself, it is distinguished from Hadoop as the infrastructure is different but most importantly the programming model of MapReduce is extended in a way that it supports broadcast and scatter type data transfers. All the above make Twister far more efficient when talking about iterative MapReduce computations. Although the architecture of twister is different than the one of Hadoop, on top of which this project will be built, the way in which the programming model of MapReduce is extended using communication concepts from parallel computing like scatter and broadcast so as to support iteration is quite interesting. 2.5 Summary MapReduce has proven to be an extremely powerful tool for analysing and processing vast amounts of data. Hadoop, its open source realisation, is used by a significant number of companies and it is of no coincidence that large corporations like Yahoo, Amazon and Facebook among others, use it for storing and analysing data. However, Hadoop offers a complex and detailed API for writing MapReduce jobs. It is evident that if a programmer wants to write a program to perform simple computations over large data sets, he has to write long complex programs of many lines. It comes as no surprise that many projects that aim to offer a higher level API to facilitate large-data processing are under development or have been developed in the past few years. On the other hand, algorithmic skeletons have been known for being provided in various frameworks as libraries (most of the times). Their outstanding feature, that the synchronisation of the parallel activities is implicitly defined by the abstract skeleton patterns, aid the programmers to write parallel programs in a easier but most importantly sequential way. It is of no surprise that providing a library of skeleton operations, other than MapReduce over Hadoop may well prove valuable for its users. Especially, if the provided skeleton operations are offered in a higher level than MapReduce by hiding many of its details will undoubtedly lead to a more expressive and easier to configure software framework.

20 Chapter 3 Background Upon providing all the necessary information regarding the Algorithmic Skeletons, MapReduce and MapReduce s open source realization Hadoop, more details should be given about the scope of this project. This project aims to implement a number of skeleton operations and ultimately offer a library of skeleton operations to the user of Hadoop. To this end, a number of matters should be discussed. First and foremost, we must describe the package that implements MapReduce over Hadoop. The skeletons library will use this package to somewhat offer a level of indirection. In essence, when one skeleton operation will be used, a MapReduce job will run underneath. This makes the study and understanding of the package more than important. Moreover, the algorithmic skeletons that will be implemented are presented thoroughly. Comprehending the pattern of each skeleton is the first step for designing and implementing them above any software framework. The reader needs to understand the pattern that every single one of them offers, if he is to move on to the following chapters that give more details regarding far more complex aspects of the design and the implementation. 3.1 Hadoop s MapReduce Implementation Before moving on to the description of how MapReduce is organised and functions in Hadoop [13], we should take note that currently two different packages exist which provide to the user of Hadoop the parallel algorithmic framework of MapReduce; mapred and mapreduce. The differences of these two packages are beyond the scope of this report. However, it must be brought to attention that mapreduce is the newest 11

21 Chapter 3. Background 12 implementation that is meant to replace completely mapred in the following releases. As a result, the implementation of the skeleton operations are based on the most recent package of MapReduce. It is due to this fact that we should describe a description of the package mapreduce [14]. This description will aid the reader in his comprehension of how the system works and will ultimately lead to a better understanding of the implementation of the parallel algorithmic skeletons, as classes of the mapreduce API are used. A MapReduce job has two phases of operation. Firstly, the input data is split into chunks which are processed by the Mappers in parallel. The output of the mappers is sorted, grouped and then used as input for the Reducers. It should be noted that both the input and the output of a MapReduce job is stored in the distributed file system of Hadoop (HDFS) whereas the intermediate results of the Mappers are stored in the local file-system of the Mappers. The framework takes care of scheduling tasks, monitoring them and re-executing the failed ones. Now, let us take a closer look at the implementation. The MapReduce framework operates exclusively on Key,Value pairs, that is, the framework views the input to the job as a set of Key,Value pairs and produces a set of Key,Value pairs as the output of the job, conceivably of different types. The main classes of the API are: Job, Mapper and Reducer. The Job class is an extension of the class JobContext. It allows the user to configure and submit the job and it offers him the ability to check and control the state of the execution. To this end, it contains certain set methods with which a user can configure the MapReduce job before it is submitted. Moreover the classes Mapper and Reducer provide the API for creating the functions that realize the map and the reduce stages. They both contain an internal class Context that extends MapContex and ReduceContext respectively. Its purpose is to provide the context to both the Mapper and the Reducer. A user who wants to create a MapReduce job needs to create a new instance of the class Job. Moreover, he has to create two new classes that will extend the existing classes Mapper and Reducer. Additionally, in the extended class of the Mapper he needs to override the method map according to the task he has to deal with. In the extended class of the Reducer the user overrides the method reduce depending to his needs. Finally, the programmer has to configure the Job instance he created with the two extended classes and submit the job.

22 Chapter 3. Background Setting up a MapReduce job Here is an example of a program that creates, configures and submits a MapReduce Job for counting the number of occurrences of each individual word. import java. io. IOException ; import java. util. StringTokenizer ; import org.apache.hadoop.conf. Configuration ; import org.apache.hadoop.fs.path; import org.apache.hadoop.io. IntWritable ; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat ; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat ; import org. apache. hadoop. util. GenericOptionsParser ; public class WordCount { public static class TokenizerMapper extends Mapper < Object, Text, Text, IntWritable >{ private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map( Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer ( value. tostring ()) ; while (itr. hasmoretokens ()) { word.set(itr. nexttoken ()); context.write(word, one); public static class IntSumReducer extends Reducer < Text, IntWritable, Text, IntWritable > { private IntWritable result = new IntWritable (); public void reduce( Text key, Iterable < IntWritable > values, Context context

23 Chapter 3. Background 14 ) throws IOException, InterruptedException { int sum = 0; for ( IntWritable val : values) { sum += val.get (); result.set(sum); context.write(key, result); public static void main( String [] args) throws Exception { Configuration conf = new Configuration (); String [] otherargs = new GenericOptionsParser ( conf, args). getremainingargs (); if ( otherargs. length!= 2) { System.err.println (" Usage: wordcount <in > <out >"); System.exit (2); Job job = new Job( conf, " word count "); job. setjarbyclass ( WordCount.class); job. setmapperclass ( TokenizerMapper. class); job. setcombinerclass ( IntSumReducer. class); job. setreducerclass ( IntSumReducer. class); job. setoutputkeyclass (Text.class); job. setoutputvalueclass ( IntWritable. class); FileInputFormat. addinputpath ( job, new Path( otherargs [0])); FileOutputFormat. setoutputpath ( job, new Path( otherargs [1])); System.exit(job. waitforcompletion (true)? 0 : 1); 3.2 Parallel For Generally speaking, Parallel For [5], [7] represents finite iteration. The algorithmic skeleton of Parallel For is used when a user wants to do some work on all the elements of a specific input. The input needs to be partitioned first so that the work can be parallelizable. In essence the work is done in the different partitions of the data in parallel. A piece of pseudo-code follows that describes the pattern of Parallel For.

24 Chapter 3. Background 15 int i =...; Skeleton P,R nested =...; Skeleton P,R forskel = new For P,R (nested, i); 3.3 Parallel Sort Sorting is a well known concept in Computer Science. A significant number of algorithms have been introduced in the past decades, each of them serving its own purposes and designed for different needs. Hadoop, as an open source MapReduce system deals with massive data-sets that are stored on the various nodes of the cluster. It is rather obvious that one user of the cluster will require to sort the data for his own purposes. For instance, it may well be the case that the user wants to find out the maximum value of the data sets or sort some Text entries alphabetically. As mentioned above, a large number of algorithms exist for sorting. As far as parallel sorting [15] is concerned the most famous categories of sorting are the Bucket Sorts, the Exchange Sorts and the Partition Sorts. In the context of parallel sorting in Hadoop, it can be said that we need not deal with any type of parallel sorting algorithm. In a MapReduce job the intermediate results of the Mappers are sorted according to the value of the key. It is only natural, that when designing a sort skeleton over Hadoop, this feature of MapReduce is going to be used. It should be pointed out that an implementation of Sorting currently exists in the Hadoop examples jar file. Nonetheless, we designed a different one for the purposes of this project so as to offer the additional parallel skeleton of sort. The main difference is that the skeleton Sort that we implemented targets to a more specific group of problems; it produces one sorted output file. Another difference between the two implementations is that the parallel sorting which is included in the examples jar file is based on the old implementation of MapReduce mapred, whereas the implementation described in the following chapters is based on the new mapreduce package. 3.4 Parallel While Parallel While [5], [7] represents conditional iteration, where a function (or possibly another skeleton) is applied to the data while a condition holds. This condition may or may not relate to the value that is read from the input. In Parallel While it may well

25 Chapter 3. Background 16 be the case that the condition is checked more than one time. While the value is true a function is applied to the input data. However, the user may require to perform a different action when the condition returns false which means that the loop was exited. A piece of pseudo-code follows that describes the pattern of Parallel While. Condition P condition =...; Skeleton P,R nested =...; Skeleton P,R whileskel = new While P,R (nested, condition); From the pseudo-code it becomes obvious that the function or skeleton nested is executed while the condition is true. The data is partitioned and the processing takes place in parallel for each shard of input data. Perhaps the most important feature of Parallel While (and of Parallel If that follows in the next section) is the definition of the Condition for which we give details in the following sub-section Condition in Parallel While It is evident that the condition that is to be checked in Parallel While could relate to the value read but it could also be independent of it. Whichever the case, most likely the condition will be checked again and the data may have changed. As a result, a way is needed for storing the results of the function nested and providing them during the next step of the loop. The fact that the condition is checked more than once in the majority of the cases, is perhaps the key aspect that will lead the design and the implementation of this skeleton over Hadoop. It is only natural that in the context of providing a more expressive framework that is also simple for the programmer to use, an abstract concept of condition will be provided which will be up to the user to implement according to his specific needs. 3.5 Parallel If Parallel If [5], [7] can be described as conditional branching where the choice of which computation to apply on what subset of the data, is solely based on a condition specified by the user. In essence for each data set of the input a condition is checked. This condition can be either about the data set itself or even independent of it, according to

26 Chapter 3. Background 17 the needs of the user. In any case, the input is split to various shards and for the values contained in each shard a condition is deduced whether it is met or not. Depending on the outcome of this action (the result is either true or false) a different function is applied to the data set. Below is a simple and abstract description of this algorithmic skeleton. Condition P Condition =...; Skeleton P,R TrueCase =...; Skeleton P,R FalseCase =...; Skeleton P,R If = new If P,R (Condition, TrueCase, FalseCase); The parallelization of this model heavily depends on the fact that the input data is partitioned and then the processing occurs in parallel. The function TrueCase is executed if Condition returns true whereas FalseCase is executed if Condition returns false. As per the description, it is possible that these two functions may be other skeletons leading to skeleton nesting. What is more, as in the skeleton previously described (Parallel While) the Condition is of great importance and more details should be given regarding it Condition in Parallel If The condition in parallel If is quite similar to the one defined in Parallel While. Of course, an important difference exists that distinguishes them. As we saw before, the condition in Parallel While by definition is checked more than once. After all, While in its essence provides the notion of loop. As a result it is essential that we may have to check the condition numerous times. On the other hand, in Parallel If the condition is strictly checked only once. Depending on the outcome a different function (or skeleton) is executed. This constitutes a major difference between the two skeletons. The fact that in Parallel While only a single computation is executed numerous times (more or equal to one) while in Parallel If one of two possible types of computation is executed only one time. In the following chapter, we will describe how all these differences affect the design of these skeletons over the framework of Hadoop.

27 Chapter 3. Background Summary This chapter introduced the reader to the implementation of MapReduce over Hadoop along with the four skeleton operations that are to be included in the skeletons library. Understanding the package of MapReduce over Hadoop is extremely important as we will design the skeletons in a similar way. Furthermore, as it is our purpose to offer the skeleton operations at a higher level, the MapReduce package will be used underneath. In essence, one of the goals of this project is to offer another level of indirection above MapReduce in the context of providing a library of algorithmic skeletons. The ultimate goal of this project after all is to enhance Hadoop so as to provide a more expressive software framework for processing large amounts of data and both the implementation of the algorithmic skeletons and the streaming API are to this end.

28 Chapter 4 Design 4.1 Designing Parallel Skeletons over Hadoop When talking about providing the algorithmic model of another parallel model over Hadoop a way must be found to offer the programmer a new API that implements the new algorithmic skeleton with the use of the existing API of MapReduce. A number of approaches exist that can be used for tackling this specific problem. For example one may attempt to build an individual package that will implement the algorithmic skeleton. Such a package will need to communicate with the distributed file system underneath and contain a number of low level methods to this end. This approach will require a better understanding of the whole Hadoop system. One may argue that the package of MapReduce already contains ways for communicating with HDFS. What is more, its functionality supports configuring, executing and monitoring MapReduce jobs. As a result, using parts of the existing functionality of MapReduce, and masking parts of it according to the task at hand, may prove an easier but more importantly a far more efficient way of implementing another parallel algorithmic framework over Hadoop. In addition, by following this specific approach we can further hide many of Hadoop s details resulting in skeleton operations that offer a higher level of interaction with the framework. In a way, the user will have another programming model but also a less complex and easier to use model. This was the key idea which led our implementation. More specifically, after careful inspection of the MapReduce API and implementation we deduced a methodology for offering different parallel models using classes of the MapReduce API. As mentioned in the previous chapter, the main classes of the existing API are the following: Job, Mapper and Reducer. The class Job instantiates a MapReduce Job and has com- 19

29 Chapter 4. Design 20 plete control over it. What is required from the new API is a class that will extend the existing class Job, offer some of Job s functionality to the user but coordinate a number of things internally. It is important that the new extended class fits the new model accordingly. This raises the question of which functionalities are to be hidden from the user. The answer to this question depends heavily to the specific parallel skeleton that is to be implemented. In the following sections, we are going to give more details regarding the concepts behind the implementation of several algorithmic skeletons that we implemented over Hadoop. An indicative design of Algorithmic Skeletons over Hadoop is shown in Figure 4.1. In the context of this project, we followed this abstract design in order to implement the Algorithmic Skeletons we chose over Hadoop. This figure will provide the reader with a better understanding of the methodology we followed for accomplishing our objectives. Figure 4.1: The Design of Algorithmic Skeletons over Hadoop. 4.2 Designing Parallel For over Hadoop Parallel For is the application of a function to all the data elements of the input. This fact distinguishes it from MapReduce on certain key points. Firstly, there is no need for a reduce phase. As only a certain computation is applied to the input, there is no sense in having an additional stage that will not perform any computation of any kind. Moreover, in a Parallel For no grouping or sorting should be performed in the output of

30 Chapter 4. Design 21 the job. It needs to be noted that in MapReduce the output of the map phase is sorted and grouped according to the key of the Key,Value pairs that produces as output. This needs to be avoided. Furthermore, the output of the Mappers is usually written in the local file system of the nodes. We require that the output of the job is written to HDFS. Lastly, the fact that the MapReduce framework functions with pairs must be taken into consideration. A Mapper takes as input a Key,Value pair and produces another Key,Value as output. In Parallel For we must find a way must to hide the pairs and present the user an easy way to handle and manipulate the input. With the completion of spotting the key differences between a MapReduce job and a Parallel For job, we can now proceed into designing the new parallel model using the existing one. In this part of the report we will underline the ways that are best fit for dealing with the issues that are mentioned in the previous paragraph. To begin with, we need to remove the reduce phase from the new job class, extension of the existing Job class, we are to create. This can be done internally by setting the number of reduce tasks to zero and offer an API that only a Mapper (soon to be renamed) can be declared by the user. Even though Parallel For s functionality is close to that of a Map, we need to offer a new class. This class will be an extension of the Mapper and it will hide the input, output pairs offering to the user a single input and output. Furthermore, the data types of Hadoop are of no use in the context of Parallel For and thus both the input and output will be manipulated as String data types. Additionally, by specifying zero reduce tasks, the output of the Map phase is written to the HDFS instead of the local file-system of the nodes. The final obstacles that need to be dealt with are those of grouping and sorting of the map output. Thankfully, the package of MapReduce has been designed in such a way that when no reduce tasks exist, the sorting and grouping of the map phase is omitted. Thus, all the issues have been dealt with. In Chapter 5 we present more details regarding the implementation of the new API are going to be presented. 4.3 Designing Parallel Sort over Hadoop Implementing Parallel Sort over Hadoop raises a number of issues that need to be addressed. First of all, all known data types should be supported. More specifically, the skeleton offered to the user of Hadoop should be able to sort integers, doubles, floats, strings and long integers. Moreover, the sorting implementation must be able to perform sorting both in descending and ascending order. Recall that after the map phase,

31 Chapter 4. Design 22 the intermediate results are sorted in ascending order. This natural sorting of MapReduce is used for developing the sort skeleton but it is apparent that when wanting to sort in descending order this feature of MapReduce needs tweaking. Finding the best possible ways for dealing with these issues, ultimately lead to the best fit implementation of Parallel Sort over Hadoop. To begin with, there are two approaches in order for the skeleton to support five data types. The dynamic approach is for the program to dynamically determine the data type and transport the relevant Mapper and Reducer to support the specific data type. The other approach can be considered static, as the main idea is to have separate pairs of Mappers and Reducers already predefined for each data type. By entering a parameter, the user can define which is the type of the data that is to be sorted. The final obstacle for designing an efficient and complete sort skeleton is to support sorting in descending order besides only sorting in ascending order. For the implementation to support descending order we need to find a way and reverse the sorting of the intermediate results. In order for us to accomplish such thing, we have to look in the MapReduce package and deduce how the sorting occurs and then try to add the functionality of sorting in descending order. Once these two issues are dealt with, all we have to do is create a class that the user can use to execute a sort over an input of his own. A MapReduce job must be internally set up and configured with the appropriate parameters. This MapReduce job will use the fitting pair of the Mappers and Reducers that are already predefined and perform the sorting. The user will have the option to specify the output order. The predefined Mappers and Reducers need only perform their default functionality which is to read the input pair and write it. The sorting occurs in the intermediate phase from the framework itself. What should be noted is that the Mappers should isolate the value upon which the sorting will occur and use it as key for their own output pair. An important factor to take into consideration is that already many sort algorithms exist over Hadoop, so why implement a new one or even why use the skeleton sort instead of another. The answer to these questions is that the operation of skeleton Sort results in a single sorted output file no matter what the input is or how many reducers there are. As a result, a final note is that in our implementation me must guarantee that the output of the skeleton Sort will reside in a single file. These details conclude the design of the parallel algorithmic skeleton of Sort over Hadoop.

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.