Implementing Algorithmic Skeletons over Hadoop

Size: px
Start display at page:

Download "Implementing Algorithmic Skeletons over Hadoop"

Transcription

1 Implementing Algorithmic Skeletons over Hadoop Dimitrios Mouzopoulos E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Computer Science School of Informatics University of Edinburgh 2011

2 Abstract In the past few years, there has been a growing interest for storing and processing vast amounts of data that many times exceed the Petabyte limit. To that end, MapReduce, a computational paradigm that was introduced by Google in 2003, has become particularly popular. It provides a simple interface with two functions, map and reduce for developing and implementing scalable parallel applications. The goal of this project is to enhance Hadoop, the open source implementation of Google s MapReduce and accompanying distributed file system, so that it supports additional computational paradigms. By providing more parallel patterns to the user of Hadoop, we believe that the task of dealing with specific kinds of problems becomes simpler and easier. To this end, we present our design of four Algorithmic Skeletons over Hadoop. Algorithmic skeletons are structured parallel programming models that allow programmers to develop applications over parallel and distributed systems. We implement these skeleton operations and, along with a streaming mechanism, we offer them in a library of skeleton operations. The use of these operations when dealing with problems that are a good fit for the abstract parallel processing pattern they encapsulate, results in more concise and efficient programs. i

3 Acknowledgements Special thanks are dedicated to Dr. Stratis Viglas, my thesis supervisor, not only for his constant and meaningful guidance and his expert s opinion on my project, but also for his thoughtful support during the development and completion of this thesis. I wish to acknowledge the work of the Apache Software Foundation and all the individuals who were involved to the implementation of Hadoop, as well as all the groups and researchers who contributed with their work to Algorithmic Skeletons. To my family for their continuous support and encouragement during my studies. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Dimitrios Mouzopoulos) iii

5 Table of Contents 1 Introduction Motivation Objectives Structure of the Report On Algorithmic Skeletons and MapReduce Algorithmic Skeletons MapReduce Hadoop Skeletons over Hadoop Related Work Overview of Algorithmic Skeletons Related work for MapReduce Summary Background Hadoop s MapReduce Implementation Setting up a MapReduce job Parallel For Parallel Sort Parallel While Condition in Parallel While Parallel If Condition in Parallel If Summary Design Designing Parallel Skeletons over Hadoop iv

6 4.2 Designing Parallel For over Hadoop Designing Parallel Sort over Hadoop Designing Parallel While over Hadoop Designing Parallel If over Hadoop Designing a Streaming API for Parallel Skeletons over Hadoop Summary Implementation General Guidelines for Implementing Algorithmic Skeletons over Hadoop Implementing Parallel For over Hadoop Implementing Parallel Sort over Hadoop Implementing Parallel While over Hadoop Implementing Parallel If over Hadoop Implementing a Streaming API for Parallel Skeletons over Hadoop Summary Evaluation Information regarding the process of Evaluation Environment of Evaluation Input Execution Time Evaluation of For Evaluation of Sort Evaluation of While Evaluation of If Evaluation of the Streaming API Level of Expressiveness Summary Conclusion Summary Challenges Lessons Learned Future Work v

7 A Setting up Hadoop so that is supports the implemented skeletons 66 A.1 How to use the Skeletons library A.1.1 Alternative methods A.2 Examples of setting up Skeleton jobs A.2.1 Example of setting up a For job A.2.2 Example of setting up a Sort job A.2.3 Example of setting up an If job A.2.4 Example of setting up a Streaming job B The API of the new package 74 B.1 API of Skeleton Job B.2 API of For B.3 API of Sort B.4 API of While B.5 API of If B.6 API of Streaming Bibliography 80 vi

8 List of Figures 2.1 The MapReduce process in the form of a Diagram. [3] The infrastructure of Hadoop. [4] The Design of Algorithmic Skeletons over Hadoop Comparison between Skeleton For and MapReduce Comparison between Skeleton Sort and MapReduce Comparison between Skeleton While and MapReduce Comparison between Skeleton If and MapReduce Comparison of the Streaming, Non-Streaming and MapReduce implementations vii

9 List of Tables 5.1 Most Important methods of the SkeletonJob API Most Important methods of the For API Most Important methods of the Sort API Most Important methods of the While API Most Important methods of the If API Most Important methods of the Streaming API Execution time of the For and the equivalent MapReduce implementation Execution time of the Sort and the equivalent MapReduce implementation Execution time of the While and the equivalent MapReduce implementation Execution time of the If and the equivalent MapReduce implementation Execution time of the Streaming implementation Comparison between MapReduce and Skeletons against lines of code. 60 B.1 SkeletonJob API B.2 For API B.3 Sort API B.4 While API B.5 If API B.6 Streaming API viii

10 Chapter 1 Introduction In the past few years there has been a growing interest in parallel and distributed computing, mainly due to the vast amounts of data produced. Most of the times, data is organized and stored in structured clusters of computers that may not even be in the same site. In any case, however, it needs to be processed in a quick and efficient way by various applications and for various reasons. Algorithmic Skeletons [1] are an approach in which the complexity of parallel programming is abstracted away through a library of skeleton operations. Each skeleton captures a particular pattern of computation and interaction. Thus, it provides an interface for each pattern to the programmer without presenting the implementation of the pattern itself. This results in a far less complex and more efficient way to write programs in a parallel or a distributed environment. However, algorithmic skeletons have not really been used broadly to this end, especially by commercial companies, but rather remained more of an academic concept. The MapReduce framework [2] can be considered an exception to the above. It was developed by Google in 2003 for processing large data sets and has grown in popularity since. While inspired by functions commonly used in functional programming, the MapReduce framework is not similar to these functions. MapReduce is used for processing petabytes of data that is stored across a large number of nodes organized over a distributed file system like Google s GFS or Hadoop s HDFS. An apparent subsequent question raised is why not have additional skeleton operations implemented over a MapReduce system like Hadoop and more importantly whether or not something might be gained out of it. 1

11 Chapter 1. Introduction Motivation The main reason why a significant number of skeleton operations exist is that each performs better and more efficiently at specific kinds of problems. For instance, when we have as input lines of numbers and we need to compute the average of each line, an algorithmic skeleton like Divide and Conquer is not useful at all and others like MapReduce may prove inefficient due to redundant computation. In such a case a single Map or For operation will perform far more efficiently. In other words, an algorithmic skeleton is the better at a problem, the more naturally this problem fits the abstract pattern of the skeleton. Hadoop is an open source implementation of Google s MapReduce system and, in addition to other things, it offers a way to store large data sets in a distributed setting and schedule MapReduce jobs over it. Finding a way to offer additional parallel programming models over Hadoop, other than MapReduce, will aid the user write far more efficient programs for certain kinds of problems. Moreover, another purpose of this project is to do this in a more user-friendly way than the original interface of MapReduce. Setting up a MapReduce job requires knowledge of the framework and thus it would surely be useful if certain details can be hidden from the user when dealing with skeleton jobs. After all, algorithmic skeletons are all about hiding complex information from the user and providing him or her with a simple API with which he or she can write programs that can run in parallel, without knowledge of the underlying infrastructure and the process of actually setting up and configuring the job that is submitted to the framework. 1.2 Objectives The key idea behind this project is to implement a selection of algorithmic skeletons over Hadoop. The main concept is to provide to the users of Hadoop more options regarding the way they can organize and parallelise the computations they need to perform over Big Data, other than what the MapReduce programming framework offers. This will result in an improved data processing model with more capabilities. More specifically, four skeletons have been implemented; For, If, Sort and While. Additionally, an API for setting up streamed skeleton and/or MapReduce jobs as a pipeline has also been implemented. Implementing structured algorithmic skeletons over a MapReduce system like Hadoop is not a trivial task. In order for this to be

12 Chapter 1. Introduction 3 carried out, the implementation of MapReduce over Hadoop and HDFS must be examined. This will be the guide for designing and implementing additional skeleton patterns over Hadoop. As a result, before attempting to design the skeletons, comprehending the way MapReduce is organized and implemented may prove more than useful. 1.3 Structure of the Report This chapter aimed to provide the reader with an understanding of the scope of this project. Moreover it provided a description of the project objectives and motivation. Chapter 2 provides information on related work within the field of algorithmic skeletons and MapReduce. In Chapter 3, we give more details concerning the background of the concepts related to this project. More specifically, we give a short description of the Hadoop software framework. We also present the package org.apache.hadoop.mapreduce, which is the newest package of Hadoop that realises MapReduce. Moreover, we introduce the parallel algorithmic skeletons (For, Sort, While and If), which we have implemented for the purposes of this project, to the reader. In Chapter 4 we describe the main ideas behind the design of the four skeletons over Hadoop and the interface for streaming Hadoop jobs along with alternative approaches and the reasons behind the choices of the implementation. What is more, in this chapter we mention issues that need addressing together with ways for dealing with them. Moreover, in Chapter 5 we present a detailed description of the implementation. In this description, low level details of the implementation, how issues were dealt with, what is offered to the user of Hadoop and more are thoroughly presented. Chapter 6 is all about evaluation. More specifically, we present the metrics that are used for evaluating the implementation of the four skeleton operations and the streaming API and the results, along with relevant comments. Finally, Chapter 7 serves for concluding this thesis and this project. It offers the opportunity to summarize what was achieved and what is finally offered to the user of Hadoop. Information regarding possible future work is also given.

13 Chapter 2 On Algorithmic Skeletons and MapReduce This project deals with a variety of computational concepts and systems. It is more than necessary to introduce the reader to these concepts and outline the overall scope of our project. A significant number of related work has already been carried out regarding both Algorithmic Skeletons and MapReduce but none so far, to the best of our knowledge, has tried to provide a library with a few skeleton operations over a software framework that implements MapReduce. 2.1 Algorithmic Skeletons Algorithmic skeletons are structured parallel programming models that allow programmers to develop applications over parallel and distributed systems. The main idea behind them is to provide the user with an abstract framework for programming in parallel and distributed settings, where many details regarding the underlying architecture of the system but also the implementation of the parallel pattern itself remain hidden. What separates Algorithmic skeletons from many other parallel programming models is that synchronization between the different tasks is defined by the skeleton itself and the programmer needs not worry about it. They were introduced by Murray Cole in 1989 [1] as a way to offer to the programmer a framework that will appear non-parallel to him while its execution will take place in parallel. Cole presented four initial skeletons: divide and conquer, iterative combination, cluster and task queue. Afterwards, other research groups proposed additional algorithmic skeletons and developed many algorithmic skeleton frameworks based on 4

14 Chapter 2. On Algorithmic Skeletons and MapReduce 5 different techniques such as functional, imperative and object oriented languages. 2.2 MapReduce MapReduce [2] can be described as a parallel programming model that serves for manipulating vast amounts of data. It has become rather popular over the past few years, mainly due to the fact that it provides a simple interface with two functions for developing and implementing scalable parallel applications. Perhaps the major reason behind the great success of MapReduce is that it supports auto-parallelization of programs on large clusters of commodity machines. Moreover, the capabilities it holds regarding its fault tolerance and scalability are invaluable when the size of the data which needs any kind of processing grows larger and larger. In essence, MapReduce is just one of the many skeleton algorithms (combination of the skeleton Map and the skeleton Reduce), which is implemented over a distributed infrastructure. Figure 2.1: The MapReduce process in the form of a Diagram. [3] However, MapReduce comes with certain shortcomings and the primary reason for this is that its original purpose was not performing structured data analytics. What is more, this model does not support many features that would be useful for developers. Assertions have been made regarding the limitations and thus the breadth of problems that MapReduce can be used for. From a software engineering point of view MapReduce systems lack the features that other parallel algorithmic structures can offer to

15 Chapter 2. On Algorithmic Skeletons and MapReduce 6 programmers Hadoop Hadoop is a software framework that supports storing and processing large amounts of data. Basically Hadoop can be considered as the open source realisation of Google s MapReduce System and is used by a number of companies and organizations like Yahoo and Facebook. It consists of a distributed scalable, and portable file system (HDFS) in which petabytes of data can be stored and an implementation of the computational paradigm of MapReduce. HDFS has a master/slave architecture, similar to the one of Google s GFS. An HDFS cluster consists of a single NameNode which is the master server and is responsible for managing the file system namespace and providing access to the files stored in the cluster by the clients. Additionally, a number of DataNodes exist, usually one per node in the cluster, which manage the storage attached to the node or nodes that they run on. Figure 2.2: The infrastructure of Hadoop. [4] HDFS is designed to store and handle files larger than a few Gigabytes or even Petabytes, across the machines of a cluster. In addition, the files are replicated so as to offer a high level of fault tolerance. Every file is split in blocks of equal size, except the last block, and is then stored in sequence of blocks. The files are only written once and only by one writer at any time. It has to be noted that the size of the block and the replication factor can be configured separately for every file.

16 Chapter 2. On Algorithmic Skeletons and MapReduce 7 Hadoop Common provides access to the file system supported by Hadoop. In short, the Hadoop Common package contains all the necessary JAR files and scripts needed to start Hadoop. The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community. The MapReduce engine runs on top of the file system. It consists of one Job Tracker and multiple Task Trackers. When a user wants to submit a MapReduce job to the framework, his or her client submits it to the Job Tracker. The Job Tracker then pushes the necessary work to the available Task Tracker nodes. It is of the highest priority to move the computation to the data and not the other way around. This depends heavily upon the Job Tracker that knows which node contains the data and the nearby machines (it is preferable to use the node where the data resides or if this is not possible machines of the same rack). This will result in the reduction of the network traffic and to more efficient jobs. By default Hadoop uses a first-in-first-out principal to schedule jobs from a work queue. In version 0.19 of Hadoop the job scheduler was refactored out of Hadoop which added the ability to use an alternate scheduler. Companies that use Hadoop took this opportunity to develop their own schedulers that suit better their needs. Most notably, Facebook developed the Fair scheduler and Yahoo the Capacity scheduler. 2.3 Skeletons over Hadoop Hadoop s infrastructure can be used for developing more algorithmic skeletons over it. Providing the developers with more parallel programming models, in a framework that is widely used, will of course strongly enhance it, but it will also allow programmers to implement applications that will cover a broader range of problems and needs. This will ultimately result in a more expressive programming framework with more capabilities. There is a great research interest in MapReduce as it is a rather new technology and there is ground to further explore and optimize it. There are many papers and projects that are focused on enhancing MapReduce systems. These attempts are mainly focused on the open source MapReduce system, Hadoop. This project targets in enhancing Hadoop by offering additional parallel programming models other than MapReduce to the user. More specifically four algorithmic skeletons are to be designed and implemented over Hadoop and HDFS which will co-exist with MapReduce along with a streaming mechanism for MapReduce and/or Skeleton jobs. The programmer will be

17 Chapter 2. On Algorithmic Skeletons and MapReduce 8 able to choose between these programming models, in order to implement his application depending on the needs of the task he has at hand. 2.4 Related Work Overview of Algorithmic Skeletons Regarding Algorithmic Skeletons there has been quite a lot of research and work at an academic level. For various reasons that are beyond the scope of this project, companies have not really taken an interest in using them in their applications. Most of the work that has been done is focused in the area of providing programming frameworks that implement a number of algorithmic skeletons with generic parallel functionality which can be used from the user to implement parallel programs [5]. There are three types of algorithmic skeletons; data-parallel, task-parallel and resolution [5]. Data-parallel skeletons operate on data structures. Task-parallel skeletons work on tasks and their functionality heavily depends on the interaction between the tasks. Resolution skeletons represent an algorithmic way for dealing with a given group of problems. Many frameworks have been developed that provide sets of algorithmic skeletons to users. These algorithmic skeleton frameworks (ASkF) can be split into four categories according to their programming paradigm: - Coordination ASkFs - Functional ASkFs - Object-oriented ASkFs - Imperative ASkFs ASSIST [6] can be classified as a coordination ASkF. Parallel programs are expressed as graphs of software modules by using structured coordination language. The execution language is C++ and while it supports type safety, it does not support skeleton nesting. The skeletons that are offered in ASSIST are seq and parmod. Skandium [7] is another ASkF that supports both data-parallel and task-parallel skeletons. It is a re-implementation of an older ASkF, Calcium, with multi-core computing in mind. The execution language is Java and both type safety and skeleton nesting are supported. For more information regarding the numerous ASkFs that are developed [5] pro-

18 Chapter 2. On Algorithmic Skeletons and MapReduce 9 vides a very good description of the most important along with references to papers for more information Related work for MapReduce Even though MapReduce systems are a relatively new idea (or maybe better, a new implementation of an old idea) there is a growing interest in them and a lot of effort from research groups aim towards enhancing it. The majority of these attempts are built on top of Hadoop due to the fact that it is the most popular open source realization of MapReduce systems. Hive is a data warehousing application in Hadoop. It was originally developed by Facebook a few years ago but now it is open source [8], [9]. Hive organises the data in three ways which are analogous to well-known database concepts; tables, partitions and buckets. Hive provides a SQL-like query language which is called HiveQL (HQL) [8], [9]. HQL supports project, select, join, union, aggregate expressions and subqueries in the from-clause like SQL. Hive translates HQL statements into a syntax tree. This syntax tree is then compiled into an execution plan of MapReduce jobs. Finally, these jobs are executed by Hadoop. It can be concluded that with Hive, the developer has in his possession a declarative query language which is close to SQL and supports quite a few functionalities that are essential for many data analytic jobs and are pretty repetitive [8], [9]. Pig on the other hand, is a large scale dataflow system that is built on top of Hadoop. The idea behind Pig is similar to the one of Hive. Pig programs are parsed and compiled into MapReduce jobs, which are then executed by the MapReduce framework on the cluster [10]. A Pig program goes through a number of intermediate stages before execution. First of all, it is parsed and checked for errors. A logical plan is produced, and then is optimized and compiled into a series of MapReduce jobs and afterwards it passes another optimization phase. Finally, the jobs are sorted and submitted to Hadoop for execution [10]. The programs that are given as input to Pig are written in a specifically designed script language, Pig Latin. Pig Latin is a script programming language where the user specifies a number of consecutive steps that implement a specific task. Each step is equivalent to a single, high level data transformation. This is different from the declarative approach of SQL where only the constraints that define the final result are declared. As Pig Latin was developed having in mind processing web-scale data, only

19 Chapter 2. On Algorithmic Skeletons and MapReduce 10 the parallelisable primitives were included in it [11]. Another extension of MapReduce is Twister [12] which aims to make MapReduce suitable for a wider range of applications. What the runtime of Twister offers compared to similar MapReduce runtimes is the ability to support iterative MapReduce computations. Being a runtime itself, it is distinguished from Hadoop as the infrastructure is different but most importantly the programming model of MapReduce is extended in a way that it supports broadcast and scatter type data transfers. All the above make Twister far more efficient when talking about iterative MapReduce computations. Although the architecture of twister is different than the one of Hadoop, on top of which this project will be built, the way in which the programming model of MapReduce is extended using communication concepts from parallel computing like scatter and broadcast so as to support iteration is quite interesting. 2.5 Summary MapReduce has proven to be an extremely powerful tool for analysing and processing vast amounts of data. Hadoop, its open source realisation, is used by a significant number of companies and it is of no coincidence that large corporations like Yahoo, Amazon and Facebook among others, use it for storing and analysing data. However, Hadoop offers a complex and detailed API for writing MapReduce jobs. It is evident that if a programmer wants to write a program to perform simple computations over large data sets, he has to write long complex programs of many lines. It comes as no surprise that many projects that aim to offer a higher level API to facilitate large-data processing are under development or have been developed in the past few years. On the other hand, algorithmic skeletons have been known for being provided in various frameworks as libraries (most of the times). Their outstanding feature, that the synchronisation of the parallel activities is implicitly defined by the abstract skeleton patterns, aid the programmers to write parallel programs in a easier but most importantly sequential way. It is of no surprise that providing a library of skeleton operations, other than MapReduce over Hadoop may well prove valuable for its users. Especially, if the provided skeleton operations are offered in a higher level than MapReduce by hiding many of its details will undoubtedly lead to a more expressive and easier to configure software framework.

20 Chapter 3 Background Upon providing all the necessary information regarding the Algorithmic Skeletons, MapReduce and MapReduce s open source realization Hadoop, more details should be given about the scope of this project. This project aims to implement a number of skeleton operations and ultimately offer a library of skeleton operations to the user of Hadoop. To this end, a number of matters should be discussed. First and foremost, we must describe the package that implements MapReduce over Hadoop. The skeletons library will use this package to somewhat offer a level of indirection. In essence, when one skeleton operation will be used, a MapReduce job will run underneath. This makes the study and understanding of the package more than important. Moreover, the algorithmic skeletons that will be implemented are presented thoroughly. Comprehending the pattern of each skeleton is the first step for designing and implementing them above any software framework. The reader needs to understand the pattern that every single one of them offers, if he is to move on to the following chapters that give more details regarding far more complex aspects of the design and the implementation. 3.1 Hadoop s MapReduce Implementation Before moving on to the description of how MapReduce is organised and functions in Hadoop [13], we should take note that currently two different packages exist which provide to the user of Hadoop the parallel algorithmic framework of MapReduce; mapred and mapreduce. The differences of these two packages are beyond the scope of this report. However, it must be brought to attention that mapreduce is the newest 11

21 Chapter 3. Background 12 implementation that is meant to replace completely mapred in the following releases. As a result, the implementation of the skeleton operations are based on the most recent package of MapReduce. It is due to this fact that we should describe a description of the package mapreduce [14]. This description will aid the reader in his comprehension of how the system works and will ultimately lead to a better understanding of the implementation of the parallel algorithmic skeletons, as classes of the mapreduce API are used. A MapReduce job has two phases of operation. Firstly, the input data is split into chunks which are processed by the Mappers in parallel. The output of the mappers is sorted, grouped and then used as input for the Reducers. It should be noted that both the input and the output of a MapReduce job is stored in the distributed file system of Hadoop (HDFS) whereas the intermediate results of the Mappers are stored in the local file-system of the Mappers. The framework takes care of scheduling tasks, monitoring them and re-executing the failed ones. Now, let us take a closer look at the implementation. The MapReduce framework operates exclusively on Key,Value pairs, that is, the framework views the input to the job as a set of Key,Value pairs and produces a set of Key,Value pairs as the output of the job, conceivably of different types. The main classes of the API are: Job, Mapper and Reducer. The Job class is an extension of the class JobContext. It allows the user to configure and submit the job and it offers him the ability to check and control the state of the execution. To this end, it contains certain set methods with which a user can configure the MapReduce job before it is submitted. Moreover the classes Mapper and Reducer provide the API for creating the functions that realize the map and the reduce stages. They both contain an internal class Context that extends MapContex and ReduceContext respectively. Its purpose is to provide the context to both the Mapper and the Reducer. A user who wants to create a MapReduce job needs to create a new instance of the class Job. Moreover, he has to create two new classes that will extend the existing classes Mapper and Reducer. Additionally, in the extended class of the Mapper he needs to override the method map according to the task he has to deal with. In the extended class of the Reducer the user overrides the method reduce depending to his needs. Finally, the programmer has to configure the Job instance he created with the two extended classes and submit the job.

22 Chapter 3. Background Setting up a MapReduce job Here is an example of a program that creates, configures and submits a MapReduce Job for counting the number of occurrences of each individual word. import java. io. IOException ; import java. util. StringTokenizer ; import org.apache.hadoop.conf. Configuration ; import org.apache.hadoop.fs.path; import org.apache.hadoop.io. IntWritable ; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat ; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat ; import org. apache. hadoop. util. GenericOptionsParser ; public class WordCount { public static class TokenizerMapper extends Mapper < Object, Text, Text, IntWritable >{ private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map( Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer ( value. tostring ()) ; while (itr. hasmoretokens ()) { word.set(itr. nexttoken ()); context.write(word, one); public static class IntSumReducer extends Reducer < Text, IntWritable, Text, IntWritable > { private IntWritable result = new IntWritable (); public void reduce( Text key, Iterable < IntWritable > values, Context context

23 Chapter 3. Background 14 ) throws IOException, InterruptedException { int sum = 0; for ( IntWritable val : values) { sum += val.get (); result.set(sum); context.write(key, result); public static void main( String [] args) throws Exception { Configuration conf = new Configuration (); String [] otherargs = new GenericOptionsParser ( conf, args). getremainingargs (); if ( otherargs. length!= 2) { System.err.println (" Usage: wordcount <in > <out >"); System.exit (2); Job job = new Job( conf, " word count "); job. setjarbyclass ( WordCount.class); job. setmapperclass ( TokenizerMapper. class); job. setcombinerclass ( IntSumReducer. class); job. setreducerclass ( IntSumReducer. class); job. setoutputkeyclass (Text.class); job. setoutputvalueclass ( IntWritable. class); FileInputFormat. addinputpath ( job, new Path( otherargs [0])); FileOutputFormat. setoutputpath ( job, new Path( otherargs [1])); System.exit(job. waitforcompletion (true)? 0 : 1); 3.2 Parallel For Generally speaking, Parallel For [5], [7] represents finite iteration. The algorithmic skeleton of Parallel For is used when a user wants to do some work on all the elements of a specific input. The input needs to be partitioned first so that the work can be parallelizable. In essence the work is done in the different partitions of the data in parallel. A piece of pseudo-code follows that describes the pattern of Parallel For.

24 Chapter 3. Background 15 int i =...; Skeleton P,R nested =...; Skeleton P,R forskel = new For P,R (nested, i); 3.3 Parallel Sort Sorting is a well known concept in Computer Science. A significant number of algorithms have been introduced in the past decades, each of them serving its own purposes and designed for different needs. Hadoop, as an open source MapReduce system deals with massive data-sets that are stored on the various nodes of the cluster. It is rather obvious that one user of the cluster will require to sort the data for his own purposes. For instance, it may well be the case that the user wants to find out the maximum value of the data sets or sort some Text entries alphabetically. As mentioned above, a large number of algorithms exist for sorting. As far as parallel sorting [15] is concerned the most famous categories of sorting are the Bucket Sorts, the Exchange Sorts and the Partition Sorts. In the context of parallel sorting in Hadoop, it can be said that we need not deal with any type of parallel sorting algorithm. In a MapReduce job the intermediate results of the Mappers are sorted according to the value of the key. It is only natural, that when designing a sort skeleton over Hadoop, this feature of MapReduce is going to be used. It should be pointed out that an implementation of Sorting currently exists in the Hadoop examples jar file. Nonetheless, we designed a different one for the purposes of this project so as to offer the additional parallel skeleton of sort. The main difference is that the skeleton Sort that we implemented targets to a more specific group of problems; it produces one sorted output file. Another difference between the two implementations is that the parallel sorting which is included in the examples jar file is based on the old implementation of MapReduce mapred, whereas the implementation described in the following chapters is based on the new mapreduce package. 3.4 Parallel While Parallel While [5], [7] represents conditional iteration, where a function (or possibly another skeleton) is applied to the data while a condition holds. This condition may or may not relate to the value that is read from the input. In Parallel While it may well

25 Chapter 3. Background 16 be the case that the condition is checked more than one time. While the value is true a function is applied to the input data. However, the user may require to perform a different action when the condition returns false which means that the loop was exited. A piece of pseudo-code follows that describes the pattern of Parallel While. Condition P condition =...; Skeleton P,R nested =...; Skeleton P,R whileskel = new While P,R (nested, condition); From the pseudo-code it becomes obvious that the function or skeleton nested is executed while the condition is true. The data is partitioned and the processing takes place in parallel for each shard of input data. Perhaps the most important feature of Parallel While (and of Parallel If that follows in the next section) is the definition of the Condition for which we give details in the following sub-section Condition in Parallel While It is evident that the condition that is to be checked in Parallel While could relate to the value read but it could also be independent of it. Whichever the case, most likely the condition will be checked again and the data may have changed. As a result, a way is needed for storing the results of the function nested and providing them during the next step of the loop. The fact that the condition is checked more than once in the majority of the cases, is perhaps the key aspect that will lead the design and the implementation of this skeleton over Hadoop. It is only natural that in the context of providing a more expressive framework that is also simple for the programmer to use, an abstract concept of condition will be provided which will be up to the user to implement according to his specific needs. 3.5 Parallel If Parallel If [5], [7] can be described as conditional branching where the choice of which computation to apply on what subset of the data, is solely based on a condition specified by the user. In essence for each data set of the input a condition is checked. This condition can be either about the data set itself or even independent of it, according to

26 Chapter 3. Background 17 the needs of the user. In any case, the input is split to various shards and for the values contained in each shard a condition is deduced whether it is met or not. Depending on the outcome of this action (the result is either true or false) a different function is applied to the data set. Below is a simple and abstract description of this algorithmic skeleton. Condition P Condition =...; Skeleton P,R TrueCase =...; Skeleton P,R FalseCase =...; Skeleton P,R If = new If P,R (Condition, TrueCase, FalseCase); The parallelization of this model heavily depends on the fact that the input data is partitioned and then the processing occurs in parallel. The function TrueCase is executed if Condition returns true whereas FalseCase is executed if Condition returns false. As per the description, it is possible that these two functions may be other skeletons leading to skeleton nesting. What is more, as in the skeleton previously described (Parallel While) the Condition is of great importance and more details should be given regarding it Condition in Parallel If The condition in parallel If is quite similar to the one defined in Parallel While. Of course, an important difference exists that distinguishes them. As we saw before, the condition in Parallel While by definition is checked more than once. After all, While in its essence provides the notion of loop. As a result it is essential that we may have to check the condition numerous times. On the other hand, in Parallel If the condition is strictly checked only once. Depending on the outcome a different function (or skeleton) is executed. This constitutes a major difference between the two skeletons. The fact that in Parallel While only a single computation is executed numerous times (more or equal to one) while in Parallel If one of two possible types of computation is executed only one time. In the following chapter, we will describe how all these differences affect the design of these skeletons over the framework of Hadoop.

27 Chapter 3. Background Summary This chapter introduced the reader to the implementation of MapReduce over Hadoop along with the four skeleton operations that are to be included in the skeletons library. Understanding the package of MapReduce over Hadoop is extremely important as we will design the skeletons in a similar way. Furthermore, as it is our purpose to offer the skeleton operations at a higher level, the MapReduce package will be used underneath. In essence, one of the goals of this project is to offer another level of indirection above MapReduce in the context of providing a library of algorithmic skeletons. The ultimate goal of this project after all is to enhance Hadoop so as to provide a more expressive software framework for processing large amounts of data and both the implementation of the algorithmic skeletons and the streaming API are to this end.

28 Chapter 4 Design 4.1 Designing Parallel Skeletons over Hadoop When talking about providing the algorithmic model of another parallel model over Hadoop a way must be found to offer the programmer a new API that implements the new algorithmic skeleton with the use of the existing API of MapReduce. A number of approaches exist that can be used for tackling this specific problem. For example one may attempt to build an individual package that will implement the algorithmic skeleton. Such a package will need to communicate with the distributed file system underneath and contain a number of low level methods to this end. This approach will require a better understanding of the whole Hadoop system. One may argue that the package of MapReduce already contains ways for communicating with HDFS. What is more, its functionality supports configuring, executing and monitoring MapReduce jobs. As a result, using parts of the existing functionality of MapReduce, and masking parts of it according to the task at hand, may prove an easier but more importantly a far more efficient way of implementing another parallel algorithmic framework over Hadoop. In addition, by following this specific approach we can further hide many of Hadoop s details resulting in skeleton operations that offer a higher level of interaction with the framework. In a way, the user will have another programming model but also a less complex and easier to use model. This was the key idea which led our implementation. More specifically, after careful inspection of the MapReduce API and implementation we deduced a methodology for offering different parallel models using classes of the MapReduce API. As mentioned in the previous chapter, the main classes of the existing API are the following: Job, Mapper and Reducer. The class Job instantiates a MapReduce Job and has com- 19

29 Chapter 4. Design 20 plete control over it. What is required from the new API is a class that will extend the existing class Job, offer some of Job s functionality to the user but coordinate a number of things internally. It is important that the new extended class fits the new model accordingly. This raises the question of which functionalities are to be hidden from the user. The answer to this question depends heavily to the specific parallel skeleton that is to be implemented. In the following sections, we are going to give more details regarding the concepts behind the implementation of several algorithmic skeletons that we implemented over Hadoop. An indicative design of Algorithmic Skeletons over Hadoop is shown in Figure 4.1. In the context of this project, we followed this abstract design in order to implement the Algorithmic Skeletons we chose over Hadoop. This figure will provide the reader with a better understanding of the methodology we followed for accomplishing our objectives. Figure 4.1: The Design of Algorithmic Skeletons over Hadoop. 4.2 Designing Parallel For over Hadoop Parallel For is the application of a function to all the data elements of the input. This fact distinguishes it from MapReduce on certain key points. Firstly, there is no need for a reduce phase. As only a certain computation is applied to the input, there is no sense in having an additional stage that will not perform any computation of any kind. Moreover, in a Parallel For no grouping or sorting should be performed in the output of

30 Chapter 4. Design 21 the job. It needs to be noted that in MapReduce the output of the map phase is sorted and grouped according to the key of the Key,Value pairs that produces as output. This needs to be avoided. Furthermore, the output of the Mappers is usually written in the local file system of the nodes. We require that the output of the job is written to HDFS. Lastly, the fact that the MapReduce framework functions with pairs must be taken into consideration. A Mapper takes as input a Key,Value pair and produces another Key,Value as output. In Parallel For we must find a way must to hide the pairs and present the user an easy way to handle and manipulate the input. With the completion of spotting the key differences between a MapReduce job and a Parallel For job, we can now proceed into designing the new parallel model using the existing one. In this part of the report we will underline the ways that are best fit for dealing with the issues that are mentioned in the previous paragraph. To begin with, we need to remove the reduce phase from the new job class, extension of the existing Job class, we are to create. This can be done internally by setting the number of reduce tasks to zero and offer an API that only a Mapper (soon to be renamed) can be declared by the user. Even though Parallel For s functionality is close to that of a Map, we need to offer a new class. This class will be an extension of the Mapper and it will hide the input, output pairs offering to the user a single input and output. Furthermore, the data types of Hadoop are of no use in the context of Parallel For and thus both the input and output will be manipulated as String data types. Additionally, by specifying zero reduce tasks, the output of the Map phase is written to the HDFS instead of the local file-system of the nodes. The final obstacles that need to be dealt with are those of grouping and sorting of the map output. Thankfully, the package of MapReduce has been designed in such a way that when no reduce tasks exist, the sorting and grouping of the map phase is omitted. Thus, all the issues have been dealt with. In Chapter 5 we present more details regarding the implementation of the new API are going to be presented. 4.3 Designing Parallel Sort over Hadoop Implementing Parallel Sort over Hadoop raises a number of issues that need to be addressed. First of all, all known data types should be supported. More specifically, the skeleton offered to the user of Hadoop should be able to sort integers, doubles, floats, strings and long integers. Moreover, the sorting implementation must be able to perform sorting both in descending and ascending order. Recall that after the map phase,

31 Chapter 4. Design 22 the intermediate results are sorted in ascending order. This natural sorting of MapReduce is used for developing the sort skeleton but it is apparent that when wanting to sort in descending order this feature of MapReduce needs tweaking. Finding the best possible ways for dealing with these issues, ultimately lead to the best fit implementation of Parallel Sort over Hadoop. To begin with, there are two approaches in order for the skeleton to support five data types. The dynamic approach is for the program to dynamically determine the data type and transport the relevant Mapper and Reducer to support the specific data type. The other approach can be considered static, as the main idea is to have separate pairs of Mappers and Reducers already predefined for each data type. By entering a parameter, the user can define which is the type of the data that is to be sorted. The final obstacle for designing an efficient and complete sort skeleton is to support sorting in descending order besides only sorting in ascending order. For the implementation to support descending order we need to find a way and reverse the sorting of the intermediate results. In order for us to accomplish such thing, we have to look in the MapReduce package and deduce how the sorting occurs and then try to add the functionality of sorting in descending order. Once these two issues are dealt with, all we have to do is create a class that the user can use to execute a sort over an input of his own. A MapReduce job must be internally set up and configured with the appropriate parameters. This MapReduce job will use the fitting pair of the Mappers and Reducers that are already predefined and perform the sorting. The user will have the option to specify the output order. The predefined Mappers and Reducers need only perform their default functionality which is to read the input pair and write it. The sorting occurs in the intermediate phase from the framework itself. What should be noted is that the Mappers should isolate the value upon which the sorting will occur and use it as key for their own output pair. An important factor to take into consideration is that already many sort algorithms exist over Hadoop, so why implement a new one or even why use the skeleton sort instead of another. The answer to these questions is that the operation of skeleton Sort results in a single sorted output file no matter what the input is or how many reducers there are. As a result, a final note is that in our implementation me must guarantee that the output of the skeleton Sort will reside in a single file. These details conclude the design of the parallel algorithmic skeleton of Sort over Hadoop.

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.

COMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838. COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information

Data Analysis Using MapReduce in Hadoop Environment

Data Analysis Using MapReduce in Hadoop Environment Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti

More information

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2

Big Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2 Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming) Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version : Hortonworks HDPCD Hortonworks Data Platform Certified Developer Download Full Version : https://killexams.com/pass4sure/exam-detail/hdpcd QUESTION: 97 You write MapReduce job to process 100 files in HDFS.

More information

Introduction to HDFS and MapReduce

Introduction to HDFS and MapReduce Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

MapReduce-style data processing

MapReduce-style data processing MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic

More information

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018

Experiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018 Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However,

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework In Partial fulfilment of the requirements for course CMPT 816 Presented by: Ahmed Abdel Moamen Agents Lab Overview MapReduce was firstly introduced by Google on 2004. MapReduce

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

2. MapReduce Programming Model

2. MapReduce Programming Model Introduction MapReduce was proposed by Google in a research paper: Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System

More information

Java in MapReduce. Scope

Java in MapReduce. Scope Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

Parallel Computing: MapReduce Jin, Hai

Parallel Computing: MapReduce Jin, Hai Parallel Computing: MapReduce Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! MapReduce is a distributed/parallel computing framework introduced by Google

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Big Data Analytics: Insights and Innovations

Big Data Analytics: Insights and Innovations International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 6, Issue 10 (April 2013), PP. 60-65 Big Data Analytics: Insights and Innovations

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018 Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

Data Storage Infrastructure at Facebook

Data Storage Infrastructure at Facebook Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung Outline Strategy of data storage, processing, and log collection Data flow

More information

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns STATS 700-002 Data Analysis using Python Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns Unit 3: parallel processing and big data The next few lectures will focus on big

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

MapReduce and Hadoop. The reference Big Data stack

MapReduce and Hadoop. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica MapReduce and Hadoop Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

Attacking & Protecting Big Data Environments

Attacking & Protecting Big Data Environments Attacking & Protecting Big Data Environments Birk Kauer & Matthias Luft {bkauer, mluft}@ernw.de #WhoAreWe Birk Kauer - Security Researcher @ERNW - Mainly Exploit Developer Matthias Luft - Security Researcher

More information

Lecture 1: Overview

Lecture 1: Overview 15-150 Lecture 1: Overview Lecture by Stefan Muller May 21, 2018 Welcome to 15-150! Today s lecture was an overview that showed the highlights of everything you re learning this semester, which also meant

More information

MapReduce for Parallel Computing

MapReduce for Parallel Computing MapReduce for Parallel Computing Amit Jain 1/44 Big Data, Big Disks, Cheap Computers In pioneer days they used oxen for heavy pulling, and when one ox couldn t budge a log, they didn t try to grow a larger

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Map-Reduce in Various Programming Languages

Map-Reduce in Various Programming Languages Map-Reduce in Various Programming Languages 1 Context of Map-Reduce Computing The use of LISP's map and reduce functions to solve computational problems probably dates from the 1960s -- very early in the

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Guidelines For Hadoop and Spark Cluster Usage

Guidelines For Hadoop and Spark Cluster Usage Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset

More information

A Survey on Big Data

A Survey on Big Data A Survey on Big Data D.Prudhvi 1, D.Jaswitha 2, B. Mounika 3, Monika Bagal 4 1 2 3 4 B.Tech Final Year, CSE, Dadi Institute of Engineering & Technology,Andhra Pradesh,INDIA ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors

More information

Big Data landscape Lecture #2

Big Data landscape Lecture #2 Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Map Reduce Group Meeting

Map Reduce Group Meeting Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor

CS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html

More information

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce

More information

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon HBase vs Neo4j Technical overview Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon 12th October 2017 1 Contents 1 Introduction 3 2 Overview of HBase and Neo4j

More information

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce)

Ghislain Fourny. Big Data 6. Massive Parallel Processing (MapReduce) Ghislain Fourny Big Data 6. Massive Parallel Processing (MapReduce) So far, we have... Storage as file system (HDFS) 13 So far, we have... Storage as tables (HBase) Storage as file system (HDFS) 14 Data

More information

50 Must Read Hadoop Interview Questions & Answers

50 Must Read Hadoop Interview Questions & Answers 50 Must Read Hadoop Interview Questions & Answers Whizlabs Dec 29th, 2017 Big Data Are you planning to land a job with big data and data analytics? Are you worried about cracking the Hadoop job interview?

More information

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture What Is Datacenter (Warehouse) Computing Distributed and Parallel Technology Datacenter, Warehouse and Cloud Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University,

More information

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

ExamTorrent.   Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you ExamTorrent http://www.examtorrent.com Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you Exam : Apache-Hadoop-Developer Title : Hadoop 2.0 Certification exam for Pig

More information

A brief history on Hadoop

A brief history on Hadoop Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)

More information

A Review Approach for Big Data and Hadoop Technology

A Review Approach for Big Data and Hadoop Technology International Journal of Modern Trends in Engineering and Research www.ijmter.com e-issn No.:2349-9745, Date: 2-4 July, 2015 A Review Approach for Big Data and Hadoop Technology Prof. Ghanshyam Dhomse

More information

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop & Big Data Analytics Complete Practical & Real-time Training An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig CSE 6242 / CX 4242 Scaling Up 1 Hadoop, Pig Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le

More information