Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi

Similar documents
Survey on MapReduce Scheduling Algorithms

ADVANCES in NATURAL and APPLIED SCIENCES

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Programming Models MapReduce

Cloud Computing CS

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

A priority based dynamic bandwidth scheduling in SDN networks 1

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Big Data 7. Resource Management

A Brief on MapReduce Performance

Mitigating Data Skew Using Map Reduce Application

Exploiting Heterogeneity in the Public Cloud for Cost-Effective Data Analytics

Chapter 5. The MapReduce Programming Model and Implementation

Big Data for Engineers Spring Resource Management

Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica. University of California, Berkeley nsdi 11

Database Applications (15-415)

An improved MapReduce Design of Kmeans for clustering very large datasets

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

A Comparison of Scheduling Algorithms in MapReduce Systems

An Adaptive Scheduling Technique for Improving the Efficiency of Hadoop

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Hadoop/MapReduce Computing Paradigm

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Lecture 11 Hadoop & Spark

ABSTRACT I. INTRODUCTION

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

2/4/2019 Week 3- A Sangmi Lee Pallickara

HADOOP FRAMEWORK FOR BIG DATA

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

A Platform for Fine-Grained Resource Sharing in the Data Center

Load Balancing in Cloud Computing Priya Bag 1 Rakesh Patel 2 Vivek Yadav 3

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

The MapReduce Framework

2/26/2017. For instance, consider running Word Count across 20 splits

Indexing Strategies of MapReduce for Information Retrieval in Big Data

An Enhanced Approach for Resource Management Optimization in Hadoop

CISC 7610 Lecture 2b The beginnings of NoSQL

Improving the MapReduce Big Data Processing Framework

Introduction to MapReduce

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

High Performance Computing on MapReduce Programming Framework

Integrated IoT and Cloud Environment for Fingerprint Recognition

Resilient Distributed Datasets

Smart Load Balancing Algorithm towards Effective Cloud Computing

Jumbo: Beyond MapReduce for Workload Balancing

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

Improving MapReduce Performance in Heterogeneous Environments

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

ENHANCING CLUSTERING OF CLOUD DATASETS USING IMPROVED AGGLOMERATIVE ALGORITHMS

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Computing: MapReduce Jin, Hai

Stream Processing on IoT Devices using Calvin Framework

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

The Datacenter Needs an Operating System

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Accelerate Big Data Insights

Batch Processing Basic architecture

A Survey on Job Scheduling in Big Data

Improving MapReduce Performance in Heterogeneous Environments

MapReduce programming model

Implementation of Optimized Mapreduce With Smart Speculative Strategy And Network Levitated Merge

CompSci 516: Database Systems

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng

Global Journal of Engineering Science and Research Management

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

A Fast and High Throughput SQL Query System for Big Data

MI-PDB, MIE-PDB: Advanced Database Systems

Time Space Scheduling in the MapReduce Framework

Hadoop Map Reduce 10/17/2018 1

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

A Survey on Job and Task Scheduling in Big Data

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Novel Scheduling Algorithms for Efficient Deployment of MapReduce Applications in Heterogeneous Computing Environments

Mesos: Mul)programing for Datacenters

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Hadoop An Overview. - Socrates CCDH

Batch Inherence of Map Reduce Framework

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Transcription:

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 06, 2016 ISSN (online): 2321-0613 Big Data Processing: Improve Scheduling Environment in Hadoop Bhavik.B.Joshi Abstract Now a days, dealing with datasets in the order of terabytes or even petabyte is a reality. Therefore, processing such big datasets in an efficient way is a clear need for many users. In this context, Hadoop Map Reduce is a big data processing framework that has rapidly become the de facto standard in both industry and academia. The main reasons of such popularity are the ease-of-use, scalability, and failover properties of Hadoop Map Reduce. However, these features come at a price: the performance of Hadoop Map Reduce is usually far from the performance of a well-tuned parallel database. Therefore, many research works (from industry and academia) have focused on improving the performance of Hadoop Map Reduce jobs in many aspects. For example, researchers have proposed different data layouts; join algorithms, high-level query languages, failover algorithms, query optimization technique, and indexing techniques. We will point out the similarities and differences between the techniques used in Hadoop with those used in parallel databases. And try to improving scheduling environment in Hadoop. Key words: Big Data, Hadoop, Map Reduce, Scheduling algorithms, SARS algorithm I. INTRODUCTION Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Big data is a broad term for data sets so large or complex. Those traditional data processing applications are inadequate. Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, search, sharing, storage, transfer, visualization, and information privacy. But having data bigger it requires different approaches, Techniques, tools and architecture [1] [2]. Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques. [1] [10]. A. Architecture of Hadoop LDRP-ITR, Kadi Sarva Vidhyalay, Gujarat, India There are two kinds of nodes in Hadoop first one are master node and second one is the slave nodes. Master node contains job trackers and slave nodes contain task trackers. [7] [8] Map Reduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Hadoop Map Reduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on compute clusters of commodity hardware. [1][9].It also uses the HDFS (Hadoop distributed file system). Hadoop Distributed File System (HDFS): a distributed file system that store large amount of data with high throughput access to data on clusters. A Map Reduce program is composed of a Map () procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name).reduce () procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).the "Map Reduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshaling the distributed servers, running the various tasks in parallel. Managing all communications and data transfers between the various parts of the system and providing for redundancy and fault tolerance. [7] [8] B. Working of Map Reduce The model is inspired by the map and reduces functions commonly used in functional programming although their purpose in the Map Reduce framework is not the same as in their original forms. The key contributions of the Map Reduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once.[1][2][4] [6] C. Hadoop Map Reduce Scheduling and Important Issues Hadoop allows the user to configure the job submit it and control its execution and query the State. Every job consists of independent tasks and all the tasks need to have a system slot to run. In Hadoop all scheduling and allocation decisions are made on a task and node slot Level for both the map and reduce phases. Fig. 1: [7] [8] All rights reserved by www.ijsrd.com 519

The Hadoop scheduling model is a Master/Slave (Master/Worker) cluster structure. The master node (Job Tracker) coordinates the worker machines (Task Tracker).Job tracker is a process which manages jobs, and Task Tracker is a process which manages tasks on the corresponding nodes. The scheduler resides in the Job Tracker and allocates Task Tracker resources to running tasks: Map and Reduce tasks are granted independent slots on each machine. In the fact, Map Reduce Scheduling system takes on in six steps: first, User program divides the Map Reduce Fig. 2: [9] job. Second, master node distributes Map Tasks and Reduce Tasks to different workers. [1][3][4] [5] Third, Map Tasks reads in the data splits, and runs map function on the data which is read in. Fourth, Map Tasks write intermediate results into local disk. Then fifth, Reduce Tasks read the intermediate results remotely, and run reduce function on the intermediate results which are read in. Finally, These Reduce Tasks write the final results into the output files. [1][3][4][5] Fig. 3: [5][6][7] All rights reserved by www.ijsrd.com 520

D. Issues in Scheduling in Hadoop There are three important scheduling issues in Map Reduce such as locality, synchronization and fairness. (D.1) Locality Locality is a very crucial issue affecting performance in a shared cluster environment, due to limited network bisection bandwidth. It is defined as the distance between the input data node and task-assigned node. The concept of data locality in Map Reduce is a time when scheduler tries to assign Map tasks to slots available on machines in which the underlying storage layer holds the input intended to be processed.[1][3][6] [7] (D.2) Synchronization Synchronization is the process of transferring the intermediate output of the map processes to the reduce processes as input is also consider as a factor which affects the performance. In generally, reduce process will be start a time when Map Process is absolutely finished. Due to this dependency, a single node can slow down the whole process, causing the other nodes to wait until it is finished. So, Synchronization of these two phases (Map and Reduce phase), made up essential help to get overall performance of Map Reduce Model.[6] [7] (D.3) Fairness A map-reduce job with a heavy workload may dominate utilization of the shared clusters, so some short computation jobs may not have the desired response time. The demands of the workload can be elastic, so fair workload to each job sharing cluster should be considered. Fairness finiteness has trade-offs between the locality and dependency between the maps & reduce phases. To solve these important issues many scheduling algorithms have been proposed in the past decades.[1][5][6] [7] II. SCHEDULING ALGORITHMS Due to the important issues that we described, and many more problems in scheduling of Map Reduce, the Scheduling is one of the most critical aspects of Map Reduce. There are many algorithms to address these issues with different techniques and approaches. Some of them get focus to improvement data locality and some of them implements to provide Synchronization processing. Also, many of them have been designed to minimizing the total completion time. In this section we describe briefly some of these algorithms.[6] [7] A. Analysis of Scheduling Algorithm Based Of Taxonomy, Advantages and Disadvantages Sr.no Scheduling Algorithm 1 FIFO Advantages 1. cost of entire cluster Scheduling process is less. 2. simple to implement and efficient. Disadvantages Designed only for single type of job. Low performance when run multiple Types of jobs. poor response times for short jobs 2 Fair Scheduling 3 Capacity 4 hybrid scheduler based on dynamic priority 5 LATE 6 SAMR 1. Less complex. 2. Works well when both Small and large clusters. 3. it can provide fast response times for small jobs mixed with large Jobs. 1. Ensure guaranteed access with the potential to reuse unused capacity and prioritize jobs within Queues over large cluster. 1. is a fast and flexible Scheduler. 2. improves response time for multi-user Hadoop Environments. 1. robustness to node Heterogeneity. 2. address the problem of how to robustly perform speculative execution to Maximize performance. 1. decreases the execution Time of map reduce job. 2. improve the overall Map Reduce performance in the heterogeneous Environments. Compared to large jobs. Does not consider the job weight of each node. 1. The most complex Among three schedulers. 1. only takes action on Appropriate slow tasks. 2. However it does not compute the remaining time for tasks correctly and cannot find real slow tasks in the end. 3. Poor performance due to the static manner in computing the Progress of the tasks. 1. do not consider the data locality management for Launching backup tasks. All rights reserved by www.ijsrd.com 521

7 delay scheduling 8 CREST 9 LARTS 10 CoGRS 11 SARS 1. Simplicity of scheduling 1. decrease the response Time of Map Reduce jobs. 2. optimal running time For speculative map tasks. 1. improvement data Locality 1. decreased network Traffic. 2. saving Map Reduce Network traffic. 1. Reduce completion time. 2. decreased the response Time. Table 1: 1. No particular 1. only focus on reduce Process. job is allocated two map slots to run respective tasks. Because the reduce tasks will begin once any map task finishes, from the duration t1 to t2, there are two reduce tasks from Job1 and Job2 running respectively. Since the execution time of each task is usually not the same, Job2 finishes its map tasks at time t2 and reduce tasks at time t3. Because the reduce tasks of Job1 are not finish in this time t3, as indicated in Fig. 5, there are two new reduce tasks from Job1 started. Now all the reduce slot resources are taken up by Job1. In Fig. 5, at the moment t4, when Job3 starts, two idle map slots can be assigned to it, and the reduce tasks from this job will then start. However, we can find all reduce slots are already occupied by the Job1, and the reduce tasks from Job3 have to wait for slot release. The root cause of this problem is that reduce task of Job3 must wait for all the reduce tasks of Job1 to be completed, as Job1 takes up all the reduce slots and Hadoop system does not support preemptive action acquiescently. III. INTRODUCTION TO SARS SCHEDULING ALGORITHM Map Reduce is by far one of the most successful realizations of large-scale data-intensive cloud computing platforms. Algorithms, the reduce completion time is decreased sharply. Time, the experimental results illustrate that, when comparing with other When to start the reduce tasks is one of the key problems to advance the Map Reduce performance. The existing implementations may result in a block of reduce tasks. When the output of map tasks becomes large the performance of a Map Reduce scheduling algorithm will be influenced seriously. Through analysis for the current Map Reduce scheduling mechanism, this paper illustrates the reasons of system slot resources waste, which results in the reduce tasks waiting around, and proposes an optimal reduce scheduling policy called SARS (Self Adaptive Reduce Scheduling) for reduce tasks start times in the Hadoop platform. It can decide the start time point of each reduce task dynamically according to each job context, including the task completion time and the size of map output. Through estimating job completion time, reduce completion time, and system average response it is also proved that the average response time is decreased by 11% to 29%, when the SARS algorithm is applied to the traditional job scheduling algorithms FIFO, Fair Scheduler, and Capacity Scheduler. [4] [5] [6] [7] A. Problem Analysis Hadoop allows the user to configure the job, submit it, control its execution, and query the state. Every job consists of independent tasks, and each task needs to have a system slot to run. Fig. 5 shows the time delay and slot resources waste problem in reduce scheduling. Y-axis in Fig. 5 means the resource slots, which is represented by Map slots and Reduce slots. At first in Fig.5 we can know that Job1 and Job2 are the current running jobs, before the time t2, each Fig. 4: [6][7] This is the performance of the policies with respect to various graph size. In the early algorithm design, reduce tasks can be scheduled once any map tasks are finished. One of the benefits is that the reduce tasks can copy the output of the map tasks as soon as possible. But reduce tasks will have to wait before all map tasks are finished, and the pending tasks will always occupy the slot resources so that other jobs which finish the map tasks cannot start the reduce tasks. All in all, this will result in the long waiting of reduce tasks and greatly increase the delay of the Hadoop jobs. [5] [6] [7] in practice, a shared cluster environment often runs in parallel multiple jobs that are submitted by different users. If the above situation appears among different users at the same time, and the reduce slot resources are occupied for a long time, the submitted jobs from other users will not be pushed ahead until the slots are released. Such inefficiency will extend the average response time of a Hadoop system, lower the resource utilization rate, and affect the throughput of a Hadoop cluster.[3][4][5] [6] IV. PROPOSED ALGORITHM & ARCHITECTURE Through the above analysis, one method to optimize the Map Reduce tasks is to select an adaptive time to schedule the reduce tasks. By this means, we can avoid the reduce tasks waiting around and enhance the resource utilization rate. This section proposes a self-adaptive reduce task All rights reserved by www.ijsrd.com 522

scheduling algorithm, which gives a method to estimate the start time of a task, instead of the traditional mechanism where the reduce tasks are started once any map task is completed. The reduce process can be divided into the following several phases. Firstly, the reduce task requests to read each map output data in the copy phase, which belong to this reduce function in the map output data blocks. [5] [6] Next, in the sort process, these intermediate data are output to an ordered data set by merging, which are divided into two types. One type is the data in memory. When the data are read from various maps at the same time, the data should be merged as the same keys. The other is as like the circle buffer. Because the memory belonging to the reduce task is limited, the data in the buffer should be written to disks regularly in advance. Next, in the sort process, these intermediate data are output to an ordered data set by merging, which are divided into two types. One type is the data in memory. When the data are read from various maps at the same time, the data should be merged as the same keys. The other is as like the circle buffer. Fig. 5: [6] The default scheduling for reduce task. Fig. 6: [6] The scheduling method for reduce task in SARS. Because the memory belonging to the reduce task is limited, the data in the buffer should be written to disks regularly in advance. In this way, subsequent data need to be merged by the data which are written into the disk earlier, the so called external sorting. External sorting need to be executed several times if the numbers of map tasks are large in the practical works. The copy and sort are customarily called the shuffle phase. Finally, after finishing the copy and sort process, the subsequent functions start, and the reduce tasks can be scheduled to the compute nodes. [3][4][5] [6] V. SARS ALGORITHM (A SELF-ADAPTIVE REDUCE SCHEDULING ALGORITHM) A. Reduce Task Algorithm M: the map slots collection; R: the reduce slots collection; Q: the job queue. 1) MT = obtainmaptask (M, Q); 2) start the map task in MT; 3) obtain KVS as the middle key-value set; 4) for each job Q do 5) for each key KVS do 6) S_MapTask = 0; //all maps finish time 7) S_MapOut = 0; //all maps output 8) for each I [1, m] do 9) S_MapTask+ = t_mapi; 10) S_MapOut+ = m_outi; 11) end for 12) //get the starting time of reduce task: 13) startreduce = startmap + S_MapTask/s S_MapOut/ (transspeed copythread); 14) if System.currentTime = startreduce then 15) if job exists waiting reduce task then 16) remove a reduce slot from R; 17) start the reduce task; 18) else 19) break; 20) end if 21) else 22) break; 23) end if 24) end for 25) end for [6] B. Map Task Algorithm M: the map slots collection; Q: the job queue. Ensure: MT: map task set; 1) m: the count of current map tasks. 2) for each job Q do 3) count = 0; 4) if job has a waiting map task t then 5) MT = MT {t}; 6) remove a map slot from M; 7) count++; 8) else 9) break; 10) end if 11) end for 12) m = count; 13) return MT. [6] All rights reserved by www.ijsrd.com 523

VI. COMPARISON BETWEEN BASIC SCHEDULING ALGORITHMS AND SARS ALGORITHMS ON WORD COUNT Fig. 7: Comparison between FIFO and SARS algorithms on Word count [2] Survey on Improved Scheduling in Hadoop Map Reduce in Cloud Environments by aws B.Thirumala Rao, Lakireddy Bali Reddy College of Engineering [3] Map Reduce: State-of-the-Art and Research Directions Abdelrahman Elsayed, Osama Ismail, and Mohamed E. El-Sharkawi [4] Improving Map Reduce Performance Using Smart Speculative Execution Strateg Qi Chen,Cheng Liu, and Zhen Xiao, Senior Member, IEE [5] Efficient Big Data Processing in Hadoop Map Reduce Jens Dittrich Jorge-Arnulfo Quian em Ruiz Information Systems Group Saarland University [6] A self-adaptive scheduling algorithm for reduce start time Zhuo Tanga,, Lingang Jiang a, Junqing Zhoua, Kenli Li a, Keqin Li [7] A Comprehensive View of Hadoop Map Reduce Scheduling Algorithms Seyed Reza Pakize Department of Computer, Islamic Azad University, Yazd Branch, Yazd, Iran [8] http://searchaws.techtarget.com/definition/amazon- Elastic-Compute-Cloud-Amazon-EC2 [9] http://www.ibm.com/developerworks/library/oshadoop-scheduling/ [10] http://carleton.ca/ses/ad-hoc-room-booking/carletonuniversity-web-scheduler2015/1-web-scheduler-quickreference-guide-and-logging-into web-scheduler. Fig. 8: Comparison between FAIR and SARS algorithms on Word Count VII. CONCLUSION The goal of the improved algorithm proposed in this report is to decrease the completion time of reduce tasks in the Map Reduce framework. This method can decide reduce task start time through running situation of the jobs, and to avoid waiting around of the reduce function. In this paper, the performance of this algorithm is estimated from the job completion time; reduce completion time, and the system average response time. So in SARS algorithm we have to estimate the point at which reduces function starts. VIII. FUTURE WORK We saw a multiple, robust scheduling algorithm, SARS, which uses to improve scheduling environment in reduce phase but as per its disadvantage it does not consider mapping phase and add the mapping phase algorithm and analyze the result. And also implement a technique to identify that at which point we start the Reduce function and it benefited the algorithm most. So, now in future we select more efficient mapping algorithm with SARS algorithm and check the result and analysis of it in Hadoop cluster. REFERENCES [1] Improving Map Reduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica University of California, Berkeley{matei, andyk, adj, randy, stoica}@cs.berkeley.edu Achebe, Chinua. Things Fall Apart. New York: Anchor Books, Doubleday, Press, 1994. Print. All rights reserved by www.ijsrd.com 524