PWBRR Algorithm of Hadoop Platform

Similar documents
TP1-2: Analyzing Hadoop Logs

Clustering Lecture 8: MapReduce

Hadoop MapReduce Framework

Database Applications (15-415)

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

HADOOP FRAMEWORK FOR BIG DATA

Google File System (GFS) and Hadoop Distributed File System (HDFS)

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop/MapReduce Computing Paradigm

The Google File System

MI-PDB, MIE-PDB: Advanced Database Systems

Distributed File Systems II

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

2/26/2017. For instance, consider running Word Count across 20 splits

CLOUD-SCALE FILE SYSTEMS

Map Reduce & Hadoop Recommended Text:

Cloud Computing CS

Programming Models MapReduce

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

A brief history on Hadoop

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce. U of Toronto, 2014

Distributed Filesystem

Batch Inherence of Map Reduce Framework

50 Must Read Hadoop Interview Questions & Answers

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Big Data for Engineers Spring Resource Management

Distributed Systems 16. Distributed File Systems II

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Introduction to MapReduce

Big Data 7. Resource Management

Dept. Of Computer Science, Colorado State University

BigData and Map Reduce VITMAC03

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Chapter 5. The MapReduce Programming Model and Implementation

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

CS 138: Google. CS 138 XVII 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Introduction to MapReduce

GFS: The Google File System. Dr. Yingwu Zhu

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

MapReduce, Hadoop and Spark. Bompotas Agorakis

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

IMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM *

CS 138: Google. CS 138 XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

The Google File System. Alexandru Costan

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Lecture 11 Hadoop & Spark

Locality Aware Fair Scheduling for Hammr

Distributed Computation Models

The Google File System

The Google File System

Hadoop. copyright 2011 Trainologic LTD

Map-Reduce. Marco Mura 2010 March, 31th

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

CLIENT DATA NODE NAME NODE

1. Introduction (Sam) 2. Syntax and Semantics (Paul) 3. Compiler Architecture (Ben) 4. Runtime Environment (Kurry) 5. Testing (Jason) 6. Demo 7.

The MapReduce Framework

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Hadoop محبوبه دادخواه کارگاه ساالنه آزمایشگاه فناوری وب زمستان 1391

CSE 124: Networked Services Fall 2009 Lecture-19

CS370 Operating Systems

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

GFS: The Google File System

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Map Reduce Group Meeting

Introduction to Map Reduce

CS 345A Data Mining. MapReduce

Google File System. Arun Sundaram Operating Systems

Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Google Disk Farm. Early days

Programming Systems for Big Data

CSE 124: Networked Services Lecture-16

Hadoop and HDFS Overview. Madhu Ankam

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Survey on MapReduce Scheduling Algorithms

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

HADOOP. K.Nagaraju B.Tech Student, Department of CSE, Sphoorthy Engineering College, Nadergul (Vill.), Sagar Road, Saroonagar (Mdl), R.R Dist.T.S.

Exam Questions CCA-500

ABSTRACT I. INTRODUCTION

Google File System. By Dinesh Amatya

TI2736-B Big Data Processing. Claudia Hauff

Transcription:

PWBRR Algorithm of Hadoop Platform Haiyang Li Department of Computer Science and Information Technology, Roosevelt University, Chicago, 60616, USA Email: hli01@mail.roosevelt.edu ABSTRACT With cloud computing growing in popularity, programming platform becomes quite important part for cloud computing. Hadoop is the most widely open source programming platform. In this chapter, we survey the Hadoop Platform and announce the Priority Based Weight Round Robin algorithm (PWBRR). According to the result of assuming experiment, we found that the PWBRR algorithm is more suitable than any other algorithms for Hadoop Platform. Keyword: cloud computing, PWBRR, Hadoop I.INTRODUCTION Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It has been used by many big technology companies, such as Amazon, Facebook, Yahoo, IBM. Hadoop[1]is best known for MapReduce and its distributed file system(hdfs). MapReduce idea is mentioned in a Google paper [2], to be simply the task of MapReduce is another processing of divide andconcur.hadoop[8] is aimed at problems tha t require examination of all the available data. For example, text analysis and image processin g generally require that every single record be read, and often inerpreted in the context of sim ilar records. Hadoop uses a technique called M apreduce to carry out this exhaustive analysis quickly.hdfs gives the distributed computing storage provides and support. They are the two main subprojects for Hadoop platform..hadoop set FIFO algorithm as its default algorithm. According to our research of algorithm for Hadoop, we found that it unable to satisfy the demand of users. We cannot only keep the idea of first come first served. We need to think about the requirement form who has the higher priority, but at the same time we also can keep the fairness to other users. Then we announced PBWRR algorithm. II. MAPREDUCE OVERVIEW The MapReduce frame work [3] consists of a single master JobTracker and one slave TaskTracker per cluster node. The master is responsible for scheduling the jobs component tasks in the slaves, monitoring them, and re-executing any failed tasks. The slaves executed the tasks as directed by the master. As mentioned, MapReduce applications are based on a master-slave model [6]. This part describes the various operations that are performed by a generic application to transform input data into output data according to that model.the user defined map and reduce functions [5], The map function processes a key/value pairs and return a list of intermediate key/value pairs Map(k1,v1) -list(k2 v2) The reduce function merges all intermediate values having the same intermediate key: Reduce (k2, list(v2))---list(v2) The JobTracker will first determine the number of splits (each split is configurable, ~16-64MB) from the input path, and select some TaskTracker based on their network proximity to the data sources, then the JobTracker send the task requests to those selected TaskTrackers. Each TaskTracker will start the map phase processing by extracting the input data from the splits. For each record parsed by the InputFormat, it invokes the user provided map function, which emits a number of key/value pair in the memory buffer. A periodic wakeup process will sort the memory buffer into different reducer node by invoke the combine function. The key/value pairs are sorted into one of the R local files (suppose there are R reducer nodes). When the map task completes (all splits are done), the TaskTracker will notify the JobTracker. When all the TaskTrackers are done, the JobTracker will notify the selected TaskTrackers for the reduce phase. Each TaskTracker will read the region files remotely. It sorts the key/value pairs and for each key, it invoke the reduce function, which collects the key/aggregatedvalue into the output file (one per reducer node). III. HDFS OVERVIEW Hadoop Distributed File System (HDFS)[4] is the primary storage system used by Hadoop applications. The Hadoop distributed file system is designed to handle large files (multi-gb) with sequential read/write operation. Each file is broken into chunks, and stored across multiple data nodes as local OS

files. There is a master NameNode to keep track of overall file directory structure and the placement of chunks. DataNode reports all its chunks to the NameNode at bootup. Each chunk has a version number which will be increased for all update. Therefore, the NameNode know if any of the chunks of a DataNode is stale Those stale chunks will be garbage collected at a later time. To read a file, the client API will calculate the chunk index based on the offset of the file pointer and make a request to the NameNode. The NameNode will reply which DataNodes has a copy of that chunk. From this points, the client contacts the DataNode directly without going through the NameNode. To write a file, client API will first contact the NameNode who will designate one of the replica as the primary (by granting it a lease). The response of the NameNode contains who is the primary and who are the secondary replicas. Then the client push its changes to all DataNodes in any order, but this change is stored in a buffer of each DataNode. After changes are buffered at all DataNodes, the client send a commit request to the primary, which determines an order to update and then push this order to all other secondaries. After all secondaries complete the commit, the primary will response to the client about the success. All changes of chunk distribution and metadata changes will be written to an operation log file at the NameNode. This log file maintain an order list of operation which is important for the NameNode to recover its view after a crash. The NameNode also maintain its persistent state by regularly check-pointing to a file. In case of the NameNode crash, a new NameNode will take over after restoring the state from the last checkpoint file and replay the operation log. IV. HADOOP ALGORITHM Hadoop's MapReduce operation is initiative requesting from tasktracker to jobtracker.the principle is similar to ordinary, non-preemptive scheduling operating system, which is cannot be interrupt once the task is assigned. As what I have learned about the Hadoop algorithms, there are four classic algorithms. FIFO, First In First Out.: This expression describes the principle of a queue processing technique or servicing conflicting demands by ordering process by first come, first served behavior, what comes in first is handled first, what comes in next waits until the first is finished. This is the default algorithm in Hadoop. RR: Round Robin. In computer operation, one method of having different program process take turns using the resources of the computer is to limit each process to a certain short time period, then suspending that process to give another process a turn (or "time-slice"). This is often described as round-robin process scheduling.2) RR: Round Robin. In computer operation, one method of having different program process take turns using the resources of the computer is to limit each process to a certain short time period, then suspending that process to give another process a turn (or "time-slice"). This is often described as round-robin process scheduling. HPF:Height Priority First. The algorithm scheduling process, each will be assigned to handle the highest priority ready process. Priority setting for the number of static when it can be dynamic. Static priority number is in the process of creation is based on the initial characteristics of the process or user requirements identified in the process cannot be changed during operation. Dynamic priority number refers to the process and then create an initial priority to determine when the first number, after running in the process as the process characteristics change. WRR: Weighted Round Robin is a scheduling discipline. Each packet flow or connection has its own packet queue in a network interface card. It is the simplest approximation of generalized. While GPS serves infinitesimal amounts of data from each nonempty queue, WRR serves a number of packets for each nonempty queue. V. PBWRR, Priority based weighted round robin Strict priority queueing and weighted round robin are two common scheduling schemes for cloud computing.[7] According to the Hadoop's MapReduce operations pattern and non-preemptive scheduling operating system features, we proposed the Priority based weighted round robin algorithm to avoid long wait of task scheduling. It can be adjusted actual situation while the scheduling priority of the each task. The PBWRR Algorithm basic idea: input the each job into a queue, in the unweighted case, the tasktracker will execute the each task in turn. In the weighted case, the job which has higher priority can be executed multiple tasks in turn.the PBWRR Algorithm basic step: 1. In the case of the resources available, the tasktracker requires the assignments from jobtracker. 2. When the Jobtracker receive the requirement from tasktracker, it assigns a task from resent job to tasktracker to execute. Then

the Jobtracker updates the information of the rest tasks of the resent Job. At the same time, the value of thisroundtask reduces one, if the result is less than one, the pointer move to the next Job, otherwise the pointer stay to wait for another requirement from tasktracker. 3. When the pointer move to the end of queue, update the information of Jobtracker. If there has new Job join the queue, the Jobtracker recalculate the value of thisroundtask and move the pointer to the beginning of the queue.. Figure1: PBWRR Algorithm Processing Cycle Each element data structure of JobInfo in jobqueue Class jobinfor { Int JobId; Int jobsize; Int tasknum; Int meantasksize; Int priority; Int weight; Int ThisRountTask; } There are two type queue in the whole scheduling algorithm, jobqueue and taskqueue[ ]. jobinfo is the element of jobqueue, each element of taskqueue has a job which include all map s and reduce s tasks of this job. The algorithm of method assignment assigntasks: synchronized List<Task> assigntasks (TaskTrackerStatus tracker): 1.First search the jobqueue whether there is map task or reduce task waiting to be processed, if not, return null. 2.Find the pointer position of jobinfo in the resent jobqueue, according to JobId of joninfo find taskqueue[i].3.distinguish maptask Queue or Rrducetask Queue, if it is Reducetask Queue, we should consider that whether its maptask completed, it not, we should move back the pointer of jobqueue then redo the step 2; if its maptask completed, we should assign task on the top of Queue to tasktracker which required the assignment. 4. Update the information of taskqueue, delete the task from taskqueue. Then Update the jobinfo information from jobqueue (include thisroundtask-1, donetask+1). If the value of taskroundtask of jobinfo less than 1, move the pointer to the next, otherwise the pointer keep stay. 5.If the pointer is at the end of queue. Update jobinfo of jobqueue. If there has new job join the queue. We should recalculate the all jobweight, the thisroundtask value of each job, then move the pointer to the top of the queue. 6. Return to the step 3. The method to calculate value of weight and thisroundtask in step 5: updatejobqueue(jobqueue queuel) { weight=calculateweight(priority): thisroundtask=min(floor(weigho,tasknum); } We configure the value of priority when the system running. Due to cloud computing services provided by widely range and fares and service quality requirement, the value of priority is also more abundant. However, in order to keep fairness of the entire system, we have to consider the priority values of all previous business when we expand new business. In this research paper, we simply set up the priority value as 1,2,3. VI. Deduction Process of weight. Each job[i], meantasksize[i]=jobsize[i] / tasknum[i]; if the weight of job[i] is weight[i], The amount of data processing in a scheduling cycle of the entire system is s = meantasksize[i] weight[i] (1) meanwhile, if each job[i] has priority [i], in order to ensure the effectiveness of its priority, we should make percentage of the data processing capacity of the entire system equal to the priority of the total system priority value. Then weight[i] = [ ] [ ] [ ] = [ ] [ ] [ ] (2) (3) In the formula, priority[i] and meantasksize[i] are all known. The only value which we should consider is S amount of data in a processing cycle. Due to the computer system processing ability, most task size should be blocksize, consequently we consider the taskability of Maptask ability(normally, ReduceTask requirement less than MapTask), we can change the formula to:

S = Max(num blocksize jobnum, taskability blocksize) (4) Num means the number of execute task in every job in a processing cycle. JobNum,means the number of unexecuted of job. Assume that there have m computers in the entire system, the I computer can supported number t[i] tasktracker, each tasktracker processing ability is pt[i], then the entire system processing ability is : P = t[i] pt[i] = p[i] (5) P[i] means the processing ability of Hadoop was assigned by this computer. The early Map task starts the reduce task start early, and it also enhance the parallelism of map reduce. Second test: we input 4 jobs. For Job1 andjob3, each of them has seven files with 20M; For Job2 and Job 4, each of them has 2 files with 256M. Job1 and Job2 priorities are 1. Job3 and Job4 priorities are 2. Figure4: PWBRR VII.Experiment of PWBRR We supposed three tests between FIFO and PWBRR. FIFO algorithm is default algorithm in Hadoop. The PWBRR algorithm is the algorithm we announced based on RR,HPR,WRR algorithms. The first test that we assuming is that they have the same priorities. The second and third tests that we assuming that they have different priority, the priority are all different between second and third tests. First test: The first test we used the same priority; we input 4 jobs, for job 1 and job3. Each of them has seven files with 20m. For job2 and job 4, each of them has 2 files with 256 M. The jobs priorities are equal 1. Figure5: FIFO Compare Figure4 and Figure5 results; Job3 is twice faster than Job1. This is the result what we need. But for Job4, it doesn t much faster than job 2. There are two reason could caused, this first is the running time is not enough for test. Second, the tasktracker worked for Job1 and Job 3, the system did take good care of Job4 Third test: we input 6 jobs,forjob1 Job2 and Job3, each of them has 3 files with 128M;For Job4 Job5 and Job 6, each of them has 3files with 192M.Job1 Job2 and Job3 priorities are 1,2,3.Job4 Job5 and Job6 priorities are 1,2,3. Figure2: PWBRR Figure6: PWBRR Figure3: FIFO Compare Figure2 and Figure3 results according to the Timecost, we can find out the PBWRR algorithm showed the Fairness. Because the reduce task based on the Map task. Figure7: FIFO Compare Figure6 and Figure7 results. We find out the higher the job priority is the faster it

will be calculate much faster than FIFO. Because the reduce task can be executed early. VIII. Conclusion In this paper, we overviewed the Hadoop Platform and proved the PWBRR algorithm which we announced is much suitable for Hadoop Platform. According to the research of the cloud computing platform, normally it supply cloud computing services with different services fee and the user who has the higher priority, not only obey the first come first service rule. The PWBRR algorithm can clearly distinguish the different user s priorities, and it also keeps the fairness. It also avoid the users who have the higher take the most system resources. The PWBRR algorithm inherits the open source to developer characteristic of Hadoop. It is a good supplement for Hadoop. REFERENCE [1] Tom White, Hadoop: The Definitive Guide. First Edition. PP 12 June 2009. [2] J.Dean,S.Ghemawa.MapReduce:Simplified Data Processing on Large Cluster.OSDI 04,Sixth Symposium on Operating System Design and Implementation, SanFrancisco,CA,December,2004 [3] Hai Jin, Shadi Ibrahim, Tim Bell, Li Qi, Haijun Cao,Song Wu, and Xuanhua Shi Tools and Technologies for Building Clouds pp.13 2010 [4] Hadoop.http://hadoop.apache.org/ common/docs/current/ [5] Dean,J,Ghemawas MapReduce: simplified data processing on large clusters pp.107-113 [6] Fabrizio, Marozzo, Domenico, Talia,and Paolo Trunfio, A Peer-to-Peer Framework for Supporting MapReduce Applications in Dynamic Cloud Environments pp.113-115 [7] Y.Zhang, P.G.Harrison. Performance of a Priority-Weighted Round Robin Mechanism for Differentiated Serive Nerwork [8] http://www.vmware.com/appliances/directory /uploaded_files/what%20is%20hadoop.pdf.