Dynamic Resource and Energy Aware Scheduling of MapReduce Jobs and Virtual Machines in Cloud Datacenters

Size: px

Start display at page:

Download "Dynamic Resource and Energy Aware Scheduling of MapReduce Jobs and Virtual Machines in Cloud Datacenters"

Posy Lynch
5 years ago
Views:

1 Dynamic Resource and Energy Aware Scheduling of MapReduce Jobs and Virtual Machines in Cloud Datacenters Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in Computer Science and Engineering by Radheshyam Nanduri Search and Information Extraction Lab International Institute of Information Technology Hyderabad , INDIA February 2012

3 International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled Dynamic Resource and Energy Aware Scheduling of MapReduce Jobs and Virtual Machines in Cloud Datacenters by Radheshyam Nanduri, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Advisor: Prof. Vasudeva Varma

4 To my parents and my teachers

5 Acknowledgments Firstly, I would like to express my heartfelt gratitude to my parents and family whose love, support and encouragement inspired me to complete this degree. I would like to dedicate this to them, who have been a great influence in my life. I would like to thank my advisor Prof. Vasudeva Varma for his continuous guidance throughout the course of my thesis, without whom this thesis would not have been possible. His motivation has helped me pursue my research in the field of Cloud Computing. His continued support has led me in the right direction. I am very grateful to Mr. Reddy Raja. A of Pramati Technologies for his valuable inputs in framing research ideas in the field of MapReduce. His continuous feedback about my work helped me move forward very swiftly. This thesis would not come to completion without the help of my friends at IIIT-Hyderabad. I would like to express my thanks to each and every friend who helped me directly or indirectly during my masters program. Special thanks to my friends (in alphabetical order), Anil, Girish, Gowtham, Kiran, Kushal, Laxit, Nisarg, Praveen, Siddhartha Varma, Srikanth Reddy, Srinath etc., with whom I have spent most of the time at college. I am grateful to them for being there with me in good times and hard times and making my stay at IIIT-Hyderabad a memorable experience. I would also like to thank my lab friends, Nitesh, Akshat, Manisha and Dharmesh for their help during my research work. Finally, I want to express my gratitude to Mr. Babji and Mr. Mahender for their enthusiasm in helping lab students. v

6 Synopsis Cloud Computing is the technology that enables users to host their applications or subscribe to computing resources on remote servers in a pay-as-you-go model. The users access these applications and computing resources in the form of web services. These services offered by the Cloud providers can be broadly classified as: Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS) and Infrastructureas-a-service (IaaS). The pricing models for different types of services vary from provider to provider. With massive improvement in the internet connectivity in recent years, many users and enterprises are favouring these kind of services as they save a lot of infrastructure and maintenance cost. MapReduce framework has received a wide acclaim over the past few years for large scale computing. It has become a standard paradigm for batch oriented workloads. It is usually run on large number of physical machines called as nodes which are collectively termed as cluster. MapReduce framework suits very well to develop applications that can rapidly process huge amount of data in a distributed fashion over large clusters. This framework is widely used in the industry to analyse log files, index files, crawl the web, social media monitoring, data mining etc. MapReduce is fairly simple framework which only takes the code that needs to be executed and the data to be processed as input, while the rest of the processes is handled transparently from the user. Though there are many distributed paradigms, MapReduce has seen a wider scope in industry and academia because of its simplicity. It works on a simple master-slave architecture, where the jobs are submitted to the master, which divides them into smaller units and schedules them across the cluster. MapReduce is mostly used to run data intensive jobs which are usually very long and sometimes even run from days to months. Since the jobs may be very vi

7 vii long, intelligent scheduling decisions help in reducing the overall runtime of the jobs. Given the scale at which the MapReduce applications work, reducing the overall runtime would have a direct impact on minimizing power consumption of the cluster. Since the cluster is comprised of heterogeneous commodity hardware, the scheduler has to face another challenge of scheduling tasks effectively on a limited diverse resource pool. The first problem that we address in this thesis is: Reduce the overall runtime of MapReduce jobs by intelligent scheduling. We propose an approach which tries to maintain harmony among the jobs running on the cluster, and in turn decrease their runtime. In our model, the scheduler is made aware of different types of jobs running on the cluster. The scheduler tries to allocate a task on a node if the incoming task does not affect the tasks already running on that node. From the list of available pending tasks, our algorithm selects the one that is most compatible with the tasks already running on that node. We bring up heuristic and machine learning based solutions to our approach and try to maintain a resource balance on the cluster by not overloading any of the nodes, thereby reducing the overall runtime of the jobs. We evaluate our algorithm on a variety of workloads and our results show that the proposed algorithm achieves substantial savings in runtime of the jobs when compared to Yahoo! s Capacity scheduler. Our approach takes into account of the interoperability of MapReduce tasks running on a node of the cluster and ensures that a task running on a node would not affect the performance of other tasks. This requires the scheduler to be aware of the resource usage information of each task running on the cluster. The second problem that we address in this thesis is: Energy efficient and SLA aware scheduling of virtual machines in a data center. We propose algorithms which try to minimize the energy consumption in the data center duly maintaining the level of service that is formally agreed between the service provider and the user, termed as Service-Level Agreement (SLA). The algorithms try to utilize least number of physical machines in the data center by dynamically rebalancing the virtual machines based on their utilization. The algorithms also perform an optimal consolidation of virtual machines on a physical machine minimizing SLA violations. Our algorithms make sure that the physical machines are utilized to the

8 viii maximum extent and put the low utilized physical machines to standby mode, by intelligently migrating the load on to other physical machines. In this part of work, we try to utilize least number of physical machines in the data center, there-by conserving energy. The virtual machines are consolidated on a physical machine based on the resource usage patterns in order to avoid any race condition for resources. To achieve close to perfect consolidation we follow a similarity model and the virtual machines are consolidated based on this similarity measure. The data center is scaled up and down dynamically based on the requirement of resources. We evaluate our algorithm on various diverse workloads against Single Threshold algorithm. And our algorithms save a considerable amount of energy and also minimize the number of SLA violations.

9 Contents Chapter Page 1 Introduction MapReduce Framework MapReduce Programming Model JobTracker Architecture Virtualized Data Center Dynamic Resource Scheduling Problem Definition and Scope Job Aware Scheduling Algorithm for MapReduce Framework Dynamic Energy and SLA aware Scheduling of Virtual Machines in a Cloud Data Center Organization of the thesis Context: Scheduling Algorithms in Cloud Frameworks Background: JobTracker and Scheduling MapReduce Job Run in detail Schedulers Available In Hadoop FIFO Fair Scheduler Capacity Scheduler Related Work: Scheduling algorithms Delay Scheduling Performance-Driven Task Co-Scheduling for MapReduce Environments A Self-adaptive MapReduce Scheduling Algorithm In Heterogeneous Environment ix

10 x CONTENTS Performance Management of Accelerated MapReduce Workloads in Heterogeneous Clusters Using Pattern Classification for Task Assignment in MapReduce Need For a Job Aware Scheduler Background: Cloud Data Center and Resource Management Private cloud data center: Eucalyptus and its Architecture Cloud Controller Walrus Cluster Controller Storage Controller Node Controller Resource Management in Cloud Data Center Need for Energy Awareness in Resource Management Related work: Resource Scheduling Round Robin, Greedy and Power Save Dynamic Round Robin Single Threshold Dynamic Voltage Scaling Dynamic Cluster Reconfiguration Summary Job Aware Scheduling Algorithm for MapReduce Framework Proposed Algorithm Task Characteristics Calculation of E x of a TaskVector Task Selection Algorithm Construction of Task Vector (T k ) Task Assignment Algorithm Machine Learning Approach Hardware Specifications of TaskTracker (Φ): Network Distance (Σ) Task Vector of Incoming Task (T k ) TaskVectors of Tasks Running on Task- Tracker (T compound (i)): Incorporating Naive Bayes Classifier: Heuristic-based Algorithm Task Compatibility Test Cosine Similarity Model

11 CONTENTS xi 3.2 Evaluation and Results Testing Environment Experiments Description Comparison on runtime of the jobs Comparison on resource usage Effect on low resource-intensive jobs Overhead of the scheduler No task overhead No resource monitoring overhead No decision making overhead Summary Dynamic Energy and SLA Aware Scheduling and Provisioning of Virtual Machines in Cloud Data Center Allocation Algorithm Resource Vector Construction of Resource Vector Calculation of Similarity Utilization model Scale-up Algorithm Scale-down Algorithm Experimental Evaluation Simulation Model Experimental Set-up and Dataset Energy Savings Effect of Scale up Threshold Effect of scaling down SLA violations Effect of Similarity Threshold Effect of Scale up Threshold Effect of buffer Effectiveness of our algorithm against Single Threshold Algorithm Summary Conclusions Job Aware Scheduling Algorithm for MapReduce Framework Future Work Energy and SLA Aware Scheduling Algorithm for Cloud Data Center Future Work

12 xii CONTENTS 5.3 Summary Bibliography

13 List of Figures Figure Page 1.1 Different types of services offered by Cloud providers. Source: [4] Electricity consumption statistics of various countries in the year Source: Green Peace International [30] Popular uses of MapReduce framework. Source [16] MapReduce workflow Architecture of JobTracker The figure shows how Hadoop runs the MapReduce job. Source: [58] The figure shows the architecture of Eucalyptus cloud system. Source: [11] Server power usage and energy efficiency at varying utilization levels, from idle to peak performance. Even an energy-efficient server still consumes about half its full power when doing virtually no work. Source: [31] Task Selection algorithm The scenario of a MapReduce cluster Task Assignment algorithm following the machine learning approach Task Selection algorithm: The received task is tested for compatibility on a Incremental Naive-Bayes classifier and then its accepted if the posterior probability is greater than equal to C ml Task Assignment algorithm following the heuristic-based approach Comparison of runtime (in hours) of the jobs between Capacity and heuristic based algorithms: The amount of the saving in the runtime of the jobs increases as the number of jobs increase xiii

14 xiv LIST OF FIGURES 3.7 Comparison of runtime (in hours) of the jobs between Capacity and machine learning based algorithms: The amount of the saving in the runtime of the jobs increases as the number of jobs increase Comparison of cpu requirement on a TaskTracker between Capacity and heuristic based algorithms: The cpu requirement mostly stays below 100% in case of heuristic based algorithm except for few surges. The time stamp is shown in minutes Comparison of cpu requirement on a TaskTracker between Capacity and machine learning based algorithms: The cpu requirement mostly stays below 100% in case of machine learning based algorithm except for few surges. The time stamp is shown in minutes Effect of Capacity scheduler on low resource-intensive jobs: The job queues (three) alternatively contain the jobs in the order of jobs shown on the y-axis (bottom to top). X-axis shows the runtime in hours. The low resource-intensive jobs starve until the completion of high resource-intensive jobs Effect of heuristic based algorithms on low resource-intensive jobs: The job queue contains the jobs in the order of jobs shown on the y-axis (bottom to top). X-axis shows the runtime in hours. The starvation of low resource-intensive jobs is reduced Effect of machine learning based algorithms on low resource-intensive jobs: The job queue contains the jobs in the order of jobs shown on the y-axis (bottom to top). X-axis shows the runtime in hours. The starvation of low resource-intensive jobs is reduced Allocation Algorithm. The VMs are consolidated on physical machines based on similarity measure Scale-up Algorithm. Upon reaching the scale-up trigger condition, the above algorithm is executed Scale-down Algorithm. Upon reaching the scale-down trigger condition, the above algorithm is executed The graph demonstrates the effect of Scale up Threshold on energy consumption (in kwh). We see a sudden drop of energy consumption when U up is around 0.70 to The graph demonstrates the effect of Scale down Threshold on energy consumption (in kwh). Algorithm with scale down procedure enabled, performs better in terms of energy conservation The graph demonstrates the effect of Similarity Threshold on number of SLA violations. Method 2 performs very well with zero violations for any Similarity Threshold

15 LIST OF FIGURES xv 4.7 The graph demonstrates the effect of Scale up Threshold on number of SLA violations. No violations occur for lower values of Scale up Threshold The graph demonstrates the effect of buffer on SLA violations. The number of SLA violations drop to zero with a buffer value of more than or equal to The graph demonstrates the effect of buffer on energy consumption (in kwh). We see a sudden drop of energy consumption when buffer is around 0.20, but steadily increases beyond it The graph demonstrates the effectiveness of our algorithm against Single Threshold algorithm in terms of both energy consumption (in kwh) and also number of SLA violations

16 List of Tables Table Page 3.1 Hadoop and Algorithm Parameters Simulation and Algorithm Parameters xvi

17 Chapter 1 Introduction Cloud Computing has received a wide acclaim in the recent past for its ease in use of the provided services. From a user s perspective, Cloud Computing can be defined as a technology that enables hosting of applications and using the computing resources on the remote servers in a pay-as-you-go model. The framework is simpler to work as it abstracts all the complex aspects of maintaining, updating, managing and securing the infrastructure from the user. From a Cloud provider s point of view, the technology is about a huge data center which provides services to its customers in the form of web services. The driving technology of Cloud Computing is Virtualization [56]. The services provided by the Cloud providers could be broadly classified as: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS). The basic definition about these services is below. IaaS: Through Infrastructure-as-a-Service, the cloud providers provide computing resources such as CPU, memory, disk space etc., as a utility in a virtualized environment. Users can subscribe to a required specification of computing resources from a list of various offerings and use them for a time period of their wish. PaaS: Platform-as-a-Service provides the users with required abstraction for the underlying actual framework. The users are given the freedom to develop their own customized applications and host them as a service. SaaS: Through Software-as-a-Service the users can access the applications which are already hosted, customize them as per the needs, and use them. 1

18 The main advantage of cloud computing is that customers only pay for what they use. The customer need not own the infrastructure to perform a huge computation. Instead, the computing resources could be leased over the internet and used only when required. This saves a lot of cost for the customer. For example, a small online retailer may not afford to own a data center with high-end servers to handle a peak traffic on his website which occurs for a short duration of time in a year. Even if the retailer shelves the servers to handle the traffic, they go unused for most of the period in a year. The best solution for this use-case is to lease servers from an IaaS provider. The increased computation capacity which is leased from a provider, would handle the rise in traffic during peak shopping season. And the customer could relinquish the services from the cloud provider during other seasons, eventually paying for the resources only when used. There are many reasons why small and medium sized organizations are migrating from traditional paradigm to cloud paradigm. Customers can avoid large expenditure on purchasing and installing the computing infrastructure by moving to cloud platform. The expenditure could be easily budgeted on a month-by-month basis and also avoids money being spent on acquiring costly hardware. The IT solutions are maintained, updated, patched very quickly by the provider and the organizations need not worry about this tedious work. Scalability, availability and elasticity are highly valuable advantages offered by Cloud Computing. More systems could be added to resource pool as and when required and could be brought down when not required. The providers guarantee the availability of the resources through out the clock. So the organizations need not worry about the security and safety of the data centers. The providers provide the uptime guarantees to the organizations through their Service-Level Agreements (SLA) [33] which ensure that the applications are always accessible to the customers. The Figure 1.1 shows different types services offered by Cloud providers. The services offered by Cloud Computing could be accessed using simple client devices such as desktop, laptop, hand-held portable device like a smart mobile phone etc., with internet connectivity. The access to the cloud services requires minimal computing resources and hence client devices just act as dummy terminals. Whereas, the real processing happens on the cloud framework which is remotely accessible through the internet. The inside of the Cloud Computing framework is very 2

complex involving many middleware services, monitoring applications, virtualized physical resources, storage devices, accounting services etc.

19 complex involving many middleware services, monitoring applications, virtualized physical resources, storage devices, accounting services etc. All these components work together to offer better Quality of Service (QoS) to the customers. Figure 1.1: Different types of services offered by Cloud providers. Source: [4] With growing spread of technology, the usage of electricity has grown significantly in the data centers. The focus on Green Cloud Computing has been increasing day-by-day due to shortage of energy resources. The U.S. Environmental Protection Agency (EPA) data center report [28] mentions that the energy consumed by data centers has doubled in the period of 2000 and 2006 and estimates another two fold increase in the following years if the servers are not used in an improved operational scenario. The Server and Energy Efficiency Report [29] states that more than 15% of the servers are run without being used actively on a daily basis. The Green Peace International survey [30] reports that the amount of electricity used by Cloud data centers (Figure 1.2) could be more than the total electricity consumed by a big country like India, in the year This shows that 3

20 there is a need to utilize the resources very effectively and in turn save energy. The advent of Cloud Computing shows encouraging signs in cutting down the carbon footprint by reducing the power consumed per user. But, this does not implicitly mean that Cloud Computing saves energy. The administrators of the Cloud data centers have to follow stringent and effective consolidation policies through which this target could be achieved. Figure 1.2: Electricity consumption statistics of various countries in the year Source: Green Peace International [30]. Relevant basic information related to the problems discussed in the thesis is framed below. 1.1 MapReduce Framework MapReduce [36] is a modern programming paradigm which can process huge amount of data in parallel on a distributed environment, specially designed for heterogeneous commodity hardware. And Hadoop [5], developed by Apache Software Foundation is most widely used MapReduce implementation across the industry and academia. Another MapReduce implementation is developed by Cloudera 4

[8]. Apart from thse MapReduce implementations, Amazon launched MapReduce as a service called Amazon Elastic MapReduce [2]. Using this web service, researchers, developers, data analysts etc.

21 [8]. Apart from thse MapReduce implementations, Amazon launched MapReduce as a service called Amazon Elastic MapReduce [2]. Using this web service, researchers, developers, data analysts etc., could cost-effectively process vast amount of data in a short span of time. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) [1] and Amazon Simple Storage Service (Amazon S3) [3]. Apart from this service from Amazon, many companies have dedicated MapReduce clusters in their data centers. MapReduce has seen a tremendous growth in recent years especially for Text Indexing, Log Processing, Web Crawling, Data Mining, Machine Learning etc [16] (Figure 1.3). MapReduce is best suited for batch-oriented jobs which tend to run for hours to days over a large dataset on a limited resources of the cluster. Figure 1.3: Popular uses of MapReduce framework. Source [16] MapReduce MapReduce follows a master-slave architecture. There are two important components of this framework: Computation division and Storage division called Distributed File System (Hadoop Distributed File System, HDFS in case of Hadoop). 5

22 JobTracker acts as master daemon for computation division where as NameNode acts as master daemon for storage division. Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System run on the same set of nodes. In MapReduce terminology, job is a bigger unit comprising of sub units called tasks and we follow the same convention through out the thesis. The MapReduce framework consists of a single master JobTracker which handles the framework. The master is responsible for scheduling the jobs component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves called as TaskTrackers execute the tasks as directed by the master Programming Model MapReduce [36] follows a simple programming paradigm which computes any type of data in the form of key and value pairs in two major steps, map and reduce which is represented in Figure 1.4. In the first step, map reads the data in the form of key and value pair, key usually being the record offset in the file and value being the record itself. The user can write his custom logic in this method which he wishes to be applied on each record. But the output of this method has to be a key and value pair again. The map method is applied on each and every record of the input file and < key, value > pairs are given as intermediate output. Next, the reduce method receives a set of values corresponding to a particular key and the custom logic of the user is applied on it, which again outputs a key and value pair JobTracker Architecture MapReduce works on a simple master-slave architecture, where JobTracker is the master and TaskTrackers [19] are the slaves. The JobTracker is the heart of the Hadoop cluster which handles scheduling decisions for the MapReduce jobs. The architecture of JobTracker is presented in Figure 1.5. The JobTracker in Hadoop is designed in such a way that the schedulers can be pluggable in and out of it. The TaskTrackers run tasks and send progress reports to the JobTracker, which keeps 6

23 Figure 1.4: MapReduce workflow. record of overall progress of the job. The JobTracker and TaskTrackers communicate with each other through heartbeat messages. Through these messages the TaskTracker indicates the JobTracker that it is alive. As a part of heartbeat message the TaskTracker will also indicate if it is ready to accept a new task and other related information. The client application submits jobs to the JobTracker. The Hadoop divides the input data into multiple splits and replicates it on the cluster. The JobTracker talks to NameNode (the master for Hadoop Distributed File System [15]) to determine the location of the data. The JobTracker then creates a map task for each data split. These tasks are queued up in the JobTracker as per the scheduling algorithm. Whenever the TaskTracker requests for a task, the Job- Tracker submits a task to it as a return value. The TaskTracker launches each task in a new Java Virtual Machine. A TaskTracker can run multiple tasks at an instant which can be configured through the configuration parameters. Once all the tasks of a job are completed the output is stored on HDFS. 1.2 Virtualized Data Center Virtualization is the technology which enables cloud computing by providing intelligent abstraction that hides the complexities of underlying software and hardware. Using this technology, multiple operating system instances called Virtual 7

24 Figure 1.5: Architecture of JobTracker. Machines (VMs) [56] can be executed on a single physical machine without interfering each other. Each virtual machine is installed with its own operating system and acts as an independent machine running its own applications. The abstraction provided by this technology takes care of security, isolation of computation and data across the virtual machines without the knowledge of the user. This gave rise to cloud computing which commercializes the benefits of consolidation of virtual machines by exposing them as utility Dynamic Resource Scheduling Typical Cloud data center consists of hundreds of physical machines each hosting multiple virtual machines. One of the challenges in managing such a huge data centers is to optimally allocate the resources to virtual machines and re-configure them dynamically as required. An intelligent resource scheduler aggregates the computing capacity across the data center into logical resource pools and intelligently allocates available resources among the virtual machines. The resource scheduler works on the policies imposed inside it by the data center administrator. The policies or the algorithms have to be intelligent enough to efficiently utilize the resources of the data center. 8

25 The dynamic resource scheduler keeps track of resources across the physical machines and uses this information to dynamically allocate resources to virtual machines. If a virtual machine requires additional resources which are not present on its current physical machine, the scheduler migrates it to another physical machine which can satisfy its requirement. This process of migration is technically termed as Live Migration of Virtual Machines. Live Migration [35] is the process of moving an application (virtual machine in this scenario) on a physical machine to a target physical machine without any perceivable downtime of the application. This involves a very fast transfer of memory, data and other resources pertaining to the application from source to target physical machine. A lot of research has been carried out in the field of Live Migration and current products such as vmotion [27] could achieve a seamless migration in just a second or two. 1.3 Problem Definition and Scope In this thesis, we try to solve the resource scheduling problem in cloud infrastructure. We have discussed earlier the difficulty in managing resources effectively in the cloud data center so as to achieve better QoS. In this attempt, we present scheduling algorithms in cloud frameworks that can use the resources cogently improving the efficiency and in turn reduce energy consumption. In the first problem, we propose scheduling algorithms for MapReduce framework. The algorithms try to lower the runtime of the jobs running on the cluster and also increase the overall utilization of individual nodes of the cluster. In the second problem, we propose scheduling algorithms which try to address resource management in a virtualized cloud infrastructure. The algorithms try to effectively consolidate the virtual machines on physical machines with conserving the energy as prime motive. Also, the algorithms try to maintain SLA guarantees agreed with the customer Job Aware Scheduling Algorithm for MapReduce Framework Many scheduling algorithms were proposed in the field of distributed computing, but many of them make scheduling decisions just by duly checking if the requested resources are sufficient to accommodate the incoming request. But, we 9

26 argue that to improve a scheduling decision in a distributed environment, the scheduler should be intelligent enough to understand the true resource usage pattern of the job, and only then allocate the job on the machine where it does not affect the jobs that co-exist with it. In this first problem, we try to address the necessity of taking into consideration of many factors related to resource usage to make scheduling decisions better. Our algorithms try to schedule the jobs on MapReduce framework duly taking into consideration of resource usage patterns of the jobs. The jobs are scheduled on a machine in such a way that it does not affect the jobs already running on that particular machine. The algorithms make sure that no race condition is created among the jobs running on the machine, through which the overall runtime could be brought down. We proposed heuristic and machine learning based algorithms for our approach. The Heuristic based algorithm tries to find similarity between the resource usage patterns of the jobs by constructing this pattern into a vector. The scheduling decision is based on this similarity measure. In Machine Learning based algorithm, Naive Bayes [20] classifier is used to make scheduling decisions, which is trained with resource usage patterns of the incoming jobs and jobs running on the cluster, along with other framework related parameters. We have implemented the algorithms as a plugin to the actual MapReduce framework and tested our algorithms on diverse real life jobs widely run in the industry. The results show a saving of runtime of around 21% in the case of heuristic based approach and around 27% in the case of machine learning based approach when compared to Yahoo! s Capacity scheduler [7] Dynamic Energy and SLA aware Scheduling of Virtual Machines in a Cloud Data Center We have already discussed the necessity to conserve the energy in a data center. And it becomes very important in case of cloud data centers because of the scale in the number of machines used. We propose scheduling algorithms which not only conserve the energy and cut cost of data center, but also maintain necessary SLA guarantees. 10

27 In this second problem, we discuss the scheduling algorithms with prime focus on energy conservation. Our algorithms try to use minimum number of machines by effectively consolidating virtual machines together on a physical machine. The resource usage patterns of each virtual machine is captured and a resource vector is constructed based on it, which defines its true resource usage nature. Using this information, a similarity measure between the resource vectors of multiple virtual machines is constructed. Based on this similarity measure, the algorithms take consolidation decisions for virtual machines and try to avoid any race condition for resources between them. The algorithms also keep monitoring the resource usage patterns at regular intervals. Whenever the scheduler finds a need to scale up the data center, it dynamically adds an unused physical machine to the resource pool to serve further requests. The algorithms also try to scale down the data center by putting machines to standby mode, when their utilizations are very low, migrating the virtual machines on them to other suitable physical machines. Through this dynamic scaling up and down processes, the algorithms try to effectively use only the required amount of resources, avoiding any wastage of energy. We have simulated the complete data center architecture with necessary power and SLA models. Through extensive simulation, our algorithms showed 21% of energy savings while ensuring 60% better SLA maintenance, compared to the Single Threshold algorithm [32]. 1.4 Organization of the thesis Rest of the thesis is organized as follows. Chapter 2 begins by giving a background of MapReduce and JobTracker. We present the available schedulers in Hadoop and then move on to discuss other proposed algorithms in the literature. We also discuss the need for a job aware scheduling algorithm for MapReduce framework. Next we give a background of the architecture of cloud data centers. And discuss the need for an energy aware scheduler in cloud datacenters. Finally, we move on to discuss the related work regarding scheduling of virtual machines. Chapter 3 presents our job aware scheduling algorithm for MapReduce framework. We initially start with discussing the vector model defining the task charac- 11

28 teristics. We then discuss our Task Selection and Task Assignment algorithms that try to assign tasks on a node avoiding the race condition for resources. Next, Heuristic and Machine Learning based approaches of our algorithm are presented. Finally, we conclude with a discussion on our experiments and evaluations of our algorithms. Chapter 4 presents and discusses the need for energy awareness while maintaining cloud data centers. We start the chapter with a discussion on Resource Vector and Utilization Model. We then discuss our Allocation Algorithm which tries to consolidate virtual machines on a similarity measure. Dynamic scaling up and down of the data center is presented through our scale-up and scale-down algorithms which try to conserve energy. Finally, we conclude by presenting our simulation model and the efficacy of our algorithms in terms of energy conservation and maintaining SLA guarantees. Chapter 5 is the last chapter of this thesis and summarizes our work and establishes the key lessons learned from our work. We touch upon a number of topics for further research in this field, and conclude this thesis. 12

29 Chapter 2 Context: Scheduling Algorithms in Cloud Frameworks In this chapter, we discuss a brief background about scheduling in MapReduce framework and also in cloud data centers. We first present the JobTracker architecture and scheduling mechanism in MapReduce framework and then discuss the necessity for an efficient scheduler to the framework. We then discuss the importance of our first problem mentioned in Section by mentioning related work in this field. We also discuss the scheduling in cloud data centers and a brief introduction regarding it. Finally, we present the need for an energy efficient scheduler for a data center with many machines and the importance of our second problem mentioned in Section 1.3.2, moving on to discuss the related work in this field. 2.1 Background: JobTracker and Scheduling In this section, we shall discuss the architecture of computation division of MapReduce framework Life cycle of a MapReduce Job MapReduce has been identified as a standard paradigm for large scale computing over the past few years primarily because of its scalability and fault tolerant nature. Hadoop [5] has become widely used open source MapReduce implementation in the industry and academia. The ease of this framework comes through just submitting the code that needs to be executed, describing the computation of 13

30 the job and a data associated with it. The code for execution is submitted in the form of a JAR file [18], since MapReduce is written in Java. And the rest is taken care by the framework which is totally abstracted from the user and hence there is no need to worry about any internal processing of the framework. We shall briefly discuss the working of the framework in the following sections. Since MapReduce follows a master-slave architecture (Figure 2.1), the execution of a job is controlled by the JobTracker, whereas the TaskTrackers execute the job. JobClient acts as the starting point for the execution of the job. It co-ordinates the initialization of the job setup with the JobTracker. The following explains a step by step execution procedure for a job run inside MapReduce framework [58]. When the job is submitted, JobClient requests the JobTracker for an ID to represent the job. The JobTracker responds back with a Job ID, which is unique across the cluster. After receiving the JobID from the JobTracker, JobClient follows initialization check, so as to see if everything is in right order before actually submitting the job to the JobTracker. Firstly, it checks if the output folder specified by the user exists already. If it does, then it throws an error indicating the user of an existing output folder. This check makes sure that the output folders of the jobs do not interfere with each other, avoiding any possible mixing up of the outputs. Then, the JobClient proceeds with the initial setup. Given the size of input data, the JobClient splits the data into number of input splits based on the size of individual data splits specified in the configuration files. The size of an individual split could be controlled by the user through the configuration parameter, mapred.min.split.size. Most common split size used in MapReduce clusters is either 64MB or 128MB. Next, these splits are copied on the Hadoop Distributed File System (HDFS) [15] under the directory named after JobID specified by the JobTracker. This makes sure that the input data is properly organized on the HDFS. Not only that the input splits are copied, they are replicated by a factor to maintain fault tolerance. After this replication, the job is said to be ready for execution. And, the JobClient hands over the control to JobTracker for its further execution. As the JobTracker receives the job, it puts the job into its internal queue. The job waits in the queue until it is picked up by the scheduling algorithm plugged in the JobTracker. Once the job gets its turn, JobTracker initializes it by creating its corresponding tasks. The number of map tasks created by the JobTracker de- 14

Figure 2.1: The figure shows how Hadoop runs the MapReduce job. Source: [58]. pends on the size of the input data. Whereas, the number of reduce tasks could be explicitly specified by the user.

31 Figure 2.1: The figure shows how Hadoop runs the MapReduce job. Source: [58]. pends on the size of the input data. Whereas, the number of reduce tasks could be explicitly specified by the user. After the tasks are created, they are ready to be assigned to TaskTrackers. The scheduling algorithm holds the responsibility in assigning a particular task on a particular TaskTracker. Since the JobTracker needs to effectively schedule the tasks, it needs to have sufficient information regarding each of the TaskTrackers. To support this, each TaskTracker sends the required information to the JobTracker through communication messages called heart beats. Through each heart beat, TaskTracker sends the information that it is up and also availability of slots for task execution. Since each TaskTracker has a cap on number of slots for task execution, the scheduler is restricted to assign tasks not more than its maximum number. While assigning the tasks, most schedulers make sure that the data is co-located for the execution, i.e. the data is present on the Task- 15

32 Tracker on which the task is executed. This concept of data locality is explained further in 2.3. After the assignment of the task, TaskTracker identifies the data and JAR related to the task by getting the resources from the directory named after its JobID. It copies the JAR and the input data into its local file system from the distributed file system. For each task, the TaskTracker initializes a separate TaskRunner, which spawns its own Java Virtual Machine (JVM). TaskTracker runs each task in a separate JVM, to maintain isolation and avoid any failure caused by other tasks running on it. While the tasks run, the TaskTracker keeps track about each tasks status. The status of each task is also reported to the JobTracker as a part of heartbeat. The Job- Tracker consolidates the statuses of all the tasks of a particular job and marks the status of the complete job. As and when the JobClient polls for the job status, the JobTracker returns the status back to it, which is displayed on the console for the user. Since the MapReduce jobs are very long, the job status gives the idea about the progress of the job to the user. And finally, if the TaskTracker finds that the tasks have finished their execution, it marks it as complete and informs the same to the JobTracker in the next heart beat. The JobTracker then marks the jobs as complete and removes the job from its execution queue. Then both JobTracker and TaskTracker clear up the working state for the job and also the intermediate data. The JobClient then displays the completion of the job on the console to the user. 2.2 Schedulers Available In Hadoop Hadoop has the feature to plug-in scheduler into the framework and use it as per the requirement. It comes with a default scheduler FIFO (First In First Out) FIFO First In First Out is the default scheduler in Hadoop. In this scheduling algorithm, the JobTracker picks the job that is oldest in the queue and schedules it. This scheduler has no concept for priority and is very simple and efficient in terms of execution. 16

33 2.2.2 Fair Scheduler Fair Scheduler [12] is a scheduling algorithm proposed and used by Facebook in its data centers. The algorithm follows a simple rule of allocating the resources of the cluster uniformly across all the jobs in the job-queue. If there is a single job running on the cluster, then all the resources are consumed by it. As the number of jobs increase, the resources of the cluster are shared uniformly across them. This algorithm makes sure that there is no starvation for the jobs and there exists equality among the jobs running concurrently on the cluster. If there are multiple departments in a company trying to access the same cluster, there is also a provision of pools in the algorithm. If the user base is divided into multiple pools, the resources are fairly shared across the pools, as discussed above. So, each job could be submitted to a pool and the job gets its resource share according to the number of pools running on the cluster. In this way, resources could be shared across the pool-of-jobs of the cluster. A pool could also be set with a minimum amount of resources that it should hold at all the times. In other words, the algorithm restricts to add number of pools indefinitely, which reduces the pool share of resources. So, the pool set with minimum amount of resources, gets its minimum share all the time, and the excess resources are shared uniformly across the pools. In addition to this, the pools could share the resources according their assigned weights (by default, these weights are equal). While within the pool, fair share of resources could be followed across its jobs of the pool. If under any condition, the pool is not met with its minimum share, the resources could be pre-empted (which is optional) from other pools, so as to recover its minimum share Capacity Scheduler Capacity Scheduler [7] was proposed by Yahoo! and is used in its data centers. According to this scheduler, the resources are shared across the queues. Since multiple departments of the organization could access the MapReduce cluster, each department could be assigned with a queue. Each queue, could be assigned with a share of resources according to the needs of the department. The advantage of this kind of approach is that, a bad user of a department who wants to get hold of 17

34 whole of resources of the cluster, is restricted to use only the resources restricted to his department. According to the dynamic needs of a particular department, the capacity of queue associated to it could be re-configured. Each user submits the job to a queue. The jobs of the queue execute based on the resources allocated to it. If there are excess resources available which are not used by any queue, it is shared across the queues. And if a new queue is added and it does not get its share, it could reclaim the resources so as to achieve its minimum share. Within the queue, the jobs could optionally follow a priority based algorithm to use the resources. 2.3 Related Work: Scheduling algorithms Before we move on to discuss related work in this field, we define data locality and speculative execution in terms of MapReduce framework. Data Locality: When JobTracker tries to run a task on a node of the cluster, the data associated to that particular task may or may not exist on that node. If the data for the task exists on that particular node, the task is said to be data-local. Data locality saves an overhead of copying the data from another node to the node where computation takes place. Hence it is advantageous to maintain data locality. Schedulers usually try to maintain data locality at the time of allocation of the tasks but at certain cases, it may not be possible to achieve, during which the task may be scheduled against this. Speculative Execution: After the allocation of a task, the JobTracker keeps track of the status of the task every now and then. If the JobTracker finds that the task is running very slow (relative to other tasks of same job), or showing very low progress over time, it schedules a copy of the same task on another node without suspending the initial task. And both the tasks run simultaneously and separately. Then the JobTracker approves the task that completes first in the two and rejects the other. This process is very advantageous as sometimes few tasks may lag in computation for various conditions in the system. During such situations, the second task could complete faster and waiting for a particular slow task could be avoided. 18

35 2.3.1 Delay Scheduling In [59], the authors have discussed the effects of data locality on the job performance. They argue that the schedulers sometimes degrade performance, reducing throughput and response times due to poor data locality management. They proposed two algorithms: delay scheduling and copy-compute splitting which could increase the performance by a factor of 2-10 in multi-user environments. First problem that they discuss is the problem related to data-locality. At times, most of the schedulers try to execute the tasks in a non-data local manner so as to follow a queue procedure, such as FIFO. The authors propose an algorithm which does not impose a strict rule of queueing for the tasks in the scheduling process. If the scheduler does not find a data local task, it is delayed in its execution and the task next to it in the queue is scheduled. After sometime, the task may become data local and then be scheduled. But, if the scheduler cannot find a data local task after a certain time, it is allowed to run in non-data local manner. The authors argue that reduce phase has to wait for all the tasks to complete which could considerable degrade the response time of the application. To overcome this, they have proposed a solution in which the reduce task would be split into two logical distinct type of tasks, copy tasks and compute tasks with separate forms of admission control. Copy tasks fetch and merge map inputs, an operation which is usually network-i/o bound. Compute tasks apply user defined reduce function on the map outputs. Copy-compute splitting thus becomes two separate processes for copy and compute tasks, and scheduling these tasks separately in conjunction with distinctive tasks increases overall performance Performance-Driven Task Co-Scheduling for MapReduce Environments In [55], Jorda Polo et al. discuss a performance driven task scheduler for MapReduce framework. The proposed task scheduler adjusts the resource allocation for the MapReduce jobs by dynamically predicting their performance. The scheduler also makes sure that the resources are not over-provisioned to the jobs, duly making sure that the applications meet their performance objectives. 19

36 The proposed technique dynamically estimates the completion time of the job during its execution, and this becomes easy since the MapReduce jobs are collection of identical smaller tasks. The scheduler uses this completion time information and adjusts the resource allocation of other jobs accordingly. The scheduler tries to allocate the slots to the job based on its estimated completion time. So the job may be allotted with lower slots if it is estimated to nish well ahead of its actual completion time. In this way, the scheduler makes sure that over provisioning of resources is avoided duly taking into consideration of the performance A Self-adaptive MapReduce Scheduling Algorithm In Heterogeneous Environment In [34], Quan Chen et al. have proposed scheduling algorithms which try to improve performance of the system. They argue that the progress calculation of the tasks in Hadoop are just static and using dynamic methods to calculate the progress could help in taking better scheduling decisions and in turn improve the performance. The scheduler allocates series of tasks to the nodes of the cluster. As the map and reduce tasks run on the cluster, the historical information regarding the task execution is updated on each node. The algorithm adjusts time-weight of each stage of the task accordingly, using the historical information. Due to dynamic timeweight adjustment and constant logging, the algorithm knows the true progress of each individual task. The algorithm classifies few tasks as slow tasks if their progress is very slow and similarly classifies a node to be a slow node if the tasks on it run slowly very often. By classifying a node as a slow node helps in not launching any speculative task on that node Performance Management of Accelerated MapReduce Workloads in Heterogeneous Clusters In [54], the authors try to present an advanced task scheduler which can exploit the advantage of hybrid systems. In this work, the adaptive scheduler discussed in [55] is improved to make it aware of hardware capabilities. 20

37 The Adaptive Task Scheduler discussed in [55] and Cell/BE processor is leveraged to show how next generation software and hardware can be combined to meet high level performance goals of the customers while transparently managing resources. To take advantage of hardware accelerators on particular nodes, the scheduler tracks the hardware that can support hardware acceleration. The scheduler keeps track of progress of each task and if a job is expected to finish soon, the tasks of that job are allocated on the nodes with hardware acceleration. In this way, the tasks of the job can run faster and finish well within the actual completion time Using Pattern Classification for Task Assignment in MapReduce In [37], the authors present a machine learning based task scheduler that uses pattern classification for utilization oriented task assignment in MapReduce. The scheduler initially puts the jobs in a job queue. One Map task and one Reduce task of each job are put into a separate list. The classifier classifies the jobs in this list as good or bad. Good jobs do not overload the TaskTracker during their execution. Jobs labelled bad overload the TaskTracker and hence not considered for scheduling at that point of time. The job features, which are given by the user (before job submission) and node specifications are used as features to train the classifier. After the classification, utility functions are used for prioritizing jobs, based on which the task is scheduled. 2.4 Need For a Job Aware Scheduler Many schedulers have been proposed and used in MapReduce framework. Few of them just take scheduling decisions based on the arrival time or priorities. Other schedulers take decisions based on the resource requested by the job, by checking against the resource availability in the cluster. All these schedulers are simple and easy to implement but have minimal amount of information to take an intelligent decision. We argue that the scheduler should have enough information so as to effectively come up with a decision. In the next chapter 3, we present our algorithms which dynamically take into consideration of various details in the framework and use machine learning concepts to take scheduling decisions. 21

38 Apart from basic information of number of nodes, number of slots, size of job etc., having others details regarding the current utilizations of each node of the cluster, the resource usage pattern of the incoming job would help in taking better decisions. Understanding the job even before its scheduling would help in understanding its future resource usage pattern. Through this information, the job could be executed as per the current utilization of the nodes of the cluster dynamically. We demonstrate through our algorithms that, this not only reduces the runtime of the job but also increase the overall utilization of each node of the cluster, in turn reducing the energy consumption. 2.5 Background: Cloud Data Center and Resource Management We have discussed about cloud data centers and how they are powered by virtualization technology. A typical cloud data center consists of series of servers with a virtualization layer and virtual machine instances on top of it. There are many IaaS providers like Amazon EC2 [1], Rackspace [24], GoGrid [13] etc. If an enterprise does not want to subscribe to services offered by public cloud providers, a private cloud could be installed in its own data center. A private cloud data center could be set up using many open source cloud operating system distributions such as Eucalyptus [11], Open Nebula [22], Open Stack [23] etc. In the following section, we present the components of Eucalyptus, one of the widely used private cloud operating systems Private cloud data center: Eucalyptus and its Architecture Eucalyptus is an open source cloud operating system which could be installed on an enterprises IT infrastructure to setup a private or hybrid cloud. The setup provides an Infrastructure as a Service (IaaS) to the users in or out of the enterprise. Eucalyptus provides fairly simple web interface which is easy to understand and use. The following are important components of Eucalyptus: Cloud Controller 22

(CLC), Walrus, Cluster Controller (CC), Storage Controller (SC), Node Controller (NC) as shown in the Figure 2.2. Figure 2.2: The figure shows the architecture of Eucalyptus cloud system.

39 (CLC), Walrus, Cluster Controller (CC), Storage Controller (SC), Node Controller (NC) as shown in the Figure 2.2. Figure 2.2: The figure shows the architecture of Eucalyptus cloud system. Source: [11] Cloud Controller Cloud Controller (CLC) acts as a master which co-ordinates all the operations of the system. Administrators and users contact the system through this component of Eucalyptus. Cloud Controller orchestrates the necessary management activities of the system such as scheduling decisions, monitoring etc., by delegating activities to other components such as Cluster Controller, Storage Controller etc Walrus Walrus is a storage component which deals with storage of data. The data stored on Walrus is persistent and would be in the form of objects and buckets. 23

40 Users using the cloud are assigned buckets where they can store the data. Walrus supports all the basic file operations such as read, write, delete files. External public storage services such as Amazon S3 [3] could be synced to Walrus, extending its hybrid cloud features Cluster Controller Cluster Controller (CC) is the component which controls individual clusters of the system. Each cluster is assigned its own Cluster Controller, which acts as a local master under the supervision of Cloud Controller. Cluster Controller constantly monitors the status of each Node Controller of its cluster and is responsible for enforcing policies such as scheduling of VMs in its cluster Storage Controller Amazon Elastic Block Storage (EBS) [10] provides a block storage for each user to store data related to each virtual machine instance launched by him. Much like a hard disk for a personal computer, the block storage stores the installation details pertaining to an operating system, which it stores in the form of an image. When the user re-launches the virtual machine instance, the EBS loads the image on the virtual machine. Storage Controller acts much similar to Amazons EBS. It also provides support for storage technologies like NFS [21], iscsi [43] etc. These blocks handled by Storage Controller are centrally stored in Walrus. The enterprise grade SAN [26] devices could be used to host Storage Controller within the Eucalyptus cloud Node Controller Node Controller is a service that runs on each physical machine, which is responsible for handling the execution of virtual machines on it. Node Controller closely monitors the working of each virtual machine instance on physical machine and could terminate or restart it as directed by Cluster and Cloud Controller. 24

41 2.5.2 Resource Management in Cloud Data Center Resource management in a traditional data center is inefficient because of the delay in reacting to dynamic resource requirement of the applications running on the servers. If something goes wrong in any server, there has to be a human intervention to correct or take preventive measures for future. Provisioning of additional resources or correcting from a failure, with human intervention hugely reduces the performance of the data center. If these processes are automated and taken care by the framework dynamically, the performance of the data center could be improved significantly. With the advent of cloud technologies, all these processes are transparently taken care by the framework. Cloud Data Center is usually distributed and the architecture is centralized. The server which acts as a master viz. Cloud Controller in Eucalyptus, is responsible for managing the resources of the entire data center. The cloud based infrastructure helps in providing elasticity on the fly to handle workload spikes without overheads that accompany a traditional data center environment. This can be achieved by instantiating new virtual machines dynamically and turn them off when not required. The scale-up and scale-down of the data center happens dynamically by migrating the virtual machines across the servers to handle dynamic workload. 2.6 Need for Energy Awareness in Resource Management In the recent past, a lot of focus has been driven on green computing [14] and energy conservation. Data centers contribute significantly in the total amount of carbon footprint generated across the globe [33]. The data centers of the companies these days have thousands of servers in their server farms and the amount of the energy consumed by them is significantly high. A lot of the servers are usually under utilized. Google s server utilization and energy consumption study [31] reports that the energy efficiency peaks at full utilization and significantly drops as the utilization level decreases (Figure 2.3). Hence, the power consumption at zero utilization is still considerably high (around 50%). Essentially, even an idle server consumes about half its maximum power. 25

42 Server power usage (percent of peak) Power Energy Efficiency Utilization (percent) Figure 2.3: Server power usage and energy efficiency at varying utilization levels, from idle to peak performance. Even an energy-efficient server still consumes about half its full power when doing virtually no work. Source: [31]. There is a significant need to improve the resource management in the data centers which can consolidate the load effectively so that only limited number of machines could be used. The scheduling algorithms incorporated inside the resource manager should be intelligent enough to conserve energy by cogently utilizing the resources. 2.7 Related work: Resource Scheduling Scheduling has always been a challenging research problem in the field of computer science. Many scheduling algorithms have been proposed each having its own pros and cons Round Robin, Greedy and Power Save Round Robin [25], Greedy and Power Save algorithms are the virtual machine scheduling algorithms provided along with Eucalyptus open source cloud operating system distribution. Round Robin algorithm follows the basic mechanism of allocating the incoming virtual machine requests on to physical machines in a cir- 26

43 cular fashion. It is simple and starvation-free scheduling algorithm which is used in most of the private cloud infrastructures. The Greedy algorithm allocates the virtual machine to the first physical machine which has enough resources to satisfy the resources requested by it. In Power Save algorithm, physical machines are put to sleep when they are not running any virtual machines and are re-awakened when new resources are requested. First, the algorithm tries to allocate virtual machines on the physical machines that are running, followed by machines that are asleep. These algorithms have limited or no support for making scheduling decisions based on the resource usage statistics. Moreover these algorithms do not take into account of SLA violations, energy consumed etc., which form very important factors in real time cloud environments Dynamic Round Robin Ching-Chi Lin et. al in [49] presented an improved version of Round Robin algorithm used in Eucalyptus. According to Dynamic Round Robin algorithm, if a virtual machine has finished its execution and there are still other virtual machines running on the same physical machine, this physical machine will not accept any new virtual machine requests. Such physical machines are referred to as being in retirement state, meaning that after the execution of the remaining virtual machines, this physical machine could be shutdown. And if a physical machine is in the retirement state for a sufficiently long period of time, the currently running virtual machines are forced to migrate on to other physical machines and shutdown after the migration operation is finished. This waiting time threshold is denoted as retirement threshold. So, a physical machine which is in the retirement state beyond this threshold will be forced to migrate its virtual machines and shutdown. Even this algorithm has limited support for making scheduling decisions based on the resource usage statistics and does not take into account of SLA violations, energy consumed etc Single Threshold In [32], the authors propose Single Threshold algorithm which sorts all the VMs in decreasing order of their current utilization and allocates each VM to a 27

44 physical machine that provides the least increase of power consumption due to this allocation. The algorithm does optimization of the current allocation of VMs by choosing the VMs to migrate based on CPU utilization threshold of a particular physical machine called Single Threshold. The idea is to place VMs while keeping the total utilization of CPU of the physical machine below this threshold. The reason for limiting CPU usage below the threshold is to avoid SLA violation under a circumstance where there is a sudden increase in CPU utilization of a VM, which could be compensated with the reserve. Single Threshold algorithm works better in terms of energy conservation when compared to Dynamic Round Robin Algorithm discussed in This algorithm is fairly improved one which takes into consideration of power consumption and CPU usage of physical machines Dynamic Voltage Scaling Dynamic Voltage Scaling (DVS) is a power management technique where under-volting (decreasing the voltage) and over-volting (increasing the voltage) is done to conserve power and increase computing performance respectively. In [47, 48], the authors employ this DVS technique to design power-aware scheduling algorithms that minimize the power consumption. A variation of this DVS technique called as Dynamic Voltage Frequency Scaling (DVFS) is applied by Hsu et al. [44] to reduce overall power consumption by operating the servers at various CPU voltage and frequency levels. in [57], the authors focus on thermal-aware scheduling, where the jobs are scheduled in a manner to minimize the overall temperature of the data center. In this work, the authors try to focus majorly on reducing the energy needed to operate the data center cooling systems rather than conserving the energy used by the servers. In [45], Liting Hu et al. propose an approach called magnet, to transfer the load among the nodes on a multi-node ring based overlay through live migration of virtual machines. The algorithms try to find a minimal set of servers that the workload can fit into. 28

45 2.7.5 Dynamic Cluster Reconfiguration In [53, 41, 50], the authors proposed systems that dynamically scale the cluster up and down to be able to handle the load on the system effectively and also save power under lower load. Based on the load imposed on the system, the proposed algorithms have the capability to take intelligent cluster re-configuration decisions on the fly. Our work is mainly inspired by these algorithms which scale the cluster up and down as per the requirement and save power. We tried to employ the same kind of principle in a virtualized data center environment. In [52], Pérez et al. come up with a mathematical formalism to achieve a dynamic reconfiguration. A machine learning based approach to the same was proposed by Duy et al., through the use of neural network predictors. Using the historical demand [40], they predict the future load and based on this, the unused servers are turned off and restarted as per requirement to minimize the number of running servers. A considerable amount of research has been done in the fields of load balancing and cluster reconfiguration, with prime focus on harvesting the cycles of idle machines [42, 38, 46]. Only few of them really consider the energy aspect while load balancing and cluster reconfiguration. Our work is based on load balancing and VM migration decisions with prime focus on reducing the total number of running physical machines. 2.8 Summary In this chapter we present a basic background of MapReduce scheduling and architecture of a cloud data center. We have tried to give an insight of the problems that we are attempting to solve in this thesis, by addressing the need for a job aware scheduler and a power aware resource manager for a data center. We also present the reader with relevant algorithms and related work in this field. 29

46 Chapter 3 Job Aware Scheduling Algorithm for MapReduce Framework This chapter presents our job aware scheduling algorithm designed to reduce the overall runtime of the jobs on MapReduce framework. The algorithm tries to capture the information regarding the nature of the job and combine that with the information about the existing scenario of the cluster and take an intelligent scheduling decision accordingly. First, the task characteristics and its vector calculation is explained, which is the base for our algorithms. Next, we propose heuristic and machine learning based algorithms for our approach. Finally, we conclude by describing the experiments, testing environment and results of our evaluation. 3.1 Proposed Algorithm Scheduling in distributed systems has always been a challenging problem in the grid community. Over the years many scheduling algorithms have been proposed, each having its own merits and demerits. Most of the schedulers take into account the priority of the job, deadline expected, resource availability on the cluster etc. But many of them do not consider the nature of the incoming job, characteristics of jobs already running on the cluster and stability of the cluster in terms of resources of the nodes used. Keeping track of the nodes resource usage and cluster stability becomes primary and most vital duty of a scheduler, which our scheduling algorithms concentrate upon. 30

47 In our approach, we try to monitor the health up to the level of each task and each node in the cluster as the performance of tasks and nodes is vital in any distributed environment. Our algorithm tries to maintain stability at node and cluster level through intelligent scheduling of the tasks. The uniqueness of this scheduler lies in its ability to take into account the resource usage pattern of the job before its tasks are scheduled on the cluster. First we discuss the characteristics of the task and how we try to arrive at a vector model to it as this forms the base of our algorithms Task Characteristics Based on the resources a task uses, it can be broadly classified into cpuintensive, memory-intensive, disk-intensive, network-intensive. In a practical scenario, it might not be possible to categorize a task as belonging to one of the above categories. A task will have attributes of more than one categories mentioned above and to perfectly describe the true nature of the task it should be characterized as being a weighted-linear combination of parameters from each of these categories. We represent true and complete nature of a task through its TaskVector T k as defined below. T k =< E cpu, E mem, E disk, E nw > (3.1) where E x (x is cpu, mem, disk, nw) is percentage of cpu, memory, disk and network resources used of a physical machine respectively. Since we consider the percentage of resources used, E x takes the values 0 E x 1. Example : TaskVector (T k ) T k = < 0.70, 0.10, 0.05, 0.05 > denotes a CPU-intensive vector Calculation of E x of a TaskVector Every task on the TaskTracker runs as a java process. The atop utility can monitor the resource usage of any particular process. The atop utility keeps monitoring the resource usage of the map/reduce task (java process) at regular intervals 31

48 (say 0.5 sec, which is configurable) and average resource usage could be calculated after the completion of that particular task. For example, if the average cpu utilization for the particular process is found to be 20% then, E cpu of the TaskVector is And if the average bandwidth usage of the process is 10 Mbps and the maximum supported bandwidth is 100 Mbps, then E nw of the TaskVector is 0.10 (percentage of maximum supported bandwidth). Similarly, E mem and E disk can be calculated based on the rate of memory and disk usage on that particular node respectively. Since in MapReduce framework, map/reduce methods run repeatedly on the individual records of the data split in a sequence, the resource usage of the task would not have much variation across the time taken to complete the task. Hence, it is fair to average the total resource usage of the task to arrive at an E x Task Selection Algorithm The JobTracker (master node) [19] receives an incoming job through the Job- Client (client). The received job is queued up into Pending Job List (J). Our Task Selection Algorithm (Figure 3.1) takes the pending jobs from the J and tries to split it into its sub units (map and reduce tasks) Construction of Task Vector (T k ) Before scheduling any task of a particular job, the algorithm calculates the TaskVector for the task. As task can be either a map or a reduce, each task has its own corresponding Map-TaskVector (T kmap ) and Reduce-TaskVector (T kreduce ). Now, we explain how the TaskVector is constructed before actually running all the tasks of the job. Calculation of TaskVector before running all the tasks: We assume the Map-TaskVector of a task of a particular job to be logically equivalent to Map- TaskVector of the whole job since same code runs for each map task of the job. And same is the case with reduce tasks. Ideally, MapReduce jobs run over a large dataset with thousands of map tasks and hence are usually very long running. To calculate T k, an initial sample of map/reduce tasks are executed. The sample size could be any small number and an ideal sample size could be less than 10 tasks. The algorithm employs an event capturing mechanism on the TaskTrackers [19] which 32

49 1: T askselection(t T ) 2: task 3: for all J k J do 4: {for each job J k in pending job list J} 5: if T kmap / T then 6: {if TaskVector List T does not contain Map-TaskVector of the job J k } 7: task = getamapt ask(j k ) 8: J kmaps J kmaps task 9: return task {calculate Map-TaskVector of the job J k } 10: end if 11: if J k has finished all Map Tasks then 12: if T kreduce / T then 13: {if TaskVector List T does not contain Reduce-TaskVector of the job J k } 14: task = getareducet ask(j k ) 15: J kreduces J kreduces task 16: return task {calculate Reduce-TaskVector of the job J k } 17: end if 18: end if 19: m {MapTask List for the job J k } 20: m getremainingmapt asks(j k ) 21: M M m {MapTask List for all the jobs} 22: r {ReduceTask List for the job J k } 23: r getremainingreducet asks(j k ) 24: R R r {ReduceTask List for all the jobs} 25: J J J k 26: end for 27: for all R k R do 28: if J k has finished all Map Tasks then 29: decision = T askassignment(r k, T T ) 30: if decision == TRUE then 31: task R k 32: R R R k 33: return task 34: end if 35: end if 36: end for 37: for all M k M do 38: decision = T askassignment(m k, T T ) 39: if decision == TRUE then 40: task M k 41: M M M k 42: return task 43: end if 44: end for 33 Figure 3.1: Task Selection algorithm

50 listens to events related to cpu, memory, disk and network to monitor resource usage characteristics of that particular task and creates a TaskVector as discussed in The TaskVectors calculated through these few initial map/reduce tasks are averaged to get T k. Calculation of T k for remaining tasks: The T k is calculated after the execution of sample tasks, the remaining tasks of that particular job are said to be ready for scheduling. The initial sample of tasks which are run to calculate the TaskVector process different data splits and may also be scheduled on different nodes. To overcome the minor differences in the TaskVectors generated due to the heterogeneity in the cluster and the data splits processed by the tasks, we use an average TaskVector T k, which could work as an approximate TaskVector for the rest of the tasks in that particular job. Task Selection Algorithm: Whenever a TaskTracker T T (slave node) has an empty slot for a task, our Task Selection algorithm checks if the TaskVector List T contains T kmap for job J k from J. If T does not contain T kmap, it schedules the TaskVector Calculation Algorithm on T T to calculate T kmap. Otherwise, if T contains T kmap, then the algorithm schedules the TaskVector calculation algorithm on T T to calculate T kreduce in the similar fashion, duly taking into consideration that all the map tasks have finished for J k. But, if the TaskVectors for a job s map and reduce components have already been calculated, the algorithm tries to queue up the remaining map/reduce tasks in to the Task Queue. This queue is filled up with the tasks asynchronously as the new jobs arrive at the JobTracker. Then, our algorithm takes the Task Queue as the input and picks up the task task k that has arrived first. This task k is submitted to Task Assignment Algorithm along with details of T T to get the approval for its scheduling Task Assignment Algorithm Task Assignment algorithm is the second part of our scheduling algorithm. This algorithm is primarily responsible to take the decision whether a task can be scheduled on a particular TaskTracker. The main objective of this algorithm is to avoid overloading any of the TaskTrackers by meticulous scheduling of only compatible tasks on a particular node. By compatible task, we mean task that does 34

51 not affect already running tasks on that node. We present two approaches to our algorithm: machine learning based approach and a heuristic-based approach Machine Learning Approach In this section, we present a machine learning based approach of our algorithm. We employ an automatically supervised Incremental Naive-Bayes classifier [39], [51] to decide whether the task is compatible on a particular TaskTracker. Whenever the Task Selection algorithm queries for the compatibility of a task on a TaskTracker, our algorithm computes the compatibility through the outcome of the classifier. We consider the following features to train the classifier: 1. Hardware Specifications of TaskTracker (Φ) 2. Network Distance (Σ) 3. TaskVector of Incoming Task (T k ) 4. TaskVectors of Tasks Running on TaskTracker (T compound (i)) We now describe the above mentioned features in detail and explain the reason for choosing them as features to train our classifier Hardware Specifications of TaskTracker (Φ): MapReduce is designed as a programming model designed for large clusters of commodity hardware. In most cases, a high degree of heterogeneity can be expected in these commodity clusters. Consider the scenario in Figure 3.2, where we have a cluster with two TaskTrackers T T 2, T T 4 each having an empty task slot. Assume that we have a cpu-intensive task task k that needs to be scheduled. Let us also assume that T T 2 has lower computing power (speed of the processor) as compared to T T 4. It is an intuitive decision to schedule task k on T T 4 as it would take less runtime to execute a cpu-intensive job on a higher computing power node. Hence, the knowledge of the hardware specifications of a TaskTracker Φ, would enhance decision making capability for the scheduler. 35

Figure 3.2: The scenario of a MapReduce cluster 3.1.

52 Figure 3.2: The scenario of a MapReduce cluster Network Distance (Σ) Hadoop implements a rack aware data block replication policy for the data that is stored on Hadoop Distributed File System (HDFS) [15]. For the default replication factor of three, HDFS s replica placement policy is to put one replica of the block on one node in the local rack, another on a different node in the same rack, and the third on a node in some other rack. Whenever a TaskTracker has a free task slot, the scheduler tries to assign a task to it. Although the scheduler tries to select the task which is data-local to the Task- Tracker, there is no guarantee that this condition can always be met. Consider the scenario in Figure 3.2 again. A task task k needs to be scheduled either on T T 2 or T T 4 which have an empty slot each and data block for the task k is local to T T 2. Scheduling the task k on T T 2 saves the overhead of copying the data from another node. But, if task k is known to be cpu-intensive, it would be tricky for the sched- 36

53 uler to decide among T T 2 and T T 4. Scheduling the task k on T T 2 would save a little amount of runtime by avoiding copying of data block. On the other hand, scheduling task k on T T 4 could pay an extra price in terms of time for copying, but could gain the advantage of running a cpu-intensive job on a higher computing power node. It could turn out that the time for copying the data could be much lower than actual runtime because of very high speed connectivity between the nodes and second option of running the task on T T 4 with minor overhead of copying could turn out to be advantageous. Hence, network distance Σ also becomes an important feature in making scheduling decisions Task Vector of Incoming Task (T k ) Apart from having all the relevant details regarding the task to be scheduled, the scheduler should also have its TaskVector for better decision making. In the previous sections, we have observed the additional informative value that is added by having TaskVector T k of the task k through the scenario discussed in Figure TaskVectors of Tasks Running on TaskTracker (T compound (i)): Multiple tasks could be scheduled to run on a particular node, as each TaskTracker could be configured with maximum number of tasks that it can run in parallel. Since, multiple tasks run on a single node, the tasks have to share the finite amount of resources present on that node. Let us consider the scenario in Figure 3.2 again. Assume that task task k needs to be scheduled on one of the TaskTrackers, T T 2 or T T 4. The task task k was calculated to be a cpu-intensive task and data block for the task is local to T T 2. Having taken into consideration the higher computation power of T T 4 and copying of data block to its local file system, the scheduler decides that it would be advantageous if it schedules task k on T T 4, despite taking minor copying overhead as the runtime saved by running the task on higher computation power node could be much more than saving the copying time. Adding some more complexity to the existing scenario, let us assume that T T 4 has a very highly cpu-intensive task task j currently running. The scheduler now has to decide weighing lots of variables. Deciding to schedule the task on T T 4 would lead to a race condition for resources. Instead of degrading the performance of the task task j, along with task k by scheduling task k on T T 4, the scheduler may instead 37

54 choose T T 2. In this way, knowing the TaskVectors of all the tasks currently running T compound (i) on a TaskTracker could bring out a lot of information which could prove very valuable for decision making. T compound (i) is the vector addition of the vectors of all the tasks currently running on the TaskTracker T T i, given by, T compound (i) = T 1 + T T n (3.2) where n is the number of tasks currently running on that TaskTracker Incorporating Naive Bayes Classifier: The above discussed features form the feature set F = {Φ, Σ, T k, T compound (i)}. Whenever Task Selection algorithm queries the Task Assignment algorithm for the compatibility of a task task k on a TaskTracker T T, the algorithm tries to test the compatibility on a Incremental Naive-Bayes classifier model. task k = compatible denotes the event that the task k would be compatible with the other tasks running on T T. The probability P (task k = compatible F ) is conditional on the feature set F. To calculate the compatibility of a task, the classifier uses its prior knowledge accumulated over a period of time. Using Bayes theorem, the posterior probability P (task k = compatible F ) is calculated as follows: P (task k = compatible F ) = P (F task k = compatible) P (task k = compatible) P (F ) (3.3) The quantity P (task k = compatible F ) is simplified as P (F task k = compatible) = P (f 1, f 2,..f n task k = compatible) (3.4) where f 1, f 2,..f n are the features of the classifier. We assume that all the features are independent of each other (Naive-Bayes assumption). Thus, P (F task k = compatible) n = P (f j task k = compatible) (3.5) j=1 38

55 The above equation forms the foundation of learning in our classifier. The classifier uses results of the decisions made in the past to make the current decision. This is achieved by keeping track of past decisions and their outcomes in the form of posterior probabilities. If this posterior probability is greater than or equal to the administrator configured Minimum Acceptance Probability C ml, then the task k is considered for scheduling on T T i (algorithm on Figure 3.3). 1: TaskAssignment(task k, TT) 2: Φ = gethardwarespecifications(tt) 3: Σ = getnetworkdistance(task k, TT) 4: T k = gettaskvector(task k ) 5: T compound = getvectorsoftasksontasktracker(tt) 6: compatibility = classifier(task k, {Φ, Σ, T k, T compound }) 7: if compatibility C ml then 8: return TRUE 9: else 10: return FALSE 11: end if Figure 3.3: Task Assignment algorithm following the machine learning approach Reason for using Naive Bayes Classifier: We have used Incremental Naive Bayes classifier since, the features used in our algorithm are independent to each other. Moreover this classifier is fast, consumes low memory and cpu, which avoids any overhead on the scheduler. Trained model updation: Our algorithm tries to train the classifier incrementally after every task is completed. Every TaskTracker monitors itself for the race condition on its resources by checking if the particular TaskTracker is overloaded and sends the feedback corresponding to the previous decision made on it after the completion of every task. If a negative feedback is received from a TaskTracker pertinent to previous task allocation, the classifier is re-trained at the JobTracker (Figure 3.4), to avoid such mistakes in future. This will help in keeping the classifier model updated with the current scenario of the cluster. 39

56 Figure 3.4: Task Selection algorithm: The received task is tested for compatibility on a Incremental Naive-Bayes classifier and then its accepted if the posterior probability is greater than equal to C ml Heuristic-based Algorithm In this section we try to present a heuristic-based algorithm to decide whether a task is compatible on a particular TaskTracker. Whenever the Task Selection algorithm queries for the compatibility of a task on a TaskTracker, this algorithm runs the Task Compatibility Test Task Compatibility Test The compound TaskVector of the tasks currently running on the TaskTracker T T i, is given by equation 3.2. T compound (i) = T 1 + T T n follows, Each of the TaskVector T k in Task Queue, for all k [1, n], is represented as 40

57 T k = E cpu e 1 + E mem e 2 + E disk e 3 + E nw e 4 (3.6) where, e 1, e 2, e 3 and e 4 are basis vectors. We obtain T availability by calculating the difference of the total resources and the T compound (i), which is given by, T availability = T otalresources T compound (3.7) Assuming that T k is scheduled on the TaskTracker, the amount of unused resources on it is calculated through the following equation T unused = T availability T k (3.8) Cosine Similarity model is used to measure the similarities in resources usage patterns of the MapReduce tasks which is discussed in the next section Cosine Similarity Model Cosine similarity [9] gives the measure of the angle between two vectors. If the angle between two vectors is small, then they are said to possess similar alignment. The cosine of two vectors lies between -1 and 1. If the vectors point in the same direction, the cosine between them is 1 and the value decreases and falls to -1 with an increase in angle between them. Using Euclidean dot product, the cosine of two vectors is defined as And the similarity is shown as follows, a b = a b cos θ (3.9) similarity = A B A B = n i=1 A i B i n i=1 (A i) 2 n i=1 (B i) 2 (3.10) The resulting similarity ranges from 1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity. And, the similarity value 41

58 lies between 0 and 1 since we are not dealing with negative physical resource values. The cosine similarity between T availability (i) and T k is calculated to find the measure of the similarities in resource usage patterns of T k and T availability (i). Since, the resource usage patterns between T k and T compound (i) should be dissimilar (tasks should not conflict with each other), the pattern should be similar in case of T k and T availability (i). The similarity is given by, similarity = T k T availability (i) T k T availability (i) (3.11) Greater the similarity value, greater the similarity between T k, T availability (i) and hence better the compatibility. At the same time, lesser the T unused more the resource utilization on a particular TaskTracker. Bringing them into a single equation, we get the following. value combination = α similarity + (1 α) T unused (3.12) The values α and C heu are administrator configured values which can be set based on the requirements of a particular cluster. If the value combination is greater than or equal to Minimum Value for Acceptance C heu, then the T k is considered for scheduling on T T i. (Figure 3.5). 1: TaskAssignment(task k, TT) 2: T k = gettaskvector(task k ) 3: T compound = getvectorsoftasksontasktracker(tt) 4: T availability = TotalResouces - T compound 5: T unused = T availability T k 6: similarity = T k T availability (i) T k T availability (i) 7: value combination = α similarity + (1 α) T unused 8: if value combination C heu then 9: return TRUE 10: else 11: return FALSE 12: end if Figure 3.5: Task Assignment algorithm following the heuristic-based approach 42

59 We have forced the condition for a task to have its value combination greater than or equal to C heu which ensures that the task is compatible with currently running tasks on that node and at the same time make sure that it increases the overall utilization of the node. 3.2 Evaluation and Results We have employed our algorithms into scheduler plugins for MapReduce framework. We customized the method, List<Task> assigntasks(tasktrackerstatus) of org.apache.hadoop.mapred.taskscheduler class. Upon the job submission our algorithm calculates the TaskVector of the job by running a sample of map/reduce tasks. The TaskVector is constructed separately for map and reduce phases and sort and shuffle phases are merged with the reduce phase. In our experiments, utmost five map/reduce tasks are run and the average amongst them is calculated to find out the T k. The TaskVector is captured by monitoring the resource usage related to cpu, memory, disk and network through atop [6] utility Testing Environment Our testing environment consisted of 12 nodes (one master and 11 slaves). The master node was designated to run JobTracker and NameNode [15], while the slaves ran TaskTrackers and DataNodes (slave daemon of HDFS). The nodes were heterogeneous with Intel core 2 Duo, 2.4 GHz processors, with a single hard disk capacity ranging from 80 GB to 2 TB, and 2 or 4 GB RAM. The nodes were interconnected with a gigabit ethernet switch. All the nodes were installed with CentOS release 5.5 (Final) operating system with Java Experiments Description There are very few schedulers which are minimally resource aware of which Fair [12] and Capacity schedulers [7] are widely acclaimed. Capacity scheduler, developed by Yahoo is one of the most successful schedulers currently used in the industry for real-world applications with resource division at cluster level. Hence, 43

60 Table 3.1: Hadoop and Algorithm Parameters Parameter HDFS block size Speculative execution Heartbeat interval Number of map task slots per node Number of reduce task slots per node Replication factor 3 Number of queues in Capacity scheduler Cluster resources allocated for each queue in Capacity scheduler alpha, α 0.5 Description 64 MB enabled 4 seconds % Minimum Value for Acceptance, C 0.40 heu Minimum Acceptance Probability C 0.45 ml 44

61 we have chosen it as the baseline for testing our algorithms. Table 3.1 provides the Hadoop and algorithm parameters used for testing. Experiments were conducted on jobs like terasort, grep, web crawling, wordcount, video file format conversion which are close to real-world applications. The input data size to these jobs is taken up to a size of 50 GB with number of tasks upto 800. The implementation of the Incremental Naive Bayes classifier for the machine learning based algorithm is taken from Weka [17] Comparison on runtime of the jobs We compare the overall runtime of the jobs by varying the number of input jobs given to the scheduler. From our experiments, we conclude that the overall runtime saved could go up to 21% in heuristic based approach and 27% in machine learning based approach when compared to Capacity scheduler. The amount of savings also varies with the number of jobs given to the cluster. As the number of jobs increase, the scheduler finds it easier to find diverse jobs to be scheduled together so that they would not conflict. As the number of jobs increase the savings in the overall runtime also increase, as can be seen from Figures 3.6 and Comparison on resource usage We present the effect of our scheduling algorithms on cpu usage of the Task- Tracker. We monitored a random TaskTracker for its resource requirement. Figures 3.8 and 3.9 show a snapshot of cpu requirement on a particular TaskTracker for a certain random period of time. From Figures 3.8 and 3.9, we can see that the cpu requirement for the tasks running on the TaskTracker with Capacity scheduler reaches up to 250% (resource requirement but not the resource usage). This situation has occurred due to un-awareness of the jobs in Capacity scheduler. On the contrary our algorithms try not to schedule similar tasks on a TaskTracker and hence not overload the TaskTracker, except for minor infrequent surges. We also observe that the overall utilization of the node is increased. Similar effect of our algorithms on other resources such as memory, disk and network have been observed. 45

62 Figure 3.6: Comparison of runtime (in hours) of the jobs between Capacity and heuristic based algorithms: The amount of the saving in the runtime of the jobs increases as the number of jobs increase. Figure 3.7: Comparison of runtime (in hours) of the jobs between Capacity and machine learning based algorithms: The amount of the saving in the runtime of the jobs increases as the number of jobs increase. 46

63 Figure 3.8: Comparison of cpu requirement on a TaskTracker between Capacity and heuristic based algorithms: The cpu requirement mostly stays below 100% in case of heuristic based algorithm except for few surges. The time stamp is shown in minutes. Figure 3.9: Comparison of cpu requirement on a TaskTracker between Capacity and machine learning based algorithms: The cpu requirement mostly stays below 100% in case of machine learning based algorithm except for few surges. The time stamp is shown in minutes. 47

64 3.2.5 Effect on low resource-intensive jobs Not all MapReduce jobs are highly resource intensive i.e. they can consume very less cpu, memory, disk and network resources. These jobs are considered to be low resource-intensive jobs. When these low resource-intensive jobs are queued up after the high resource-intensive jobs, they tend to starve until the high resource-intensive jobs finish in the case of Capacity scheduler. But in our algorithms, we see that these jobs get scheduled ahead of their actual turn and would not starve, which can be seen from the Figures 3.10, 3.11 and After scheduling a high resource-intensive task on a TaskTracker, the scheduler finds that the resource availability on that particular TaskTracker to be very low to accommodate another similar task and hence would look for low resource-intensive task and allocates it. But sometimes we also see an increase in the individual runtime of these jobs in our algorithms. At the same time, the overall runtime of all the jobs is lower compared to Capacity scheduler. Figure 3.10: Effect of Capacity scheduler on low resource-intensive jobs: The job queues (three) alternatively contain the jobs in the order of jobs shown on the y-axis (bottom to top). X-axis shows the runtime in hours. The low resource-intensive jobs starve until the completion of high resource-intensive jobs. 48

65 Figure 3.11: Effect of heuristic based algorithms on low resource-intensive jobs: The job queue contains the jobs in the order of jobs shown on the y-axis (bottom to top). X-axis shows the runtime in hours. The starvation of low resource-intensive jobs is reduced. Figure 3.12: Effect of machine learning based algorithms on low resource-intensive jobs: The job queue contains the jobs in the order of jobs shown on the y-axis (bottom to top). X-axis shows the runtime in hours. The starvation of low resourceintensive jobs is reduced. 49

66 3.2.6 Overhead of the scheduler We see that the algorithms discussed in the paper, would create no or very minimal overhead on the framework No task overhead A sample of tasks are run ahead of the scheduling to calculate the TaskVector of the tasks of the job. The algorithms would not run these completed tasks again but would only run the remaining tasks of that particular job No resource monitoring overhead The atop utility which is used to listen to the events related to resource usage is very light weight tool No decision making overhead Incremental Naive Bayes classifier used on the JobTracker is fast, consumes low memory, cpu and would not affect any of the TaskTrackers. 3.3 Summary In this chapter, we presented job aware scheduling algorithm that tries to schedule the jobs based on their resource usage pattern by duly considering the resource usage pattern of the jobs currently running on the node of the cluster. The algorithms try to avoid race condition between the jobs running on a node, which reduces the over all runtime, there by saving the energy consumption of the cluster implicitly. Our algorithms increase the overall utilization of the nodes of the cluster and also lower starvation of low resource intensive jobs. The results of our algorithms indicate a significant saving in runtime of the jobs and also an increase in utilization of the nodes. The efficacy of our algorithms can further be increased if many jobs of diverse resource utilization nature are submitted to the cluster. 50

67 Chapter 4 Dynamic Energy and SLA Aware Scheduling and Provisioning of Virtual Machines in Cloud Data Center Cloud computing presents a compelling opportunity to lower the power consumption in the data centers. It is a great way to save energy and reduce green house emissions if the data center policies are efficient. In this chapter, our approach to handle the scheduling decisions of VMs in the data center is presented. We discuss the vector model that forms the base for our algorithms. We also discuss similarity model which tries to consolidate virtual machines on physical machines avoiding SLA violations. Initially, we assume that all the physical machines in the data center are put to standby mode except for one. We start with only one physical machine that is up and running and only awaken the physical machines as and when required, as directed by our algorithms discussed ahead. When a new request to allocate a VM is received by the data center, the request is directed to Allocation Algorithm. The Allocation Algorithm takes the decision of allocating the VM on a particular physical machine Allocation Algorithm The Allocation Algorithm presented in the Figure 4.1, accepts the VM request and tries to fit on to one of the currently running physical machines. The algorithm tries to fit the virtual machine based on the resource usage of the target physical machine. The resource usage of the target physical machine is represented by its 51

68 Resource Vector. Firstly, we discuss Resource Vector which forms the base for our algorithms Resource Vector A virtual machine uses the computing resources based on the applications running on it. Based on the resource usage, a virtual machine can be broadly categorized as CPU-intensive if it uses high CPU, or memory-intensive if it accounts for more of memory I/O, and similarly disk-intensive or network-intensive. But, just identifying this information about a virtual machine does not give its exact resource usage pattern. To calculate a better resource usage pattern of a virtual machine, we need to take into account of all the resources used by it, at once. So, we define the resource usage pattern of a virtual machine as a vector with four components each denoting CPU, memory, disk and network resources. ResourceV ector(rv ) =< E cpu, E mem, E disk, E bw > (4.1) where E x (x is cpu, mem, disk, bw) represents the percentage of corresponding resource used i.e. percentage of total CPU, memory, disk and network resources used respectively on that physical machine. Since we denote E x as percentage of resource used on the physical machine, we represent its value from 0 to 1. Example: Resource Vector (RV) Resource Vector 1 = < 0.70, 0.10, 0.05, 0.05 > denotes a CPU-intensive vector. Resource Vector 2 = < 0.70, 0.50, 0.05, 0.05 > denotes a CPU and memory intensive vector Construction of Resource Vector The resources used by a virtual machine are logged at regular intervals at the hypervisor level. RV of virtual machine is represented as RV vm. E x in RV vm of the virtual machine is calculated by averaging its corresponding resource usage (say E cpu ) over a period of time (previous time units in the history). For 52

69 example, E cpu at any time τ is the average percentage utilization of CPU by the virtual machine between τ and τ. Handling Resource Vector in Heterogeneous Environment: Resource Vector of a VM (RV vm ) is the vector representation of percentage of resources utilized by the VM on a physical machine. But since the data center could be heterogeneous, this RV vm may not be uniform across different physical machines because of diverse resource capacities. To handle such heterogeneous data center environments, the resource vector could be modified as RV vm (P M), denoting resource vector of a VM on a particular PM. Example: RV in heterogeneous environment Resource Vector RV of a VM on a physical machine P M 1 is given as follows: RV vm (P M 1 ) =< E cpu, E mem, E disk, E bw > (4.2) similar to Equation 4.1, where E cpu = CP U used by V M max CP U capacity of P M 1 (4.3) Similarly, the rest of the components of RV vm (P M 1 ), which are E mem, E disk, E bw can be calculated. So, given the resource vector of a VM on a physical machine say, P M 1 i.e., RV vm (P M 1 ), we could calculate its resource vector corresponding to another physical machine say, P M 2 denoted by RV vm (P M 2 ). The calculation is straight forward as the information about resource capacities of both the physical machines is available to the system. Next, the Allocation algorithm tries to allocate the new VM request on to the physical machine on which it fits the best. To check whether a VM perfectly fits on a running physical machine, we follow Cosine Similarity model which was discussed in The Allocation Algorithm uses this similarity model and tries to allocate the incoming virtual machine request on to a physical machine based on the similarity measure between RV vm (P M) of incoming VM and RV of physical machine (denoted by RV P M which will be discussed later). 53

70 This idea of similarity is used to allocate dissimilar VMs on a physical machine. By similar/dissimilar VMs, we are referring to the similarity/dissimilarity in resource usage patterns of the VMs. For example, if VM1 is CPU-intensive, we would not want VM2 which is also CPU-intensive, to be allocated on same physical machine since there may be a race condition for CPU resource. By allocating dissimilar VMs on the physical machine, following benefits could be achieved: 1. Race condition for the resources between the VMs could be minimized. 2. The physical machine would be trying to use all the resources, increasing its overall utilization. Reason for choosing Cosine Similarity model: The reason for choosing Cosine Similarity model over the other similarity models is that, it is simpler and takes into consideration of similarity measure of each component of the vector. And this perfectly suits our requirement of comparing usage patterns of different resources at a time. Before moving forward, we shall discuss about Resource Vector of a physical machine, RV P M. RV P M is the percentage of resources used on the physical machine. It is similar to RV vm (P M) denoting the percentage of resources used on the physical machine i.e., the usage accounted due to sum of the resources consumed by all the virtual machines running on that particular physical machine. RV P M can be shown as follows, RV P M =< E cpu used, E mem used, E disk used, E bw used > (4.4) where E x used (x is cpu, mem, disk, bw) represents the percentage of corresponding resource used i.e. percentage of total CPU, memory, disk and network resources used respectively on the physical machine. Similarly, P M free is resource vector which represents the free resources available on physical machine Calculation of Similarity As discussed, Allocation Algorithm uses the cosine similarity measure to find a physical machine that is most suitable for the incoming VM request. To use the 54

71 cosine similarity model, we need to know the RV of the incoming VM. But, since the incoming VM s resource usage may not be known ahead of it s allocation, we make an initial assumption to take a default RV vm (P M). The default RV vm (P M) is assumed to be < 0.25, 0.25, 0.25, 0.25 >. Once the VM is allocated and run for a time period of, its exact RV vm (P M) could be found by the mechanism discussed in To avoid race condition for resources between the VMs, we need to allocate VMs of dissimilar nature. We propose two different methods of calculating similarity measure which are based on Cosine Similarity. Method 1 - Based on dissimilarity: In this method, we calculate the cosine similarity between RV of the incoming VM and RV P M. And, we select a running physical machine which gives least cosine similarity measure with the incoming VM. The least cosine similarity value implies that the incoming VM is mostly dissimilar to the physical machine in terms of resource usage patterns. By equation 3.10 we arrive at following formula, similarity = Method 2 - Based on similarity: RV vm(p M) RV P M RV vm (P M) RV P M similarity between RV of the incoming VM and P M free. (4.5) In this method, we calculate the cosine similarity = RV vm(p M) P M free RV vm (P M) P M free (4.6) We select a running physical machine which gives maximum cosine similarity measure with the incoming VM. The maximum cosine similarity value implies that the incoming VM s resource requirements are most compatible with the free resources of physical machine. And, the similarity value lies between 0 and 1 since we are not dealing with negative physical resource values. Difference between Method 1 and 2: The similarity methods discussed above help in consolidating VMs on a physical machine without a race condition for resources. There is a subtle difference between the proposed methods. Method 1 tries to allocate VMs which are dissimilar in resource usage patterns. This method helps in achieving the consolidation of VMs with diverse resource usage patterns. While, Method 2 tries to allocate a VM which could properly consume the un- 55

72 utilized resources of the physical machine. This method inherently makes sure that race condition for resources is avoided and at the same time improves the utilization of the physical machine. Before discussing further algorithms, we present the utilization model for a physical machine upon which the following algorithms are based on Utilization model Our work considers multiple resources viz. CPU, memory, disk and network of a physical machine. It is difficult to incorporate utilizations of each of the resources individually into the algorithms. Hence, we come up with a unified model that tries to represent the utilization of all these resources into a single measure, U. The unified utilization measure, U is considered to be a weighted linear combination of utilizations of individual resources. It is given as follows, U = α E cpu + β E mem + γ E disk + δ E bw (4.7) where, α, β, γ, δ [0, 1] can be weighed accordingly by the the administrator as per the requirements. And, α + β + γ + δ = 1 (4.8) So we try to measure the utilization of any physical machine or virtual machine through a single parameter, U. This unified single parameter, U is introduced for simplicity reasons which reduces the difficulty in taking into consideration of multiple parameters into our algorithms. The Allocation Algorithm not only tries to consolidate dissimilar VMs but also makes sure that the physical machine is not overloaded after the allocation of VM. Hence, first the similarity measure between the VM and the physical machine is calculated. If the algorithm finds that the similarity measure is good enough to accommodate the VM on the physical machine we proceed to next step. In the next step, the algorithm calculates the estimated U after the VM allocation on the target physical machine (the machine which is suggested by similarity measure) ahead of its actual allocation. And the VM allocation is considered only if U after 56

73 allocation, i.e., the estimated utilization of the physical machine after the allocation of VM on it, is less than the value (U up buffer). If U after allocation, is greater than the value (U up buffer), we do not consider that physical machine as the allocation may overload it. Instead we take the physical machine which is next best in terms of similarity measure and find its U after allocation. The physical machine is accepted if U after allocation is less than the value (U up buffer), else we repeat the same procedure by taking the physical machine with next best similarity measure. The details of this value (U up buffer) is discussed clearly later. If the algorithm fails to find any running physical machine which satisfies both the above conditions, then it awakens one of the standby physical machines and allocates the VM on it. The calculation of estimated U after allocation is straightforward since we have enough information about the resource vectors of VMs and physical machines. After the allocation, the resource usage of each physical machines is monitored at regular intervals. And if the utilization of a physical machine reaches an administrator specified threshold (Scale-up Threshold, U up ), we follow the following Scale-up Algorithm Scale-up Algorithm If the utilization, U of any physical machine is observed to be greater than U up for a consistent time period T, the Scale-up Algorithm is triggered. The Scale-up Algorithm presented in Figure 4.2, then tries to bring down U of the physical machine by migrating the VM with highest utilization on that particular physical machine to another physical machine. Firstly, the Scale-up algorithm hands over the VM with high utilization on that overloaded physical machine to Allocation Algorithm for suitable migration. Then, the Allocation Algorithm tries to consolidate that particular VM on any of the other already running physical machines, duly taking into consideration that the migration does not overload the target physical machine as well. If the Allocation Algorithm succeeds in finding a physical machine to allocate the VM, the migration of the VM is instantiated on to the target physical machine. But, if the Allocation Algorithm fails to find a suitable physical 57

74 1: Allocation Algorithm(VMs to be allocated) { VMs to be allocated is the argument passed to this algorithm} 2: for each VM VMs to be allocated do 3: for each PM Running PMs do {physical machine is represented as PM} 4: similarity P M = calculatesimilarity(rv vm (P M), RV P M ) {similarity is calculated using any of the two methods discussed} 5: add similarity P M to queue 6: end for 7: sort queue in ascending values of similarity P M {if Method 1 is used} or sort queue in descending values of similarity P M {if Method 2 is used} 8: for each similarity P M in queue do 9: target P M = PM corresponding to similarity P M 10: if U after allocation on target PM < (U up buffer) then 11: allocate(vm, target PM) {VM is allocated on target PM} 12: return SUCCESS 13: end if 14: end for 15: return FAILURE {VM can t be allocated on any of the running machines} 16: end for Figure 4.1: Allocation Algorithm. The VMs are consolidated on physical machines based on similarity measure. 58

75 1: Scale up Algorithm() 2: if U > U up then {if U of a PM is greater than U up } 3: VM = VM with max U on that PM 4: Allocation Algorithm(VM) 5: end if 6: if Allocation Algorithm fails to allocate VM then 7: target PM = add a standby machine to running machine 8: allocate(vm, target PM) 9: end if Figure 4.2: Scale-up Algorithm. Upon reaching the scale-up trigger condition, the above algorithm is executed. machine, then one of the standby physical machines is awakened and migration of the VM is instantiated on to it. By doing this we bring down the U of the physical machine below U up. Addition of standby physical machines to the running physical machines happens only when required, to handle the rise in resource requirement. This makes sure that the physical machines are used very optimally, conserving a lot of energy. Similarly, if the utilization of a physical machine goes down below an administrator specified threshold (Scale-down Threshold, U down ), we follow the following Scale-down Algorithm Scale-down Algorithm If the utilization, U of any physical machine is observed to be lower than U down for a consistent time period T, the Scale-down Algorithm is triggered. This suggests that the physical machine is under-utilized. So, the Scale-down Algorithm presented in Figure 4.3, tries to migrate VMs on that particular under-utilized physical machine to other running physical machines and put it to on standby mode. The VMs on the physical machine are handed over to Allocation Algorithm one after the other for allocation on any other running physical machines, duly taking into consideration that the target physical machines are not overloaded. If the Allocation Algorithm succeeds in finding suitable physical machine where it can consolidate these VMs, the migration of such VMs is initiated on to the target physical 59

76 1: Scale down Algorithm() 2: if U < U down then {if U of a PM is less than U down } 3: Allocation Algorithm(VMs on PM) 4: end if Figure 4.3: Scale-down Algorithm. Upon reaching the scale-down trigger condition, the above algorithm is executed. machine. The physical machine is then put in standby mode after all the migration operations are performed. But, if the Allocation Algorithm fails to find a suitable physical machine, the VMs are allowed to run on the same physical machine. Reason for using Threshold: Scale-up and Scale-down algorithms are triggered when it is observed that U of a physical machine is above or below U up, U down thresholds for a consistent period of time respectively. These thresholds make sure that the physical machines are neither over-loaded nor under-utilized. Preventing the utilization of physical machine above U up helps in reserving sufficient resources for any sudden surge in utilizations of any of the VMs. This reserve compute resources greatly helps in avoiding any SLA violations. Similarly, usage of U down helps in conserving energy by putting an under-utilized machine to standby mode. To trigger these algorithms, it is a necessary condition that the utilization activity on physical machine should persist consistently for a certain period of time. Sometimes there could be a sudden surge in utilization of a VM and may just persist for small duration. By imposing this condition we could avoid unnecessary migration of VMs during such conditions. Since these thresholds are percentage utilizations of physical machines, the algorithms work unchanged for heterogeneous data centers. Reason for using buffer: Before a VM is allocated on a physical machine using Allocation Algorithm, its utilization U after the VM s allocation is calculated upfront. And the allocation of that VM is considered only if U after the allocation is less than U up - buffer. This buffer value is considered to make sure that the utilization does not reach U up immediately after allocation, which avoids scale-up operation. 60

77 Selection of standby physical machines while scaling up: During the scaleup operation, a standby physical machine may be re-awakened to accommodate VMs. The machine which is least recently used is picked up while selecting a standby physical machine. This makes sure that all the machines in the data center are uniformly used and avoids hotspots. Difference between Scale-up and Scale-down Threshold: U up and U down are set by the administrator of the data center as per the requirements. Difference in these values should be made sufficiently large so that the data center does not experience a jitter effect of scaling up and down very frequently. 4.1 Experimental Evaluation The cloud data center architecture is simulated and the results are generated over it. The simulator is written in Java Simulation Model Our simulator simulates the cloud data center from a granularity level of physical machines, virtual machines running on it, to applications running on each virtual machine. Each physical machine could be designed with its own resource specification. Each virtual machine could be assigned to any physical machine dynamically with requested amount of resources. One or many applications could be run on each virtual machine with its own resource requirement dynamics. The simulator has the provision to incorporate scheduling algorithms which guide the allocation of resources in the data center. The simulator takes care of the amount of energy consumed using the model discussed in Google s server utilization and energy consumption study [31]. The simulator is designed with the following SLA model. SLA Model: An SLA violation is considered at the process scheduling level of hypervisor, whenever any requested resource could not be met to any virtual machine. In simpler terms, during the scheduling of VMs on a physical machine by the hypervisor (scheduling the VMs in a kind of process scheduling in the operating system), a violation of SLA is considered, whenever requested resources such as 61

78 Table 4.1: Simulation and Algorithm Parameters Parameter Value Scale-up Threshold, [0.25, 1.0] U up Scale-down Threshold, U [0.0 to 0.4] down buffer [0.05 to 0.5] Similarity Threshold [0, 1] Similarity Method Method 1 or 2 Number of physical machines Specifications of physical machines Time period for which resource usage of VM is logged for exact RV vm calculation, 100 Heterogeneous 5 minutes the amount of cpu, memory, disk or network could not be supplied to any virtual machine Experimental Set-up and Dataset The simulation is performed on a machine with Intel core 2 Duo, 2.4 GHz processor with 2 GB of memory and 500 GB of hard disk which runs Ubuntu LTS (Lucid Lynx). Rigorous simulation is carried out with various distinctive workloads, based on the real time data center usage, as the input dataset to the simulator. The simulator and algorithm parameters are specified in Table 4.1. To verify the efficacy of our algorithms, we compared them to Single Threshold algorithm and the results are recorded as follows. 62

79 4.1.3 Energy Savings Effect of Scale up Threshold Experiments are carried out on our algorithms to find out the effect on energy consumption for various values of U up and the output is plotted in Figure 4.4. The curve shows a dip when U up is around 0.70 to 0.80 indicating a sudden drop in the energy consumed. Figure 4.4: The graph demonstrates the effect of Scale up Threshold on energy consumption (in kwh). We see a sudden drop of energy consumption when U up is around 0.70 to The curve says that the U up should not be too high or too low and its optimal value is around 0.70 to If U up is low, Scale-up algorithm tries to run more physical machines to accommodate VMs. And when U up is too high, we see more number of VMs getting consolidated in the machine and few surges in the usage of VMs could lead to running new physical machines. Hence, we see a gradual increase in the energy consumption after

4.1.3.2 Effect of scaling down Figure 4.5 demonstrates the use of having a threshold to put machines to sleep and its effect on energy conservation. Figure 4.5: The graph demonstrates the effect of Scale down Threshold on energy consumption (in kwh).

80 Effect of scaling down Figure 4.5 demonstrates the use of having a threshold to put machines to sleep and its effect on energy conservation. Figure 4.5: The graph demonstrates the effect of Scale down Threshold on energy consumption (in kwh). Algorithm with scale down procedure enabled, performs better in terms of energy conservation. The graph shows that the energy consumed by our algorithms with scale down algorithm enabled, is much lower than the algorithm without scale down procedure. Scaling down of machines when there is not enough load on them could directly save upto 50% of energy as demonstrated in the figure. Higher the value of U down, more the physical machines that are scaled down. At the same time, U down should not be too high, which could result in a jitter effect of scaling up and down, due to a low difference between U up and U down, which is discussed earlier. 64

2/26/2017. For instance, consider running Word Count across 20 splits

2/26/2017. For instance, consider running Word Count across 20 splits Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task: