CHAPTER 7 CONCLUSION AND FUTURE SCOPE

121 CHAPTER 7 CONCLUSION AND FUTURE SCOPE This research has addressed the issues of grid scheduling, load balancing and fault tolerance for large scale computational grids. To investigate the solution space, a study was premeditated for the fault tolerant scheduling and load balancing. A system model was developed to study the issues in computational grids. In particular, five decentralized algorithms were designed using only partial information. Decentralized and dynamic schemes have been built that are capable of efficient fault tolerance, load assignment and redistribution to minimize the average response time of the job and optimize resource utilization despite the scalability of grid systems, the heterogeneity in processing power and network bandwidth and considerable communication costs induced owing to information collection. This chapter concludes the dissertation by briefing the major contributions and unfolding the future research directions. Section 7.1 highlights the chief contributions. Section 7.2 focuses on the future scope, which is an extension of the past and current research on fault tolerant scheduling and load balancing for computational grids.

122 7.1 MAJOR CONTRIBUTIONS 7.1.1 Recent Neighbour Load Balancing Algorithm Recent Neighbour technique is a decentralized dynamic load balancing strategy with periodic load information exchanges. It logically divides the grid into three levels namely grid-level, cluster-level and leaf nodes. The jobs are assumed to be computationally intensive, mutually independent and can be executed at any cluster. No deterministic or priori information about the job is available. Each job is assigned a timer when it is generated. If the timer reaches a threshold and the job is not processed, the job is given the highest priority for execution The algorithm maintains two queues for storing incoming jobs namely the local job waiting queue and the global job waiting queue. The local job-waiting queue holds the jobs waiting to be assigned to intra-cluster nodes when load balancing is initiated. The global job-waiting queue holds those jobs waiting to be assigned to inter-cluster nodes when load balancing is initiated. The jobs in the global and local job-waiting queue are processed in First-Come First-Serve order. RN algorithm also maintains a node list NSET which contains information about the neighbours of the cluster and a cluster list CSET which contains information about the neighbors of the grid. NSET and CSET are updated whenever a computing node enters or fails in the cluster. RN algorithm first tries to assigns jobs and perform load balancing locally. If any neighbour in the cluster or grid at any instant of time is over-loaded, RN allots jobs to other neighbours in NSET or CSET with a minimal load using the sender-initiated approach to load balancing. Hence, based on the load information RN chooses the most suitable system for each job, thereby

123 minimizing the job execution time and maximizing the system throughput. RN also takes into account the system heterogeneity with respect to processing power but at the cost of a high communication delay induced owing to frequent load information exchanges. 7.1.2 Recent Neighbour Algorithm with Fault Tolerance This technique is a fault-tolerant version of RN algorithm. In a computing environment, job migration is the only efficient way to guarantee that the submitted jobs are completed reliably and efficiently even if a failure occurs. RN with fault tolerance detects the occurrence and type of resource failure by analyzing the information about the state of a resource. Resource failures is considered as process failure. RN algorithm with fault tolerance uses the concept of passive replication scheme and backup approach to avoid loss of jobs during resource failures. The algorithm guarantees that the jobs submitted are completely executed using available resources. 7.1.3 Symmetric-Initiated Algorithm Symmetric-initiated algorithm (SI-LB) is an extended study of RN method without fault tolerance. In Symmetric-initiated load balancing method, both the sender system and receiver system are responsible for job migration. In SI-LB, the load of a system at a particular instant of time t is defined as the total length of the jobs in job waiting queue divided by the system s current processing capacity. The algorithm surmounts the issue of a high communication delay by means of mutual information feedback (MIF) policy.

124 MIF is an event-driven policy which minimizes the overhead involved in collecting load status information. In MIF, each system maintains the state information of other systems by using a state object. The state object helps a system to estimate the load of other systems at any time without message transfer. This is done by using the concept of piggybacking. Each system collects and maintains the state information of its neighboring systems only. 7.1.4 Hybrid Load Balancing Algorithm This technique is a hybrid version of static and dynamic load balancing strategies for non-dedicated grid environments. It employs FCFS method for job scheduling. In hybrid load balancing technique, the resources of the grid environment are considered dynamic. That is, each computing resource can join or leave the grid dynamically and provides its time and level of contribution. 7.1.5 Performance-Driven Load Balancing Algorithm This technique is an extended study of the performance-driven load balancing proposed by (Kai Lu et al 2006 and 2007). It is based on a dedicated grid environment where all the computing resources work together to solve a compute-intensive problem. It proposes a primary-backup approach for fault- tolerance with a minimum replication cost and an efficient job scheduling technique with minimum communication cost. The main idea of passive replication scheme is that a backup copy of a job is activated only if a fault occurs while executing its primary copy. It does not require fault diagnosis and is guaranteed to recover all the affected jobs by processor failure. In such a scheme, only two copies of the job are scheduled on different processors

125 (space exclusion) and time exclusion (Budhiraja et al 1992). This approach is immensely helpful for grid where fault diagnosis is very difficult as one can discover a failure in a grid processor about which he/she could never know its hardware platform model has existed. Two techniques have been applied while scheduling primary and backup copies of each job. (1) Backup overloading consists of scheduling backups for multiple primary jobs at the same time slot in order to make an efficient utilization of the available processor time. (2) De-allocation of the resources reserved for backup jobs when the corresponding primaries complete successfully. Both hybrid load balancing algorithm and performance-driven technique juxtapose the strong points of neighbour-based and cluster-based load balancing strategies. A load balancing algorithm in which a resource exchanges information and transfers jobs to its physical and/or logical neighbours is called neighbour-based load balancing method. The load balancing algorithms in which the resources are partitioned into clusters based on network transfer delay are called cluster-based load balancing methods (Chatrapati et al 2010). With a view to improve the system flexibility, reliability and save the system resource, both approaches employ the passive replication scheme. The main objective of these techniques is to arrive at the job assignments that can achieve minimum response time, maximum resource utilization and a well balanced load across all the computing resources involved in a grid. 7.1.6 Discussion Optimizing workload allotment for the dynamic grid system is not a simple mission. The assignment of jobs to the systems is performed so as to

126 minimize the average response time and communication delay and optimize the resource utilization. Due to the dynamic nature of the grids, designing a supreme fault tolerant scheduling and load balancing technique still remains a challenge. It is hoped that the techniques can serve as an illustration for pursuing research work in the field of fault tolerance and load balancing. 7.2 FUTURE SCOPE As the configuration of the grid is small, the mathematical and theoretical performance of the proposed techniques cannot be derived with certainty. In the course of designing and evaluating fault tolerant scheduling and load balancing schemes for grids, quite a few attention-grabbing issues have been found which require further investigation. These issues are as follows: 7.2.1 Real Grid Environment The proposed techniques can be tested on real grid environment with assumed distribution for job arrival rate, their resource requirements and execution times for analysing their performance from the mathematical and theoretical perspective. 7.2.2 Security Concerns Grids are mostly formed by resources owned by many organizations and thus are not dedicated to certain users. As such jobs dispatched to the remote systems may experience security issues if the system is attacked by malicious users. Hence, a grid scheduler must be security driven. Applying the notion of security into the proposed approach is clearly a research opportunity.

127 7.2.3 Data Grid A data grid is a collection of geographically dispersed storage resources over a wide area network. The goal of the data grid is to provide a large virtual storage framework with unlimited power through collaboration among individuals and institutions. Heterogeneity is a big challenge for the data-intensive applications running on the data grids where interconnections are relatively slow and network latencies are high. Hence, the performance of the proposed approaches needs to be investigated for data grids.