AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS

Size: px

Start display at page:

Download "AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS"

Baldwin Wade
5 years ago
Views:

1 International Journal of Computer Engineering & Technology (IJCET) Volume 6, Issue 11, Nov 2015, pp , Article ID: IJCET_06_11_005 Available online at ISSN Print: and ISSN Online: IAEME Publication AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS Manoj Niranjan Rustamji Institute of Technology, BSF Academy, Tekanpur Mahesh Motwani Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal Cite this Article: Manoj Niranjan and Mahesh Motwani. An Efficient Algorithm in Fault Tolerance for Electing Coordinator in Distributed Systems. International Journal of Computer Engineering and Technology, 6(11), 2015, pp INTRODUCTION A distributed system consists of various self-governing computers [15]. The selfgoverning computers communicate to attain a common goal through a computer network. The distributed computing systems, predominantly computing and computer-based systems generally tolerate changes which are not desired, in their internal structure or external environment in regular working which can be referred to as faults[15]. A Fault may be an operational fault or design fault. Fault may occur more than once or once. The techniques to tolerate the fault are used to make a system fault tolerable. Checkpointing is a technique for fault tolerance which periodically records the state of the system in stable storage. The Checkpointing technique provides fault tolerance without requiring extra efforts from the programmer [1]. Any state that is saved periodically is called the checkpoint of the process [2,3]. A global state [4] [15] of a distributed system is a set of individual process states, on per process [2] [15]. Checkpointing may be either independent or coordinated checkpointing. In Independent checkpointing, each process takes checkpoint independently without any synchronization between the processes [15] [5]. In coordinated checkpointing, the processes coordinate their checkpointing actions in a manner so that the set of local checkpoints taken is consistent [6,7,8,9]. The current work suggests a new coordinated checkpointing algorithm that effectively selects a new coordinator process whenever the existing coordinator stops working due to any failure. In this algorithm, the election of new coordinator takes 46 editor@iaeme.com

2 An Efficient Algorithm In Fault Tolerance For Electing Coordinator In Distributed Systems less time and minimum network message transmission in comparison to existing algorithms. 2. EXISTING WORK In the existing work, to create a checkpoint the initiator communicates with other processes. In the checkpointing protocols, the global checkpoint may be inconsistent, if message communication takes place after checkpoint request of initiator. This is shown in fig. 1 in which m is the message which is sent by P 1 after receiving a checkpoint request from the initiator. The checkpoint will become inconsistent if m reaches to the process P 0 before the checkpoint request, because checkpoint c 0,x indorses that message m is received from process P 1, while checkpoint c 1,x states that it is not sent from P 1 [14] [15]. Non-Coordinator P 0 C 0,x Request for Checkpoint Coordinator Request for Checkpoint m Non-Coordinator P 1 C 1,x Figure 1 Message communication between P 0 and P 1 causing inconsistent checkpoint In other protocol, the message communication is permitted within a fixed time interval only which reduces message communications [10] [15]. This concept decreases the communication overhead. The main drawback of this protocol is that it fixes a particular process as coordinator. A coordinator process which is fixed for the entire system execution increases the probability of failure [15]. In another protocol, the coordinator process changes during entire system execution which reduces the probability of failure of coordinator process. The disadvantage of this protocol is that the communication of message happens at any time. Hence the communication overhead and output commit latency increase[11] [15]. The proposed protocol not only presents a new method for new coordinator selection in case of failure of coordinator, but also may be used to overcome to these shortfalls since a fixed time interval may be used for message communications to reduce the communication overhead. The proposed protocol overcomes to these shortfalls. The proposed protocol controls the message communication by allowing the message communications in a fixed time interval. This fixed time interval is called smart interval. This concept of smart interval minimizes the communication overhead [15]. Before the completion of process, if any non-coordinator process is not receiving any system messages (PREPARE CHECKPOINT or SAVE CHECKPOINT), then the process assumes that the coordinator process is not in working condition, i.e. failed. In this situation, the proposed algorithm starts working to select the new initiator editor@iaeme.com

3 Manoj Niranjan and Mahesh Motwani 3. SYSTEM MODEL Let us consider a system model consists of n processes, P 0, P 1,, P n-1. The no. of processes n do not change for the duration of execution. Let the i th checkpoint for k th i process is denoted as CP k i.e., initial checkpoint CP 0 1 k (i=0), first checkpoint CP k (i=1), second checkpoint CP 2 k (i=2) and so on [15]. The initial checkpoint is taken at the time of system initialization. The independent states, data structures, and computations are maintained by each process. The processes do not have shared memory and global clock. The communication among processes is made only by message passing. We are assuming that the underlying network guarantees reliable FIFO (First In First Out) delivery of messages between any pair of processes. The assumption of First in First Out delivery guarantees the message synchronization [15]. We have used the concept of smart interval in which the message communication took place only. The smart interval is a time interval which is elapsed between the control messages for checkpoint preparation and checkpoint taking. Any message which is sent within smart interval has to be logged and the process execution is continued. This enables handling of lost messages [13]. The control messages for checkpoint preparation and checkpoint taking to other processes is sent by initiator process [15]. In case of failure of initiator process at any moment, the process of selection of new initiator starts. Each process will have a priority and the process with highest priority will act as initiator. If the process with highest priority fails, then the process with the second highest priority i.e. highest priority-1 will be the coordinator. 4. PROTOCOL DESCRIPTION The checkpoint initiator process sends the message (checkpoint-prepare-requestmessage) to other processes to initiate checkpointing. Then the other processes respond to the initiator process by sending reply. If the reply from all processes is received within smart-interval then take-checkpoint-request-message is sent to all processes by initiator otherwise abort-checkpoint-request-message is sent. Initiator process prepares a Global checkpoint which is the set of local checkpoints of all processes. A local i th checkpoint for k th process is denoted by CP k i. The i th global checkpoint is denoted as set CP i ={CP 0 i, CP 1 i,, CP n-1 i } in a system of n processes. The i th global checkpoint CP i is said to be consistent if and only if j,k[0,n-1]:j k(cp j i CP k i ) where denotes the happened-before relation described by Lamport in [12] [15]. t is the maximum transmission delay of a message to reach to destination and T is the checkpointing interval. Here T>3t, since checkpoint interval (T) is obviously greater than smart-interval and the length of smart-interval is bound to be at least 3t to survive the transmission delay of control messages (checkpoint-prepare-requestmessage, response of checkpoint-prepare-request-message and take-checkpointrequest-message and each transmission will take at least t) and to enable logging of computational messages[15] editor@iaeme.com

4 An Efficient Algorithm In Fault Tolerance For Electing Coordinator In Distributed Systems Figure 2 Diagram showing message communication during smart interval Now, let us define the following terms: t prep =time stamp at which initiator process sends prepare request[15] t rec = time stamp at which prepare request is received by a process[15] T trns =maximum transmission time for message including permissible delay (which is t) [15] save_state (P i )=method that saves the current state of process P i [15] send(), receive()=methods for sending and receiving messages respectively. [15] 5. CHECKPOINTING PROCESS The checkpointing process starts with the system initialization. The initiator process starts the process of next checkpoint after time interval T (Time decided by the programmer) of previous checkpoint. The checkpoint-prepare-request-message is sent by initiator process P i to all other processes at t prep. Each process writes tentative checkpoint after sending response to the initiator on receiving checkpoint-preparerequest-message [15]. 1. Now, if response from all processes is received within (t prep +2*T trns ), the initiator process sends take-checkpoint-request-message to all processes. The tentative checkpoint is made permanent after receiving take-checkpoint-request-message from initiator process. This will save the states of all processes which are responsible for preparing a global checkpoint. The tentative checkpoint (which is prepared in response to checkpoint-prepare-request-message) is used to recover the failed process if one or more process fails after responding to checkpoint-prepare-request-message [15]. 2. Now suppose if one or more process does not respond to checkpoint-prepare-requestmessage, the initiator process sends abort-checkpoint-request-message to all processes. The tentative checkpoint is deleted after receiving this message. The copy of unacknowledged messages is logged in this case [15]. 3. Now if any process does not get any message, i.e., checkpoint-prepare-requestmessage or take-checkpoint-request-message within smart interval, it will assume that the initiator has failed. In this condition, the process of leader election will start editor@iaeme.com

5 Manoj Niranjan and Mahesh Motwani 6. LEADER-ELECTION As soon as a process knows that the initiator has failed, it starts the process of electing new initiator. Each process knows the priority number of rest of the processes. In case of failure of initiator, the process with next higher priority, i.e., (Highest Priority-1) will be the initiator. It sends the message to the process with second highest priority i.e. next initiator about electing new initiator. On receiving this message, the new initiator sends messages to all the remaining processes that I am the new initiator. The existing protocols, such as Bully algorithm and algorithm presented by Basu [15], take more time in comparison to presented algorithm. The network overhead of existing algorithms is also higher than presented algorithm. 7. LEADER ELECTION ALGORITHM Step-I Any non-initiator process executes this step Smart Interval Started//Start smart interval If checkpoint-prepare-request-message received from initiator Then Prepare the Checkpoint accordingly and Exit Else if no-message-received AND smart-interval-ended Then go to Step C Send message to process with PRIORITY=(HIGHEST PRIORITY-1) Update the initiator priority, i.e., HIGHEST PRIORITY=HIGEST PRIORITY-1 Step-II This step is executed at process with PRIORITY= (HIGHEST PRIORITY-1) Received message NEW-LEADER from any process Update initiator priority=myself Send message to all remaining processes with HIGHEST PRIORITY=HIGHEST PRIORITY-1 8. PERFORMANCE RESULTS The planned algorithm is simulated in Microsoft Windows Environment using JPVM library. The result shows that the leader election time of proposed protocol is lower than the existing protocols. This time difference is shown in Table-: Test Case Existing Algorithm Table 1 Result for proposed algorithm New Algorithm Difference editor@iaeme.com

6 An Efficient Algorithm In Fault Tolerance For Electing Coordinator In Distributed Systems Test Case Existing Algorithm New Algorithm Difference editor@iaeme.com

7 Thousands Manoj Niranjan and Mahesh Motwani New Algorithm Existing Algorithm CONCLUSION Above mentioned results show that the new algorithm takes lesser time than the existing algorithms in electing new coordinator as well as tolerating the faults. The Smart Interval reduces the message overhead because message communication is not allowed outside the Smart Interval. REFERENCES [1] Partha Sarathi Mandal, Checkpointing and Self-Stabilization for Fault- Tolerance in Distributed Systems, Ph.D. Thesis (2006) [2] D. Manivannan, R.H.B. Netzer & M. Singhal, Finding Consistent Global Checkpoints in a Distributed Computation, IEEE Trans. On Parallel & Distributed Systems, Vol.8, No.6, pp (June 1997) [3] D. Manivannan, Quasi-Synchronous Checkpointing: Models, Characterization, and Classification ; IEEE Trans. On Parallel and Distributed Systems, Vol. 10, No. 7, pp (July 1999) [4] J. Tsai & S. Kuo, Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability ; IEEE Trans. On Parallel & Distributed Systems, Vol.9, No. 10, pp (October 1998) [5] B. Bhargava and S.R. Lian, Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems-An Optimistic Approach, Proceeding of IEEE Symposium on Reliable Distributed Systems, pp (1988) [6] Jiannong cao, Weijia jia,xiaohua Jia, and To-yat cheung, Design and Analysis of an Efficient Algorithm for Coordinated Checkpointing in Distributed systems, Proc. Of Advances in Parallel and Distributed Computing, pp (March 1997) [7] Guohong Cao, and Mukesh Singhal, On Coordinated Checkpointing in Distributed Systems, IEEE Transactions On Parallel And Distributed Systems, Vol. 9, No. 12, pp (Dec.1998) [8] Sharma D. D. and Pradhan D. K., An Efficient Coordinated Checkpointing Scheme for Multicomputers, Proc. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, pp (June 1994) 52 editor@iaeme.com

8 An Efficient Algorithm In Fault Tolerance For Electing Coordinator In Distributed Systems [9] E.N. Elnozahy, D.B. Johnson, and W. Zwaenepoel, The Performance of Consistent Checkpointing, Proc. 11 th Symp. Reliable Distributed Systems, pp (Oct. 1992) [10] Ch. D.V. Subba Rao and M.M. Naidu, A New, Efficient Coordinated Checkpointing Protocol Combined with Selective Sender-Based Message Logging, IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2008, pp (2008) [11] Sarmistha Neogy, Anupam Sinha, Pradip K Das, CCUML: A Checkpointing Protocol for Distributed System Processes, IEEE Transactions on TENCON 2004, IEEE Region 10 Conference, Volume B, Nov. 2004, Page(s): (2004) [12] K.M. Chandy & L. Lamport, Distributed Snapshots: Determining Global States of Distributed Systems, ACM Trans. On Computer Systems, Vol. 3, no., Feb 1985, pp (1985) [13] Ch. D.V. Subba Rao and M.M. Naidu, A Survey of Error Recovery Techniques in Distributed Systems, Proc. 28 th Annual Convention and Exihibition of IEEE India Council, pp (December 2002) [14] E.N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang and David B.Johnson, A Survey of Rollback-Recovery Protocols in Message-Passing Systems, ACM Computing Surveys (CSUR), Volume 34, Issue 3 (September 2002) Page(s): (2002) [15] Jagdish Makhijani, Manoj Kumar Niranjan, Mahesh Motwani, A K Sachan, Anil Rajput, An Efficient Protocol using Smart Interval using Coordinated Checkpointing, Communications in Computer and Information Science, 2011, ISBN: (Print) (Online) [16] Partha Das and Sushabhan Biswas, Fault Tolerance and Power Quality Study of DFIG Based Wind Turbine System, International Journal of Electrical Engineering & Technology, Volume 5, Issue 5, 2014, pp editor@iaeme.com

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS Ruchi Tuli 1 & Parveen Kumar 2 1 Research Scholar, Singhania University, Pacheri Bari (Rajasthan) India 2 Professor, Meerut Institute