Handling Single Node Failures Using Agents in Computer Clusters

Size: px

Start display at page:

Download "Handling Single Node Failures Using Agents in Computer Clusters"

Agatha Houston
6 years ago
Views:

1 Handling Single Node Failures Using Agents in Computer Clusters Blesson Varghese, Gerard McKee and Vassil Alexandrov School of Systems Engineering, University of Reading Whiteknights Campus, Reading, Berkshire United Kingdom, RG6 6AY Abstract The work reported in this paper is motivated towards handling single node failures for parallel summation algorithms in computer clusters. An agent based approach is proposed in which a task to be executed is decomposed to subtasks and mapped onto agents that traverse computing nodes. The agents intercommunicate across computing nodes to share information during the event of a predicted node failure. Two single node failure scenarios are considered. The Message Passing Interface is employed for implementing the proposed approach. Quantitative results obtained from experiments reveal that the agent based approach can handle failures more efficiently than traditional failure handling approaches. Index Terms Failure handling; Cluster computing; Message Passing Interface; Agent-based failure handling; Single node failure I. INTRODUCTION Proactive failure handling forms a crucial component of research in fault tolerance for distributed parallel computing systems. Handling failures proactively, as the term implies is the prediction of failures on computing nodes and moves tasks from such computing nodes predicted to fail onto safe computing nodes less likely to fail soon [1][2][3][4]. Hence, proactively handling failures aim for controlling a situation by causing something to happen rather than waiting to respond after it happens. Traditional approaches to handle failures include methods such as checkpointing, replication and message logging, and are reported in research that focuses on fault tolerance of distributed parallel computing systems. However, the traditional failure handling approaches are challenged by single point failures, scalability issues, communication overheads and prolonged periods of time for reinstating process execution [5][6]. In more recent times, agent technology has been employed for failure handling. Approaches that employ agent technology have incorporated failure handling strategies that tend to be more dynamic and address many issues that are a challenge for traditional failure handling approaches. Research based on agent based failure handling can classified as failure handling by an agent framework and failure handling by individual agents within an agent framework. Both researches have significantly contributed to achieve agent based failure handling. Failure handling by agent frameworks is reported in [7][8] and [9]. Tichy et al. [7] identify failure handling multiagent system characteristics and consider a potential framework, namely Autonomous Cooperative System (ACS). Key concepts of the framework include reliable communication, fault-tolerant agent platform, fault-tolerant social knowledge, physical distribution and fault-tolerant agent architecture. Mendes et al. [8] propose a fault tolerant networked control system. In the proposed system, the number of critical communications needed for safe operations between system components are minimized, hence guaranteeing safe operation and performance even in faulty conditions. Almeida et al. [9] report the implementation of the Dynamic Agent Replication Extension (DARX), a failure handling agent framework. In this model, failure handling is performed by replicating those agents that are critical to the system and whose future plans could influence other agents in the system. Failure handling by individual agents within an agent framework is reported in [10] and [11]. Khan et al. [10] propose exception handling and periodic events that are sent to agents to inspect their state, though an overhead to the system, as a means to handle failures. The agent and broker system is integrated into the architecture to achieve failure handling capabilities in the brokerage system. Summiya et al. [11] address the prevention of partial or complete loss of an agent in an agent framework by employing an algorithm similar to the sliding window model [12]. A selected set of research based on failure handling using agent technology whose review is reported above, considers failure handling by an agent framework and agents within an agent framework. However, such research seldom explores the extension and implementation of such ideas for large scale distributed parallel computing systems. Hence, there exists a need to address the issue of failure handling employing agent technology in parallel computing systems. Failures of computing nodes in parallel computing systems can be classified as single node and multiple node failures. Single node failures occur when one node in an array of computing nodes have failed, while multiple node failures occur when more than one node in an array of computing nodes fail. The work reported in this paper considers only single node failures by prediction, on similar lines to proactive failure

2 handling. An agent based approach is proposed in which a task to be executed is decomposed to sub-tasks and mapped onto agents that traverse an abstracted hardware layer. The agents intercommunicate across processors to share information during the event of a predicted node failure and for successfully completing the task. The agents hence contribute towards handling faults efficiently. The remainder of this paper is organized as follows. Section 2 considers failure handling agents and their cognitive capabilities. Section 3 presents the implementation of the agent based approach for failure handling. The resources required for implementing the failure handling approach and how these resources are glued together coherently to handle failures is presented. Two scenarios of single node failure are also considered. Section 4 presents quantitative results based on experiments performed for the single node failure scenarios. Section 5 concludes the paper by considering future work. II. FAILURE HANDLING AGENTS Agent-based approaches are biologically inspired from nature. For example, swarming of agents in multi-agent systems like swarm robotic systems are inspired from the biological phenomena of swarming bees. Agents in a natural swarm also demonstrate intelligence by their cognitive capabilities in atleast four different ways [13][14]. Firstly, an agent is capable of being able to know its environment, the surroundings in which it is located. Secondly, an agent is capable to identify a location in the environment in which it can nicely situate. Thirdly, an agent is capable to sense any hazard that is likely to deteriorate or impair its functioning. Fourthly, an agent is capable to traverse from one location to another when necessary for survival. These four capabilities are also desirable for agents in a computing environment. The aim of the agent based approach for failure handing considered in this section is to achieve agent intelligence in parallel computing systems and further demonstrate that the cognitive capabilities of an agent complementing its intelligence can lead towards effective failure handling. In abstract terms the agent based approach proposed in this paper can be summarised as follows. A task to be executed on a parallel computing system is decomposed into subtasks and mapped onto agents that carry these tasks onto computing nodes for execution. The agent and the sub-problem are independent of each other; in other words, the agents only carry the sub-tasks or act as a wrapper around the sub-task independent of the operations performed by the task. The agents displace through the nodes to find an appropriate area to cluster and execute the task. In the proposed approach, an agent possesses capabilities similar to the capabilities of a natural agent presented above. Intelligence of an agent in the computing environment is demonstrated in four different ways. Firstly, an agent is aware of its environment that is the computing nodes on which it can carry a task onto, other agents in its vicinity and agents with which it interacts or shares information. Secondly, an agent can situate itself on a node that may not fail soon and can provide necessary and sufficient consistency in executing the task. Thirdly, an agent can predict node failures by consistent monitoring (for example, power consumption and heat dissipation of the nodes can be used to predict failures). Fourthly, an agent is capable of shifting gracefully from one node to another without causing interruption to the state of execution and notifying other interacting agents in the system when a node on which a sub-task being executed is predicted to fail. III. IMPLEMENTATION To implement the agent based approach considered in section 2 there is a requirement to consider the resources and how these resources can be glued together. Four resources, namely the executed problem, the parallel computing platform, the middleware and the hardware abstraction are considered. A. Resources Firstly, the executed problem an important aspect in large scale parallel computing systems is considered [15]. Parallel reduction algorithms are identified as a class of algorithms that can benefit from the agent based failure handling approach due to two reasons. Firstly, the computing nodes of a parallel reduction algorithm tend to be critical. The execution of the algorithm stalls or produces an incorrect solution if any node information is lost. Secondly, parallel reduction algorithms are employed in critical applications such as space applications. These applications need failure handling through self-managing real time systems. Parallel summation is an exemplar of parallel reduction algorithm and is considered as the executed problem in this paper. Figure 1 is an illustration of the parallel summation algorithm. The problem of addition is sub-divided between nodes as shown in the diagram, thereby generating sub-problems. These sub-problems are executed on parallel nodes for a given level, but executed sequentially on nodes between different levels. Secondly, the parallel computing platform to execute the problem is considered. In the research reported in this paper, a computer cluster is chosen as a platform for implementing the agent based approach for two reasons. Firstly, a cluster is often characterized by three basic elements, namely a collection of nodes, a network connecting these nodes and a facility to access and share information between the nodes [16], which are simpler elements to handle when compared to other parallel computing infrastructures. Secondly, existing middleware for clusters, namely Message Passing Interface (MPI)[17] provide standard and portable programming interfaces. The cluster used for the research reported in this paper is one among the high performance computing resources available at the Centre for Advanced Computing and Emerging Technologies (ACET), University of Reading, United Kingdom [18][19]. The cluster consists of a head node and 33 compute nodes. All nodes are connected via a Gigabit ethernet switch and communicate via the standard TCP protocol.

3 Fig. 2. Mapping hardware nodes to logical nodes Fig. 1. Illustration of the Parallel Summation Algorithm Thirdly, the middleware, for which Message Passing Interface, a standardized application programming interface (API) used for parallel and/or distributed computing, is chosen for implementing the agent based failure handling approach. Open MPI [20] [21] version 1.3.3, an open source implementation of MPI 2.0 is employed on the cluster. An important feature of MPI 2.0, dynamic process creation and management is essential for implementing the approach. The MPI dynamic process model permits the creation and management of a set of processes both when an MPI application begins and after the application has started. The management of newly created processes includes cooperative termination of a process, communication between newly created processes and existing MPI application, and establishing communication between two independent processes. MPI COMM SPAWN is used to create a new MPI process and establish communication from an existing MPI application. On the other hand, MPI COMM ACCEPT and MPI COMM CONNECT can be used to establish communication between two independent processes. More MPI specific details on dynamic process model can be obtained from [17] [22]. Fourthly, the hardware abstraction which is obtained when hardware nodes are abstracted to logical nodes. The hardware layer comprises physical nodes of the cluster that are connected via a switch, thereby forming a fully connected mesh topology. The abstracted layer is obtained when the physical nodes are abstracted to logical nodes, and is achieved by implementing software rules/policies using the middleware. The policies are such that a carrier agent carrying an executing sub-task can only communicate with a vertically, horizontally or diagonally adjacent carrier agent or process, effectively leading to a grid topology on the abstracted layer. For example, nine nodes of a computer cluster forming a fully connected mesh topology in figure 2 is abstracted to a grid topology in the abstraction layer. B. Gluing the Resources Having considered the resources there is a need to glue them together in a coherent fashion to achieve the goals of the agent based approach. This section hence considers how the resources considered in the previous sub-section are glued together. The parallel summation algorithm works in four sequential levels. The first level comprising nodes N 1 N 8 receives a live input feed of data. The second level comprising nodes N 9 N 12 receives data from the first level, adds the data received and yields the result to the third level nodes N 13 and N 14. The fourth level, adds data received from the third level nodes and produces the final result. Figure 1 shows the nodes required in the parallel summation algorithm. For a given time step, every node in a level operates in parallel. Each node is characterized by input dependencies (process or processor a node is dependent on for receiving an input), output dependencies (process or processor a node yields data to as output) and data contained in the node. The first level nodes have one input dependency and one output dependency. For instance, node N 1 has one input dependency I 1 and node N 9 as its output dependency. However, the second, third and fourth levels have two input dependencies and one output dependency. For instance, node N 13 of the third level has nodes N 9 and N 10 as input dependencies and node N 15 as output dependency. The data contained in a node is either the input data for the first level nodes or a calculated value (sum of two value in the case of a parallel summation algorithm) stored within a node. The agents on the abstracted layer are created such that they carry input and output dependencies and data. Since, parallel summation is relatively less complex when compared to other computational algorithms; the agents carry little information and have few dependencies. Agent intelligence is demonstrated in four different ways. Firstly, an agent is capable of being able to know its environment, the surroundings in which it is located. Information

4 concerning the environment includes knowledge about the processing node on which it is situated, knowledge about other processing nodes in its vicinity and knowledge about other agents situated on processing nodes located to its vicinity. Secondly, an agent is capable to identify a location in the environment in which it can nicely situate. As an agent continues to gather and update information about its vicinity, the agent is also capable to decide onto which processing node it can situate when the processing node on which it is currently situated is likely to fail. Thirdly, an agent is capable to sense any hazard that is likely to deteriorate or impair its functioning. This capability is on similar lines of proactive fault tolerance whereby a failure is predicted. In the intelligent agent based approach presented in this dissertation, rising temperatures of a processing node beyond a threshold is the factor that can impair the functioning of an agent. Hence, an agent is capable to sense this hazard Fourthly, an agent is capable to pass over from one location to another when necessary for survival. If a hazard is sensed the agent can relocate on another processing node and complete the execution of the task it is carrying. These four capabilities are utilized in the intelligent agent based approach. The agent capabilities are combined together to achieve the failure handling in the following manner. Each process executing on a node gathers some sensory information to predict whether a node is likely to fail, on similar lines to proactive fault tolerance. In the implementation presented in this paper node temperatures are simulated. When the temperature of a node rises beyond a threshold, the process executing on that node predicts a failure and hence spawns a process on an adjacent node in the abstracted layer. The agent on the abstracted node expected to fail shifts to the adjacent node on which the new process was spawned. The dependency information carried by the agent that was shifted to the new node is employed to reinstate the state of execution of the algorithm. The data for summation contained in the agent, either obtained from a previous level or a calculated value to be yielded to the next level, ensures that information is not lost and does not affect the final solution in critical applications. Though a preliminary implementation model was achieved, it was observed that MPI was not the most appropriate middleware for implementing the multi-agent approach. When an agent predicted a node failure, a new process had to be dynamically created on an adjacent node that was not predicted to fail, hence allowing the agent on the node predicted to fail to transfer control onto the agent on the newly created process. For this, MPI Comm spawn, MPI Comm connect and MPI Comm accept were required. Since some of these functionalities provided unstable results on the cluster used for implementation, a work-around had to be sought. Hence the process on the new node onto which the agent transferred was created during the initialization of the program and ran on the cluster as a dummy process until it came to play. Two scenarios based on single node failure (only one node fails in an instant) are considered in the implementation of the agent based approach. Fig. 3. Communication sequence for single node failure scenario 1 Fig. 4. Communication sequence for single node failure scenario 2 1) Single Node Failure Scenario 1: Firstly, a scenario in which no nodes connected to a node predicted to fail would fail in a consecutive time step. Figure 3 illustrates the sequence of events and communication in the first scenario. The communication sequences for the first scenario are as follows. Firstly, the hardware probing process of the node predicted to fail notifies the carrier agent situated on that node that it has predicted a failure. The carrier agent immediately spawns a new process on a node adjacent to it. Further to this the carrier agent sends notification to the input dependent processes (two processes in the case of parallel summation algorithm considered in this paper) and the output dependent process (one process in the case of parallel summation algorithm considered in this paper). After sending the notification to the dependent processes the agent process terminates execution. The newly spawned process then reestablishes all input and output dependencies and continues execution.

5 2) Single Node Failure Scenario 2: Secondly, a scenario that was more realistic in nature and assumed that any node connected to a node predicted to fail could also fail in a consecutive time step. Figure 4 illustrates the sequence of events and communication in the second scenario. Most of the communication sequences in the second scenario are similar to the first scenario. However, additional communication between the agent and the hardware probing process on the adjacent nodes is required. The additional communication sequences enable the carrier agent to select one target node on which a new process can be spawned from the eight adjacent nodes. In summary, a classic version and a failure handling parallel summation algorithm were implemented. The failure handling algorithm incorporates concepts of the agent based failure handling considered in section 2. The algorithms capability of handling single node failure scenarios is demonstrated. The quantitative results obtained from experiments based on the single node failure scenarios of the agent based parallel summation algorithm is reported in the next section. Fig. 5. T sn1 plotted for third level nodes N 13 and N 14 IV. RESULTS The quantitative results obtained from the experiments performed are based on the single node failure scenarios considered in the above section. This section presents the results obtained from both the scenarios. Nodes N 9 N 15 as shown in figure 1 are the computational nodes of the parallel summation algorithm. In the experimental results reported in this section the third level nodes, nodes N 13 and N 14 were only considered. A. Single Node Failure Scenario 1 The time taken by an agent to transfer itself in single node failure scenario 1 from a node predicted to fail onto another adjacent node in the abstracted layer and re-establish all process dependencies for seamless execution, otherwise referred to as T sn1 shown in figure 3 was noted. Thirty different trial runs were performed to gather the statistic. Figure 5 shows the graph that plots T sn1 for nodes N 13 and N 14. The mean of T sn1 is calculated as seconds and is shown in figure 5 as a red axis line. B. Single Node Failure Scenario 2 The time taken by an agent to transfer itself in single node failure scenario 2 from a node predicted to fail onto another adjacent node in the abstracted layer and re-establish all process dependencies for seamless execution, otherwise referred to as T sn2 shown in figure 4 was noted. Thirty different trial runs were performed to gather the statistic. Figure 5 shows the graph that plots T sn2 for nodes N 13 and N 14. The mean of T sn2 is calculated as seconds and is shown in figure 6 as a red axis line. It is noted that the mean time for the single node failure scenario 2 requires greater time. This is due to additional communication sequences to gather sensory information that aids decision making concerning which target node can be Fig. 6. T sn2 plotted for third level nodes N 13 and N 14 used to spawn a new process. The additional time is calculated as T x = T sn2 T sn1 and is obtained as seconds. The mean time taken by an agent to transfer itself in a realistic scenario (single node failure scenario 2) from a node predicted to fail onto another adjacent node in the abstracted layer and re-establish all process dependencies for seamless execution or in other words the mean time taken for reinstating execution after a predicted node failure is noted as seconds. If traditional checkpointing (checkpointing only when the failure is predicted) with human adminstration was employed or incremental checkpointing (periodic checkpointing so that a process does not need to restart from the beginning), reinstating execution would atleast be in the order of minutes. This brief comparison reveals that the agent based approach is effective than traditional failure handling methods. In short, though preliminary results obtained through simple experiments are presented, the agent based approach proposed in this paper is promising and paves a path towards being able to handle faults more efficiently than traditional fault handling

6 approaches in distributed parallel computing systems. V. CONCLUSION In this paper, handling single node failures for parallel summation algorithms in computer clusters has been considered. An agent based approach has been proposed in which a task to be executed is decomposed to sub-tasks and mapped onto agents that traverse computing nodes. The agents intercommunicate across computing nodes to share information during the event of a predicted node failure. Two single node failure scenarios are considered. It is also observed that implementing a realistic scenario for single node failures require additional communication sequences between the agent and hardware probing processes at the expense of time. The Message Passing Interface has been employed for implementing the proposed approach. Quantitative results obtained from experiments reveal that the agent based approach can handle failures more efficiently than traditional failure handling approaches. Future work will aim to extend the agent based approach for multiple node failures. More statistical results will be gathered to compare the efficiency of the proposed approach with existing and traditional failure handling approaches. Efforts will also me made to implement the approach on other largescale parallel computing systems. [11] S.Summiya, K. Ijaz, U. Manzoor and A. A. Shahid, A Fault Tolerant Infrastructure for Mobile Agents, Proceedings of the International Conference on Computational Intelligence for Modelling Control and Automation, [12] D. E. Comer, Internetworking with TCP/IP, Volume 1: Principles, Protocols, and Architecture, Prentice Hall, [13] M. Wooldridge, An Introduction to Multi-Agent Systems, Second Edition, John Wiley & Sons, [14] D. Weyns, H. Van Dyke Parunak and F. Michel, Environments for Multi- Agent Systems, Lecture Notes in Artifcial Intelligence 3374, Springer, [15] M. J. Quinn, Parallel Computing Theory and Practice, McGraw-Hill Inc., [16] J. D. Sloan, High Performance Linux Cluster with OSCAR, Rocks, openmosix & MPI, O Reilly, [17] W. Gropp, E. Lusk and A. Skjullum, Using MPI-2: Advanced Features of the Message Passing Interface, MIT Press, [18] Center for Advanced Computing and Emerging Technologies (ACET) website: [19] High Performance Computing at ACET website: [20] OpenMPI website: [21] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, T. S. Woodall, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, Proceedings of the 11th European PVM/MPI Users Group Meeting, Budapest, Hungary, 2004, pp [22] MPI Tutorial: report.html REFERENCES [1] K. A. Hummel and G. Jelleschitz, A Robust Decentralized Job Scheduling Approach for Mobile Peers in Ad-hoc Grids, Proceedings of the 7th IEEE International Symposium on Cluster Computing and Grid, 2007, pp [2] C. Engelmann, G. R. Vallee, T. Naughton and S. L. Scott, Proactive Fault Tolerance using Preemptive Migration, Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Networkbased Processing, 2009, pp [3] B. Eckart, X. Chen, X. He and S. L. Scott, Failure Prediction Models for Proactive Fault Tolerance within Storage Systems, Proceedings of the IEEE International Symposium on Modelling, Analysis and Simulation of Computers and Telecommunication Systems, 2008, pp [4] A. F. Iskander and A. A. Younis, A Proactive Fault tolerance Management Algorithm for Mobile Ad Hoc Networks, Proceedings of the 4th IEEE Consumer Communications and Networking Conference, 2007, pp [5] J. P. Walters and V. Chaudhary, Replication-Based Fault Tolerance for MPI Applications, IEEE Transactions on Parallel and Distributed Systems, Vol. 20, No. 7, July 2009, pp [6] X. Yang, Y. Du, P. Wang, H. Fu and J. Jia, FTPA: Supporting Fault- Tolerant Parallel Computing through Parallel Recomputing, IEEE Transactions on Parallel and Distributed Systems, Vol. 20, Issue 10, October 2009, pp [7] P. Tichy, P. Slechta, R. J. Staron, F. P. Maturana and K. H. Hall, Multi-agent Technology or Fault Tolerance and Flexible Control, IEEE Transactions on Systems, Man and Cybernetics, Part C: Application and Reviews, 2006, pp [8] M. J. G. C. Mendes, B. M. S. Santos and J. Sa da Costa, Multi-agent Platform for Fault Tolerant Control Systems, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2007, pp [9] Ad. L. Almeida, S. Aknine, J. -P. Briot and J. Malenfant, Plan-Based Replication for Fault-Tolerant Multi-Agent Systems, Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium, [10] Z. A. Khan, S. Shahid, H. F. Ahmad, A. Ali and H. Suguri, Decentralized Architecture for Fault Tolerant Multi Agent System, Proceedings of the 7th IEEE International Symposium on Autonomous Decentralized Systems, 2005, pp

Improving the Dynamic Creation of Processes in MPI-2

Improving the Dynamic Creation of Processes in MPI-2 Márcia C. Cera, Guilherme P. Pezzi, Elton N. Mathias, Nicolas Maillard, and Philippe O. A. Navaux Universidade Federal do Rio Grande do Sul, Instituto