Framework for replica selection in fault-tolerant distributed systems

Framework for replica selection in fault-tolerant distributed systems Daniel Popescu Computer Science Department University of Southern California Los Angeles, CA 90089-0781 {dpopescu}@usc.edu Abstract. This paper describes my term project, which I developed in the course CS 589 Software Engineering for Embedded Systems. The term project should be a design and an implementation of a novel application or development tool that exploits one or more existing approaches to software engineering in the context of embedded systems, demonstrates a novel idea in this domain, or overcomes a known significant challenge posed by embedded systems. In my project I examined how to select replica components in fault-tolerant systems to increase the overall reliability of the system considering the additional costs of the deployed replica components. As a result, I developed a framework for different replica selection algorithms and evaluated five selections strategies for this framework. 1 Introduction Software plays more and more of an integrated part in our everyday environment. Almost every electronic device in a household contains complex software. Since software is also integrated in critical devices, systems are desired, which can still operate after a severe fault occurred. Distributed and mobile systems especially have more points of possible failure than desktop applications. Therefore, these systems have a higher likelihood of failing. In a desktop system, when the hardware completely fails, the whole software system is now unavailable. In a distributed system, which consists of many hardware notes, a failure of a hardware node does not mean that the whole system fails. It can still operate, if software services are independent or if replica of critical software components exists. If the replication of components is used as a fault-tolerance strategy, the following problem arises. If a hardware note fails and we have replicated certain software components, did we replicate the right component? To be sure that we have replicated the right component, we could replicate every single software component. However, this is not feasible because additional software components consume memory, process power and bandwidth among others. Therefore, a trade-off decision about criticality, reliability and costs of software components is needed to find the right set of replica. Finding a good trade-off is a difficult problem, because many decision variables exist. An analyst has to decide which components to replicate considering the importance

of use cases, the resource consumption of software components, the reliability of the components, the computer hardware and the network. After selecting the components the analyst has to decide where to deploy these components again considering the above mentioned dimensions. To address this replica selection problem I transformed it into an optimization problem [4]. In an optimization problem the objective is to find the best possible solution satisfying all given constraints. Optimization or mathematical programming is a wellstudied domain in applied mathematics and operations research. Therefore, different methods and algorithms have been developed to solve such problems using strategies such as non-linear programming, greedy strategies or genetic algorithms. Since different algorithms can be used to solve optimization problems and since no canonical set of design constraints for replica exists, I developed a framework providing extension ports for additional constraints and different algorithms allowing customized instantiation of the replica model. This framework can be seen as the base implementation for the development of different later models. The class project required not only designing a novel approach for the embedded system domain, but also an implementation. This paper also describes the architecture and the usage of the implemented tool. The tool is an Eclipse plugin extending the at USC developed deployment tool DeSi [4]. Additionally, it is able to export data to the parallel coordinates visualisation tool Parvis [2] for further analyses. The remainder of this paper is structured as follows. Section 2 describes the developed framework model, an instantiation of the model with some constraints and the developed algorithms to solve the optimization problem. Section 3 describes the developed tool. Section 4 shows an evaluation of the developed approach. Section 5 concludes this paper and shows possible future research ideas. 2 Framework Based on the framework model of Malek et al. [4], the developed replication framework model enables the modelling of distributes system. It allows to define customized constraints and parameters allowing instantiations of the model to solve own defined tradeoff scenarios. Furthermore, since multiple strategies exist to solve optimization problems, the framework provides different algorithms to find solutions. 2.1 Framework model The basic entities of the framework model include hosts, components, links and services. These entities seem appropriate to model component deployments in distributed systems. In detail, a model consists of A set of hosts, H, which represents the hardware nodes of a system. Possible attributes of hardware nodes are the capacity of memory, process power or energy consumption. A set of components, C, where a possible attribute could be the size of the component. Each component is deployed on one of the above defined hosts and each component has a set of replica R, which also includes the original component. 2.

A set of services, S, which describe the different use cases that the whole system offers and can perform. A service is composed of the interaction of components in a system. Since multiple services could use an e.g. encryption component, a component can appear in multiple services. 3. The model contains three different types of links: physical networks links, logical links and service logical links. Therefore, the model contains three different sets of links. The set of physical network links, PL, which connect two hosts to each other showing that components on these host could interact with each other. Physical network links have properties like bandwidth among others. The set of logical links, LL. Logical links show that two components are able to interact with each other, because they know e.g. how to invoke methods in each other. The set of service logical links, SLL. A service logical link connects two components, which are part in the same service. The service logical link shows that these two components interact in a use case of the system. It can have properties such as exchanged average data size and average data exchange frequency. Two components can only have a service logical link, if they also have a logical link between each other. To set up the global optimization function we need some more attributes for the defined basic entities. Reliability. Different entities can fail in a distributed system. The host can break down, a physical network link can fail or a software component reaches a certain failure state. Therefore, each host, each physical link and each component in the model has a reliability value, which describes the likelihood that this entity will not fail during the operation. Service Criticality. A use case of a system has different priorities for a user. For example, in a car a working brake system has a higher criticality in the overall system than an entertainment system. Therefore, my model considers service criticality. After defining all required elements, the designed global optimization function can be derived. It is based on the original deployment, the criticality of services and the reliability of components. C S C i = 1 S j = 1 α Reliability( S j, C i ) Criticality( S j ) α = 1if ( C i S j ) α = 0if( C i S j )

Creating replica components for certain services increases the reliability of a component. Therefore, the reliability function computes the reliability of the component based on the amount of component instances. Each replica of a component is an additional instance of the original component. 4. Reliability( S j, C i ) = ReplicaSet( C i ) 1 FailProbability( S j, Replica( R k, C i )) k The reliability definition above is based on the assumption that the failing events of each replica are independent. This should be the case, since all replica are on different hosts and therefore should have different causing events for host and link failures. There can never be the same components deployed on the same host, because a constraint of the framework prohibits this double deployment. The probability of failure of a component is dependent on the host on which it is placed and on the links which it uses to serve a certain service. If a host fails ( PH ( )) the component also fails, if a host does not fail ( PH ( ) = 1 PH ( )) then the probability of failure depends on the link reliability and the component reliability. Therefore, through application of probability theory (i.e. conditional probability and the law of total probability) the following formula for the component reliability can be derived. FailProbability( C i ) = PH ( ) + PH ( ) PLH ( ) + PH ( ) PLH ( ) PC ( i LH, ) For this formula I assume that a component and link cannot exist without running host. If the host fails, then each attached network link and each software component, which was deployed on it, fails also. As an example, if a host fails 30% of the time, one link fails 20% of the time and the software component does not fail at all, then we arrive at the following formula. FailProbability( C i ) = PH ( ) + PH ( ) PLH ( ) + PH ( ) PLH ( ) PC ( i LH, ) = 0.3 + 0.7 0.2 + 0.7 0.8 0 = 0.34 In this example the probability that the component fails is 34% considering host and link failure. 2.2 Framework Instantiation For this framework, an active replication strategy is assumed although it can be adapted to other replication strategies as well. In an active replication all replica are running at the same time. The above defined global optimization function does not include any constraints. This is a wrong assumption, because hosts have memory and processing constraints and the original component forwards each received event to each replica. Therefore, each additional replica consumes resources of the system. The framework itself does not require any constraints. However, different constraints can be defined.

To make a reasonable trade-off decision, each instantiation of a model should be instantiated with some constraints. Memory consumption is an example of a constraint for the model. Each host has only a limited space available to install software components. Each software component has a size. This constraint ensures that the optimization algorithm will not generate too many replica components. At most it will generate as many components as will fit on the hosts. Using all remaining space for replica components is also not optimal, since no space for maintenance tasks would exist. Realistic boundaries for the memory usage are required. However, how does the algorithm know what are good boundaries for memory usage? Deciding on a trade-off requires intelligence, understanding and domain knowledge. An optimization algorithm does not know what memory usage is. Therefore, we need to add some human input to the model. Since each important parameter of the system is modeled, an engineer can examine how much of each resource is used and how much of each resource is available. Consequently, he can set these constraints for a certain system before running the optimization algorithm. Bandwidth consumption is another example of a constraint. Each replica is placed on a different host. Whenever a data event is sent to the component, the data event needs also to be forwarded to all replica components. This can generate a lot of overhead traffic. A human engineer knows how much overhead traffic he wants to allow providing the input for the optimization algorithm. Using the engineer s input and considering the available bandwidth of each physical network link, we have input for a realistic constraint for the optimization algorithm. 2.3 Framework algorithms Several algorithms with different run-time behavior and quality of the results exist to solve optimization problems. Since optimization problems often are so complex that they require an algorithm with exponential runtime, algorithms can be chosen, which are based on heuristics finding solution that are approximations of the optimum. 2.3.1 Exhaustive Search The first algorithm is an exhaustive search. This algorithm always finds the optimal solution by computing the global function value for every possible replication. Therefore, if the algorithms find a solution, we can be sure that it has found the best replication strategy. What are the dimensions of this algorithm? Each component can be replicated multiple times and be put on each host. The only constraint is that one host can only run one instance of a component. Therefore, the exhaustive search tries out 2 C H configurations. This runtime complexity causes the algorithm to be inapplicable for already small problems. Therefore, the algorithm s primary function is serving as a benchmark for other algorithms. 2.3.2 Greedy Search Greedy algorithms are iterative algorithms, which find a better solution in each iteration step. They operate by selecting the best possible solution in each step. Therefore, greedy 5.

algorithms have a much better runtime compared to an exhaustive search enabling the optimization of large distributed systems. Greedy algorithms only find the optimal solution if the search space does not have local maxima, which is rarely the case. The solution that a greedy algorithm computes is therefore often only an approximation of the optimal results. In this framework the greedy algorithm replicates in every step a component on a host, which improves the optimization function the most. Therefore, it computes in very step C H possibilities. In the worst case each component gets replicated on each host. Since in every step only one component gets replicated and only one replica can be installed on each host, we need at most C H iterations. Therefore, if we assume no constraints such as memory consumption or bandwidth consumption, the greedy algorithm has the runtime O( C 2 H 2 ). This low runtime complexity enables the use of the greedy strategy also for large distributed systems. 2.4 Practical discussion To compute an optimal optimization, the framework requires quantifiable values. Values such as component size or the available memory of the host can be easily measured in the implemented system. However, values such as reliability or average data transmission cannot be exactly specified a priori to the runtime. These values need to be obtained using expert knowledge gained from earlier systems or through dynamic simulation [1]. 3 Tool Support An implementation of the above described conceptual model is another part of the deliveries. This section describes the implementation in detail. 3.1 Architecture 3.1.1 Foundation of the tool The developed tool was developed as a plug-in for Eclipse. The Eclipse framework is an open source platform-independent software framework. It is mainly used as an IDE, however it was developed to be a general platform for rich clients. Its basic version consists of the Rich Client Platform, which provides the basic functionality for extensions, a framework for views and editors, and plug-ins for the Java programming language. It integrates OSGi [5] as a component framework and provides extension ports for additional components. Besides requiring the basic Eclipse framework, the replication plug-in also requires and extends the tool DeSi. The DeSi tool, an Eclipse plug-in itself, is a visual deployment exploration environment that supports specification, manipulation, visualization, and (re)estimation of deployment architectures for large-scale, highly distributed systems. DeSi exports an API for modifying its deployment model, which can be used to define new system parameters, since the underlying deployment framework for DeSi is similar to the framework of the paper. The basic data model and graphical user interface of DeSi could be reused for this project. 6.

3.1.2 The extension mechanism of the replication plug-in The replication plug-in was designed to provide two distinct extension ports for future change cases. It provides an API to plug in new optimization algorithms and to plug in new constraints for the algorithms. To develop a new algorithm, a new subclass of an abstract base class has to be implemented. The new developed algorithm is automatically integrated in the program, receiving all required data and being executable through the graphical user interface of the replication plug-in. Adding a new constraint is similarly easy. After implementing the general constraint interfaces, the constraints are automatically integrated in the graphical user interface, enabling user input about allowed cost increase. Additionally, the constraint interface of each algorithm has to be implemented. If the constraint interface of an algorithm is implemented, the constraint is considered automatically in the run of this algorithm. 3.1.3 Parvis - The analysis COTS component To enhance visualization of the results, the replication plug-in exports its results in a readable format for the parallel coordinate visualization tool Parvis. 7. Parvis uses the visualization technique called parallel coordinates. Parallel Coordinates are a way to visualize multi-dimensional data. In this visualization technique each dimension is represented as a parallel axes with equal distance to each other. Each value of an n-dimensional data point is marked on the parallel axes and connected through one line. Therefore, one data point can be traced over the different dimensions. To increase the readability of the visualization, the user can highlight since data points. The replication tool exports its data using five dimensions. Each component is a five-dimensional data point. The dimensions are the original reliability value, the improved reliability value, the number of component instances (replica + the original component), the number of services, where the component appears, and the average criticality of these services.

8. 3.2 Usage Description 3.2.1 The main views of the replication tool This section shows an example execution of the tool. The replication plugin can be invoked through a menu item in Eclipse. The screenshot above shows the deployment of an example embedded system. 12 components are deployed on four hosts. Each host is connected to each other host, which the black connecting lines indicate. If two components are connected, they are able to communicate. The properties of each entity (hosts, links and components) can be edited in the view below.

As described above components together form services, which represent the use cases of the system for the user. Services and their component interaction can be modeled in the service view below. 9. 3.2.2 The replication wizard After the system is modeled as desired, the algorithms can be invoked on the model using a graphical wizard. The screenshot above shows the first page of the wizard. As the first step the optimization strategies can be chosen. In the current version five strategies are implemented.

10. In the second and also the final page of the wizard, each constraint is displayed showing resources used in the whole system and the total resources, which are available in that dimension. The user can enter how much cost increases for each constraint he tolerates. In the screenshot above a cost increase for 75% is entered. Since 40.596 units of memory are already in use, 175% would correspond to 71.043 units of memory. The algorithm ensures that the solution does not consume more resources as allowed. If the specified allowed increase exceeds the available resources, the available resources are the boundary for the algorithm. After the algorithm computed the best replica, it adds replica component to the model. These component are highlighted by a grey bar on the bottom of the component. All properties can be examined in the same way as for the original components. In addition, a replica service is created in the service view showing which components are connected to which replica components. In this view it can be analyzed how much data is transferred from each component to its replica. The replication data transfer is inferred from the data exchange from each component of its services.

As an additional step, the effectiveness of the chosen algorithm can be analyzed using the integrated COTS tool Parvis. 11. In the parallel coordinate result graph, a user can see in a overview how the reliability of each component improved. Furthermore, he can read other parameters as service criticality. Therefore, the results of the algorithms can be visual validated. 4 Evaluation For the evaluation of the system I developed three comparison heuristics as algorithm extension for the replication tool. Since the domain is complex and often the variables are unattainable, human engineers can use heuristics for deciding on a replica. The first heuristic replicates each component of the system on each host maximizing reliability. This strategy provides the maximum possible reliability in the system while being very costly. However, in some systems costs are not critical. Therefore, it is still a valid strategy. The second strategy is to replicate each component once. Therefore, if a component fails, there should be always one component, which can be used instead. This approach is less expensive than the first heuristic. In unreliable environments, where components fail more often, this strategy might be insufficient. Replicating every component twice, could still be inexpensive, while providing higher reliability. These more easily comprehendable selection strategies are compared against the exhaustive search and the greedy algorithm. Note that all three heuristics do not consider any constraints while creating replica components. To evaluate the algorithms, I ran each described algorithm on three randomly generated distributed systems with parameters in the following data ranges. The units of attributes such as component size are not specified, since many different units are possible and this information is not essential for the algorithms. Attribute Range Component Sizes 1..5 Service Criticality 1..3

Event Data Frequency 1..10 Event Data Size 1..10 Host Reliability 0.9..1 Component Reliability 0.7..1 The algorithms are compared in the dimension of overall bandwidth usage increase, overall memory cost increase and the increase in the global optimization function value. Additionally, the test shows the reliability value of the two components with the lowest reliability after the optimization ( C w1 and C w2 ), because these two components are the weakest points in the whole system. They are the most likely to fail. All three experiment runs have four hosts, four services and twelve components. Even for this small configuration, the exhaustive search is already not any more feasible. The result of the three experiments can be seen in the following tables. The first column Base shows the values before a replication strategy was applied. For Greedy (50%) the greedy algorithm was executed, whereas both memory and bandwidth constraint allowed only a cost increase of 50%. For Greedy (100%) the allowed cost increase was specified as 100% and for Greedy (150%) it was 150%. All three experiments were too complex for the exhaustive search. Therefore, executing the exhaustive search was not feasible. Instead the Replicate All strategy was used as a benchmark. In this strategy each component is replicated on each host. Table 1: Experiment 1 12. Name Bandwidth Memory Optimization C w1 C w2 Base 40.6 255.13 16.17 0.743 0.761 Greedy (50%) 56.59 366.02 21.59 0.838 0.853 Greedy (100%) 78.38 451.14 29.50 0.951 0.956 Greedy (150%) 97.43 565.91 30.76 0.974 0.988 Replicate All 162.38 871.07 30.97 0.997 0.998 Replicate each once 81.19 460.44 30.26 0.934 0.954 Replicate each twice 121.79 665.76 30.86 0.990 0.992 Table 2: Experiment 2 Name Bandwidth Memory Optimization C w1 C w2 Base 36.3 429.93 21.42 0.760 0.773

13. Table 2: Experiment 2 Name Bandwidth Memory Optimization C w1 C w2 Greedy (50%) 48.68 660.67 26.94 0.824 0.826 Greedy (100%) 63.82 898.14 31.90 0.824 0.826 Greedy (150%) 76.36 1073.28 35.40 0.946 0.955 Replicate All 145.2 1970.28 36.97 0.997 0.998 Replicate each once 72.6 943.38 36.11 0.945 0.955 Replicate each twice 108.9 1456.83 36.83 0.987 0.988 Table 3: Experiment 3 Name Bandwidth Memory Optimization C w1 C w2 Base 32.25 426.42 28.78 0.774 0.775 Greedy (50%) 47.43 622.15 37.21 0.774 0.784 Greedy (100%) 64.39 866.15 43.21 0.946 0.957 Greedy (150%) 79.69 1113.92 44.69 0.990 0.995 Replicate All 129.02 1901.73 44.98 0.998 0.998 Replicate each once 64.41 918.19 44.15 0.944 0.948 Replicate each twice 96.76 1409.96 44.86 0.988 0.991 In general, it can be observed that all algorithms showed a similar behavior in all three experiments. The first result is that a cost increase of 50% increases the overall reliability, but seems to be in general too little compared to the maximum possible reliability of the Replicate All strategy. This can especially be seen in the overall optimization value of each experiment and in the small changes in the reliability of the most unreliable component. The Greedy (100%) strategy produces already better results, which are closer to the best possible solution. Replicating each component once has slightly better results than the Greedy (100%) strategy while being only slightly more expensive. A similar trend can be observed with the Greedy (150%) strategy compared to strategy, which replicates each component twice. Both strategies compute almost the same results, but the

Greedy (150%) performs slightly worse while having a slightly better resource consumption. Both strategies reach values, which are close to the best possible solution. How are these results transferable to the replica selection in fault-tolerant systems? If a moderate reliability is sufficient or the hardware resources are constrained, the Greedy (100%) or the strategy, which replicates each component once, is suitable. If a high reliability is required replicating every component twice or using the Greedy (150%) strategy can be used. Both strategies require approximately three times the resources as the original deployment. It is interesting that a simple heuristic, such as replicating every component once, produces almost the same results as the more sophisticated greedy search. Since it might be difficult to obtain all reliability values and other parameters of the model, the simple heuristic can be utilized in many fault-tolerant distributed systems. If, however, more uneven constraints boundaries exist (e.g. 65% allowed cost increase) the greedy algorithm should be selected. 5 Future The developed framework helped to explore the problem space of the replication domain. Several possible future questions could still be answered. Since the greedy algorithm performs only slightly better than the simple heuristic, would a better solution provided by a non-linear programming solver create better results? How do the developed algorithms perform when they are bounded by new additional constraints and how could other resource constraints be modeled? The set of randomly created experiments helped to understand the different strategies better. As a next step the algorithms could be applied in real systems. 6 Contribution In conclusion this term project made several contributions. Design and implementation of a novel approach for Component Replica Selection. A framework plus tool implementation to facilitate complex trade-off decisions between reliability and replica overhead costs based on service criticality. An architecture that provides extension ports for additional constraints and algorithms Evaluation of different replication selection strategies Design and COTS component integration of parallel coordinates technique to visualize the results and the fault-tolerance improvements of the analyzed system. 14.

15. 7 References [1] Edwards, G. et al. Scenario-Driven Dynamic Analysis of Distributed Architectures, USC- CSE-2006-617 [2] H. Hauser, F. Ledermann, and H. Doleisch. Angular brushing for extended parallel coordinates. In Proc. of the IEEE Symposium on Information Visualization, pages 127--130, 2002 [3] Inselberg, A. and Dimsdale, B. 1987. Parallel coordinates for visualizing multi-dimensional geometry. In CG international '87 on Computer Graphics 1987 (Karuizawa, Japan). T. L. Kunii, Ed. Springer-Verlag New York, New York, NY, 25-44. [4] Malek, S. et al., A User Centric Approach for Improving A Distributed Software System's Deployment Architecture, USC-CSE-2006-602 [5] OSGi Alliance, OSGi Service Platform, Release 3, Mar 2003.