Automatic Generation of Availability Models in RAScad

Size: px

Start display at page:

Download "Automatic Generation of Availability Models in RAScad"

Barrie Garrison
6 years ago
Views:

1 Automatic Generation of Availability Models in RAScad Dong Tang, Ji Zhu, and Roy Andrada Sun Microsystems, Inc Network Circle, Santa Clara, CA {dong.tang, ji.zhu, Abstract RAScad is a Sun internal web based reliability, availability, serviceability (RAS) architecture modeling and analysis tool for use in the computer system design and development phase. Two major goals of RAScad are: Making availability modeling possible for design engineers without background in mathematical modeling and making availability modeling efficient for RAS engineers who understand underlying mathematical models. To achieve these goals, RAScad integrates two modules: Model Generator (MG) which provides automatic model generation specific to Sun product RAS characteristics, and Graphical Model Builder (GMB) which provides general, graphical Markov, semi Markov and reliability block diagram modeling capabilities. An MG model is a hierarchical specification, in terms of an engineering language (MTBF, MTTR, redundancy, etc.), of the constituent components and associated parameters for the modeled system and the user does not have to understand the underlying mathematical models generated by RAScad. 1. Introduction As reliability, availability, and serviceability (RAS) are becoming increasingly important for the networked computer server and storage systems running critical applications, RAS has been one of the major issues considered in designing such products. It has been realized by manufacturers of these products that the quantification of the system level RAS metrics needs to be performed in the early design phase. How to derive the system level availability and reliability measures from the component level RAS parameters (for both hardware and software) and how to relate models to field data have long been addressed in previous studies [1, 2, 3, 4, 5, 6, 10]. Many in house and commercial software tools have also been developed to automate the evaluation procedure. Some representative commercial dependability modeling tools are SHARPE [7], UltraSAN [8], and MEADEP [9]. Although these tools incorporate advanced modeling and evaluation techniques, they all require the users to have mathematical modeling background to build models with the tools. The model construction is time consuming and error prone even if the modeler is an experienced user. In addition, these commercial tools are all stand alone (not web based) applications which lack support for connecting to existing enterprise RAS metrics (e.g., component level MTBF and MTTR) databases and for file sharing across networks which is desired in the modeling effort coordinated by a group of engineers located at different sites. It would be very beneficial to have a domain specific tool that understands not only the mathematical language, but also the engineering language from which mathematical models are generated automatically, and addresses all of the above issues. RAScad is being developed for filling this void. Two major goals of RAScad are: (1) Making availability modeling possible for design engineers without background in mathematical modeling and (2) making availability modeling efficient for RAS engineers who understand underlying mathematical models. To achieve these goals, RAScad integrates two modules: Model Generator (MG) and Graphical Model Builder (GMB). MG provides automatic model generation specific to Sun product RAS characteristics for use by system designers. GMB provides general, graphical Markov, semi Markov, and reliability block diagram (RBD) modeling capabilities for use by RAS experts. RAScad also allows the combined use of MG models and GMB models. MG is used to develop a diagram/block model which is a specification, in terms of an engineering language (MTBF, MTTR, redundancy, etc.), of the constituent components and associated parameters for the modeled system. In the model solution procedure, RAScad translates the diagram/block model to RBDs and Markov chains which are then solved using numerical methods. The MG user does not have to understand these underlying mathematical models. What is needed from the user is the knowledge about the RAS architecture of the modeled system and a basic understanding of the MG diagram/block model structure. GMB is used to develop graphical RBD models and Markov/semi Markov chains, by drawing blocks, states, transitions, and other objects and by specifying related parameters hierarchically. To develop models using GMB, the user needs to have knowledge on RBD and Markov modeling and to understand how to map the

2 system RAS architecture to these models. However, GMB offers more powerful modeling capabilities than MG. For experienced users, GMB provided a very flexible and user friendly environment for modeling system behaviors in great detail. RAScad is implemented using Java TM technology and incorporates a rich set of features including: Automatic model generation Graphical Markov, semi Markov, and RBD modeling and hierarchical approach A library of models for existing Sun products and integration with the component MTBF database Graphical output and parametric analysis capability File sharing across networks and documentation generation In the following sections, we discuss MG only because it is the module that incorporates the automatic model generation, the topic of this paper. 2. RAS Characteristics Modeled MG is intended for use to analytically assess and compare RAS quantities achievable by the computer architectures under design. The tool is not intended for use to predict actual field availability performed by a system. In particular, it is applicable to architectures with the RAS characteristics discussed in this section. The level of detail that can be modeled by MG is the Field Replaceable Unit (FRU) such as CPU module and power supply unit. Based on our investigation of the RAS architectures of Sun server products, the following RAS characteristics are identified to be important for the generation of availability models: Redundancy Fault type (permanent/transient) Fault detection Fault recovery Logistic event Repair of faulty component Reintegration of repaired component The redundancy feature is determined by the quantity and the minimum required quantity for the modeled component. In the current implementation of model generation, all redundant components of the same type are assumed to be functionally equivalent, or symmetric, and have the same failure rate. Model generation for the primary standby and primary secondary (e.g., cluster) architecture is the work in progress. A permanent fault refers to a hard failure of a component and a physical repair action needs to be taken. A transient fault refers to an erroneous state in the system that is induced by cosmic rays, power surges, software defects, or environmental factors. In most cases, the erroneous state can be corrected by a restart of the system. The detailed fault detection process is not modeled in MG. But the effect of fault detection is modeled: detected fault or undetected fault (latent fault). A recovery event occurs after a fault is detected. Depending on the redundancy and automatic recovery (AR) capability implemented in the architecture and operating system, the impact of recovery event on the user applications can be transparent or nontransparent. For example, if there are N+1 power supply units providing power sharing for the system, the failure of one power supply unit would have no effect on the applications and the recovery process is transparent to the user. In a server containing multiple CPUs, if AR for a CPU failure is implemented by system reboot, a CPU failure would trigger a reboot event to deconfigure the failed CPU. This recovery process is not transparent to the user. For both transparent and nontransparent recovery events, imperfect recovery needs to be modeled. The logistic event follows the recovery event. The logistic event duration depends on the redundancy of the faulty component and maintenance strategy. If the faulty component is a required, non redundant component, the system ceases operation upon the failure of the component. A call to the customer service should be placed immediately and the logistic time is just service response time, the time for service personnel to arrive at the scene. If the faulty component is a redundant component, the system is still operational after recovering from the failure. The repair of the faulty component can be scheduled at a later time (e.g., off peak hours) and the time to placing service call is referred to as service restriction time. The logistic event duration is thus the sum of service restriction time and service response time. Similar to the recovery event, the repair event and the following reintegration event can be transparent or nontransparent. If the faulty component is hot pluggable (plug in/out while the system is running) and the system supports dynamic reconfiguration for the component (i.e., the new component can be reintegrated on line without service interruption), the repair event (including reintegration) is transparent to the user. If the faulty component is hot pluggable and the system does not support dynamic reconfiguration for the component, the repair event is not transparent because the system has to be restarted to reintegrate the new component (incurring a short downtime). If the faulty component is not hot pluggable, of course, the repair event is not transparent because the system has to be powered off for replacing the component (incurring a longer downtime). For both transparent and nontransparent repair events, imperfect repair (due to incorrect diagnosis or incorrect corrective action) needs to be modeled. 3. Model Generator GUI The MG Graphical User Interface (GUI) is used to build a diagram/block model which furnishes the automatic model generation. A diagram/block model

An MG block can have a subdiagram to model the subcomponents in the component represented by the block. The root diagram is numbered level 1.

3 consists of a MG diagrams and MG blocks. An MG diagram represents a system or subsystem and contains a number of MG blocks. Each MG block represents a component in the system modeled by the diagram and has a parameter list associated with it. An MG block can have a subdiagram to model the subcomponents in the component represented by the block. The root diagram is numbered level 1. All subdiagrams of the root diagram are numbered level 2, etc. The overall diagram/block model is a tree structure of MG diagrams and MG blocks. Figure 1 and Figure 2 show a diagram/block model. Figure 1. Diagram/block model level 1 Figure 2. Diagram/block model level 2 The first diagram (Data Center System) has four blocks: Server Box, Boot Drives, RAID1, Storage 1, RAID5, and Storage 2, RAID5. The color for these four blocks are dark, which means each of them has a subdiagram. The second diagram is the subdiagram of the block Server Box in the first diagram. This subdiagram consists of 19 blocks (System Board, CPU Module, etc.). Associated with each block there is a parameter list. These parameters are explained as follows: Name Name of this component Part Number Part number of this component Description User s description of this component Quantity Quantity of this component Minimum Quantity Required Minimum quantity of this component required by the system MTBF Mean time between failures caused by permanent faults on this component (hours) Transient Failure Rate Failure rate due to transient faults on this component (FIT, or failures/10 9 hours) MTTR Part 1: Diagnosis Time Time to identify the failed component (min.) MTTR Part 2: Corrective Action Time Time to replace the failed component (min.) MTTR Part 3: Verification Time Time to verify the new component function or to restore lost data (min.) Service Response Time (Tresp) Time to wait for service (hours) Probability of Correct Diagnosis (Pcd) Probability of correctly identifying and replacing the faulty component (to model imperfect repair) The following parameters are relevant only if Quantity is greater than Minimum Quantity Required (i.e., the block is a redundant component): Probability of Latent Fault (Plf) MTTDLF Mean time to detect latent fault (hours) Automatic Recovery (AR) Scenario Transparent No downtime is associated with AR Nontransparent Downtime is associated with AR AR/Failover Time User defined downtime associated with AR (min.) Probability of SPF during AR (Pspf) Probability of single point of failure during AR SPF State Recovery Time (Tspf) Recovery time at the SPF state (min.) Repair Scenario Transparent No downtime in repair/reintegration Nontransparent Downtime is associated with repair/ reintegration Reintegration Time User defined downtime associated with reintegration While the above component parameters are local to a specific block, there are a few global parameters which apply to every block in the model, as shown on the Global Parameter Bar (below the Menu Bar) in Figures 1 & 2: Reboot Time (Tboot) Time to reboot the system.

4 MTTM Mean time to maintenance, or service restriction time. The average waiting time before the service call. MTTRFID Mean time to repair from incorrect diagnosis. Mission Time Time point used to calculate interval availability and reliability. 4. Models Generated This section describes how a diagram/block model is translated to underlying mathematical models reliability block diagrams and Markov chains. The modeling approach used in MG is based on the assumption that failures and repairs for different component types are independent. However, the possibility that a component failure causes a system failure is taken into account by the SPF state in the component model discussed below. Because of independent component failures, the probability of repairing multiple faulty components in a service action is very low. The repair of multiple components in a service action seen in the field is most likely due to imperfect diagnosis/replacement which is also modeled in the component model (by the Service Error state). Given an MG diagram/block model discussed in the previous section, each MG diagram is modeled by a serial RBD which consists of all the MG blocks in the diagram. Each block is then modeled by a Markov chain. The Markov chain may have a sub RBD, depending on if the corresponding block has a subdiagram. The overall model is a hierarchy of RBDs and Markov chains. The system availability of an MG diagram containing n blocks is the product of individual block availability, A i (i = 1, 2,..., n). How A i is evaluated depends on the parameters associated with Block i. Let N represents Quantity and K represents Minimum Quantity Required. If there is no redundancy, i.e.,n=k,a i is evaluated from the Markov chain called Markov Model Type 0 (Figure 3). The states and parameters of the model are all explained in the figure. These parameters come from the block and global parameters discussed in the previous section. Each state is marked either by 1 or 0, which is a reward rate assigned to the state. A reward rate of 1 means the state is an operational (up) state. A reward rate of 0 means the state is a failure (down) state. The system availability will be calculated based on the reward rate assignments [1, 6, 10]. If there is redundancy, i.e., N > K, A i is evaluated from one of the four types of Markov chain discussed below. To simplify the discussion, we assume N = 2 and K = 1. That is, there are two components in the system and at least one of them is required for the system to function. For larger N and K values, more states are needed and these states are all generated automatically in RAScad. The four Markov model types are determined by the four combinations of the parameters Automatic Recovery Scenario and Repair Scenario: 1. Transparent recovery, transparent repair 2. Transparent recovery, nontransparent repair 3. Nontransparent recovery, transparent repair 4. Nontransparent recovery, nontransparent repair The Markov chain generated for the above case i (i = 1, 2, 3, 4) is referred to as Markov Model Type i. The complexity of the model increases from type 1 to type 4. For illustration purposes, Markov Model Types 3 is shown in Figure 4 and will be discussed here. The parameters in the model are either derived or directly obtained from the block and global parameters discussed in the previous section. Figure 4. Markov Model Type 3 Figure 3. Markov Model Type 0 Markov Model Type 3 models nontransparent recovery and transparent repair. A detected permanent fault triggers an AR process (Ok AR1). If the AR works, the system goes into a degraded mode (AR1 PF1). Otherwise, it goes into the single point of failure state (AR1 SPF),

5 where it stays for a period of time (Tspf) defined by the user. A non detected permanent fault (latent fault) changes the system to another degraded mode the latent fault state (Ok Latent1). When the latent fault is detected after a delay of MTTDLF, the system has to go through the AR process again (Latent1 AR1). In the PF1 state, a repair action takes place after a logistic event delay (MTTM+Tresp). If the repair (diagnosis and corrective action) is successful, the system goes back to the normal state (PF1 Ok). Otherwise, it has to go through the service error state (PF1 ServiceError) which represents a longer downtime (MTTRFID). If the second fault occurs while the system stays in the degraded mode (PF1 or Latent1), it goes to state PF2 if the fault is permanent or to TF2 if the fault is transient. In PF2, an immediate service call is placed to initiate a repair action. In the situation of a transient fault, either the first fault (Ok TF1) or the second fault adding to a permanent fault (PF1 TF2), the system clears the fault by an AR process. If the AR process does not work (e.g., due to data corruption), the system has to go through the SPF state. As indicated in the figure, the number of states in the model is determined by N and K. For example, if N K > 1, states TF1, AR1, PF1 and Latent1 will be repeated in the model. Due to the variation on the model size, the internal matrix representation, instead of the graphical representation, of the Markov models are generated in the implementation. The system measures generated by RAScad include: Steady state availability, failure and recovery rates Interval availability, failure and recovery rates for (0, T) where T is the Mission Time defined in Section 3 For reliability model: MTTF, Reliability at T, interval failure rate for (0, T), and hazard rate for the time increment in a loop 5. Conclusions In this paper, we discussed the automatic model generation in RAScad, a RAS modeling tool that can generate mathematical models from an engineering specification and that provides tool access and model sharing across Internet. Although the model generation method discussed in this paper was developed based on the Sun server architectures, we believe it is applicable to other server architectures available in the market because they have either transparent or nontransparent property in the recovery and repair processes which are key elements determining the structure of Markov models in our automatic model generation. RAScad has been validated by comparing its results with those generated by SHARPE [7] and MEADEP [9] for selected example models and field data collected from two large operational E10000 servers for 15 months. The availability and reliability results generated from the GMB models match very well with those from the above mentioned commercial tools and for the MG models, the relative errors in yearly downtime are all less than 0.2%. RAScad has been used to develop availability models for a variety of Sun system products during the development phase and is being used in the design of availability architecture for the next generation of Sun products. Acknowledgments The authors would like to thank Vijay Radhakrishnan for his good GUI programming work. Special thanks go to Helen Cunningham, William Bryson, Emrys Williams, Steve Kendall, Swami Sankaran, Robert White, David Wonnacott, and Stefan Myslicki for their valuable comments on RAScad. Sun, Sun Microsystems, and Java are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. References [1] A. Goal, S. S. Lavenberg and K. S. Trivedi, "Probabilistic Modeling of Computer System Availability," Annals of Operations Research, No. 8, March 1987, pp [2] M. C. Hsueh, R. K. Iyer and K. S. Trivedi, "Performability Modeling Based on Real Data: A Case Study," IEEE Transactions on Computers, April 1988, pp [3] J. C. Laprie, "Dependability Evaluation of Software Systems in Operation," IEEE Transactions on Software Engineering, Nov. 1984, pp [4] J. F. Meyer, "On Evaluating the Performability of Degradable Computing Systems," IEEE Transactions on Computers, Aug. 1980, pp [5] D. K. Pradhan (Ed.), Fault Tolerant Computer System Design, Prentice Hall PTR, Upper Saddle River, NJ, [6] A. Reibman, R. Smith and K. Trivedi, "Markov and Markov Reward Model Transient Analysis: An Overview of Numerical Approaches," European Journal of Operational Research, Vol. 40, 1989, pp [7] R. A. Sahner and K. S. Trivedi, "Reliability Modeling Using SHARPE," IEEE Transactions on Reliability, Feb. 1987, pp [8] W. H. Sanders, W. D. Obal II, M. A. Qureshi and F. K. Widjanarko, "The UltraSAN Modeling Environment," Performance Evaluation, Oct./Nov. 1995, pp [9] D. Tang, M. Hecht, J. Miller and J. Handal, "MEADEP A Dependability Evaluation Tool for Engineers," IEEE Transactions on Reliability, Dec. 1998, pp [10] K. S. Trivedi, Probability & Statistics with Reliability, Queuing and Computer Science Applications, Prentice Hall, Englewood Cliffs, NJ, 1982.

Fault tolerance and Reliability

Fault tolerance and Reliability Reliability measures Fault tolerance in a switching system Modeling of fault tolerance and reliability Rka -k2002 Telecommunication Switching Technology 14-1 Summary of