SoS Dependability Assessment: Modelling and Measurement

Size: px

Start display at page:

Download "SoS Dependability Assessment: Modelling and Measurement"

Suzan Willis
5 years ago
Views:

1 DSoS IST Dependable Systems of Systems SoS Dependability Assessment: Modelling and Measurement Report Version: Deliverable CSDA3 Report Preparation Date: October 2002 Classification: Public Circulation Contract Start Date: 1 April 2000 Duration: 36m Project Co-ordinator: Newcastle University Partners: DERA, Malvern UK; INRIA France; CNRS-LAAS France; TU Wien Austria; Universität Ulm Germany; LRI Paris-Sud - France Project funded by the European Community under the Information Society Technology Programme ( )

2 LAAS-CNRS Report No

3 Table of Contents 1. Introduction SoS Dependability Modelling: The Travel Agency Example The Travel Agency (TA) Presentation Function and User Levels Service and Function Levels Resource Level TA Availability Modelling Service Level Availability External services Internal services Function Level Availability User Level Availability Evaluation Results Summary Measurement-based Evaluation Target system architecture Data collection and processing approach Event logging in Unix Event logging in Windows NT and 2K Data collection strategy Data processing Application to Unix Systems Identification of reboots Distribution of reboots per machine Machine uptimes and downtimes evaluation Availability evaluation Dependencies among machines Dependencies related to reboot events Dependencies related to SNR events Application to Windows NT and 2K Systems Identification of reboots Distribution of reboots per machine Reboot causes analysis Uptime and downtime evaluation Availability evaluation Summary Conclusion References...61

5 SoS Dependability Assessment: Modelling and Measurement Mohamed Kaâniche, Karama Kanoun, Magnos Martinello, Cristina Simache LAAS-CNRS (Toulouse, France) 1. Introduction This report summarizes the work carried out within the VA Workpackage on SoS dependability modelling and assessment. Two complementary approaches are considered to support the SoS dependability evaluation: (i) analytical modelling, and (ii) measurement-based assessment. Modelling is useful to guide the development of the target SoS during the design phase by providing quantitative measures characterizing the dependability of the target SoS at the successive design stages. Various design alternatives can be analysed and assessed in order to choose the final solution that better satisfies the requirements. Our hierarchical modelling approach proposed in deliverable DMS1 [Kaâniche et al. 2001] has been defined to allow the easy construction and refinement of the SoS dependability models during the early design stages. Measurement experiments are needed to provide estimates for the parameters used in the models as well as demonstrate the validity of the modelling assumptions and evaluations. This report is structured into two main parts. Part 1 concerns SoS dependability modelling and Part 2 addresses the estimation of dependability related measures and parameters, characterizing component systems, based on data collected from operation. Part 1 illustrates the main concepts of our hierarchical modelling framework proposed in deliverable DMS1 [Kaâniche et al. 2001] for the dependability evaluation of systems of systems, using the travel agency (TA) case study described in deliverable DMS3 [Periorellis & Dobson 2001] as an example. In particular, the objectives are: 1) to show how to apply our framework based on the decomposition of the target SoS according to four levels: user, function, service and resource levels, and 2) to present typical dependability analysis and evaluation results obtained from modelling, to help the SoS providers in making objective design decisions. In particular, several sensitivity analysis results are presented to illustrate the impact of various assumptions concerning e.g. the users operational profile, the TA architecture and the fault coverage, on the user perceived availability. The availability measure Dependable Systems of Systems 1

6 CSDA3- SoS Dependability Assessment: Modelling and Measurement considered takes into account the combined impact of performance related failures and traditional software and hardware failures. The application of this framework requires the estimation of several parameters involved in the models using data collected from the field. This issue is covered in Part 2. The ideal situation would be to collect data from an operational TA SoS for which we have detailed information on the architecture. However, such an SoS is not available in the context of DSoS. To illustrate the type of measurement-based studies that can be carried out, the LAAS computing network is used as an example. In particular, the results presented in Part 2 are based on event logs collected during a three year observation period from 373 SunOS/Solaris Unix machines, 76 Windows NT and 89 Windows 2K systems interconnected through the LAAS network. The identification of useful trends from large event logs is a time consuming task that requires thorough manual analyses. In our study, we have focused on the identification of machine reboots, and the evaluation of statistical measures characterizing: a) the reboot distribution and occurrence rate, per machine, b) the distribution of uptimes and downtimes associated to these reboots and the corresponding availability, c) the classification of reboot causes and d) the analysis of error dependencies among machines. 2 Deliverable CSDA3

7 2. SoS Dependability Modelling: The Travel Agency Example 2. SoS Dependability Modelling: The Travel Agency Example The aim of this part is to i) illustrate the main concepts of the SoS hierarchical dependability modelling framework proposed in deliverable DMS1 and to ii) show its applicability by considering the travel agency case study presented in deliverable DMS3 as an example. Dependability evaluation is performed in two main steps corresponding to: 1) Hierarchical description of the SoS and its interactions with the users, from the functional and structural point of view; this step consists in identifying and structuring the main information needed to support SoS dependability modelling. 2) Hierarchical construction and solution of the SoS dependability model based on the information provided at step (1). The information needed to describe the SoS behaviour from the user perspective is structured into four levels. The first level (user level) describes how the users interact with the SoS, and the three remaining levels (function, service and resource levels) detail how the user requests are implemented by the SoS. More specifically, the proposed levels are defined as follows: The user level describes the user operational profile in terms of the types of SoS functions invoked and the probability of activation of each of them. The function level describes the set of functions available at the SoS provider site. The service level describes the main services needed to implement each function and the interactions among them. Two categories of services are distinguished: those provided by the SoS provider (internal services) and those provided by external suppliers (external services). The resource level describes the architecture on which the services identified at the service level are implemented. At this level, the architecture, and fault tolerance and maintenance strategies implemented at the SoS provider site are detailed. However, each service provided by an external supplier is represented by a single resource that is considered as a black box. This is illustrated in Figure 2.1 where the dependability measure considered is availability. This figure shows that the SoS availability modelling and evaluation step is directly related to the SoS hierarchical description. It has been defined in such a way that the outputs of a given level are used in the next immediately upper level to compute the availability measures associated to this level (denoted by A(x) where x is a user, a function, a service or a resource). Accordingly, at the service level, the availability of each service is derived based on the availability of the resources involved in the accomplishment of this service. Similarly, at the function level, the availability of each function is obtained from the availability of the services implementing it. Finally, at the user level, the availability measures are obtained based on the availability measures of the functions invoked by the users. Various techniques can be used to model each level of the hierarchy: fault trees, reliability block diagrams, Markov chains, stochastic Petri nets, etc. The selection of the right technique to be used for each level mainly depends on the kinds of dependencies between the elements of the Dependable Systems of Systems 3

8 CSDA3- SoS Dependability Assessment: Modelling and Measurement considered level and on the quantitative measures to be evaluated. In Section 2.2, we will mainly make use of block diagrams and Markov chains to evaluate the availability of the travel agency. It is noteworthy that although the dependability measure considered in Figure 2.1 is availability other quantitative measures can be evaluated following the same approach, e.g., reliability and performability-related measures. This is illustrated in particular on the travel agency example where availability measures taking into account performance-related failures are considered (see Section 2.2.1). User 1 User 2 User level A(user 1 ) A(user 2 ) A(user N ) Start F1 Fn Exit Start F Fn Exit F1 F2 Fn Start F1 Fn Exit Start F Fn Exit Function level Availability modelling at the user level A(F 1 ) A(F 2 ) A(F n ) F1 F2 Fn Si 1 Si m Se 1 Se p Availability modelling at the function level Service level SoS provider External suppliers Si 1 Si 2 Si m Se 1 Se 2 Se p A(Si 1 ) A(Si k ) A(Se 1 ) A(Se p ) Si 1 Si 2 Si m Ri 1 Ri 2 Ri k Se 1 Se 2 Se p Re 1 Re 2 Re p Availability modelling at the service level Resource level A(Ri 1 ) A(Ri m ) SoS provider External suppliers A(Re 1 ) A(Re p ) Ri 1 Ri 2 Ri m Re 1 Re 2 Re p Availability modelling at the resource level SoS description SoS availability modelling Figure 2-1. SoS hierarchical availability modelling The above presentation shows that we need to structure the information about the target system to characterize each level of the hierarchical model (user, function, service and resource levels). The rest of this part is organised as follows. Section 2.1, presents the travel agency according to the above hierarchical description. Section 2.2 concentrates on modelling the availability of the travel agency. Section 2.3 gives some examples of dependability evaluation results. 4 Deliverable CSDA3

9 2.1 The Travel Agency (TA) Presentation 2. SoS Dependability Modelling: The Travel Agency Example The TA is designed to allow the users to plan and book trips over the web. For this end, the TA interacts through dedicated linking interfaces (LIFs) with several flight reservation, hotel booking and car rental component systems. The TA described in CS1 [Periorellis 2001, Periorellis & Dobson 2001] is composed of two basic components: the travel agent front end- client side, denoted as TAFE-CS, and the travel agent front end- server side, denoted as TAFE-SS (Figure 2.2). The TAFE-CS handles user s inputs, performs necessary checks and forwards the data to the TAFE-SS by calling the Abstract Service Interface. The TA-SS is the main component of the TA SoS. It is designed to respond to a number of calls from the TA-CS concerning for instance, availability checking, booking, payment and cancellation of each item of a trip. The TA-SS handles all transactions to and from the booking systems, composes items into full trips, converts incoming data into a common data structure supported throughout the SoS and finally handles all exceptions. Travel Agency TAFE-SS Users TAFE-CS trip details Abstract Service Interface Flight Hotel Car Flight reservation component systems Hotel booking component systems Car rental component systems Figure 2-2. The TA high-level structuring Starting from this very high-level description, we will further detail the system description according to the various aspects required for the hierarchical description. Therefore, we will first focus on the function and user levels together then the service and function levels before addressing the resource level Function and User Levels To fulfil its three main purposes (flight, hotel and car reservation) the TA supplies the users with various functions. The successive execution of these functions, will allow the users to obtain the actions/information looked-for from the TA. We have identified seven such possible functions, defined as follows: Start: this state identifies the start of a customer visit to the TA web site. Home: this state is reached when a customer accesses the TA home page. Dependable Systems of Systems 5

10 CSDA3- SoS Dependability Assessment: Modelling and Measurement Browse: in this state, the customer navigates through the links available at the TA web site to view any of the pages of the site. These links include for example the weekly promotions, help pages, frequent queries, etc. Search: here, the TA checks the availability of trip offers corresponding to the criteria specified by the customer. A user request can be composed of a flight, a hotel and a car reservation. Based on the information provided by the user, the TA converts the user requests into transactions to several hotel, flight and car reservation component systems and returns the results of the search to the user. Book: the customer chooses the trip that suits his request and confirms his reservation. Pay: this state is reached when the customer is ready to pay for the reservation fees for the trips booked on the TA site. Exit: end of the customer visit to the TA site. Operational profile To characterize the behaviour of the users accessing the TA web site, we consider the operational profile example presented in Figure 2.3 where the nodes represent the various functions identified above. The transitions among the nodes and the associated probabilities p ij describe how the users interact with the TA web site. A given class of users is defined by a specific set of probabilities p ij. These probabilities are usually obtained by collecting data on the web site (see e.g., [Menascé et al. 2000]). Home p 12 p 32 p 27 p 47 p 24 p 54 Start Exit Book Search p 57 p 56 p 45 p 34 p 44 p 23 p 37 p 67 Pay p 13 Browse p 33 Figure 2-3. User operational profile graph User execution scenarios Let us first assume that the various probabilities p ij are specified. Each path from node Start to node Exit of the operational profile denotes a user execution scenario (or shortly, user scenario) when visiting the TA web site. The probability of activation of each path denotes the relative frequency of the corresponding user scenario compared to the other scenarios of the same class. 6 Deliverable CSDA3

11 2. SoS Dependability Modelling: The Travel Agency Example Table 2.1 lists all the user scenarios derived from the example of Figure 2.3 and the associated probability of activation as obtained from processing the user profile graph. The parameters p ij are the probabilities associated to the transitions of the user operational profile. The notations {Home - Browse} * and {Search-Book} * mean that these functions are activated more than once in the corresponding scenarios, due to the presence of cycles in the graph 1. Table 2-1. TA user execution scenarios and associated probabilities (π i ) User scenario Scenario activation probability (π i ) 1: Start-Home-Exit p p : Start-Browse-Exit 3: Start-{Home; Browse} * -Exit 4: Start-Home-Search-Exit 5: Start-Browse-Search-Exit 6: Start-{Home; Browse} * - Search-Exit 7: Start-Home- {Search-Book} * -Exit 8: Start-Browse- {Search-Book} * -Exit 9: Start-{Home; Browse} * - {Search-Book} * -Exit 10: Start-Home- {Search-Book} * -Pay-Exit 11: Start-Browse- {Search-Book} * - Pay-Exit 12: Start-{Home - Browse} * - {Search-Book} * - Pay-Exit p 1 p p 33 p p p +p p p +p p p p + p p p 1 p p p p p p p p13p34p47 ( 1 p44) ( 1 p33) p p p p +p p p +p p p p + p 13 p p 33 ( 1 p33 p32p23) ( 1 p44) p p p p + p 45 p 54 p p44 1 p44 p45 p54 p p p p + p 45 p 54 p p44 ( 1 p44 p45 p54) ( 1 p33) p p p +p p p +p p p p + p 13 p 34 p p + p p p p33 1 p44 1 p p p 1 p p p p p p p p 1 p p p ( ) ( 33) ( ) ( ) p13p34p45 p56 p67 1 p p p 1 p p p p +p p p +p p p p + p p p 1 p -p p 1 p p p ( )( ) p p p Traditional techniques for computing path probabilities are presented e.g., in [Kemeny & Snell 1959] J. G. Kemeny and J. L. Snell, Finite Markov Chains, Princeton, NJ: Van Nostrand, 1959, [Howard 1971] R. A. Howard, Dynamic Probabilistic Systems Volume I: Markov Models, 576p., John Wiley & Sons, Inc., New York, Dependable Systems of Systems 7

12 CSDA3- SoS Dependability Assessment: Modelling and Measurement Sensitivity analyses based on the equations presented in Table 2.1 allow us to understand how the parameters p ij affect the probabilities associated to each path. Such analyses are useful to identify the most significant scenarios to be considered when evaluating the SoS availability as perceived by the users. Indeed, the higher the probability of activation of a given scenario, the higher its impact on the availability as perceived at the user level. Such availability measure is affected by the availability of the functions, services and resources involved in the corresponding user scenario. The scenarios listed in Table 2.1 can be grouped into four categories, denoted as SC1, SC2, SC3 and SC4 according to the activated functions: SC1 gathers all scenarios that lead to the execution of functions Home or Browse without invoking the other functions (i.e., scenarios 1-3). SC2 gathers all scenarios that include the invocation of the Search function, without going through the Book or Pay functions (i.e., scenarios 4-6). These scenarios may require several interactions between the TA and the flight, hotel and car reservation component systems. However, they do not end up with a booking or payment. SC3 gathers all scenarios that include the invocation of the Book function (i.e., scenarios 7-9). These scenarios involve several interactions between the TA and the booking systems. SC4 gathers all scenarios that reach the Pay function (i.e., scenarios 10-12). These scenarios end up with a payment. Let us denote by π(sc1), π(sc2), π(sc3) and π(sc4) the activation probability of SC1, SC2, SC3 and SC4. These probabilities can be obtained from Table 2.1 by summing the probabilities associated to the corresponding scenarios. Example of two user classes For our example, we define two customer profiles (denoted as user class A and user class B), with different values for the transition probabilities p ij. In particular, the class A profile is characterised by a high proportion of users who are mainly seeking for information without a buying intention, whereas the class B profile is characterized by a higher proportion of users really seeking for booking a trip. Tables 2.2 and 2.3 give the probability transition matrices associated to the class A and class B profiles, respectively. The associated scenario probabilities are given in Table 2.4 (in terms of percentage). 8 Deliverable CSDA3

13 2. SoS Dependability Modelling: The Travel Agency Example Table 2-2. User class A profile Start Home Browse Search Book Pay Exit Start Home Browse Search Book Pay Exit Table 2-3. User class B profile Start Home Browse Search Book Pay Exit Start Home Browse Search Book Pay Exit Table 2-4. User scenario probabilities (in %) for user classes A and B User scenario π i, Class A π i, Class B 1: Start-Home-Exit : Start-Browse-Exit : Start-{Home- Browse} * -Exit : Start-Home-Search-Exit : Start-Browse-Search-Exit : Start-{Home- Browse} * -Search-Exit : Start-Home-{Search-Book} * -Exit : Start-Browse-{Search-Book} * -Exit : Start-{Home- Browse} * -{Search-Book} * -Exit : Start-Home-{Search-Book} * -Pay-Exit : Start-Browse-{Search-Book} * -Pay-Exit : Start-{Home-Browse} * -{Search-Book} * -Pay-Exit Table 2.5 gives the probabilities π(sc1), π(sc2), π(sc3) and π(sc4) associated with the scenario categories SC1 to SC4, corresponding to scenarios involving functions up to Browse, Search, Book and Pay respectively. It can be seen that the user class B exhibits a higher probability of activation for scenarios SC2, SC3 and SC4, compared to the user class A. In particular, 80% of user transactions involve the external reservation systems in addition to the TA SoS, whereas this percentage is around 50% only when considering the class A profile. Dependable Systems of Systems 9

14 CSDA3- SoS Dependability Assessment: Modelling and Measurement Moreover, the percentage of transactions that end up with a payment of a trip is around 20% for user class B while it is almost 3 times lower for user class A. Table 2-5. π (SC1), π (SC2), π (SC3) and π (SC4) for user classes A and B π(sc1) π(sc2) π(sc3) π(sc4) Class A 47.9% 38.2% 6.4% 7.5% Class B 20.8% 44.0% 14.9% 20.3% These two examples of user classes will be used in Section 2.3 to evaluate the user availability Service and Function Levels The service level identifies the set of servers involved in the execution of each function and describes their interactions. This analysis requires a deep understanding of the business logic and the technical solutions implemented by the TA SoS provider. For the sake of illustration, Table 2.6 gives a simplified example of mapping between the functions provided at the SoS TA site, the internal servers directly controlled by the TA SoS provider and the external servers operated and controlled by external suppliers. Table 2-6. Mapping between functions and services Internal Services External Services Web Application Database Flight Hotel Car Payment Home Browse Search Book Pay The external suppliers correspond to the flight reservation component systems (AF, KLM, BA, ), hotel reservation component systems (Sofitel, Holiday Inn, ), and car rental component systems (Hertz, Avis, Europcar, ), that provide information on the corresponding items of a trip. Also, we assume that the SoS provider uses the services of an external payment component system for handling card-based transactions. The internal services are supported by three types of servers: 1) Web servers that receive user requests and send back the requested data. 2) Application servers that implement the main operations needed to process user requests. 10 Deliverable CSDA3

15 2. SoS Dependability Modelling: The Travel Agency Example 3) Database servers handling data related operations (for storing and retrieving information about flight reservation, hotel booking and car rental companies, as well as information on customer orders) 2. The execution of the Home function involves only the web server. However, for the other functions several servers are involved. In this case, it is necessary to analyse for each function the interactions among the servers involved. Similarly to the user level, we have to identify for each function all possible function execution scenarios (also referred to as function scenarios). This is achieved through the interaction diagram dedicated to each function. Examples of interaction diagrams for the Browse, Search, Book and Pay functions are given hereafter. Browse Figure 2.4 describes the interactions among the servers involved in the accomplishment of the Browse function. The Begin and End nodes identify the beginning and the end of each function execution. Each path from the Begin node to the End node identifies one possible function scenario. The probability of activation of each scenario can be evaluated by taking into account the probabilities q ij associated to the transitions involved in the corresponding scenario. Note that the probability of activation of non-labelled transitions is one End WS End q 2,3 q 4, Begin WS AS DS AS WS End q 2,4 q 4,7 WS : web server AS : application server S : database server Figure 2-4. Interaction diagram of the Browse function We can identify three scenarios described as follows: 1 2 3: The user sends a request to the web server (node 2). The data requested is available in the local cache and returned back to the user (node 3). This marks the end of this interaction : The web server accepts the request from the user and sends it to the application server (node 4). In this case the requested data is not available in the local cache. The application server processes the user request and returns a dynamically 2 If we refer to the TA high-level design presented in Figure 2.2, the Web-servers will typically host the travel agency front-end client side (TAFE-CS) component and the application servers will host the front-end server side (TAFE-CS) components (including the Abstract Service Interface, and the Flight, Hotel and Car LIFs). Dependable Systems of Systems 11

16 CSDA3- SoS Dependability Assessment: Modelling and Measurement generated page to the web server (node 5). The latter is then forwarded to the user (node 6). The database is not involved in this case : The application server requires some specific information that is on the TA database server (node 7). After the database server has answered the application server, the latter processes the user request (node 8) and sends the results to the web server (node 9). The latter generates an HTML page incorporating the corresponding outputs (node 10). Search The interaction diagram describing the execution of the Search function is decomposed into 9 stages (Figure 2.5). The input data provided in the search request issued by the user (node 1) are first processed by the web server WS (node 2). WS performs necessary checks, and then breaks down the user request into three individual requests corresponding to each aspect of the trip. If data is correct and in the right format, it is then forwarded to the application server AS (node 4), otherwise an exception is sent to the user (node 3). AS uses the request information to formulate a query and asks the database server (node 5) for the list of component booking systems to be contacted. Based on the answer received, AS sends a query (node 6) to the selected systems (identified by the Flight, Hotel and Car nodes in our example). The AND operator means that the request is submitted to the three types of booking systems (nodes 7.a, 7.b, 7.c). The answers returned to AS will be formatted by AS (node 8) and sent to WS (node 9) that forwards them to the user (node 10). The number of Flight, Hotel and Car reservation systems contacted is not indicated in this figure. We assume that the TA SoS always interacts with the same booking systems. We assume that a transaction is successful when, for each type of service (Flight, Hotel and Car reservation), at least one system responds to the request submitted by AS. 3 End 7.a Flight q 2, b Begin WS AS DS q 2,4 AS Hotel AS WS End AND 7.c Car Figure 2-4. Interaction diagram of the Search function Book An example of interaction diagram the Book function is given in Figure 2.6. In this example, the trip booking order received from the user through the web server is processed by the application server. Using the parameters embedded in the book order associated with the selected trip, the application server interacts with the corresponding flight, hotel and car booking systems to book the selected trip. The booking references returned to the application 12 Deliverable CSDA3

17 2. SoS Dependability Modelling: The Travel Agency Example server are then stored in the database, before a confirmation is sent to the user through the web server. 4.a Flight b Begin WS AS Hotel AS DS AS WS End AND 4.c Car Figure 2-6. Interaction diagram of the Book function Pay The interaction diagram for the Pay function is presented in Figure 2.7. When a payment call is received through the web server, the booking data is first checked by the application server, then a call is sent to the payment server, for authentication and verification purposes, and also to accomplish the payment. Finally, the application server updates the information in the database concerning client orders, before sending a confirmation to the user Begin WS AS PS AS DS WS End WS : web server AS : application server DS : database server PS : payment server Figure 2-7. Interaction diagram of the Pay function Resource Level The various services are mapped into the resources involved in their accomplishment. Therefore, we need to take into account the real hardware and software organisation of the SoS. With respect to external services, as the architecture on which these services is not known, we associate to each external service a single resource that is considered as a black box. For internal services, it is possible to detail the organization of internal resources for which the architecture is known. Different architectural solutions are possible for implementing the internal services. In particular, several alternatives corresponding to different organisations of the servers on the hardware support (e.g., dedicated hosts for each server, vs. multiple servers on the same host) or different fault tolerance strategies (non-redundant servers vs. replicated servers) might be analysed and compared from the availability point of view. Replicated servers can be located at one site or be geographically distributed at distinct sites. Also, fault tolerance can be applied to provide redundant accesses to the Internet or redundant communication links between internal resources. Additionally, the architecture solutions might be compared with regards to the Dependable Systems of Systems 13

18 CSDA3- SoS Dependability Assessment: Modelling and Measurement maintenance strategy adopted by the SoS provider (e.g., immediate maintenance vs. deferred maintenance, dedicated vs. shared repair resources). For illustration purposes, we consider the two architectures presented in figures 2.8 and 2.9. The basic architecture (Figure 2.8) consists in allocating a dedicated host to each server and interconnecting these hosts through a LAN. The LAN is viewed as a single resource providing communication between the servers. Concerning external services, we assume that the flight, hotel and car reservation systems are composed of respectively NF, NH and NC components each. The basic architecture suffers from several weak points due to the lack of redundancy and scalability. The architecture described in Figure 2.9 applies redundancy in several places to reduce some of these weaknesses. The TA SoS provider site architecture is based on a server farm configuration with load balancing. This redundant architecture is based on NW web servers, two application servers and two database servers with two mirrored disks. The servers are connected through a LAN (that can be replicated). Indeed, several LANs are generally used to interconnect these servers, nevertheless we will assume that all of them are represented as a single LAN. Also, to simplify the modelling, the load balancers are not explicitly described in this architecture. SoS Provider site Payment server Flight reservation Flight Flight reservation component reservation system #N component system #2 F component system #1 Disk Database server Web server Internet Hotel reservation component Hotel reservation system #1 component Hotel reservation system #2 component system #N H Application server LAN Car reservation component Car reservation system #1 component Car reservation system #2 component system #N C Figure 2.8: Basic architecture SoS Provider site D1 D2 Database server 1 Database server 2 Application server 1 Application server 2 Web server 1 Web server 2 Web server N W LAN Payment server Internet Flight reservation Flight Flight reservation component reservation system #N component system #2 F component system #1 Car reservation component Car reservation system #1 component Car reservation system #2 component system #N C Hotel reservation component Hotel reservation system #1 component Hotel reservation system #2 component system #N H Figure 2-9. Redundant architecture 14 Deliverable CSDA3

19 2. SoS Dependability Modelling: The Travel Agency Example In the next section, we will model the availability of both the basic and redundant architectures. 2.2 TA Availability Modelling The availability modelling and evaluation of the TA SoS will be carried out according to the hierarchical description of the SoS in four steps in order to evaluate progressively the resource, service, function and user levels (see Figure 2.1). The outputs of a given level are used in the next immediately upper level to compute the availability measures associated to this level. An overview of the various modelling steps is recalled hereafter, before modelling specifically the TA case study. Resource models: The resource models describe the behaviour of the SoS provider resources, as resulting from components interaction, failure and repair. Depending on the system nature and dependency among components, one or several models are built. The outputs of these models are the availabilities of the various resources. Service level model: We make a distinction between internal and external services. External services are delivered by providers for whom only little information is known. It is assumed that external services are independent. The availability of these services, denoted as {A(Se j )}, is expected to be provided using specific experiments or measurements, such as those presented in [Long et al. 1995, Kalyanakrishnam et al. 1999a, Machiraju et al. 2000]. Internal services are supplied by the resources of the SoS TA provider. The availability of these services, denoted as {A (Si j )}, is evaluated based on the availability of the resources on which these services are implemented. Indeed, very often, as a result of the analysis of the mapping between services and resources and due to dependencies between the resources, for a given service, the service model and the resource model are built as a unique model, for efficiency reasons. In the case of TA, we will build the same models for services and resources. Function level model: Availability modelling is based on the knowledge of the availability of the services involved in function accomplishment, along with the matrix giving the mapping between the functions and the services, and the scenario probabilities derived from the interaction diagram of each function. The outputs of this level are the availability of the various functions {A (F i )} that can be evaluated as follows. M i j j i j= 1 A(F) = φ A( σ ( F)) (1) where: M is the number of function execution scenarios for function F i in the interaction diagram φ j is the probability of activation function execution scenario j σ j (F i ) is the set of servers involved in function execution scenario j A(σ j (F i ) is the availability of the servers involved in function execution scenario j User level model: The availability of the target SoS as perceived by a given user class is based on the knowledge of the execution scenarios followed by the user when visiting the SoS Dependable Systems of Systems 15

20 CSDA3- SoS Dependability Assessment: Modelling and Measurement provider site(s) (derived from the user operational profile) and the availability of the functions invoked in each scenario. The outputs of this level are the availability as seen by the various classes of users {A (user k )}. Similarly to the function level, A (user k ) is obtained as follows: N A(user k) = π i A(L i ) i1 = (2) where: N is the number of user scenarios in the Markov chain describing the user operational profile π i is the probability of activation of user scenario i L i is the set of functions involved in user scenario i A(L i ) is the availability of the functions involved in user scenario i In the following, we illustrate the above availability evaluation approach on the TA example, starting with the evaluation of the availability measures at the service level based on the modelling of the two architectures presented in Section Service Level Availability At this step, we are concerned with the evaluation of the availability of external and internal services External services We assume that the external resources are identical for both architectures. They correspond to Flight reservation, Hotel reservation, Car reservation and Payment. To evaluate the availability of these services, each external component system is described by a single resource modelled as a black box. Each of these systems is assumed to fail independently of all the others. The availabilities of the various component systems are defined as follows: A Fi : Availability of the flight reservation component system i, i = 1, 2,, NF A Hi : Availability of the hotel reservation component system i, i = 1, 2,, NH A Ci : Availability of the car reservation component system i, i = 1, 2,, NC A PS : Availability of the payment component system. A net : Availability of the TA connectivity to the Internet. Using the failure independence assumption and considering that the service is provided as long as at least one of the redundant component systems is available, the availability of the external services can be directly derived as in Table Deliverable CSDA3

21 2. SoS Dependability Modelling: The Travel Agency Example Table 2-7. External service availability NF A(Flight) 1 ( 1 i1 = NH A(Hotel) 1 ( 1 i1 = A(Car) 1 ( 1 A(Payment server) A PS It is worth mentioning that if the TA connectivity to the Internet is unavailable, none of the services is provided. As a consequence, the availability of the TA connectivity to the Internet will be accounted for by multiplying the user availability expression by A net as will be seen in Section NC i1 = A Fi A Hi A Ci ) ) ) Internal services They concern the web, application and database services. The availability measures will be evaluated for the two architectures of Figures 2.8 and 2.9 (basic and redundant architectures, respectively). For both architectures, communication between servers is achieved by a local area network (LAN). The LAN is assumed to be a single point of failure, i.e., when the LAN is unavailable, all internal services are unavailable. As a consequence, the LAN availability, denoted by A LAN, is in factor of all equations giving the various function availabilities (as will be seen in Section 2.2.2). A LAN can be evaluated using the model discussed in deliverable DMS1 [Kaâniche et al. 2001]. As the primary objective of this deliverable is to show the applicability of the approach to the TA SoS, we make simplistic assumptions for the application and database services. More realistic assumptions are made for the web service, to illustrate the kind of more complex calculations that can be performed. Similar approaches can be followed to evaluate the availability of the application and database services. Application and database service availability Let us denote by C AS and C DS the computer hosts associated to the application and database servers, respectively. Their availability are denoted by A(C AS ) and A(C DS ). The disk availability is denoted by A(Disk). To simplify the presentation we assume that each component (i.e., computer hosts and disks) fails independently of the others. The application and database service availability are given in Table 2.8. Dependable Systems of Systems 17

22 CSDA3- SoS Dependability Assessment: Modelling and Measurement Table 2-8. Application and database service availability Basic architecture Redundant architecture A(Application service) A( C AS ) ( A( C )) A(Database service) A( C ) A( Disk) DS 1 ( 1 A( C 2 DS)) 1 ( 1 A( Disk)) 2 AS [ ][ ] In the following, we focus on the evaluation of the web service availability for the basic and redundant architectures, respectively. Web service availability To evaluate the availability of the web service, we distinguish two sources of failures: 1) Hardware and software failures that affect the computer host and lead to the failure of the web server; 2) Performance-related failures that are due to the fact that the web server generally has a limited capacity. When the input buffer is full, the incoming requests are not serviced. The web service is assumed to be available when neither of the above types of failures occurs. The impact of both types of failures on the web service availability can be accounted for by adopting a composite performance and availability (generally called performability) evaluation approach. The main idea was initially proposed by Meyer [Meyer 1980, Meyer 1982] and it has been since extensively used in performability modelling. It consists in combining the results obtained from two models: a pure performance model and a pure availability model. The performance model takes into account the request arrival and service processes and evaluates performance related measures conditioned on the state of the system as determined from the availability model. The availability model is used to evaluate the steady state probability associated to the system states that result from the occurrence of failures and recoveries. This approach is based on the assumption that the system reaches a quasi steady state with respect to the performance related events, between successive occurrences of failure-recovery events. This assumption is valid when the failure/recovery rates are much lower than the request arrival/service rates, which is typically true in our context. Basic architecture It is composed of a unique computer host, C WS. Let us denote by p K the probability that the web server input buffer (whose size is K) is full when a request is received. The evaluation of p K is derived from the performance model and depends on the assumptions made about the request arrival process and the request service process. Let us assume that the request arrivals are modelled by a Poisson process with rate α and the request service times are exponentially distributed with rate ν. Then the web server behaviour governed by the arrival and service processes can be modelled by an M/M/1/K queue. 18 Deliverable CSDA3

23 2. SoS Dependability Modelling: The Travel Agency Example The probability that an arriving request is lost due to buffer being full is well known (see e.g., [Allen 1978]) and is given by: 1 ρ K pk = ρ K 1 1 ρ + (3) with: ρ = α. (4) ν The availability model is composed of two states: up and down states. The steady state probability of the up state corresponds to the system steady-state availability denoted A (C WS ). The availability of the web service can then be expressed as follows: A A (Web service) = ( C )( 1 p ) (5) WS K Thus, this definition of availability allows incorporation of the inherent dependence between performance and dependability in one equation. Redundant Architecture The redundant architecture is composed of NW identical web servers. We assume that all component failures are independent and that the web service is provided as long as at least one of the redundant component systems is available. The performance model associated to this architecture to evaluate, p K (i), the probability that web requests are lost due to input buffer being full is assumed to be described by an M/M/i/K queue, where i is the number of servers available and K is the size of the buffer. For a system state with i operational servers, the probability that an arriving request is lost due to buffer being full, denoted as p K (i), is given by (see, e.g., [Allen 1978]): p K(i) = K j j ρ i1 ρ K i K-i + ρ i! j0 = j! ji = i j-i i! 1 (6) where ρ = α. (7) ν With respect to the availability model, the aim is to model the behaviour of the redundant architecture as resulting from the occurrence of failures/repairs in order to evaluate the steady state probability associated to system states i (where i is the number of operational servers, as denoted above). In the following, two assumptions are made with regard web server failures and recovery. First we assume a perfect coverage following the failure of a web server then we consider the case where coverage is imperfect. Perfect coverage: The model presented in Figure 2.10 is based on the assumption that each web server runs on a dedicated computer host. Web server failures occur with rate λ. The repair rate is µ. Also, the model assumes shared repair facilities. Upon the failure of a web server, it is automatically Dependable Systems of Systems 19

24 CSDA3- SoS Dependability Assessment: Modelling and Measurement disconnected and the system is reconfigured (with probability 1) with the web servers that are still operational. N w λ (N w - 1) λ (N w - 2) λ 2λ λ N w N w - 1 N w µ µ µ µ µ Figure Markov model of the NW web servers (perfect coverage) Let us denote by Π i the steady-state occupation probability of state i, i = 0, 1,, NW. In state i, i 0, i web servers are available to process the input requests. (Π 0 correspond to web server unavailability). The Π i are given by: Π i i= 1 µ λ i! Π o i =1,, N w. (8) Π 0 = 1 Nw i 1 i=0 i! µ λ (9) The availability of the web service is as follows: A(Web service) = (i) + Nw 1 ΠipK Πo i=1 (10) where p K (i), the probability that an arriving request in state i is lost due to buffer being full), is given by equation (6). This definition of availability incorporates the inherent dependence between performance and dependability in one equation. The expression between the brackets corresponds to the probability that a web request is not serviced either due i) to buffer being full or ii) to web server unavailability. Imperfect coverage: The model of Figure 2.10 is based on perfect failure coverage and reconfiguration assumption. This assumption is revisited in the model presented in Figure 2.11, where from each state i, two output transitions are considered: 1) After a covered failure (transition with rate icλ) the system is automatically reconfigured into an operational state with (i-1) web servers. 2) Upon the occurrence of an uncovered failure (transition with rate i(1-c)λ), the system moves to a down state y i, where a manual reconfiguration action is required before 20 Deliverable CSDA3

25 2. SoS Dependability Modelling: The Travel Agency Example moving to operational state (i-1). The corresponding reconfiguration times are assumed to be exponentially distributed with mean 1/β. N w cλ (N w - 1) cλ (N w - 2) cλ 2cλ λ N w N w - 1 N w - 2 µ µ µ. β β β µ 1 µ 0 N w (1-c)λ (N w - 1) (1-c) λ 2 (1-c) λ y Nw - 1 y Nw - 2 y 2 Figure Markov model of the NW web servers (imperfect coverage) Solving Figure 2.11 model for steady-state probabilities leads to: Π Π i i= 1 µ λ yi i! Π o i-1 = µ (1 c) i-1)! Π µ β( λ µ (1 c) µ i! β(nw-i-1)! ( λ) Nw i Nw-2 Π 0 = 1 µ + i=0 λ i=0 o 1 Nw-i-1 i =1,, NW (11) i =1,, N w -2 (12) (13) Giving the fact that states y i, correspond to down states, the availability of the web service can be computed as follows: A(Web service) = Nw Nw (i) + + i 2 1 Π p K Πy Π i o (14) i=1 where p K (i), is also given by equation (6). Summary of web service availability i=1 Table 2.9 recalls the equations of the web server availability for the basic and redundant architecture, assuming perfect and imperfect coverage. Dependable Systems of Systems 21

26 CSDA3- SoS Dependability Assessment: Modelling and Measurement Table 2-9. Web service availability Architecture Basic Web service availability A A (Web service) = ( C )( 1 p ) 1 ρ K pk = ρ K 1 1 ρ + ρ = α ν WS K Redundant (perfect coverage) Redundant (imperfect coverage) A(Web service) = (i) + Nw 1 ΠipK Πo i=1 K j j 1 ρ i1 ρ K p K(i) = i K-i + ρ i! j0 = j! ji = i j-i i! ρ = α ν Π 0 = 1 Nw i 1 i! µ λ Π i=0 i i= 1 µ λ i! Π o A(Web service) = Nw Nw (i) + + i 2 1 Π pk Πy Π i o i=1 p K(i) = ρ = α ν K j j ρ i1 ρ K i K-i + ρ i! j0 = j! ji = i j-i i! Nw i Nw-2 Π 0 = 1 µ + i=0 λ i=0 Π Π yi yi 1 µ (1 c) µ i! β(nw-i-1)! ( λ) i-1 = µ (1 c) i-1)! Π µ β( λ i-1 = µ (1 c) i-1)! Π µ β( λ o o i=1 1 Nw-i-1 22 Deliverable CSDA3

27 2. SoS Dependability Modelling: The Travel Agency Example Function Level Availability The availability evaluation of each function, identified at the function level, is based on the availabilities of the services involved in its accomplishment and when various function execution scenarios are possible on the activation probability of each scenario. Table 2.10 gives the availability for the Home, Browse, Search, Book and Pay functions. A(WS), A(AS), A(DS) correspond respectively to A(Web service), A(Application service) and A(Database service), given in Tables 2.8 and 2.9. A(PS) corresponds to A(Payment service) given in Table 2.7 A(Flight), A(Hotel) and A(Car) are given in Table 2.7. The parameters q ij involved in the availability of the Browse function are associated to the three execution scenarios of this function presented in Section Note that all the function equations include the product A net A LAN, meaning that if the TA connectivity to the Internet or the internal communication among the servers is not available, none of the TA functions can be invoked by the users. Also, the Book function has the same availability equation as the Search function. This is due to the fact that we have assumed that the former uses a subset of the resources used by the latter. Indeed, in our example the Book function can be achieved only if the Search function has succeeded. This led us to assume that if the Search function succeeds, automatically the Book function succeeds. Of course, other situations can be modelled. Table Function level availabilities A (Home) A (Browse) A (Search) A (Book) A (Pay) A net A LAN A(WS) A net A LAN [q 23 A(WS) + q 24.q 45 A(WS)A(AS) + q 24. q 47 A(WS)A(AS) A(DS)] A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A net A LAN A(WS) A(AS) A(DS) A(PS) User Level Availability For a given user operational profile, the availability perceived by the users can be obtained by evaluating for each user execution scenario derived from the operational profile, the expression specifying that all the functions invoked in the corresponding scenario are available. When several functions are invoked in a given path, a careful analysis of the dependencies that might exist among the functions due to shared services or resources is needed at this stage to evaluate the availability measure associated to the path from the availability of the corresponding functions. Dependable Systems of Systems 23

28 CSDA3- SoS Dependability Assessment: Modelling and Measurement Table 2.11 gives the availabilities associated to the user scenarios presented in Section (Table 2.3). The first column identifies the scenario and the functions invoked. The second column specifies the availability of the user scenario based on the availability of the functions and the analysis of their dependencies. The third column gives the availability of the user scenario that takes into account the availability of corresponding services and resources. It is worth mentioning that the results presented in column 2 take into account the dependencies that exist among the various functions involved in each scenario. As discussed in Section 2.2.2, such dependencies mainly result from resource sharing among the function (this is the case of the Search and Book functions). Table User scenarios and associated availabilities Scenario Availability wrt associated functions 1: Start-Home-Exit A (Home) A net A LAN A(WS) Availability 2: Start-Browse-Exit A (Browse) A net A LAN [q 23 A(WS) + q 24.q 45 A(WS)A(AS) + q 24. q 47 A(WS) A(AS) A( DS)] 3: Start-{Home; Browse} * - Exit A (Browse) A net A LAN [q 23 A(WS) + q 24.q 45 A(WS)A(AS) + q 24. q 47 A(WS) A(AS) A(DS)] 4: Start-Home-Search-Exit A (Search) 5: Start-Browse-Search-Exit A (Search) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) 6: Start-{Home; Browse} * - Search-Exit 7: Start-Home-{Search- Book} * -Exit 8: Start-Browse- {Search-Book} * -Exit 9: Start-{Home; Browse} * - {Search-Book} * -Exit 10: Start-Home-{Search- Book} * -Pay-Exit 11: Start-Browse-{Search- Book} * - Pay-Exit 12: Start-{Home - Browse} * - {Search-Book} * - Pay-Exit A (Search) A (Search) A (Search) A (Search) A(Search; Pay) A(Search; Pay) A(Search; Pay) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A(PS) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A(PS) A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A(PS) 24 Deliverable CSDA3

29 2. SoS Dependability Modelling: The Travel Agency Example Taking into account: the activation probabilities of all user scenarios i, π i, (whose equations are given in Table 2.3, for the TA example). The numerical values of π i, are given in Table 2.4 for user classes A and B, the contribution of all scenarios i, as given in Table 2.11, the user availability is given by equation (15). A(user) = A net A LAN A(WS) [π 1 + (π 2 +π 3 ) {q 23 + A(AS) (q 24 q 45 + q 24 q 47 A(DS)} + A(AS) A(DS) A(Flight) A(Hotel) A(Car) {(π 4 +π 5 +π 6 +π 7 +π 8 +π 9 ) + (π 10 +π 11 +π 12 ) A(PS) }] (15) Taking into account the grouping of user scenarios into four categories SC1, SC2, SC3 and SC4 as defined in Section 2.1.1, Equation 15 can be written as follows: A(user) = A(SC1) + A(SC2) + A(SC3) + A(SC4) (16) where A(SC1) = A net A LAN A(WS){π 1 + (π 2 +π 3 ) {q 23 + A(AS) (q 24 q 45 + q 24 q 47 A(DS)} A(SC2) = A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) (π 4 +π 5 +π 6 ) A(SC3) = A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) (π 7 +π 8 +π 9 ) A(SC4) = A net A LAN A(WS) A(AS) A(DS) A(Flight) A(Hotel) A(Car) A(PS) (π 10 +π 11 +π 12 ) For a given user class, Equation (16) enables the analysis of the relative contribution to availability of scenarios that end up with a payment compared to all the scenarios than might be invoked by the users. Equations (15) and (16) will be used in Section 2.3 to evaluate the availability of the two user classes A and B defined in Tables 2.2 and Evaluation Results In the previous section, we have defined two user classes for the travel agency. They both use the same set of functions that are activated differently. Also, we have defined two possible TA architectures: a basic architecture (in which each service is implemented on a single computer host) and a redundant architecture (composed of NW redundant web servers, a duplex application server and a duplex database server using a mirrored disk). We have established the models of the various services, the function models as well as the user model and derived their availabilities. Concerning the redundant architecture, when building the web service, we made two assumptions with respect to web server recovery (perfect and imperfect coverage). From equation (15), it can be seen that the availabilities of the LAN, the net and the web service are the most influential ones (i.e., their impact is of the first order, while the others are Dependable Systems of Systems 25

30 CSDA3- SoS Dependability Assessment: Modelling and Measurement at least at the second order). This is due to the fact that all requests (more exactly, all user scenarios) use these three services. In the rest of this section, we will first show the impact of the number of web servers as well as their failures rates on the web service availability, according to the request arrival rates. Then, based on the various equations derived in the previous section, we will evaluate the user availability as perceived by user classes A and B. web service availability results Figures 2.12 and 2.13 give the web service availability for perfect and imperfect fault coverage, with the number of web servers NW varying from 1 to 10. It is worth mentioning that when only one web server is used (NW = 1), the results correspond to the basic architecture. The parameters used to obtain these curves are indicated on the figures. Sensitivity analyses are done considering different values of web server failure rates (10-2, 10-3 and 10-4 per hour) and request arrival rates (50, 100 and 150 requests per second). It is assumed that each web server has a processing rate ν equal to 100 per second and a repair rate µ equal to 1 per hour. The mean reconfiguration rate of the web server architecture (β) is 12 per hour (i.e., 1/ β = 5 min) and the buffer size K is assumed to be 10. Web service Unavailability (1- A (WS) ) 1 e+0 1 e-1 1 e-2 1 e-3 1 e-4 1 e-5 1 e-6 1 e-7 1 e-8 α = 150/sec λ = 1e-2/hour α = 50/sec λ = 1e-4/hour α = 150/sec λ = 1e-3/hour α = 150/sec λ = 1e-4/hour α = 100/sec λ = 1e-2/hour µ = 1/hour ν = 100/sec β = 12/hour K = 10 c = 1 α = 100/sec λ = 1e-3/hour α = 100/sec λ = 1e-4/hour 1 e-9 α = 50/sec λ = 1e-3/hour α = 50/sec λ = 1e-2/hour 1 e Number of web servers (Nw ) Figure Web service unavailability (perfect coverage) 26 Deliverable CSDA3

31 2. SoS Dependability Modelling: The Travel Agency Example 1 e+0 1 e-1 α = 100/sec λ = 1e-2/hour µ = 1/hour ν = 100/sec β = 12/hour K = 10 c = 0.98 Web service Unavailability (1- A (WS) ) 1 e-2 1 e-3 1 e-4 1 e-5 1 e-6 α = 50/sec λ = 1e-2/hour α = 50/sec λ = 1e-3/hour α = 100/sec λ = 1e-3/hour α = 50/sec λ = 1e-4/hour α = 100/sec λ = 1e-4/hour 1 e Number of web servers (Nw ) Figure Web service unavailability (imperfect coverage) Both figures show that increasing the number of web servers NW from 1 to 2, 3 or 4 (depending on the failure and request arrival rates) reduces the web service unavailability. However, the trend is reversed when the coverage is imperfect for NW values higher than 4 (Figure 2.13). This is due to the fact that when the coverage is imperfect, increasing the number of servers also increases the probability for the system being in states y i, (of Figure 2.11) where the web service is unavailable and a manual reconfiguration action is required. Actually, the probability of a request being rejected because the buffer is full plays a significant role until a certain value of NW. When the number of servers is higher than the threshold value, the total service rate and the buffer capacity are sufficient to handle the flow of arrivals without rejecting requests. In this case, the unavailability of the web service mainly results from hardware and software failures leading the web server architecture to a down state. Compared to the imperfect coverage model, it can be noticed that the model with perfect coverage is more sensitive to the variation of NW. Indeed the unavailability decreases exponentially when NW increases and the trend is not reversed for values higher than 4. Also, the web servers failure rate has a significant impact on availability only when the system load (α/ν) is lower than 1. Design decisions can be made based on the results presented on these figures. In particular, we can determine the number of servers needed to achieve a given availability requirement, or evaluate the maximum availability that can be obtained when the number of servers is set to a given value. For instance, considering the model with imperfect coverage, the number of servers needed to satisfy an unavailability lower than 5 min/year (unavailability < 10-5 ), with a failure rate equal to 10-3 per hour will be at least 2 if the request arrival rate is 50 per second and 4 if the request arrival rate is 100 per second. We obtain the same result with a failure rate Dependable Systems of Systems 27

32 CSDA3- SoS Dependability Assessment: Modelling and Measurement 10-4 per hour, however such a requirement cannot be satisfied with a failure rate of 10-2 per hour. Similar sensitivity analyses can be done to study the level of availability that can be achieved when the number of web servers is set to a given value. For instance, if we decide to employ three servers to support the web service, we would have an unavailability lower than 1hour per year. This is true when the failure rate varies from 10-2 to 10-4 and the system load (α/ν) is less than 1. User level availability results We will consider equations (15) and (16) presented in Section to evaluate the user availability as perceived by user classes A and B. Numerical values should be assigned to the various parameters involved in these equations. These parameters, together with their numerical values are given in Table The probabilities characterizing user classes A and B operational profiles have been presented in Tables 2.2, 2.3 and 2.4. Table Model parameters A net = A LAN = A(C AS ) = A(C DS ) = A(Disk) = 0.9 A PS = A Fi = A Hi = A Ci = 0.9 q 23 = 0.2 q 24 = 0.8 q 45 = 0.4 q 47 = 0.6 Referring to equation (15), to analyse the impact of the operational profile on the user perceived availability, we will consider the normalized availability given by A(user). This quantity AnetALAN A( WS ) does not depend on the availability of the LAN, the Internet or the web service. Table 2.13 presents the normalized availability for user classes A and B considering different values for the number of flight, car and hotel reservation systems (N F, N H, N C ) interacting with the travel agency SoS. For the sake of simplicity, the same number is assumed for N F, N H and N C. Also, according to Table 2.12, it is assumed that all the reservation systems have the same availability (A Fi = A Hi = A Ci = 0.9). The results in Table 2.13 show that for a given user class, the normalized availability increases significantly when the number of reservation systems increases from 1 to 4, and then stabilizes. The rate of availability variation is directly related to the availability assigned to each reservation system. Comparison of the results obtained for class A and B users show that different operational profiles might lead to significant differences in the availability perceived by the users. For instance, considering the case N F = N H = N C = 10, the normalized user perceived unavailability is about 57 hours per year for class A users and 74 hours for class B users. Such unavailability takes into account all the scenarios that might be invoked by the users. 28 Deliverable CSDA3

33 2. SoS Dependability Modelling: The Travel Agency Example Table Normalized user perceived availabilities for user classes A and B as a function of the number of flight, car and hotel reservation systems N F = N H = N C A(Class A users) AnetALAN A( WS ) A(Class B users) AnetALAN A( WS) The user perceived availability can be analysed from another perspective by considering equation (16) which allows the evaluation of the relative contribution to the observed availability of each category of user scenarios (SC1, SC2, SC3 and SC4 as defined in Section 2.1.1). This is illustrated on figures 2.14 and 2.15 considering class A and class B users, respectively, and assuming that the web service is implemented on four servers with imperfect coverage. UA (A users) (respectively UA (B users)) denotes the unavailability perceived by Class A users, and UA(SCi), i varying from 1 to 4, denotes the contribution of scenarios SCi to the user perceived unavailability. It can be seen that the unavailability caused by scenarios SC4 that end up with a trip payment is higher for class B users compared to class A users (43 hours downtime per year for class B users compared to 16 hours for class A users, when considering the steady values). Therefore, the impact in terms of loss of revenue for the TA provider will be higher. Indeed, if the users transaction rate is 100 per second, the total number of transactions ending up with a payment that are lost is 5.7 million for class A users and 15.5 million for class B users. Assuming that the average revenue generated by each transaction is 100 euros, then the loss of revenue amounts to 570 million euro and 1.55 billion euros, respectively. This result clearly shows that it is important to have a faithful estimation of the user operational to obtain realistic predictions of the impact of failures from the economic and business view points. Dependable Systems of Systems 29

34 CSDA3- SoS Dependability Assessment: Modelling and Measurement 1 µ = 1/hour ν = 100/sec β = 12/hour K = 10 c = 0.98 α = 100/sec λ = 1e-4/hour Nw=4 0.1 UA(SC2) UA(A users) Unavailability 0.01 UA(SC4) UA(SC1) UA(SC3) NF=NH=NC Figure Class A users unavailability with the unavailability of associated scenarios SC1, SC2, SC3 and SC4 1 µ = 1/hour ν = 100/sec β = 12/hour K = 10 c = 0.98 α = 100/sec λ = 1e-4/hour Nw=4 UA(SC4) UA(B users) 0.1 UA(SC2) Unavailability 0.01 UA(SC3) UA(SC1) NF=NH=NC Figure Class B users unavailability with the unavailability of associated scenarios SC1, SC2, SC3 and SC4 30 Deliverable CSDA3

35 2. SoS Dependability Modelling: The Travel Agency Example 2.4. Summary In this part of the report, we have illustrated the main concepts that we defined within our hierarchical modelling framework proposed in deliverable DMS1 for the dependability evaluation of systems of systems. The example used for the illustration is the travel agency case study described in deliverable DMS3. Our objectives were: 1) to show how to apply our framework considering the decomposition of the target SoS according to four levels: user, function, service and resource levels, and 2) to present typical dependability analysis and evaluation results that could be obtained from the modelling to help the SoS providers in making objective design decisions. For the sake of illustration, we have deliberately considered simplified (yet realistic) assumptions. We have showed that the proposed hierarchical framework provides a systematic and pragmatic modelling approach, that is necessary to be able to evaluate the dependability characteristics of the target SoS at different levels of abstractions. The proposed framework is general enough and can be applied to handle more complex assumptions and models. The application of this framework requires the estimation of several parameters involved in the models. The second part of this report addresses this issue. Dependable Systems of Systems 31

36 CSDA3- SoS Dependability Assessment: Modelling and Measurement 3. Measurement-based Evaluation The application of the SoS dependability modelling framework requires the estimation of several parameters that characterise the failure and recovery behaviour of the component systems and resources included in the model(s). Such parameters can be estimated based on measurement [Arlat et al. 2000]. Measurement involves three main steps: (1) data collection, (2) data validation and (3) data processing. Data collection consists in the definition of which data to collect and how to collect it. The analysis and assessment of computer systems based on data collected during operation provide valuable information on actual error/failure behaviour. In most commercial systems, in particular Unix and Windows NT and 2K based systems, error and failure data can be obtained from the event logging mechanisms offered by the operating system. Event logs include a large amount of information about the occurrence of various types of events; some of these events result from the normal activity of the target systems, whereas others are recorded when errors and failures affect local or distributed resources, or upon the occurrence of system reboots and shutdowns. Usually, the collected data contains a large amount of redundant and irrelevant information, as well as incorrect or incomplete information. Such problems have been observed in several studies, e.g. those reported in [Kaâniche et al. 1990, Levendel 1990, Buckley & Siewiorek 1995, Thakur & Iyer 1996]. Therefore, data validation is needed in order to analyse the collected data for correctness, consistency, and completeness. This consists in particular in filtering-out invalid or irrelevant data and in coalescing redundant or equivalent data. Once this step is achieved, the basic dependability characteristics of the measured system can be identified through data processing. Data processing consists in performing statistical analyses on the validated data to identify and analyse trends and to evaluate quantitative measures that characterise dependability. Various statistics can be derived from the data to study the distribution of errors and failures among system components and their severity, evaluate the time to failure or time to recovery distribution, analyse the impact of the workload on the system behaviour, etc. Measurement-based dependability analysis of computer systems, using event logs or data collected from the field have given rise to a wide variety of research. A detailed survey of the state of the art was presented in deliverable BC2 [Arlat et al. 2000]. Today s computing environments are mainly based on Unix, Windows NT and Windows 2K interconnected systems. However, to the best of our knowledge, only a few studies addressed the dependability analysis of Unix or Windows NT systems based on event logs [Thakur & Iyer 1996, Kalyanakrishnam et al. 1999b, Xu et al. 1999]. These studies did not cover Windows 2K systems. The work reported in [Thakur & Iyer 1996] is based on event logs collected from 69 SunOS workstations monitored over a period of 32 weeks. In [Kalyanakrishnam et al. 1999b], several analyses are carried out using event logs collected over a six month from 70 Windows NT mail 32 Deliverable CSDA3

37 3. Measurement-based Evaluation servers. Similar analyses are presented in [Xu et al. 1999] based on event logs collected over a four month period from 503 Windows NT servers running in a production environment. The systems analysed in these three studies are from distinct environments and the data collection period is rather short (less than 8 months). Clearly, additional measurement-based analyses are needed to understand the dependability characteristics of networked distributed systems and to give better insights into the problems that one might face when processing and analysing event logs. In this part of the report, we summarize the results obtained from the analysis of event logs collected from 373 Unix SunOS/Solaris machines, 76 Windows NT and 89 Windows 2K systems, interconnected through the LAAS computing network. The data collection period was about 33 months for Unix systems (from November 1999 until July 2002), 44 months for Windows NT (from January 1999 until July 2002) and 23 months for Windows 2K (from September 2000 until July 2002). The identification of useful trends from large event logs is a time consuming task that requires thorough manual analyses. In our study, we have focused on the identification of machine reboots, and the evaluation of statistical measures characterizing: a) the distribution of reboots (per machine, over time), b) the distribution of uptimes and downtimes associated to these reboots, c) the availability of machines including workstations and servers. These analyses have been done for both Unix and Windows systems. Also, we present some results concerning the classification of Windows NT and 2K reboot causes and the analysis of error dependencies among Unix machines. Preliminary analyses of subsets of the data presented in this report can be found in [Simache & Kaâniche 2001b, Simache et al. 2002]. This part of the report is organized as follows. Section 3.1 describes the target system architecture. Section 3.2 outlines the data collection strategy and the main analyses carried out on the collected data. The results obtained from the analysis are presented in Section 3.3 for Unix Systems and in Section 3.4. The main conclusions are summarized in Section Target system architecture The LAAS computing network is composed of a large set of heterogeneous workstations and servers interconnected through an Ethernet-based local area network. These systems are organized into subnets, according to their physical location and the research group they belong to. The subnets are interconnected through dedicated communication switches to a central switch. The latter provides connectivity to the servers shared by the whole network (SMTP, NIS+, Backup, HTTP, FTP, etc.) as well as to the Internet. Some of these services are replicated on several machines (e.g. the NIS+ server), and some machines host more than one service. In addition, some research groups have a set of servers dedicated to their users (NFS, POP, Application, Printing, etc.), nevertheless there are also some servers that are shared by several research groups. Most of the network and group servers are implemented on SunOS and Solaris machines (23), and a few shared servers run on Windows NT and 2K machines (10). The clients are a heterogeneous mix of Unix workstations, PCs and Macintoshes hosting many types of operation systems like SunOS, Solaris, Linux, Windows and MacOS and a large variety of versions. It is noteworthy that some machines host more than one operating system (e.g. Linux and MacOS). Dependable Systems of Systems 33

38 CSDA3- SoS Dependability Assessment: Modelling and Measurement In our study, we focussed on Unix and Windows NT and 2K machines. 3.2 Data collection and processing approach Our analysis is based on the operational data logged by the Unix, Windows NT and 2K systems-based connected to the LAAS network. Each type of operating system has its own event logging mechanism. Event logging is a facility used by computer systems to record the occurrence of significant events: error reports, system alerts, and diagnostic messages. The Unix-based systems offer capabilities for event logging by means of the syslogd daemon and the Windows-based systems via the Event Logging facility. In this section, we present some details concerning these facilities, the data collection strategy used in order to collect the operational data and the processing approach used to analyse the data Event logging in Unix The Unix operating system offers capabilities for event logging by means of the syslogd daemon. This background process records events generated by different local sources: kernel, system components (disk, network interfaces, memory), daemons and application processes that are configured to communicate with syslogd. Different types of events of various severity levels are generally recorded. Some of them result from the normal activity of the system whereas others provide information about hardware, software and configuration errors as well as system events such as reboots and shutdowns. The configuration file /etc/syslog.conf specifies the destination of each event received by syslogd, depending on its severity level and its origin. The destination could be one or several log files, the administration console or the operator (notified by ). The events that are relevant to our study are generally stored in the log file /var/adm/messages. Each event stored in this log file is formatted as follows: Date and time of the event Machine on which the event is logged Description of the message Example: Dec 15 16:39:29 napoli unix: server butch not responding still trying The Unix operating system provides the possibility to automatically control the size of the log files. This is done by executing, on a weekly basis, the script /usr/lib/newsyslog, via the cron mechanism. This script ensures that only the current log file /var/adm/messages and those recorded during the last four weeks (named messages.0, messages.1, messages.2 and messages.3) remain in the system. Therefore, data is lost if not archived within five weeks. 34 Deliverable CSDA3

39 3. Measurement-based Evaluation Event logging in Windows NT and 2K For Windows NT and 2K, event logging is implemented as a system service that runs in the background and waits for processes running on the local (or a remote) system to send it reports of events [Murray 1998]. Each event report is stored in a specific event log file on a disk. There are three event log files: The security log contains events generated by the system security and auditing processes. The system log contains events generated by system components, including drivers and services. It is used primarily to store diagnostic messages that are used by the system administrators for troubleshooting abnormal conditions, or to find problems unnoticed by the users. For example, a driver has failed to load, the operation of a device has failed, an I/O error has occurred, etc. The application event log stores all event reports not involving security auditing and system component event reporting. It is most commonly used to report internal errors that occur during the execution of an application, such as failing to allocate memory, being unable to access object, or aborting the transfer of a file, etc. The only native facility giving the user access to event logs is Event Viewer. This application displays the information on event records sorted in chronological order. Also it is used to back up or clear the event logs, or to change the parameters of the event logging policy. The data displayed by Event Viewer is formatted according to the following fields: - Event type: denotes the severity level of the event; five event types are defined: error, warning, information, success audit and failure audit. - Date and time: indicates the date and time when the report was logged. - Source: the registered name of the event source that reported the event. - Category: source-specific event classification. - Event: source-specific event identification (called also Event ID). - User: name of the user account that generated the event. - Computer: name of the computer that reported the event. In addition, Event Viewer offers the possibility to display a description of the event, its cause, and where it occurred. However, such a description is not always available Data collection strategy We have set up a data collection strategy to automatically collect the data stored in the /var/adm/messages.0 log file of each SunOS and Solaris machine and the Application and System event logs of each Windows NT and 2K machine connected to the network. This strategy has been defined to take into account the dynamic evolution of the network configuration resulting for instance from system administration and maintenance activities (connection of new machines, upgrade of OS versions, modification of shared services and resources configuration, modification of machine names and configuration, temporary disconnection of machines from the network, etc). Dependable Systems of Systems 35

40 CSDA3- SoS Dependability Assessment: Modelling and Measurement The data collection strategy is decomposed into two main steps: 1) Identification of the list of machines to be included in the data collection process. 2) Collection of data from these machines to a dedicated machine used for data processing. The identification of Unix and Windows machines from which data will be collected is based on the analysis of the hosts.org_dir master table maintained by the NIS+ server. All IP devices connected to the network, including Unix and Windows machines, are declared in this table. However, this table generally contains redundant information that corresponds, for instance, to machines that are declared under different IP addresses or with different names. The script that we have developed automatically detects and eliminates such redundant information to avoid collecting multiple copies of the same log files from the corresponding machines. Also, the script eliminates from the list of machines those that are not relevant to our study; for example, those used to support offline maintenance activities, or used in specific experimental testbeds, or those Windows systems that have Linux as a second operating system and laptops. In the second step of the data collection strategy, the log files are remotely copied from the selected machines to a dedicated machine, collated into a single file corresponding to each machine that is sorted chronologically. Only the new events logged since the last collection are selected and included in the final file containing the data for the corresponding machine. Also, a verification of the format of the collected data is done at this step, and an additional field specifying the year is added to the date of each message corresponding to a Unix machine (by default, the year is not recorded by syslogd; it s not the case for the Event Logging facility of Windows systems). This simplifies analyses of data collected over several years. For Windows systems, the data collection is carried out manually once every month using the Event Viewer backup function. For Unix systems, the data collection is carried out using Shell and Perl scripts. These scripts are executed via cron on a weekly basis in accordance with the mechanism provided by the operating system to control the log files size. However, manual verification is sometimes needed when problems affecting some target machines occur during the execution of these scripts, e.g., these targets may not be alive, or they are alive but due to some local problems the scripts hang. If the manual verification is not done, we might lose some data, or the same data may be copied more than once (see [Simache & Kaâniche 2001a] for more detail). Using this strategy, we have collected on a regular basis the event logs stored on 373 Unix, 76 Windows NT and 89 Windows 2K systems connected to the LAAS network. The data collection period was (October 1999, July 2002) for Unix systems (January 1999, July 2002) for Windows NT systems and (September 2000, July 2002) for Windows 2K systems. However, due to the frequent addition and removal of machines from the network, the data collection period was not uniform for all the machines that we have monitored. This is illustrated in Figures 3-1 and 3-2 which plot the distribution of the data collection period (in hours) for Unix and Windows machines, respectively. These figures show a large variability among the machines with respect to the data collection period. In particular, the data collection period for a few Unix and Windows 2K machines that have been recently connected to the network was short (less than 2000 hours). Clearly, these machines should be excluded from the analysis to avoid having biased results due to the short time during which they have been 36 Deliverable CSDA3

41 3. Measurement-based Evaluation monitored. For the rest of the analysis we decided to ignore all machines for which the data collected period was shorter than three months (2000 hours). Accordingly, 23 Unix and 8 Windows 2K machines satisfying this criterion were excluded from the analyses presented in the following sections Data collection period (hours) Unix % 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% % %machines Figure 3-1. Distribution of data collection period for Unix machines Data collection period (hours) NT 2K % 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% % %machines Figure 3-2. Distribution of data collection period for NT and 2K machines Data processing The data processing phase consists of: 1) extracting from the log files the information that is relevant to the dependability analysis of the target system and 2) evaluating statistical measures to identify significant trends. The log files contain a large amount of information that is not Dependable Systems of Systems 37

42 CSDA3- SoS Dependability Assessment: Modelling and Measurement always easy to categorize. The identification of events corresponding to errors and the definition of error classification criteria requires a thorough manual analysis of event logs. The classification is system dependent because, even for the same operating system, system hardware and software components, architecture and activity may strongly influence the criteria used for classification. In this report, we focus on the identification of machine reboots, and the evaluation of statistical measures characterizing: a) the distribution of reboots (per machine, over time), b) the distribution of uptimes and downtimes associated to these reboots, c) the availability of machines including workstations and servers. These analyses have been done for both Unix and Windows systems. Also, we present some results concerning the classification of Windows NT and 2K reboot causes and the analysis of error dependencies among Unix machines. A summary of these results is presented in the following sections, considering first Unix systems (section 3.3) and then Windows NT and 2K systems (section 3.4). 3.3 Application to Unix Systems In this section, we present the algorithm that we developed to identify machine reboots from the Unix event log files, and the results obtained from the analysis of the identified reboots. The event log files have been collected from 350 Unix machines during the period (October 1999, July 2002) Identification of reboots Three methods can be distinguished to identify when Unix machines are rebooted: 1) Use of last reboot command 2) Analysis of /var/adm/wtmp log files 3) Analysis of /var/adm/messages log files With the first and second methods, only the start timestamp of machine reboots can be identified. However, in our study, we are interested in identifying the start and end timestamps of machine reboots as well as the service interruption duration associated to these reboots. Moreover, the causes of reboots can be investigated based on the analysis of the messages logged by the system before the machine is rebooted. Therefore, we have developed an algorithm to identify machines reboots based on the analysis of the /var/adm/messages log files collected from the target systems included in our data collection. A manual analysis of collected data revealed that not all reboots could be easily identified from the corresponding log files. Indeed, whereas some reboots are explicitly identified by a reboot or a shutdown event, many others can be detected only by identifying the sequence of initialisation events generated by the system when it is restarted. Generally, an initialisation sequence of the system is composed of about 70 messages, starting with unix: SunOS Release or unix: Copyright 3 messages, and ending with 3 Note that these messages may appear several times in the sequence. 38 Deliverable CSDA3

43 3. Measurement-based Evaluation clock synchronization messages generated by the ntpdate and xntpd or ntpd daemons. An example of such a sequence is presented in Figure Jan 31 08:16:03 ripolin unix: Copyright , Sun Microsystems, Inc Jan 31 08:16:03 ripolin unix: SunOS Release Version Generic_ [UNIX System V Release 4.0] 2000 Jan 31 08:16:03 ripolin unix: root nexus = SUNW,SPARCstation Jan 31 08:16:03 ripolin unix: Ethernet address = 8:0:20:82:23:f 2000 Jan 31 08:16:03 ripolin unix: avail mem = Jan 31 08:16:04 ripolin unix: SunOS Release Version Generic_ [UNIX System V Release 4.0] 2000 Jan 31 08:16:04 ripolin unix: Copyright , Sun Microsystems, Inc Jan 31 08:16:13 ripolin unix: vol0 is /pseudo/vol@ Jan 31 08:16:13 ripolin unix: pseudo-device: vol Jan 31 08:16:18 ripolin ntpdate[228]: step time server offset sec 2000 Jan 31 08:16:23 ripolin xntpd[231]: xntpd Tue Jul 6 18:01:08 MET DST 1999 (1) 2000 Jan 31 08:16:24 ripolin xntpd[231]: sched_setscheduler(): Operation not applicable 2000 Jan 31 08:16:25 ripolin xntpd[231]: tickadj = 5, tick = 10000, tvu_maxslew = 495, est. hz = 100 Figure 3-3. Initialisation sequence However, we have identified several scenarios that do not fit the initialisation sequence presented in Figure 3-3. Such scenarios occur for example: a) when multiple reboots are needed before the machine can restore its normal functioning state, or b) when the time synchronization messages do not appear in the corresponding sequence, or their timestamp precedes the timestamp of the messages identifying the start of the sequence. Typically the latter case corresponds to synchronization events with a negative offset value. An example of such scenario is presented in Figure 3-4. It can be seen that the timestamp of the ntpdate message (Jan 16 18:22:56) precedes the timestamp of the unix: SunOS Release message because the negative value of the offset (-15.89) Jan 16 18:23:02 demeter unix: SunOS Release 5.7 Version Generic 64-bit [UNIX System V Release 4.0] 2000 Jan 16 18:23:02 demeter unix: Copyright , Sun Microsystems, Inc Jan 16 18:23:02 demeter unix: mem = K (0x ) 2000 Jan 16 18:23:05 demeter unix: vol0 is /pseudo/vol@ Jan 16 18:22:56 demeter ntpdate[269]: step time server offset sec 2000 Jan 16 18:23:01 demeter xntpd[273]: xntpd Tue Jul 6 18:01:08 MET DST 1999 (1) 2000 Jan 16 18:23:02 demeter xntpd[273]: kvm_open failed Figure 3-4. ntpdate message with negative offset at the end of a reboot (original sequence, i.e., before sorting chronologically the data) To identify reboots from the log files, we have developed an algorithm, implemented in Perl, that is based on the sequential parsing and matching of each message in the collected log files to specific patterns or sequences of patterns characterizing the occurrence of reboots. These patterns correspond to explicit reboot messages or to sequences of events generated during the initialisation of the system, as explained above. The algorithm is detailed in [Simache & Kaâniche 2001a]. This algorithm gives, for each reboot detected in the log file and for each machine, the timestamp of the start and of the end of the reboot, and the last event logged before each reboot with the corresponding timestamp. Dependable Systems of Systems 39

CSDA3- SoS Dependability Assessment: Modelling and Measurement The reboot identification algorithm allowed us to detect 8842 reboots from the log files collected from 350 Unix machines during 33

44 CSDA3- SoS Dependability Assessment: Modelling and Measurement The reboot identification algorithm allowed us to detect 8842 reboots from the log files collected from 350 Unix machines during 33 months (November 1999 until July 2002). During the observation period, several versions of SunOS and Solaris were running on these machines including versions 1.2, 4.1.3, 4.1.4, 5.4, 5.5, 5.5.1, 5.6, 5.7 and 5.8. Among the machines that we monitored, 23 machines (referred to as main servers ) hosted critical services shared by the whole network or by a large subset of users. In the following we present various analyses of the 8842 reboots corresponding to the 350 Unix machines Distribution of reboots per machine The number of reboots observed during the data collection period constitutes a large sample of data on which significant statistical analyses can be performed. However, these reboots are not uniformly distributed among the machines. This is illustrated by the number of reboots per machine statistics presented in Table 3-1. In particular, 85.7% of the Unix machines had more than 10 reboots. Further investigation showed that 50% of the reboots were caused by 25% of the machines. Such variability is explained by differences with respect to the length of the data collection period (see Figure 3-1), the configuration of these machines, the types of software running on them and the user workload. Table 3-1. Distribution of the number of reboots per machine 0 #reb <#reb <#reb <#reb <#reb 14.29% 28.57% 31.43% 12.86% 12.86% The impact of the user s workload can be highlighted by considering the distribution of reboots according to the hour of the day when the reboots occurred. As illustrated in Figure 3-5, the majority of reboots occurred during normal working hours (8AM to 6PM). The peak between 9 and 10 AM includes all reboots that are generally done during the morning by the system administrator to solve problems that occur during the night Unix Number of Reboots :00 2:00 4:00 6:00 8:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 Hour of day Figure 3-5. Number of reboots per hour of the day 40 Deliverable CSDA3

A User -Perceived Availability Evaluation of a Web-based Travel Agency

A User -Perceived Availability Evaluation of a Web-based Travel Agency Mohamed Kaâniche, Karama Kanoun, Magnos Martinello Partially supported by the European Community, DSoS - Project IST-1999-11585 DSN-2003,