Enabling Rollback Support in IT Change Management Systems

Enabling Rollback Support in IT Change Management Systems Guilherme Sperb Machao, Fábio Fabian Daitx, Weverton Luis a Costa Coreiro, Cristiano Bonato Both, Luciano Paschoal Gaspary, Lisanro Zambeneetti Granville Institute of Informatics, UFRGS - Brazil {gsmachao, ffaitx, weverton.coreiro, cbboth, paschoal, granville}@inf.ufrgs.br Clauio Bartolini, Akhil Sahai, Davi Trastour, Katia Saikoski HP Laboratories Palo Alto, USA HP Laboratories Bristol, UK HP Brazil R&D, Brazil {clauio. bartolini, akhil.sahai, avi.trastour, katia.saikoski}@hp.com Abstract The current research on IT change management has been exploring several aspects of this new iscipline, but it usually assumes that changes expresse in Requests for Change (RFC) ocuments will be successfully execute over the manage IT infrastructure. This assumption, however, is not realistic in actual IT systems because failures uring the execution of changes o happen an cannot be ignore. In orer to aress this issue, we propose a solution where tightly-relate change activities are groupe together forming atomic groups of activities. These groups are atomic in the sense that if one activity fails, all other alreay execute activities of the same group must rollback to move the system backwars to the previous state. The automation of change rollback is especially convenient because it relieves the IT human operator of manually unoing the activities of a change group that has faile. To prove concept an technical feasibility, we have materialize our solution in a prototype system that, using elements of the Business Process Execution Language (BPEL), is able to control how atomic groups of activities must be hanle in IT change management systems. I. INTRODUCTION Currently, moern companies an organizations are often unable to eliver high quality services without employing sophisticate IT infrastructures to support their final businesses. Sophisticate IT infrastructures, in turn, are usually accompanie by complex management challenges that often lea to increasing maintenance costs. A rational management of IT infrastructures then becomes a critical issue for any organization that aims at keeping a goo financial health. In orer to provie a more systematic IT infrastructure management an thus reuce management costs the wiely recognize Information Technology Infrastructure Library (ITIL) [1] presents a set of best practices an processes that helps organizations to properly maintain their IT infrastructures. Among the ITIL processes, change management [2] is the one that efines how changes in IT infrastructures shoul be planne, scheule, implemente, an assesse. The importance of change management resies in the fact that changes in IT infrastructures must be execute in a way that oes not lea the manage systems to unknown or inconsistent states. To aress this issue, changes require over the IT infrastructures are firstly expresse in Requests for Change (RFC) ocuments that efine which changes are neee but not how they must be performe. The efinition of an RFC is the first step of a process that will generate a final change plan, which is essentially a workflow of low level activities that, when execute, will evolve the manage system to a new state consistent with the changes expresse in the original RFC. Although change management is a relatively new iscipline, several important challenges have alreay been investigate in research projects [3] [4] [5]. Given the complexity of the subject, these investigations have naturally mae some assumptions that enable the investigations to progress. One of these assumptions is that once an RFC is approve an reay to be eploye, the activities of the associate change plan will always succee an lea the IT infrastructure to the next consistent state. This assumption oes not in fact represent a realistic an practical situation in actual IT environments because failures uring the execution of changes o happen an cannot be ignore. In this paper we focus our attention to the necessity of hanling failures on change plan execution, in orer to avoi manage IT infrastructures of ening up in unesire states. We firstly efine that, after the eployment of an RFC, the manage infrastructure must have either successfully evolve to a new state or returne to the state previous to the change request. In other wors, RFC eployment is treate as a single atomic transaction. To support this behavior, whenever a failure occurs uring a change plan execution, a rollback proceure is invoke to uno changes execute so far an abort the ongoing change plan. In fact, we go further an observe that for some IT scenarios it woul be too restrictive to consier an RFC the only possible atomic element. We thus propose that aitional atomic elements can be complementarily efine in a granularity finer than that of an RFC, for example, at the change plan level. In orer to materialize atomic transactions on IT management, we have employe a set of techniques in a prototype system evelope to evaluate our propose solution. In particular, we have explore some exception-relate mechanisms (e.g., fault hanlers, asynchronous notifications, compensation activities) present in the Business Process Execution Language 978-1-4244-2066-7/08/$25.00 2008 IEEE 347

(BPEL) [6]. In our prototype, atomic transactions efine by human system operators at the RFC an change plan levels are translate into BPEL constructions via a translating algorithm also presente in this paper. A set of experiments have been aitionally carrie out to observe the impact of our proposal in the whole management system as well as on the manage IT infrastructure. The remainer of this paper is organize as follows. In Section II we briefly review rollback support for computing systems. The propose solution to incorporate rollback support in change plans is presente in Section III, while the associate prototype is escribe in Section IV. The results of the evaluation carrie out in this research is then presente in Section V. Finally, we close this paper in Section VI, where conclusions an future work are iscusse. II. BACKGROUND Rollback support is a complex subject in iverse computer science isciplines. Several aspects of computing systems (e.g., faulty unerlying communications, epenencies among istribute components, services unavailability) make saving consistent states an subsequent rolling back to them a task that can not always be properly or successfully accomplishe. Nevertheless, some mechanisms to support rollback have alreay been propose, investigate, an implemente. In this section we review rollback-relate work that has inspire the esign of our propose solution. At the evice level, a common way to implement rollback support is to ownloa a evice s configuration file to a configuration server, eploy a new configuration, an use the previous one again if the new configuration turns the evice behavior unstable. An evolution of this solution can be seen in evices where caniate configurations are store insie the manage evices themselves, ispensing with an exterior configuration server. Recently, the NETCONF protocol [7], propose by the Internet Engineering Task Force (IETF), has incorporate the notion of transactions in a configuration task, which avois the manage evices to evolve to states unknown for the network operator. Solely, rollback at the evice level is not sufficient for complex IT scenarios because often ifferent evices an services are epenent of one another. For example, if the installation of a new Web server that requires aitional configuration of the borer firewall fails, not only the server installation itself nees to be unone, but also the configurations on the borer firewall must be returne to the previous state. Rollback at the network level (above the evice level) is then require. In an earlier work, we have propose a Policy-Base Network Management (PBNM) system [8] where failures in the eployment of QoS policies return the manage evices to the previous state using an aapte version of the two-phase commit protocol. Anrzejak et al. [9] have investigate automatic workflow generation that can aapt the unerlying IT system in reaction to failures. However, this initial work oes not present many etails about automatic correction of graphs in response to partial failures. In aition, the authors recognize that the propose solution has some bottlenecks, such as complexity limit relate to the number of objects an existing operators, upper boun on the sum of costs, an on specification of the actions. In change management, as far as the authors of this paper are aware of an as mentione in the introuction section, there is no work that aresses the question of employing rollback as a mechanism to maintain the manage IT infrastructure in a consistent state. The importance of the subject can be even irectly observe in the ITIL ocuments themselves, where back-out plans are explicitly mentione as a requirement of change management. However, no proper support is foun in the current systems. In the next sections we will introuce our propose solution for rollback in change management. III. ROLLBACK SOLUTION In this section we present our solution for rollback support in IT change management system. We first escribe a general IT management architecture where the rollback support is introuce. We then iscuss how atomic actions take place an how system aministrators are able to group critical activities in atomic groups. Finally, we present the moeling of IT infrastructure classes to support the propose approach. A. IT Change Management Architecture an Rollback Support Components Although there is not a single, wiely employe architecture to support IT change management, it is possible to ientify a set of basic functional components that, groupe together, form a general architecture. We introuce in such an architecture complementary elements to explicitly support rollback in change plans. Figure 1 epicts the general IT change management architecture highlighting the components require for the rollback support. Change requester Change esigner Config. Mgmt. Database Change management system Manage IT infrastructure Fig. 1. Change planner SW Config. Repository Operator Rollback planner Definitive SW Library Rollback support generator Change eployer Rollback engine Deployment system The specification of a new RFC begins when the change requester escribes, in a high-level ocument, his/her change necessities. This is achieve by interacting with the change IT change management architecture 348

esigner, which is a tool that helps the change requester to fulfill the RFC ocument in a clear an consistent way. It is important to remember that an RFC express what is require, but not how to achieve that. This is initially efine when the operator, also interacting with the change esigner, come up with a preliminary change plan. For example, consier the statement install a new Web-base project management system on server A, which is typical of an RFC. Here, nothing efines whether a new Web server is neee or if an aitional atabase must be create prior to the installation of the Web application; this is, however, informe by the operator that, consulting the Configuration Management Database (CMDB) is able to check whether Web servers an require atabases are available. The output of the change esigner is then a preliminary change plan that nees to be further complemente. The change planner component is the one responsible for automatically computing an actionable workflow that efines that final change plan. The algorithm for such computation is out of the scope of this paper, but it has been alreay aresse by other authors [3] an is the subject of a complementary research of our group. The change planner, in its process of complementing the preliminary change plan, nees to consult two atabases: the Configuration Management Database an the Software Configuration Repository. CMDB stores upate information about the manage IT infrastructure, enabling the change planner to iscover which elements must be manipulate in orer to fulfill the original RFC. The Software Configuration Repository, in its turn, maintain information about software epenencies, consume uring the installation/uninstallation process. For example, CMDB lists the softwares alreay available on server A, but it is the software configuration repository that informs, uring the installation process, that the Web-base management system requires a pair of Web server an atabase to work properly. In a change system with no rollback support, the workflow compute by the change planner woul be reay to be submitte, upon the operator orer, to the eployment system. Itthen executes, using off-the-shelf solutions, the changes over the IT infrastructure (accoring to the change plan). One central point of this paper is that, in this case, if any fault occurs uring the change plan eployment, the system will possibly enter in an inconsistent state, because no reactions to faults are usually efine. In our approach, we aress this issue by complementing the change plan with rollback information. That is accomplishe through the rollback planner component. As can be seen in Figure 1, it takes as input an actionable workflow as well as aitional marks informe by the operator. These aitional marks allow the rollback planner to create an enhance version of the original change plan, that now inclue marks to support rollback actions if fails occur. Internally, the eployment system is compose by components: rollback support generator, change eployer an rollback engine. The rollback-enable change plan is submitte to the rollback support generator that creates internal structures to support eventual rollback actions, while the change eployer performs changes over Configuration Items (s) of the IT environment. If any fail occur in the change process, the rollback engine is invoke an executes the rollback proceure for the faile action, following the marks provie in the rollback-enable change plan. Whether a change plan has been successfully eploye or a fail triggere a rollback proceure, the eployment system must upate CMDB at the en of the process. This ensures that the configuration atabase will provie an upate view of the IT infrastructure, which is require by the change planner to compute future change plans. B. Marking Rollback-enable Change Plans As mentione before, the operator is responsible, in a rollback-enable IT change management system, to mark the original change plan in orer to complement it with rollback information. In orer to present how these marks are efine, we first nee to observe how RFCs an change plans are internally organize. A single RFC is compose of one or more operations. Each operation is an inepenent element that must be execute to accomplish the change requeste in the RFC. Since operations are inepenent from each other, two ifferent operations of the same RFC can be execute in parallel. Internally to each operation, a single change plan is foun, which means that one change plan is associate to each operation. In fact, all change plans from the same RFC coul be merge into a single one to optimize the eployment process, but for the sake of simplicity we assume in this paper that RFCs with more than one operation will present one change plan for each operation. Finally, each change plan is compose of a set of activities that are chaine to form the final actionable workflow. In orer to support rollback-enable RFCs, we efine that some elements (RFCs, operations, or activities) can be marke as atomic elements using atomicity marks. The simplest case is the one where the whole RFC is marke as atomic. It means that, if any problem happens uring the execution of any of its change plans, all activities must go backwar, moving the IT infrastructure to the state previous to the RFC eployment. If no mark is efine at all, any failure will abort all change plans in execution, but no rollback action will be performe, leaving to the system aministrator the responsibility to lea back the infrastructure to a consistent state. Limiting the atomicity marks at the RFC level may be too restrictive in some scenarios. Consier, for example, that one oes not want to uninstall a Web server in the event of a failure in the process of installing an associate atabase. We thus further efine that besies marking an RFC as atomic, one can mark each operation of an RFC as iniviually atomic as well. In this case, only the change plan of a failing operation rolls back, leaving the change plans of other operations untouche. Finally, atomicity can be efine at the activity level. Here, if an activity efine as atomic fails, it will only rollback itself, but the next subsequent activities will be execute. If a failing activity is not atomic, no rollback is execute, but the associate change plan is aborte at all. An aitional 349

mechanism is incorporate at the activity level: we efine that a group of activities that are closely relate to each other may form an atomic group. Once any activity of such a group fails, all other activities in the same group must rollback. Using atomic groups, one can efine an atomic operation in two ifferent ways: (a) by marking, at the operation level, the operation as atomic, or (b) by grouping all activities of an operation in a single atomic group. The option (a) is obviously easier to choose, but the fact that (a) an (b) are a possible inicative that, in fact, an atomic operation is a particular case of an atomic group compose of all activities from a single change plan. In orer to exemplify the use of atomicity marks, Figure 2 epicts how atomicity is efine at the aforementione three levels, i.e., at RFC, operation, an activity levels. (a) RFC OP1 OP2 OP3 (c) Fig. 2. 1 2 3 4 5 6 (b) RFC OP1 OP2 OP3 () Examples of atomicity marks an atomic groups 1 2 3 4 5 Figure 2-a shows an RFC compose of three operations: OP1, OP2, an OP3. In this case, the whole RFC is marke as atomic, implying that any problem on any operation will revert all actions execute. Figure 2-b presents the same RFC an operations, but atomicity is efine ifferently now. If OP1 fails its internal actions will be unone, but it will not affect other operations of the RFC. The same happens with OP2. Since OP3 is not marke, any fail in its actions will abort the associate change plan without executing any rollback proceure. Figure 2-c presents marks at the activity level. In this case, a change plan compose of activities numbere from 1 to 6 has two atomic groups: one forme by activities 1 an 2, an another forme by activities 4 an 5. Note that, in the first group, the activities are execute sequentially. In this case, if activity 2 fails, activity 2 itself an activity 1 are reverse. After that, the workflow continues evolving to the execution, in parallel, of activities 3, 4, an 5. In the secon atomic group, if activity 4 or 5 fail, both will be unone. In this case that activity 3 is not affecte at all. Once activity 3 finishes, an activities 4 an 5 rollback or complete successfully, activity 6 will be reay to be execute. In the event of a failure in activity 3, it is important to emphasize that the whole change plan is 6 aborte, skipping the execution of activity 6. Finally, Figure 2- show, therefore, the case where the whole change plan is mae atomic. The same effect can be achieve by efining that the operation that generate this change plan is atomic, in this case without requiring the efinition of an atomic group of all operation s activities. C. Rollback Moel In orer to enable change plans to inclue rollback support, it is necessary to moel change plan information with rollback in min. As mentione before, a change plan consists in a workflow of activities that when execute lea the manage infrastructure to a new state. Therefore, for our solution, a moel for change plan information must thus express actionable workflows that inclue rollback support. We esign our solution thus efining a Requests for Change an Change Plan Moel. Our moel is strongly base on the change management guielines presente in the ITIL Service Support book [1], an on the approach to specify workflows efine by the Workflow Management Coalition (WfMC) [10]. It is not our goal here to stress what information is require to fully express workflows. Actually, that woul be a whole research in itself. Rather, we are intereste in moeling the information require for rollback support in a change management system. Consiering these assumptions, Figure 3 presents a partial view of the efine moel, highlighting the elements require for the rollback support, while omitting others that are not relate to rollback. RFC - name: String - reason: String -priority: int -status: int - type : String -... AtomicRFC Atomic Activity ActivityAtomicGroup - groupname : String Fig. 3. Operation - name: String -priority: int -type: int operation -... AtomicOperation SubProcess Specification changeplan SubProcess changeplan Activity Block Activity allactivity ChangePlan ActivitySet alltransition from Transition Information to allactivityset activityset BlockActivity CPM: Change Plan Moel RFCM: Request for Change Moel RM: Rollback Moel RFC an change plan moel with rollback support An RFC is compose of Operations that, in turn, are compose of one or more ChangePlans. RFC an Operation hol more abstract information of a change request, an thus form the Requests for Change part of our moel. An AtomicRFC is an specialize RFC whose final actions must be treate as a single transaction by the eployment system, i.e., one marks an RFC as atomic by using the AtomicRFC class. In the same way, an AtomicOperation 350

is an Operation whose associate actions will rollback in the event of a problem throughout their eployment. Each operation in an RFC has a change plan compose of ActivitySets, which are groups of one or more activities intene to implement a change plan. An Activity can be either a low-level, non refinable activity (LeafActivity), which is the lowest level of granularity that an action can represent, or groupe activities (SubProcessDefinition an BlockActivity), which are activities compose of another activity set or by a new change plan. The TransitionInformation class moels how activities are chaine in the final workflow. The classes ChangePlan, ActivitySet, Activity, LeafActivity, TransitionInformation, SubProcessDefinition, an BlockActivity form the Change Plan part of our moel. In orer to efine an atomic activity, one shoul mark it using the AtomicActivity class. Since several atomic activities can be groupe together in atomic groups, each atomic activity nees to provie the ientification, in an string, of the atomic group it belongs to. If no name is provie, the activity belongs to a group forme solely by itself. Notice that it is not possible for a single activity to be part of more than one atomic group. If that was possible, the activity woul work as a mechanism to merge the rollback behavior of those groups. That is so because if one atomic group rolls back, the common activity of both groups shoul rollback too, leaing the activities of the secon atomic group to rollback as well. D. Translating Marke Change Plans to Actionable Workflows with Rollback In orer to prouce actionable workflows to eploy an RFC with rollback support, it is necessary an automatic mechanism to generate such rollback plans. For each marke element in the same atomic group, the rollback plan is automatically generate by following two general steps: (1) reversing the entire change plan, an (2) not incluing those activities that the marke element epens on. For example, if an activity A is marke to participate in an atomic group AG1, the first step to generate the rollback plan is reverse the change plan. After that, it is possible to ientify which activities that activity A epens on, an then not inclue them in the rollback plan. Note that this propose metho reprouces, in terms, the iea of a common stack: in a normal execution, activities are pushe into the stack. Otherwise, if any fail happens, the system will pop each element in orer to perform the rollback. IV. PROTOTYPE IMPLEMENTATION In orer to prove concept, we have evelope a prototype system that implements our rollback propose solution. Our implementation is base on Web services technologies an stanars, mainly ue to (1) the interprocess communication over the Internet an (2) Web services composition such as the Business Process Execution Language (BPEL), use to coorinate istribute actions over an IT infrastructure. In this context, our implementation is base on the following Web services solutions: At the final Configuration Items (s) sie, we assume that target elements that nees to be manage (e.g., hosts, servers, clusters, storage, etc.) implement a management interface as a Web service. It can, for example, be materialize following the Configuration Description, Deployment, an Lifecycle Management (CDDLM) specification [11]. Associate with the Web service management interface, we assume that a Web Service Description Language (WSDL) [12] ocument is also available, escribing the management interface itself; In orer to eploy the require changes over an IT infrastructure, actionable workflows are escribe in BPEL ocuments that can be rea an execute by a BPEL engine, such as ActiveBPEL [13]. The BPEL engine operates as the eployment system of the previously presente architecture. The communications with the BPEL engine are accomplishe using Web services as well; The change management system is implemente as a simple Web application accesse by both change requester an operator, an works as the Web service client of the eployment system. A. BPEL Constructions to Support Rollback While executing a change plan, the eployment system must keep track of the organization in which actions will be performe. To orchestrate the system such as expresse in a given workflow, four BPEL constructs were use: sequence, flow, if, anlinks, which is a basic BPEL construction to create links between activities. The BPEL invoke activity was use to represent a workflow activity, allowing to call a remote Web service to perform the given task. It can thus be use to perform remote operations using two-way (request-response) or one-way messages. In an one-way communication, the invoker sens a message an oes not wait for any response. In the request-response communication style, a message is sent by the BPEL engine an the processing of the workflow remainer is blocke until a response arrives. In this case, the communication is synchronous an timeout-relate issues must be taken into account. Our prototype supports both synchronous an asynchronous communication combining sequence, flow, if an links to guie the execution of workflows for an effective change plan execution. Finally, in orer to etect configuration problems an thus trigger rollback actions, aitional BPEL constructions have been employe. For example, an invoke activity can inclue fault hanlers to eal with errors associate with the invoke service. In this case, the invoke activity must be ae to a scope an errors can be caught by a catch all activity at execution time, eviating the normal execution flow to a ifferent flow which hanles the failure. As we assume in the solution that rollback actions o not fail, no aitional fault hanler constructions are attache to a catch all activity. 351

B. Deployment System The eployment system is implemente in Java an organize internally in three blocks alreay introuce in Figure 1: the rollback support generator, the change eployer, an the rollback engine. The complete iagram for the eployment system is presente in Figure 4. Rollback support generator Import WSDL Valiate WSDL Convert to BPEL Deployment system A rollback support Create epl. escriptor Fig. 4. Change eployer Buil epl. file Execute change plan Deployment system Upate CMDB Rollback engine Orer uno actions Rollback First, the rollback support generator receives a marke change plan, an after reaing the internal information, imports the set of WSDL files from all the enpoints that will be affecte by the change plan. The WSDL files are then valiate to guarantee that all require resources an operations are available in the manage elements. In this step, a verification is also one in orer to etermine what kin of communication will be use to perform each activity (synchronous or asynchronous). Next, the workflow conversion is use to convert the original marke workflow into a BPEL workflow. In our implementation, the change plan file is alreay a BPEL ocument that contains no rollback support but only atomicity marks. In orer to transform these marks in BPEL rollback structures, the a rollback support component is issue. Finally, with all information reay to be elivere for execution, the create eployment escriptor component create a complementary file calle Process Deployment Desciptor (PDD) require by the ActiveBPEL engine (which we also use in our implementation) to execute a whole actionable workflow. The complete set of files is then forware to the change eployer. The change eployer expans the BPEL engine functionality by aapting its resources to the eployment system intention. Primary, it buils a file packing all previously generate files (WSDL, BPEL an PDD). Finally, ActiveBPEL is calle to execute the change plan. In fact, ActiveBPEL is responsible for executing the change plan an hanling the faults. Once a fail is etecte, the normal flow of the change plan is intercepte an the rollback engine is calle. The first step is to orer the activities that must be unone to accomplish an atomic behavior. In the current version of our prototype, this orer is compute just following the reverse orer of atomic activities execute so far. Possibly ifferent orer to rollback the system may present, improving performance in terms of rollback latency. Optimize orers of activities in rollback proceures is subject of future work. After computing the orer of activities, the rollback itself is execute by invoking reversing operations at remote enpoints to uno the previous activities. For example, if an original activity was an install instruction, the reverse of it will be an uninstall instruction. Besies it, assuming that the configurable items interfaces are implemente in Web services, we also assume that for each remote action there will be another action able to uno the first one. If that not happens, however, the rollback proceure itself may lea the manage system to another unpreicte state. It is thus crucial that not only the rollback support works properly, but that the enpoints present reversible actions as well. These reversible actions are align to Recovery-Oriente Computing (ROC), where the uno/reo is one propose recovery technique that is able to cure a high percentage of failures [14]. After a rollback, the execution flow may return to the original change plan, or then evolve irectly to its en, epening on how the atomic activities an groups have been efine by the system operator. The last action, as alreay mentione, is to upate CMDB (configuration management atabase) to reflect the new state of the manage IT infrastructure. V. CASE STUDY &ANALYSIS To prove concept an technical feasibility of our proposal, we have conucte an experimental eployment consiering a real-life scenario. In subsection V-A we provie a etaile view of the stuie scenario, while in subsection V-B we iscuss the results achieve. A. Case Scenario Our case stuy is base on a company that provies services to its customer using the Internet. In orer to couple with the heavy emans of the provie services, the company employs a high-performance cluster compose of 10 noes. Each noe is equippe with a Dual Core Xeon processor an 2GB of RAM memory. In aition to the cluster, an authentication server responsible for the customer s authentication is present as well. Finally, an HP 9000 server, configure using optimal performance options, hosts a MySQL atabase use to persist the information manipulate by the provie services. This IT environment is responsible for hanling user requests consiering appropriate performance threshols, such as the Emergency Loa Threshol (ETL). ETL is monthly calculate an measure taking into account the average number of customer s accesses, an typically varies when the company releases a new service. Once ELT is exceee, the availability of the provie services may be severely compromise. Once the company intens to release a new service, an increase in the access loa is expecte. That is calculate through a poll estimative. In orer to support this loa increase an maintain the system health, the company ecies to upgrae its IT infrastructure to have ELT in 55%. Let s assume that, accoring to the poll, increasing ELT to 45% woul be sufficient to a new service release, but the company wants to 352

have a comfort ege with aitional 10%. Consiering this IT upgrae strategy, the previous ecisions are materialize in the IT infrastructure accoringly to the RFC ocument presente in Figure 5. K h^ h, ^ Z& /> Z Z, Fig. 5. Example of an RFC K h /, In this RFC, two operations are efine: a) a atabase service upate, an b) an access capability upgrae. The first operation intens to improve the atabase reliability by upating MySQL to the Enterprise Eition (EE). The secon operation installs a new machine in the access cluster, an tune the configuration of oler machines in orer to increase their performance. After being processe by the change planner, the operations generates the change plans epicte in Figure 6. ^ ^ ^ /D^Y> ^ Z ^ ^ ^ ZD,Weee y ' ' ' / y y,weee E' E' E' manager marks this operation as an atomic operation. The secon operation, however, is not marke; in fact, the change manager elegates that to the operator that must efine, among the internal operation s activities, those that must be groupe in atomic groups. The change plan for the secon operation is compose of the installation of a new machine with better harware resources, an performance improvements on each ol machine by installation aitional memory. The first action of the change plan is to physically install an configure the harware for the new access cluster, incluing a RAM installation for each ol machine. This action is performe by humans, an we assume, in this example, that no fail occurs. As the change plan evolves, it executes the software installation/configuration in parallel. The technical conitions are an important point to efine which activities will participate in the atomic groups, which are set in our case stuy in a way to prevent the system of being unavailable. Consiering this scenario, the operator evience that to guarantee the access cluster availability an the new user access emans, the change plan must consier the following conition: the activity relate to the new machine must succee (install an configure software), an the activities relate to the ol machines (i.e., a configuration in performance options) may either succee or fail. The result of the performance options configuration are not critical, because even with a failure, the access cluster will remain available whether or not a rollback action is performe. In this case, the performance configuration will lea to an increase of 10% in ELT, which is not sufficient to the new services release, but it oes not affect the current company s services. Therefore, to justify the operator eviences, some conitions are escribe as follows: If one activity relate to an ol machine fails, the system must rollback all other ol machines to the previous consistent an working configuration. In this case, one machine configure ifferently from others may turn customer s services unavailable. Performing the rollback just for ol machines, in case of one fail, guarantees at least the cluster functionality with a better memory specification; If the new machine installation/configuration fails, the system must ientify an rollback its actions. In this case, the change plan will not be interrupte an the access cluster will work as before. The success of this activity guarantees the company s new service release, since the new machine configuration can elevate ELT to 45%. Otherwise, the support for customer access emans, that is require by the new service release, cannot be assure. Fig. 6. Parallel change plans Consiering the first operation as a critical one, the change For these reasons, expressing atomicity in the activity level guarantees at least the access cluster availability. In this case, the single activity relate to the new machine are set as participant of Atomic Group 1 (AG1), an all activities relate to ol machines, in turn, are set as Atomic Group 2 (AG2). 353

B. Failure Cases & Analysis In the first operation performe by the change eployer, which is a atabase service upate (Figure 6), it is possible to suppose a failure in the MySQL Enterprise Eition installation. Therefore, the change eployer invoke the rollback engine, immeiately executing the following actions: 1) Request the atabase server to rollback the installation: wipe out all installation files relate to the MySQL Enterprise Eition, resulting in the previous untouche MySQL server; 2) Request the atabase server to uno the backup process; 3) Request the atabase server to start MySQL again. Note that the backup ata activity (Figure 6) oes not affect the system for the rollback proceure, meaning that whether unoing the backup ata or keeping it will make no ifference. The secon change plan operation, which has the goal of increasing the access cluster capability, has activities marke in atomic groups as presente in Figure 6. Supposing a situation where the configuration of the first ol machine fails (activity with i number 3), the change eployer ientify the failure insie the BPEL constructs, an invoke the rollback engine, which then orers the atomic group to rollback by remotely invoking rollback operations in all 10 ol machines. The rollback requests happen in parallel, since the change plan activities were also originally expresse in parallel. Supposing that both the first ol machine an the new machine installation/configuration fail, the change plan will not succee al all an the final result will not be sufficient to allow the release of the new service. Despite that, the failing requeste change oes not lea to the unavailability of all other services previously available because the access cluster is still up. The same case occurs if the the new machine installation/configuration fails, but the ol machines performance configuration succee: the access cluster is still up, but ELT was not increase to release the new service. Finally, supposing that the new machine installation/configuration succees, the result of the performance options configuration at the ol machines will not affect the new service release, since it can increase ELT at 45%. In terms of scalability, the prototype generate the BPEL ocument in less than 1 secon for the escribe case. Otherwise, the eployment execution time always epens on the activities actions an how the change plan was create. VI. CONCLUSION AND FUTURE WORK In this paper we have iscusse how organizations implement their changes, an the importance of having a change management system able to ientify fails an rollback the manage system. Most of organizations have a complex IT infrastructure where changes are eploye by humans, increasing the failure probability. For this reason, we have propose a solution to express atomic activities in a change plan, informing the system which activities must rollback to a previous consistent state in a failure case. Also, in our solution, there is three ways to express atomicity: in RFC, operation, an activity levels. This can turn the system more specialize, making possible not only the operator to express atomicity in a low-level, but also the change manager to efine which actions of a given RFC must rollback. The obtaine results emonstrate to be coherent in the sense that it guarantees what the change manager or operator expresse. The use of atomic groups showe that activities can be involve as a single transaction, not affecting activities of other atomic groups. Moreover, the case stuy presente that the rollback solution help change managers/operators to efine atomicity in a more consistent way. Since our main objective was to provie the rollback support in a change plan, turning possible to go back into a previous consistent state by a rollback engine, we have not focuse on expressing compensation activities an preict atomicity restrictions. In a future work, we inten to improve the rollback moel to express an perform compensation activities, which can be very useful to achieve the RFC goal even with failures in ifferent levels (i.e., failures in rollback actions), an preict atomicity restrictions ue to rollback planner misuse. REFERENCES [1] ITIL, Information Technology Infrastructure Library (ITIL), Office of Government Commerce (OGC), 2006. [Online]. Available: http://www.itil.co.uk/ [2] IT Infrastructure Library, ITIL Service Support Version 2.3. Office of Government Commerce, 2000. [3] A. Keller, J. L. Hellerstein, J. L. Wolf, K.-L. Wu, an V. Krishnan, The champs system: Change management with planning an scheuling, in 9th IEEE/IFIP Network Operations an Management Symposium (NOMS 2004), Seoul, Korea, April 2004, pp. 395 408. [4] C. Bartolini, J. Sauv, an D. Trastour, It service management riven by business objectives - an application to incient management, in 11th IEEE/IFIP Network Operations an Management Symposium (NOMS 2006), Vancouver, Canaa, April 2006, pp. 45 55. [5] R. Rebouças, J. Sauv, A. Moura, C. Bartolini, an D. Trastour, A ecision support tool to optimize scheuling of it changes, in 10th IFIP/IEEE International Symposium on Integrate Network Management (IM 2007), Munich, Germany, May 2007, pp. 343 352. [6] OASIS Stanars, Business process execution language version 2.0, Apr. 2007, http://ocs.oasis-open.org/wsbpel/2.0/. [7] R. Enns, NETCONF Configuration Protocol, RFC 4147, Internet Engineering Task Force, Dec. 2006. [8] R. S. Alves, L. Z. Granville, M. J. B. Almeia, an L. M. R. Tarouco, A Protocol for Atomic Deployment of Management Policies in QoS- Enable Networks, in 6th IEEE International Workshop on IP Operations an Management (IPOM 2006), ser. Lecture Notes in Computer Science, G. Parr, D. Malone, an M. Foghl, Es., vol. 4268. Springer, 2006, pp. 132 143. [9] A. Anrzejak, U. Hermann, an A. Sahai, FEEDBACKFLOW-An Aaptive Workflow Generator for Systems Management, in 2n International Conference on Automatic Computing (ICAC 2006). IEEE Computer Society, 2005, pp. 335 336. [10] The Workflow Management Coalition Specification, Workflow Process Definition Interface - XML Process Definition Language. [Online]. Available: http://www.wfmc.org/stanars/ocs/tc- 1025 10 xpl 102502.pf. [11] The GGF CDDLM working group, Configuration Description, Deployment, an Lifecycle Management. [Online]. Available: https://forge.griforum.org/projects/clm-wg. [12] W3C Note, Web Services Description Language 1.1 (WSDL). [Online]. Available: http://www.w3.org/tr/wsl [13] Active Enpoints, ActiveBPEL Open Source Engine, http://www.activebpel.org. [14] G. Canea, A. B. Brown, A. Fox, an D. A. Patterson, Recoveryoriente computing: Builing multitier epenability. IEEE Computer, vol. 37, no. 11, pp. 60 67, 2004. 354