Automatic Merging of Specification Documents in a Parallel Development Environment

Size: px

Start display at page:

Download "Automatic Merging of Specification Documents in a Parallel Development Environment"

Frank Hodges
5 years ago
Views:

1 Automatic Merging of Specification Documents in a Parallel Development Environment Rickard Böttcher Linus Karnland Department of Computer Science Lund University, Faculty of Engineering December 16, 2008

2 Contact information Authors: Rickard Böttcher Linus Karnland Supervisors: Martin Höst Lund University, Faculty of Engineering Niklas Leopold Cybercom Group Sweden South AB

3 Abstract Version control is today an important concept in the field of software development. One of the key features is the ability to have several developers concurrently working on the same file and then automatically merge the different versions. Rational ClearCase is one of todays major version control tools and is used at the department of process management at Cybercom Group Sweden South AB. ClearCase supports automatic merging of source code, however merging of specification documents must be performed manually. This master thesis investigates the possibility of automatically merging specification documents as well. We identify the main problems with manual merging and propose a solution that performs the merging automatically with as little user interaction as possible. The solution is based on a three-way merge-algorithm which identifies the differences between two specification documents and uses the information obtained to perform the merge. We also present a prototype that implements the merge-algorithm and is integrated into ClearCase.

5 Contents 1 Introduction Background Research questions Methodology Theoretical background Software configuration management The need for merging in software development Merging techniques Merge conflicts Requirements engineering The current use and structure of specifications at Process Management Specification structure Current merging techniques Problem identification Solution General approach The diff-algorithm Merging two edited versions of a file Comparison with the GNU diff3-algorithm Summary Discussion Developed merge-tool General description Tool goals Modules Future work Evaluation 47 7 Conclusion 50 1

6 A Algorithms 53 A.1 The diff-algorithm A.2 The merge-algorithm

7 Chapter 1 Introduction In this chapter we present a short background of the main problems this master thesis addresses. In section 1.2 we provide a list of questions the master thesis set out to solve and in section 1.3 the methods used are presented and discussed. 1.1 Background Productivity is a critical factor in the field of software development. To stay competitive and not lose customers or market shares, it is important to keep the development process as efficient as possible and minimizing the time developers have to spend on other things than developing software. With time, the software development process has evolved resulting in generally larger development teams and more complex software. In order to maintain a high level of productivity, software configuration management (SCM) systems are used. SCM consists of many principles and practices but the area this thesis will focus on is the ability to manage and organize various types of artifacts used throughout the software development process and especially on the support for parallel development. Parallel development is the practice of several users developing the same software at the same time without interfering with each other. SCM-systems can achieve this with the use of different concurrency schemes. One of the most widely used schemes, called optimistic concurrency, allows two or more users to modify the same artifact at the same time. When a user so chooses, the SCM-system will incorporate the changes from all the contributors into a combined artifact. To be able to do this the SCM-system must have the ability to identify and isolate the changes made by each contributor and it must also have the ability to combine changes made by different developers. Since the SCM-system can be used to control and manage many different types of artifacts these abilities need to be extended to all types of artifacts for the system to function efficiently. In practice this means that if the SCM-system is used to store and manage Microsoft Word documents the system must be able to read and parse these types of files, as well as identifying differences and 3

8 combining them. Most of the SCM-systems today can perform these operations on a standard set of file types, the most common one being a plain text file. At Cybercom Group Sweden South AB, the department of Process Management (PM) uses the SCM tool Rational ClearCase to maintain active projects. In one of the major projects at PM an automated workflow system is being developed for a customer in the telecom business. Each part of the system is described in detail in a Microsoft Excel document known as a specification (or requirements) document. These documents provide detailed information about the requirements of the system and are used in the implementation process. As with the source code, the specification documents are continuously updated in a parallel development environment and remain under version control in ClearCase. As opposed to source code, there is no support in ClearCase for automatically integrating changes made to a specification document. This merging process is today performed manually which is both time consuming and error prone. The aim of this master thesis is to investigate the possibility of increasing productivity by automating the process of merging specification documents. The main goal is to develop a prototype, integrated into ClearCase, that automatically performs the merging in a fast, efficient and reliable way. 1.2 Research questions The questions this master thesis aim to answer are as follows 1. How are specifications managed and merged at the department of Process Management today? 2. What are the problems with the way specifications are managed and merged today? 3. What techniques can help improve the way specifications are merged? 4. How can a developed tool solve the identified problems? 5. How can the specifications be changed to further improve the developed tool? 1.3 Methodology To start off the master thesis a literature study was carried out to get a good picture of the theoretical background of the current merging techniques available, and also to gain understanding of the basic concepts of software requirements and the requirements engineering process. Most of the information was gathered from research papers and technical literature, and the result of the study is presented in chapter 2. The general purpose of the master thesis is of an exploratory and problem solving nature and thus a case-study [1] of the current system was carried out in 4

combination with the development of a prototype tool. The reason this methodology was chosen was because it will give a deep understanding of the specific system studied.

9 combination with the development of a prototype tool. The reason this methodology was chosen was because it will give a deep understanding of the specific system studied. The drawback of this is that the conclusions from the study may not be generally applied to other cases. From the perspective of this master thesis this is not an issue since the problem to solve is not generic. If the results need to be generally applicable more case studies of similar systems may need to be carried out. The techniques used to carry out the case-study involved interviews and observation of the system. The interviews were of an informal and open nature, meaning that we let the person interviewed describe the current system and the problems with it. The intention of these interviews was to get a good picture of how work is being done in the system and what the most common problems are from a typical user point of view. Alongside the interviews the system properties and functionality were observed by exploring the system and trying it out for ourselves. Figure 1.1: Evolutionary prototype development process One of the goals of the master thesis was to develop a prototype tool which can help answer the research questions. The tool was developed with an evolutionary approach [1][2]. The reason this approach was chosen was because it does not require a full system specification before any work can begin, instead the system and its requirements evolves over time as it is being developed. The approach starts with a simple system which implements the most important user requirements. These requirements are updated and new ones are added as 5

10 the system evolves. There is no detailed system specification and there may not be a formal requirements document [2]. The work was initiated based on an idea of the desired tool functionality, but it was not predetermined how it would be implemented or what the resulting tool would look like. An initial specification of the tool was developed and presented to the department management and based on this the implementation process began. During the development phase continuous feedback from the department managers and new ideas resulted in changes to the initial specification and system design, some of which are discussed in chapter 5. Evolutionary prototyping differs in verification and validation as opposed to normal specification based development approaches. Since verification is the process of assuring that a program conforms to its specification this cannot be applied to evolutionary prototyping since there are no formal specifications in this approach. Validation on the other hand is the process of determining if the program is suitable for its intended purpose, and in the end this will be up to the end-users to decide. In order to assess our own work during the master thesis an evaluation was needed and the main purpose of this was to measure whether the tool has improved the system performance or not. The method used to carry out the evaluation involved simulating how the system is used with a real set of data as input. To measure the performance, the time it takes to manually merge a specification was compared with the time it takes to merge the same specification using the developed prototype. Another part of the evaluation was to ensure that the developed prototype produces output of high quality meaning that it does not introduce any errors into the specifications. The actual evaluation is presented and analyzed in section 6. With this type of evaluation some validation threats can be identified. The fact that we have conducted the evaluation ourselves can be seen as a threat to the validity. For example the measured performance of the tool can be affected by the fact that we have more knowledge about the tool than any typical user, or the fact that we do not have as good knowledge about the system as a typical user. 6

11 Chapter 2 Theoretical background In this chapter we introduce and describe the major concepts used in this master thesis. We begin by describing the need for software configuration management in software development and some of the main problems it addresses. We introduce the concept of software merging and the different techniques used. In the last part we describe the basic concepts of software requirements and the requirements engineering process. 2.1 Software configuration management Babich [3] defines software configuration management as the art of identifying, organizing, and controlling modifications to the software being built by a programming team, and that the goal is to maximize productivity by minimizing mistakes. He also identifies three typical problems that arise when a group of people do collaborative work: The shared data problem The double maintenance problem The simultaneous update problem The shared data problem occurs as a result of multiple programmers modifying a single shared file. Changes made by one programmer will unavoidably interfere with the progress of others. For example, if one programmer introduces a fault into the software all the other programmers will have to stop their work until the fault is corrected. The solution to this problem is to isolate the individual programmers in workspaces where they each can work without being interfered, and the programmer can decide when to take in changes from the others into his own workspace. 7

12 The double maintenance problem arises as a result of the solution to the shared data problem. Because of the introduced workspaces there will exist several copies of the same software artifact, and when one of them changes all need to be updated. For example, if a fault is found by one programmer then it needs to be fixed in all of the copies. In time these copies will inevitably diverge and not be identical anymore, one will have a fault or feature that the others do not and the maintenance for this scenario will soon grow out of hand. Because of the shared data problem a software team cannot work with a single copy of the software artifact that everybody shares and changes. But the opposite, where everybody has a copy of the artifact is not desired either because of the double maintenance problem. A solution to this problem is to divide the artifacts into modules and when someone wants to change a module, they make a copy of it and changes the copy. When they are done with the changes they copy the module back to the shared copy. This strategy has a problem which arises if two users copy the same module and make parallel changes individually, and then one of them copies their changes back to the shared version. When the second developer is done with his changes and copies them back to the shared version they will overwrite the changes made by the first developer. This problem happens when two people simultaneously update the same module. To cope with this problem you need to integrate the changes made by the first developer into your copy, before copying your changes to the shared version. To do this process manually can be time-consuming and therefore one needs tools to take care of this merge of changes automatically. 2.2 The need for merging in software development In early SCM systems there were no need for merging support, since when a file was checked out from the repository it was locked by the user so no one else could edit it at the same time. As a result, the system allowed concurrent development, but only on different files. As software development evolved over time with increased product complexity, larger development teams and reduced time between releases, the need for a more efficient method than the locking scheme became apparent[4]. The solution was to allow developers to create and work on personal copies of the software artifacts. This made it possible for several developers to work on the same artifact at the same time. At some later point the personal copies are integrated, or merged, into a single shared version again. Several different merging techniques exist to make the merging process as efficient as possible and to reduce the number of merge conflicts that might occur. All of todays major SCM systems, such as CVS and ClearCase support this type of concurrent development and use merging to achieve this. 8

13 2.3 Merging techniques Merging techniques can be divided into different categories based on their functionality or their underlying technologies Functionality based categories One category based on functionality is textual and object merging [8]. Textual merging treats all files as plain text files and only apply pure textual changes to the documents. This makes the technique very flexible since almost all types of files can be considered a piece of text. The major disadvantage with textual merging is that it is only able to detect basic conflicts since it does not understand the syntax or semantics of the file. Object merging uses objects to represent the data and thus supports general types of files. Another category is syntactic and semantic merging [8]. These types of merging are more powerful than a textual merge since it will take the syntax and/or semantics of a file into account and thus only issue a conflict when the syntax or semantics of the file is not correct. A problem with these techniques is that they can not be applied when there is no clear syntax or semantics of the artifact that should be merged. Another problem with this technique is that if the syntax or semantics of the artifact changes the merge-algorithm needs to be updated as well Categories based on underlying technologies There are essentially three different merging techniques based on the underlying technology. The most basic one is the so called two-way merge which compares two different versions of an artifact and is therefore able to detect differences between them. Two-way merging is the basis of three-way merging which is able to automatically resolve some conflicts. With three-way merging the two versions to be merged are compared with the common base version from which they both originated. By doing this the three-way merge is able to tell if information has been added, deleted or changed in the two edited versions. If, for example, one line is present in one of the edited versions but not in the other, with two-way merge, it is impossible to tell if this line was removed from one version or added in the other. But if the two versions are compared to a common base-file where the line is not present then we know that the line was added in one of the edited versions. Thus we can resolve the conflict automatically by simply adding the line to the merged version. This would not be possible in a two-way merge. The above techniques are examples of state-based merging. In contrast, changebased merging tracks all individual changes to the edited versions. Operationbased merging is a special kind of change-based merging which models the changes as operations. These operations usually correspond to the operations issued in the application used to develop the software artifact. In section 4.2 we present an algorithm that performs the comparison made in a three-way merge. The algorithm finds the longest common substring of the 9

14 edited version and its predecessor, and use this information to determine what kinds of changes have taken place. The longest common substring is a special case of the more widely known longest common subsequence (LCS) problem. The LCS of two artifacts can be defined as the maximum number of identical symbols found in both artifacts while preserving the symbol order [5]. We can define the longest common substring as the LCS with the constraint that the matching symbols have to be consecutive in both artifacts. In our implementation we use a modified version of the longest common substring algorithm found in [6]. 2.4 Merge conflicts A merge conflict occurs when changes can not be integrated without user interaction. Merge conflicts can be divided into two categories, true conflicts and false conflicts. True conflicts are those that are contradictory changes in the edited versions. If, for example, two developers make different changes to the same line in a textual merge scenario, then there will be no way to automatically integrate the changes. The developers have to manually decide which version to use or combine the two versions into one. On the contrary a false conflict occurs when a merge conflict is announced even though the changes are not contradictory in nature, but due to limitations in the merge algorithm they cannot be integrated [7] Conflict resolution techniques The most basic conflict resolution technique is where the user has to manually resolve the conflict, which often is a time-consuming process. Therefore a fully automatic strategy is desired to save time. One such simple strategy is to give the developers different priorities and if a conflict occurs the change made by the developer with the highest priority will be chosen. Another possibility is to have an interactive process where the developer interacts with the merge-tool to resolve the conflict when it cannot be resolved automatically [8]. 2.5 Requirements engineering A software requirement is a description of some functionality or property of the system that is being developed. It might for example describe a certain feature that should be included in the system or determine constraints regarding the overall memory usage. A requirement can be written at different levels of detail and there are also different classifications used to specify its type and scope. The process of determining, analyzing and managing the requirements for a system is called requirements engineering and is described further in section Requirements are often described in plain text files using natural language, but they can also be specified using figures, pseudo-code implementation or mathematical models. There are basically four different notations that can be used when writing requirements: 10

15 Structural natural language The requirements are written in plain text and natural language using a standard form or template. Using natural language makes it possible to write requirements that are expressive and easy to understand while still having a predefined structure. Design description language A programming-like language is used to define an operational model. This notation is often useful for interface implementation. If a system under development should be able to interact with already existing systems, the interfaces of the existing ones need to be clearly specified. One way of doing this is to define the functionality in pseudo-code. Graphical notation The requirements are defined using graphical descriptions with natural language as supplement. Sequence diagrams and usecases are the most commonly used techniques in graphical notation today. Mathematical notation Mathematical specifications are based on mathematical concepts such as sets or state-machines. They usually have a strict syntax which has the advantage of not being as ambiguous as natural language, however the requirements can be hard to understand if the user is not familiar with the notation Software requirement levels Software requirements can be written at different levels of detail depending on the kind of user they are intended for. A high-level abstract requirement is called a user requirement, and a more detailed description of the system functionality is called a system requirement. User requirements are often written in natural language and should be easy to understand for users without specific technical knowledge. They describe what features and functionality should be supported, and under what constraints the system should be able to operate. The targets of user requirements are mainly users who are not interested in how the system is implemented, for example managers and end-users. Their main concern is the external behaviour of the system and not technical issues like the programming language used for implementation or what internal data structures to use. The main advantages with user requirements are that they are easy to understand and provide a good overview of the system. There are however some problems associated with the use of abstract descriptions and natural language: Ambiguity The use of natural language may result in ambiguous requirements that may be interpreted in several different ways. Confusion It might be hard to distinguish if a certain requirement is functional, non-functional, or some other classification. Amalgamation One requirement may actually be a combination of several different requirements. 11

16 System requirements are more detailed versions of the user requirements. They are mainly used in the system design process and also specify how the user requirements should be provided by the system. They should however still only specify the external behaviour of the system and not be implementation specific. Natural language is often used in system requirements as well, but to avoid the problems associated with user requirements it should be well structured and possibly combined with graphical or mathematical language Software requirement classifications A requirement can be classified as a functional, non-functional or domain requirement depending on what is specified. Functional requirements The purpose of functional requirements is to describe what the system should do, for example the services it should provide or how it should react to certain user input. Functional requirements can be written both as abstract user requirements or more detailed and specific system requirements. In the ideal case, they should be complete in the sense that they cover all parts of the system and describe all system functionality. They should also be consistent, meaning that two individual requirements should not define any contradictory behaviour. However, as systems grow in size and complexity, it is very difficult to keep the functional requirements complete and consistent. One reason is that different stakeholders may have different needs which might not be compatible. Non-functional requirements As indicated by the name, non-functional requirements does not describe specific system functions but are more concerned with the system properties and constraints. They might for example specify system response times, usability or security. Since non-functional requirements often specify properties of the entire system, they are generally more critical than functional requirements. A system might operate quite well even if certain functionality is missing, however if a critical reliability requirement is not met the entire system might fail. There are basically three types of non-functional requirements: Product requirements These are requirements that specify product behaviour such as portability, performance and reliability. Organizational requirements This type of requirement specify policies and procedures used by the organizations involved in the development process. It might for example specify the development method, delivery dates and budget constraints. External requirements This category covers all requirements not concerning the system itself or the organization around it. These could be require- 12

17 ments on interoperability with other systems or legislative considerations to ensure that no laws will be violated when using the system. Domain requirements Domain requirements are unlike functional and non-functional requirements not derived from desired functionality or properties of the system, but from the application domain. They are specialized and technical requirements that describe the characteristics and constraints of the specific domain. A domain requirement might for example specify that a certain technical standard must be adhered to when implementing a database, or define a mathematical formula that must be used in a specific calculation The requirements engineering process The objective of the requirements engineering process is to discover, define and maintain all requirements associated with the system that should be developed. This process generally results in a requirements document that is used in the implementation of the system. The process consists of several steps as illustrated by figure 2.1. Feasability study The first step in the process is to conduct a feasability study and summarize the result in a report. This study should investigate the incentives to build the system and conclude if it is desired to continue with the development process. A good strategy is to define a number of questions beforehand which are answered in the report. The following are a few examples of such questions: 1. Will the use of the system lead to higher productivity within the organization? 2. Is it possible to develop the system using existing techniques and within the specified budget? 3. How will the organization handle the situation if the system is not implemented? Information to use in the study can be obtained from many different sources such as developers, management representatives and end-users. This information should be used to answer the pre-defined questions and provide recommendations on whether the system should be developed or not. Suggestions may also be provided regarding issues like budget and schedule adjustments, or the introduction of additional high-level requirements. 13

18 Requirements elicitation and analysis When the decision has been made to continue with the development, requirements on the system and application domain need to be collected and analyzed. This process can be divided into four steps: 1. Requirements from all the different stakeholders such as customers, endusers and management are collected. 2. The requirements are sorted and classified correctly in order to provide structure and organization. 3. Any existing conflicts are resolved. There will often be conflicting requirements when different stakeholders are involved. In a system function that measures distance, the customer might for example want to use yards as the distance unit while a majority of the end-users prefer meters. These types of conflicts must be resolved by negotiation before the process can continue. In this step requirements should also be prioritized in order of importance and need. 4. The requirements are documented and used to provide an early model of the system and an initial draft of the requirements document as seen in figure 2.1. Requirements elicitation and analysis is an iterative process, and the result of each of the four activities described above should be used as input to the other activities in the following iteration. Requirements specification and validation In the last step of the requirements elicitation and analysis process the requirements are documented. This step is closely related to the requirements specification step in figure 2.1 where the requirements are specified in a more formal way using standards or templates. These specifications result in user and system requirements that can be incorporated in the requirements document. It is important that the requirements define the system that the customer wants. In the validation process, the main goal is to find errors, conflicts and inconsistencies in the requirements to make sure that the system is specified correctly. If errors exist in the requirements document, it is usually much more expensive to correct them during the development process or after the system has been deployed than during the requirements engineering process. Different types of checks should be performed in order to ensure that the requirements document is correct. Some properties that should be checked include: Consistency There should not be any conflicting requirements. Completeness All functionality should be defined in the requirements document along with the constraints on the system. 14

19 Figure 2.1: The requirements engineering process Verifiability It should be possible to verify that a specific requirement has been met. This might be done either manually or by automated tests. The end result of the requirements engineering process is a requirements document that specify the entire functionality of the system and its properties Requirements management In the ideal case the requirements document will not have to be revised or changed during the development process. However, in large projects this is rarely true. There might be several reasons for this, the system might not be entirely specified prior to implementation or stakeholders might change their minds about certain features or system properties. When end-users start using the system they might discover that they need additional functionality or want other changes to the system. The process of managing and updating requirements during development is called requirements management. It is desired to use a standardized process when making changes to requirements in order to update the requirements document in a controlled and consistent way. This process should consist of three steps: 15

20 1. The problem with the original requirement is identified and the proposed change is analyzed to ensure that it is valid. 2. The system is analyzed in order to determine the cost of implementing the change. The cost is usually measured in the amount of work needed to integrate the changes into the requirements document, system design and implementation. When an estimated cost has been calculated, a decision must be made whether to implement the changes or if the cost is too great. 3. If the decision was made to continue with the changes, they are integrated into the requirements document and if necessary the system design and implementation. Requirement changes should always be handled in the order described above. It might be tempting to change the system design and implementation before updating the requirements document, but this might result in the requirements document not being up to date. When the changes have been made to the system, it is easy to update the requirements document in a negligent way or forgetting to update it at all, resulting in an erroneous document.[2] 16

21 Chapter 3 The current use and structure of specifications at Process Management In this section we describe how specifications are used at the department of Process Management. Section 3.1 gives a brief overview of the structure of the specifications. Section 3.2 takes a look at what techniques are used today to merge the specifications. In section 3.3 we identify the main problems with the way specifications are managed and merged today. 3.1 Specification structure In the major projects at PM, certain kinds of technical specifications are used as an aid in the software development process. These specifications are used in order to clarify the requirements from the customer and provide a low-level specification document that can be used for implementation, both manual and automatic. Specifications at PM are mainly used for specifying workflow processes and are written in Excel documents with several sheets, each describing a different part of the system. All specification documents contain a sheet called Revision History which contains a record of all changes that have been made to that process. For every update in the file a row is added containing the version of the revision, the date it is updated, who made the update and a description of the changes. What other sheets that are defined in a document may vary from specification to specification, but the structure of these sheets are often well defined. The general structure of these sheets is that a row defines an attribute, a mapping or a rule in the system. However there are some sheets where the information is organized in columns instead of rows. In this case, if a new attribute, mapping 17

22 Figure 3.1: Example of a specification where the information is organized in columns as opposed to rows or rule should be added to the sheet a new column is created and the information inserted, instead of as in the other case where a row would be inserted. An example of this can be seen in figure 3.1. In some sheets several rows together describe a part of the system and the rows in the group are depending on each other, meaning that if one of the rows in this group is changed or a row is added it will affect the behavior of all the rows in the group. This can be seen in figure 3.2 where each gray line defines a new group of data. If one of the lines are altered in the group the other lines are also likely to be altered. The specifications are written mostly as low-level system requirements, meaning that they are more implementation specific than a more abstract user requirement (as described in section 2.5.1). They are generally written as functional requirements that specify the behavior and structure of the system, and also how it should should be implemented. The main notation used when writing the requirements is a programming-like design description language. This type of language gives a clear and unambiguous specification of the system and can in some cases automatically be translated into source code. Some parts of the system are however written using a more natural language, and in some cases graphical notation such as figures and flow charts are also used to clarify or complement the written requirements. As described in section 2.5.4, requirements management is an important part of software development. The requirements management process and change request process at PM follow the described steps very closely: 1. The customer has a request for change, it might be because of a fault or the desire to add some new functionality. 2. The change is analyzed in terms of what needs to be changed in the system and the cost of implementing the changes. 18

23 Figure 3.2: Example of a specification where multiple rows are grouped together 3. The customer either accepts the costs and the implementation can begin or the cost is rejected and further discussions are had until either the change is dismissed or an agreement on the cost can be made. After the change request and the associated costs are approved the implementation can begin. The first step is the design phase. In this step the specifications of the system are updated so that the desired changes are incorporated into the system. When the specifications are updated these are reviewed in the design review phase, and these two steps are repeated until the proposed design is accepted. After this, the actual implementation can begin based on the updated specifications. During the implementation different types of tests are continuously performed to get feedback to help determine if the specifications need to be further updated. 3.2 Current merging techniques The current merging technique relies heavily on the fact that updates to specifications are well documented. As described in section 3.1 each specification contains a sheet called Revision History that keeps a record of the changes made to that specification. When a change is made, it is also documented via color codes. If something is to be removed from the specification it is colored red instead of actually removing it from the document. If something has been added it is colored green and if something has been changed in some way the old line remains in the specification colored red, and the altered line is added below in green. 19

This can be seen as a rudimentary form of version control where you manually keep a record of the differences between two versions instead of letting the version control system take care of it, and

24 This can be seen as a rudimentary form of version control where you manually keep a record of the differences between two versions instead of letting the version control system take care of it, and this is actually what it has been used for. When these documents initially were created and used they were not stored under any version control tool and so this was the only way to keep track of the changes made to the documents. The actual merging is done manually and involves bringing up the two documents that should be merged and manually looking for the differences in the documents and incorporating the changes into the merged document. This process relies on the fact that the color coding is done correctly as otherwise it is very easy to miss something. When the two sheets are merged all red lines are deleted and all green lines are set to be white. This process can be seen in figure 3.3 that depicts two developers concurrently working with the same specification. Each developer works against their own copy of the specification, adding and removing data and applying the correct color coding. At some later point in time these two concurrently edited specifications are merged and their respective changes are incorporated into one version. After all changes has been incorporated, the red lines are removed, and all other color-coding is removed. Figure 3.3: Color-coding and merge process 20

25 3.3 Problem identification Today the specifications are stored in ClearCase, but this is not utilized to its full extent, it is merely used as a way of storing the documents. It is not possible to use the version control system to identify the differences between two specification documents, or to merge them. In order to fully benefit from a configuration management system and a version control tool, these capabilities are required. When entire branches are merged today much of the source code merging will automatically be taken care of by the version control tool, but since there is no support for merging of the specifications this has to be done manually. As described in section 3.2 the manual merging relies on the fact that each individual change is properly color coded and described in the Revision History sheet. If this is not the case it is very likely that a change will be lost and not incorporated into the merged version. Even if it is correctly documented it is possible that something will be missed in the manual merging since there is no easy way to verify that all the changes have been incorporated into the merged version. Because of this inherently error prone manual merging process it is important to be careful when merging so no changes are lost and this will lead to a very time consuming process. Some of the specifications may be very large as well and this will of course add to the time required to merge. The identified problems with the way specifications are managed and merged at PM today can be summarized as follows: No integration in the version control system Time consuming Error prone These were given as input at the start of the thesis work and had already been identified by the people working with the system. In section 4 we propose a solution on how to make the current system of specification handling and merging more efficient, focusing mainly on these problems. 21

26 Chapter 4 Solution To solve the problems listed in section 3.3, we have developed a tool that automatically merges two versions of an Excel document. The tool is integrated in the version control tool ClearCase to make the merging process as easy and efficient as possible. In this section we focus on the underlying technology and algorithms used in the developed merge-tool. We also compare our merge-algorithm with the GNU diff3-tool, and list some features and drawbacks with our solution in the summary and discussion. 4.1 General approach Our solution is based on a three-way, textual merge and basically consists of two major parts one diff-algorithm and one merge-algorithm. The diff-algorithm finds the differences between an edited version and the ancestral (base) version of a file and is run once for each of the two edited versions. The merge-algorithm then uses the information obtained in order to merge the two edited versions with as little manual interaction as possible. The entire merging process consists of some smaller parts as well, and the whole procedure can be divided into the following steps which are performed in sequential order: 1. Selecting the edited versions to merge in ClearCase and performing a checkout on them 2. Finding and registering the changes made in the edited versions 3. Merging the two edited versions 4. Writing the merged version to a new file 5. Putting the merged file back under version control The first step has been integrated into ClearCase, the version control system used at PM. The edited versions to be merged are selected manually and then 22

27 ClearCase automatically finds their common ancestor and calls the tool that we have developed. In the second step the diff is performed on the versions that have been chosen. Section 4.2 describes in detail how this algorithm works and how it determines what changes have been made. The actual merging of the two edited versions is performed in the third step which is described in section 4.3. When the merging is complete, the merged version is written to a new file which is automatically put under version control in ClearCase. 4.2 The diff-algorithm In order to perform a three-way merge, the differences between each edited version and the common base version must first be determined. We have developed an algorithm that performs a comparison between two artifacts and presents the differences found. The artifacts have to be comparable in the sense that they consist of a number of individual elements, each with a specific position in the artifact. If for example the artifact is a sheet in an Excel document, a row in that sheet is an element (or if a row is an artifact, a cell is an element). By having a unique index for each element it is possible to compare the element at position m in one artifact with the element at position n in another one, which is necessary for the diff-algorithm to work. The algorithm works by comparing blocks of data (elements) from the two artifacts and uses the longest common substring-algorithm to find the longest match of consecutive elements that can be found in both blocks. It then divides each block into two smaller blocks with the first one containing the data prior to the matching symbols, and the second one containing the data following it. The algorithm is then repeated in a recursive manner on the new, smaller blocks until one of the following base cases are encountered: A. The first block is empty but the second one contains data. B. The first block contains data but the second one is empty. C. Both blocks contain data but the longest common substring-algorithm found no matching symbols. D. Both blocks are empty. An example of this recursion and the base cases can be seen in figure 4.1. The algorithm is initially performed on the entire common base version as the first block, and one of the edited versions as the second block. After recursively calling the algorithm on blocks of decreasing size, eventually one of the cases stated above will be encountered. It is then possible to draw some conclusions on what kind of change that has been made to the edited version regarding the two sub-blocks that generated the base case. In appendix A.1 the algorithm is presented as implemented in pseudocode. 23

Figure 4.1: Illustration of when the base cases occur in the recursive algorithm 4.2.

28 Figure 4.1: Illustration of when the base cases occur in the recursive algorithm Ability to find differences The different kinds of changes that have been made in the edited version of a block of data can be divided into five categories: 1. Data has been added to the edited version. 2. Data has been removed from the ancestral version. 3. Data has been changed in the edited version. 4. Data has been changed in combination with added data. 5. Data has been changed in combination with removed data. It is also possible to have a combination of the fourth and fifth case, however we interpret this as either case four or five depending on the number of added and removed elements. If there are more added elements than removed, we interpret this as the fourth case, otherwise the fifth. Depending on the base case encountered it is possible to determine the correct category of the changes that has been made to the current blocks. However, 24

29 there is not a one-to-one mapping between the base cases and the possible categories, so it is sometimes necessary to further analyze the blocks in order to determine the correct category. This is done by comparing the number of elements in the ancestral block and the edited one and then choosing a category based on the result. The mapping between the base cases and the categories can be described as follows: Case A always corresponds to the first category. If the ancestral block is empty while the edited block contains data, that data was not present in the ancestral version of the file and has thus been added in the edited version. Since the ancestral block is empty, the number of elements that have been added is the same as the number of elements in the edited block. Case B has a one-to-one mapping with the second category. The principle is the same as with base case A if the ancestral block contains data but the edited block is empty, then the data in the ancestral block has been removed in the edited version. Case C corresponds to either category three, four or five. If the number of elements are the same in both blocks, then the mapping is to category three. Since no matching elements were found, and the total number of elements are the same in both blocks, we reach the conclusion that all elements in the ancestral block have been changed in the edited version. 1 If there are more elements in the edited block than the ancestral, some elements must have been added in the edited version. Since no match has been found between the blocks, the rest of the elements have been changed. In this case the corresponding category is the fourth one. On the other hand, if there are more elements in the ancestral block, then some elements have been removed from the edited version. This indicates that the correct category is the fifth one. Case D does not correspond to any category of changes, it is only present to keep the algorithm from recursively executing when there is no more data to process. Figure 4.3 shows a small example of how the diff-algorithm works. Two small vectors with elements represented as characters are compared. In the second version the element A has been changed to E, and the element F has been added between elements C and D. In the first step (1), the longest common substring [B, C] is found (marked as LCS). Each vector is then divided into two new vectors with the first one containing all elements prior to the longest common substring, and the second one containing all elements following the substring (2). A search for the longest common substring is performed again on 1 It is also a possibility that elements in the ancestral block have been removed in the edited version, and then new elements have been added to replace them. We have chosen to include this scenario in the definition of changed elements since they are of no practical difference. 25

30 Figure 4.2: Mapping between the bases cases in the recursive algorithm mapped to the possible changes each pair of the smaller vectors. With the first pair, base case C is encountered since no common elements can be found and they are of the same length. From this we can draw the conclusion that element A in the first vector has been changed to E in the second one. The vectors in the second pair both contain the element D, so they are again divided into new, smaller vectors (3). The first pair corresponds to base case A and the second one corresponds to the base case D, so we now know that element F has been added to the second vector. Since each element has a unique position, we also know where the changes took place. When the diff-algorithm has finished we can conclude that in the second vector the element at the first position has been changed from A to E and that a new element F has been added at the fourth position Problems and drawbacks In the first two base cases and corresponding categories we have all the information needed to determine what has happened to each element in the ancestral and edited blocks. Either all elements in the edited block have been added to the edited version (case A/category one), or all elements in the ancestral block have been removed in the edited version (case B/category two). In case C when the number of elements are the same in both blocks, we also receive enough information to determine that all elements have been changed in the edited version. If, however, there are more elements in the edited block than in the ancestral one, we do not have enough information to determine what has happened to each individual element. If there are five elements in the ancestral block and seven elements in the edited one, we know that two elements have been added and that the remaining five elements have been changed, but we cannot tell 26

31 Figure 4.3: An example of how the diff-algorithm works which two elements are the added ones. We have chosen to interpret this as a special-case where all elements in the ancestral block have been removed, and then all elements in the edited block have been added. All elements in the edited block are also marked as special, and this flag is used in the merging-algorithm described in section 4.3. The same problem appears in the case where there are more elements in the ancestral block than the edited one (base case C mapped to category five). We know that some elements have been removed, but cannot tell which. As in the previous case we interpret all elements in the ancestral block as removed and all elements in the edited block as added, and mark all elements in the edited block as special. Another problem with the diff-algorithm is if the longest common substringalgorithm finds two identical, matching sequences of elements in the edited block that corresponds to only one sequence in the ancestral block. In this case it is ambiguous which of the two matching sequences will be selected. If the wrong match is selected, the result of the diff-algorithm will not be correct. The risk of this happening is very low unless the number of changes is very high in relation to the total number of elements, and there is a lot of duplicated data in the files. The specification documents on which the algorithm is used are structured in a way that this error will occur very rarely. If an element is moved from one position in the ancestral version to another one in the edited, the diff-algorithm will not be able to detect this. Instead, the element will be interpreted as removed from the ancestral version and then 27

32 added at another position in the edited one. If the files did not allow duplicate elements it would be possible to search for a supposedly removed element to determined if it has really been deleted or only moved to a different position in the file. However, if duplicates are allowed this cannot be done since there might be several identical elements in the file, and there is no way to determine which element is the moved one. 4.3 Merging two edited versions of a file When the differences between each edited version and their common ancestor have been found, the information obtained can be used to enhance the performance of the merging algorithm. Metadata has been added to the edited files with information about: which elements have been added to the edited version which elements in the ancestral version have been removed which elements have been changed which blocks of elements that consist of a combination of added and changed elements, or deleted and changed elements, so called special-block This information is used to further manipulate the edited versions in order for the merge-algorithm to be able to consolidate the two edited versions into a merged one Pre-processing The merge-algorithm works by comparing the element at each position in one of the edited versions with the element at the same position in the other one. This means that in order for the algorithm to produce a correct result, the two versions must be of the same size and the elements have to positioned correctly. By pre-processing the edited versions before merging them, these two criteria can be met. The pre-processing consists of three steps where the data in the edited versions is manipulated based on the information about: 1. deleted elements 2. added elements 3. elements marked as special In the first step, so called dummy-elements are inserted into the files to represent the elements that have been removed from the ancestral version. For the merge-algorithm to work, it is essential that these deleted-dummies are inserted at the right positions in the files. The calculations made to determine 28

33 where to place the deleted-dummies are based on their original position in the ancestral version and the number (and positions) of the elements that have been added in the edited version. When this has been done, dummy-elements that represent added elements, so called added-dummies, are inserted into the files. If an element at position i in one edited version has been added, then an added-dummy is inserted at the same position in the other version (pushing all following elements down one step). However, if the element at the same position in the other version also is marked as added, then no added-dummy is inserted. By inserting added- and deleted-dummies we ensure that both edited versions of the file will have the same number of elements, and that corresponding elements are at the same position in both versions. Figure 4.4 describes the first two pre-processing steps performed on three small vectors (ancestor, edited1 and edited2) where each element is represented as a character. In one edited version the element B has been deleted, and in the other version a new element F has been inserted between elements C and D. After the pre-processing steps described above, the two edited versions can be merged by comparing each element in edited1 with the element at the same position in edited2. How merging-decisions are made based on these comparisons are described in section The last pre-processing step is performed once for each of the two edited versions. If an element X in one of the edited versions is marked as special, we continue to mark each following element as special in that version until we encounter an element that has not been altered in any of the two edited versions. This is then repeated for each element previous to X as well. This procedure of expanding blocks of special-marked elements in both directions ensures that each element marked as special corresponds to another special-marked element in the other edited version. This is important since it is necessary that each element in one edited version corresponds to exactly one element in the other edited version. If we did not perform this last pre-processing step, one block of special-marked elements of size m in one edited version may correspond to a block of a different size n in the other version. Then we would not have a oneto-one mapping of each element which is necessary for the merging-algorithm to work properly. Figure 4.5 shows a small example of how this expansion of the special-blocks is performed. In the first edited version, element B has been replaced by W and X. In the other one, C and D have been changed to Y and Z. After the diff-algorithm the two elements W and X in the first edited version have been marked as special (underlined in figure 4.5). The first two steps of the pre-processing adds the added- and deleted-dummies to both versions, and the added-dummies corresponding to W and X are also marked as special. Each element following X in the two edited versions are now marked as special until an element is found that has not been altered in any of the versions, in this case E. Then each element previous to W is marked as special in the same way. After this, the pre-processing is finished and it is now possible to merge the two versions. 29

34 Figure 4.4: The first two steps of pre-processing 30

35 Figure 4.5: The third step of pre-processing, expanding special-blocks 31

36 4.3.2 The merge-algorithm When the pre-processing of the two edited versions has been done, they both contain the same number of elements and each element is positioned correctly. This makes it possible to move through the edited versions and compare the two elements at each position. This is the basic strategy of the merge-algorithm, and the result of each comparison will in turn determine how the two elements will be merged. The merged version of the file is initially empty, and elements are gradually added to it as the merge-algorithm moves through the edited versions. When two elements are compared, there are four possible outcomes: 1. The element in the first edited version is added to the merged file. 2. The element in the second edited version is added to the merged file. 3. Neither of the elements are added. 4. Neither of the elements are added, however the corresponding position in the merged file is marked as a conflict between the elements in the two edited versions. Table 4.1 describes which of these four cases occur based on the elements that are compared. On top is the type of element in the first edited version, and on the left hand side the type of element in the second one. The possible element types and the corresponding decisions that are made are: Both elements in the edited versions are unchanged (regular). In this case the element in the first edited version is added to the merged file 2. One element is regular and the other one has been changed. The changed element is added to the merged file. One element is added and the other one is an added-dummy. In this case the added element will be chosen. One element is a deleted-dummy and the other one is either regular or also a deleted-dummy. This means that the element has been deleted in one (or both) of the two versions and consequently should be deleted in the merged file as well. This is accomplished by not adding any element to the merged file at all. One element is a deleted-dummy and the other one has been changed. This case will generate a conflict, since it is not possible to automatically make a correct decision. The element position in the merged file is marked as a conflict which must be resolved manually by either adding the changed element or not adding any element at all (i.e. deleting the element in the merged version). 2 Since both elements are unchanged and originate from the same element in the ancestral version, they are identical. In this case it does not matter which one we choose to add to the merged file. To be consistent, if the choice is arbitrary, we always choose the element in the first version. 32

37 Both elements have been added, or both have been changed. Some additional evaluation must be performed in this case in order to make a decision. This process is described in table 4.2. If the elements are identical, i.e. the same element has been added or the exact same changes have been made in both edited versions, then the element in the first version is added to the merged file. If different elements have been added or conflicting changes have been made, a conflict is generated. A block of elements is marked as special in both versions. This situation also needs some additional evaluation according to table 4.3. The first two cases mean that changes have occured only in one of the two edited versions, and consequently the block of elements that contains changes is added to the merged file. In the third and fourth case, all changes made in one of the versions have also been made in the other one (possibly along with some additional changes). Since no conflicting changes have been made, we add the block of elements that contains the most changes to the merged file. If none of these four cases occur we can draw the conclusion that conflicting changes have been made, so we issue a conflict on the entire block of elements in the merged file. All of the other pairs of element types, marked as - in table 4.1 are combinations that cannot occur in the merging algorithm. This is a result of the manipulation performed on the two edited versions in the pre-processing step described in section If no conflicts were generated during the work performed by the mergealgorithm, the merging is complete and the merged version has incorporated all changes made to the edited versions of the file. If there are conflicts, those must be taken care of manually by choosing which change to incorporate. Table 4.1: Decision matrix used in the merge-algorithm regular changed added add-dummy del-dummy special regular version 1 version do nothing - changed version 2 process A - - conflict - added - - process A version add-dummy - - version del-dummy do nothing conflict - - do nothing - special process B Table 4.2: Process A Case Action Element in version 1 equals element in version 2 version 1 Else conflict 33

38 Table 4.3: Process B Case Action All none-dummy elements in version 1 are regular version 2 All none-dummy elements in version 2 are regular version 1 Special-block in version 1 exists as sub-block in version 2 version 2 Special-block in version 2 exists as sub-block in version 1 version 1 Else conflict 4.4 Comparison with the GNU diff3-algorithm Diff3 is a utility developed by the Free Software Foundation Inc. as a part of the GNU project. It is originally a tool for comparing three different files, but can also be run as a merging tool performing a three-way merge on the given files [9]. Our solution is based on the same three-way merging principals as diff3, however there are some differences in the merging decisions made based on the differences between the versions. We have chosen to automatically solve some special kinds of conflicts which diff3 does not. The first case is when the same element has been added at the same position in the two edited versions. Diff3 does not evaluate the added elements to check if they are identical and always produces a conflict that has to be solved manually. Our algorithm evaluates the added elements, and if they are identical, adds the element to the merged file without generating a conflict. A similar situation occurs when an element in the ancestral version has been changed in both the edited versions. If the changes made are identical, then our algorithm chooses to integrate that change in the merged file, whereas diff3 generates a conflict. The third case where the two algorithms differ is the situation where a block of changed, added or deleted elements in one of the edited versions exists in its entirety in the other one. Diff3 will generate a conflict while our algorithm will choose the block of elements containing all the changed elements in the other version. Suppose the elements [B, C] in the ancestral version has been changed to [Y, Z] in the first edited version and [X, Y, Z] in the second one. The changed elements in the first edited version ([Y, Z]) exists as a continuous sub-block in the second version. In this case, our algorithm selects the second version and the resulting merged file becomes [X, Y, Z]. If the changed elements in the first edited version had been [Y, X] instead, then our algorithm would have generated a conflict. The second version contains both elements, but they are not in the same order. In the above three cases, diff3 probably generates conflicts just to be on the safe side. There might be scenarios where diff3 is used where it is not desired to automatically merge the edited versions as we do. However, the types and structure of the specification documents that our merge algorithm will be used on allows us to automatically merge in these situations without the risk of making the wrong decision. 34

39 4.5 Summary In order to solve the main problems listed in section 3.3, we have developed a tool that automates the merging process as far as possible and is integrated into ClearCase. This section has described the underlying technology of our solution and the major algorithms that have been implemented in the developed prototype. The solution is based on a three-way textual merge and is generic in the sense that it can be applied on any artifact that consists of comparable elements. In order for the elements to be comparable, they must have a specified position in the artifact which is unique for each element. Regarding the specifications that the developed prototype is used on, an artifact can be an excel-document, an excel-sheet or a row in a sheet, and the respective elements are excel-sheets, rows and cells. The first major part of the solution is the diff-algorithm used to find differences between two versions of an artifact, one base version and one version where elements have been changed, added or deleted. The algorithm is recursive and uses different base cases to determine what kind of change has been made to an element. An element that has been changed in any way is marked with the kind of change that has occurred. If a continuous block of elements in the edited version consists of a combination of added/deleted and changed elements, then each element in that block is marked as special since it is not possible to determine which elements have been changed and which have been added or deleted. The information gained about modified elements is later used in the merge-algorithm to make it as efficient and automatic as possible. When the diff has been performed on the base version and each of the two edited versions respectively, the edited versions are manipulated further in a step called pre-processing. So called added- and deleted-dummies are inserted at the correct positions to represent elements that have been deleted or added in the other version. The block of elements marked as special are also expanded to ensure that there is a one-to-one correspondence between each special-marked element in the edited versions. When the pre-processing is done, the edited versions are of the same length and each element is positioned correctly. The second major part of the merging process is the merge-algorithm. It compares each element in the first edited version with the corresponding element in the second one and then makes a merging decision based on the result of that comparison. All possible decisions and how they are chosen are described in table 4.1. If changes have been made to an element in both versions that cannot be automatically merged, the corresponding element position in the merged version is marked as a conflict. This conflict must later be resolved manually by choosing which change to incorporate. When the merge-algorithm has gone through all elements in the edited versions the automated part of the merging is complete. If there are no conflicts present there is nothing more to be done and the merged version is put under version control in ClearCase, otherwise the conflicts must first be resolved. The developed merge-algorithm can handle the basic kinds of changes like added, deleted and changed elements. If an element has been moved this can 35

40 not be detected and is instead interpreted as if the element was removed from one position and then added at another. As opposed to diff3, our algorithm can automatically handle some special cases which are listed in section 4.4. These features make the algorithm a bit less generic, but on the other hand optimizes the performance by generating fewer conflicts. The diff-algorithm does not depend on any of the other parts in the mergeprocess and can be implemented in a stand alone application to find the differences between two versions of a file. As a part of the developed prototype described in the next chapter we have created a graphical user interface that uses only the diff-algorithm to display the changes made in one of the edited versions. The merge-algorithm however, does not work without first performing a diff and then pre-processing the two edited versions. 4.6 Discussion As described in section 2.3, there are basically two types of merging techniques based on functionality. In our solution we have chosen to use a textual merging approach with some similarities to object merging. Each element can be thought of as an object with a specific position in the artifact, however there is only one type of object. It would have been possible to use semantic merging to make the algorithm more efficient on some parts of the specification documents, but then the merge-algorithm would only work properly on these documents. Since the structure of the specification documents may change over time we chose to go with a more generic merge-algorithm that can be applied to most types of documents and does not need to be revised when structural changes are made. As the underlying technology, a three-way merge-algorithm is used. A twoway merge-algorithm would not be able do detect what changes have been made in an edited file (since it does not have any information about the previous version) and as a consequence the merging will be far less efficient. Another possibility would have been to use a change-based merging technique, but since ClearCase does not keep track of the individual changes made to the specifications this technique could not be used. In section 3.3 we identify the main problems that we wish to solve. The first problem of no integration of the merging process in the version control system is solved by a ClearCase-plugin. With a simple command in ClearCase the merging is performed, and when finished the resulting merged version is put under version control. The second problem, that of the manual merging being time consuming, is solved by automating the entire merging process as far as possible. There is no longer any need to manually compare the two edited versions and combine them since this is done automatically, and the result is displayed in a graphical user interface. This significantly reduces the time spent on merging a document that would take 15 to 20 minutes to merge manually will now take only a few minutes, unless the number of conflicts are extremely large (see chapter 6). 36

41 Automating the merge-process also reduces the risk of making errors, which is the third main problem that is addressed. When manually trying to find differences between two versions it can be easy to miss a change, for example if it has not been marked properly according to the system of color-coding. This might result in the change not being incorporated in the merged version leading to a faulty specification. This can not happen in the automated process, since both edited files are compared to the base version and all changes that have been made are found automatically and displayed in the GUI. In section 3.1 we described how the specifications are written and used today. In order to mark changes and keep track of them color coding is used. We suggest that this system of marking changes is abandoned. Instead of marking the row that should be removed in the next merge we suggest that the row is simply removed from the document. With the new tool we have developed it is easy to automatically identify differences between two documents and thus the color coding becomes superfluous and only adds extra confusion and work. We also recommend that a uniform structure of the specifications are used, for example in some of the sheets the information is grouped in columns instead of rows and this can lead to some unwanted behavior in the merge-algorithm (see section 5.4) and so our recommendation is to use the structure where the information is grouped in rows. 37

42 Chapter 5 Developed merge-tool One of the main goals of the master thesis was to develop a prototype tool which could present the differences of specifications and also be able to merge the same. Different aspects of the tool are presented and discussed in this chapter. In section 5.1 a brief overview of the implementation of the tool is presented. The main goals of the tool used during development is presented in section 5.2. In section 5.3 the implementation and design of the tool is presented and discussed. In section 5.4 we discuss how the tool can be further improved. 5.1 General description The specifications are written in Excel documents and thus the tool needed to be able to read these. To be able to do this the choice went to an open source Excel Application Programming Interface (API) [10] which was the most mature API available. It had the support to read, create, modify, and write Excel documents. Initially there were some ideas to write a parser for Excel documents ourselves although this was abandoned early in the work due to the time needed to implement such a feature. Because the chosen API is open source we also had the ability to extend it in the event this was needed. This API is developed in Java and thus the rest of the tool was implemented in Java. To create the graphical elements of the tool the choice went to Eclipse Foundation s Standard Widget Toolkit (SWT) because of its native look and feel and high portability. 5.2 Tool goals The main goal of the tool is to help the developers in their process of developing software in a parallel development environment with continuous integration. In order for this to be possible it is important that the developed tool is seamlessly integrated into the software development environment. The requirements on the tool can be divided into the following categories: 38

43 Differences The tool should be able to identify the differences between two documents, both used as a standalone application and also used internally in the three-way merge. Merging As far as possible the tool should be able to automatically incorporate changes from two documents into one. If any conflicting changes have been made the tool will require input from the user. Conflict The tool should be able to identify conflicting changes made in two documents in order to ensure consistency in the merged documents. Graphical User Interface In order to present differences and merged documents in a intuitive way a rich graphical user interface is needed. The interface should be able to present the documents in a way similar to what the user normally sees when they are working with them. Integrated into version control tool The tool should be integrated into the version control tool to give the developers better support. 5.3 Modules In figure 5.1 a schematic view of the tool and how the user interacts with it is presented. As can be seen in the figure the user interacts with ClearCase and for example tells ClearCase to compare its current document to the previous version, what ClearCase does is that it brings up the files that should be compared and then calls our tool with the files as input. In case of a merge operation it brings up three files, the two edited and the base version, and then expects as output from our tool the merged version of the documents. The tool can also be used as a standalone application (without ClearCase), what this means is that in figure 5.1 the ClearCase block is removed and the user interacts directly with the documents/tool. When the tool is executed it expects two or three input files, depending on if it should perform a merge or identify the differences between them. To read the files the tool has a module called ExcelParser, this module utilizes the Apache Excel-API (see section 5.3.3) to parse the documents and store the information in our internal model. When this is done the Diff/Merge module can perform its task by working against the model, identifying differences and merging the documents within the model. After this is done, the user is presented with a GUI again reading its data from the model. The GUI can also change things in the model, for example conflicting changes can be resolved by the user and thus the model is updated accordingly. When the user is satisfied with the merged document the ExcelWriter module will be called which takes the internal representation, the model, and writes it back to an Excel-file. 39

Figure 5.1: Schematic view of the tools modules and the interaction with the user 5.3.

44 Figure 5.1: Schematic view of the tools modules and the interaction with the user ClearCase-plugin Since ClearCase is the configuration management tool used in the development process the developed tool needed to be integrated into this environment to allow automatic merging of specifications when entire branches are merged together. ClearCase works by defining a tag called element-type on each file it has under version control. This tag is used to determine what kind of file format the file in question is. Some examples of this tag is text file, compressed file and binary. This information is then used to create a mapping between the elementtypes and various programs that can for example compare two files of the given element-type, or merge them. So by defining a new tag specification and map this tag to our developed diff/merge-tool and set the tag on all specifications under version control, our tool will be called whenever a file of the defined type should be merged or the differences identified. This means that the integration of the tool into ClearCase will be completely transparent and the user will never have to worry about loading the tool with the correct input files et cetera. 40

45 5.3.2 ExcelParser We decided early on that we needed an internal data structure in Java to represent the documents. The reason we needed this was to be able to not only store the original information from the specifications but also metadata about the structures in the specifications. This metadata is used by the diff/merge algorithms to keep track of for example changes in the specifications. The classes used to represent the Excel files are pretty straightforward and represent the structures used in Excel, that is WorkBook, Sheet, Row et cetera. As can be seen in figure 5.2 a document is represented by a WorkBook which contains several Sheets. A Sheet in turn contains several Rows and also a CellStyle which contains styling information for the row. The Row contains several Cells which also contains CellStyles that describe styling for individual cells. A subclass of Row called ConflictBlock is used in the merging phase where conflicting changes have been made. The idea behind this class is that it contains all the conflicting changes (Rows and Cells) and since it is a subclass of Row it can be used in any place where Row is used. When the conflict has been resolved (via the GUI) the ConflictBlock returns only the chosen changes. Both Row and Cell implements an interface called Differentiable which contains a set of functions used in the diff-algoritm. This will be described further in section The ExcelParser module expects as input a filename and returns an instance of the object WorkBook with all the other classes (Sheet, Row, et cetera) created and associated properly. It is able to do this by utilizing the Apache API to parse the actual Excel file. The API contains several functions for reading out the information from the documents and these are used to copy the information into the internal structure as shown in figure 5.2. In figure 5.1 there also exists a module called ExcelWriter, the function of this module is the opposite of the ExcelParser, that is it takes our internal representation, the model, and via the Apache Excel API writes this back to an Excel file Apache Excel-API The specifications are written in Excel documents and thus our tool needed to be able to read this format. Initially there were some ideas about writing a parser for Excel files ourselves although this was quickly discarded because of the time needed to write such a feature. There were also some ideas about letting Excel convert the files into XML format and then let our program parse the XML files instead, since XML is an open format and there are a lot of APIs that support reading and writing XML files in Java this was a very realistic approach. The reason this approach was not pursued further was because of the fact that it would add an extra step (convert the Excel files to XML) and the module which would convert the XML into our internal structure would be more complex. Another possibility would be to to write a macro in Excel which would perform the diff and merge directly from within Excel. This approach was not used for several reasons. If Excel is changed or upgraded in some way 41

46 Figure 5.2: UML diagram of the classes used to represent an Excel file the tool may need to be revised. With the chosen approach the tool can be used without Excel installed on the system and this would not be possible if the tool had been built into Excel. Another advantage with the way the tool was built is that it can be extended to support other formats since the diff-algorithm is built in a generic manner. The API that the choice finally went to had an easy way to parse the Excel files and get the information into a well structured form in Java. This API is the most mature Java API available for parsing Excel files and because it is open source it had the ability to be extended. In the Apache Excel-API when an Excel file is parsed a class structure much like the one in 5.2 is created by the API and thus the conversion from this representation into our own is a simple step of traversing over the class structure and copying all the information. In the parsing phase all information is copied including styling information for Cells and Rows. The API had no support for parsing styling information for entire rows and since we needed this information to recreate the documents to their original looks when the merged documents are written back to file we had to extend the Apache API to support this functionality. To be able to do this, information about the Excel format was 42

47 needed and the OpenOffice organization [11] had the necessary information. The API was extended with this functionality and thus a custom build of the Apache API is needed for the tool to function correctly Diff In order to implement the diff-algorithm described in section 4.2 and appendix A.1 an interface was introduced into the model described in figure 5.2. The interface called Differentiable contains a set of functions which is needed by the diff-algorithm to do its job. This means that any class which implements this interface can be processed by the diff-algorithm and the differences found. The interface contains functions to compare objects of the specified type and a number of flags used to represent the differences found. In our model the classes Row and Cell implements the interface and are the objects on which we perform the diff-algorithm. This means that we can not only identify which rows that have been changed/added/removed but also which individual cells in a row that have been changed. If in the future some other type of elements needs to be processed for differences the only thing needed is to implement the Differentiable interface. To illustrate what the diff-algorithm can perform an example of the actual diff-gui is included in figure 5.3. This is the GUI the user will be presented with when comparing two versions of a document. The GUI is made up of three sections: Top section In this section information about changes are presented. When a row/cell has been changed, the user can click in it to see the old value of the field, and this will be presented here. Middle section The middle section is where the actual document is presented. It has the same layout as the original document in Excel, with all the correct colors, fonts and sizes. At the left margin of this section different icons are presented depending on what has happened to the row. A red icon with a cross means the row has been deleted, in this case the actual row is also faded out. A green icon with a plus sign means an added row, and a yellow icon with an exclamation mark means the row has been changed. In case of a changed row the actual cell(s) that has been changed are marked with a border and it is possible to click the cell to see the previous value in the top section. Bottom section The bottom section contains a tab folder where all the different sheets in the documents are listed. These are also color-coded with an icon. In the diff-gui a green icon means no changes have been made, and a red icon means that the sheet has been modified in some way. 43

Figure 5.3: Screenshot of the diff-gui 5.3.5 Merge The merge-gui is similar to the diff-gui. As can be seen in figure 5.4 the overall look is the same with some minor differences.

48 Figure 5.3: Screenshot of the diff-gui Merge The merge-gui is similar to the diff-gui. As can be seen in figure 5.4 the overall look is the same with some minor differences. In the top section of the GUI two buttons have been introduced. The first one is not clickable and marked with a red cross when there are unsolved conflicts in the document. When all conflicts are resolved this changes to a Merge button which when clicked will write the merged document to file. Next to this is a button called Show/Hide which can be used when there are conflicts in a document. Once a conflict has been resolved the rows that were not chosen are faded out in the GUI, but to completely remove them from the GUI so it is easier to see what the merged documents will look like the Show/Hide button can be pushed to toggle showing or hiding these rows. Next to this button there is a text area which is used, as in the diff-gui, to show old and new values when the user clicks in a changed element. As in the diff-gui a graphic representation of the document is presented. Icons in the left margin show whether the row has been added or changed. There are no deleted rows presented in the merge-gui since these have been automatically removed by the merge algorithm (unless they are in conflict with some other changes). When conflicting changes have been made to a row both the base version and the two edited versions of the row is presented in the GUI, they are marked with a red border and a button on each row which is used to select the one the user wants. When the user selects a row the red border is 44

removed and the other rows involved in the conflict are faded out. These rows can be completely removed from the GUI by pressing the Show/Hide button.

49 removed and the other rows involved in the conflict are faded out. These rows can be completely removed from the GUI by pressing the Show/Hide button. As has been described in chapter 4 entire blocks can end up in a conflict (as a result of the special case described in section 4.2). In this case the user selects the entire block in the GUI, this can be seen in the example in figure 5.4. At the bottom of the GUI the different sheets are listed as tabs. These are color-coded where a red icon means that the sheet is still in conflict and needs to be taken care of before a merge can be performed. A green icon means that there are no conflicts in the sheet or that all the conflicts in the sheet has been resolved. Figure 5.4: Screenshot of the merge-gui 5.4 Future work Since the specifications are written in Excel documents the abilities of the developed tool is highly dependent on how well the tool can parse the documents and later recreate them. One area where the tool could be further improved is in the ExcelParser/ExcelWriter modules. These modules are responsible for reading the Excel documents and putting them back together. Since Excel documents can contain a wide variety of artifacts, for example pictures, charts, formulas, 45

Recommended Practice for Software Requirements Specifications (IEEE)

Recommended Practice for Software Requirements Specifications (IEEE) Author: John Doe Revision: 29/Dec/11 Abstract: The content and qualities of a good software requirements specification (SRS) are described