Evaluating the Evolution of a C Application

Evaluating the Evolution of a C Application Elizabeth Burd, Malcolm Munro Liz.Burd@dur.ac.uk The Centre for Software Maintenance University of Durham South Road Durham, DH1 3LE, UK Abstract This paper describes a case study where versions of software are used to track actual changes made to software application. The analysis of 3 sequential version of the gcc C compiler are described within this paper. The results, where possible, are highlighted using graphical representations of the change process. The discussion of results aims to identify some of the reasons for the specific change features identified. The overall objective of the approach is to gain a more detailed understanding of how and where the change processes take place. This will in the future allow the change processes characterisation, so that eventually it can be used to identify a number of metrics. These metrics will ultimately be used to assess the future maintainability of software applications. Introduction Despite the cost implication of software maintenance it is generally perceived as having a low profile within the software community. Management has often in the past placed little emphasis on maintenance related activities. Small advances have been made in combating these problems, but high profile maintenance tasks such as the year 2 problem have been successful at highlighting the issues. Software maintenance is made difficult by the age of the software requiring maintenance. The age of the software means that documentation has often been lost or is out of date. Furthermore, issues of staff turnover and constant demands for changes due to user enhancements or environmental changes exasperate the problems. Often the code is the only true and accurate description of the functionality of the software. Unfortunately, however, constant corrective maintenance, which is not supported by structural improvements, has a tendency to make the software more difficult to maintain in the future. This paper describes some of the current work being conducted within the Centre for Software Maintenance, at the University of Durham. Specifically, the paper describes a on-going study into the evolution of C programs. The objectives of this work are to gain a deeper understanding of how applications evolve and the reasons for their doing so. Furthermore, the outcome of this study is expected to lead to a more detailed understanding of different maintenance approaches and for some form of qualitative review which show will highlight the most beneficial type of maintenance approaches for the support software evolution. Approach In order to evaluate the results of the process of software change, a number of case studies have been carried out. The case studies have involved the analysis of a number of large software applications. This paper describes one of these studies, where the gnu compiler gcc is examined. In total 3 sequential versions of the software have been examined totalling over 9 million lines of C. In order to examine the evolution process the changes carried out to the software over time are investigated by analysing different versions of the same software. Specifically our current work concentrates on gaining an understanding of the process of evolution in a number of main areas. These are: General change features within the code such as the size of changes, the time taken to make the modification and the number of source code files involved within a change.

Changes in the calling of procedures (including additions, deletions and movement of procedures within the call structure). Changes in data usage (including additions, deletions and movement of data items across procedures). The C is analysed using an in-house C analyser developed by Kinloch[Kinloch93]. He produces a finegrained intermediate represented from programs written within the C language from which program call graphs, flow sensitive data flow, definition-use and control dependence view can be constructed. The approach extends Harrold and Malloy [Harrold91] work by dealing with constructs such as pointer and structure variables and value return functions with pass by value and pointer parameters. Kinloch refers to the analyser as CCG; the Combined C Graph. The analysis of the gcc applications, which is currently ongoing, will eventually assess changes in each of the above representations, however at present we have performed only a detailed high level analysis investigating changes to files and changes to the calling structure within the files. Comparisons between version are made in a number of ways; the most basic of which involves the use of the UNIX utility diff. Diff is particularly useful for identifying the actual changes that were made between versions thus it is possible to ensure that a change involves a modification of the source code rather than changes within the comments or file headers. A more advanced approach to comparison between version is the use of the dominator metric and to investigate how the value of this metric changes of the life-time of the software. Work on dominance trees has been carried out by the Department of Informatica e Sistemistica, at the University of Naples [Cilitile97], Fraunhofer Institute for Experimental Software Engineering [Girard97] and Centre for Software Maintenance [Burd96a, Burd96b]. The approach is primarily used as a means of providing an abstraction of, in this case, the call relations within source code. The dominance trees essentially represent dependencies between code modules in the form of a tree. The dominance relations are defined in the following way. In a call-directed-acyclic-graph (CDAG) a node px dominates a node py if and only if every path from the initial node x of the graph to py spans px. In a CDAG a node px directly dominates a node py if and only if all the nodes that dominate py dominate px. In a CDAG there is a relation of strong direct dominance between the nodes px and py if and only if px directly dominates and it is the only node that calls py. The expression of dominance through the use of the strong and direct dominance relations provides an indication of the complexity of the relationships of the calling structure of the code. The greater its complexity the harder the software is to understand and therefore the harder it is to change. Therefore, the higher the proportion of direct dominance (the more complex) relations the harder the software is to maintain. By tracking the process of evolution using these relations it is possible to gain an understanding of the changes in dominance complexity that are occurring within the code over time. Dominance trees can be used to express potential objects within the code [Burd96b]. For instance, each subgraph within the dominance tree can be considered as a potential object. Dominance trees are often used in this way to give an indication how source code can be restructured. Thus, through the examination of the trees an indication of the modular nature of the code can be obtained. Changes in the modularisations of the code such as the overall number of objects and their composition also provide important information regarding the changing nature of software. For this reason other investigations on the source code are performed, such as investigating the number of nodes in specific dominance trees and the overall number of dominance trees per version of the software. Results In order to give an overview of the gcc application an indication will now be given of the changes that are occurring to the software over time at the file level. In this case only changes to the source code are recorded. The figure on the left (Figure 1) shows all 3 versions of the application. Each column represents a different version of the software. Versions are represented sequentially the oldest version to the left the most recent version (2.8.1) to the right. Each of the rows represents a different file within the application. The shaded boxes represent the files that have been changed within the specific version. The files are sorted into order of the number of changes. Those at the top represent those that most frequently change, those at the bottom are the files that are changed very infrequently.

From the figure it can be seen that there are a number of characteristics of the changes. For instance, it can be seen that for that the top row is shaded for each version of the software. This means that for this particular file a modification has been made for each version of the software. This file is actually version.c which prints out the version number of the application being used so in this case it should be expected to change on each occasion. It is however, the only such file to change on each occasion. Those changes that the columns are most heavily shaded represent major changes to the software. Those columns that contain only a few changes may represent, for instance, the result of small bug corrections. It is interesting to see how the majority of the changes are made to relatively few of the files. This is especially true when major changes to the software are discounted. Specifically, 3 or 4 files seem to be changed in each version of the software. It is therefore likely that it is these files which are in most need of preventative maintenance as these either represent the core functions of the application or are hard to understand. Cases where the software is difficult to understand may mean that during the process of updating the software mistakes are made and therefore such files often require bug fixes. An investigation into this area is an ongoing direction of the research. The visual representation of such a change history provides an important guide to assist the preventative maintenance process. This above chart allows visual identification of those files that are frequently changing. Those files which are changing more frequently and most often might be those to which maintenance work is better targeted. However, would also be of interest to see how many or to what detail changes are made to each of the files. One possible solution to this problem would be the use of colour where colour could be used to categorise the degree to which changes have been made for each file and version. For instance, red boxes may highlight that a large number of changes had been made. In this way a row with a large number of red boxes would be very identifiable as representing where the major proportion of the maintenance activity occurred. Figure 1: 3 versions of the gcc application

Figure 2 shows a graph of the C application. The 3 versions of the software are represented across the horizontal axis, whereas the vertical axis shows the number of source code files involved within a change, per version, and secondly the number of months between each release made. The graph shows a far degree of correspondence between the number of changes per version and the time between each release. Showing Number of Changes and Time to make Modifications 12 16 1 14 12 8 1 6 8 Months Changes 4 6 4 2 2 1 2 3 4 5 6 7 8 9 111121314151617181922122232425262728293 version Figure 2: Number of Changes and Time to Make Them The figure identifies the presence of 5 major changes within the software. Studies in evolution indicate that when such major changes occur within the software its complexity increases [Lehman97]. Furthermore, other specific patterns can also be identified from the graph. For instance, when referring to the changes immediately following a major change a small number of changes are often performed soon after the initial release. It is assumed that this relates to minor corrections such as bug fixes. Later changes show significant increases within the number of files involved within a change. This increase appears to be soon followed by a new major change. The above results give an indication of the changes that are occurring to the application as a whole. A more low level analysis will now be given of the changes that are occurring within the files, in particular, at the calling level. By way of illustration this paper will now describe the results of the study of evolutionary change on a specific file within the gcc application; combine.c. The results of the analysis showed that there were no additions to the functions within the c code. The changes that did occur however, were related to the calls between the functions. Within the study combine.c was updated 2 times over the 3 versions studied. However only 5 of these changes resulted in a change to the call graph. The degree of changes to the call graph differed greatly between the releases. Varying from a single change to 168 accountable changes. Changes were broadly categorised into 4 types from the analysis of the call graph, these are the addition of new call, the removal of a call, an increase in the number of calls between two specific functions and a decrease in the number of call between two specific functions. Each of these changes has the potential, but will not necessarily, change the dominance relations within the code. The results showed that of the five changes to the call graph 3 of these resulted in a change to the dominance relations within the code. It was indicated above that changes within the dominance relation have the potential to express changes within the maintainability of the application. Where there is an increase in direct dominance relations this tends to mean that there is a decrease in comprehensibility of the code and therefore the maintenance process will in the future be harder to perform. Similarly, an increase in the number of strong dominance relations expresses the reverse process. The results of the analysis process showed that there is an inverse relation between an increase within the number of direct dominance relations and strong dominance relations. Thus, in general it was found that when one

increased the other change was small. For instance, a change within on version shows a change of 3 relations between strong to direct dominance, whereas the reverse change accounted for 25 relations. Cumulative total of Dominance Relation Changes 3 25 2 15 1 5 2.5.4 2.5.5 2.7. 2.7.2 2.8. version s->d d->s Figure 3: Changes to the dominance relations Figure 3 shows the cumulative changes over a number of versions. It is interesting to note the high increase of strong dominance relations between version 2.7.2 and 2.8.. It would appear that a preventative maintenance process has been performed during this major release. This process was not found to occur when 2.7. was released. Conclusions and Further Work This paper has described some of the initial finding from a study of the gcc application. It has identified a number of hypotheses regarding the changes to the application over time. Further work will now be performed to verify if these hypotheses are correct. Furthermore, so far the analysis has only been carried out at the file and calling structure level. More detailed analysis is soon to be performed to identify changes, for instance, within the data structures. References Burd96a Burd96b Cilitile97 Girard97 Harrold91 Kinloch93 Lehman97 Burd E.L., Munro M., Wezeman C., Analysing Large COBOL Programs: the extraction of reusable modules, published in Proceedings of the International Conference on Software Maintenance, California, IEEE Press, 1996. Burd E.L., Munro M., Wezeman C., Extracting Reusable Modules from Legacy Code: Considering issues of module granularity, published in Proceedings of the 3rd Working Conference on Reverse Engineering, California, IEEE Press, 1996. Cimitile A., De Lucia A., Di Lucca G.A. Fasolino A.R., Identifying Objects in Legacy Systems, International Workshop on Program Comprehension, IEEE Press 1997 Girard J-F., Koschke R., Finding Components in a Hierarchy of Modules: a step towards architectural understanding, International Conference on Software Maintenance, IEEE Press, 1997 Harrold M., Malloy B., A Unified Interprocedural program representation for a maintenance environment, Proceedings of the Conference on Software Maintenance, Italy, IEEE Press, 1991 Kinloch D., Munro M., A Combined Representation for the Maintenance of C Programs, 2 nd Workshop of Program Comprehension, WPC 93, Italy, IEEE Press, 1993 Lehman M.M., Ramil J.F., Wernick P.D., Perry D.E., 'Metrics and Laws of Software Evolution - the nineties view', Symposium on Software Metrics, IEEE Press, Nov 1997