Visual Querying and Analysis of Large Software Repositories

Size: px

Start display at page:

Download "Visual Querying and Analysis of Large Software Repositories"

Bernard Hill
6 years ago
Views:

1 Empirical Software Engineering manuscript No. (will be inserted by the editor) Visual Querying and Analysis of Large Software Repositories Lucian Voinea 1, Alexandru Telea 2 1 Technische Universiteit Eindhoven, The Netherlands, l.voinea@tue.nl 2 Technische Universiteit Eindhoven, The Netherlands, alext@win.tue.nl Received: date / Revised version: date Abstract We present a software framework for visual mining of software repositories. Our framework addresses three aspects: data extraction from repositories, data analysis, and interactive visualization and exploration. Along these, it provides an open structure, extensible with new functionality. We first discuss the challenges of data extraction and storage, and propose a flexible way to deal with implementation inconsistencies in software repositories. We next present a new technique to enrich the raw data with information about artifacts showing similar evolution. Finally, we propose a visualization back-end that supports visual navigation and query of the extracted data at several levels of detail. We demonstrate the applicability of our framework by presenting several case studies performed on industry-size software repositories. 1 Introduction Software Configuration Management (SCM) systems are widely accepted instruments for managing large software development projects containing millions of lines of code spanning thousands of files developed by hundreds of people over many years. SCMs maintain a history of changes in both the structure and contents of the managed project. This information is very suitable for empirical studies on software evolution [1,2,15]. Many SCM systems exist on the market and are used in the daily practice in the software industry. Among the most widespread SCM systems we mention CVS, Subversion, Visual SourceSafe, RCS, CM Synergy, and ClearCase. The Concurrent Versions System (CVS) and Subversion, available via the Open Source community, are very popular SCM systems used by the Open Source community. Many CVS and Subversion repositories covering long evolution periods, e.g., 5-10 years, are freely available for analysis. These repositories are, hence, interesting options for research on software evolution.

2 2 Lucian Voinea, Alexandru Telea However, most repositories and SCM systems are mainly designed for archiving data. For example, CVS and its web-based front-ends offer only a basic querying interface for retrieving a given version of a file or an attribute list with the file state evolution. CVS provides no functionality that enables users to get data overviews easily. Such overviews are essential in case the questions one asks involve many files or file versions, rather than a particular file. Most questions concerning spotting trends in software evolution are of this type. On the other hand, the feedback provided by CVS, e.g., queried state attributes list, is delivered only in a human readable textual format, which makes it unhandy for quick browsing and automatic parsing. This makes the preprocessing needed for answering more complex questions difficult. Subversion is a SCM system built on the same concepts as CVS, but it tries to address some of its shortcomings. Most notably, the Subversion communication protocol is entirely machine parsable, which simplifies the task of data extraction from repositories. However, most SCM systems, including CVS and Subversion, are primarily designed to support the task of archiving software. The main operations they provide are check-in, check-out, and a limited amount of functionality for navigating the intermediate versions that a software system has during its evolution. This leads us to a major challenge of software evolution research based on SCMs, namely how to tackle data size and complexity. Raw repository information is too large and provides directly just limited insight in the evolution of a software project. Extra analysis is needed to process these data and extract relevant evolution features. In this paper we address the challenges of software evolution assessment in CVS and Subversion repositories. We propose an open framework for data extraction and analysis. We illustrate the capabilities of this framework with a customized implementation. The basic questions we try to answer using this framework are: How to deal with the large size of repository data and various limitations of textual feedback? How to extract logical coupling information from evolution? How to efficiently present evolution data to users to enable correlations across entire projects? How to get insight of a system evolution at different levels of detail of the software? The structure of this paper is as follows. In section 2 we review existing repository data extraction methods and software evolution analysis techniques. Section 3 presents our flexible interface with software repositories. Section 4 describes a new clustering technique for detecting logical coupling of files based on evolution similarity. Section 5 describes a visual back-end for evolution assessment and demonstrates our entire framework in five experiments performed on four different large-scale, real-world repositories. Section 6 summarizes our contribution and outlines open issues for future research.

3 Visual Querying and Analysis of Large Software Repositories 3 2 Background The huge potential of the data stored in SCMs for empirical studies on software evolution has been recently acknowledged. The growth in popularity and use of SCM systems, e.g., the open source CVS [7] and Subversion [20], opened new ways for project accounting, auditing and understanding. So far, the research efforts to support these developments can be grouped in two directions: data mining and data visualization. Data mining focuses on extracting and processing relevant information from SCM systems. SCM systems have not been designed to support empirical studies, so they often lack direct access to high-level, aggregated evolution information. Hence, information is distilled from the raw stored data by data mining tools. In the following, we review a number of the prominent results in this field. Fischer et al. [10] extend the SCM evolution data with information on file merge points. Gall [12] and German [14] use transaction recovery methods based on fixed time windows. Zimmermann et al. [29] extended this work with sliding windows and facts mined from commit s. Ball analyzes cohesion of classes using a mined probability of classes being modified together [1]. Bieman et al. [3] and Gall et al. [12] also mine relations between classes based on change similarities. Ying et al. [27] and Zimmermann et al. [29, 28] address relations between finer-grained artifacts, e.g., functions. Lopez-Fernandez et al. [17] apply general social network analysis methods on SCM data to assess the similarity and development process of large projects. All in all, the above methods provide several types of distilled information from the raw facts, thereby trying to answer different questions, such as: which software entities evolved together, which are the high and low activity areas of a project, how is the developers network structured, and how do a number of quality metrics, such as maintainability or modularity, change during a project s evolution. In the data mining step, an information size reduction often takes place, as only the metrics and facts relevant to the actual question are examined further. However useful, this approach can discard important facts in the data which are not directly reflected by the mining process, but which can give useful collateral insight. Data visualization, the second research direction, takes a different path, focusing on making the large amount of evolution information effectively available to users. Visualization methods make few assumptions on the data - the goal is here to let users discover patterns and trends rather than coding these in the mining process. SeeSoft [8], one of the earliest works in this direction, is a line-based code visualization tool which uses color to show code snippets matching given modification requests. Augur [11] visually combines project artifact and activity data at a given moment. Xia [26] uses treemap layouts for software structure, colored to show evolution metrics, e.g., time and author of last commit and number of changes. Such tools successfully show the structure of software systems and the change dependencies at given moments. Yet, they don t give insight into code attributes and structure changes made throughout an entire project. A first step in the direction of getting insight into several versions, UNIX s gdiff and Windows Win- Diff tools display code differences between two versions of a file by showing line

4 4 Lucian Voinea, Alexandru Telea insertions, deletions, and edits computed by the diff tool. Still, such tools cannot show the evolution of thousands of files and hundreds of versions. To overcome these shortcomings, Collberg et al. [6] depicts the evolution of software structures and mechanisms as a sequence of graphs, for medium-size projects. Lanza [16] depicts the evolution of object-oriented software systems at class level. Wu et al. [25] visualize the evolution of entire projects at file level and emphasize the evolution moments. Burch et. al [5] propose a framework for visual mining of both evolution and structure at various levels of detail. SoftChange [13] is a first attempt in this direction for a coherent environment to support the comparison of Open Source projects, targeting CVS, project mailing lists, and bug report databases. SoftChange focuses mainly on data extraction and analysis, aiming to be a generic foundation for building evolution visualization tools. Finally, our own work provided software evolution visualizations at several granularity levels: CVSscan [24] for the line-level evolution of a few source code files and CVSgrab [21] for filelevel, project-wide evolution investigations. Data extraction is a less detailed, yet very important, aspect of software evolution analysis. Subversion offers a standard API for accessing the software repository. However, Subversion is a relatively new SCM system. It appeared in June 2000 and it was first managed using CVS. This reflects in a lower number of large size repositories freely available for investigations. A large part of the open source community still uses CVS as the primary SCM system. Therefore, a large part of research in software evolution consider data from CVS repositories, e.g., [29, 10,12,13,27,17,5,24,21]. Yet, a standard framework for CVS data extraction still lacks. Two main challenges exist here for the research community: evolution data retrieval and CVS output parsing. The huge amount of data in software repositories is usually available over the Internet. On-the-fly retrieval is not suited for interactive assessment, given the sheer data size. Storing data locally requires long acquisition times, large storage space, and consistency checks. Next, CVS output is badly suited for machine reading. Many CVS systems use ambiguous or nonstandard output formats. Attempts to address these problems exist, but are incomplete. Libraries exist that offer an application interface (API) to CVS, e.g., Java s javacvs or Perl s libcvs. However, javacvs is basically undocumented. Libcvs handles only local repositories. The Eclipse environment offers a CVS client implementation but not an API. The Bonsai project [4] offers several tools to populate a database with evolution data obtained from CVS repositories. These tools are mainly meant as a web data access package and are little documented. The best supported effort for CVS data acquisition is the NetBeans javacvs package [18], a welldocumented API with allegedly full CVS client support that parses CVS output into API-level data structures. There exist only a few tools which combine data extraction, data mining and analysis, and data visualization functionality in a single environment [13, 19]. We propose a new approach towards an integrated framework that addresses these issues. Our goal is twofold. First, we aim to leverage the power of interactive visualization techniques in the analysis of the large amounts of useful and interesting information in software repositories. Second, we aim at developing an experimenting foundation, both in terms of software tools and methodology, for practical re-

5 Visual Querying and Analysis of Large Software Repositories 5 SCM systems Subversion? Repository Repository CVS Repository Subversion access library Subversion format converter CVS client CVS format converter Generic software evolution model?? format converter Control data acquisition Data processing and enrichment tools (e.g. Bugzilla acquisition module) Evoluition DB access API Cached evolution database Content manager Evoluition DB access API Evolution analysis tools Data acquisition mediator Fig. 1 A data acquisition mediator for generic, fast and and transparent access to SCM repositories search on software evolution. Ultimately, we would like to provide end users with a complete software evolution analysis chain. Our approach is described next. 3 Data extraction Repository data extraction is a main practical problem for research on software evolution. Most repository access protocols, such as CVS and Subversion, implement only basic access functions to support the archiving process. Additionally, in the concrete case of CVS, navigation commands do not have a machine-readable output. Navigation feedback is given in a compiled text format that is not always easy to decipher. Parsing this output often fails on some repositories due to awkward local conventions, e.g., date format is yyyy-mm-dd and not dd/mm/yyyy or file names may contain spaces. This makes uniform access to CVS data difficult. In such cases, one usually searches a parser that copes with the output format at hand and tries to add it to the experimental setup. We propose next an approach towards repository data access that simplifies this process using a data acquisition mediator (see Figure 1). The mediator is an easy-to-customize layer between repositories, data acquisition, processing and analysis tools. When format inconsistencies occur between the CVS output and a parser, we don t need a new CVS data acquisition tool. Instead, we adapt the mediator with a simple rule to transform the new format into

6 6 Lucian Voinea, Alexandru Telea the one accepted by the tool. While this does not completely remove the problems of inconsistent output formatting, it is a flexible way to solve problems without removing the preferred data acquisition tool. We developed an open source, easy to customize mediator, in a simple to use programming language: Python. The mediator mainly consists of a set of scripts which act as data filters, reading the raw data produced by the SCM tool (e.g. CVS client), applying data and format conversion, and outputting a filtered data stream. Filters can be connected in a dataflow-like manner to yield the desired data acquisition tool. Several filter pipelines can be added to work in parallel and have their outputs merged at the end. This enables alternative parsing strategies, e.g. in case one does not know beforehand how to detect the data format used. A second function of the mediator is to offer selective access to CVS repositories by retrieving only information about a desired folder or file, as demanded by the data analysis or visualization tool further in the pipeline. The mediator also caches the retrieved information locally, using a custom developed database. This design lets one control the trade-off between latency, bandwidth and storage space in the data acquisition step as desired. 4 Mining evolutionary coupling Raw repository data is too large and low-level to provide insight in the evolution of software projects. Extra analysis is needed to extract relevant evolution aspects. An important analysis use-case is to identify artifacts that have similar evolution. Several approaches exist for this task [3, 12, 27, 29]. All these approaches use similarity measures based on recovered repository transactions, i.e., sets of files committed by a user at some moment. The assumption is that related files have a similar evolution pattern, and thus their revisions will often share the same repository transaction. This information about correlated files is used to predict future changes in the analyzed system, from the perspective of a given artifact. We propose a more general approach. We argue that not transactions, but pure commit moments, are important for finding similar files. Transaction-based similarity measures fail to correlate files developed by different authors and with different comments attached, but which are still highly coupled. To handle such cases, we propose a similarity measure using the time distance between commit moments. If S 1 = {t i i = 1..N} are the commit moments for a file F 1 and S 2 = {t j j = 1..M} the commit moments for F 2, we define the similarity between F 1 and F 2 as the symmetric sum: Φ(F 1, F 2 ) = N i=1 1/ min( t i t j t j S 2, t i t j < k) M j=1 1/ min( t j t i t i S 1, t j t i < k) + 1 (1) where k is a customizable neighborhood factor intended to reduce the influence of completely unrelated events on the similarity measure. The square root is meant to attenuate the influence of the network latency on the transaction time recorded by the repository. Intuitively, this measure considers, for each commit moment of

7 Visual Querying and Analysis of Large Software Repositories 7 F 1, the closest commit moment from F 2, weighted by the inverse time distance between the two moments. We next use this measure in an agglomerative clustering algorithm to group files that have a similar evolution. The clustering works in a bottom-up fashion, as follows. First, all individual files are placed in their own cluster. Next, the two most similar clusters, as given by the above similarity metric, are found and merged in a parent cluster, which gets the two original clusters as children. The process is repeated until a single root cluster is obtained. To measure the similarity of nonleaf clusters, we use Equation 1 on the union of commit moments of all files in the first, respectively second, cluster. This is equivalent to a full linkage clustering which is computationally more expensive than other techniques such as single linkage or k means [9]. To give a practical estimation of the efficiency, clustering a system containing around 850 versions of 2000 files, using our evolutionary coupling implemented in C++, took just under 5 minutes on a 2.5 GHz Windows PC. After the clustering tree is computed, we use it to visualize a system evolution decomposition by displaying those clusters on the level of detail selected by the user interactively. This technique is presented along our other visualization tools in the next section. 5 Visualization Visualization tries to give insight in large and complex evolution data by delegating the pattern detection and correlation making to the human visual system. Visualization can also present the results of data analysis in an intuitive, readyto-use, way. Visualization is a main ingredient of our software repository mining framework. We present next a set of experiments that illustrate the applicability of our framework on real-life, industry-size software repositories. To this end we use the evolutionary coupling computation technique proposed in the previous section and the CVSgrab tool [21] as our visualization back-end. The evolutionary coupling algorithm presented in Section 4 is connected to the visualization back-end through the data acquisition mediator described in Section 3 to form our complete evolution analysis framework. We first outline the functionality of the CVSgrab tool. For more details, we refer to [21]. CVSgrab is an interactive tool that depicts project evolution at file level. A complete project, as present in a SCM repository, is rendered as a set of horizontal strips. Each strip shows a file s evolution in time (Figure 2). The x axis shows the time. The file strips are stacked along the y axis following different layouts, which can be interactively set up in CVSgrab to suit different types of analyses. Color can be used to show several attributes: file type, size, author, metrics computed on the file contents, or other information such as the occurrence of a specific word in the file s commit log or information from an associated bug database. All these mechanisms are presented in the next sections, by illustrating how they can be used to assess several aspects of software evolution.

8 8 Lucian Voinea, Alexandru Telea F 1 Time V 1 V 2 V 3 V 4 F 2 F 3 F 4 Files Color encodes version based file metrics Fig. 2 CVSgrab visualization of project evolution 5.1 Case 1: Assessment of Software Metrics Evolution We analyzed the evolution of ArgoUML, an object-oriented design tool with a 6- year evolution of 4452 files developed by 37 authors. This project is hosted in a CVS repository. Data acquisition took 31 minutes over a T1 Internet connection: 8 minutes for the initial setup (i.e., one-time retrieval of the last version of 59MB) and 23 minutes to retrieve the evolution data to be visualized (20MB). In Figure 3, a 12-snapshot matrix shows ArgoUML s evolution, displayed using CVSgrab. Each column shows the evolution of one metric: Column 1 shows the development team evolution. Color shows the ID of the user who committed each file. Column 2 shows the file size evolution, measured as changed lines of code (LOC). Hue encodes change type: Gray shows the first commit of a file, blue shows file size increase, red is decrease, and yellow is a commit that changes content but not size. Brightness encodes change size: Lighter colors show smaller changes, darker colors show larger changes. Column 3 shows file type: red for java sources, green for images, and yellow for HTML files. Column 4 shows files that contain a given string in their commit comment: green are files that contain the word fix, blue are files that contain the word error. Each matrix row in Figure 3 uses another layout on the y axis. In row A, files are sorted alphabetically on their full path, and thus show the folder structure. In row B, files are sorted from top to bottom in decreasing order of number of versions, i.e., on file activity. Files that have the same number of versions are further sorted in decreasing order of creation time. In row C, files are sorted in decreasing order of creation time. Files created in the beginning of the project are at the bottom, while young files are at the top. Figure 3 shows several interesting aspects of the process and organization of ArgoUML. Cell C3 shows that the project started with a documentation base (green and yellow files) that probably contained the system specification. This was contributed by one user (jrobbins = brown in C1) and remained unchanged for the entire project duration except for a large addition (dark blue in C2) done by another user two years later (dennyd = red in C1). The added code concerned an underspecified issue as it was extend again two years later (dark blue in C2) by another author (mvw = cyan in C1). The real implementation first appeared 6 months

Visual Querying and Analysis of Large Software Repositories 9 Sort alphabetically 2 1 1 2 A 2 1 1 2 Sort by creation time

3 ArgoUML metrics evolution visualization with CVSgrab after the specification was committed (java source files = red in

Two years from the project start, another big documentation chunk was added (yellow and green in C3) by one user

Although the specification, implementation, and documentation appear to be the work of one author each, it is intriguing

We believe that this denotes work of more people, which was first checked in bulk by one single person.

implementation (red in A3). The same pattern can be seen following the large ovals (1) in B1 and B2.

The major color groups correspond to the folders documentation (green at the top), src_new (red at the middle) and www

From B3, one can see that most activity during the project was related, as expected, to changes in the implementation files

9 Visual Querying and Analysis of Large Software Repositories 9 Sort alphabetically A Sort by creation time Sort by activity B C 1: team evolution 2: size evolution 3: file type evolution 4: search evolution Fig. 3 ArgoUML metrics evolution visualization with CVSgrab after the specification was committed (java source files = red in C3) and was contributed by one author (1sturm = blue in C1). Two years from the project start, another big documentation chunk was added (yellow and green in C3) by one user (jeremybennett = yellow in C1). Although the specification, implementation, and documentation appear to be the work of one author each, it is intriguing that they were all committed at one time and not incrementally, by one author, and contained many (around 400) files. We believe that this denotes work of more people, which was first checked in bulk by one single person. For the rest of the project, one user has a major contribution (linus = green in C1), with one exception in the fifth project year (mvw = cyan in C1). The large oval (1) in A1 shows that mvw (cyan in A1) made a large contribution (dark blue, large oval (1) in A2) to implementation (red in A3). The same pattern can be seen following the large ovals (1) in B1 and B2. A3 shows that the project has a very clean organization. The major color groups correspond to the folders documentation (green at the top), src_new (red at the middle) and www (yellow and green at the bottom). From B3, one can see that most activity during the project was related, as expected, to changes in the implementation files (red at the top) followed by changes in the documentation (yellow in the middle) and in the documentation images (green at the bottom). B2 shows that almost one-third of the files added during the project did not change during all six years (as they are grey). Most such files are documentation (yellow and green in B3). To this group belongs also the largest part of the previously identified system specification (brown in B1, by correlation with C1 and C3). Another interesting aspect is shown in the small ovals (2) in A1 and A4. In the fourth project year, linus made a significant contribution, not as size (no big size

Lucian Voinea, Alexandru Telea Sort alphabetically 10 1 A 2 1 3 Sort by activity 2 Sort by creation time B 3 3 2 1:

4 PostgreSQL metrics evolution visualization with CVSgrab change pattern seen in A2) but in terms of code cleaning.

It suggests that previous work has been done in that area without being committed.

One exception, highlighted in C2, shows a size drop for documentation files (yellow in C3) that occurred at the end

It shows the evolution of PostgreSQL, an object-relational database management system hosted on a CVS repository,

The data acquisition step took 28 minutes: 7 minutes for the initial setup (i.e., one time retrieval of the last

The evolution retrieval time was now smaller than in our first example, even though more data was retrieved.

When retrieving evolution data, the connection has to be established for each file.

10 Lucian Voinea, Alexandru Telea Sort alphabetically 10 1 A Sort by activity 2 Sort by creation time B : team evolution 2: size evolution 3: file type evolution C 4: search evolution Fig. 4 PostgreSQL metrics evolution visualization with CVSgrab change pattern seen in A2) but in terms of code cleaning. Many implementation files (red in A3) containing the words fix and error in their commit comment have been committed by linus. The same pattern can be seen in row B. The large green (i.e., fix ) horizontal pattern visible in column 4 corresponds to an initial checkout of documentation files. It suggests that previous work has been done in that area without being committed. Figure 3 shows also the project size almost never decreased. One exception, highlighted in C2, shows a size drop for documentation files (yellow in C3) that occurred at the end of the fourth year. Figure 4 shows a second example of using our framework to assess software evolution. It shows the evolution of PostgreSQL, an object-relational database management system hosted on a CVS repository, with a history of 10 years, 2829 files, and 27 authors. The data acquisition step took 28 minutes: 7 minutes for the initial setup (i.e., one time retrieval of the last project version = 56MB) and 21 minutes for retrieving the evolution information to be visualized (29MB). The evolution retrieval time was now smaller than in our first example, even though more data was retrieved. This is explained by the connection overhead. When retrieving evolution data, the connection has to be established for each file. In the second example, there are less files than in the first example, which decreased the overall connection latency. Figure 4 shows a 12-snapshot matrix of the evolution of PostgreSQL, structured similarly to Figure 3. Columns show the development team (1), size evolution (2), file type (3) and string search (4) encoded by colors, just as in example 1, except for file type and string search. In column 3, C source files are blue, C headers are light green, SGML documentation files are dark green, SQL files are

11 Visual Querying and Analysis of Large Software Repositories 11 pink, and test support files are red. In column 4, green shows versions that contain the word fix in their associated commit comment, and red versions that contain the word bug. The matrix rows use different sortings for files on the y axis: alphabetical (A), number of revisions (B) and creation time (C). Using Figures 4 and 3, we can compare the evolution of PostgreSQL and ArgoUML. Cell C3 (Figure 4) shows that the project started with a source code base (blue at bottom) and not with a specification, as for ArgoUML. Even the header files (light green) were not committed until several months later. As for ArgoUML, the initial contribution (about 400 C sources and headers) was done by one person (scrappy = red in C1). This suggests previous development unrecorded in CVS. The project is contributed by a few authors: momjian (light green), tgl (dark blue), pgsql (magenta), petere (cyan), thomas (yellow-green). The commits of momjian and tgl are interleaved at intervals of around 6-8 months (column 1) and address the most active parts of the system (B1). These parts correspond to the C sources and headers, by correlation via A1 and A3. These parts are also modified by pgsql in the last two project years. A detailed look at B1 and B2 reveals the commit patterns of momjian and tgl. momjian s commits do not usually change file sizes (yellow in B2) and are done at large intervals. In contrast, tgl s commits are done more often and often change file sizes. The contributions of momjian interrupt abruptly the ones of tgl but not conversely. This suggests the real work might be done by tgl while momjian has more the role of a code standard manager. By zooming in on the files contents, we saw that the changes done by momjian were mainly indentation and copyright text. Finally, petere and thomas mainly contributed to the documentation (by correlating A1 and A3). As for ArgoUML, PostgreSQL has a clean organization (A3): source, header, documentation, and test files are well separated. Most activity occurs in the implementation (C) files, but also in headers and documentation files. This suggests frequent architectural changes. No large size changes are seen throughout the project, except documentation, (label 1 in A2). Finally, column 4 in Figure 4 shows the distribution of the words fix and bug in commit logs. The green patterns (2) show versions containing the word fix. By correlating C3, C4 and C1, we saw that these patterns match header files committed by momjian. Hence, we thought they would contain cosmetic changes. Indeed, file-by-file analysis showed that fix refers actually to the indentation tool and not to the system code itself. Other occurrences of the word fix are evenly spread mainly over the evolution of C sources (A4). The red patterns (3 in A4 and C4) show versions containing the word bug. These are test files and appear close to the file creation moment. This, together with the fact that test files appear early in the project, suggests an active test policy. The rest of the occurrences of the word bug are evenly distributed, mostly over of C sources. 5.2 Case 2: Assessment of Evolutionary Coupling One question we were interested in was which are the components of PostgreSQL which show similar evolution patterns. To answer this, we computed a cluster hierarchy based on evolutionary coupling, as explained in Section 4. Figure 5 shows

12 12 Lucian Voinea, Alexandru Telea the clustering using the CVSgrab tool. Files in the same cluster are drawn using the so-called plateau cushions, dark at the edges and bright in the middle[21]. In each cluster, files are sorted on the y axis in inverse order of their creation time. Color shows different metrics: word distribution (1), size evolution (2), file author (3) and file type (4), just as in Figure 4. We see five main important evolution groups. Column 4 shows three main groups: sources (a,b,c), documentation (d), and test scenario files (e). We can easily see that source files appearing in the beginning of the project evolve similarly (b). These may refer to a subsystem having a high logical coupling which can be seen as a building block or component. The same holds for the other two source code clusters (a,c). All these clusters share the same developer team (3) and size evolution patterns (2). The early introduced source code cluster (b) has, however, more versions whose commit logs contain the word bug (marked in 1). This suggests a problematic implementation in this cluster. Documentation forms a separate cluster (d), which suggests it does not target the system detailed design, as it doesn t change in sync with the headers. Finally, the large cluster at the bottom of the images (e) contains mainly test scenarios, so it nicely factors out files intended to support the development process, rather than implementing actual functionality. Changing the level of detail, we can further split the above clusters for a finer analysis. This can be useful not only to obtain a logical decomposition, but also to predic future changes with different levels of confidence, similar to [29]. d a b c Main clusters e 1: word distribution 2: size evolution 3: file author 4: file type Fig. 5 PostgreSQL evolution clusters visualization with CVSgrab 5.3 Case 3: Assessment of Source Code Style A frequent question that appears when assessing for the first time a third party software stack is: How does the source code of a given project look like? We chose the KDE KOffice project as a case study in this direction. We used the Subversion module of the the data acquisition mediator to retrieve the entire KDE KOffice project history (10616 files, 37 authors, 9 years of evolution) and cache it locally. This operation took 30 minutes, which is several times faster than a similar action on a CVS repository for two reasons. First, Subversion allows retrieving the list of files directly from the repository. There is no need for the initial CVS setup step

13 Visual Querying and Analysis of Large Software Repositories 13 where the final versions of all files are retrieved prior to history information. Secondly, Subversion offers a compact, thus network-friendly, output format. After acquisition, we used CVSgrab s filter functions to select the 108 C++ source and header files from the KOfficeui folder for detailed investigation. We split the files into source and header using the sort-by-type feature of CVSgrab and then sorted each group in the increasing order of creation time. Finally, we parsed the files using a custom lightweight parser, computed the amount of code and comment lines, and visualized these using color. Figure 6 top shows the size evolution using a rainbow color map. Most implementation files have over 90 LOC (red). There are two files with less then 10 LOC (blue). The source files seem to stay quite constant in size, while their number steadily increases. This signals addition of new features as opposed to maintenance or refactoring. In contrast, most headers are up to 40 LOC. One larger header was simplified (B). At one point (A), an interface refactoring took place. A lot of code from the headers was systematically removed across the entire library, yet the implementation stayed constant (as there is no drop in the sources size). Figure 6 bottom uses a rainbow color map to show the comments evolution. Both headers and sources have a similar amount of comments, which denotes good coding style. The amount of comments increases in a few sources at one point (C), while the size stays constant, which suggests a code review and documentation addition. 5.4 Case 4: Assessment of Change Propagation during Debugging An important question in maintenance is whether debugging-induced changes propagate system-wide or are localized to subsystems. The latter situation is desirable, as it allows modular, simpler, and ultimately cheaper maintenance. We studied this question on the Mozilla Firefox code (659 files, 120 authors, 5 years of evolution). We first acquired the Mozilla evolution data from the official CVS site, which took 30 minutes including the initial setup step. Next, we implemented a custom data acquisition module to extract related bug data from the project s Bugzilla database. We merged the history and bug acquisition modules in the mediator s data pipeline and finally loaded all the information into the visualization. We used small red square markers, as opposed to colors, to show the time and files where debugging activities related to the so-called blocker bugs occur (Figure 7 top). Each marker is surrounded by a transparent disk of constant radius. This allows bugs located close to each other in time and project space to blend together, which reveals a debugging-intensive area. Figure 7 top shows a red area of high density indicating intensive debugging activity in the recent past. Looking at folder names, we saw that this area almost perfectly matches the /components/places/content folder (colored in green in Figure 7 middle). Consequently, we wanted to know which other files evolve similarly to these intensively debugged files. For this, we clustered the project based on evolution similarity (Section 4). Figure 7 bottom shows a zoomed-in view of the cluster containing files from the /components/places/content folder. The found

14 Lucian Voinea, Alexandru Telea.cpp B A.h C.cpp.h Fig.

Increasing number of constant size sources suggests new feature addition as

A complex header file is simplified (B), size of headers drops at one moment

14 14 Lucian Voinea, Alexandru Telea.cpp B A.h C.cpp.h Fig. 6 Source code evolution in KDE Koffice. Increasing number of constant size sources suggests new feature addition as opposed to maintenance. A complex header file is simplified (B), size of headers drops at one moment due to an interface refactoring (A)(top). Number of comment lines is similar in implementation and header files, some implementation files become more commented possibly after a code review (C) (bottom)

Red spots show debugging activities addressing blocker bugs (top).

15 Visual Querying and Analysis of Large Software Repositories 15 Fig. 7 Assessment of change propagation during debugging in Firefox. Red spots show debugging activities addressing blocker bugs (top). Green marks the folder containing the most intense debugging activity (middle). These files form an apart evolution cluster (bottom), so changes in the debugging zone stay localized.

16 Lucian Voinea, Alexandru Telea cluster clearly contain only files from the investigated folder (green).

5 Case 5: Assessment of Architecture Stability Another question in software evolution is how the code structure changes during a project s lifetime.

Next, we focused on the Libs folder which contains C/C++ source files and their interface headers.

This operation took one hour. The resulting data was visualized with CVSgrab. We focused next on those interface functions which are implemented in several source files.

16 16 Lucian Voinea, Alexandru Telea cluster clearly contain only files from the investigated folder (green). Hence, debugging activities are nicely matching the folder structure, and do not propagate to other system parts. 5.5 Case 5: Assessment of Architecture Stability Another question in software evolution is how the code structure changes during a project s lifetime. In this case study, we look at how stable function interfaces are. We used the same process as in case 3 to acquire the history of KDE KOffice. Next, we focused on the Libs folder which contains C/C++ source files and their interface headers. Next, we integrated a custom C/C++ parser as a filter into our data acquisition mediator, and used it to extract the names of the functions implemented in the Libs folder. This operation took one hour. The resulting data was visualized with CVSgrab. We focused next on those interface functions which are implemented in several source files. These are typically overriden methods of some C++ base class. For each such function, we visualize when do the files which implement it change, by coloring those file versions with the function s color ID (Figure 8). When several functions are selected for inspection and they all change in the same version, we color that version red. For some interfaces, e.g. the draw method, we see that the Fig. 8 Assessment of interface stability in KDE KOffice. Files implementing the name method (blue) do not change together, i.e., the associated interface is stable (top). Files implementing the draw method (green) change together, so the associated interface is less stable (bottom)

17 Visual Querying and Analysis of Large Software Repositories 17 files that implement it have many moments when they change together, as shown in Figure 8 bottom by the vertical stripes. We next used the details-on-demand mechanism of CVSgrab to investigate the associated commit messages of those files that changed together. In some cases, the changes addressed textual issues such as indentation (4 in Figure 8 bottom). These situations are not interesting for analysis. However, there were also cases in which the change referred to an issue of the interface, which propagated to all the files that implemented it (1,2 and 3 in Figure 8 bottom). Such patterns indicate an unstable interface. The most favorable case is when changes in the files implementing the interface are uncorrelated (Figure 8 top). 6 Conclusions We presented a new framework for visual data mining and analysis of software repositories. Our goal is twofold. First, we use several novel visualization techniques to analyze the large amounts of information from software repositories. Our techniques are simple: 2D Cartesian layouts, color showing attributes, sorting on any attribute, clustering based on evolutionary similarity, and bugs displayed as blended areas. However, their power resides precisely in this simplicity. Combined with their genericity, i.e. the fact that any technique can be used on any attribute, and driven by interactive selection and zooming mechanisms, this yields a versatile and effective visualization environment. Adding the generic dataflow-driven filter mechanism of the data acquisition mediator yields a complete framework for evolution data acquisition, mining and visualization. Our second goal was to develop an experimenting foundation, both in software tools and methodology, for research on software evolution. Our methodology is illustrated by means of five different case studies, covering analysis of developer network, software interfaces, code style, system modularity, and debugging activities, over four industry-size Open Source projects: ArgoUML, PostgreSQL, Mozilla Firefox and KDE Koffice. We stress that the examples presented here are simplified scenarios for space constraints. We have used our framework together with actual developers of commercial software, as well as with developers of open source libraries in several more involved activities targeting more complex questions. The feedback collected revealed a methodology very similar to the one presented in the scenarios here. Moreover, our framework competed with success in several software repository mining and visualization challenges [22,23]. A strong point revealed in all these situations was the simplicity of retrieving repository data, supported by the flexible data acquisition mediator design. We also propose a new technique for computing evolutionary coupling. This technique assists users both to perform a logical decomposition of the system, and to predict future changes in the system from the perspective of select files. This technique is completely independent on the file contents, which is a strong advantage in terms of robustness, simplicity, and genericity. We aim next to improve the similarity measure of the evolutionary coupling mechanism by using additional attributes, e.g., file type. The challenge here is to find the best similarity description that matches a given user requirement. Additionally, we would like to extend

18 18 Lucian Voinea, Alexandru Telea the framework with other generic visualization and code analysis mechanisms, to support more complex and also more specific questions. References 1. T. Ball, J. Kim, A. Porter, and H. Siy. If your version control system could talk... In Proc. Workshop on Process Modeling and Empirical Studies of Software Engineering, K. Bennett, E. Burd, C. Kemerer, L. M. M., M. Lee, R. Madachy, C. Mair, D. Sjoberg, and S. Slaughter. Empirical studies of evolving systems. Empirical Software Engineering, 4(4): , J. M. Bieman, A. Andrews, and H. Yang. Understanding change-proneness in oo software through visualization. In Proc. IWPC 03, pages IEEE Press, Bonsai, www M. Burch, S. Diehl, and P. Weißgerber. Visual data mining in software archives. In SoftVis 05: Proceedings of the 2005 ACM symposium on Software visualization, pages 37 46, New York, NY, USA, ACM Press. 6. C. Collberg, S. Kobourov, J. Nagra, J. Pitts, and K. Wampler. A system for graph-based visualization of the evolution of software. In Proc. ACM SoftVis 03, pages ACM Press, CVS, www S. G. Eick, J. Steffen, and E. Sumner. Seesoft a tool for visualizing line oriented software statistics. IEEE Trans. Soft. Eng., 18(11): , E. Everitt, S. Landau, and M. Leese. Cluster Analysis. Arnold Publishers, Inc., M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In Proc. ICSM, pages IEEE Press, J. Froehlich and P. Dourish. Unifying artifacts and activities in a visual tool for distributed software development teams. In Proc. ICSE 04, pages IEEE Press, H. Gall, M. Jazayeri, and J. Krajewski. Cvs release history data for detecting logical couplings. In Proc. IWPSE, pages IEEE Press, D. German, A. Hindle, and N. Jordan. Visualizing the evolution of software using softchange. In Proc. 16th Intl. Conf. on Software Engineering and Knowledge Engineering, pages , D. German and A. Mockus. Automating the measurement of open source projects. In Proc. Workshop on Open Source Software Engineering (OOSE), R. M. Greenwood, B. Warboys, R. Harrison, and P. Henderson. An empirical study of the evolution of a software system. In Proc. 13 th Conf. on Automated Software Engineering (ASE), pages , M. Lanza. The evolution matrix: recovering software evolution using software visualization techniques. In Proc. IWPSE 01, pages ACM Press, L. Lopez-Fernandez, G. Robles, and J. Gonzalez-Barahona. Applying social network analysis to the information in cvs repositories. In Proc. Intl. Workshop on Mining Software Repositories (MSR). IEEE Press, NetBeans.javacvs, www M. Pinzger, H. Gall, M. Fischer, and M. Lanza. Visualizing multiple evolution metrics. In SoftVis 05: Proceedings of the 2005 ACM symposium on Software visualization, pages 67 75, New York, NY, USA, ACM Press. 20. SVN, www.

19 Visual Querying and Analysis of Large Software Repositories L. Voinea and A. Telea. CVSgrab: Mining the history of large software projects. In Proc. EuroVis 06, pages IEEE Press, L. Voinea and A. Telea. How do changes in buggy mozilla files propagate? In SoftVis 06: Proceedings of the 2006 ACM symposium on Software visualization, pages , New York, NY, USA, ACM Press. 23. L. Voinea and A. Telea. Mining software repositories with cvsgrab. In MSR 06: Proceedings of the 2006 international workshop on Mining software repositories, pages , New York, NY, USA, ACM Press. 24. L. Voinea, A. Telea, and J. van Wijk. Visualization of code evolution. In Proc. ACM SoftVis 05, pages ACM Press, J. Wu, C. Spitzer, A. Hassan, and R. Holt. Evolution spectrographs: Visualizing punctuated change in software evolution. In Proc. of IWPSE 04, pages IEEE, X. Wu. Visualization of version control information. Master s thesis. University of Victoria, A. Ying, G. Murphy, R. Ng, and M. Chu-Carroll. Predicting source code changes by mining revision history. IEEE Trans. on Soft. Eng., 30(9): , T. Zimmermann and Weisgerber. Preprocessing CVS data for fine-grained analysis. In Proc. Intl. Workshop on Mining Software Repositories (MSR). IEEE Press, T. Zimmermann, P. Weisgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes. In Proc. ICSE 04, pages IEEE Press, 2004.

A File Based Visualization of Software Evolution

A File Based Visualization of Software Evolution Keywords: Software evolution, Software visualization. S.L. Voinea l.voinea@tue.nl A. Telea alext@win.tue.nl Technische Universiteit Eindhoven Abstract Software